# Web Scraping Project
In this project, we are going to extract some information about all the products exposed on the website [Sephora](https://www.sephora.com/)

## We first begin by importing the most useful libraries.

The first one is Selenium web driver.
The second one is pandas, this will be used to store information as DataFrames.
The third one, time.sleep, allows to wait some time for the explorer to download all the content.
Lastly, we will use the datetime to deal with timestamps and time durations.

In [None]:
from selenium import webdriver
chrome_options = webdriver.ChromeOptions()
prefs = {"profile.managed_default_content_settings.images": 2,
        'permissions.default.image': 2, 'dom.ipc.plugins.enabled.libflashplayer.so': 'false'}
chrome_options.add_experimental_option("prefs", prefs)
browser = webdriver.Chrome(options=chrome_options)

import pandas as pd
from time import sleep
from datetime import datetime, timedelta

Now let's take a look at the website, you can see that all the brands are listed in **Brands ---> Brands A-Z**

![](brandsa-z.jpg)

![](allbrands.jpg)

Now, we want to extract the url for each brand, hence, we are going to inspect the brands names elements.

![](brand_name_elements.jpg)

You can see that this elements have the following properties:
1. *Tag* = 'a' 
2. *href* : The brand URL
2. *class* = 'css-11medar e65zztl0'
3. *text* : The brand name

So, we are going to use the class to find out all the brands names and URLs

\* Since there are two classes *'css-11medar'* and *'e65zztl0'*, it's better to use the method ***find_elements_by_css_selector*** with th parameter *'.css-11medar.e65zztl0'*

In [None]:
url='https://www.sephora.com/brands-list'
browser.get(url)
sleep(1)
brands_urls={}
for item in browser.find_elements_by_css_selector('.css-11medar.e65zztl0'):
    brands_urls[item.text]=item.get_attribute('href')

In [None]:
brands_urls

Next, we are going to explore [the products of the first brand](https://www.sephora.com/brand/acqua-di-parma)

In [None]:
url="https://www.sephora.com/brand/acqua-di-parma"
browser.get(url)
sleep(1)
for k in range(1,100):
    browser.execute_script("window.scrollTo(0, document.body.scrollHeight*k/100);".replace('k',str(k)))

![](first_brand_product.jpg)

There are 46 products.

Now, in order to get the urls of these products, let's inspect the first product element.

![](first_product_inspec.jpg)

This elements are characterized by the following properties:
1. *Tag* = 'a' 
2. *href* : The product URL
2. *class* = 'css-ix8km1'

So, using the method ***find_elements_by_class_name()*** , we are going to find out all the products URLs

In [None]:
product_urls=[]
for item in browser.find_elements_by_class_name('css-ix8km1'):
    product_urls.append(item.get_attribute('href'))
len(product_urls)

Next, let's follow [the first product link](https://www.sephora.com/product/blu-mediterraneo-mandorlo-di-sicilia-P307803?icid2=products%20grid:p307803)

In [None]:
url='https://www.sephora.com/product/blu-mediterraneo-mandorlo-di-sicilia-P307803?icid2=products%20grid:p307803'
browser.get(url)
sleep(2)
for k in range(1,100):
    browser.execute_script("window.scrollTo(0, document.body.scrollHeight*k/100);".replace('k',str(k)))

![](first_product.jpg)

## Now, we have to scrape many deatils:

In [None]:
num_reviews= int(browser.find_element_by_id("ratings-reviews-container").find_element_by_tag_name('h2').text.split('(')[1][:-1])
reviews=[]
brand_url=browser.find_element_by_xpath('/html/body/div[1]/div[2]/div/main/div/div[1]/div[1]/h1/a').get_attribute('href')
brand_name=browser.find_element_by_xpath('/html/body/div[1]/div[2]/div/main/div/div[1]/div[1]/h1/a').text
prod_categ=browser.find_element_by_xpath('/html/body/div[1]/div[2]/div/main/div/nav/ol').text.replace('\n','->')
produ_name=browser.find_element_by_xpath("/html/body/div[1]/div[2]/div/main/div/div[1]/div[1]/h1/span").text
prod_price=float(browser.find_element_by_xpath('/html/body/div[1]/div[2]/div/main/div/div[1]/div[1]/p/span/span[1]/b').text[1:])
try:
    prod_aver_star=float(browser.find_element_by_xpath("/html/body/div[1]/div[2]/div/main/div/div[1]/div[1]/div/a[1]/div").get_attribute('aria-label').split(' ')[0])
except:
    prod_aver_star=0
prod_likes=browser.find_element_by_xpath('/html/body/div[1]/div[2]/div/main/div/div[1]/div[1]/div/div/span').text
if prod_likes[-1]=='K':
    prod_likes=int(float(prod_likes[:-1])*1000)
else:
    prod_likes=int(float(prod_likes))
    
print(num_reviews)
while len(reviews)!=num_reviews:
    for i in browser.find_elements_by_css_selector('.css-13o7eu2.eanm77i0'):
        if len(i.find_elements_by_css_selector(".css-1yc3bi7.eanm77i0")):
            review_text=i.find_element_by_css_selector(".css-1x44x6f.eanm77i0").text
            star_rating=int(float(i.find_element_by_class_name('css-4qxrld').get_attribute('aria-label').split(' ')[0]))
            review_date=i.find_element_by_css_selector('.css-ak0g49.eanm77i0').text
            if 'm ago' in review_date:
                review_date=datetime.now()-timedelta(minutes=int(review_date.split(' ')[0]))
            elif 's ago' in review_date:
                review_date=datetime.now()-timedelta(seconds=int(review_date.split(' ')[0]))
            elif 'h ago' in review_date:
                review_date=datetime.now()-timedelta(hours=int(review_date.split(' ')[0]))
            elif 'd ago' in review_date:
                review_date=datetime.now()-timedelta(days=int(review_date.split(' ')[0]))
            else:
                review_date=datetime.strptime(review_date, '%d %b %Y')

            upvotes =int(i.find_elements_by_class_name("css-36ie0l")[0].find_element_by_tag_name('span').text[1:-1])
            downvotes =int(i.find_elements_by_class_name("css-36ie0l")[1].find_element_by_tag_name('span').text[1:-1])
            verified_purshase=bool(len(i.find_elements_by_css_selector(".css-1cf4ane.eanm77i0")))
            recommanded=bool(len(i.find_elements_by_css_selector(".css-12com3g.eanm77i0")))
            if len(i.find_elements_by_class_name( 'css-hoe9xz')):
                shade_of_product=i.find_element_by_class_name( 'css-hoe9xz').text
            else:
                shade_of_product=''
            
            nickname=''
            try:
                nickname=i.find_element_by_tag_name('strong').text
            except:
                pass
            
            eyes,hair,skin_col,skin_ty='','','',''            
            try:
                descr=i.find_element_by_css_selector(".css-t72irq.eanm77i0").text
                eyes=descr.split(', ')[0]
                hair=descr.split(', ')[1]
                skin_col=descr.split(', ')[2]
                skin_ty=descr.split(', ')[3]
            except:
                pass
            reviews.append({'product name':produ_name, 'product average stars notation':prod_aver_star,'product likes':prod_likes,
                            'product category': prod_categ, 'product price':prod_price,'product url':url,'brand name': brand_name, 
                            'brand URL':brand_url, 'review text':review_text,'review date':review_date, 'upvotes' :upvotes, 'downvotes':downvotes,
                            'verified purshase':verified_purshase, 'recommanded':recommanded, 'shade of product':shade_of_product, 
                            'nickname':nickname,'eyes color':eyes,'hair color':hair, 'skin color':skin_col, 'skin type':skin_ty})
            print(nickname+' : '+ str(star_rating))
    
    if len(reviews)!=num_reviews:
        browser.find_element_by_class_name('css-2anst8').click()
        sleep(1)

Different pieces of the code above are explained here

1. The URL of the brand 
    ```python
    brand_url=browser.find_element_by_xpath('/html/body/div[1]/div[2]/div/main/div/div[1]/div[1]/h1/a').get_attribute('href')
    ```
    
    
2. The brand name
    ```python
    brand_name=browser.find_element_by_xpath('/html/body/div[1]/div[2]/div/main/div/div[1]/div[1]/h1/a').text`
    ```
    
    
3. The product name
    ```python
    produ_name=browser.find_element_by_xpath("/html/body/div[1]/div[2]/div/main/div/div[1]/div[1]/h1/span").text`
    ```
    
4. The product price *(dollars)*
    ```python
    prod_price=float(browser.find_element_by_xpath('/html/body/div[1]/div[2]/div/main/div/div[1]/div[1]/p/span/span[1]/b').text[1:])
    ```
    
    
5. The product category
    ```python
    prod_categ=browser.find_element_by_xpath('/html/body/div[1]/div[2]/div/main/div/nav/ol').text.replace('\n','->')`
    ```
    
    
6. The product average stars notation
    ```python
    try:
        prod_aver_star=float(browser.find_element_by_xpath("/html/body/div[1]/div[2]/div/main/div/div[1]/div[1]/div/a[1]/div").get_attribute('aria-label').split(' ')[0])
    except:
        prod_aver_star=0
    ```
    
    
7. The product likes number
    ```python
    prod_likes=browser.find_element_by_xpath('/html/body/div[1]/div[2]/div/main/div/div[1]/div[1]/div/div/span').text
    ```


   - Sometimes, this number contains a 'K' char which represents 1000, we then have to make some changes
    ```python
    if prod_likes[-1]=='K':
        prod_likes=int(float(prod_likes[:-1])*1000)
    else:
        prod_likes=int(float(prod_likes))
    ```
    
8. The number of reviews
    ```python
    num_reviews= int(browser.find_element_by_id("ratings-reviews-container").find_element_by_tag_name('h2').text.split('(')[1][:-1])
    ```
    
    
9. The review text
    ```python
    review_text=i.find_element_by_css_selector(".css-1x44x6f.eanm77i0").text
    ```
   
   
10. The stars notation
    ```python
    star_rating=int(float(i.find_element_by_class_name('css-4qxrld').get_attribute('aria-label').split(' ')[0]))
    ```
    
    
11. The review date
    ```python
    review_date=i.find_element_by_css_selector('.css-ak0g49.eanm77i0').text
    if 'm ago' in review_date:
        review_date=datetime.now()-timedelta(minutes=int(review_date.split(' ')[0]))
    elif 's ago' in review_date:
        review_date=datetime.now()-timedelta(seconds=int(review_date.split(' ')[0]))
    elif 'h ago' in review_date:
        review_date=datetime.now()-timedelta(hours=int(review_date.split(' ')[0]))
    elif 'd ago' in review_date:
        review_date=datetime.now()-timedelta(days=int(review_date.split(' ')[0]))
    else:
        review_date=datetime.strptime(review_date, '%d %b %Y')
    ```
    
    
12. The review upvotes and downvotes
    ```python
    upvotes =int(i.find_elements_by_class_name("css-36ie0l")[0].find_element_by_tag_name('span').text[1:-1])
    downvotes =int(i.find_elements_by_class_name("css-36ie0l")[1].find_element_by_tag_name('span').text[1:-1])
    ```
    
    
13. Purshase verification and recommandation
    ```python
    verified_purshase=bool(len(i.find_elements_by_css_selector(".css-1cf4ane.eanm77i0")))
    recommanded=bool(len(i.find_elements_by_css_selector(".css-12com3g.eanm77i0")))
    ```
    
    
14. The shade of th product *(if available)*
    ```python
    if len(i.find_elements_by_class_name( 'css-hoe9xz')):
        shade_of_product=i.find_element_by_class_name( 'css-hoe9xz').text
    else:
        shade_of_product=''
    ```
    
    
15. Reviewer details
    ```python
    nickname=''
    try:
        nickname=i.find_element_by_tag_name('strong').text
    except:
        pass

    eyes,hair,skin_col,skin_ty='','','',''            
    try:
        descr=i.find_element_by_css_selector(".css-t72irq.eanm77i0").text
        eyes=descr.split(', ')[0]
        hair=descr.split(', ')[1]
        skin_col=descr.split(', ')[2]
        skin_ty=descr.split(', ')[3]
    except:
        pass
    ```
    
    
- The last piece of code aims to click on 'next' arrow *(See the figure below)*
    ```python
    if len(reviews)!=num_reviews:
        browser.find_element_by_class_name('css-2anst8').click()
        sleep(5)
    ```

![](next_arrow.jpg)

Next, we make all the code above in a single function, that takes a product URL and returns the information as a list of dictionaries. 

In [None]:
def extract_details(url):
    browser.get(url)
    sleep(2)
    for k in range(1,100):
        browser.execute_script("window.scrollTo(0, document.body.scrollHeight*k/100);".replace('k',str(k)))
    
    reviews=[]
    num_reviews= int(browser.find_element_by_id("ratings-reviews-container").find_element_by_tag_name('h2').text.split('(')[1][:-1])
    brand_url=browser.find_element_by_xpath('/html/body/div[1]/div[2]/div/main/div/div[1]/div[1]/h1/a').get_attribute('href')
    brand_name=browser.find_element_by_xpath('/html/body/div[1]/div[2]/div/main/div/div[1]/div[1]/h1/a').text
    prod_categ=browser.find_element_by_xpath('/html/body/div[1]/div[2]/div/main/div/nav/ol').text.replace('\n','->')
    produ_name=browser.find_element_by_xpath("/html/body/div[1]/div[2]/div/main/div/div[1]/div[1]/h1/span").text
    prod_price=float(browser.find_element_by_xpath('/html/body/div[1]/div[2]/div/main/div/div[1]/div[1]/p/span/span[1]/b').text[1:])
    try:
        prod_aver_star=float(browser.find_element_by_xpath("/html/body/div[1]/div[2]/div/main/div/div[1]/div[1]/div/a[1]/div").get_attribute('aria-label').split(' ')[0])
    except:
        prod_aver_star=0
    prod_likes=browser.find_element_by_xpath('/html/body/div[1]/div[2]/div/main/div/div[1]/div[1]/div/div/span').text
    if prod_likes[-1]=='K':
        prod_likes=int(float(prod_likes[:-1])*1000)
    else:
        prod_likes=int(float(prod_likes))

    while len(reviews)!=num_reviews:
        for i in browser.find_elements_by_css_selector('.css-13o7eu2.eanm77i0'):
            if len(i.find_elements_by_css_selector(".css-1yc3bi7.eanm77i0")):
                review_text=i.find_element_by_css_selector(".css-1x44x6f.eanm77i0").text
                star_rating=int(float(i.find_element_by_class_name('css-4qxrld').get_attribute('aria-label').split(' ')[0]))
                review_date=i.find_element_by_css_selector('.css-ak0g49.eanm77i0').text
                if 'm ago' in review_date:
                    review_date=datetime.now()-timedelta(minutes=int(review_date.split(' ')[0]))
                elif 's ago' in review_date:
                    review_date=datetime.now()-timedelta(seconds=int(review_date.split(' ')[0]))
                elif 'h ago' in review_date:
                    review_date=datetime.now()-timedelta(hours=int(review_date.split(' ')[0]))
                elif 'd ago' in review_date:
                    review_date=datetime.now()-timedelta(days=int(review_date.split(' ')[0]))
                else:
                    review_date=datetime.strptime(review_date, '%d %b %Y')

                upvotes =int(i.find_elements_by_class_name("css-36ie0l")[0].find_element_by_tag_name('span').text[1:-1])
                downvotes =int(i.find_elements_by_class_name("css-36ie0l")[1].find_element_by_tag_name('span').text[1:-1])
                verified_purshase=bool(len(i.find_elements_by_css_selector(".css-1cf4ane.eanm77i0")))
                recommanded=bool(len(i.find_elements_by_css_selector(".css-12com3g.eanm77i0")))
                if len(i.find_elements_by_class_name( 'css-hoe9xz')):
                    shade_of_product=i.find_element_by_class_name( 'css-hoe9xz').text
                else:
                    shade_of_product=''

                nickname=''
                try:
                    nickname=i.find_element_by_tag_name('strong').text
                except:
                    pass

                eyes,hair,skin_col,skin_ty='','','',''            
                try:
                    descr=i.find_element_by_css_selector(".css-t72irq.eanm77i0").text
                    eyes=descr.split(', ')[0]
                    hair=descr.split(', ')[1]
                    skin_col=descr.split(', ')[2]
                    skin_ty=descr.split(', ')[3]
                except:
                    pass
                reviews.append({'product name':produ_name, 'product average stars notation':prod_aver_star,'product likes':prod_likes,
                                'product category': prod_categ, 'product price':prod_price,'product url':url,'brand name': brand_name, 
                                'brand URL':brand_url, 'review text':review_text,'review date':review_date, 'upvotes' :upvotes, 'downvotes':downvotes,
                                'verified purshase':verified_purshase, 'recommanded':recommanded, 'shade of product':shade_of_product, 
                                'nickname':nickname,'eyes color':eyes,'hair color':hair, 'skin color':skin_col, 'skin type':skin_ty})

        if len(reviews)!=num_reviews:
            browser.find_element_by_class_name('css-2anst8').click()
            sleep(1)

    return reviews

In order to get the job done, we proceed like this:

- Instantiate a pandas DataFrame object `results`
- For each brand in the dictionary of brands: `brands_urls`
    - Find out the list of products urls
    - For each product url in the list in question
        - Extract the details using `extract_details(url)`
        - Add the extracted information into `results`

In [None]:
results=pd.DataFrame(columns=['product name', 'product average stars notation','product likes'
                         ,'product category', 'product price','product url','brand name', 
                         'brand URL', 'review text','review date', 'upvotes' , 'downvotes',
                         'verified purshase', 'recommanded','shade of product','nickname',
                         'eyes color','hair color', 'skin color', 'skin type'])
for brand_url in brands_urls.values():
    print(brand_url)
    browser.get(brand_url)
    sleep(1)
    for k in range(1,100):
        browser.execute_script("window.scrollTo(0, document.body.scrollHeight*k/100);"
                               .replace('k',str(k)))
    product_urls=[]
    for item in browser.find_elements_by_class_name('css-ix8km1'):
        product_urls.append(item.get_attribute('href'))
    for url in product_urls:
        print(url)
        try:
            reviews=extract_details(url)
        except :
            continue

        for review in reviews:
            results=results.append(review, ignore_index=True)


The code in the cell above looks sufficient to accommpish the task. But, one may get some network or power issues since there are many brands and many products, so the operation will take a lot of time.

To overcome this possible problem, one way to take is as follows:
1. Create an empty Excel file `results.xlsx`
2. Read the excel file to a DataFrame, and extract the lists of the products urls
    ```python
    old_results=pd.read_excel('results.xlsx')
    got_url = old_results['product url'].tolist()
    ```
3. Run the cell conaining the extraction code
    ```python
    results=pd.DataFrame(columns=['product name', 'product average stars notation', 
    ...
    ...
    ...
                for review in reviews:
                    results=results.append(review, ignore_index=True)
    ```
    
4. If somehow the cell stops working, save the extracted information
    ```python
    results.to_excel('results.xlsx')
    ```
5. Repeat the steps ***2***, ***3*** and ***4***

In [None]:
old_results=pd.read_excel('results.xlsx')
got_url= old_results['product url'].tolist()

In [None]:
results=pd.DataFrame(columns=['product name', 'product average stars notation','product likes'
                         ,'product category', 'product price','product url','brand name', 
                         'brand URL', 'review text','review date', 'upvotes' , 'downvotes',
                         'verified purshase', 'recommanded','shade of product','nickname',
                         'eyes color','hair color', 'skin color', 'skin type'])
results.append(old_results)
for brand_url in brands_urls.values():
    print(brand_url)
    browser.get(brand_url)
    sleep(1)
    for k in range(1,100):
        browser.execute_script("window.scrollTo(0, document.body.scrollHeight*k/100);"
                               .replace('k',str(k)))
    product_urls=[]
    for item in browser.find_elements_by_class_name('css-ix8km1'):
        product_urls.append(item.get_attribute('href'))
    for url in product_urls:
        if url not in got_url:
            print(url)
            try:
                reviews=extract_details(url)
                got_url.append(url)
            except :
                continue

            for review in reviews:
                results=results.append(review, ignore_index=True)


In [None]:
results.to_excel('results.xlsx')