# Data extraction

This notebook is the first part of the project and contains functions used for data acquisition, i.e. downloading products reviews and ratings from Ebay which is an e-commerce website (https://www.ebay.com/).
The input data was a list containing search words (words that can be entered in a search bar). The products category was mainly focused on electronics and tools (see the search word list below). The script returns  pandas dataframe consisting of four columns: product category, raw review title, raw review content and rating.

Five functions were defined:

 - _ebay_parser:_
 The function downloads all reviews and ratings for products that have more than one page with reviews
 
 - _ebay_parser2:_
 The function downloads all reviews and ratings for products that have only one page with reviews
 
 - _one_product_reviews:_
 The function downloads all reviews and ratings of a given product (the input argument is product url)
 
 - _review_df:_
 The function downloads all reviews and ratings for a given list with products' url
 
 - _product_collector:_
 The function collects products' urls which appear after entering a given search word in the search bar
 
__Note:__ The notebook was run twice at different time in order to collect new reviews and ratings

In [1]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import numpy as np
import re
import pickle 

## Functions

In [2]:
def ebay_parser(url, category=None):
    """
    The function collects all reviews and ratings for a given url if a product has more than one page with reviews
    """
    try:
        content = requests.get(url).text
    except:
        return 'connection failed'
    
    review_data = pd.DataFrame(columns=['category', 'review title', 'review content', 'rating'])
    
    soup = BeautifulSoup(content)
    for review, rating_stars in zip(soup.find_all('div',class_='ebay-review-section-r'), soup.find_all('span',class_='star-rating')[1:]):
        rating = len(rating_stars.find_all('i', class_='fullStar'))
        
        #checkin if a review has a content and a title - if not it is skipped
        if review.p is None or review.h3 is None: 
            continue
        else:
            review_data = review_data.append({'category':category, 'review title':review.h3.text, 'review content':review.p.text, 'rating':rating}, ignore_index=True)
    
    return review_data


def ebay_parser2(url, category=None):
    """
    The function collects all reviews and ratings for a given url if a product has only one page with reviews
    """
    try:
        content = requests.get(url).text
    except:
        return 'connection failed'
    
    review_data = pd.DataFrame(columns=['category', 'review title', 'review content', 'rating'])
    
    soup = BeautifulSoup(content)

    
    for review, rating_stars in zip(soup.find_all('div',class_='review--section--r'), soup.find_all('div',class_='review--section--l')):
        rating = rating_stars.span.text[0]
        review_data = review_data.append({'category':category, 'review title':review.h4.text, 'review content':review.p.text, 'rating':rating}, ignore_index=True)
    
    return review_data


def one_product_reviews(url):
    """
    This function returns pandas dataframe with reviews and rating for a given url
    """

    isMore = True
    content = requests.get(url).text
    soup = BeautifulSoup(content) 
    
    #checking if there is a link to the review section (more than 10 reviews)
    try:
        url = soup.find('div', class_="see--all--reviews").a['href']
    except:
        isMore = False
    
    category = 'not defined'
    try:
        category = soup.find('nav', class_='breadcrumb clearfix').text.split('>')[-1]
    except:
        pass
    
    #if there is more than 10 reviews
    if isMore:
        content = requests.get(url).text
        soup = BeautifulSoup(content)  
        

        reviews_page_number = len(soup.find_all('a', class_='spf-link'))-2
        if reviews_page_number == -2:
            return ebay_parser(url+'?pgn=1')
        else:
            for page in range(1, reviews_page_number+1):
                url_temp = url + f'?pgn={page}'
                if page == 1:
                    reviews = ebay_parser(url_temp, category)
                else:
                    reviews = pd.concat((reviews, ebay_parser(url_temp, category)))
            return reviews
    
    #if there is less than 11 reviews
    else:
        return ebay_parser2(url, category)
    
    
def review_df(url_list):
    """
    This function receives a list with urls and returns one dataframe with reviews and ratings
    """ 
    df = one_product_reviews(url_list[0])
    for product in url_list[1:]:
        df = pd.concat((df, one_product_reviews(product)))
    df.set_index(np.arange(df.shape[0]), inplace=True)
    return df

def product_collector(url):
    """
    The function receives an url with search word results and collects reviews and ratings of searched products
    """
    links_list = []
    soup = BeautifulSoup(requests.get(url).text)
    for link in soup.find_all('div', class_='s-item__reviews'):
        links_list.append(link.a['href'])
            
    return links_list

## Search words and loop for collecting products urls

In [4]:
final_search=['iphone, xbox, playstation','sony', 'samsung', 'apple', 
              'lenovo', 'asus', 'dell', 'acer', 'HP', 'toshiba', 
              'philips','denon','canon','gopro','xiaomi','google', 
              'monitor', 'keyboard', 'mouse','headphones','cable', 'set', 
              'tool','tools','saw', 'drill', 'game', 'car','kitchen']

for word in final_search:
    for i in range(1,100):     
        #this is an i-th page with search results for a given word
        url = f'https://www.ebay.com/sch/i.html?_from=R40&_nkw={word}&_sacat=0&_pgn={i}'
        list_temp = product_collector(url) 
        final_list.extend(list_temp)

In [5]:
#number of products
len(final_list)

1653

## Collecting reviews and ratings is divided into many steps to get a better control on the process

In [6]:
df1 = review_df(final_list[:100])

In [7]:
df2 = review_df(final_list[100:250])

In [8]:
df3 = review_df(final_list[250:400])

In [9]:
df4 = review_df(final_list[400:500])

In [10]:
df5 = review_df(final_list[500:600])

In [11]:
df6 = review_df(final_list[600:800])

In [12]:
df7 = review_df(final_list[800:1000])

In [13]:
df8 = review_df(final_list[1000:1200])

In [14]:
df9 = review_df(final_list[1200:1400])

In [15]:
df10 = review_df(final_list[1400:])

## Creating the final dataframe

In [16]:
data = pd.concat([df1,df2,df3,df4,df5,df6,df7,df8,df9,df10])
data.set_index(np.arange(data.shape[0]), inplace=True)
data

Unnamed: 0,category,review title,review content,rating
0,Headsets,Wireless gaming headset,This gaming headset ticks all the boxes # look...,5
1,Headsets,"Cheap, not great, but good enough.",I don't want to be too harsh because it's reas...,3
2,Headsets,MezumiWireless Gaming Headset,I originally bought this wireless headset for ...,5
3,Headsets,"Good for those with a big head, low budget","Easy setup, rated for 6 hours battery but mine...",3
4,Headsets,HW- S2 great headset.,"This is my 2nd Mezumi headset, It kills the fi...",5
...,...,...,...,...
50150,Racks & Holders,Utensil holder,Reasonably priced but a little flimsy,3
50151,Racks & Holders,Recommended,As described,5
50152,Racks & Holders,cheap looking,cheap looking,1
50153,Racks & Holders,Ok,Okay,5


## Saving the dataframe to a csv file

In [39]:
date='30122020'

data.to_csv(f'ebay_reviews{date}.csv', index=True)