# Data extraction

This notebook is the first part of the project and contains class used for data acquisition, i.e. downloading products reviews and ratings from ebay which is an e-commerce website (https://www.ebay.com/).
The input data was a list containing search words (words that can be entered in a search bar). The products category was mainly focused on electronics and tools. The class returns pandas dataframe consisting of four columns: product category, raw review title, raw review content and rating.

In [1]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import numpy as np

In [106]:
class EbayCrawler:
    """
    This class is dedicated for collecting ebay reviews.
    The final result is pandas dataframe which contains four columns:
    category, review title, review content and rating

    The input data consists of search words,
    having search words defined the functions checks search results with
    every given search word and collects products urls that have reviews.
    """

    def __init__(self, search_words, number_of_pages=1):
        """
        A user gives list of search words and the number of pages
        which will be checked
        
        Parameters
        ----------
        search_word : list, default None
        List with search words that can be put in ebay search bar,
        for example search_words = ['apple', 'huawei', 'samsung']
        
        number_of_pages : int, default 1
        How many pages with every search word will be search in
        order to find products with reviews
        
        """
        if number_of_pages < 1:
            raise ValueError("number of pages must be at least 1")
        if len(search_words) < 1:
            raise ValueError("there must be at least one search word")

        self.search_words = search_words
        self.number_of_pages = number_of_pages
        self.search_urls = []
        self.products_urls = []
        self.df = None

    def __add__(self, other_df):
        """
        Addition operation is implemented
        """
        if not isinstance(other_df, pd.core.frame.DataFrame) \
        and not isinstance(other_df, EbayCrawler):
            raise TypeError(
                "Cannot perform add operation on different types" \
                "The other type must be pandas data frame or an " \
                "instance of EbayCrawler")
        else:
            if self.df is None:
                return other_df
            else:
                if isinstance(other_df, pd.core.frame.DataFrame):
                    if self.df.shape[1] != other_df.shape[1]:
                        raise ValueError(
                            "Dimensions of each dataframe must match")
                    self.df = pd.concat([self.df, other_df])
                    self.df.reset_index(drop=True, inplace=True)
                else:
                    if self.df.shape[1] != other_df.df.shape[1]:
                        raise ValueError(
                            "Dimensions of each dataframe must match")
                    self.df = pd.concat([self.df, other_df.df])
                    self.df.reset_index(drop=True, inplace=True)
    
    def search_word_urls(self):
        """
        Method that creates urls for every search words
        """
        if not self.search_urls:
            print('creating search links...\n')
            for word in self.search_words:
                for i in range(1, self.number_of_pages+1):
                    url = f'https://www.ebay.com/sch/i.html?_from=R40&_nkw' \
                          f'={word}&_sacat=0&_pgn={i}'
                    self.search_urls.append(url)

    def find_products_urls(self, display=True, num_to_display=10):
        """
        Method that receives an url with search word results and
        collects urls of products that have some reviews
        
        Parameters
        ----------
        display : boolean, default True
        If True the method shows products urls
        
        num_to_display : int, default 10
        If display is True, the number of url which will be shown
        
        """
        if not self.products_urls:
            print('downloading products urls...\n')
            for url in self.search_urls:
                soup = BeautifulSoup(requests.get(url).text)
                for link in soup.find_all('div', class_='s-item__reviews'):
                    self.products_urls.append(link.a['href'])
        if display and self.products_urls != []:
            if len(self.products_urls) < num_to_display:
                num_to_display = len(self.products_urls)
            print(f'Links for the first {num_to_display} products:')
            for i in range(num_to_display):
                print(self.products_urls[i])
        print('downloading finished\n')

    @property
    def urls_number(self):
        "Property that shows number of found products with reviews"
        return len(self.products_urls)

    def create_reviews_df(self, num_to_download=None, clear=True):
        """
        Method that receives a list with products urls and returns one
        dataframe with reviews and ratings. The user can define how many
        links will be considered in downloading the reviews
        
        Parameters
        ----------
        num_to_download : int, default None
        Number of links which will provide reviews,
        for example if the instance found 1000 links it will take much time,
        the user may want to collect the first 10 links. If no number is given
        the method collects all of the links
        
        
        clear : boolean, default False
        If False instance dataframe is deleted and becomes None. The method
        may be used again for downloading reviews.
        
        """
        if not clear:
            self.df = None
        num_of_links = len(self.products_urls)

        if num_to_download is not None:
            if num_to_download < len(self.products_urls):
                num_of_links = num_to_download

        if self.df is None:
            for i, product in enumerate(self.products_urls[:num_of_links], 0):
                print(f'downloading {i+1} link from {num_of_links}')
                self.df = pd.concat(
                    (self.df, EbayCrawler.one_product_reviews(product))
                    )
            self.df.set_index(np.arange(self.df.shape[0]), inplace=True)
            print('creating dataframe is completed')
        else:
            print('creating dataframe is completed')

    @classmethod
    def one_product_reviews(cls, url):
        """
        Class method that returns pandas dataframe with reviews and rating
        for a given url
        
        Parameters
        ----------
        url : string
        Url with product that has reviews
        """

        isMore = True
        content = requests.get(url).text
        soup = BeautifulSoup(content)

        # checking if there is a link to the review section
        # (more than 10 reviews)
        try:
            url = soup.find('div', class_="see--all--reviews").a['href']
        except:
            isMore = False

        category = 'not defined'
        try:
            category = soup.find('nav', class_='breadcrumb clearfix') \
                       .text.split('>')[-1]
        except:
            pass

        # if there is more than 10 reviews
        if isMore:
            content = requests.get(url).text
            soup = BeautifulSoup(content)

            reviews_page_number = len(soup.find_all('a', class_='spf-link'))-2
            if reviews_page_number == -2:
                return EbayCrawler.ebay_parser(url+'?pgn=1')
            else:
                for page in range(1, reviews_page_number+1):
                    url_temp = url + f'?pgn={page}'
                    if page == 1:
                        reviews = EbayCrawler.ebay_parser(url_temp, category)
                    else:
                        reviews = pd.concat((
                            reviews, EbayCrawler.ebay_parser(url_temp, category)
                            ))
                return reviews

        # if there is less than 11 reviews
        else:
            return EbayCrawler.ebay_parser2(url, category)

    @staticmethod
    def ebay_parser(url, category=None):
        """
        Staticmethod that collects all of the reviews and ratings
        for a given url if a product has more than one page with reviews
        """
        try:
            content = requests.get(url).text
        except:
            return 'connection failed'

        review_data = pd.DataFrame(
            columns=['category', 'review title', 'review content', 'rating']
            )

        soup = BeautifulSoup(content)
        for review, rating_stars in zip(
            soup.find_all('div', class_='ebay-review-section-r'),
            soup.find_all('span', class_='star-rating')[1:]
        ):

            rating = len(rating_stars.find_all('i', class_='fullStar'))

            # checkin if a review has a content and a title
            # if not it is skipped
            if review.p is None or review.h3 is None:
                continue
            else:
                review_data = review_data.append(
                    {
                        'category': category,
                        'review title': review.h3.text,
                        'review content': review.p.text,
                        'rating': rating
                    },
                    ignore_index=True
                )

        return review_data

    @staticmethod
    def ebay_parser2(url, category=None):
        """
        Staticmethod that collects all of the reviews and ratings
        for a given url if a product has only one page with reviews
        """
        try:
            content = requests.get(url).text
        except:
            return 'connection failed'

        review_data = pd.DataFrame(columns=[
            'category', 'review title', 'review content', 'rating'
            ])

        soup = BeautifulSoup(content)

        for review, rating_stars in zip(
            soup.find_all('div', class_='review--section--r'),
            soup.find_all('div', class_='review--section--l')
        ):
            rating = rating_stars.span.text[0]
            review_data = review_data.append(
                {
                    'category': category,
                    'review title': review.h4.text,
                    'review content': review.p.text,
                    'rating': rating
                },
                ignore_index=True
            )

        return review_data

    def save_to_csv(self, file_name, **kwargs):
        """
        Saving created dataframe to csv file
         
        Parameters
        ----------
        file_name : string
        Name of the file to which dataframe will be saved
        
        **kwargs
        Arguments for pandas to_csv method
        """
        if self.df is None:
            raise ValueError('dataframe needs to be created')
        else:
            self.df.to_csv(file_name, **kwargs)
            print('dataframe saved to csv file')

## Example showing how to use the class
note: the instance below was not used for reviews used in the project

In [107]:
# creating an instance of a class with a list with search words
# and number of pages that are going to be searched
electronics = EbayCrawler(['apple', 'samsung'], 5)

In [108]:
# creating urls for every given search word
electronics.search_word_urls()

creating search links...



In [109]:
# the first five search links
electronics.search_urls[:5]

['https://www.ebay.com/sch/i.html?_from=R40&_nkw=apple&_sacat=0&_pgn=1',
 'https://www.ebay.com/sch/i.html?_from=R40&_nkw=apple&_sacat=0&_pgn=2',
 'https://www.ebay.com/sch/i.html?_from=R40&_nkw=apple&_sacat=0&_pgn=3',
 'https://www.ebay.com/sch/i.html?_from=R40&_nkw=apple&_sacat=0&_pgn=4',
 'https://www.ebay.com/sch/i.html?_from=R40&_nkw=apple&_sacat=0&_pgn=5']

In [110]:
# downloading ulrs of products that have reviews
# the first five link are displayed
electronics.find_products_urls(display=True, num_to_display=5)

downloading products urls...

Links for the first 5 products:
https://www.ebay.com/p/19034211488?iid=154462877777&var=454530222702#UserReviews
https://www.ebay.com/p/4018215500?iid=174105091978&var=472963059163#UserReviews
https://www.ebay.com/p/15022478164?iid=274505797468&var=574685374492#UserReviews
https://www.ebay.com/p/3033813445?iid=113652223677&var=413778701567&rt=nc#UserReviews
https://www.ebay.com/p/23024045643?iid=114711167770&var=414867589699#UserReviews
downloading finished



In [112]:
# a total number of found urls
electronics.urls_number

269

In [113]:
# creating dataframe
# reviews are downloaded from the first two links
electronics.create_reviews_df(num_to_download=2, clear=1)

downloading 1 link from 2
downloading 2 link from 2
creating dataframe is completed


In [116]:
# final dataframe
electronics.df.head()

Unnamed: 0,category,review title,review content,rating
0,Cell Phones & Smartphones,"Great Phone, But Repetitive Design",The iPhone 11 Pro Max continues to showcase Ap...,5
1,Cell Phones & Smartphones,Smooth and fast,"Like any new wave of Apple product, this is ea...",5
2,Cell Phones & Smartphones,Awesome,"Absolutely love it ,perfect new condition flaw...",5
3,Cell Phones & Smartphones,Smokin fast,"Great phone. User friendly, super fast, long l...",5
4,Cell Phones & Smartphones,Great phone,It all I wanted and now I wouldn’t trade it fo...,5


In class operation of addition is implemented

In [117]:
# creating second dataframe
electronics2 = EbayCrawler(['huawei'], 5)
electronics2.search_word_urls()
electronics2.find_products_urls(display=False)
electronics2.create_reviews_df(num_to_download=5)
electronics2.df.head()

creating search links...

downloading products urls...

downloading finished

downloading 1 link from 5
downloading 2 link from 5
downloading 3 link from 5
downloading 4 link from 5
downloading 5 link from 5
creating dataframe is completed


Unnamed: 0,category,review title,review content,rating
0,Cell Phones & Smartphones,Would Reccomend even without Google.,"Great fit in your hand great Battery life, Per...",5
1,Cell Phones & Smartphones,The best bang for your buck!,I do have a couple of small issues with the P2...,5
2,Cell Phones & Smartphones,Amazing,Two small problems: 1. The phone came started ...,5
3,Cell Phones & Smartphones,Very Reliable phone.,This is our 2nd Wuawei phone. I got a mate 9 ...,5
4,Cell Phones & Smartphones,More than I expected,Amazing product! Took a chance on an unfamilia...,5


In [119]:
# shapes before add
print(electronics.df.shape, electronics2.df.shape)

# adding two instances
electronics + electronics2

# shape after add
print(electronics.df.shape)

(30, 4) (21, 4)
(51, 4)


In [120]:
#saving dataframe to csv file, a user can give some additional options regaridng saving options
electronics.save_to_csv('test.csv', index=False)

dataframe saved to csv file


In [121]:
# reviews of one product can be collected using class method
url_test = """
https://www.ebay.com/p/19032164388?iid=301769309963&var=600587289314#UserReviews
"""
test = EbayCrawler.one_product_reviews(url_test)

In [122]:
test

Unnamed: 0,category,review title,review content,rating
0,Wormer Products,Panacur C 3 packets 10 lb dog,i purchase this for my cat and chose what is a...,5
1,Wormer Products,use regular.. makes for healthier dogs,Works great! no more worms and inproves breat...,5
2,Wormer Products,This is a great product . After giving it to m...,Verified purchase: Yes | Condition: new | Sol...,5
3,Wormer Products,great item thank you for prompt shipping,"this med is great for both dog and cat , a gre...",5
4,Wormer Products,good for planaria as well,used it to deworm my fish tanks worked like a ...,5
5,Wormer Products,Panacur C,works every time ! read and follow directions ...,5
6,Wormer Products,A better product @2grams = 444mg of Fenbendazole,A better product @2grams = 444mg of Fenbendazo...,5
7,Wormer Products,GREAT,Very effective. Price has doubled in the last ...,5
8,Wormer Products,"Great product, great price !",Verified purchase: Yes | Condition: new | Sol...,5
9,Wormer Products,Will reorder,"Works great, fast, no Rx needed.",5
