## Importing Libraries

In [None]:
import time
from requests_html import HTMLSession
from lxml import html
from dotenv import load_dotenv
import os

## Scraping

Since the user agent is something private that allows the user to access a website, I decided to set an environmental variable in order to hide it to the public

In [None]:
load_dotenv()
USER_AGENT = os.getenv("USER_AGENT")

Here we have the Scraper class.
In the init function I declared 3 variables:
- *Session*: a variable that create a object from the library requests that allows the user to communicate with the website;
- *Asin*: this is the unique code that represent an Amazon product inside the marketplace. It is unique only in the specific country, so for amazon.com we will have a specific asin, while for the same product but in amazon UK we must use another asin;
- *Url*: this is the url that link directly to the reviews. I already formatted the string in order to access all the pages by only changing a parameter at the end of the string.

Then we have the *check_page* function. It has the role to assess if there are reviews in the page I want to scrape. I pass only *i* that reppresents the number ofthe page and using the css selector I check for the presence of reviews. So the function returns the reviews if it finds them, otherwise it returns False.

THe last function, *scrape*, is the actual scraping. If there are reviews in the specific page I start to scroll the list and alw

In [None]:
class Scraper:
    def __init__(self, asin) -> None:
        self.session = HTMLSession()
        self.asin = asin
        self.url = f"https://www.amazon.com/product-reviews/{self.asin}/ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType=all_reviews&sortBy=recent&pageNumber="
    
    def check_page(self, i):
        headers = {"User-Agent": USER_AGENT}
        response = self.session.get(self.url + str(i), headers=headers)

        if response.html.find('div[data-hook=review]'):
            return response.html.find('div[data-hook=review]')
        else:
            return False

    def scrape(self, reviews):
        total = []

        for review in reviews:
            try:
                title = review.find('div[data-hook="review"] span[data-hook="review-title"]', first = True).text.strip().replace('\n', '')
                rating = review.find('i[data-hook=review-star-rating] span', first = True).text.strip().replace('\n', '').replace(' out of 5 stars', '')
                body = review.find('div[data-hook="review"] span[data-hook="review-body"]', first = True).text.strip().replace('\n', '')
            except:
                continue        

            data = {'title': title,
                    'rating': rating,
                    'body': body}      
            total.append(data)
        return total

## Main

In [None]:
scraper = Scraper('B08D6VD9TR')

results = []
for i in range(1,51):          
    print('Getting page', i)
    time.sleep(0.3)             
    reviews = scraper.check_page(i)
    
    if reviews:                    
        results.append(scraper.scrape(reviews))
    else:
        print('No more reviews')
