## Importing Libraries

In [10]:
import time
import requests
from bs4 import BeautifulSoup
from dotenv import load_dotenv
import os
from requests_html import HTMLSession

## Scraping

Since the user agent is something private that allows the user to access a website, I decided to set an environmental variable in order to hide it to the public

In [12]:
load_dotenv()
USER_AGENT = os.getenv("USER_AGENT")

Here we have the Scraper class.
In the init function I declared 3 variables:
- **Session**: a variable that create a object from the library requests that allows the user to communicate with the website;
- **Asin**: this is the unique code that represent an Amazon product inside the marketplace. It is unique only in the specific country, so for amazon.com we will have a specific asin, while for the same product but in amazon UK we must use another asin;
- **Url**: this is the url that link directly to the reviews. I already formatted the string in order to access all the pages by only changing a parameter at the end of the string.

Then we have the **check_page** function. It has the role to assess if there are reviews in the page I want to scrape. I pass only **i** that represents the number ofthe page and using the css selector I check for the presence of reviews. So the function returns the reviews if it finds them, otherwise it returns False.

The last function, **scrape**, is the actual scraping. If there are reviews in the specific page I start to scroll the list and always through the css selectors I extract the various part of the reviews. In the end I create a dictionary as per key the part of the review and as per value a list with all the various parts. 

In [37]:
class Scraper:
    def __init__(self, asin) -> None:
        self.session = HTMLSession()
        self.asin = asin
        self.url = f"https://www.amazon.com/product-reviews/{self.asin}/ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType=all_reviews&sortBy=recent&pageNumber="
    
    def check_page(self, i):
        headers = {"User-Agent": USER_AGENT}
        response = self.session.get(self.url + str(i), headers=headers)
        
        if response.status_code == 200:
            return response
        else:
            return False

    def scrape(self, page):
        if page.html.find("div.a-section a-spacing-none reviews-content a-size-base") is not None:
            return page.html.find(f"div[data-asin='{self.asin}'] .review-title-content span")
        else:
            return False

## Main

In [38]:
scraper = Scraper('B08D6VD9TR')

results = []

page = scraper.check_page(1)
print(scraper.scrape(page))


[]


In [5]:
results

[]

In [6]:

"""for i in range(1, 51):
    page = scraper.check_page(i)
    if page:
        print(f"Scraping page {i}")
        review = scraper.scrape(page)
        results.append(review)
        time.sleep(0.5)
    else:
        print(page)"""

'for i in range(1, 51):\n    page = scraper.check_page(i)\n    if page:\n        print(f"Scraping page {i}")\n        review = scraper.scrape(page)\n        results.append(review)\n        time.sleep(0.5)\n    else:\n        print(page)'