## Importing Libraries

In [1]:
import time
import requests
from bs4 import BeautifulSoup
from dotenv import load_dotenv
import os
import yaml

## Scraping

Since the user agent is something private that allows the user to access a website, I decided to set an environmental variable in order to hide it to the public

In [2]:
load_dotenv()
USER_AGENT = os.getenv("USER_AGENT")
HEADER = os.getenv("HEADER")
YML_FILE = os.getenv("YML_FILE")

Here we have the Scraper class.
In the init function I declared 3 variables:
- **Session**: a variable that create a object from the library requests that allows the user to communicate with the website;
- **Asin**: this is the unique code that represent an Amazon product inside the marketplace. It is unique only in the specific country, so for amazon.com we will have a specific asin, while for the same product but in amazon UK we must use another asin;
- **Url**: this is the url that link directly to the reviews. I already formatted the string in order to access all the pages by only changing a parameter at the end of the string.

Then we have the **check_page** function. It has the role to assess if there are reviews in the page I want to scrape. I pass only **i** that represents the number ofthe page and using the css selector I check for the presence of reviews. So the function returns the reviews if it finds them, otherwise it returns False.

The last function, **scrape**, is the actual scraping. If there are reviews in the specific page I start to scroll the list and always through the css selectors I extract the various part of the reviews. In the end I create a dictionary as per key the part of the review and as per value a list with all the various parts. 

In [3]:
class Scraper:
    def __init__(self, asin) -> None:
        self.session = requests.Session()
        self.asin = asin
        self.url = f"https://www.amazon.com/product-reviews/{self.asin}/ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType=all_reviews&sortBy=recent&pageNumber="
    
    def check_page(self, i):
        headers = {"User-Agent": USER_AGENT}
        response = self.session.get(self.url + str(i), headers=headers)
        
        if response.status_code == 200:
            return response.content
        else:
            return False
    
    def load_yml(self):
        with open(str(YML_FILE), "r") as f:
            config = yaml.safe_load(f)
        return config

    def scrape(self, page):
        total = []
        
        scraped_data = {}
        
        config = self.load_yml()
        
        soup = BeautifulSoup(page, 'html.parser')
        
        for key, selector in config.items():
            elements = soup.select(selector['css'])
            scraped_data[key] = [element.text.strip().replace('\n', '') for element in elements]
            total.append(scraped_data)
        return total

## Main

In [4]:
scraper = Scraper('B08D6VD9TR')

results = []

for i in range(1, 51):
    page = scraper.check_page(i)
    if page:
        print(f"Scraping page {i}")
        review = scraper.scrape(page)
        time.sleep(0.5)
        results.append(review)


Scraping page 1
Scraping page 2
Scraping page 3
Scraping page 4
Scraping page 5
Scraping page 6
Scraping page 7
Scraping page 8
Scraping page 9
Scraping page 10
Scraping page 11
Scraping page 12
Scraping page 13
Scraping page 14
Scraping page 15
Scraping page 16
Scraping page 17
Scraping page 18
Scraping page 19
Scraping page 20
Scraping page 21
Scraping page 22
Scraping page 23
Scraping page 24
Scraping page 25
Scraping page 27
Scraping page 28
Scraping page 29
Scraping page 30
Scraping page 31
Scraping page 32
Scraping page 33
Scraping page 34
Scraping page 35
Scraping page 36
Scraping page 38
Scraping page 39
Scraping page 40
Scraping page 41
Scraping page 42
Scraping page 43
Scraping page 44
Scraping page 45
Scraping page 46
Scraping page 47
Scraping page 48
Scraping page 49
Scraping page 50
