# Script that performs scraping from Amazon website of a specific product

Sentiment analysis on product reviews: Students will extract product reviews from an e-commerce site 
like Amazon, using web scraping. They will then process the reviews and use sentiment analysis techniques 
to classify opinions as positive, negative, or neutral.

### Importing Libraries

In [10]:
import pandas as pd
from bs4 import BeautifulSoup
import requests
import re
import time
import streamlit as st

In [11]:
# Set display options
pd.set_option('display.max_columns', None)  # Display all columns
pd.set_option('display.max_rows', None)  # Display all rows
pd.set_option('display.max_colwidth', None)  # Prevent DataFrame wrapping

In [12]:
review_title = []
review_body = []
review_stars = []
i = 1

### Scraping:
The code underneath basically does all the scraping I need to construct the dataframe with the reviews. 

First of all I declared a variable **URL** that obviously contains the link to the product. Secondly I declared an **header** that is mandatory in order to access the webpage.

After that I iterate the first 50 pages of reviews in order to obtain more a consistent result. Before the actual scraping I check for any possible error while requesting the page. 

After that I look for every css selector that include the title, the body and the star the person gave to the product. Through the css selector I can also find automatically the next page.

In [13]:
URL = "https://www.amazon.it/echo-dot-2022/product-reviews/B09B8X9RGM/ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType=all_reviews"

headers = {
        'authority': 'www.amazon.it',
        'user-agent': 'Mozilla/5.0 (X11; CrOS x86_64 8172.45.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.64 Safari/537.36',
        'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
        'accept-language': 'it-IT,it;en-GB,en-US;q=0.9,en;q=0.8'
    }

while i <= 10:
    try:
        webpage = requests.get(URL, headers=headers)
        # Process the response if the request was successful
        if webpage.status_code == 200:
            # Starting the scraping
            soup = BeautifulSoup(webpage.content, 'html.parser')
            print(f'Scraping page {i}')
            review_title.append(soup.select('a.review-title')) # css selector for the title of the review
            review_body.append(soup.select('div.a-row.review-data span.review-text')) # css selector for the body of the review
            review_stars.append(soup.select('div.a-row:nth-of-type(2) > a.a-link-normal:nth-of-type(1)')) # css selector for the stars of the review
            try:
                next_link = soup.select_one('li.a-last a')
                if next_link is not None:
                    next_url = next_link.get('href')
                    URL = f"https://www.amazon.it{next_url}"
            except Exception as e:
                print(f'An error occured {e}')
        else:
            # Handle the response if it's not successful
            print(f"Request failed with status code: {webpage.status_code}")
    except requests.RequestException as e:
        # Handle any exceptions that occur during the request
        print(f"An error occurred: {e}")
    i += 1
    time.sleep(2)


Scraping page 1
Scraping page 2
Scraping page 3
Scraping page 4
Scraping page 5
Scraping page 6
An error occurred: HTTPSConnectionPool(host='www.amazon.ithttps', port=443): Max retries exceeded with url: //www.amazon.it/ap/signin?openid.return_to=https%3A%2F%2Fwww.amazon.it%2Fecho-dot-2022%2Fproduct-reviews%2FB09B8X9RGM%3FpageNumber%3D2%26reviewerType%3Dall_reviews&openid.identity=http%3A%2F%2Fspecs.openid.net%2Fauth%2F2.0%2Fidentifier_select&openid.assoc_handle=itflex&openid.mode=checkid_setup&marketPlaceId=APJ6JRA9NG5V4&language=it&openid.claimed_id=http%3A%2F%2Fspecs.openid.net%2Fauth%2F2.0%2Fidentifier_select&openid.ns=http%3A%2F%2Fspecs.openid.net%2Fauth%2F2.0 (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x13e47c310>: Failed to establish a new connection: [Errno 8] nodename nor servname provided, or not known'))
An error occurred: HTTPSConnectionPool(host='www.amazon.ithttps', port=443): Max retries exceeded with url: //www.amazon.it/ap/signin?openi

### Text processing

Here I make all the texts more flat, removing all the '\n', keeping only letters inside the texts and ultimately making everything lowercase

In [14]:
review_title = [[element.text.replace('\n', '') for element in sublist] for sublist in review_title]
review_body = [[element.text.replace('\n', '') for element in sublist] for sublist in review_body]
review_stars = [[element.get('title').split()[0] for element in sublist] for sublist in review_stars] # getting only the number of stars the user put

AttributeError: 'NoneType' object has no attribute 'split'

In [None]:
review_title = [[re.sub("[^a-zA-ZÀ-ÖØ-öø-ÿ]", " ", element) for element in sublist] for sublist in review_title] # keeping only the letters from the titles
review_title = [[element.lower() for element in sublist] for sublist in review_title]

In [None]:
review_body = [[re.sub("[^a-zA-ZÀ-ÖØ-öø-ÿ]", " ", element) for element in sublist] for sublist in review_body] # keeping only the letters from the bodies
review_body = [[element.lower() for element in sublist] for sublist in review_body]

In [None]:
df = pd.DataFrame(columns = ['Title', 'Body', 'Stars'])

In [None]:
df['Title'] = [item for sublist in review_title for item in sublist]
df['Body'] = [item for sublist in review_body for item in sublist]
df['Stars'] = [item for sublist in review_stars for item in sublist]

In [None]:
df['Stars'] = [element.replace(',0', '') for element in df['Stars']] # keeping only the number of stars the person put since previously the column was like '4,0', '3,0' etc...
df['Stars'] = df['Stars'].astype(int)
df['Title'] = df['Title'].astype(str)
df['Body'] = df['Body'].astype(str)

### NA's check and file writing

In [None]:
df.isnull().sum()

Title    0
Body     0
Stars    0
dtype: int64

In [None]:
df.sample(10)

Unnamed: 0,Title,Body,Stars
282,bello,alexa è stratosferica risponde bene ad ogni comando molto comodo e utile per chi sta sempre a letto e per chi perde il telecomando,5
455,alexa,e piccola ma ha un suono spettacolo mi piace un sacco regalate a natale anche hai miei genitori e miei suoceriottimo tuttoimpacchettamwnto spedizione prodotto e consegna ora voglio le lampadine appena potrò me le regalo,5
311,soddisfatto,articolo consegnato nei tempi e rispondente alla descrizione,4
12,utilissimo,per qualità prezzo non si poteva chiedere di meglio,5
425,ottimo prodotto,tutto perfetto ma per installarla ho dovuto chiamare un tecnico non è poi così semplice come dicono,5
238,alexa tutta la vita,ottimo acquisto,5
86,amazon,ottima come tutti prodotti amazon secondo me é da migliore il touch per mettere pausa perché non prende al primo colpo,5
156,miglioramenti solamente nel design,rispetto al modello precedente di eco dot quello di terza generazione personalmente noto solo miglioramenti nel design la qualità audio mi sembra la stessa e l efficienza energetica è pressoché la stessa mi sono trovato bene in passato mi trovo bene tutt ora nulla da aggiungere,5
347,bene,audio discreto a patto di non alzare molto il volume esteticamente accattivante ha un sensore di temperatura che chiedendo informa sui gradi della stanza in cui e collocato senza abbonamento premium si ascoltano i brani casuali e non quelli voluti,5
401,dovrebbero averlo tutti,ne avevo già una in cada e ne ho comprata un altra per mio cognato è un oggetto comodissimo soprattutto se associato ad altri come tv e lampadine non uso più gli interruttori e lo trovo comodissimo è comodo anche per ascoltare musica e come timer può essere usato come vivavoce e per fare chiamate se collegato al telefono eccellente,5


Here I write the dataframe into a csv file in order to upload it in the next file

In [None]:
df.to_csv('data.csv', index=False, header=0)