<Center>
    <h1 style="font-family: Roboto slab">
        <p>
        <font color="white">
            Sentiment Mining for Amazon Devices: 
            <br>
            Applying Natural Language Processing with Machine Learning and Deep Learning Techniques
        </font>
    </h1>
    <h3 style="font-family: Roboto slab">
        <font color="yellow">
            Notebook 1/5: Web Scraping
        </font>
    </h3>
</Center>

# I. Introduction & Context
---------------------------------------------------------------------------

### <font color = "yellow" >Objective:</font>
This project aims to build a sentiment analysis tool to classify customer reviews of Amazon devices as positive, negative, or neutral. <br>

The project involves preprocessing review text, extracting key features, and implementing both traditional machine learning models (Logistic Regression, Naive Bayes, Support Vector Machines, ...) and deep learning models (LSTM-based RNNs). After training and evaluating these models, the project will compare their performance to select the most effective one for deployment in the sentiment analysis tool. This approach ensures that the tool utilizes the best-performing model to deliver accurate sentiment classification, ultimately supporting better business decisions and product improvements.

### <font color = "yellow">Application Overview:</font>
This application approach is divided into 5 core steps:

<ul>
    <li>
        <u>Step 1:</u>  Data Collection
        <ul>
            <li> <b>Description:</b> Gather Amazon devices reviews from Amazon's website using web scraping techniques: Selenium, BeautifulSoup, scrape review data to capture review_id, Reviewer, Rating, Date, Review_title, Review_content, Product_id & Product_link</li>
            <li> <b> Output: </b> Raw dataset of Amazon device reviews, including review text, star ratings, and other relevant metadata.</li>
        </ul>
    </li>
</ul>
<ul>
    <li>
        <u>Step 2:</u> Data Pre-Processing
        <ul>
            <li><b> Description: </b> Clean and prepare the review text for exploratory analysis. Identify and resolve missing values or inconsistencies in the dataset. Convert text to lowercase, remove special characters, stop words, punctuations; apply tokenization and lemmatization.
            </li>
            <li> <b> Output: </b> Cleaned dataset for EDA and model development.</li>
        </ul>
    </li>
</ul>
<ul>
    <li>
        <u>Step 3:</u> Exploratory Data Analysis
        <ul>
            <li>Data Distribution: Analyze review counts across sentiment classes. </li>
            <li>Text Analysis: Review text length, word count, and common words.</li>
            <li>Sentiment Visualization: Visualize trends in positive, neutral, and negative reviews.</li>
        </ul>
    </li>
</ul>
<ul>
    <li>
        <u>Step 4:</u > Model Development
         <ul>
            <li>Developing 5 machine learning and LSTM deep learning models.</li>
            <li>Evaluating models performance.</li>
            <li>Selecting the best model for final hyperparameter tuning</li>
            <li>Validating final model's predictions on 10 new real reviews</li>
        </ul>
    </li>
</ul>
<ul>
    <li>
        <u>Step 5:</u> Model Deployment
        <ul>
            <li>Deploy the sentiment analysis model via a Flask API and create a website for users to input reviews and view predicted sentiments in real-time.</li>
        </ul>
    </li>
</ul>

# II. Amazon Web Scraping
---------------------------------------------------------------------------

## <font color = "red">1.  Web Scraping Tools: Selenium and BeautifulSoup</font>

- <b>Selenium:</b> is a web automation tool primarily used for testing web applications. In this project, Selenium automates the process of navigating through Amazon’s web pages, handling actions like clicking the "Next" button to move through multiple review pages, accepting cookies or other pop-ups, and interacting with JavaScript elements to load reviews dynamically. This ensures that all reviews are fully loaded and accessible for scraping, especially those that might not initially appear on the page without user interaction.<bf>

- <b>BeautifulSoup: </b> is a Python library designed for parsing HTML and XML documents. After Selenium has loaded a web page, BeautifulSoup helps in navigating and extracting specific data elements from the page structure. It simplifies the process of locating data by offering easy-to-use methods for accessing tags, attributes, and text content, making it ideal for structuring the extracted data from web scraping tasks.

## <font color = "red">2.  Libaries Import</font>

In [4]:
import pandas as pd
from datetime import datetime

from urllib.parse import unquote, urlparse, parse_qs
from bs4 import BeautifulSoup

# Selenium libraries
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
import time
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By

## <font color="red">3.  Function initialization</font>

In [None]:
def generate_review_link(page_no, product_path, product_id):
    """
    Function that generates one review page url for each product

    Parameters
    ----------
    page_no : review page number
    product_path: product name path
    product_id: product id

    Returns
    -------
    str: review page url

    """
    return (f"https://www.amazon.com/{product_path}/product-reviews/{product_id}"
            f"/ref=cm_cr_arp_d_paging_btm_next_{page_no}?ie=UTF8&reviewerType=all_reviews&pageNumber={page_no}")


In [120]:
def normalize_url(url):
    """
    Function that normalizes product url to direct link

    Parameters
    ----------
    url: original url extracted from web page

    Returns
    -------
    str: normalized product url

    """
    
    if "/sspa/click" in url:   # Check if the URL is a sponsored link (Type 1)   
        parsed_url = urlparse(url)    # Extract the 'url' parameter from the query string
        query_params = parse_qs(parsed_url.query)

        if 'url' in query_params:
            product_url = unquote(query_params['url'][0])   # Decode the product path URL from the 'url' parameter
            return product_url
    
    # ElseIf it's already a direct product link (Type 2), return it as is
    return url

In [None]:
# Initialize the ChromeDriver service for Selenium 
webdriver_service = Service("/usr/local/bin/chromedriver")

# Set Chrome options
chrome_options = Options()
chrome_options.add_argument("--no-sandbox")
chrome_options.add_experimental_option("detach", True)
chrome_options.add_argument("--disable-dev-shm-usage")
chrome_options.add_argument("--disable-blink-features=AutomationControlled")  # Avoid detection
chrome_options.add_argument("user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7")

# Create a new Chrome browser instance
driver = webdriver.Chrome(service=webdriver_service, options=chrome_options)  

In [None]:
def getAllHtml (products_page_len):
    """
    Function that uses Selenium and BeautifulSoup to scrape the web page and parse it to html format

    Parameters
    ----------
    products_page_len: number of product pages to scrape

    Returns
    -------
    list: list of dictionnaries for all reviews and BeautifulSoup objects

    """

    # Initialize an empty list to hold BeautifulSoup objects
    soups = []

    #///////////////////////////////////////////#
    # Outer Loop: Navigating Product List Pages #
    #///////////////////////////////////////////#

    for page_no in range(1, products_page_len + 1):
         
        URL = f"https://www.amazon.com/s?i=amazon-devices&rh=n%3A2102313011&s=popularity-rank&fs=true&page={page_no}&qid=1729958028&ref=sr_pg_{page_no}"   # List of products pages to scrape

        driver.get(URL)  # Load each product list page
        driver.maximize_window()
        time.sleep(5) # Wait for content to load fully

        # Locating and Extracting Product Divs
        driver.find_elements(By.CSS_SELECTOR, 'div[data-component-type="s-search-result"]')  # Finds multiple elements matching the CSS selector, each representing a product <div>.
        time.sleep(2)

        product_list = BeautifulSoup(driver.page_source, 'html.parser') 
        product_divs = product_list.find_all('div', attrs={'data-component-type': "s-search-result"})  # Find all <div> elements with the 'data-component-type' attribute
  

        print(f"Page {page_no}: {URL}")
        print(f"--> There are: {len(product_divs)} products")

        if not product_divs: # Exit the loop and finish the function when it comes to last page which has no products
            break
        else: 
            #///////////////////////////////////////////#
            # Loop through all product divs of each page #
            #///////////////////////////////////////////#
            count = 0
            for div in product_divs:

                count = count + 1

                link_html = div.find("a", attrs={'class':'a-link-normal s-underline-text s-underline-link-text s-link-style a-text-normal'})                  # Find the <a> tag within the current <div>
                link = normalize_url(link_html.get('href'))
                product_title = div.find("span", attrs={'class': 'a-size-medium a-color-base a-text-normal'}).get_text(strip=True)  # Extract product title

                path_parts = urlparse(link).path.split('/') # Extract the 'url' parameter contains the final product path

                
                product_path = path_parts[1]  # Extract the product path, example: 'Android-Expanded-Quad-core-processor-certified'
                product_id = path_parts[3]  # # Extract the product ID, example: 'B0D6279VW5'
                product_url = f"https://www.amazon.com{link}"

            
                #////////////////////////////////////////////////////#
                # Loop through list of review pages for Each Product#
                #///////////////////////////////////////////////////#
                print(f"Page {page_no}, product {count} : {product_title}")

                for i in range(1, 11):  # Amazon only allows to show 10 pages of reviews

                    review_link = generate_review_link(i, product_path, product_id)  # Call generate_review_link function to generate link for each review list page
                    
                    driver.get(review_link)  #Load each review page
                    time.sleep(5)  # Wait for content to load

                    review_divs = driver.find_elements(By.CSS_SELECTOR, 'div[data-hook="review"]')  # Find elements that contain review details
                    time.sleep(2)

                    soup = BeautifulSoup(driver.page_source, 'html.parser')

                    if not review_divs: # exit this loop when it comes to the last page of reviews and move to next product
                        print("no review divs")
                        break

                    else:
                        print(f"Review page {i}: There are {len(review_divs)} reviews")  
                        # Add single Html product review data in master soups list
                        soup_dict = {
                            "product_id": product_id,
                            "product_title": product_title,
                            "product_url": product_url,
                            "review_data": soup
                        }
                        soups.append(soup_dict)

    driver.quit()
    
    return soups


In [None]:
def getReviews(review_dict):
    """
    Function that extract scraped Html content to review details

    Parameters
    ----------
    list : list of BeautifulSoup objects

    Returns
    -------
    list : list of dictionaries of reviews: review_id, reviewer name, rating, date, review title, review content, product id, product title, product link

    """

    data_dicts = []  # Create Empty list to Hold all Reviews
    html_product = review_dict["review_data"]
    
    boxes = html_product.select('div[data-hook="review"]')  # Select all Reviews BOX html using css selector
    
    
    #////////////////////////////////////////////////////#
    # Loop through all the reviews///////////////////////#
    #///////////////////////////////////////////////////#
    for box in boxes:
        
        try:
            name = box.select_one('[class="a-profile-name"]').text.strip()   # Select Name using css selector and cleaning text using strip()
        except Exception as e:
            name = 'N/A'    # If Value is empty define value with 'N/A' for all.

        try:
            rating = box.select_one('[data-hook="review-star-rating"]').text.strip().split(' out')[0]
        except Exception as e:
            rating = 'N/A' 

        try:
            review_id = box.get('id')
        except Exception as e:
            review_id = 'N/A'  

        try:
            review_title = box.select_one('[data-hook="review-title"]').text.strip().split('\n')[1]
        except Exception as e:
            review_title = 'N/A'

        try:
            # Convert date str to dd/mm/yyy format
            datetime_str = box.select_one('[data-hook="review-date"]').text.strip().split(' on ')[1]
            review_date = datetime.strptime(datetime_str, '%B %d, %Y').strftime("%d/%m/%Y")
        except Exception as e:
            review_date = 'N/A'

        try:
            review_content = box.select_one('[data-hook="review-body"]').text.strip()
        except Exception as e:
            review_content = 'N/A'

        # create Dictionary with all for all reviews 
        data_dict = {
            'Review_id': review_id,
            'Reviewer': name,
            'Rating': rating,
            'Date': review_date,
            'Review title': review_title,
            'Review content': review_content,
            'Product_id': review_dict["product_id"],
            'Product_title': review_dict["product_title"],
            'Product_link': review_dict["product_url"]
        }

        # Add Dictionary in master empty List
        data_dicts.append(data_dict)
    
    return data_dicts

## <font color="red">4. Function application & Data Processing</font>

In [123]:
#Grab all HTML data: 29 existing pages of Amazon Devices
html_datas = getAllHtml(29)

Page 1: https://www.amazon.com/s?i=amazon-devices&rh=n%3A2102313011&s=popularity-rank&fs=true&page=1&qid=1729958028&ref=sr_pg_1
--> There are: 24 products
Page 1, product 1 : Blink Subscription Plus Plan with monthly auto-renewal
Review page 1: There are 10 reviews
Review page 2: There are 10 reviews
Review page 3: There are 10 reviews
Review page 4: There are 10 reviews
Review page 5: There are 10 reviews
Review page 6: There are 10 reviews
Review page 7: There are 10 reviews
Review page 8: There are 10 reviews
Review page 9: There are 10 reviews
Review page 10: There are 10 reviews
Page 1, product 2 : Amazon Fire TV Stick 4K with AI-powered Fire TV Search, Wi-Fi 6, stream over 1.5 million movies and shows, free & live TV
Review page 1: There are 10 reviews
Review page 2: There are 10 reviews
Review page 3: There are 10 reviews
Review page 4: There are 10 reviews
Review page 5: There are 10 reviews
Review page 6: There are 10 reviews
Review page 7: There are 10 reviews
Review page 8: 

In [124]:
# Empty List to Hold all reviews data
reviews = []

# Iterate all Html pages 
for html_data in html_datas:
    
    # Grab review data
    review = getReviews(html_data)
    
    # add review data in reviews empty list
    reviews += review

In [125]:
# Create a dataframe with reviews Data
df_reviews = pd.DataFrame(reviews)

In [126]:
df_reviews

Unnamed: 0,Review_id,Reviewer,Rating,Date,Review title,Review content,Product_id,Product_title,Product_link
0,R192QJ45JRSLTC,Chris,5.0,13/07/2024,Didn't think I needed it until I wish I had it.,The Blink Subscription Basic Plan offers essen...,B08JHCVHTY,Blink Subscription Plus Plan with monthly auto...,https://www.amazon.com/Blink-Plus-Plan-monthly...
1,RLJN0G2I0CRNC,Amazon Customer,5.0,12/10/2024,Worth every penny!,I have had Blink cameras at my house for years...,B08JHCVHTY,Blink Subscription Plus Plan with monthly auto...,https://www.amazon.com/Blink-Plus-Plan-monthly...
2,R19D78F9YK0DVA,uniquely unique,5.0,11/10/2024,I’m quite satisfied,I’ve been using the Blink Subscription Plus Pl...,B08JHCVHTY,Blink Subscription Plus Plan with monthly auto...,https://www.amazon.com/Blink-Plus-Plan-monthly...
3,R2W7QUYHDCN6CB,Lyndi Dawn Macdonald,4.0,26/09/2024,Very Nice Added Security,"Really like having these cameras, they give us...",B08JHCVHTY,Blink Subscription Plus Plan with monthly auto...,https://www.amazon.com/Blink-Plus-Plan-monthly...
4,RM9R0N4N310DC,ccoulson90,5.0,21/10/2024,Great Item,The Blink Subscription Plus Plan is a fantasti...,B08JHCVHTY,Blink Subscription Plus Plan with monthly auto...,https://www.amazon.com/Blink-Plus-Plan-monthly...
...,...,...,...,...,...,...,...,...,...
15857,R3M2LG3W8S0EKR,A. Rose,2.0,20/12/2018,Cannot Change Reading Progress Font Sizes,Just wasted half an hour on the phone with tec...,B00QJDVBFU,"Kindle Paperwhite 3G, 6"" High-Resolution Displ...",https://www.amazon.com/Paperwhite-High-Resolut...
15858,RMFF5NVA9YK87,Kindle Customer,2.0,26/01/2016,New paperwhite,Why would I buy this I like to listen to music...,B00QJDVBFU,"Kindle Paperwhite 3G, 6"" High-Resolution Displ...",https://www.amazon.com/Paperwhite-High-Resolut...
15859,R11DMWR29QKLWY,Gordon,4.0,11/05/2016,Kindle Great Device,I am in my eighties and find it quite intuitiv...,B00QJDVBFU,"Kindle Paperwhite 3G, 6"" High-Resolution Displ...",https://www.amazon.com/Paperwhite-High-Resolut...
15860,R1PCPXC0B5ZUNC,dorothyinoz,1.0,12/05/2016,DEFECTIVE PRODUCT,Within six months of buying this product it fa...,B00QJDVBFU,"Kindle Paperwhite 3G, 6"" High-Resolution Displ...",https://www.amazon.com/Paperwhite-High-Resolut...


## <font color="red">5. Dataset Export</font>

In [127]:
# Export data to an Excel file
df_reviews.to_excel('amazon_reviews.xlsx', index=False)