<Center>
    <h1 style="font-family: Roboto slab">
        <p>
        <font color="white">
            Sentiment Mining for Amazon Devices: 
            <br>
            Applying Natural Language Processing with Machine Learning and Deep Learning Techniques
        </font>
    </h1>
    <h3 style="font-family: Roboto slab">
        <font color="yellow">
            Notebook 2/5: Data Pre-Processing
        </font>
    </h3>
</Center>

# I. Introduction & Context
---------------------------------------------------------------------------

### <font color = "yellow" >Objective:</font>
This project aims to build a sentiment analysis tool to classify customer reviews of Amazon devices as positive, negative, or neutral. <br>

The project involves preprocessing review text, extracting key features, and implementing both traditional machine learning models (Logistic Regression, Naive Bayes, Support Vector Machines, ...) and deep learning models (LSTM-based RNNs). After training and evaluating these models, the project will compare their performance to select the most effective one for deployment in the sentiment analysis tool. This approach ensures that the tool utilizes the best-performing model to deliver accurate sentiment classification, ultimately supporting better business decisions and product improvements.

### <font color = "yellow">Application Overview:</font>
This application approach is divided into 5 core steps:

<ul>
    <li>
        <u>Step 1:</u>  Data Collection
        <ul>
            <li> <b>Description:</b> Gather Amazon devices reviews from Amazon's website using web scraping techniques: Selenium, BeautifulSoup, scrape review data to capture review_id, Reviewer, Rating, Date, Review_title, Review_content, Product_id & Product_link</li>
            <li> <b> Output: </b> Raw dataset of Amazon device reviews, including review text, star ratings, and other relevant metadata.</li>
        </ul>
    </li>
</ul>
<ul>
    <li>
        <u>Step 2:</u> Data Pre-Processing
        <ul>
            <li><b> Description: </b> Clean and prepare the review text for exploratory analysis. Identify and resolve missing values or inconsistencies in the dataset. Convert text to lowercase, remove special characters, stop words, punctuations; apply tokenization and lemmatization.
            </li>
            <li> <b> Output: </b> Cleaned dataset for EDA and model development.</li>
        </ul>
    </li>
</ul>
<ul>
    <li>
        <u>Step 3:</u> Exploratory Data Analysis
        <ul>
            <li>Data Distribution: Analyze review counts across sentiment classes. </li>
            <li>Text Analysis: Review text length, word count, and common words.</li>
            <li>Sentiment Visualization: Visualize trends in positive, neutral, and negative reviews.</li>
        </ul>
    </li>
</ul>
<ul>
    <li>
        <u>Step 4:</u > Model Development
         <ul>
            <li>Developing 5 machine learning and LSTM deep learning models.</li>
            <li>Evaluating models performance.</li>
            <li>Selecting the best model for final hyperparameter tuning</li>
            <li>Validating final model's predictions on 10 new real reviews</li>
        </ul>
    </li>
</ul>
<ul>
    <li>
        <u>Step 5:</u> Model Deployment
        <ul>
            <li>Deploy the sentiment analysis model via a Flask API and create a website for users to input reviews and view predicted sentiments in real-time.</li>
        </ul>
    </li>
</ul>

# II. Data Preprocessing
---------------------------------------------------------------------------

## <font color = "red">1.  Libaries Import</font>

In [145]:
# Basic libraries
import pandas as pd
import numpy as np

# For Decoding HTML Entities
import html

from nltk.corpus import stopwords
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer
from nltk import pos_tag, word_tokenize
import nltk
import contractions
import re

# Download necessary NLTK resources
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('averaged_perceptron_tagger')

# Suppress FutureWarnings
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/vytran/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /Users/vytran/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/vytran/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /Users/vytran/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/vytran/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


## <font color="red">2.  Dataset import</font>

In [146]:
# Read reviews from an Excel file
raw_reviews = pd.read_excel('../data/amazon_reviews.xlsx')

In [147]:
# Check columns and rows
raw_reviews.shape

(15862, 9)

In [148]:
# Check data information
raw_reviews.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15862 entries, 0 to 15861
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Review_id       15862 non-null  object 
 1   Reviewer        15858 non-null  object 
 2   Rating          15806 non-null  float64
 3   Date            15862 non-null  object 
 4   Review title    15828 non-null  object 
 5   Review content  15847 non-null  object 
 6   Product_id      15862 non-null  object 
 7   Product_title   15862 non-null  object 
 8   Product_link    15862 non-null  object 
dtypes: float64(1), object(8)
memory usage: 1.1+ MB


In [149]:
raw_reviews.head()

Unnamed: 0,Review_id,Reviewer,Rating,Date,Review title,Review content,Product_id,Product_title,Product_link
0,R192QJ45JRSLTC,Chris,5.0,13/07/2024,Didn't think I needed it until I wish I had it.,The Blink Subscription Basic Plan offers essen...,B08JHCVHTY,Blink Subscription Plus Plan with monthly auto...,https://www.amazon.com/Blink-Plus-Plan-monthly...
1,RLJN0G2I0CRNC,Amazon Customer,5.0,2024-12-10 00:00:00,Worth every penny!,I have had Blink cameras at my house for years...,B08JHCVHTY,Blink Subscription Plus Plan with monthly auto...,https://www.amazon.com/Blink-Plus-Plan-monthly...
2,R19D78F9YK0DVA,uniquely unique,5.0,2024-11-10 00:00:00,I’m quite satisfied,I’ve been using the Blink Subscription Plus Pl...,B08JHCVHTY,Blink Subscription Plus Plan with monthly auto...,https://www.amazon.com/Blink-Plus-Plan-monthly...
3,R2W7QUYHDCN6CB,Lyndi Dawn Macdonald,4.0,26/09/2024,Very Nice Added Security,"Really like having these cameras, they give us...",B08JHCVHTY,Blink Subscription Plus Plan with monthly auto...,https://www.amazon.com/Blink-Plus-Plan-monthly...
4,RM9R0N4N310DC,ccoulson90,5.0,21/10/2024,Great Item,The Blink Subscription Plus Plan is a fantasti...,B08JHCVHTY,Blink Subscription Plus Plan with monthly auto...,https://www.amazon.com/Blink-Plus-Plan-monthly...


## <font color="red">3.  Dataset Details</font>

The Amazon review dataset contains the customer reviews for all listed Amazon devices: Echo, Fire TV, Fire Tablet...<br>
There are a total of 15,862 reviews on 690 unique products. 


Descriptions of columns:
- Reivew_id: ID of the review
- Reviewer: Name of reviewer
- Rating: Rating of the product by each reviewer
- Date: date of the review dd/mm/yyyy
- Review title: Review title
- Review content: Content of the review
- Product_id: ID of the product being reviewed
- Prodcut_title: Product title
- Product_link: Product link



## <font color="red">4.  Rename Columns</font>

In [150]:
# Creating a copy dataset
process_reviews = raw_reviews.copy()

# Renaming the column 'Review title' to 'Review_title'
process_reviews = process_reviews.rename(columns={'Review title': 'Review_title'})

# Renaming the column 'Review content' to 'Review_content'
process_reviews = process_reviews.rename(columns={'Review content': 'Review_content'})

# Check data information
process_reviews.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15862 entries, 0 to 15861
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Review_id       15862 non-null  object 
 1   Reviewer        15858 non-null  object 
 2   Rating          15806 non-null  float64
 3   Date            15862 non-null  object 
 4   Review_title    15828 non-null  object 
 5   Review_content  15847 non-null  object 
 6   Product_id      15862 non-null  object 
 7   Product_title   15862 non-null  object 
 8   Product_link    15862 non-null  object 
dtypes: float64(1), object(8)
memory usage: 1.1+ MB


## <font color="red">5.  Handling Null Values</font>

In [151]:
# Checking for null values
process_reviews.isnull().sum()

Review_id          0
Reviewer           4
Rating            56
Date               0
Review_title      34
Review_content    15
Product_id         0
Product_title      0
Product_link       0
dtype: int64

Explanation:
- Reviewer (4 nulls): This column likely doesn't impact sentiment analysis significantly, so filling these nulls with a placeholder: "Anonymous" to ensure no data loss.
- Rating (56 nulls): Ratings are essential for sentiment classification. But null Rating values are less than 5% of the data set, so dropping these rows to maintain intergity.
- Review_title (34 nulls): While review content is available, review title is not strickly necessary for sentiment analysis, so filling these rows with placeholder: "Title sample" to ensure no data loss.
- Review_content (15 nulls): Since review content is crucial for sentiment analysis, rows with null values in this column should be dropped, as placeholders would not add meaningful content.

In [152]:
# Filling 4 nulls in 'Reviewer' columns with : "Anonymous"
process_reviews['Reviewer'] = process_reviews['Reviewer'].fillna('Anonymous')

# Dropping rows with null values in the 'Rating' column
process_reviews = process_reviews.dropna(subset=['Rating'])

# Filling 34 nulls in 'Review_title' columns with : "Title sample"
process_reviews["Review_title"] = process_reviews['Review_title'].fillna('Title sample')

# Dropping rows with null values in the 'Review_content' column
process_reviews = process_reviews.dropna(subset=['Review_content'])

# Checking for null values
process_reviews.isnull().sum()

Review_id         0
Reviewer          0
Rating            0
Date              0
Review_title      0
Review_content    0
Product_id        0
Product_title     0
Product_link      0
dtype: int64

## <font color="red">6.  Creating 'Review_length' and 'Word_count' Columns</font>

This step involves calculating the length and word count of each review to gain insights into the dataset's text structure.

In [153]:
# Calculating review length and word count
process_reviews['Review_len'] = process_reviews['Review_content'].apply(len)
process_reviews['Word_count'] = process_reviews['Review_content'].apply(lambda x: len(x.split()))

process_reviews.head()

Unnamed: 0,Review_id,Reviewer,Rating,Date,Review_title,Review_content,Product_id,Product_title,Product_link,Review_len,Word_count
0,R192QJ45JRSLTC,Chris,5.0,13/07/2024,Didn't think I needed it until I wish I had it.,The Blink Subscription Basic Plan offers essen...,B08JHCVHTY,Blink Subscription Plus Plan with monthly auto...,https://www.amazon.com/Blink-Plus-Plan-monthly...,2558,356
1,RLJN0G2I0CRNC,Amazon Customer,5.0,2024-12-10 00:00:00,Worth every penny!,I have had Blink cameras at my house for years...,B08JHCVHTY,Blink Subscription Plus Plan with monthly auto...,https://www.amazon.com/Blink-Plus-Plan-monthly...,190,37
2,R19D78F9YK0DVA,uniquely unique,5.0,2024-11-10 00:00:00,I’m quite satisfied,I’ve been using the Blink Subscription Plus Pl...,B08JHCVHTY,Blink Subscription Plus Plan with monthly auto...,https://www.amazon.com/Blink-Plus-Plan-monthly...,638,97
3,R2W7QUYHDCN6CB,Lyndi Dawn Macdonald,4.0,26/09/2024,Very Nice Added Security,"Really like having these cameras, they give us...",B08JHCVHTY,Blink Subscription Plus Plan with monthly auto...,https://www.amazon.com/Blink-Plus-Plan-monthly...,996,189
4,RM9R0N4N310DC,ccoulson90,5.0,21/10/2024,Great Item,The Blink Subscription Plus Plan is a fantasti...,B08JHCVHTY,Blink Subscription Plus Plan with monthly auto...,https://www.amazon.com/Blink-Plus-Plan-monthly...,766,114


## <font color="red">7.  Concentrating Review_title and Review_content</font>

Concatenating Review_title and Review_content into a single text field can enhance sentiment analysis by combining all relevant information and simplifying preprocessing, feature extraction, and vectorization for a more efficient analysis process.

In [154]:
# Checking Review_title and Review_content before concatenating
print(f"Title: {process_reviews.iloc[5]['Review_title']}")
print(f"Content: {process_reviews.iloc[5]['Review_content']}")

Title: Doesn't do much .. But nothing works Without it.
Content: I&apos;m not impressed with the blink system at all. It spends most of it&apos;s time offline and missing. Currently, I have 2 mini blinks ... which work the best - if they are plugged in. Currently, one is missing, and the other is often unplugged because it is annoying if someone is working in the backyard. We have 2 doorbells. One never seems to be working, but the front door at least rings through. I rarely get pictures from it anymore.

We have 2 spotlights ... one requires the little white box to work that stores the pictures and uploads to the Blink plan ... however, if the white box fails, you have no way to contact or store the pictures. It&apos;s been ages since the white storage box quit working, and I can&apos;t find the replacement. My husband says he&apos;ll get around to installing the second spotlights, but he&apos;s been saying that for 2 years now.

Overall, I would not recommend the blink as a system. I

In [155]:
# Concatenating 'Review_title' and 'Review_content' into a single column 'Full_review'
process_reviews['Full_review'] = process_reviews["Review_title"].astype(str) + ' ' + process_reviews['Review_content'].astype(str)

# Dropping 'Review_title' and 'Review_content' columns
process_reviews = process_reviews.drop(['Review_title', 'Review_content'], axis = 1)

process_reviews.head()

Unnamed: 0,Review_id,Reviewer,Rating,Date,Product_id,Product_title,Product_link,Review_len,Word_count,Full_review
0,R192QJ45JRSLTC,Chris,5.0,13/07/2024,B08JHCVHTY,Blink Subscription Plus Plan with monthly auto...,https://www.amazon.com/Blink-Plus-Plan-monthly...,2558,356,Didn't think I needed it until I wish I had it...
1,RLJN0G2I0CRNC,Amazon Customer,5.0,2024-12-10 00:00:00,B08JHCVHTY,Blink Subscription Plus Plan with monthly auto...,https://www.amazon.com/Blink-Plus-Plan-monthly...,190,37,Worth every penny! I have had Blink cameras at...
2,R19D78F9YK0DVA,uniquely unique,5.0,2024-11-10 00:00:00,B08JHCVHTY,Blink Subscription Plus Plan with monthly auto...,https://www.amazon.com/Blink-Plus-Plan-monthly...,638,97,I’m quite satisfied I’ve been using the Blink ...
3,R2W7QUYHDCN6CB,Lyndi Dawn Macdonald,4.0,26/09/2024,B08JHCVHTY,Blink Subscription Plus Plan with monthly auto...,https://www.amazon.com/Blink-Plus-Plan-monthly...,996,189,Very Nice Added Security Really like having th...
4,RM9R0N4N310DC,ccoulson90,5.0,21/10/2024,B08JHCVHTY,Blink Subscription Plus Plan with monthly auto...,https://www.amazon.com/Blink-Plus-Plan-monthly...,766,114,Great Item The Blink Subscription Plus Plan is...


In [156]:
# Checking Fulling review after concatenating
print('Output review example:')
print(process_reviews.iloc[5]['Full_review'])

Output review example:
Doesn't do much .. But nothing works Without it. I&apos;m not impressed with the blink system at all. It spends most of it&apos;s time offline and missing. Currently, I have 2 mini blinks ... which work the best - if they are plugged in. Currently, one is missing, and the other is often unplugged because it is annoying if someone is working in the backyard. We have 2 doorbells. One never seems to be working, but the front door at least rings through. I rarely get pictures from it anymore.

We have 2 spotlights ... one requires the little white box to work that stores the pictures and uploads to the Blink plan ... however, if the white box fails, you have no way to contact or store the pictures. It&apos;s been ages since the white storage box quit working, and I can&apos;t find the replacement. My husband says he&apos;ll get around to installing the second spotlights, but he&apos;s been saying that for 2 years now.

Overall, I would not recommend the blink as a sy

## <font color="red">8.  Creating 'Sentiment' Column</font>

In this preprocessing step, a sentiment column is created as the outcome variable, categorizing reviews based on their ratings: ratings greater than 3 indicate positive sentiment, ratings less than 3 indicate negative sentiment, and ratings equal to 3 are considered neutral.

In [157]:
# Counting the distribution of sentiments
process_reviews['Rating'].value_counts()

Rating
5.0    5876
3.0    4020
4.0    2451
2.0    2019
1.0    1425
Name: count, dtype: int64

In [158]:
def sentiment_categorization(row):

    ''' Assigns a sentiment value based on the user's rating.'''

    if row['Rating'] < 3.0:
        val = 'Negative'
    elif row['Rating'] > 3.0:
        val = 'Positive'
    else:
        val = 'Neutral'

    return val

In [159]:
# Applying the function to create the 'Sentiment' column
process_reviews['Sentiment'] = process_reviews.apply(sentiment_categorization, axis=1)
process_reviews.head()

Unnamed: 0,Review_id,Reviewer,Rating,Date,Product_id,Product_title,Product_link,Review_len,Word_count,Full_review,Sentiment
0,R192QJ45JRSLTC,Chris,5.0,13/07/2024,B08JHCVHTY,Blink Subscription Plus Plan with monthly auto...,https://www.amazon.com/Blink-Plus-Plan-monthly...,2558,356,Didn't think I needed it until I wish I had it...,Positive
1,RLJN0G2I0CRNC,Amazon Customer,5.0,2024-12-10 00:00:00,B08JHCVHTY,Blink Subscription Plus Plan with monthly auto...,https://www.amazon.com/Blink-Plus-Plan-monthly...,190,37,Worth every penny! I have had Blink cameras at...,Positive
2,R19D78F9YK0DVA,uniquely unique,5.0,2024-11-10 00:00:00,B08JHCVHTY,Blink Subscription Plus Plan with monthly auto...,https://www.amazon.com/Blink-Plus-Plan-monthly...,638,97,I’m quite satisfied I’ve been using the Blink ...,Positive
3,R2W7QUYHDCN6CB,Lyndi Dawn Macdonald,4.0,26/09/2024,B08JHCVHTY,Blink Subscription Plus Plan with monthly auto...,https://www.amazon.com/Blink-Plus-Plan-monthly...,996,189,Very Nice Added Security Really like having th...,Positive
4,RM9R0N4N310DC,ccoulson90,5.0,21/10/2024,B08JHCVHTY,Blink Subscription Plus Plan with monthly auto...,https://www.amazon.com/Blink-Plus-Plan-monthly...,766,114,Great Item The Blink Subscription Plus Plan is...,Positive


In [160]:
process_reviews['Sentiment'].value_counts()

Sentiment
Positive    8327
Neutral     4020
Negative    3444
Name: count, dtype: int64

## <font color="red">9.  Handling 'Date' Column</font>

The Date column in the dataset includes both the review date and year in a combined format. To enhance time-based analysis, this column will be split into separate components: year, month, and day. <br>This approach allows for deeper insights, such as identifying seasonal trends or changes in sentiment over time.

In [161]:
# Try to parse the 'Date' column, accounting for mixed formats
process_reviews['Date'] = pd.to_datetime(
    process_reviews['Date'], 
    errors='coerce', 
    dayfirst=True
)

# Extracting day, month, and year as separate columns
process_reviews['Day'] = process_reviews['Date'].dt.day
process_reviews['Month'] = process_reviews['Date'].dt.month
process_reviews['Year'] = process_reviews['Date'].dt.year

# Handle rows where 'Date' conversion failed
missing_date_conversion = process_reviews[process_reviews['Date'].isnull()]
print(f"Rows with unconvertible dates: {len(missing_date_conversion)}")

process_reviews = process_reviews.drop(['Date'], axis=1)
process_reviews.head()

Rows with unconvertible dates: 0


Unnamed: 0,Review_id,Reviewer,Rating,Product_id,Product_title,Product_link,Review_len,Word_count,Full_review,Sentiment,Day,Month,Year
0,R192QJ45JRSLTC,Chris,5.0,B08JHCVHTY,Blink Subscription Plus Plan with monthly auto...,https://www.amazon.com/Blink-Plus-Plan-monthly...,2558,356,Didn't think I needed it until I wish I had it...,Positive,13,7,2024
1,RLJN0G2I0CRNC,Amazon Customer,5.0,B08JHCVHTY,Blink Subscription Plus Plan with monthly auto...,https://www.amazon.com/Blink-Plus-Plan-monthly...,190,37,Worth every penny! I have had Blink cameras at...,Positive,10,12,2024
2,R19D78F9YK0DVA,uniquely unique,5.0,B08JHCVHTY,Blink Subscription Plus Plan with monthly auto...,https://www.amazon.com/Blink-Plus-Plan-monthly...,638,97,I’m quite satisfied I’ve been using the Blink ...,Positive,10,11,2024
3,R2W7QUYHDCN6CB,Lyndi Dawn Macdonald,4.0,B08JHCVHTY,Blink Subscription Plus Plan with monthly auto...,https://www.amazon.com/Blink-Plus-Plan-monthly...,996,189,Very Nice Added Security Really like having th...,Positive,26,9,2024
4,RM9R0N4N310DC,ccoulson90,5.0,B08JHCVHTY,Blink Subscription Plus Plan with monthly auto...,https://www.amazon.com/Blink-Plus-Plan-monthly...,766,114,Great Item The Blink Subscription Plus Plan is...,Positive,21,10,2024


## <font color="red">10.  Removing Unnecessary Columns</font>

In [162]:
process_reviews = process_reviews.drop(['Reviewer', 'Product_title', 'Product_link'], axis = 1)

process_reviews.head()

Unnamed: 0,Review_id,Rating,Product_id,Review_len,Word_count,Full_review,Sentiment,Day,Month,Year
0,R192QJ45JRSLTC,5.0,B08JHCVHTY,2558,356,Didn't think I needed it until I wish I had it...,Positive,13,7,2024
1,RLJN0G2I0CRNC,5.0,B08JHCVHTY,190,37,Worth every penny! I have had Blink cameras at...,Positive,10,12,2024
2,R19D78F9YK0DVA,5.0,B08JHCVHTY,638,97,I’m quite satisfied I’ve been using the Blink ...,Positive,10,11,2024
3,R2W7QUYHDCN6CB,4.0,B08JHCVHTY,996,189,Very Nice Added Security Really like having th...,Positive,26,9,2024
4,RM9R0N4N310DC,5.0,B08JHCVHTY,766,114,Great Item The Blink Subscription Plus Plan is...,Positive,21,10,2024


## <font color="red">11.  Text Pre-Processing</font>

### <font color="red">1.  Converting Reviews to Lower Case</font>

Converting all text to lowercase reduces case sensitivity and ensures that words are treated uniformly.

In [163]:
process_reviews['Full_review'] = process_reviews['Full_review'].str.lower()

print('Output review example:')
print(process_reviews.iloc[5]['Full_review'])

Output review example:
doesn't do much .. but nothing works without it. i&apos;m not impressed with the blink system at all. it spends most of it&apos;s time offline and missing. currently, i have 2 mini blinks ... which work the best - if they are plugged in. currently, one is missing, and the other is often unplugged because it is annoying if someone is working in the backyard. we have 2 doorbells. one never seems to be working, but the front door at least rings through. i rarely get pictures from it anymore.

we have 2 spotlights ... one requires the little white box to work that stores the pictures and uploads to the blink plan ... however, if the white box fails, you have no way to contact or store the pictures. it&apos;s been ages since the white storage box quit working, and i can&apos;t find the replacement. my husband says he&apos;ll get around to installing the second spotlights, but he&apos;s been saying that for 2 years now.

overall, i would not recommend the blink as a sy

### <font color="red">2.  Decoding HTML Entities</font>

HTML entities are special codes used to represent characters that may not display correctly in HTML, such as the apostrophe (&#39;) and en dash (&#8211;). These entities appear in text scraped from web pages because HTML parsers often use them to represent non-ASCII characters. For example, "&#39;" represents an apostrophe, while "&#8211;" represents an en dash.

The html library in Python provides a function called html.unescape() that converts HTML entities to their readable text equivalents.

In [164]:
# Decode HTML entities in the 'text' column
process_reviews['Full_review'] = process_reviews['Full_review'].apply(html.unescape)

# Display a review example
print(process_reviews.iloc[5]['Full_review'])

doesn't do much .. but nothing works without it. i'm not impressed with the blink system at all. it spends most of it's time offline and missing. currently, i have 2 mini blinks ... which work the best - if they are plugged in. currently, one is missing, and the other is often unplugged because it is annoying if someone is working in the backyard. we have 2 doorbells. one never seems to be working, but the front door at least rings through. i rarely get pictures from it anymore.

we have 2 spotlights ... one requires the little white box to work that stores the pictures and uploads to the blink plan ... however, if the white box fails, you have no way to contact or store the pictures. it's been ages since the white storage box quit working, and i can't find the replacement. my husband says he'll get around to installing the second spotlights, but he's been saying that for 2 years now.

overall, i would not recommend the blink as a system. i'm torn between starting over with a ring syst

### <font color="red">3.  Expanding Contractions</font>

Expanding contractions involves converting shortened phrases like "I'll" and "can't" into their full forms ("I will" and "cannot"). <br>
By expanding contractions first, the project maintains clarity and preserves meaning, preventing issues where removing the apostrophe might turn "I'll" into "ill" or "y'all" into "yall," which can confuse the lemmatizer. <br>
This ensures that subsequent steps, such as lemmatization, can interpret the text accurately and maintain the original sentiment.

In [165]:
# Expand contractions in the 'Full_review' column
process_reviews['Full_review'] = process_reviews['Full_review'].apply(contractions.fix)

# Display a review example
print(process_reviews.iloc[5]['Full_review'])

does not do much .. but nothing works without it. i am not impressed with the blink system at all. it spends most of it is time offline and missing. currently, i have 2 mini blinks ... which work the best - if they are plugged in. currently, one is missing, and the other is often unplugged because it is annoying if someone is working in the backyard. we have 2 doorbells. one never seems to be working, but the front door at least rings through. i rarely get pictures from it anymore.

we have 2 spotlights ... one requires the little white box to work that stores the pictures and uploads to the blink plan ... however, if the white box fails, you have no way to contact or store the pictures. it is been ages since the white storage box quit working, and i cannot find the replacement. my husband says he will get around to installing the second spotlights, but he is been saying that for 2 years now.

overall, i would not recommend the blink as a system. i am torn between starting over with a 

### <font color="red">4.  Tokenization & Lemmatization</font>

Stemming and lemmatization are two common techniques in natural language processing (NLP) for reducing words to their base or root forms. Both are used to simplify text data, helping models understand words with similar meanings as the same concept. However, they differ in their methods and the precision of their output.<br>
- Stemming removes common endings like "ing," "es," or "ed" to shorten words, often producing forms that aren’t real words (e.g., "running" becomes "runn"). It’s fast and efficient but less accurate because it doesn’t consider grammar or context.
- Lemmatization reduces words to their dictionary base form by using vocabulary and part-of-speech (POS) tagging. It produces valid words adjusted for context, so "better" becomes "good" when used as an adjective. Though slower, it’s more precise because it considers grammar, making it ideal for applications needing higher accuracy.





This step combines tokenization and lemmatization with part-of-speech (POS) tagging to standardize each review. Tokenizing each sentence and then lemmatizing each word based on its POS tag accurately reduces words to their base form while considering grammatical roles. <br>
The WordNetLemmatizer from the Natural Language Toolkit (NLTK) is used for lemmatization. Since lemmatization depends on sentence structure to understand context, part-of-speech tags are necessary for each word.

Explanation
- Tokenization: Each review is first split into individual words (tokens), allowing for separate processing of each word.
- POS Tagging: Each token is assigned a part-of-speech tag (e.g., noun, verb, adjective), helping determine its correct base form for lemmatization.
- Lemmatization with POS: By applying POS-specific lemmatization, words are converted to their root form accurately (e.g., "running" to "run" if tagged as a verb), preserving the correct meaning and grammatical structure.

In [166]:
# Initialize lemmatizer
lemmatizer = WordNetLemmatizer()

# Function to convert NLTK POS tags to WordNet POS tags
def get_wordnet_pos(tag):
    if tag.startswith('J'):
        return wordnet.ADJ
    elif tag.startswith('V'):
        return wordnet.VERB
    elif tag.startswith('N'):
        return wordnet.NOUN
    elif tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN  # Default to noun if POS tag is unknown

# Function to tokenize, remove punctuation marks and lemmatize with POS tagging
def tokenize_and_lemmatize(text):
    
    # Step 1: Tokenize the text
    tokens = word_tokenize(text)
    
    # Step 2: Get POS tags for each token
    pos_tags = pos_tag(tokens)
    
    # Step 3: Lemmatize each token based on its POS tag
    lemmatized_tokens = [lemmatizer.lemmatize(token, get_wordnet_pos(tag)) for token, tag in pos_tags]

    # Lemmatize each token as a noun (default behavior without specifying POS)
    #lemmatized_tokens = [lemmatizer.lemmatize(token) for token in tokens]
    
    return " ".join(lemmatized_tokens)

# Apply the function to the 'Full_review' column
process_reviews['Full_review'] = process_reviews['Full_review'].apply(tokenize_and_lemmatize)

print('Output review example:')
print(process_reviews.iloc[5]['Full_review'])

Output review example:
do not do much .. but nothing work without it . i be not impress with the blink system at all . it spend most of it be time offline and miss . currently , i have 2 mini blink ... which work the best - if they be plug in . currently , one be miss , and the other be often unplugged because it be annoy if someone be work in the backyard . we have 2 doorbell . one never seem to be work , but the front door at least ring through . i rarely get picture from it anymore . we have 2 spotlight ... one require the little white box to work that store the picture and uploads to the blink plan ... however , if the white box fails , you have no way to contact or store the picture . it be be age since the white storage box quit working , and i can not find the replacement . my husband say he will get around to instal the second spotlight , but he be be say that for 2 year now . overall , i would not recommend the blink a a system . i be tear between start over with a ring system

### <font color="red">5.  Removing Extra Whitespace</font>

Removing extra whitespace is generally a helpful step in text preprocessing. Extra whitespace can create unintended tokens or affect character counts, potentially adding noise to the data.
This step will remove any leading, trailing, or excessive spaces within each review in the 'Full_review' column.

In [167]:
process_reviews['Full_review'] = process_reviews['Full_review'].apply(lambda x: " ".join(x.split()))

print('Output review example:')
print(process_reviews.iloc[5]['Full_review'])

Output review example:
do not do much .. but nothing work without it . i be not impress with the blink system at all . it spend most of it be time offline and miss . currently , i have 2 mini blink ... which work the best - if they be plug in . currently , one be miss , and the other be often unplugged because it be annoy if someone be work in the backyard . we have 2 doorbell . one never seem to be work , but the front door at least ring through . i rarely get picture from it anymore . we have 2 spotlight ... one require the little white box to work that store the picture and uploads to the blink plan ... however , if the white box fails , you have no way to contact or store the picture . it be be age since the white storage box quit working , and i can not find the replacement . my husband say he will get around to instal the second spotlight , but he be be say that for 2 year now . overall , i would not recommend the blink a a system . i be tear between start over with a ring system

### <font color="red">6.  Removing Punctuation</font>

 Punctuation marks are symbols used in writing to clarify meaning, separate sentences, and organize text. <br>
 Eliminating punctuation improves consistency, as punctuation marks are typically irrelevant in sentiment analysis.

In [168]:
# Remove punctuation except apostrophes
#process_reviews['Full_review'] = process_reviews['Full_review'].str.replace(r"[^\w\s']", " ", regex=True)

# Remove all punctuation, including apostrophes
process_reviews['Full_review'] = process_reviews['Full_review'].str.replace(r"[^\w\s]", "", regex=True)

#process_reviews['Full_review'] = process_reviews['Full_review'].str.translate(str.maketrans('','', string.punctuation))


print('Output review example:')
print(process_reviews.iloc[5]['Full_review'])

Output review example:
do not do much  but nothing work without it  i be not impress with the blink system at all  it spend most of it be time offline and miss  currently  i have 2 mini blink  which work the best  if they be plug in  currently  one be miss  and the other be often unplugged because it be annoy if someone be work in the backyard  we have 2 doorbell  one never seem to be work  but the front door at least ring through  i rarely get picture from it anymore  we have 2 spotlight  one require the little white box to work that store the picture and uploads to the blink plan  however  if the white box fails  you have no way to contact or store the picture  it be be age since the white storage box quit working  and i can not find the replacement  my husband say he will get around to instal the second spotlight  but he be be say that for 2 year now  overall  i would not recommend the blink a a system  i be tear between start over with a ring system and try to get the blink system 

### <font color="red">7.  Removing Numeric Characters</font>

This ensures that any numbers, which may not add meaningful content for analysis, are cleaned up prior to tokenization.

In [169]:
process_reviews['Full_review'] = process_reviews['Full_review'].str.replace(r'\d+', '', regex=True)

print('Output review example:')
print(process_reviews.iloc[5]['Full_review'])

Output review example:
do not do much  but nothing work without it  i be not impress with the blink system at all  it spend most of it be time offline and miss  currently  i have  mini blink  which work the best  if they be plug in  currently  one be miss  and the other be often unplugged because it be annoy if someone be work in the backyard  we have  doorbell  one never seem to be work  but the front door at least ring through  i rarely get picture from it anymore  we have  spotlight  one require the little white box to work that store the picture and uploads to the blink plan  however  if the white box fails  you have no way to contact or store the picture  it be be age since the white storage box quit working  and i can not find the replacement  my husband say he will get around to instal the second spotlight  but he be be say that for  year now  overall  i would not recommend the blink a a system  i be tear between start over with a ring system and try to get the blink system work

### <font color="red">8.  Removing Stopwords</font>

The Natural Language Toolkit (NLTK) is a powerful Python library used for natural language processing (NLP) tasks. It provides tools for text processing, such as tokenization, stemming, lemmatization, and removing stop words. <br> 
In this step, NLTK’s stop words list is used to clean up text by removing common, unimportant words like "the, I, he, she" "and," and "is, are"
For sentiment analysis, a customized stop word list is used to keep important words like "not" or "hasn't," as they can affect the overall sentiment of a review. This step streamlines the text while keeping essential words for accurate analysis.


In [170]:
# Loading the gereral English stop words
stop_words = set(stopwords.words('english'))

# Defining the words to keep for sentiment analysis
important_words = {
    "doesn", "doesn't", "doesnt", "dont", "don't", "not", "wasn't", "wasnt",
    "aren", "aren't", "arent",  "couldn", "couldn't", "couldnt", "didn",
    "didn't", "didnt", "hadn", "hadn't", "hadnt",  "hasn", "hasn't", "hasnt",
    "haven't", "havent", "isn", "isn't", "isnt", "mightn",  "mightn't",
    "mightnt", "mustn", "mustn't", "mustnt", "needn", "needn't", "neednt",
    "shan", "shan't", "shant", "shouldn", "shouldn't", "shouldnt", "wasn",
    "wasn't",  "wasnt", "weren", "weren't", "werent", "won", "won't", "wont",
    "wouldn", "wouldn't", "wouldnt", "good", "bad", "worst", "wonderfull",
    "best", "better", "not", "no", "but", "yet", "never", "none"
}

# Removing import sentiment-related words from the stop words list
custom_stop_words = stop_words - important_words

# Removing customized stop words
process_reviews['Full_review'] = process_reviews['Full_review'].apply(lambda x: ' '.join([word for word in x.split() if word not in (custom_stop_words)]))

print('Output review example:')
print(process_reviews.iloc[5]['Full_review'])

Output review example:
not much but nothing work without not impress blink system spend time offline miss currently mini blink work best plug currently one miss often unplugged annoy someone work backyard doorbell one never seem work but front door least ring rarely get picture anymore spotlight one require little white box work store picture uploads blink plan however white box fails no way contact store picture age since white storage box quit working not find replacement husband say get around instal second spotlight but say year overall would not recommend blink system tear start ring system try get blink system work super frustrate hidden cost unreliability work nice though always delay never get face action frequently record back people leave nothing person already go get first blink almost year ago get two mini blink apartment get doorbell get house year ago since nobody could hear knocking front door garage door work great month stick not hear doorbell without plan plan super o

The resulting text is now concise, standardized, and retains crucial context. Contractions have been expanded, preserving sentiment, and punctuation has been selectively handled to maintain sentence structure. Stop words are removed, and words are lemmatized, reducing them to their base forms. This refined text strikes a balance between clarity and simplicity, enhancing readability while preserving essential information, making it well-suited for both visualization and further analysis.

## <font color="red">12.  Export Cleaned Dataset</font>

In [171]:
process_reviews.info()

<class 'pandas.core.frame.DataFrame'>
Index: 15791 entries, 0 to 15861
Data columns (total 10 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Review_id    15791 non-null  object 
 1   Rating       15791 non-null  float64
 2   Product_id   15791 non-null  object 
 3   Review_len   15791 non-null  int64  
 4   Word_count   15791 non-null  int64  
 5   Full_review  15791 non-null  object 
 6   Sentiment    15791 non-null  object 
 7   Day          15791 non-null  int32  
 8   Month        15791 non-null  int32  
 9   Year         15791 non-null  int32  
dtypes: float64(1), int32(3), int64(2), object(4)
memory usage: 1.1+ MB


In [172]:
# Check NULL value
process_reviews.isnull().sum()

Review_id      0
Rating         0
Product_id     0
Review_len     0
Word_count     0
Full_review    0
Sentiment      0
Day            0
Month          0
Year           0
dtype: int64

In [173]:
process_reviews.shape

(15791, 10)

In [174]:
process_reviews.head()

Unnamed: 0,Review_id,Rating,Product_id,Review_len,Word_count,Full_review,Sentiment,Day,Month,Year
0,R192QJ45JRSLTC,5.0,B08JHCVHTY,2558,356,not think need wish blink subscription basic p...,Positive,13,7,2024
1,RLJN0G2I0CRNC,5.0,B08JHCVHTY,190,37,worth every penny blink camera house year peac...,Positive,10,12,2024
2,R19D78F9YK0DVA,5.0,B08JHCVHTY,638,97,quite satisfied use blink subscription plus pl...,Positive,10,11,2024
3,R2W7QUYHDCN6CB,4.0,B08JHCVHTY,996,189,nice add security really like camera give u st...,Positive,26,9,2024
4,RM9R0N4N310DC,5.0,B08JHCVHTY,766,114,great item blink subscription plus plan fantas...,Positive,21,10,2024


In [175]:
# Export the cleaned dataset to a excel file
process_reviews.to_excel('../data/cleaned_amazon_reviews.xlsx', index=False)