### Sentiment Analysis Notebook

This notebook is for testing and gathering ideas before I tidy up the code and compose as a python script.

In [1]:
# Imports
import pandas as pd
import spacy
from textblob import TextBlob

In [2]:
# Import spaCy model
nlp = spacy.load("en_core_web_sm")

In [3]:
# Reading in csv file
df = pd.read_csv("amazon_product_reviews.csv")
df.head()

Unnamed: 0,id,dateAdded,dateUpdated,name,asins,brand,categories,primaryCategories,imageURLs,keys,...,reviews.didPurchase,reviews.doRecommend,reviews.id,reviews.numHelpful,reviews.rating,reviews.sourceURLs,reviews.text,reviews.title,reviews.username,sourceURLs
0,AVpgNzjwLJeJML43Kpxn,2015-10-30T08:59:32Z,2019-04-25T09:08:16Z,AmazonBasics AAA Performance Alkaline Batterie...,"B00QWO9P0O,B00LH3DMUO",Amazonbasics,"AA,AAA,Health,Electronics,Health & Household,C...",Health & Beauty,https://images-na.ssl-images-amazon.com/images...,"amazonbasics/hl002619,amazonbasicsaaaperforman...",...,,,,,3,https://www.amazon.com/product-reviews/B00QWO9...,I order 3 of them and one of the item is bad q...,... 3 of them and one of the item is bad quali...,Byger yang,"https://www.barcodable.com/upc/841710106442,ht..."
1,AVpgNzjwLJeJML43Kpxn,2015-10-30T08:59:32Z,2019-04-25T09:08:16Z,AmazonBasics AAA Performance Alkaline Batterie...,"B00QWO9P0O,B00LH3DMUO",Amazonbasics,"AA,AAA,Health,Electronics,Health & Household,C...",Health & Beauty,https://images-na.ssl-images-amazon.com/images...,"amazonbasics/hl002619,amazonbasicsaaaperforman...",...,,,,,4,https://www.amazon.com/product-reviews/B00QWO9...,Bulk is always the less expensive way to go fo...,... always the less expensive way to go for pr...,ByMG,"https://www.barcodable.com/upc/841710106442,ht..."
2,AVpgNzjwLJeJML43Kpxn,2015-10-30T08:59:32Z,2019-04-25T09:08:16Z,AmazonBasics AAA Performance Alkaline Batterie...,"B00QWO9P0O,B00LH3DMUO",Amazonbasics,"AA,AAA,Health,Electronics,Health & Household,C...",Health & Beauty,https://images-na.ssl-images-amazon.com/images...,"amazonbasics/hl002619,amazonbasicsaaaperforman...",...,,,,,5,https://www.amazon.com/product-reviews/B00QWO9...,Well they are not Duracell but for the price i...,... are not Duracell but for the price i am ha...,BySharon Lambert,"https://www.barcodable.com/upc/841710106442,ht..."
3,AVpgNzjwLJeJML43Kpxn,2015-10-30T08:59:32Z,2019-04-25T09:08:16Z,AmazonBasics AAA Performance Alkaline Batterie...,"B00QWO9P0O,B00LH3DMUO",Amazonbasics,"AA,AAA,Health,Electronics,Health & Household,C...",Health & Beauty,https://images-na.ssl-images-amazon.com/images...,"amazonbasics/hl002619,amazonbasicsaaaperforman...",...,,,,,5,https://www.amazon.com/product-reviews/B00QWO9...,Seem to work as well as name brand batteries a...,... as well as name brand batteries at a much ...,Bymark sexson,"https://www.barcodable.com/upc/841710106442,ht..."
4,AVpgNzjwLJeJML43Kpxn,2015-10-30T08:59:32Z,2019-04-25T09:08:16Z,AmazonBasics AAA Performance Alkaline Batterie...,"B00QWO9P0O,B00LH3DMUO",Amazonbasics,"AA,AAA,Health,Electronics,Health & Household,C...",Health & Beauty,https://images-na.ssl-images-amazon.com/images...,"amazonbasics/hl002619,amazonbasicsaaaperforman...",...,,,,,5,https://www.amazon.com/product-reviews/B00QWO9...,These batteries are very long lasting the pric...,... batteries are very long lasting the price ...,Bylinda,"https://www.barcodable.com/upc/841710106442,ht..."


In [4]:
# Selecting the review column
data = df["reviews.text"]
data.info()

<class 'pandas.core.series.Series'>
RangeIndex: 28332 entries, 0 to 28331
Series name: reviews.text
Non-Null Count  Dtype 
--------------  ----- 
28332 non-null  object
dtypes: object(1)
memory usage: 221.5+ KB


In [5]:
data = data.to_frame(name="Review")
data.head()

Unnamed: 0,Review
0,I order 3 of them and one of the item is bad q...
1,Bulk is always the less expensive way to go fo...
2,Well they are not Duracell but for the price i...
3,Seem to work as well as name brand batteries a...
4,These batteries are very long lasting the pric...


In [6]:
# Initialising the stopwords variable
stopwords = nlp.Defaults.stop_words

# Creating a function for removing the stopwords
def remove_stopwords(review):
    # Splitting each review into individual words
    review_words = review.split()
    # Creating a list for cleaned words to append to
    cleaned_words = []
    # Loop through splitted words and filter out the stopwords
    for word in review_words:
        if word.lower() not in stopwords:
            cleaned_words.append(word)
    # Combining the cleaned words into a sentence and return the sentence
    return " ".join(cleaned_words)

In [7]:
# Removing the stopwords from the data
cleaned_reviews = data["Review"].apply(lambda x: remove_stopwords(x))
cleaned_reviews.head()

0    order 3 item bad quality. missing backup sprin...
1                     Bulk expensive way products like
2                                Duracell price happy.
3                    work brand batteries better price
4                  batteries long lasting price great.
Name: Review, dtype: object

In [8]:
# Checking missing values
cleaned_reviews.info()

<class 'pandas.core.series.Series'>
RangeIndex: 28332 entries, 0 to 28331
Series name: Review
Non-Null Count  Dtype 
--------------  ----- 
28332 non-null  object
dtypes: object(1)
memory usage: 221.5+ KB


Noted that there are no null data.

In [9]:
# Testing out the sentiment and polarity functions
print(TextBlob(cleaned_reviews[42]).sentiment)
print(TextBlob(cleaned_reviews[42]).polarity)

Sentiment(polarity=0.05, subjectivity=0.32499999999999996)
0.05


In [10]:
# Creating a function for sentiment analysis using Text Blob
def get_sentiment(review):
    return TextBlob(review).polarity

In [11]:
# Generate sentiment polarity scores for the reviews
data["Polarity Score"] = cleaned_reviews.apply(get_sentiment)
data.head()

Unnamed: 0,Review,Polarity Score
0,I order 3 of them and one of the item is bad q...,-0.45
1,Bulk is always the less expensive way to go fo...,-0.5
2,Well they are not Duracell but for the price i...,0.8
3,Seem to work as well as name brand batteries a...,0.5
4,These batteries are very long lasting the pric...,0.25


In [12]:
# Looking at the stats of the sentiment analysis
data.describe()

Unnamed: 0,Polarity Score
count,28332.0
mean,0.380882
std,0.318976
min,-1.0
25%,0.141667
50%,0.397813
75%,0.616667
max,1.0


In [13]:
# Creating a function to sort the sentiment according to the polarity scores
def sentiment_category(score):
    if score <= -0.01:
        return "Negative"
    elif score >= 0.01:
        return "Positive"
    else:
        return "Neutral"

In [14]:
# Creating a new column to sort the sentiment using negative, neutral and positive
data["Sentiment Prediction"] = data["Polarity Score"].apply(sentiment_category)

In [15]:
# Checking the predictions
pd.options.display.max_colwidth = 100
data.sample(5, random_state=12)

Unnamed: 0,Review,Polarity Score,Sentiment Prediction
12860,I bought this thinking that I needed it but I didnt so it was a waste of money. I do so hate to ...,-0.3,Negative
14350,I like that you can set parental setting so the kids can't purchase without you knowing about it...,0.8,Positive
22605,I like the unit but battery seems to run down more quickly that my previous Fire. Also gets warm...,0.216667,Positive
2199,"Great batteries, great price, fast shipping.",0.6,Positive
6898,This is sth i never had with other serious brands. disappointing and returning,-0.6,Negative


I will now compose this as a python script as requested by the task.