# Sentiment Analysis of Amazon Reviews

#### sentiment analysis is also known as opinion mining is a technique used in natural processing (NLP) to determine the emitonal undertone of a document

#### Types

* Fine grainded sentiment analysis
* Emotion detection
* Intent Based Sentiment Analysis
* Aspect Based Sentiment Analsyis

#### Sentiment analysis
Also known as opinion mining, is a natural language processing (NLP) technique used to determine the sentiment or emotional tone expressed in a piece of text. It involves analyzing and categorizing the subjective information present in text data, such as reviews, social media posts, customer feedback, and more. The goal is to identify the writer's attitude, opinion, or sentiment towards a particular topic, product, or event.

Different kinds of sentiment analysis include:

#### Fine-grained sentiment analysis: 
* This approach focuses on classifying sentiment into multiple categories or levels, allowing for more nuanced analysis. Instead of just labeling text as positive, negative, or neutral, it aims to capture a wider range of sentiments. For example, sentiment labels could include very positive, positive, neutral, negative, and very negative. Fine-grained sentiment analysis provides more detailed insights into sentiment variations and can be valuable in scenarios where simple positive/negative classification is insufficient.

#### Emotion detection sentiment analysis:
* Emotion-based sentiment analysis goes beyond basic sentiment classification and aims to identify specific emotions expressed in the text. It involves recognizing emotions such as happiness, sadness, anger, fear, or surprise. Emotion detection can provide deeper insights into the emotional impact of certain topics or events, making it useful in areas like social media monitoring, brand reputation management, and market research.

#### Intent-Based Sentiment Analysis: 
* Intent-based sentiment analysis focuses on understanding the intention behind a particular sentiment expressed in the text. It goes beyond sentiment polarity and tries to identify whether the sentiment indicates a desire to buy, recommend, complain, praise, or express other specific intents. This type of sentiment analysis helps in understanding not just the sentiment but also the action or behavior associated with it, which can be valuable in customer feedback analysis, brand management, and market analysis.

#### Aspect-Based Sentiment Analysis: 
* Aspect-based sentiment analysis aims to identify and analyze sentiment at a more granular level by considering specific aspects or features of a product, service, or event. Instead of assigning a single sentiment score to the entire text, it analyzes sentiments associated with different aspects mentioned in the text. For example, in a product review, aspect-based sentiment analysis can identify sentiments towards individual features like design, performance, usability, customer service, etc. This enables a more detailed understanding of sentiment distribution across different aspects, helping businesses focus on areas that need improvement or promotion.

These different types of sentiment analysis techniques cater to various needs and applications, providing insights into sentiment variations, emotional responses, user intentions, and aspect-level sentiments. The choice of sentiment analysis approach depends on the specific requirements and goals of the analysis task at hand.

#### Applications

* Brand Popularity
* Monitoring
* Customer Serive
* Identifying Demographics
* Marketing Effort

Is text analyis also referreed to as Text Mining ?



In [2]:
import numpy as np
import pandas as pd
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer

import re
# re = regular expressions

from textblob import TextBlob
# to process textual data

from wordcloud import WordCloud

import seaborn as sns
import matplotlib.pyplot as plt
import cufflinks as cf
%matplotlib inline


from plotly.offline import init_notebook_mode, iplot
init_notebook_mode(connected = True)
cf.go_offline()


import plotly.graph_objects as go
from plotly.subplots import make_subplots

import warnings
warnings.filterwarnings("ignore")
warnings.warn("this will not show")


pd.set_option('display.max_columns', None)

In [3]:
df = pd.read_csv("amazon.csv")

# using the panda library to read the data set

In [4]:
df.head()

Unnamed: 0.1,Unnamed: 0,reviewerName,overall,reviewText,reviewTime,day_diff,helpful_yes,helpful_no,total_vote,score_pos_neg_diff,score_average_rating,wilson_lower_bound
0,0,,4,No issues.,23-07-2014,138,0,0,0,0,0.0,0.0
1,1,0mie,5,"Purchased this for my device, it worked as adv...",25-10-2013,409,0,0,0,0,0.0,0.0
2,2,1K3,4,it works as expected. I should have sprung for...,23-12-2012,715,0,0,0,0,0.0,0.0
3,3,1m2,5,This think has worked out great.Had a diff. br...,21-11-2013,382,0,0,0,0,0.0,0.0
4,4,2&amp;1/2Men,5,"Bought it with Retail Packaging, arrived legit...",13-07-2013,513,0,0,0,0,0.0,0.0


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4915 entries, 0 to 4914
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Unnamed: 0            4915 non-null   int64  
 1   reviewerName          4914 non-null   object 
 2   overall               4915 non-null   int64  
 3   reviewText            4914 non-null   object 
 4   reviewTime            4915 non-null   object 
 5   day_diff              4915 non-null   int64  
 6   helpful_yes           4915 non-null   int64  
 7   helpful_no            4915 non-null   int64  
 8   total_vote            4915 non-null   int64  
 9   score_pos_neg_diff    4915 non-null   int64  
 10  score_average_rating  4915 non-null   float64
 11  wilson_lower_bound    4915 non-null   float64
dtypes: float64(2), int64(7), object(3)
memory usage: 460.9+ KB


In [6]:
# rearranging the wilson lower bound in ascending order


df = df.sort_values("wilson_lower_bound", ascending=False)

df.drop('Unnamed: 0',inplace =True, axis =1)


In [9]:
df.head()

Unnamed: 0,reviewerName,overall,reviewText,reviewTime,day_diff,helpful_yes,helpful_no,total_vote,score_pos_neg_diff,score_average_rating,wilson_lower_bound
2031,"Hyoun Kim ""Faluzure""",5,[[ UPDATE - 6/19/2014 ]]So my lovely wife boug...,05-01-2013,702,1952,68,2020,1884,0.966337,0.957544
3449,NLee the Engineer,5,I have tested dozens of SDHC and micro-SDHC ca...,26-09-2012,803,1428,77,1505,1351,0.948837,0.936519
4212,SkincareCEO,1,NOTE: please read the last update (scroll to ...,08-05-2013,579,1568,126,1694,1442,0.92562,0.912139
317,"Amazon Customer ""Kelly""",1,"If your card gets hot enough to be painful, it...",09-02-2012,1033,422,73,495,349,0.852525,0.818577
4672,Twister,5,Sandisk announcement of the first 128GB micro ...,03-07-2014,158,45,4,49,41,0.918367,0.808109


In [10]:


def missing_values_analysis(df):
    # Step 1: Identify columns with missing values
    # Create a list of column names that have at least one missing value
    na_columns_ = [col for col in df.columns if df[col].isnull().sum() > 0]

    # Step 2: Calculate the number of missing values for each column
    # Select the subset of the DataFrame that contains only the columns with missing values
    # Apply the isnull() method to identify the missing values in each column
    # Calculate the sum of missing values for each column
    # Sort the resulting Series in ascending order
    n_miss = df[na_columns_].isnull().sum().sort_values(ascending=True)

    # Step 3: Calculate the percentage of missing values for each column
    # Divide the sum of missing values for each column by the total number of rows in the DataFrame
    # Multiply by 100 to get the percentage
    # Sort the resulting Series in ascending order
    ratio_ = (df[na_columns_].isnull().sum() / df.shape[0] * 100).sort_values(ascending=True)

    # Step 4: Create a DataFrame with missing values count and percentage
    # Concatenate the n_miss Series and the ratio_ Series along the columns axis (axis=1)
    # Assign column names 'Missing Values' and 'Ratio' to the resulting DataFrame
    missing_df = pd.concat([n_miss, np.round(ratio_, 2)], axis=1, keys=['Missing Values', 'Ratio'])

    # Step 5: Convert the concatenated DataFrame to a new DataFrame
    # This step is not necessary as the concatenated DataFrame already has the desired structure,
    # but it ensures that the result is of the pd.DataFrame type
    missing_df = pd.DataFrame(missing_df)

    # Step 6: Return the DataFrame with missing values information
    return missing_df


In [11]:

def check_dataframe(df, head = 5, tail = 5):
    print("SHAPE".center(82, '~'))
    print('Rows: {}'.format(df.shape[0]))
    print('columns: {}'.format(df.shape[1]))
    print("TYPES".center(82,'~'))
    print(df.dtypes)
    print("".center(82, '~'))
    print(missing_values_analysis(df))
    print('DUPLICATED VALUES'.center(83,'~'))
    print(df.duplicated().sum())
    print("QUANTILES".center(82,'~'))
    print(df.quantile([0,0.05,0.50,0.95,0.99,1]).T)

    
check_dataframe(df)
    
    

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~SHAPE~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Rows: 4915
columns: 11
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~TYPES~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
reviewerName             object
overall                   int64
reviewText               object
reviewTime               object
day_diff                  int64
helpful_yes               int64
helpful_no                int64
total_vote                int64
score_pos_neg_diff        int64
score_average_rating    float64
wilson_lower_bound      float64
dtype: object
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
              Missing Values  Ratio
reviewerName               1   0.02
reviewText                 1   0.02
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~DUPLICATED VALUES~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
0
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~QUANTILES~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                       0.00  0.05   0.50        0.95       0.99         1.00
overall 

In [12]:
# quantiles define a particular part of the data set
# 

In [13]:
import pandas as pd

def check_class(dataframe):
    # Step 1: Calculate the number of unique classes for each variable/column
    # Create a DataFrame with two columns: 'variable' and 'classes'
    # Iterate over each column in the DataFrame and calculate the number of unique classes using the nunique() method
    nunique_df = pd.DataFrame({'variable': dataframe.columns, 'classes': [dataframe[i].nunique() \
                                                                             for i in dataframe.columns]})

    # Step 2: Sort the DataFrame by the number of classes in descending order
    # Sort the DataFrame based on the 'classes' column in descending order
    nunique_df = nunique_df.sort_values('classes', ascending=False)

    # Step 3: Reset the index of the DataFrame
    # Reset the index of the DataFrame to ensure a new index is generated and the old index is discarded
    nunique_df = nunique_df.reset_index(drop=True)

    # Step 4: Return the DataFrame with the number of unique classes for each variable
    return nunique_df


In [14]:


check_class(df)

Unnamed: 0,variable,classes
0,reviewText,4912
1,reviewerName,4594
2,reviewTime,690
3,day_diff,690
4,wilson_lower_bound,40
5,score_average_rating,28
6,score_pos_neg_diff,27
7,total_vote,26
8,helpful_yes,23
9,helpful_no,17


In [15]:
# bar chart

constraints = ['#B34D22', '#EBE00C', '#1FEB0C', '#0C92EB', '#EB0CD5']

def categorical_variable_summary(df, column_name):
    fig = make_subplots(rows=1, cols=2,
                        subplot_titles=('Countplot', 'Percentage'),
                        specs=[[{"type": "xy"}, {'type': 'domain'}]])

    fig.add_trace(go.Bar(y=df[column_name].value_counts().values.tolist(),
                         x=[str(i) for i in df[column_name].value_counts().index],
                         text=df[column_name].value_counts().values.tolist(),
                         textfont=dict(size=14),
                         name=column_name,
                         textposition='auto',
                         showlegend=False,
                         marker=dict(color=constraints,
                                     line=dict(color='#DBE6EC',
                                               width=1))),
                  row=1, col=1)

    # pie chart
    fig.add_trace(go.Pie(labels=df[column_name].value_counts().keys(),
                         values=df[column_name].value_counts().values.tolist(),
                         textfont=dict(size=18),
                         textposition='auto',
                         showlegend=False,
                         name=column_name,
                         marker=dict(colors=constraints)),
                  row=1, col=2)

    fig.update_layout(title={'text': column_name,
                             'y': 0.9,
                             'x': 0.5,
                             'xanchor': 'center',
                             'yanchor': 'top'},
                      template='plotly_white')

    iplot(fig)

In [16]:
categorical_variable_summary(df, 'overall')

In [17]:
# ranking the comments on the sentiment analysis

In [18]:
df.reviewText.head()

2031    [[ UPDATE - 6/19/2014 ]]So my lovely wife boug...
3449    I have tested dozens of SDHC and micro-SDHC ca...
4212    NOTE:  please read the last update (scroll to ...
317     If your card gets hot enough to be painful, it...
4672    Sandisk announcement of the first 128GB micro ...
Name: reviewText, dtype: object

In [19]:
review_example = df.reviewText[2031]
review_example

'[[ UPDATE - 6/19/2014 ]]So my lovely wife bought me a Samsung Galaxy Tab 4 for Father\'s Day and I\'ve been loving it ever since.  Just as other with Samsung products, the Galaxy Tab 4 has the ability to add a microSD card to expand the memory on the device.  Since it\'s been over a year, I decided to do some more research to see if SanDisk offered anything new.  As of 6/19/2014, their product lineup for microSD cards from worst to best (performance-wise) are the as follows:SanDiskSanDisk UltraSanDisk Ultra PLUSSanDisk ExtremeSanDisk Extreme PLUSSanDisk Extreme PRONow, the difference between all of these cards are simply the speed in which you can read/write data to the card.  Yes, the published rating of most all these cards (except the SanDisk regular) are Class 10/UHS-I but that\'s just a rating... Actual real world performance does get better with each model, but with faster cards come more expensive prices.  Since Amazon doesn\'t carry the Ultra PLUS model of microSD card, I had 

In [20]:
review_example = re.sub("[^a-zA-Z]",'',review_example)
review_example

'UPDATESomylovelywifeboughtmeaSamsungGalaxyTabforFathersDayandIvebeenlovingiteversinceJustasotherwithSamsungproductstheGalaxyTabhastheabilitytoaddamicroSDcardtoexpandthememoryonthedeviceSinceitsbeenoverayearIdecidedtodosomemoreresearchtoseeifSanDiskofferedanythingnewAsoftheirproductlineupformicroSDcardsfromworsttobestperformancewisearetheasfollowsSanDiskSanDiskUltraSanDiskUltraPLUSSanDiskExtremeSanDiskExtremePLUSSanDiskExtremePRONowthedifferencebetweenallofthesecardsaresimplythespeedinwhichyoucanreadwritedatatothecardYesthepublishedratingofmostallthesecardsexcepttheSanDiskregularareClassUHSIbutthatsjustaratingActualrealworldperformancedoesgetbetterwitheachmodelbutwithfastercardscomemoreexpensivepricesSinceAmazondoesntcarrytheUltraPLUSmodelofmicroSDcardIhadtododirectcomparisonsbetweentheSanDiskUltraExtremeandExtremePLUSAsmentionedinmyearlierreviewIpurchasedtheSanDiskUltraformyGalaxySMyquestionwasdidIwanttopayovermoreforacardthatisfasterthantheoneIalreadyownedOrIcouldpayalmostdoubletoget

In [21]:
review_example = review_example.lower().split()


In [22]:
review_example

['updatesomylovelywifeboughtmeasamsunggalaxytabforfathersdayandivebeenlovingiteversincejustasotherwithsamsungproductsthegalaxytabhastheabilitytoaddamicrosdcardtoexpandthememoryonthedevicesinceitsbeenoverayearidecidedtodosomemoreresearchtoseeifsandiskofferedanythingnewasoftheirproductlineupformicrosdcardsfromworsttobestperformancewisearetheasfollowssandisksandiskultrasandiskultraplussandiskextremesandiskextremeplussandiskextremepronowthedifferencebetweenallofthesecardsaresimplythespeedinwhichyoucanreadwritedatatothecardyesthepublishedratingofmostallthesecardsexceptthesandiskregularareclassuhsibutthatsjustaratingactualrealworldperformancedoesgetbetterwitheachmodelbutwithfastercardscomemoreexpensivepricessinceamazondoesntcarrytheultraplusmodelofmicrosdcardihadtododirectcomparisonsbetweenthesandiskultraextremeandextremeplusasmentionedinmyearlierreviewipurchasedthesandiskultraformygalaxysmyquestionwasdidiwanttopayovermoreforacardthatisfasterthantheoneialreadyownedoricouldpayalmostdoubletoge

In [23]:
rt = lambda x: re.sub("[^a-zA-Z]",' ',str(x))
df["reviewText"] = df["reviewText"].map(rt)
df["reviewText"] = df["reviewText"].str.lower()
df.head()

Unnamed: 0,reviewerName,overall,reviewText,reviewTime,day_diff,helpful_yes,helpful_no,total_vote,score_pos_neg_diff,score_average_rating,wilson_lower_bound
2031,"Hyoun Kim ""Faluzure""",5,update so my lovely wife boug...,05-01-2013,702,1952,68,2020,1884,0.966337,0.957544
3449,NLee the Engineer,5,i have tested dozens of sdhc and micro sdhc ca...,26-09-2012,803,1428,77,1505,1351,0.948837,0.936519
4212,SkincareCEO,1,note please read the last update scroll to ...,08-05-2013,579,1568,126,1694,1442,0.92562,0.912139
317,"Amazon Customer ""Kelly""",1,if your card gets hot enough to be painful it...,09-02-2012,1033,422,73,495,349,0.852525,0.818577
4672,Twister,5,sandisk announcement of the first gb micro ...,03-07-2014,158,45,4,49,41,0.918367,0.808109


In [96]:
pip install vaderSentiment

Collecting vaderSentiment
  Downloading vaderSentiment-3.3.2-py2.py3-none-any.whl (125 kB)
Installing collected packages: vaderSentiment
Successfully installed vaderSentiment-3.3.2
Note: you may need to restart the kernel to use updated packages.


In [24]:

from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

df[['polarity', 'subjectivity']] = df['reviewText'].apply(lambda Text: pd.Series(TextBlob(Text).sentiment))

for index, row in df['reviewText'].iteritems():

    score = SentimentIntensityAnalyzer().polarity_scores(row)

    neg = score['neg']
    neu = score['neu']
    pos = score['pos']
    if neg > pos:
        df.loc[index, 'sentiment'] = "Negative"
    elif pos > neg:
        df.loc[index, 'sentiment'] = "Positive"
    else:
        df.loc[index, 'sentiment'] = "Neutral"

In [25]:
df[df["sentiment"] == "Positive"].sort_values("wilson_lower_bound", ascending=False).head(5)

Unnamed: 0,reviewerName,overall,reviewText,reviewTime,day_diff,helpful_yes,helpful_no,total_vote,score_pos_neg_diff,score_average_rating,wilson_lower_bound,polarity,subjectivity,sentiment
2031,"Hyoun Kim ""Faluzure""",5,update so my lovely wife boug...,05-01-2013,702,1952,68,2020,1884,0.966337,0.957544,0.163859,0.562259,Positive
3449,NLee the Engineer,5,i have tested dozens of sdhc and micro sdhc ca...,26-09-2012,803,1428,77,1505,1351,0.948837,0.936519,0.10387,0.516435,Positive
4212,SkincareCEO,1,note please read the last update scroll to ...,08-05-2013,579,1568,126,1694,1442,0.92562,0.912139,0.212251,0.505394,Positive
317,"Amazon Customer ""Kelly""",1,if your card gets hot enough to be painful it...,09-02-2012,1033,422,73,495,349,0.852525,0.818577,0.143519,0.494207,Positive
4672,Twister,5,sandisk announcement of the first gb micro ...,03-07-2014,158,45,4,49,41,0.918367,0.808109,0.172332,0.511282,Positive


In [27]:
categorical_variable_summary(df,'sentiment')

# we can see the pie chart containg the reviews in the postive, negative and neutral emotion