## <font color = Green > Business Understanding </font>

With a growing trend towards digitisation and prevalence of mobile phones and internet access, more consumers have an online presence and their opinions hold a good value for any product-based company, especially so for the B2C businesses. The industries are trying to fine-tune their strategies to suit the consumer needs, as the consumers leave some hints of their choices during their online presence. 

Whenever you are working on devising market strategies or working on product development, there are a few standard things you need to look out for that can be developed by analysing the dataset you will be working with.

![image.png](attachment:image.png)

+ By analysing the sentiment of the reviews, you can find the features of the phones that have resulted in positive/negative sentiments. This will help companies include or improve those particular features while developing a new product. If the data is of the competitor brands, the company will benefit by not repeating the same mistake during product development as their rival.
+ Companies can effectively design their Ad campaign by highlighting the features that are most talked about among the consumers.
+ Comparing the competitors' pricing and their market shares will help companies decide the price of their products.
+ It can be assumed that if the number of reviews for a particular brand is high, the number of people buying phones of that brand is also high. This will help companies gauge the market share of their competitors.
+ Before purchasing any product, we all look at similar products in various brands. This data will help the companies know their major competitors in the market. 

## <font color = Green > Problem Statement </font>

Suppose your customer is a mobile manufacturer based in the US, which entered the market three years ago. As they are a new entrant in the sector, they want to understand their competitors and preferences of their users so that they can design their strategies accordingly. They want to tweak the marketing strategies to add more value to their brand, provide features to customers that add the most value, and close the demand-supply gap. Their objective is to increase the market share as well as the brand value.

Assume that as a data analytics provider, you have been approached by this mobile phone manufacturer. They want you to provide them with some major insights into the mobile phone industry to help them achieve their objective. Their objective is to develop a new product optimally and create some marketing strategies.

## <font color = Green > Business Goal </font>

+ Part 1: Deriving the business insights that are useful for product development and marketing.
+ Part 2: Creating a sentiment classification engine.


## <font color = Green > Steps Followed </font>

#### Step 1: Data pre-processing
#### Step 2: EDA
#### Step 3: Text analytics
#### Step 4: Building a sentiment classification engine

# Reading and Understanding the data

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Importing the Libraries

import pandas as pd
import numpy as np
import json

import warnings
warnings.filterwarnings("ignore")

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

In [None]:
# Reading the meta data
# importing libraries

import gzip
import shutil

#Path to the meta data zip file. 'sentiment_analysis' is the folder name under 'My Drive'
path1 = '/content/drive/My Drive/sentiment_analysis/meta_Cell_Phones_and_Accessories.json.gz'

# Path to meta data .json file
path2 = '/content/drive/My Drive/sentiment_analysis/meta_Cell_Phones_and_Accessories.json'

# Unzipping the meta data file
with gzip.open(path1, 'rb') as f_in:
    with open(path2, 'wb') as f_out:
        shutil.copyfileobj(f_in, f_out)

In [None]:
# Reading the unzipped meta data into a Python list. The result will be a list of dictionaries. 
import json

# Empty list to store the dictonaries
phonemetadata = []

# Reading the dictionaries in the json file and appending it to the list phonemetadata[]
with open('/content/drive/My Drive/sentiment_analysis/meta_Cell_Phones_and_Accessories.json', 'r') as f:
    for line in f:
        phonemetadata.append(json.loads(line))

In [None]:
# Getting the number of entries in the phonemetadata list

len(phonemetadata)

In [None]:
#converting the list phonemetadata into a data frame

df_meta = pd.DataFrame(phonemetadata)

In [None]:
df_meta.head()

In [None]:
df_meta.shape

In [None]:
df_meta.info()

### <font color = Green > Insights: </font>

The data does not seems to be in a bad shape but need some understanding before using it further.

In [None]:
# Reading the .csv file of the Cell Phones and Accessories into a dataframe

df_celldata = pd.read_csv('/content/drive/My Drive/sentiment_analysis/Cell_Phones_and_Accessories_5.csv')

In [None]:
df_celldata.head()

In [None]:
len(df_celldata)

In [None]:
df_celldata.shape

In [None]:
df_celldata.info()

### <font color = Green > Insights: </font>

+ the cell data seems to be in a bad shape and hence need to handle the missing values, removing duplicates, standardising, filtering and etc.
+ The dataframe has bool(1), float64(1), int64(1), object(9) values.
+ It has 12 columns and more than 11 lac rows. 
+ Need to format unixReviewTime

In [None]:
#Reading the .csv file of the phone data into a dataframe

df_phonedata = pd.read_csv('/content/drive/My Drive/sentiment_analysis/phone_data_final.csv', index_col=0, header=0)

In [None]:
df_phonedata.head(5)

In [None]:
len(df_phonedata)

In [None]:
df_phonedata.shape

In [None]:
df_phonedata.info()

### <font color = Green > Insights: </font>

+ The dataframe has bool(1), float64(1), int64(1), object(10) values. 
+ We can see some missing values too in price and brand columns.
+ This dataframe seems to be the consolidated better version of df_meta and df_celldata. Hence we will not use this data further. 

### <font color = Green > More Insights: </font>

1. df_phonedata seems to be the most sofisticated one and the culmination of both df_meta and df_celldata. But df_celldata seems to have more larger entries. 
2. Cell data and phone data seems to have similar attributes but with different shape and content. Hence need to further dissect and understand the relevant columns from both. 
3. This might also include merging two of them with unique columns after dissecting. Let's see. 

### Let's read other datasets provide to us for better understanding of their usages. 



In [None]:
df_poscorpus = pd.read_excel('/content/drive/MyDrive/sentiment_analysis/positive_corpus.xlsx', index_col=None, header=0)
df_poscorpus.head()

In [None]:
df_poswords = pd.read_csv('/content/drive/MyDrive/sentiment_analysis/pos_words.txt', sep=' ', header=None)
df_poswords.head()

### <font color = Green > Insights: </font>

With not much of the understanding of the use of df_poscorpus & df_poswords, will keep it aside as of now and pick it up in the later stage of analysis. 

In [None]:
df_negcorpus = pd.read_excel('/content/drive/MyDrive/sentiment_analysis/negative_corpus.xlsx', index_col=None, header=0)
df_negcorpus.head()

In [None]:
df_negwords = pd.read_csv('/content/drive/MyDrive/sentiment_analysis/neg_words.txt', sep=' ', header=None)
df_negwords.head()

### <font color = Green > Insights: </font>

With not much of the understanding of the use of df_negcorpus & df_negwords, will keep it aside as of now and pick it up in the later stage of analysis. 

In [None]:
df_phonereviews = pd.read_csv('/content/drive/MyDrive/sentiment_analysis/phone_reviews.csv', index_col=None, header=0)
df_phonereviews.head()

In [None]:
df_phonereviews.shape

### <font color = Green > Insights: </font>

df_phonereviews also seems to be similar to df_phonedata hence keep it aside as of now.

In [None]:
df_revsent = pd.read_csv('/content/drive/MyDrive/sentiment_analysis/review_sentiment.csv', index_col=None, header=0)
df_revsent.head()

### <font color = Green > Insights: </font>

df_revsent also seems to be similar to df_phonedata hence keep it aside as of now.

In [None]:
df_brandasins = pd.read_csv('/content/drive/MyDrive/sentiment_analysis/Brands and Asins.csv', index_col=None, header=0)
df_brandasins.head()

### <font color = Green > Insights: </font>

df_brandasins contains the unique asin codes for a particular brand listed on Amazon.

### Let's start by merging df_meta and df_celldata as these seems to be the ones with a lot of information for our analysis


In [None]:
# Merging df_meta and df_celldata using the unique asin column values

merge_df = df_meta.merge(df_celldata, on = "asin")


In [None]:
merge_df.head(10)

In [None]:
merge_df.shape

In [None]:
merge_df.info()

In [None]:
merge_df.describe()

### <font color = Green > Insights: </font>

+ After getting a consolidated picture, we can see the data has a lot of noice which needs to be preprocessed. 
+ Once the data is cleaned and ready we can then start with our Exploratory Data Analysis and Text Processing. 

# Step 1: Data pre-processing



In [None]:
merge_df.head()

### Disecting each columns to see their contents and decising whether to keep them or drop them

In [None]:
merge_df.columns

In [None]:
# Dropping some unwanted and noisy columns

merge_df.drop(['category', 'description', 'image_x', 'rank', 'details', 'similar_item', 'date', 'image_y'],  axis = 1, inplace = True)

### Handling Missing Values


In [None]:
# Checking missing values columns

import missingno as msno
msno.bar(merge_df)

In [None]:
# Checking missing values percentages

# Checking exact Null Values

def null_values(merge_df):
    return round((merge_df.isnull().sum()/len(merge_df)*100).sort_values(ascending = False),2)

null_values(merge_df)

In [None]:
# Dropping vote and style as it has more than 45% missing values

merge_df.drop(['vote','style'], axis = 1, inplace = True)

In [None]:
# Checking the number of columns left after dropping few of them 

merge_df.shape

In [None]:
# Rechcking the remaing missing values

null_values(merge_df)

In [None]:
# Imputing mode values in the reviewerName column. 

merge_df['reviewerName'].fillna(merge_df['reviewerName'].mode()[0], axis = 0, inplace = True)

In [None]:
# Checking missing values again: 

null_values(merge_df)

In [None]:
# Dissecting Summary Column: 

merge_df.summary.unique()

In [None]:
# Dissecting reviewText Column: 

merge_df.reviewText.unique()

### <font color = Green > Insights: </font>

+ summary and reviewText seems to be in a string format with a small percentage of missing values, hence we will leave it as it is. 

In [None]:
merge_df.info()

In [None]:
merge_df.shape

In [None]:
# Converting unix review time to date-time format

#Transforming unixReview time to date time format
from datetime import datetime, timedelta
merge_df['Date&Time'] = merge_df['unixReviewTime'].apply(lambda d: (datetime.fromtimestamp(d) - timedelta(hours=2)).strftime('%Y-%m-%d'))


In [None]:
# Dropping unixReviewTime 

merge_df.drop('unixReviewTime', axis = 1, inplace = True)

In [None]:
merge_df.head()

### <font color = Green > Insights: </font>


In [None]:
# Dissecting tech columns

merge_df["tech2"].unique()

In [None]:
merge_df.loc[:50,"tech2"]


### <font color = Green > Insights: </font>

Nothing could be understood by exploring tech columns hence dropping them both

In [None]:
# Dropping tech1 and tech2 as it has more than 45% missing values

merge_df.drop(['tech1','tech2'], axis = 1, inplace = True)

In [None]:
# Dissecting tech columns

merge_df["fit"].unique()

In [None]:
merge_df.loc[:10,"fit"]


### <font color = Green > Insights: </font>

Similarly with the fit column

In [None]:
# Dropping fit as it has more than 45% missing values

merge_df.drop('fit', axis = 1, inplace = True)

In [None]:
merge_df.head()

In [None]:
# Let's subset the data to our particular domain which is cell phones

merge_df.main_cat.value_counts()

In [None]:
# Subsetting the dataframe to only "Cell Phones & Accessories"

merge_df = merge_df[merge_df["main_cat"] == "Cell Phones & Accessories"]

In [None]:
merge_df.shape

In [None]:
# stabilizing price column and converting it to numeric 

merge_df.price.unique()

In [None]:
merge_df['price'] = merge_df['price'].str.replace('$', '')

In [None]:
merge_df.price.unique()

In [None]:
merge_df['price'] = pd.to_numeric(merge_df['price'] , errors = 'coerce')

In [None]:
merge_df.info()

In [None]:
# Understanding the use of title column 

merge_df.title.value_counts()

In [None]:
# drop irrelevant columns with no use further

merge_df.drop(['title', 'feature', 'summary']  , axis = 1, inplace = True)

In [None]:
# Checking missing values again: 

null_values(merge_df)

In [None]:
# Identifying the spread of TotalVisits Range
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

sns.set_style("dark")
plt.style.use("ggplot")
plt.figure(figsize=[10,6])
sns.distplot(merge_df['price'], rug = True, color = 'royalblue')
plt.show()

In [None]:
merge_df['price'].describe()

In [None]:
merge_df['price'].mode()[0]

### <font color = Green > Rule Followed </font>

All of these are numerical columns are expected to fill msising values over here using the following rules:

1. If Mean ~ Median approximately, substitute by mean.

2. If Mean != median, substitute by median 

3. but if there is a huge difference in mean and max, subsitute if by mode

In [None]:
# Imputing missing values of price with it's mode value

merge_df['price'].fillna(merge_df['price'].mode()[0], axis = 0, inplace = True)

In [None]:
# Checking missing values again: 

null_values(merge_df)

In [None]:
# Extracting year and month and creating a new column for them 

merge_df['year'] = pd.to_datetime(merge_df['Date&Time']).dt.year
merge_df['month'] = pd.to_datetime(merge_df['Date&Time']).dt.month

In [None]:
merge_df.head()

In [None]:
# Dropping Date&Time as we have month and year separately

merge_df.drop('Date&Time', axis = 1, inplace = True)

In [None]:
# Renaming overall to ratings 

merge_df.rename(columns={'overall': 'ratings'}, inplace=True)

In [None]:
# Checking the value counts of ratings and the succes of rename

merge_df.ratings.value_counts()

In [None]:
# Checking the value counts of month

merge_df.month.value_counts()

In [None]:
# Checking the value counts of month

merge_df.year.value_counts()

#### Understand the data (Numeric and Categorical Analysis)

In [None]:
# Check the summary for the numeric columns 

merge_df.describe()

### <font color = Green > Insights </font>

+ As we have replaced almost 45% of the prices null values with the 9.99, that's become our avergae price for the products listed under cell phones and accessories. Also, there is a huge spread of data as the value of standard deviaton is more than mean. 
+ Most of ratings given are positive i.e. 4 and 5
+ Highest number of reviews are received in 2015 and 2016. 


In [None]:
# Checking the ratio of Actual Positive and Actual Neative labels under review_sentiment in the dataframe.

merge_df.review_sentiment.value_counts(normalize = True)*100

In [None]:
# Export the modified version of merge dataframe for Tableau demonstartion 

merge_df.to_csv('merge_df(modified).csv',index=False)

In [None]:
! ls

# EDA

In [None]:
merge_df.head()

### <font color = Green > Insights </font>

+ Using TextBlob to calculate sentiment polarity which lies in the range of [-1,1] where 1 means positive sentiment and -1 means a negative sentiment.
+ Creating new feature for the length of the review to review their distribution. 
+ Creating new feature for the word count of the review to review their distribution.

In [None]:
# Creating a function to get the polarity of the sentiment in the range of -1 and 1

from textblob import TextBlob

def GetPolarity(text):
  try:
    return TextBlob(text).sentiment.polarity
  except: 
    return None
  
#Create a new columns ‘Polarity’ 
merge_df['polarity'] = merge_df['reviewText'].apply(GetPolarity)

In [None]:
# Creating a function to get the TB Scores. 

def GetTBScore(score):
  if score < 0:
    return 'Negative'
  elif score == 0:
    return 'Neutral'
  else:
    return 'Positive'


#Create a new columns 'TB_analysis'

merge_df['TB_score'] = merge_df['polarity'].apply(GetTBScore)


In [None]:
# Creating review length and review word count column. 

merge_df['review_len'] = merge_df['reviewText'].astype(str).apply(len)
merge_df['word_count'] = merge_df['reviewText'].apply(lambda x: len(str(x).split()))

In [None]:
merge_df.head()

In [None]:
# Let's understand how the sentiment polarity score works, for that we randomly select 5 reviews with the highest sentiment polarity score (1):

print('5 random reviews with the highest positive sentiment polarity: \n')
cl = merge_df.loc[merge_df.polarity == 1, ['reviewText']].sample(5).values
for c in cl:
    print(c[0])

In [None]:
# Let's understand how the sentiment polarity score works, for that we randomly select 5 reviews with the most neutral sentiment polarity score (zero):

print('5 random reviews with the most neutral sentiment(zero) polarity: \n')
cl = merge_df.loc[merge_df.polarity == 0, ['reviewText']].sample(5).values
for c in cl:
    print(c[0])

In [None]:
merge_df.polarity.min()

In [None]:
merge_df.loc[merge_df.polarity == -1.0]


In [None]:
# Then randomly select 5 reviews with the most negative sentiment polarity score (zero):


print('5 reviews with the most negative polarity: \n')
cl = merge_df.loc[merge_df.polarity == -1.0, ['reviewText']].sample(5).values
for c in cl:
    print(c[0])

### Univariate Analysis 

In [None]:
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt
matplotlib.rcParams['figure.figsize'] = (10.0, 6.0)
import plotly.graph_objs as go
#import plotly.plotly as py
import plotly.figure_factory as ff
import plotly.express as px
from plotly.offline import iplot
import cufflinks as cf
cf.go_offline()
cf.set_config_file(world_readable=True, theme='pearl', offline=False)

%matplotlib inline

In [None]:
merge_df.head()

In [None]:
# Review Sentiment Polarity distribution 

sns.set_style("dark")
plt.style.use("ggplot")
plt.figure(figsize=[12,6])
sns.distplot(merge_df['polarity'], rug = True, color = 'royalblue').set_title('Sentiment Polarity Distribution')
plt.show()

### <font color = Green > Insights </font>

Vast majority of the sentiment polarity scores are greater than zero, means most of them are pretty positive

In [None]:
# Review Text Length distribution 

sns.set_style("dark")
plt.style.use("ggplot")
plt.figure(figsize=[12,6])
sns.distplot(merge_df['review_len'], rug = True, color = 'royalblue').set_title('Review Text Length Distribution')
plt.show()

### <font color = Green > Insights </font>

The length of the review text is between 500 to 1000. 

In [None]:
# Review Word Count Distribution 

sns.set_style("dark")
plt.style.use("ggplot")
plt.figure(figsize=[12,6])
sns.distplot(merge_df['word_count'], rug = True, color = 'royalblue').set_title('Review Word Count Distribution')
plt.show()

### <font color = Green > Insights </font>

Each review contain a word count of 50 to 200 words on average 

In [None]:
# Review Price Distribution 

sns.set_style("dark")
plt.style.use("ggplot")
plt.figure(figsize=[12,6])
sns.distplot(merge_df['price'], rug = True, color = 'royalblue').set_title('Review Price Distribution')
plt.show()

### <font color = Green > Insights </font>

Majority of the products i.e. from cell phones and accessories listed are on between 10 to 150 dollars. 

In [None]:
# Review Rating Count Distribution 

sns.set_theme(style="darkgrid")
plt.figure(figsize=[12,6])
sns.countplot(merge_df['ratings'], palette = 'inferno')
plt.title('Rating Count Distribution', fontdict={'fontsize': 20, 'fontweight': 5, 'color': 'Green'})
plt.show()

### <font color = Green > Insights </font>

+ Majority of the products have recieved positive ratings i.e either 4 or 5.

In [None]:
# Review TB_Score count percentage wise

plt.figure(figsize=[12,6])
(merge_df.TB_score.value_counts(normalize = True)*100)[:10].plot.bar()
plt.title('TB_Score Count Distribution', fontdict={'fontsize': 20, 'fontweight': 5, 'color': 'Green'})
plt.show()

### <font color = Green > Insights </font>

+ Like witnessed most of the reviews are in positive category

In [None]:
# Review verified reviews percentage wise

plt.figure(figsize=[12,6])
(merge_df.verified.value_counts(normalize = True)*100)[:10].plot.bar()
plt.title('Verified Reviews Count Distribution', fontdict={'fontsize': 20, 'fontweight': 5, 'color': 'Green'})
plt.show()

### <font color = Green > Insights </font>

Out of all reviewd recieved, more than 80* of them are verified on the website

In [None]:
# Distribution of brands listed on Amazon having more than 5000 reviews

plt.figure(figsize=[18,8])
brand_counts = merge_df.groupby('brand').count()['reviewerID'].sort_values(ascending=False)
brand_counts[brand_counts > 5000].plot.bar()
plt.title('Brands listed on Amazon having more than 5000 reviews', fontdict={'fontsize': 20, 'fontweight': 5, 'color': 'Green'})
plt.show()

### <font color = Green > Insights </font>

Samsung, Spigen, OtterBox, Generic, Anker and Motorola are the forerunner brands on the website.

In [None]:
# Distribution of reviewers on Amazon having more than 500 reviews 

plt.figure(figsize=[18,8])
reviewer_count = merge_df.groupby('reviewerName').count()['reviewerID'].sort_values(ascending=False)
reviewer_count[reviewer_count > 500].plot.bar()
plt.title('Reviewers on Amazon having more than 500 reviews', fontdict={'fontsize': 20, 'fontweight': 5, 'color': 'Green'})
plt.show()

### <font color = Green > Insights </font>

Apart from first two, are the reviewers with most review on the website. 

In [None]:
stop_words = [line.rstrip('\n') for line in open('/content/drive/My Drive/sentiment_analysis/stop_words_long.txt')]

In [None]:
# Top 20 unigrams distribution "before" removing stop words

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer

pd.set_option('max_colwidth', 100)

def get_top_n_words(corpus, n=None):
  vec = CountVectorizer().fit(corpus)
  bag_of_words = vec.transform(corpus)
  sum_words = bag_of_words.sum(axis=0) 
  words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
  words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
  return words_freq[:n]
common_words = get_top_n_words(merge_df['reviewText'].astype('U').values, 20)
for word, freq in common_words:
  print(word, freq)
df1 = pd.DataFrame(common_words, columns = ['reviewText' , 'count'])
plt.figure(figsize=[18,8])
df1.groupby('reviewText').sum()['count'].sort_values(ascending=False).plot.bar()
plt.title('Top 20 unigram words in review "before" removing stop words', fontdict={'fontsize': 20, 'fontweight': 5, 'color': 'Green'})
plt.show()

In [None]:
from wordcloud import WordCloud

text = " ".join(review for review in df1.reviewText)


wordcloud = WordCloud(max_font_size=50, max_words=100, background_color="white", stopwords=stop_words).generate(text)
plt.figure(figsize=[15,10])
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show();

In [None]:
# Top 20 unigrams distribution "after" removing stop words

def get_top_n_words(corpus, n=None):
    vec = CountVectorizer(stop_words = 'english').fit(corpus)
    bag_of_words = vec.transform(corpus)
    sum_words = bag_of_words.sum(axis=0) 
    words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
    words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
    return words_freq[:n]
common_words = get_top_n_words(merge_df['reviewText'].apply(lambda x: np.str_(x)), 20)
for word, freq in common_words:
    print(word, freq)
df1 = pd.DataFrame(common_words, columns = ['reviewText' , 'count'])
plt.figure(figsize=[18,8])
df1.groupby('reviewText').sum()['count'].sort_values(ascending=False).plot.bar()
plt.title('Top 20 unigram words in review "after" removing stop words', fontdict={'fontsize': 20, 'fontweight': 5, 'color': 'Green'})
plt.show()

In [None]:
from wordcloud import WordCloud

text = " ".join(review for review in df1.reviewText)


wordcloud = WordCloud(max_font_size=50, max_words=100, background_color="white", stopwords=stop_words).generate(text)
plt.figure(figsize=[15,10])
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show();

### <font color = Green > Insights </font>

Ignoring stopwords using Countvector function help us identify the single words for our analysis. 

In [None]:
# Top 20 bigrams distribution "before" removing stop words

def get_top_n_words(corpus, n=None):
    vec = CountVectorizer(ngram_range=(2, 2)).fit(corpus)
    bag_of_words = vec.transform(corpus)
    sum_words = bag_of_words.sum(axis=0) 
    words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
    words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
    return words_freq[:n]
common_words = get_top_n_words(merge_df['reviewText'].apply(lambda x: np.str_(x)), 20)
for word, freq in common_words:
    print(word, freq)
df1 = pd.DataFrame(common_words, columns = ['reviewText' , 'count'])
plt.figure(figsize=[18,8])
df1.groupby('reviewText').sum()['count'].sort_values(ascending=False).plot.bar()
plt.title('Top 20 bigram words in review "before" removing stop words', fontdict={'fontsize': 20, 'fontweight': 5, 'color': 'Green'})
plt.show()

In [None]:
from wordcloud import WordCloud

text = " ".join(review for review in df1.reviewText)


wordcloud = WordCloud(max_font_size=50, max_words=100, background_color="white", stopwords=stop_words).generate(text)
plt.figure(figsize=[15,10])
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show();

In [None]:
# Top 20 bigrams distribution "after" removing stop words

def get_top_n_words(corpus, n=None):
    vec = CountVectorizer(ngram_range=(2, 2), stop_words='english').fit(corpus)
    bag_of_words = vec.transform(corpus)
    sum_words = bag_of_words.sum(axis=0) 
    words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
    words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
    return words_freq[:n]
common_words = get_top_n_words(merge_df['reviewText'].apply(lambda x: np.str_(x)), 20)
for word, freq in common_words:
    print(word, freq)
df1 = pd.DataFrame(common_words, columns = ['reviewText' , 'count'])
plt.figure(figsize=[18,8])
df1.groupby('reviewText').sum()['count'].sort_values(ascending=False).plot.bar()
plt.title('Top 20 bigram words in review "after" removing stop words', fontdict={'fontsize': 20, 'fontweight': 5, 'color': 'Green'})
plt.show()

In [None]:
from wordcloud import WordCloud

text = " ".join(review for review in df1.reviewText)


wordcloud = WordCloud(max_font_size=50, max_words=100, background_color="white", stopwords=stop_words).generate(text)
plt.figure(figsize=[15,10])
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show();

### <font color = Green > Insights </font>

Ignoring stopwords using Countvector function help us identify the double words for our analysis. 

### Bi / Multivariate Analysis 

In [None]:
merge_df.head()

In [None]:
# Polarity Distribution Month Wise

plt.figure(figsize=[14,8])
sns.boxplot(data = merge_df, orient='v', x = 'month', y='polarity')
plt.title("polarity distribution - Month Wise", fontdict={'fontsize' : 20, 'fontweight' : 5, 'color': 'Green'})
plt.show()

In [None]:
# Polarity Distribution Year Wise

plt.figure(figsize=[14,8])
sns.boxplot(data= merge_df, orient='v', x = 'year', y='polarity')
plt.title("polarity distribution - Year Wise", fontdict={'fontsize' : 20, 'fontweight' : 5, 'color': 'Green'})
plt.show()

In [None]:
# Relationship between review Length and Word count in a review
plt.figure(figsize=[12,8])
sns.scatterplot(data=merge_df, x="review_len", y="word_count", hue="TB_score", style="TB_score")
plt.show()

In [None]:
# Relationship between review Length and polarity in a review

plt.figure(figsize=[12,8])
sns.scatterplot(data=merge_df, x="review_len", y="polarity", hue="TB_score", style="TB_score")
plt.show()

In [None]:
# Relationship between word_count and polarity in a review

plt.figure(figsize=[12,8])
sns.scatterplot(data=merge_df, x="word_count", y="polarity", hue="TB_score", style="TB_score")
plt.show()

In [None]:
plt.figure(figsize=[14,8])
sns.barplot(x=merge_df['TB_score'] , y=merge_df['polarity'], ci=None)
plt.title("TB Score vs Polarity Score", fontdict={'fontsize': 20, 'fontweight': 5, 'color': 'Green'})
plt.show()

### <font color = Green > Insights </font>

There is not much of the insights gathered with Bi/multi variate analysis.
Yes we agree gradually with time, the cell phone categories have made a huge impact as more amount of Positive reviews are recieved. Also, we understood that the word count in a review and the review lenght is corelated. 

# Text Analytics

In [None]:
# Reading stop words from a text file in to a list
stop_words = [line.rstrip('\n') for line in open('/content/drive/My Drive/sentiment_analysis/stop_words_long.txt')]

In [None]:
print(stop_words)

In [None]:
merge_df.head()

In [None]:
merge_df["review_sentiment"].value_counts(normalize = True)*100

In [None]:
merge_df["TB_score"].value_counts(normalize = True)*100

In [None]:
# Dataframe to proceed with Text Analystics 
df_text = merge_df[["reviewText", "review_sentiment"]]
df_text.head()

In [None]:
df_text.shape

### Bag of words model

+ Subetting the dataset
+ Plotting word frequencies and removing stopwords
+ Tokenisation
+ Stemming
+ Lemmatization

#### Let's take a subset of data (first 50 rows only) and create bag of word model on that. The objective is to undertsand the Text using CountVector.

In [None]:
text = df_text.iloc[0:50,:]
print(text)

In [None]:
# extract the reviews from the dataframe
reviewTexts = text.reviewText
print(reviewTexts)

In [None]:
reviewTexts.shape

In [None]:
# convert reviewTexts into list
reviewTexts = [review for review in reviewTexts]
print(reviewTexts)

In [None]:
# load all necessary libraries
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
import string

pd.set_option('max_colwidth', 100)

In [None]:
def preprocess(document):
    'changes document to lower case, removes stopwords and punctuations'

    # change sentence to lower case
    document = document.lower()

    # tokenize into words
    words = word_tokenize(document)

    # remove stop words
    words = [word for word in words if word not in stop_words]
    
    # for punctuation removal
    words = [word for word in words if word not in string.punctuation]

    # join words to make sentence
    document = " ".join(words)
    
    return document

In [None]:
# preprocess messages using the preprocess function
reviewTexts = [preprocess(review) for review in reviewTexts]
print(reviewTexts)

### <font color = Green > Insights: </font>


In [None]:
# bag of words model
vectorizer = CountVectorizer()
bow_model = vectorizer.fit_transform(reviewTexts)
print(bow_model.toarray())

In [None]:
print(bow_model.shape)
print(vectorizer.get_feature_names())

## Stemming and lemmatising

#### Stemming

It is a rule-based technique that just chops off the suffix of a word to get its root form, which is called the ‘stem’. For example, if you use a stemmer to stem the words of the string - "The driver is racing in his boss’ car", the words ‘driver’ and ‘racing’ will be converted to their root form by just chopping of the suffixes ‘er’ and ‘ing’. So, ‘driver’ will be converted to ‘driv’ and ‘racing’ will be converted to ‘rac’.

You might think that the root forms (or stems) don’t resemble the root words - ‘drive’ and ‘race’. You don’t have to worry about this because the stemmer will convert all the variants of ‘drive’ and ‘racing’ to those root forms only. So, it will convert ‘drive’, ‘driving’, etc. to ‘driv’, and ‘race’, ‘racer’, etc. to ‘rac’. This gives us satisfactory results in most cases.

#### Lemmarising

This is a more sophisticated technique (and perhaps more 'intelligent') in the sense that it doesn’t just chop off the suffix of a word. Instead, it takes an input word and searches for its base word by going recursively through all the variations of dictionary words. The base word in this case is called the lemma. Words such as ‘feet’, ‘drove’, ‘arose’, ‘bought’, etc. can’t be reduced to their correct base form using a stemmer. But a lemmatizer can reduce them to their correct base form. The most popular lemmatizer is the WordNet lemmatizer created by a team od researchers at the Princeton university. You can read more about it here.

Nevertheless, you may sometimes find yourself confused in whether to use a stemmer or a lemmatizer in your application. The following points might help you make the decision:

+ A stemmer is a rule based technique, and hence, it is much faster than the lemmatizer (which searches the dictionary to look for the lemma of a word). On the other hand, a stemmer typically gives less accurate results than a lemmatizer.
+ A lemmatizer is slower because of the dictionary lookup but gives better results than a stemmer. Now, as a side note, it is important to know that for a lemmatizer to perform accurately, you need to provide the part-of-speech tag of the input word (noun, verb, adjective etc.). You’ll see learn POS tagging in the next session - but it would suffice to know that there are often cases when the POS tagger itself is quite inaccurate on your text, and that will worsen the performance of the lemmatiser as well. In short, you may want to consider a stemmer rather than a lemmatiser if you notice that POS tagging is inaccurate.

In general, you can try both and see if its worth using a lemmatizer over a stemmer. If a stemmer is giving you almost same results with increased efficiency than choose a stemmer, otherwise use a lemmatizer.

In [None]:
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer
import string

stemmer = PorterStemmer()
wordnet_lemmatizer = WordNetLemmatizer()

# add stemming and lemmatisation in the preprocess function
def preprocess(document, stem=True):
    'changes document to lower case, removes stopwords and punctuations'

    # change sentence to lower case
    document = document.lower()

    # tokenize into words
    words = word_tokenize(document)

    # remove stop words
    words = [word for word in words if word not in stop_words]
    
    # for punctuation removal
    words = [word for word in words if word not in string.punctuation]
    
    # new step: adding a flag, If stem is true, we call the stemmer function, and if stem is false we call the wordnet function 
    if stem:     
        words = [stemmer.stem(word) for word in words]
    else:
        words = [wordnet_lemmatizer.lemmatize(word, pos='v') for word in words]

    # join words to make sentence
    document = " ".join(words)
    
    return document

### Bag of words model on stemmed messages

In [None]:
# stemming reviews
reviewTexts = [preprocess(review, stem=True) for review in text.reviewText]

# bag of words model
vectorizer = CountVectorizer()
bow_model = vectorizer.fit_transform(reviewTexts)

In [None]:
# look at the dataframe
pd.DataFrame(bow_model.toarray(), columns = vectorizer.get_feature_names())

In [None]:
print(vectorizer.get_feature_names())

In [None]:
len(vectorizer.get_feature_names())

In [None]:
# Word Cloud Using Porter Stemming Bags of Word Model 

from wordcloud import WordCloud

text1 = " ".join(review for review in vectorizer.get_feature_names())


wordcloud = WordCloud(max_font_size=50, max_words=100, background_color="white", stopwords=stop_words).generate(text1)
plt.figure(figsize = [15,12])
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show();

### Let's try lemmatizing the messages.

In [None]:
# lemmatise reviews
import nltk
nltk.download('wordnet')
nltk.download('omw-1.4')
reviewTexts = [preprocess(review, stem=False) for review in text.reviewText]

# bag of words model
vectorizer = CountVectorizer()
bow_model = vectorizer.fit_transform(reviewTexts)

In [None]:
# look at the dataframe
pd.DataFrame(bow_model.toarray(), columns = vectorizer.get_feature_names())

In [None]:
print(vectorizer.get_feature_names())

In [None]:
len(vectorizer.get_feature_names())

In [None]:
# Word Cloud Using Lemmatizing Wordnet Bags of Word Model 

from wordcloud import WordCloud

text2 = " ".join(review for review in vectorizer.get_feature_names())


wordcloud = WordCloud(max_font_size=50, max_words=100, background_color="white", stopwords=stop_words).generate(text2)
plt.figure(figsize = [15,12])
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show();

### <font color = Green > Insights: </font>

+ Lemmetization seems to work much profoundly with the words.


## TF-IDF model

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
df_text.head()

In [None]:
df_text.shape

In [None]:
text.shape

In [None]:
# extract the Questions from the dataframe
reviewTexts = text.reviewText
print(reviewTexts)

In [None]:
# Converting revieText into list

reviewTexts = [review for review in reviewTexts]
print(reviewTexts)

In [None]:
def preprocess(document):
    'changes document to lower case, removes stopwords and punctuations'

    # change sentence to lower case
    document = document.lower()

    # tokenize into words
    words = word_tokenize(document)

    # remove stop words
    words = [word for word in words if word not in stop_words]
    
    # for punctuation removal
    words = [word for word in words if word not in string.punctuation]

    # join words to make sentence
    document = " ".join(words)
    
    return document

In [None]:
# preprocess messages using the preprocess function
reviewTexts = [preprocess(review) for review in reviewTexts]
print(reviewTexts)

In [None]:
# bag of words model using TfidfVectorizer

vectorizer = TfidfVectorizer()
tfidf_model = vectorizer.fit_transform(reviewTexts)

In [None]:
# Let's look at the dataframe
tfidf = pd.DataFrame(tfidf_model.toarray(), columns = vectorizer.get_feature_names())
tfidf

In [None]:
# token names
print(vectorizer.get_feature_names())

In [None]:
len(vectorizer.get_feature_names())

### <font color = Green > Insights: </font>

+ Finally we can see TF-IDF technique produces much clearner and less noisy words, hence we will build ou model on this technique


In [None]:
# Word Cloud Using Tf-IDF Bags of Word Model 

from wordcloud import WordCloud

text3 = " ".join(review for review in vectorizer.get_feature_names())


wordcloud = WordCloud(max_font_size=50, max_words=100, background_color="white", stopwords=stop_words).generate(text3)
plt.figure(figsize = [15,12])
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show();

### <font color = Green > Insights: </font>


# Model Building & Evaluation

In [None]:
merge_df.head()

In [None]:
df_text = merge_df[["brand", "reviewerName", "reviewText", "review_sentiment"]]
df_text.head()

In [None]:
df_text.shape

In [None]:
# Train Test Split
from sklearn.model_selection import train_test_split
X = df_text[["brand", "reviewerName", "reviewText"]]
y = df_text.review_sentiment
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, shuffle=True, random_state = 42)
print("Value counts for Train sentiments")
print(y_train.value_counts())
    
print("Value counts for Test sentiments")
print(y_test.value_counts())
print(" ")
print(type(X_train))
print(type(y_train))
print(" ")
print(X_train.head())

In [None]:
df_text.head()

### Multinomial Naive Bayes Classifier 

In [None]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import accuracy_score, confusion_matrix

mnb = Pipeline([('vect', CountVectorizer(stop_words = stop_words)),
               ('tfidf', TfidfTransformer()),
               ('clf', MultinomialNB()),
              ])
mnb.fit(X_train['reviewText'].apply(lambda x: np.str_(x)), y_train)

#%%time
from sklearn.metrics import classification_report
y_pred = mnb.predict(X_test['reviewText'].apply(lambda x: np.str_(x)))

print('accuracy %s' % accuracy_score(y_pred, y_test))
#print(classification_report(y_test, y_pred, target_names = df_text['review_sentiment'].unique()))

### <font color = Green > Insights: </font>

**Accuracy = Correctly Predicted Labels / Total Number of Labels**

+ 'Positive' reviews being actually identified as Positive
+ 'Negative' reviews being actually identified as Negative

As we have achieve an accuracy of 81% but the question now is - Is accuracy enough to assess the goodness of the model? And the answer is a **big NO!** 

Let's take a look at the confusion matrix we got for our final model. 


In [None]:
mnb

### <font color = Green > Insights: </font>

+ Using stopwords in CountVectorizer has helped us improve the accuracy of the model

In [None]:
# predicting probabilities of test data
proba = mnb.predict_proba(X_test['reviewText'].apply(lambda x: np.str_(x)))
proba

In [None]:
# confusion matrix
from sklearn import metrics
metrics.confusion_matrix(y_test, y_pred)

In [None]:
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt


cm = confusion_matrix(y_test,y_pred)
cm_df = pd.DataFrame(cm,
                     index = ['Negative','Positive'], 
                     columns = ['Negative','Positive'])
#Plotting the confusion matrix
plt.figure(figsize=(5,4))
sns.heatmap(cm_df, annot=True,fmt='g')
plt.title('Confusion Matrix')
plt.ylabel('Actal Values')
plt.xlabel('Predicted Values')
plt.show()

### <font color = Green > Insights: </font>

+ The actual labels are along the column while the predicted labels are along the rows 
+ 686 reviews are actually 'Positive' but predicted as 'Negative' by the model, whereas 8698 reviews are correctly predicted as 'Negative'. On the other hand, 686 & 245536 are actual Positives in total but the model missed on a small chunk of it which is doable. 
+ Now, the model predicts 58736 as Positive reviews whereas those are negative reviews, which might mislead in identifing those brands having positive reviews. This is a bit risky. 

This brings us to two of the most commonly used metrics to evaluate a classification model:

+ **Sensitivity: (From out of all the positives how much did you actually detect)**
+ **Specificity: (From out of all the negatives how much did you actually detect)**

+ Actual/Predicted       	Not Churn   	           Churn
+ Not Churn	           True Negatives	        False Positives
+ Churn	               False Negatives        True Positives

In [None]:
confusion = metrics.confusion_matrix(y_test, y_pred)
print(confusion)
TN = confusion[0, 0]
FP = confusion[0, 1]
FN = confusion[1, 0]
TP = confusion[1, 1]

In [None]:
sensitivity = TP / float(FN + TP)
print("sensitivity",sensitivity)

### <font color = Green > Insights: </font>

We have a good sensitivity. When a test’s sensitivity is high, it is less likely to give a false negative. In a test with high sensitivity, a positive is positive. This can be verified as we get 686 as False Negatives.

In [None]:
specificity = TN / float(TN + FP)
print("specificity",specificity)

### <font color = Green > Insights: </font>

We have a very low specificity. A test with low specificity can be thought of as being too eager to find a positive result, even when it is not present, and may give a high number of false positives. This can also be verified as we have higher number of FP i.e 58736

In [None]:
from sklearn import metrics
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score

print("PRECISION SCORE :", precision_score(y_test, y_pred, pos_label='POSITIVE'))
print("RECALL SCORE :", recall_score(y_test, y_pred, pos_label='POSITIVE'))
print("F1 SCORE :", metrics.f1_score(y_test, y_pred, pos_label='POSITIVE'))

### <font color = Green > Insights: </font>


In [None]:
y_test.head()

In [None]:
# mapping labels to 0 and 1 in y_pred
y_test = y_test.map({'NEGATIVE':0, 'POSITIVE':1})

In [None]:
# creating an ROC curve
from sklearn.metrics import confusion_matrix as sk_confusion_matrix
from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt

false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, proba[:,1])
roc_auc = auc(false_positive_rate, true_positive_rate)

In [None]:
# area under the curve
print(roc_auc)

### <font color = Green > Insights: </font>


In [None]:
# matrix of thresholds, tpr, fpr
pd.DataFrame({'Threshold': thresholds, 
              'TPR': true_positive_rate, 
              'FPR':false_positive_rate
             })

### Plotting the ROC Curve

An ROC curve demonstrates several things:

- It shows the tradeoff between sensitivity and specificity (any increase in sensitivity will be accompanied by a decrease in specificity).
- The closer the curve follows the left-hand border and then the top border of the ROC space, the more accurate the test.
- The closer the curve comes to the 45-degree diagonal of the ROC space, the less accurate the tes

In [None]:
# plotting the ROC curve
%matplotlib inline  
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.title('ROC')
plt.plot(false_positive_rate, true_positive_rate)

### <font color = Green > Insights: </font>

The are under the curve is 0.93 which is good enough. 

**Note: The good model is the one in which TPR is high (Closer to 100%) and FPR is low (Closer to 0%). Hence need to balance these two.**

TPR and FPR are nothing but sensitivity and (1 - specificity), so it can also be looked at as a tradeoff between sensitivity and specificity.

# Displaying a Buisness Use Case

In [None]:
# Creating a new column from Predictions from model

X_test['sentiment_predicted'] = y_pred

In [None]:
# Assume you have launched a product in market and you need to see what is the sentiment of that particular product. 

temp = X_test.loc[X_test['brand']=="Motorola"].reset_index(drop=True)

In [None]:
# Visualizing the sentiments for the product

temp['sentiment_predicted'].hist()

### <font color = Green > Insights: </font>


In [None]:
negative_df = temp.loc[temp['sentiment_predicted']=='NEGATIVE'].reset_index(drop=True)


In [None]:
negative_df.head()

In [None]:
from wordcloud import WordCloud

text = " ".join(str(review) for review in negative_df.reviewText)


wordcloud = WordCloud(max_font_size=50, max_words=100, background_color="white", stopwords=stop_words).generate(text)
plt.figure(figsize=[15,10])
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show();

In [None]:
# Also, let's assume you want to see what is the ratio of reviews a particular reviewer has posted on Amazon

temp1 = X_test.loc[X_test['reviewerName']=="Daniel"].reset_index(drop=True)

In [None]:
# Visualizing the sentiments for a particular Reviewer 

temp1['sentiment_predicted'].hist()