# Exploratory Data Analysis

We will be exploring the Amazon Rating Data obtained in this notebook from a perspective of being able to better qualify the sentiment expressed by the reviewer in their reviews outside of just relying on the Rating provided. As a part of the exploratory analysis, we will 

<ul>
    <li>
        <b>Data Engineering:</b> How can we reduce the noise by removing unwanted features, imputing missing values, deriving new features that might be relevant for the purpose, sampling the data, etc
    </li>
    <li>
        <b>Feature engineering:</b> Can we leverage combination of attributes, perform text pre processing to obtain relevant tokens that can help us get more context on the sentiment  
    </li>
</ul>

In [1]:
!pip install --upgrade pip
!pip install nltk
!pip install contractions
!pip install inflect
!pip install numpy 
!pip install scikit-learn 
!pip install gensim
!pip uninstall -y tensorflow
!pip install torch
!pip install transformers



In [2]:
# Set up the notebook to import modules from relative paths
import os, sys

#'/home/user/example/parent/child'
current_path = os.path.abspath('.')

#'/home/user/example/parent'
parent_path = os.path.dirname(current_path)

sys.path.append(parent_path)

In [3]:
from transformers import pipeline

# Specify the model
model_id = "cardiffnlp/twitter-roberta-base-sentiment-latest"

sentiment_pipe = pipeline("sentiment-analysis", model=model_id)
print(sentiment_pipe('I hate you'))

Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


[{'label': 'negative', 'score': 0.7866930365562439}]


In [4]:
import pandas as pd
import numpy as np
import sklearn
from IPython.display import display, HTML

# Display Properties
from IPython.display import display, HTML
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 1000)
pd.set_option('display.colheader_justify', 'center')
pd.set_option('display.precision', 2)
pd.set_option('display.float_format', lambda x: '%.3f' % x)

We will first load the dataset, take a peek at sample values for all the data and also explore the Data types

In [5]:
# Initialize the reviews
base_dir = "/Users/shaileshhemdev/ai/ai-enabledsystems/workspace"
path = base_dir + "/amazon_movie_reviews.csv"

# Read the file with 5 years worth of data
df = pd.read_csv(path)
df.head()

  df = pd.read_csv(path)


Unnamed: 0.1,Unnamed: 0,rating,review_title,text,images_x,asin,parent_asin,user_id,timestamp,helpful_vote,verified_purchase,main_category,movie_title,subtitle,average_rating,rating_number,features,description,price,images_y,videos,store,categories,details,bought_together,author
0,0,5.0,Five Stars,"Amazon, please buy the show! I'm hooked!",[],B013488XFS,B013488XFS,AGGZ357AO26RQZVRLGU4D4N52DZQ,1440385637000,0,True,Prime Video,Sneaky Pete,Ads BadgeAds Badge,4.6,56658.0,"['IMDb 8.1', '2017', '10 episodes', 'X-Ray', '...",['A\xa0con man (Giovanni Ribisi) on the run fr...,,[{'360w': 'https://images-na.ssl-images-amazon...,[],,"['Suspense', 'Drama']","{'Content advisory': ['Nudity', 'violence', 's...",,
1,1,5.0,Five Stars,My Kiddos LOVE this show!!,[],B00CB6VTDS,B00CB6VTDS,AGKASBHYZPGTEPO6LWZPVJWB2BVA,1461100610000,0,True,Prime Video,Creative Galaxy,Season 1,4.8,6403.0,"['2014', '13 episodes', 'X-Ray', 'ALL']",['Follow the adventures of Arty and his sideki...,,[{'360w': 'https://images-na.ssl-images-amazon...,[],,['Kids'],{'Audio languages': ['English Dialogue Boost: ...,,
2,2,3.0,Some decent moments...but...,Annabella Sciorra did her character justice wi...,[],B096Z8Z3R6,B096Z8Z3R6,AG2L7H23R5LLKDKLBEF2Q3L2MVDA,1646271834582,0,True,Prime Video,,,3.9,182.0,,,,[{'360w': 'https://images-na.ssl-images-amazon...,[],,,"{'Content advisory': ['Violence', 'substance u...",,
3,3,4.0,"Decent Depiction of Lower-Functioning Autism, ...",...there should be more of a range of characte...,[],B09M14D9FZ,B09M14D9FZ,AG2L7H23R5LLKDKLBEF2Q3L2MVDA,1645937761864,1,False,Prime Video,,,4.8,533.0,,,,[{'360w': 'https://images-na.ssl-images-amazon...,[],,,"{'Content advisory': ['Violence', 'alcohol use...",,
4,4,5.0,What Love Is...,"...isn't always how you expect it to be, but w...",[],B001H1SVZC,B001H1SVZC,AG2L7H23R5LLKDKLBEF2Q3L2MVDA,1590639227074,0,True,Prime Video,,,4.5,389.0,,,,[{'360w': 'https://images-na.ssl-images-amazon...,[],,,"{'Subtitles': ['None available'], 'Directors':...",,


In [6]:
## View the Types
print(df.dtypes)

Unnamed: 0             int64
rating               float64
review_title          object
text                  object
images_x              object
asin                  object
parent_asin           object
user_id               object
timestamp              int64
helpful_vote           int64
verified_purchase       bool
main_category         object
movie_title           object
subtitle              object
average_rating       float64
rating_number        float64
features              object
description           object
price                 object
images_y              object
videos                object
store                 object
categories            object
details               object
bought_together      float64
author                object
dtype: object


Next we will look at missing values to see if fields with high missing values

In [7]:
# Then let's look at missing values 
print(df.isna().sum())

Unnamed: 0                 0
rating                     0
review_title              70
text                      78
images_x                   0
asin                       0
parent_asin                0
user_id                    0
timestamp                  0
helpful_vote               0
verified_purchase          0
main_category           9351
movie_title           542292
subtitle              979884
average_rating            10
rating_number             10
features              542290
description           542290
price                 611251
images_y                   0
videos                     0
store                 610148
categories            542290
details                    0
bought_together      1000000
author                999882
dtype: int64


With the above analysis on the columns which have missing values, we can remove the following columns since they have a very high degree of missing values hence unlikely that we would be able to impute or make meaningful generalizations on the basis of these fields

<ul>
    <li>movie_title</li>
    <li>subtitle</li>
    <li>price</li>
    <li>store</li>
    <li>bought_together</li>
    <li>author</li>
    <li>features</li>
</ul>

We will hold on to following columns with high missing values since they might actually provide us with some useful context on the sentiment

<ul>
    <li>categories</li>
    <li>description</li>
</ul>

In [8]:
# Get the total records
total_records = len(df)

# Get unique values for main category
unique_main_category = df['main_category'].nunique() 

# Get unique values for user id
unique_user_id = df['user_id'].nunique() 

# Unique values for the features
unique_features = df['features'].nunique() 

# Get Unique Movie Titles
unique_movie_title = df['movie_title'].nunique()

# Get Unique Movie Sub titles
unique_subtitle = df['subtitle'].nunique()

# Get Unique details
unique_details = df['details'].nunique()

# Get Unique asin
unique_asin = df['asin'].nunique()

# Get Unique parent asin
unique_parent_asin = df['parent_asin'].nunique()

# Get unique rating_number
unique_rating_number = df['rating_number'].nunique()

# Get unique categories
unique_categories = df['categories'].nunique()

# Get unique average rating
unique_average_rating = df['average_rating'].nunique()

In [9]:
unique_stats = np.column_stack((total_records, unique_main_category, unique_user_id, unique_features, 
                               unique_movie_title, unique_subtitle , unique_details, unique_asin, 
                                unique_parent_asin, unique_rating_number, unique_categories, unique_average_rating))
unique_stats_df = pd.DataFrame(unique_stats, columns = ['Total Reviews','Main Category',
                                                        'Users','Features', 
                                                        'Movie Titles','Sub Titles',
                                                        'Details','Asin', 'Parent Asin',
                                                        'Rating Number', 'Categories', 'Average Rating'])
display(HTML(unique_stats_df.to_html()))

Unnamed: 0,Total Reviews,Main Category,Users,Features,Movie Titles,Sub Titles,Details,Asin,Parent Asin,Rating Number,Categories,Average Rating
0,1000000,25,177374,11749,102620,83,175306,211352,211342,18561,5884,41


From the above, we can see that the following columns are high cardinality i.e. a lot of values 

<ul>
    <li>user_id</li>
    <li>asin</li>
    <li>parent_asin</li>
    <li>rating_number</li>
</ul>

Similarly since the focus of our analysis is to get to the Sentiment from actual review and any supporting elements, the following attributes are also not likely to aid in our analysis

<ul>
    <li>images_x</li>
    <li>images_y</li>
    <li>videos</li>
    <li>timestamp</li>
</ul>

In [10]:
# Keep back up for analysis
df_bak = df.copy(deep=True)

# Drop the columns we have concluded as not being meaningful
cols_to_drop = ['user_id','asin','parent_asin','movie_title','subtitle','rating_number','price','bought_together',
                'store','images_y','videos','author','images_x','timestamp','features']
df = df.drop(columns=cols_to_drop,errors='ignore')

df.head()

Unnamed: 0.1,Unnamed: 0,rating,review_title,text,helpful_vote,verified_purchase,main_category,average_rating,description,categories,details
0,0,5.0,Five Stars,"Amazon, please buy the show! I'm hooked!",0,True,Prime Video,4.6,['A\xa0con man (Giovanni Ribisi) on the run fr...,"['Suspense', 'Drama']","{'Content advisory': ['Nudity', 'violence', 's..."
1,1,5.0,Five Stars,My Kiddos LOVE this show!!,0,True,Prime Video,4.8,['Follow the adventures of Arty and his sideki...,['Kids'],{'Audio languages': ['English Dialogue Boost: ...
2,2,3.0,Some decent moments...but...,Annabella Sciorra did her character justice wi...,0,True,Prime Video,3.9,,,"{'Content advisory': ['Violence', 'substance u..."
3,3,4.0,"Decent Depiction of Lower-Functioning Autism, ...",...there should be more of a range of characte...,1,False,Prime Video,4.8,,,"{'Content advisory': ['Violence', 'alcohol use..."
4,4,5.0,What Love Is...,"...isn't always how you expect it to be, but w...",0,True,Prime Video,4.5,,,"{'Subtitles': ['None available'], 'Directors':..."


We are now left with 

In [11]:
import json
test = df.iloc[1]["details"]
test = test.replace("'",'"')

print(json.loads(test))

{'Audio languages': ['English Dialogue Boost: High', 'English', 'English [Audio Description]', 'English Dialogue Boost: Medium', 'English Dialogue Boost: Low', 'Italiano', '한국어', '日本語', 'العربية', 'Português', 'Nederlands', 'Deutsch', 'Русский', 'हिन्दी', 'Español (España)', 'Indonesia', 'Español (Latinoamérica)', 'Türkçe', '中文（台灣）', '中文（中国）', 'Français'], 'Subtitles': ['English [CC]', 'العربية', 'Deutsch', 'Español (Latinoamérica)', 'Español (España)', 'Français', 'हिन्दी', 'Indonesia', 'Italiano', 'Italiano [CC]', '日本語', '한국어', 'Nederlands', 'Português', 'Русский', 'Türkçe', '中文（简体）', '中文（繁體）'], 'Directors': ['Larry Jacobs'], 'Producers': ['Out of the Blue Enterprises', '9 Story Entertainment Inc'], 'Starring': ['Christian Distefano', 'Kira Gelineau', 'Samantha Bee']}


In [12]:
import json

# Parse the Category
def parse_categories(category):
    if (category == ''):
        return category
    else:
        try:
            start = category.index('[') + 1
            end   = category.index(']') 
            elems = category[start:end]
            result = elems.replace("'","")
            result = result.replace(","," ")
            return result
        except:
            return category

def parse_details(details):
    if (details == ''):
        return details
    else:
        result = details.replace("'",'"')
        tags = ''
        try:
            res = json.loads(result)

            if 'Content advisory' in res.keys():
                content_advisory = res["Content advisory"]
                tags = tags + " ".join(content_advisory)
            
            if 'Genre' in res.keys():
                genre = res["Genre"]
                tags = tags + " ".join(genre)
        except:
            tags = ''
        
        return tags

df['categories'].fillna('', inplace=True)
df['categories'] = df['categories'].apply(lambda x: parse_categories(x))

df['details'].fillna('', inplace=True)
df['tags'] = df['details'].apply(lambda x: parse_details(x))
df.head()

Unnamed: 0.1,Unnamed: 0,rating,review_title,text,helpful_vote,verified_purchase,main_category,average_rating,description,categories,details,tags
0,0,5.0,Five Stars,"Amazon, please buy the show! I'm hooked!",0,True,Prime Video,4.6,['A\xa0con man (Giovanni Ribisi) on the run fr...,Suspense Drama,"{'Content advisory': ['Nudity', 'violence', 's...",Nudity violence substance use alcohol use smok...
1,1,5.0,Five Stars,My Kiddos LOVE this show!!,0,True,Prime Video,4.8,['Follow the adventures of Arty and his sideki...,Kids,{'Audio languages': ['English Dialogue Boost: ...,
2,2,3.0,Some decent moments...but...,Annabella Sciorra did her character justice wi...,0,True,Prime Video,3.9,,,"{'Content advisory': ['Violence', 'substance u...",Violence substance use foul language sexual co...
3,3,4.0,"Decent Depiction of Lower-Functioning Autism, ...",...there should be more of a range of characte...,1,False,Prime Video,4.8,,,"{'Content advisory': ['Violence', 'alcohol use...",Violence alcohol use foul language sexual content
4,4,5.0,What Love Is...,"...isn't always how you expect it to be, but w...",0,True,Prime Video,4.5,,,"{'Subtitles': ['None available'], 'Directors':...",


In [13]:
# Group reviews by rating
rating_group_df = df[["rating","verified_purchase"]].groupby(['rating','verified_purchase'])
rating_agg_group_df = rating_group_df[['verified_purchase']].agg('sum')
rating_agg_group_df = rating_group_df[['rating']].count()
rating_agg_group_df

Unnamed: 0_level_0,Unnamed: 1_level_0,rating
rating,verified_purchase,Unnamed: 2_level_1
1.0,False,20182
1.0,True,46583
2.0,False,13399
2.0,True,34783
3.0,False,20910
3.0,True,68020
4.0,False,36217
4.0,True,129690
5.0,False,99485
5.0,True,530731


In [14]:
# Group reviews by rating
rating_group_df = df[["rating","main_category"]].groupby(['rating','main_category'])
rating_agg_group_df = rating_group_df[['main_category']].count()
rating_agg_group_df = rating_group_df[['rating']].count()
rating_agg_group_df

Unnamed: 0_level_0,Unnamed: 1_level_0,rating
rating,main_category,Unnamed: 2_level_1
1.0,AMAZON FASHION,4
1.0,Amazon Home,1
1.0,Books,15
1.0,Computers,2
1.0,Digital Music,3
1.0,Entertainment,1
1.0,Health & Personal Care,1
1.0,Movies & TV,15415
1.0,Musical Instruments,1
1.0,Prime Video,50919


In [15]:
video_df = df[df['main_category'].isin(["Prime Video","Movies & TV"])] 
video_df.tail()

Unnamed: 0.1,Unnamed: 0,rating,review_title,text,helpful_vote,verified_purchase,main_category,average_rating,description,categories,details,tags
999995,999995,5.0,Five Stars,You can't go wrong with Martin.,0,True,Movies & TV,4.8,['Martin: The Complete Fourth Season (RPKG/DVD...,Movies & TV Featured Categories DVD Comedy,"{'Genre': 'Comedy', 'Format': 'Color, Dolby, F...",C o m e d y
999996,999996,5.0,Five Stars,You can't go wrong with Martin.,0,True,Movies & TV,4.9,"[""Heeeey! Comic superstar Martin Lawrence (Bad...",Movies & TV Featured Categories DVD Comedy,"{'Genre': 'Comedy', 'Format': 'Color, Dolby, N...",C o m e d y
999997,999997,4.0,Predictable ending but good action,Good pace of action and good characters. Plot...,1,True,Prime Video,4.2,,,"{'Content advisory': ['Violence', 'substance u...",Violence substance use alcohol use smoking fou...
999998,999998,4.0,pretty decent ww2 flick,i watched th whole thing so that is one criter...,1,True,Prime Video,3.1,,,"{'Content advisory': ['Violence', 'foul langua...",Violence foul language drug use
999999,999999,1.0,weak. very weak,this starts out somewhat interesting but fades...,6,True,Prime Video,4.5,,,"{'Content advisory': ['Nudity', 'violence', 's...",Nudity violence substance use alcohol use smok...


In [16]:
details_df = df[["details"]].copy(deep=True)

In [17]:
print(details_df.iloc[1]["details"])

{'Audio languages': ['English Dialogue Boost: High', 'English', 'English [Audio Description]', 'English Dialogue Boost: Medium', 'English Dialogue Boost: Low', 'Italiano', '한국어', '日本語', 'العربية', 'Português', 'Nederlands', 'Deutsch', 'Русский', 'हिन्दी', 'Español (España)', 'Indonesia', 'Español (Latinoamérica)', 'Türkçe', '中文（台灣）', '中文（中国）', 'Français'], 'Subtitles': ['English [CC]', 'العربية', 'Deutsch', 'Español (Latinoamérica)', 'Español (España)', 'Français', 'हिन्दी', 'Indonesia', 'Italiano', 'Italiano [CC]', '日本語', '한국어', 'Nederlands', 'Português', 'Русский', 'Türkçe', '中文（简体）', '中文（繁體）'], 'Directors': ['Larry Jacobs'], 'Producers': ['Out of the Blue Enterprises', '9 Story Entertainment Inc'], 'Starring': ['Christian Distefano', 'Kira Gelineau', 'Samantha Bee']}


In [18]:
df1 = df[["rating","review_title","text","helpful_vote","verified_purchase"]]
df1.head()

Unnamed: 0,rating,review_title,text,helpful_vote,verified_purchase
0,5.0,Five Stars,"Amazon, please buy the show! I'm hooked!",0,True
1,5.0,Five Stars,My Kiddos LOVE this show!!,0,True
2,3.0,Some decent moments...but...,Annabella Sciorra did her character justice wi...,0,True
3,4.0,"Decent Depiction of Lower-Functioning Autism, ...",...there should be more of a range of characte...,1,False
4,5.0,What Love Is...,"...isn't always how you expect it to be, but w...",0,True


We will now apply a model to it using Large Language Models

In [None]:
from data_pipeline import Text_Pipeline

# Initialize various tools
text_pipeline = Text_Pipeline('CONVERT')

# Function for the sentiment
def analyze_sentiment(text):
    result = sentiment_pipe(text)
    return result[0]['label']

# Get the reviews
amazon_reviews = df1['text'].values.tolist()
total_size = len(amazon_reviews)

# Obtain sentiment scores for all the reviews
sentiment_scores = []
end = 100
for i in range(0,total_size,100):
    # Create a Pandas series 
    s = pd.Series(amazon_reviews[i:end]) 

    # Obtain pre processed series
    preprocessed_series = text_pipeline.preprocess(s)

    # Get reviews
    reviews = preprocessed_series.values.tolist()
    
    # Analyze sentiment for each review
    sentiments = [analyze_sentiment(review[:2000]) for review in reviews]
    print(len(sentiments))
    #sentiments = analyze_sentiment(amazon_reviews[2000])
    sentiment_scores += sentiments
    end += 100

print(len(sentiment_scores))

In [None]:
print(len(sentiment_scores))
mapped_sentiments = {'positive':2, 'neutral':1, 'negative':0}
predicted_classes = [mapped_sentiments[s] for s in sentiment_scores]

In [None]:
# Map ratings to sentiments
sentiment_classes = {5 : 2, 4 : 2, 3 : 1, 2 : 0, 1 : 0} 

# Get the rows we were able to process
df2 = df1.iloc[0:len(sentiment_scores),:].copy(deep=True)
df2["class"] = df2["rating"].map(sentiment_classes) 
df2["predicted_class"] = pd.Series(predicted_classes)
df2[["text","rating","class","predicted_class"]].head()

In [None]:
out_df = df2[["text","rating","class","predicted_class"]].copy(deep=True)
out_df.to_csv('sentiments-twitter-model.csv', index=False) 

In [None]:
from data_pipeline import Text_Pipeline

# Initialize various tools
text_pipeline = Text_Pipeline('CONVERT')

# Obtain pre processed series
sample_text = list(df2.iloc[4:5,:]['text'])
len(sample_text)
preprocessed_series = text_pipeline.preprocess(pd.Series(sample_text))

sentiments = [analyze_sentiment(review) for review in preprocessed_series.values.tolist()]
print(sentiments)

In [None]:
print(sentiment_scores)