# The Last of Us Reviews: a starter analysis

![ellie!](https://www.hdwallpapers.in/download/the_last_of_us_part_ii_ellie-1366x768.jpg)

Natural language processing (NLP) refers to the branch of artificial intelligence concerned with giving computers the ability to understand text and spoken words in much the same way human beings can. In fact, today the machines can analyze more language-based data than humans, without fatigue and in a consistent, unbiased way ([Sas Insight](https://www.sas.com/en_us/insights/analytics/what-is-natural-language-processing-nlp.html#close)). And is possible to see in the projects of this topics that those tools are increasing and developing awesome applications. One field (or first steps) of this is the sentiment analysis and the build of any classification model. This way transforms large-scaled unstructured text data into structured and quantitative measurements to be able to identify if the sentiment behind a piece of text is positive, negative, or neutral. 

In the other hand, The Last of Us Part II is the 2nd best game and the most discussed last year ([Metacritic](https://www.metacritic.com/game/playstation-4/the-last-of-us-part-ii)) having a lot of different opinions. Certainly i was surprised with the level of divergence caused by this video-game: very positive and very negative reviews. If you visit the above web-page, you can see a large number of user's reviews and a weird distribution in the user's score. So, does it exist a way to analyze it in depth? why the bad reviews are very bad reviews? and why the good reviews are very good reviews?  At this oportunity, i tried to find a classification model using the reviews of that web-page.

First of all, let's look at how the game was presented:  (more info in [PlaystationStore](https://www.playstation.com/en-us/games/the-last-of-us-part-ii/)) 

<div class="alert alert-block alert-success">  
Five years after their dangerous journey across the post-pandemic United States, Ellie and Joel have settled down in Jackson, Wyoming. Living amongst a thriving community of survivors has allowed them peace and stability, despite the constant threat of the infected and other, more desperate survivors. When a violent event disrupts that peace, Ellie embarks on a relentless journey to carry out justice and find closure. As she hunts those responsible one by one, she is confronted with the devastating physical and emotional repercussions of her actions.
    <ul>
    <li>A complex and emotional story</li>
<li>Tense and desperate action-survival gameplay</li>
<li>A beautiful yet dangerous world</li>
    </ul>
</div>

Interesting. The continuation of one of the best PS4 video-games with a promising story being ellie a more complex character. And no less important, a post pandemic story in this times (well in its release). Personally, that game was great for me, it was very realistic and it has incredible graphics. However, all users don't agree. On this other side, some persons hate this game, they says it has a bad story, the characters doesn't have sense and more... and I understand that point of view too but not in a extreme case. 

Anyway, the opinions are very different and is a challenge to propose any model. Also, is possible define an interactive dashboard or app, surely I'll add this in an update. For the first case, i want to have something like the next gif with a good prediction power. So, let's get started.

Pd: Obviously, this work contains spoilers. If you don't know the story of the Last of Us (mainly the first game). Go to play!

![sgif!](https://coinerblog.com/wp-content/uploads/2019/05/Can-We-Represent-Emotions-Using-Machine-Learning.gif)

-- --
Hello everybody! ðŸ‘‹ Welcome to this starter notebook. At this oportunity I only used the user reviews for the second game. The kaggle dataset contain reviews about all saga and you can find [here](https://www.kaggle.com/lazaro97/the-last-of-us-reviews). About the recollection, this was obtained through web scraping, more info [here](https://github.com/Lazaro-97/Web-Scraping-Project). If you are interested in this project, don't hesitate to propose any notebook or topic. I'll read everything!

Pd: Don't forget to visit https://www.kaggle.com/lazaro97/the-last-of-us-reviews

In [None]:
!pip install langdetect
!pip install nrclex
!pip install spacy
!pip install plotly
#general packages
import pandas as pd
import numpy as np
import seaborn as sns
from collections import Counter,OrderedDict
#plots and visualizations
from matplotlib import rc
import matplotlib.pyplot as plt
from matplotlib_venn import venn2,venn3
%matplotlib inline
from mpl_toolkits.axes_grid1.inset_locator import mark_inset,zoomed_inset_axes
import altair as alt
import plotly.graph_objs as go
import plotly.offline as pyo
pyo.init_notebook_mode() # Set notebook mode to work in offline
from plotly.subplots import make_subplots
import plotly.express as px
from plotly.colors import n_colors
#date preprocessing
from datetime import datetime
#text preprocessing
import spacy
#from spacy.matcher import PhraseMatcher
from spacy.lang.en import English
from spacy.lang.en.stop_words import STOP_WORDS
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer,SnowballStemmer
from nltk.tokenize import RegexpTokenizer
from nltk.util import ngrams
from nltk import word_tokenize
from string import punctuation, digits
from textblob import Word 
from wordcloud import WordCloud
from langdetect import detect
from nrclex import NRCLex
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer,TfidfTransformer
#modelling
from sklearn.linear_model import LogisticRegression,RidgeClassifier
from sklearn.model_selection import train_test_split,StratifiedKFold,KFold
from sklearn.metrics import plot_confusion_matrix,roc_auc_score
from sklearn.pipeline import Pipeline
from sklearn.cluster import DBSCAN,KMeans
from sklearn.decomposition import PCA


In [None]:
#GET DATA
#list of reviews
dat=pd.read_csv('../input/the-last-of-us-reviews/user_reviews_g2.csv')
#specific stopwords
vg_stopwords=['videogame','game','play','player','character','tlou','tlou2', 'neil','druckmann','naughty','dog','nc',
 'aaa', 'na','youtube', 'youtuber','f','h','ng']
#lexicons
afinn=pd.read_csv('../input/sentiment-lexicons-for-text-mining/afinn.csv',encoding='iso-8859-1',index_col=0)
bing=pd.read_csv('../input/sentiment-lexicons-for-text-mining/bing.csv',index_col=0)
nrc=pd.read_csv('../input/sentiment-lexicons-for-text-mining/nrc.csv',index_col=0)
loug=pd.read_csv('../input/sentiment-lexicons-for-text-mining/loughran.csv',index_col=0)

# Exploratory analysis

In [None]:
#CLEANING
#Replace null values
#sns.heatmap(dat.isna().T,cbar=False,cmap="rocket",xticklabels=False) #This set doesn't have null values 
dat.fillna('',inplace=True)
#Create some variables
#general features
dat['sug']=dat.votes/dat.views
#possible Target
def new_label(x): 
    if x>7: return 'Excellent'
    elif x<3: return 'Bad'
    else: return 'Ok'
dat['label']=dat.score.apply(new_label)
#datetime features
dat['date']=dat.date.astype('datetime64[ns]')
dat['month']=dat.date.apply(lambda x: x.strftime("%B"))
dat['day_of_month']=dat.date.apply(lambda x: x.strftime("%d"))
dat['day_of_week']=dat.date.apply(lambda x: x.strftime("%A"))
#text features
dat['review_length'] = dat.review.apply(len)
dat.head()

In [None]:
#Summary of reviews
temp=pd.DataFrame(dat.score.value_counts())
temp.sort_index(inplace=True)
temp.columns=['Number of reviews']
clrs=['red' if (x > 5000) else 'lightcoral' for x in temp['Number of reviews'] ] #highlight bars
sns.countplot(data=dat,x='score',palette=clrs) 
print(temp.T)
print('Mean score: ', dat.score.mean())

* The aprecciation of this game is very opposite. See the two high bars in the graph. 
* In general, this type of graph should be unbiased for a side with a little weight in the extremes.

In [None]:
#Analyze reviews for each score
for i in range(11):
    temp=dat[(dat.score==10-i) & (dat.type_review=='normal')].review.iloc[np.random.randint(1,100)] #For each score, choose a random review
    print('*'*100)
    print(f'SCORE {10-i} ------- {temp}')

In this way, you can see 10 random reviews and have a general idea of the user's reviews. I don't know what reviews that you see, but in my case i think that the reviews can classified as follows:
* The good reviews says that all people in that universe are bad persons, is not correct no feel sad to abby. Others says that the gameplay and graphics are enough.
* The ok reviews cheer that you buy the game and is necessary to improve the story. And more nostalgia..
* The bad reviews require that this 2nd game should be the same than the 1st, the 1st is much better than this sequel. Also claim that the 'inclusion' is not the problem, the problem is the force in the story.

I read some reviews with bad score and your arguments sounds logic..

![zombie](https://th.bing.com/th/id/R2da419735dbba740609458b9da22e34e?rik=zG%2bMDaGEd6vKbg&riu=http%3a%2f%2fwww.reviewstl.com%2fwp-content%2fuploads%2f2013%2f07%2fThe-Last-of-Us-Zombies.jpg&ehk=lvawsmdk6rqABLUYEWUIqKwyq5Tpi6NGKp1UHC8i3nU%3d&risl=&pid=ImgRaw)

In [None]:
#Analyze the user's score over the time
def plot_line(var,txt):
    '''
    usage: plot some filter of the dataframe
    '''
    temp=Counter(dat[dat.label==var].date)
    plt.plot(list(temp.keys()),list(temp.values()),color=txt)
def subplot_line(var,txt):
    '''
    usage: plot some filter of the dataframe; but in this case, over the axes
    '''
    temp=Counter(dat[(dat.date>'2020-09-25')&(dat.label==var)].date)
    sub_axes.plot(list(temp.keys()),list(temp.values()),color=txt) 
#Define the main figure
fig, ax=plt.subplots(figsize=(8,5))
plot_line('Excellent','blue')
plot_line('Ok','green')
plot_line('Bad','red')
#Add sections
ax.axvspan('2020-06-17', '2020-06-23', alpha=0.1, color='red') #06-19 Release date. I assume 4 days to finish the game
ax.axvspan('2020-06-23', '2020-09-23', alpha=0.1, color='orange') # I didn't know the mean time in finish the game(and buy). In my case was three months. So for a player was one to two months
ax.axvspan('2020-06-23', str(dat.date.max()), alpha=0.09, color='yellow')
# ax.text('2020-06-21',1800,'Early\n reviews', fontsize=7,color='red')
# ax.text('2020-07-15',1700,'Mean\n reviews', fontsize=14,color='orange')
# ax.text('2020-09-25',1700,'Latest\n reviews', fontsize=14,color='black')
# #Add common layers
plt.xlabel('Date');plt.ylabel('Number of votes')
ax.legend(['Excellent','Ok','Bad'])
#Add a zoom portion 
sub_axes =plt.axes([.55, .3, .28, .25]) 
subplot_line('Excellent','blue')
subplot_line('Ok','green')
subplot_line('Bad','red')
plt.setp(sub_axes,xticks=[], yticks=[])
#Draw the connection between plots
mark_inset(ax, sub_axes, loc1=2, loc2=4, fc="none", ec="0.1")

* Interesting. Most of all bad reviews are very early. The duration of the game is 25 to 30 hours, did they play the game? Maybe, they didn't want to keep playing, is very possible, i understand that. Also is possible that they finished the game in a pair of days.
* Also, if you add a zoom to the latest reviews you can see a more normal process where the good reviews are higher.
* About the graph. The first section are the early reviews, the second sections are mean reviews and the final section are the latest reviews.

In [None]:
# DATE AND SCORE
early=dat[dat.date<'2020-06-20'].index
bad=dat[dat.score==0].index
last=dat[dat.date>'2020-09-25'].index
good=dat[dat.score==10].index
fig,ax=plt.subplots(figsize=(12,8),nrows=1,ncols=2)
venn3([set(early),set(last),set(bad)], set_labels = ('A', 'B','C'),ax=ax[0])
venn3([set(early),set(last),set(good)], set_labels = ('A', 'B','C'),ax=ax[1])

In [None]:
#DATE, SCORE AND WEIRD REVIEWS
weird=dat[(dat.sug>0.7)& (dat.views<10)].index
fig,ax=plt.subplots(figsize=(12,8),nrows=1,ncols=2)
venn3([set(early),set(weird),set(bad)], set_labels = ('Early review', 'Weird\n review','Bad sentiment'),ax=ax[0])
venn3([set(early),set(weird),set(good)], set_labels = ('Early review', 'Weird\n review','Good sentiment'),ax=ax[1])

*  If the review is old, have few views and have various votes is weird. And the intersection between *early* and *weird* is *possible bot*. But this is a hypothesis, i didn't find studies with this step.
* **In the first graph** the asociation between sentiment and date is more clear.
* **In the 2nd graph** see the intersection between the three sets in the two images. The bad sentiment have more possible bots reviews (100 than 2).

In [None]:
#Distribution of the languages
temp=dat.language.value_counts().iloc[:5] #Only the languages with more reviews
temp.index=['English', 'Spanish', 'Portuguese', 'Russian','Others']
temp.iloc[4]=dat.language.value_counts().iloc[4:].sum()
my_colors = [ 'red','#00008b','#483d8b', "#3498db", "#95a5a6","#34495e"]
plt.figure(figsize = (6,6))
plt.pie(temp.values,labels=temp.keys(),autopct = '%1.1f%%',colors=my_colors,
        explode=[0.1,0,0,0,0],startangle=45,shadow=True,wedgeprops = {'linewidth': 5})
plt.show()

In [None]:
#Only the english reviews
dat=dat[dat.language=='English'].reset_index(drop=True)
#Translate and use all reviews
# gs = goslate.Goslate()
# dat[dat.language!='English'].review=dat[dat.language!='English'].review.apply(lambda x: gs.translate(x,'en'))

In [None]:
fig, ax=plt.subplots(figsize=(8,5))
sns.histplot(data=dat,x='review_length',fill=True,hue='type_review')
sns.histplot(data=dat,x='review_length',color='black',alpha=0.1,fill=True)
plt.axvline(410,color='red') #Literally this line is the limit of words in a review of the web-page
plt.xlabel('Game users');plt.ylabel('Length of reviews')

# Text preprocessing
* Converting all letters to lower or upper case
* Converting numbers into words or removing numbers
* Removing punctuations, accent marks and other diacritics
* Removing white spaces
* Expanding abbreviations
* Removing stop words, sparse terms, and particular words
* Text canonicalization

More detail info in this [blog](https://medium.com/@datamonsters/text-preprocessing-in-python-steps-tools-and-examples-bf025f872908).

![world](https://multiplayer.net-cdn.it/thumbs/images/2020/06/21/the-last-of-us-part-ii-seattle-guida_jpg_750x400_crop_upscale_q85.jpg)

In [None]:
#A first model
# dat=dat[dat.label!='Ok'] #Remove the neutral review
# dat.reset_index(inplace=True,drop=True)
dat.target=dat.score.apply(lambda x: 0 if x<5 else 1)  #Another option is only to use scpre 0 and score 10
print(f'Dimentions of data: {dat.shape}')
x_train,x_test,y_train,y_test = train_test_split(dat.review,dat.target, test_size=0.2, random_state=60)
pipe = Pipeline([('vect', CountVectorizer(analyzer = "word",min_df=50,stop_words='english')),
                 ('tfidf', TfidfTransformer()),
                 ('model', LogisticRegression())]).fit(x_train, y_train)
prediction = pipe.predict(x_test)
print(roc_auc_score(pipe.predict(x_train),y_train))
print(roc_auc_score(prediction,y_test))
plot_confusion_matrix(pipe,x_test,y_test,cmap='rocket')
#That looks great. The model has a good prediction power and no exist overfitting

In [None]:
nlp=English()
def preprocess_text(text): #In this way you can define specific conditions (compared to countvectorizer)
    '''
    input:a string
    usage: vectorize the vector and cuantify the string. Is possible add conditions compared to CountVectorizer()
    '''
    list_words = nlp(text)
    sentence=[]
    #Remove stopwords
    for token in list_words: 
        val=token.text.lower()#to lower
        stop1=val in vg_stopwords #specific
        stop2=nlp.vocab[val].is_stop #general
        if (not stop2) and (not stop1): sentence.append(val)
    filtered_sentence =[]
    #Lemmatize
    for val in sentence:  
        word = Word(val).lemmatize("n")
        word = Word(word).lemmatize("v")
        word = Word(word).lemmatize("a")
        word=WordNetLemmatizer().lemmatize(word,'v')
        filtered_sentence.append(word) 
    final_sentence=[]
    #Remove some stopwords again
    for val in filtered_sentence:
        if not (val in vg_stopwords): final_sentence.append(val) #Some specific stopwords can appear again; example:gamer to game
    final_sentence=' '.join(final_sentence)
    #Remove sign and punctuation
    tokenizer = RegexpTokenizer(r'\w+')
    final_sentence=tokenizer.tokenize(final_sentence)  
    return ' '.join(final_sentence)
dat['pre_review']=dat.review.apply(preprocess_text) 
#This steps were realized with other package above but i want to have this information in objects. Also is possible to see the algorithm
vectorizer=CountVectorizer(analyzer = "word",min_df=50,stop_words='english') #Only with the most repeated words
feature_matrix = vectorizer.fit_transform(dat.pre_review).toarray()
features=pd.DataFrame(feature_matrix, columns=vectorizer.get_feature_names())
# bow=features.sum(axis=0) Another way to define 1-gram and its count, bag of words

# WorldCloud
Data visualization technique where words from a given text are displayed in a chart, with the more important words being written with bigger, bold fonts, while less important words are displayed with smaller, thinner fonts or not displayed at all.

Some important words that I remove becaused it's appear in all reviews and it couldn't see any pattern in the user'scores.
- Good. All reviews had good, include the bad reviews. If you draw the worlcloud in every plot, this word appear and have a high size. 
- Story. And i agree, that is the most controversial part in this game. Justly, The Last of Us (1st game) had that excellent reputation for that story, so the sequel would be awesome..
- Ellie, Joel, Abby. Similar reason.there are the main characters, all users tell of him
- Naughty Dog. The institution, creators of the video-game. Here, i saw more size when the score was 0, but it always appear.

**Obs:** Only in the worldclouds and tables in the next images, **not** in the model

![zbie](https://cdn.mos.cms.futurecdn.net/Mz3heMHRTTD6GRa45ikjLC-1200-80.jpg)

In [None]:
def generate_worldcloud(x,txt,show_table=True):
    '''
    input:a dataframe, the title, a boolean
    output: a worldcloud, (a summary table)
    '''
    #WorldCloud
    cloud = WordCloud(width=1640, height=1200,random_state=60,
                    stopwords=['game','story','character','play','gameplay','good','end','like','ellie','joel','abby','nt','s','ll','n','t']).generate(" ".join(x.pre_review))
    fig = plt.figure(figsize=(15,12))
    ax = fig.add_subplot(111)
    plt.imshow(cloud)
    plt.axis('off')
    #Summary table
    if show_table==True:
        lst=x.review.iloc[:12].apply(lambda x: [' '+x[:100]+'\n'+x[100:200]+'\n'+x[200:300]+'...'])  
        tab=plt.table(cellText=list(lst), rowLabels=np.arange(12)+1, 
                  colLabels=[txt],colColours=['red'], loc='right',cellLoc='center',colLoc='center')
        tab.auto_set_font_size(False); tab.set_fontsize(14)
        tab.scale(1, 4)
        set_pad_for_column(tb=tab,col=0, pad=0.01) 
    plt.show() 
def subplot_worldcloud(x):
    '''
    input:a dataframe
    output: a worldcloud
    obs: maybe is better combine this and generate world_cloud in a function, or in a class
    '''
    cloud = WordCloud(width=140, height=108,random_state=60,
                      stopwords=set(['game','story','character','play','gameplay','good','end','like','ellie','joel','abby','nt','s','ll','n','t'])).generate(" ".join(x.pre_review))
    plt.imshow(cloud)
    plt.title(f'Score: {i}',loc='left')
    plt.axis('off')
def create_report(df):
    '''
    input: a dataframe
    output: return a table that shows its reviews and scores
    usage: filter the dataframe and obtain that specific report
    '''
    #Add colors by score
    fill_color = []; n = len(df)
    for i in range(n):  
        if df.iloc[i].label=='Excellent':fill_color.append('rgb(102, 178, 255)')
        elif df.iloc[i].label=='Ok':fill_color.append('rgb(153, 255, 204)')
        else:fill_color.append("rgb(255, 102, 102)")
    #Create figure
    fig = go.Figure(data=[  
        go.Table(columnorder = [1,2], columnwidth = [440,40],
      header=dict(values=['<b>Reviews</b>', '<b>Score</b>'],
        line_color='black', fill_color='black',
        align='center',font=dict(color='white', size=12)),
      cells=dict(values=[df.review, df.score],
        line_color=['black']*2,fill_color=[fill_color,fill_color],
        align='center', font=dict(color='black', size=11)))
    ])
    fig.show()
def set_pad_for_column(tb,col, pad=0.1):
    '''
    usage: fix the pad of a plt.table
    '''
    cells = tb.get_celld()
    column = [cell for cell in tb.get_celld() if cell[1] == col]
    for cell in column: cells[cell].PAD = pad

In [None]:
fig = plt.figure(figsize=(15, 7))
for i in range(11): 
    plt.subplot(4,3,i+1)
    subplot_worldcloud(dat[dat.score==i])

* We can see the pattern {0->10:(bad,kill)->(think)}. However the intermediate scores contradict this idea. Maybe this graph only suggest to take the score 0 and 10 for the modeling.
* Other pattern can be {10:love},{9:people},{8:people}... Words that only stands out in its score.
* When the words mentioned above are not deleted all graphs are very similar. Try in your pc if you are esceptic with this. Ex: Good appear in every cloud.

In [None]:
generate_worldcloud(dat[dat.label=='Excellent'],'Excellent Reviews') 

In [None]:
generate_worldcloud(dat[dat.label=='Bad'],'Bad Reviews',show_table=False)
#You can analyze any case. More examples:
# generate_world_cloud(dat[dat.date<'2020-06-20'],'Earlier reviews')
# generate_world_cloud(dat[dat.date>'2020-09-25'],'Last reviews')


In [None]:
#Then if you want to see its reviews..
create_report(dat.iloc[:30])  #First reviews
# create_report(dat[dat.date<'2020-06-20'])  #Early reviews
# create_report(dat[dat.date>'2020-09-25']) #Latest reviews

# Reduction of dimention
I only show the manner of work in this way. In this step i didn't analyze in depth, that was not the main theme in this notebook. 

Some steps that i didn't consider: the analysis of explained variance or inertia, and the use of more accurate algorithms like tsne or dbscan.

In [None]:
def reduction(df,n_components):
    '''
    input: p features
    output: k components that explain that features
    usage: reduction of dimension (feature space)
    '''
    df = PCA(n_components=n_components,random_state=60).fit_transform(df)
#     df=TSNE( n_components=n_components,init='random' method='barnes_hut',n_iter=1000,verbose=2,angle=0.5).fit_transform(data)

    names=[]
    for i in range(n_components): names.append(f'Component {i+1}')
    df=pd.DataFrame(df,columns=names)
    return df
def clustering(df,n_clusters):
    '''
    input: n observations
    output: m clusters that group those observations
    usage: reduction of dimension (observation space)
    '''
    kmeans = KMeans(n_clusters = n_clusters, n_init = 20,random_state=60)
#     kmeans=DBSCAN(eps=3, min_samples=3).fit_predict(df)
    df['label']=kmeans.fit_predict(df)
    df['label']=df.label.apply(lambda x: f'cluster {x+1}')
    return df
new_ft=reduction(features,3)

In [None]:
#Remove outliers. In this way is more easy remove if a variable is an outlier. I also see this step after the segmentation
outlier=new_ft[new_ft['Component 1']>30]
print(outlier)
# new_ft.drop(outlier.index,axis=0,inplace=True)
# new_ft.boxplot()
create_report(dat.iloc[outlier.index].sort_index(ascending=False))

In [None]:
clustering(new_ft,3)
sns.pairplot(new_ft,palette= 'Dark2', diag_kind='kde',hue='label',height=1.85)

In [None]:
new_ft=new_ft[new_ft.label!='cluster 2'] #The cluster 2 are outlier. It should be remove
sns.pairplot(new_ft,palette= 'Dark2', diag_kind='kde',hue='label')

In [None]:
fig=px.scatter_3d(new_ft,
    x='Component 1',y='Component 2',z='Component 3',
    color='label')
fig.show() 

In [None]:
cluster1_idx=new_ft[new_ft.label=='cluster 1'].index
cluster2_idx=new_ft[new_ft.label=='cluster 3'].index
create_report(dat.iloc[cluster1_idx])

# Unigram, Bigram, Trigram
<center><b>How can a machine understand linguistic structures and their meanings? What is the elementary text object?</b></center>
This is the word. But only this is not sufficient. If we join this elements (only <b>adjacents</b>) we'll obtain n-grams. Then if we count the cases, we obtain a <b>bag</b> of elements. Surely separated words can be useful but this is another perspective.

![lou](https://media.giphy.com/media/pZnubf8FJxONO/giphy.gif)

In [None]:
def n_grams(word,k):  
    '''
    input:a string, the number of characters
    output: a dictionary {set(k-gram):count}
    usage: as function to transform the strings
    '''
    token = word_tokenize(word)
    gram = ngrams(token,k)
    return Counter(gram)

def get_ngram(df,k,c):
    '''
    input:a dataframe, the number of characters
    output: a dataframe that shows the k-gram with its count
    usage: obtain bag of word, set of n-gram
    '''
    #get the k-set
    aux=''; n=len(df)
    for i in range(n): aux=aux+dat.pre_review.iloc[i]  #I use theprocessed text. I think is more logic to remove stopwords
    gramk=n_grams(aux,k)
    gramk=join_repeats(gramk)
    #drop duplicate keys
    try: gramk=filter_dict(gramk,lambda x,y: x[0]!=x[1])  #for bigram
    except:gramk=gramk
    try: 
        gramk=filter_dict(gramk,lambda x,y: x[0]!=x[2])  #for trigram
        gramk=filter_dict(gramk,lambda x,y: x[1]!=x[2])
    except:gramk=gramk
    #only select the most counts
    gramk=filter_dict(gramk,lambda x,y:y>c)
    gramk_df=pd.DataFrame({f'{k}-gram':gramk.keys(),'counts':gramk.values()})
    return(gramk_df.sort_values('counts',ascending=False))

def join_repeats(dic):
    '''
    usage: sum elements with the same key
    '''
    newDict = dict()
    for key, value in dic.items():
        nkey=tuple(sorted(key))
        if nkey in newDict.keys(): newDict[nkey]=value+newDict[nkey]
        else: newDict[nkey]=value
    return newDict 
def filter_dict(dic, f):
    '''
    usage: Filter the items of a dictionary'''
    newDict = dict()
    for key, value in dic.items():
        if f(key, value): newDict[key] = value
    return newDict  

In [None]:
def my_palette(ar):
    '''
    usage:highlight bars
    '''
    lst=[]
    for i in range(len(ar)):
        if ar.iloc[i]=='Good':lst.append('#08439A')
        else: lst.append("#e33f2c")
    return lst
def plot_ngram(df):
    '''
    input: A dataframe {set, count}
    otput: Plot the predictions of that set
    '''
    fig,ax=plt.subplots(figsize=(10,6))
#     sns.barplot(y=df.iloc[:,0],x=df.iloc[:,1],palette=['gray']*len(df),ax=ax[0])    
    df['label']=pipe.predict(df.iloc[:,0].apply(str))
    df['label']=df.label.apply(lambda x: 'Good' if x==1 else 'Bad')
    sns.barplot(y=df.iloc[:,0],x=df.iloc[:,1],palette=my_palette(df.label))


In [None]:
grams1=get_ngram(dat,1,4000)  #Another way: CountVectorizer() DataFrame() sum.axis()
grams1.iloc[:,0]=grams1.iloc[:,0].apply(lambda x:x[0])
plot_ngram(grams1)

* Remember under this way, the review only contain this word
* "Like" should be good. But in this particular case is not possible. (Example: I like the graphics, but the story is horrible, score=0)

In [None]:
gp_idx=features[(features.graphic==1)].index
print('Summary of one word')
print(dat.iloc[gp_idx].label.value_counts()) #dat and features have the same index
create_report(dat.iloc[gp_idx])

In [None]:
grams2=get_ngram(dat,2,700)
plot_ngram(grams2)

If you read some reviews above you can see why 'good story' is more near to excellent review, but I understand why the model can be wrong. A pair of words can represent opposite things.

In [None]:
ind_gs=features[(features.like==1)&(features.story==1)].index
print('Summary of one bigram')
print(dat.iloc[ind_gs].label.value_counts())
create_report(dat.iloc[ind_gs])

In [None]:
grams3=get_ngram(dat,3,100)
plot_ngram(grams3)

In [None]:
ind_gs=features[(features.gameplay==1)&(features.graphic==1)&(features.story==1)].index
print('Summary of one trigram')
print(dat.iloc[ind_gs].label.value_counts())
create_report(dat.iloc[ind_gs])

In [None]:
#Another try: Bigram,trigram
dat.target=dat.score.apply(lambda x: 0 if x<5 else 1)
x_train,x_test,y_train,y_test = train_test_split(dat.review,dat.target, test_size=0.2, random_state=60)
pipe_bt = Pipeline([('vect',CountVectorizer(analyzer='char',ngram_range=(2,3),min_df=20,stop_words='english') ),
                 ('tfidf', TfidfTransformer()),
                 ('model', LogisticRegression())]).fit(x_train, y_train)
prediction = pipe_bt.predict(x_test)
print(roc_auc_score(pipe_bt.predict(x_train),y_train))
print(roc_auc_score(prediction,y_test))
plot_confusion_matrix(pipe_bt,x_test,y_test,cmap='rocket')
#Again, that looks great. The model has a good prediction power and no exist overfitting

# Sentiment and emotion lexicons
The lexicon approach has a dictionary that contains a comprehensive list of sentiment features. All the lexical features were rated for the polarity and intensity of the sentiment (or emotion). Then, the average score can be used as the sentiment indicator. Under this way we can obtain emotion and sentiment of a review.

One advantage of this is to remove some not useful words -in addition to stopwords- in the review, but this method has its limits.

![joel](https://i.ytimg.com/vi/RVINxPk0190/maxresdefault.jpg)

In [None]:
def emotion(x):
    '''
    input: a string
    output: a dictionary that shows the emotions of that string
    usage: function of a column'''
    x = NRCLex(x)
    return x.affect_frequencies
dat['emotion'] = dat.review.apply(emotion)
feature_matrix = vectorizer.fit_transform(dat.review).toarray()#considering all original words
features=pd.DataFrame(feature_matrix, columns=vectorizer.get_feature_names())
bow=features.sum(axis=0) 
bow=pd.DataFrame({'word':bow.keys(),'count':bow.values})
bow=bow.merge(nrc,how='inner',on='word')

In [None]:
#bag of words
bow.sort_values(['sentiment','count'],ascending=False,inplace=True)
bow.reset_index(drop=True,inplace=True)
#merge the sentiment (or emotion)
aux=pd.DataFrame(columns=['word','count','sentiment'])
for i in bow.sentiment.unique():
    aux=pd.concat([aux,bow[bow.sentiment==i].iloc[:5]],axis=0)
alt.Chart(aux).mark_bar().encode(y='word',
    x='count',color='sentiment',facet=alt.Facet('sentiment',columns=5,align='none')).resolve_scale(y='independent').properties(width=250,
    height=250)

In [None]:
#threemap
thmap1 = pd.DataFrame({'scenario' : nrc.sentiment.value_counts().index,
                          'percentage':nrc.sentiment.value_counts(normalize=True).tolist(),
                        'dataset':"Original's lexicon" })
thmap2 = pd.DataFrame({'scenario' : bow.sentiment.value_counts().index,
                          'percentage':bow.sentiment.value_counts(normalize=True).tolist(),
                          'dataset':"Dataset's lexicon"})
thmap=pd.concat([thmap1,thmap2],axis=0)
thmap.reset_index(inplace=True,drop=True)
fig = px.treemap(thmap, path= ["scenario","dataset"], values ="percentage",
                 color='percentage',color_continuous_scale='viridis_r')
fig.show()


* The treemap shows the number of words associated with each affect category
* For this case, i add a step in the graph: original's lexicon and dataset'lexicon 
* The original's lexicon is the original NRCLex. This set contains the **original** proportion of the sentiment (or emotion) intensity 
* The dataset's lexicon contains the **actual** proportion of the sentiment (or emotion) intensity
* See the differences between this two groups and the see specific conditions in the sentiment's intensity of this dataset

In [None]:
#Word count distribution over sentiment intensity
del bow['count']
bow=bow.merge(afinn,how='inner',on='word')
bow.dropna(inplace=True)
plt.figure(figsize=(10,6))
sns.kdeplot(data=bow,x='value',hue='sentiment',fill=True, palette="seismic",
   alpha=.7, linewidth=0.3,)

* Well, it's curious. The emotions fit appropriately in each subgroup
* See the bimodal distribution. This can be calibrated with a mixture gaussian kernel.
* Honestly i'm surprised. I see this graph in other notebook and the sobgroups are not separated but i'm not sure how to take advantage of that

In [None]:
def get_lexicon(dat):
    '''
    input:a dataframe
    otput:a tuple: the emotion, the score in the review (mean proportion)
    '''
    n=len(dat);aux={'fear':0,'anger':0,'anticipation':0,'trust':0,'surprise':0,'positive':0,'negative':0,'sadness':0,
'disgust':0,'joy':0};newDict1=dict();newDict2=dict()
    #Obtain the sets: emotion, score
    for i in range(n): 
        aux=sum_dictionary(aux,dat.emotion.iloc[i])
    for key, value in aux.items(): 
        if (key!='positive')&(key!='negative'):
            newDict1[key]=value
        else: newDict2[key]=value
    #Define proportions in each one
    s1=sum(list(newDict1.values()));s2=sum(list(newDict2.values()))
    return(function_dic(newDict1,lambda x:x/s1),function_dic(newDict2,lambda x:x/s2))
def sum_dictionary(dic1,dic2):
    '''
    usage: sum all dictionaries in the column
    '''
    newDict=dict()
    for key, value in dic1.items():
        if key in dic2.keys(): newDict[key]=value+dic2[key]
    return newDict
def function_dic(dic,f):
    '''
    usage:change the values in a dictionary
    '''
    newDict=dict()
    for key, value in dic.items(): newDict[key]=f(value)
    return newDict

In [None]:
#Define scenarios
joel=features[(features.joel!=0)&(features.kill!=0)&(features.death!=0)]
ellie=features[(features.ellie!=0)&(features.story!=0)] #an observation. The users maybe write incorrectly(Ex: elie)
score10=features[features.feel!=0&(features.love!=0)] 
joel_emotion,joel_score=get_lexicon(dat.iloc[joel.index])
ellie_emotion,ellie_score=get_lexicon(dat.iloc[ellie.index])
score10_emotion,score10_score=get_lexicon(dat.iloc[score10.index])

In [None]:
fig,ax=plt.subplots(figsize=(8,7),nrows=3,ncols=2)
def my_palette(ar):
    '''
    usage: highlight bars
    '''
    n=len(ar);lst=[]
    lim_ar=sum(ar)/n+0.02
    for i in range(n):
        if ar[i]>lim_ar: lst.append('#008000')
        else: lst.append("#15B01A")
    return lst
#bar chart
sns.barplot(y=list(joel_emotion.keys()),x=list(joel_emotion.values()),palette=my_palette(list(joel_emotion.values())),
            label=list(joel_emotion.keys()),ax=ax[0,0])
#adjust format to the bar
ax[0,0].set_ylabel("Joel's death",fontsize=10,fontweight="bold",
                   bbox=dict(facecolor='wheat', edgecolor='gold', pad=3))
ax[0,0].set_xlim([0,0.2])
#pie chart
ax[0,1].pie(list(joel_score.values()),colors=["#3498db","#e74c3c"],autopct = '%1.1f%%',
            shadow=True,explode=[0,0.1],labels=list(joel_score.keys()))
sns.barplot(y=list(ellie_emotion.keys()),x=list(ellie_emotion.values()),
            palette=my_palette(list(ellie_emotion.values())),ax=ax[1,0])
ax[1,0].set_ylabel('Ellie and her story',fontsize=10,fontweight="bold",bbox=dict(facecolor='wheat', edgecolor='gold', pad=3))
ax[1,0].set_xlim([0,0.2])
ax[1,1].pie(list(ellie_score.values()),colors=["#3498db","#e74c3c"],autopct = '%1.1f%%',
            shadow=True,explode=[0,0.1],labels=list(joel_score.keys()))
sns.barplot(y=list(score10_emotion.keys()),x=list(score10_emotion.values()),palette=my_palette(list(score10_emotion.values())),ax=ax[2,0])
ax[2,0].set_ylabel('Great game!',fontsize=10,fontweight="bold",bbox=dict(facecolor='wheat', edgecolor='gold', pad=3))
ax[2,0].set_xlim([0,0.2])
ax[2,1].pie(list(score10_score.values()),colors=["#3498db","#e74c3c"],autopct = '%1.1f%%',
            shadow=True,explode=[0.1,0],labels=list(joel_score.keys()))
plt.subplots_adjust(wspace=1e-2,hspace=0.15)

* *Joel's death* was the emotions: sad, fear and anger. Its reviews had more negative reviews
* *Ellie and her story* was the emotions: fear and sadness. Its reviews had more negative reviews
* *Great game!* was the emotions: trust and fear. Its reviews had more positive reviews

In [None]:
categories = np.array(list(joel_emotion.keys())) #list of emotions
fig = go.Figure()
#figure 1
fig.add_trace(go.Scatterpolar(
      r=np.array(list(joel_emotion.values())), theta=categories,fill='toself',name="Joel's death")
             )
#figure 2
fig.add_trace(go.Scatterpolar(
      r=np.array(list(ellie_emotion.values())),
      theta=categories,
       fill='toself',
      name='Ellie and her story'
))
#figure 3
fig.add_trace(go.Scatterpolar(
      r=np.array(list(score10_emotion.values())),theta=categories,fill='toself',name='Great game!')
             )
fig.update_layout(
  polar=dict(radialaxis=dict(
      visible=True,
      range=[0, 0.21])
            ), showlegend=False
)

fig.show()

* This radar char complements the previous graph
* The *Joel's death* have interesting extremes. It is sadness and fear
* The *great game!* have interesting extremes. It is joy and trust
* *Ellie and her story* and *Great game!* are similar (a little). That's funny
![1st](https://31.media.tumblr.com/b7847cf1da3d34f9229955dbb1f3c3ab/tumblr_nod44mTfeJ1tyvnamo1_500.gif)

In [None]:
#nrclex prediction
def get_lexicon_element(dat):
    '''
    input:a string
    otput:a tuple: the emotion, the score in the review (for each review)
    '''
    newDict1=dict();newDict2=dict()
    #Obtain the sets: emotion, score
    for key, value in dat.items(): 
        if (key!='positive')&(key!='negative'):
            newDict1[key]=value
        else: newDict2[key]=value
    #Return each one
    return(newDict1,newDict2)
def predict_lexicon(df):
    '''
    input: a tuple (dic1,dic2)
    usage: obtain predictions
    '''
    dic=get_lexicon_element(df)
    if dic[1]['positive']>dic[1]['negative']: return 1
    elif dic[1]['positive']<dic[1]['negative']: return 0
    else: return('-')

In [None]:
# Finally, we evaluate this models with new reviews
my_reviews=pd.DataFrame({'review':["The gameplay and graphics were great. The game was entertaining.","I enjoy to play",
                      'Play as Abby was a horrible idea, what they though?',
                      "Last of Us don't deserve this sequel. Do it again","Last of Us don't deserve this sequel. Joel shouldn't die. "]})
print(f'Bag of word: {pipe.predict(my_reviews.review)}')
print(f'Bigram and trigram: {pipe_bt.predict(my_reviews.review)}')

my_reviews['emotion']=my_reviews.review.apply(lambda x: emotion(str(x)))
print(f'NRCLex: {np.array(my_reviews.emotion.apply(predict_lexicon))}') #that's bad, here is possible to see its limits, maybe is more useful add a probabilistic distribution

# Conclussions and recommendations
* The worldclouds and reports show that the game presents simmiliarities between intermediate scores. Therefore is suggested to follow this perspective (only score 0 and 10) in the next modelling.
* In the seen two ways (n-gram,lexicon) you can obtain interesting insights.
* In the first case, many words (or grams) can represent opposite representations.
* The vectorization -with other conditions- is an excellent way to cuantify text features.
* In the second case, you can view the emotion and sentiment in each review. I plotted some graphs to help the visualization.
* The list of NRCLex can detect the emotion and the sentiment automatically.
* In terms of prediction, models based in unigram, bigram and trigram are better because this depend on the specific data.
* NRCLex is based only in the word, not in its context in the sentence. Maybe this alternative is great if you use a probabilistic approach.
* I always think that a convolutional neural network is the best option in this type of modeling. That's a future task.

# Additional resources
I know that exists other specialized packages in this topic but i tried to define my proper code because i want to understand better the algorithm. If you find any improvement please let me know. In the other hand, i mention additional resources below.

**Kaggle notebooks:**
* https://www.kaggle.com/andradaolteanu/sentiment-analysis-rick-and-morty-scripts#3.-Dialogues-:-Who-talks-the-most?-%F0%9F%92%AC
* https://www.kaggle.com/ruchi798/sentiment-analysis-the-simpsons
* https://www.kaggle.com/arthurtok/principal-component-analysis-with-kmeans-visuals?scriptVersionId=1524454
* https://www.kaggle.com/meesalasaidhanush/review-scores-of-fall-guys-game-acc-99

**Courses and books:**
* https://www.tidytextmining.com/index.html
* https://www.kaggle.com/learn/natural-language-processing

 ![end](https://64.media.tumblr.com/0ff977cfd2fd40b9ff304af285acd14a/tumblr_phloi1SAjK1vrquye_540.gif)

<div class="alert alert-block alert-info">  
That's all. Thanks for read this notebook. If you think this material is useful please upvote. That motivates me to still do this types of projects.
</div>