#                           Game Console Sentiment Analysis
##                                            Rudy Duran
##                                             Practicum 1
##                                              4-08-2021

#                                         Purpose

The purpose of this project is to analyze Amazon consumer reviews to see what consumers think of the new game consoles: the<br> Playstation 5 and the Xbox Series X. The scope of this project will focus on the Playstation 5, Xbox Series X, and their<br> direct predecessors: The Playstation 4 and Xbox One X. The methods for this project include analyzing customer reviews for these products to see if there are different sentiments and what the different sentiments are between the PS5, Xbox Series X<br> and their respective predecessors.<br>This analysis will help in better undestanding consumers needs and areas for improvement with the new consoles.<br>

Natural Language Processing, LDA topic modeling, and sentiment analysis will be used on these reviews in order to achieve this purpose.

For context, LDA topic modeling is a popular topic modeling method which aims to find abstract topics within a<br>
document of words, which in this case, are the Amazon reviews. The reason why I chose to do this project on LDA<br>
and NLP is because I wanted to get more experience and familiarity within the NLP data science domain. I wanted<br>
to get better at it and also use these tools towards a real life application which, in this case, are the<br> 
Amazon reviews geared towards game consoles. I am an avid gamer myself which is why I chose these products for my<br> 
project.

Because this is a binder link, the following packages will need to be installed for this to function correctly:

In [None]:
!pip install pandas
!pip install nltk
!pip install matplotlib
!pip install seaborn
!pip install spacy
!pip install gensim
!pip install pyLDAvis
!pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.3.1/en_core_web_sm-2.3.1.tar.gz --no-deps
!pip install -U spacy==2.3.1

Once the packages have been downloaded, the following are a list of libraries that will beused for this project:

In [None]:
import pandas as pd
from nltk import FreqDist
import matplotlib.pyplot as plt 
import seaborn as sns
import re
import spacy
from nltk.corpus import stopwords
import nltk
nltk.download('stopwords')

import gensim
from gensim import corpora

import pyLDAvis
import pyLDAvis.gensim_models as gensimvis
%matplotlib inline

The pandas library will be needed for data manipulation.<br>
The seaborn and matplotlib packages will be used for data visualizations.<br> 
The re and spacy packages will be used for regular expression and tokenization purposes.<br>
The NLTK library will be used to bring in stopwords for data cleaning.<br>
The gensim and pyLDAvis libraries will be used for LDA Topic Analysis and LDA visualizations. 

In [None]:
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning) 
pd.options.mode.chained_assignment = None 

The Deprecation warning and pd.options library is done to suppress the Deprecation warning and "Settingwithcopy" warning message.

In [None]:
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning) 

This code above is re-run again in order to truly get rid of the deprecation warnings from the gensim package.

#  Data Sources

The data sources were scraped directly from Amazon using the Scrapy package which was installed on my local machine. <br>
The data sources were saved into CSV files.


The reviews for all 4 consoles are loaded into pandas dataframes.

In [None]:
ps4reviews = pd.read_csv("PS4Amazon.csv")
ps5reviews = pd.read_csv("ps5new.csv")
xboxonexreviews = pd.read_csv("xboxonexreviews.csv")
xboxseriesxreviews = pd.read_csv("xboxseriesxreviews.csv")

There are some extra columns not needed in the PS5 reviews dataset: "Name" and "Title". 
I use the following piece of code to drop them:

In [None]:
ps5reviews = ps5reviews.drop(['Name', 'Title'], axis = 1)

In order to clean the data, get it ready for modeling, and use the data for visualizations, I need to replace the Rating values
with numerical values (1,2,3,4,5) in order for the models to function properly:

In [None]:
ps5reviews['Rating'] = ps5reviews['Rating'].replace(['1.0 out of 5 stars','2.0 out of 5 stars', '3.0 out of 5 stars', '4.0 out of 5 stars', '5.0 out of 5 stars'],
                                                    [1, 2, 3, 4, 5])
ps4reviews['stars'] = ps4reviews['stars'].replace(['1.0 out of 5 stars','2.0 out of 5 stars', '3.0 out of 5 stars', '4.0 out of 5 stars', '5.0 out of 5 stars'],
                                                    [1, 2, 3, 4, 5])
xboxonexreviews['stars'] = xboxonexreviews['stars'].replace(['1.0 out of 5 stars','2.0 out of 5 stars', '3.0 out of 5 stars', '4.0 out of 5 stars', '5.0 out of 5 stars'],
                                                              [1, 2, 3, 4, 5])
xboxseriesxreviews['stars'] = xboxseriesxreviews['stars'].replace(['1.0 out of 5 stars','2.0 out of 5 stars', '3.0 out of 5 stars', '4.0 out of 5 stars', '5.0 out of 5 stars'],
                                                    [1, 2, 3, 4, 5])


The head function is run to check the first 5 rows of the data:

In [None]:
ps5reviews.head(5)

Based on the dataframe above,  it looks like the "Rating" column values have been replaced successfully.<br>
However, the "Comment" column for the PS5 reviews looks to have extra whitespace lines.<br>
The rest of the datasets are checked to see if they have the same issue:

In [None]:
ps4reviews.head(5)

In [None]:
xboxonexreviews.head(5)

In [None]:
xboxseriesxreviews.head(5)

It seems that the review text for all these reviews seem to have the same extra whitespace issue.<br>
The code below is run in order to replace the white space values using regular expressions:

In [None]:
ps4reviews.replace(to_replace=[r"\\t|\\n|\\r", "\t|\n|\r"], value=["",""], regex=True, inplace=True)
ps5reviews.replace(to_replace=[r"\\t|\\n|\\r", "\t|\n|\r"], value=["",""], regex=True, inplace=True)
xboxonexreviews.replace(to_replace=[r"\\t|\\n|\\r", "\t|\n|\r"], value=["",""], regex=True, inplace=True)
xboxseriesxreviews.replace(to_replace=[r"\\t|\\n|\\r", "\t|\n|\r"], value=["",""], regex=True, inplace=True)

Next, the "Rating" and "stars" columns are converted into numeric columns in order to ensure that the columns are all numeric.
It's done as a safeguard to confidently make sure the columns are numeric for model building purposes.

In [None]:
ps5reviews['Rating'] = pd.to_numeric(ps5reviews['Rating'])
ps4reviews['stars'] = pd.to_numeric(ps4reviews['stars'])
xboxonexreviews['stars'] = pd.to_numeric(xboxonexreviews['stars'])
xboxseriesxreviews['stars'] = pd.to_numeric(xboxseriesxreviews['stars'])

In [None]:
ps4reviews.head(5)

In [None]:
ps5reviews.head(5)

In [None]:
xboxonexreviews.head(5)

In [None]:
xboxseriesxreviews.head(5)

As seen by the dataframes above, the whitespace issue has been fixed and the ratings and stars columns are now numeric.<br>
The datasets will now be checked to see if there are any null values:

In [None]:
ps4reviews.info()

There are no null values for the PS4 dataset.

In [None]:
ps5reviews.info()

There are no null values for the PS5 dataset.

In [None]:
xboxonexreviews.info()

There is one NULL value in the "comment" section for the Xbox One X dataset.

In [None]:
xboxseriesxreviews.info()

There are no NULL values for the Xbox Series X dataset.<br>
The only NULL value that appears is for the Xbox One X dataset.<br>
Since it is only one NULL row, I will drop the row from the dataset:

In [None]:
xboxonexreviews = xboxonexreviews.dropna()

In [None]:
xboxonexreviews.info()

As shown above, there are no more NULL values within the Xbox One X dataset.

#   Data Splitting: PS4 Reviews

The next step is to split the dataframes into 2 datasets for each review dataframe: one positive review dataset and one negative review dataset.

Positive Reviews will be defined as ratings >=4.<br>
Negative Reviews will be defined as ratings <=3.<br>

Here, the PS4 reviews are split into the negative set with ratings <= 3:

In [None]:
is_2002 =  ps4reviews['stars'] <= 3
ps4reviews_negative = ps4reviews[is_2002]
ps4reviews_negative.head(5)

It looks like the code was successful and the dataset has been created.

Here, the shape for the negative review dataset is shown:

In [None]:
ps4reviews_negative.shape

As can be seen, there are 137 reviews for the PS4 reviews dataset.

Here, the PS4 reviews are split into the positive set with ratings >= 4:

In [None]:
is_2002 =  ps4reviews['stars'] >= 4
ps4reviews_positive = ps4reviews[is_2002]
ps4reviews_positive.head(5)

In [None]:
ps4reviews_positive.shape

There are about 1,063 reviews for the PS4 with 2 columns.

#                                 Data Cleaning (PS4 Positive Reviews)

The following section is done in order to prepare the PS4 positive reviews for the LDA model.<br>
The data will need to be split, tokenized, and lemmatized in order to prepare it for the model.<br>
Visualizations will also be built in order to take a better look at the data.<br>

The following is a function which is used in order to split, tokenize, and plot a bar graph of the most frequent words
in the PS4 positive review dataset:

In [None]:
def freq_words(x, terms = 30):
    all_words = ''.join([text for text in x])
    all_words = all_words.split()
    
    fdist = FreqDist(all_words)
    words_df = pd.DataFrame({'word': list(fdist.keys()), 'count':list(fdist.values())})
    d = words_df.nlargest(columns = 'count', n = terms)
    plt.figure(figsize = (20,5))
    
    ax = sns.barplot(data = d, x = 'word', y = 'count')
    ax.set(ylabel = 'Count')
    plt.show()

In [None]:
freq_words(ps4reviews_positive['comment'])

Looking above, it seems most of the most frequent words are "the", "and", and "for".<br> But these are not necessary words because these words don't show true sentiment.<br> The dataset will need to be cleaned further in order to prepare this dataset for the model.

The following code is run in order to remove the characters and symbols from the review dataset:

In [None]:
ps4reviews_positive['comment'] = ps4reviews_positive['comment'].str.replace("[^a-zA-Z#]"," ")

This code is run in order to bring stopwords into the notebook:

In [None]:
stop_words = set(stopwords.words('english'))

This function is used in order to remove the stopwords and for reusability as well:

In [None]:
def remove_stopwords(rev):
    rev_new = " ".join([i for i in rev if i not in stop_words])
    return rev_new

The following code is run in order to remove short words (length < 3) to make the data cleaner:

In [None]:
ps4reviews_positive['comment'] = ps4reviews_positive['comment'].apply(lambda x: ' '.join([w for w in x.split() if len(w) > 2]))

This code is run to remove stopwords from the text:

In [None]:
reviews = [remove_stopwords(r.split()) for r in ps4reviews_positive['comment']]

This code is run in order to make the text lowercase:

In [None]:
reviews = [r.lower() for r in reviews]

In [None]:
freq_words(reviews, 35)

Re-running a frequency bar graph on the data, it seems the stopwords have been removed and the reviews are getting closer 
to be used for the model.<br> The next step is to lemmatize the data.

The following pieces of code are run in order to load in Spacy and use the following function to lemmatize the data.<br> 
This code will be helpful for reusability purposes as well:

In [None]:
import en_core_web_sm
nlp = en_core_web_sm.load()

def lemmatization(texts, tags = ['NOUN', 'ADJ']): 
    output = []
    for sent in texts:
        doc = nlp(" ".join(sent))
        output.append([token.lemma_ for token in doc if token.pos_ in tags])
    return output

This code is done in order to split the words into tokens:

In [None]:
tokenized_reviews = pd.Series(reviews).apply(lambda x: x.split())
print(tokenized_reviews[1])

The code here is run in order to lemmatize the reviews:

In [None]:
reviews_2 = lemmatization(tokenized_reviews)
print(reviews_2[1]) # print lemmatized review

The following code is run in order to append the reviews to a new dataframe in order to show a frequency graph 
of the lemmatized reviews later on:

In [None]:
reviews_3 = []
for i in range(len(reviews_2)):
    reviews_3.append(' '.join(reviews_2[i]))

ps4reviews_positive['reviews'] = reviews_3

Here,a frequency bar graph of the top 20 words is generated with the appended dataframe from the previous step:

In [None]:
freq_words(ps4reviews_positive['reviews'], 20)

As can be seen, the tokens are now cleaner as words such as "game", "system", and "console" are now clearly showing in the dataset.<br>
The dataset is now ready for model building.

#                            LDA Model (PS4 Positive Reviews)

The code below is run in order to incorporate the reviews into a dictionary in order to set up the LDA Model:

In [None]:
dictionary = corpora.Dictionary(reviews_2)
doc_term_matrix = [dictionary.doc2bow(rev) for rev in reviews_2]

Next, the LDA Model is imported and the next lines of code are run in order to build the model.<br>
The number of topics I chose through trial and error were 3.

In [None]:
LDA = gensim.models.ldamodel.LdaModel

lda_model = LDA(corpus = doc_term_matrix, 
                id2word = dictionary,
                num_topics = 3,
                random_state = 100,
                chunksize = 1000,
                passes = 50)

Here, this code is being run to print the topics and the words associated with it:

In [None]:
lda_model.print_topics()

This code is run to generate the LDA model and present it via a graphical bubble plot:

In [None]:
pyLDAvis.enable_notebook()
vis = gensimvis.prepare(lda_model, doc_term_matrix, dictionary)
vis

As can be seen above, there are 3 topics which were created.<br> Based on the words within each topic, I can take a reasonable guess as to what the topics are about:<br>

Topic 1: Positive Review for PS4 features: Graphics, controller, play, well, easy<br>
Topic 2: Positive Reviews for PS4 internal features: amazing, sysytem, awesome, feature<br>
Topic 3: Positive Reviews on PS4 speed and power.<br>

It seems to me that people were really happy with the PS4 based on the firmware and software.<br>
There were positive comments for the controller, graphics, and the power of the PS4 as well.<br>

#                                  Data Cleaning (PS4 Negative Reviews)

The following section is done in order to prepare the PS4 negative reviews for the LDA model.<br>
The data will need to be split, tokenized, and lemmatized in order to prepare it for the model.<br>
Visualizations will also be built in order to take a better look at the data.

In [None]:
freq_words(ps4reviews_negative['comment'])

Looking above, it seems most of the most frequent words are "the", "and", and "for".<br> But these are not necessary words because these words don't show true sentiment.<br> The dataset will need to be cleaned further in order to prepare this dataset for the model.

The following code is run in order to remove short words (length < 3) to make the data cleaner:

In [None]:
ps4reviews_negative['comment'] = ps4reviews_negative['comment'].apply(lambda x: ' '.join([w for w in x.split() if len(w) > 2]))

The following code is run to remove stopwords from the text:

In [None]:
reviews = [remove_stopwords(r.split()) for r in ps4reviews_negative['comment']]

This code is run to make the text lowercase:

In [None]:
reviews = [r.lower() for r in reviews]

In [None]:
freq_words(reviews, 35)

Re-running a frequency bar graph on the data, it seems the stopwords have been removed and the reviews are getting closer to be used for the model.<br> The next step is to lemmatize the data.

This code is done in order to split the words of the words into tokens:

In [None]:
tokenized_reviews = pd.Series(reviews).apply(lambda x: x.split())
print(tokenized_reviews[1])

This code below is run in order to lemmatize the reviews:

In [None]:
reviews_2 = lemmatization(tokenized_reviews)
print(reviews_2[1]) 

The following code is run in order to append the reviews to a new dataframe in order to show a frequency graph 
of the lemmatized reviews later on:

In [None]:
reviews_3 = []
for i in range(len(reviews_2)):
    reviews_3.append(' '.join(reviews_2[i]))

ps4reviews_negative['reviews'] = reviews_3


In [None]:
freq_words(ps4reviews_negative['reviews'], 20)

As can be seen, the tokens are now cleaner as words such as "game", "system", and "console" are now clearly showing in the dataset.<br>The dataset is now ready for model building.

#                                   LDA Model (PS4 Negative Reviews)

The code below is run in order to incorporate the reviews into a dictionary in order to set up the LDA Model.


In [None]:
dictionary = corpora.Dictionary(reviews_2)
doc_term_matrix = [dictionary.doc2bow(rev) for rev in reviews_2]

Next, the LDA Model is imported and the next lines of code are run in order to build the model.<br>
The number of topics I chose through trial and error were 4.

In [None]:
LDA = gensim.models.ldamodel.LdaModel

lda_model = LDA(corpus = doc_term_matrix, 
                id2word = dictionary,
                num_topics = 4,
                random_state = 100,
                chunksize = 1000,
                passes = 50)

Here, this code is being run to print the topics and the words associated with it:

In [None]:
lda_model.print_topics()

This code is run to generate the LDA model and present it via a graphical bubble plot:

In [None]:
pyLDAvis.enable_notebook()
vis = gensimvis.prepare(lda_model, doc_term_matrix, dictionary)
vis

As can be seen above, there are 4 topics which are clearly separated. Based on the words within each topic, I can take a reasonable guess to what the topics are about:<br>

Topic 1: Possible issues with graphics and complaints about system being old: hard, drive, old, product<br>
Topic 2: Possible complaints over the price and length of warranty: disc, money, issue, warranty<br>
Topic 3: Issues with product hardware: eject, disc, warranty, sound, issue<br>
Topic 4: Calls for refund due to the condition of the PS4 Refund,death, condition<br>

It seems to me that most customers complained about the PS4 for the system being old, length of warranty, hardware issues such as disc ejection,and sound.<br>There were also issues with customers wanting a refund for possible product defects.

#                                              Data Splitting: PS5 Reviews

Here, the PS5 reviews are split into the negative set with ratings <= 3:

In [None]:
negative =  ps5reviews['Rating'] <= 3
ps5reviews_negative = ps5reviews[negative]
ps5reviews_negative.head(5)

In [None]:
ps5reviews_negative.shape

There are 339 reviews with 2 columns in the dataframe.

Here, the PS5 reviews are split into the positive set with ratings >= 4:

In [None]:
positive =  ps5reviews['Rating'] >= 4
ps5reviews_positive = ps5reviews[positive]
ps5reviews_positive.head(5)

In [None]:
ps5reviews_positive.shape

There are 742 reviews and 2 columsn in the dataframe.

#                                                    Data Cleaning (PS5 Positive Reviews)

The following section is done in order to prepare the PS5 positive reviews for the LDA model.<br>
The data will need to be split, tokenized, and lemmatized in order to prepare it for the model.<br>
Visualizations will also be built in order to take a better look at the data.

The following produces a frequency bar graph for the PS5 positive reviews:

In [None]:
freq_words(ps5reviews_positive['Comment'])

Looking above, it seems most of the most frequent words are "the", "and", and "for".<br>But these are not necessary words because these words don't show true sentiment.<br>The dataset will need to be cleaned further in order to prepare this dataset for the model.

The following code is run in order to remove the characters and symbols from the review dataset:

In [None]:
ps5reviews_positive['Comment'] = ps5reviews_positive['Comment'].str.replace("[^a-zA-Z#]"," ")

The following code is run in order to remove short words (length < 3) to make the data cleaner:

In [None]:
ps5reviews_positive['Comment'] = ps5reviews_positive['Comment'].apply(lambda x: ' '.join([w for w in x.split() if len(w) > 2]))

The following code is run in order to remove the stopwords from the data:

In [None]:
reviews = [remove_stopwords(r.split()) for r in ps5reviews_positive['Comment']]

This code is run to make the text lowercase:

In [None]:
reviews = [r.lower() for r in reviews]

In [None]:
freq_words(reviews, 20)

Re-running a frequency bar graph on the data, it seems the stopwords have been removed and the reviews are getting closer to be used for the model.<br>The next step is to lemmatize the data.

The following code is run in order to separate the text into tokens:

In [None]:
tokenized_reviews = pd.Series(reviews).apply(lambda x: x.split())
print(tokenized_reviews[1])

This code is run in order to lemmatize the data:

In [None]:
reviews_2 = lemmatization(tokenized_reviews)
print(reviews_2[1]) 

The following code is run in order to append the reviews to a new dataframe in order to show a frequency graph 
of the lemmatized reviews later on:

In [None]:
reviews_3 = []
for i in range(len(reviews_2)):
    reviews_3.append(' '.join(reviews_2[i]))

ps5reviews_positive['reviews'] = reviews_3

In [None]:
freq_words(ps5reviews_positive['reviews'], 20)

As can be seen, the tokens are now cleaner as words such as "game", "system", and "console" are now clearly showing in the dataset.<br> The dataset is now ready for model building.

#                                                        LDA Model (PS5 Positive Reviews)

The code below is run in order to incorporate the reviews into a dictionary in order to set up the LDA Model.

In [None]:
dictionary = corpora.Dictionary(reviews_2)
doc_term_matrix = [dictionary.doc2bow(rev) for rev in reviews_2]

Next, the LDA Model is imported and the next lines of code are run in order to build the model.<br>
The number of topics chosen through trial and error were 4.

In [None]:
LDA = gensim.models.ldamodel.LdaModel

lda_model = LDA(corpus = doc_term_matrix, 
                id2word = dictionary,
                num_topics = 4,
                random_state = 100,
                chunksize = 1000,
                passes = 50)

Here, this code is being run to print the topics and the words associated with it:

In [None]:
lda_model.print_topics()

This code is run to visualize the LDA model into a graphical bubble plot:

In [None]:
pyLDAvis.enable_notebook()
vis = gensimvis.prepare(lda_model, doc_term_matrix, dictionary)
vis

As can be seen above, there are 4 topics which are clearly separated. Based on the words within each topic, I can take a reasonable guess as to what the topics are about:<br>

Topic 1: Great reviews on how fast the system is: Load, fast, awesome, time, feedback<br>
Topic 2: Great reviews on the design and controller and speed: design, controller, speed, performance<br>
Topic 3: Positive reviews on the hardware and the system being quiet.<br>
Topic 4: Positive comments on the graphics and the system being perfect: Graphic, worth, love, release, perfect<br>

It seems to me that most customers were really happy with how fast the PS5 is compared to the PS4.<br> 
There seems to be great feedback on the controller and the system design.<br> "Quiet" shows up which could possibly mean
how quiet the system is when running games.<br> Some consumers even called the system perfect.

#                                      Data Cleaning (PS5 Negative Reviews)

The following section is done in order to prepare the PS5 negative reviews for the LDA model.
The data will need to be split, tokenized, and lemmatized in order to prepare it for the model.
Visualizations will also be built in order to take a better look at the data.

The following code is run to generate a frequency bar graph for the negative reviews:

In [None]:
freq_words(ps5reviews_negative['Comment'])

Looking above, it seems most of the most frequent words are "the", "and", and "for".<br> But these are not necessary words because these words don't show true sentiment.<br>The dataset will need to be cleaned further in order to prepare this dataset for the model.

This code removes the characters and samples from the text dataset:

In [None]:
ps5reviews_negative['Comment'] = ps5reviews_negative['Comment'].str.replace("[^a-zA-Z#]"," ")

The following code is run in order to remove short words (length < 3) to make the data cleaner:

In [None]:
ps5reviews_negative['Comment'] = ps5reviews_negative['Comment'].apply(lambda x: ' '.join([w for w in x.split() if len(w) > 2]))

This code removes the stopwords from the text:

In [None]:
reviews = [remove_stopwords(r.split()) for r in ps5reviews_negative['Comment']]

Here, the text is lowercased:

In [None]:
reviews = [r.lower() for r in reviews]

In [None]:
freq_words(reviews, 20)

Re-running a frequency bar graph on the data, it seems the stopwords have been removed and the reviews are getting closer to be used for the model.<br>The next step is to lemmatize the data.

Here, the text within the dataset is split into tokens:

In [None]:
tokenized_reviews = pd.Series(reviews).apply(lambda x: x.split())
print(tokenized_reviews[1])

The data is lemmatized through this code:

In [None]:
reviews_2 = lemmatization(tokenized_reviews)
print(reviews_2[1]) 

The following code is run in order to append the reviews to a new dataframe in order to show a frequency graph 
of the lemmatized reviews later on:

In [None]:
reviews_3 = []
for i in range(len(reviews_2)):
    reviews_3.append(' '.join(reviews_2[i]))

ps5reviews_negative['reviews'] = reviews_3

In [None]:
freq_words(ps5reviews_negative['reviews'], 20)

As can be seen, the tokens are now clearer as words such as "game", "system", and "money" are now clearly showing in the dataset.<br> The dataset is now ready for model building.

#                                        LDA Model (PS5 Negative Reviews)

The code below is run in order to incorporate the negative reviews into a dictionary in order to set up the LDA Model.


In [None]:
dictionary = corpora.Dictionary(reviews_2)
doc_term_matrix = [dictionary.doc2bow(rev) for rev in reviews_2]

Next, the LDA Model is imported and the next lines of code are run in order to build the model.<br>
The number of topics chosen through trial and error were 4.

In [None]:
LDA = gensim.models.ldamodel.LdaModel

lda_model = LDA(corpus = doc_term_matrix, 
                id2word = dictionary,
                num_topics = 4,
                random_state = 100,
                chunksize = 1000,
                passes = 50)

Here, this code is being run to print the topics and the words associated with it:

In [None]:
lda_model.print_topics()

This code is run to visualize the LDA model into a graphical bubble plot:

In [None]:
pyLDAvis.enable_notebook()
vis = gensimvis.prepare(lda_model, doc_term_matrix, dictionary)
vis

As can be seen above, there are 4 topics which are clearly separated.<br>Based on the words within each topic, I can take a reasonable guess to what the topics are about:<br>

Topic 1: Complaints mostly on bots: Scam,bot, ridiculousness,reseller<br>
Topic 2: Complaints mostly on the supply and demand of the product<br>
Topic 3: Complaints mostly on the price of the PS5<br>
Topic 4: Complaints on the PS5 crashing<br>

It seems to me that most customers were really unhappy with the scalpers and bots that bought most of the PS5's
online given the COVID 19 pandemic.<br>There were complaints on the limited supply and high demand of the product.<br>
There were also complaints about the price of the PS5.<br>
Finally, there were a few cases where the PS5 tended to crash for some customers.<br> 

These are interesting results because there are not a lot of the complaints on the product itself.<br>The complaints are mostly
on obtaining the PS5 which has been difficult to get for consumers due to the pandemic. 

#                                                    Data Splitting: Xbox One X Reviews

Here, the Xbox One X reviews are split into the positive set with ratings >= 4:

In [None]:
positive =  xboxonexreviews['stars'] >= 4
xboxonex_positive = xboxonexreviews[positive]
xboxonex_positive.head(5)

In [None]:
xboxonex_positive.shape

There are 1,853 positive review rows and 2 columns in the dataframe.

Here, the Xbox One X  reviews are split into the negative set with ratings <= 3:

In [None]:
negative =  xboxonexreviews['stars'] <= 3
xboxonex_negative = xboxonexreviews[negative]
xboxonex_negative.head(5)

In [None]:
xboxonex_negative.shape

There are 612 negative review rows and 2 columns in the dataframe.

#                                 Data Cleaning (Xbox One X Positive Reviews)

The following section is done in order to prepare the Xbox One X positive reviews for the LDA model.<br>
The data will need to be split, tokenized, and lemmatized in order to prepare it for the model.<br>
Visualizations will also be built in order to take a better look at the data.

In [None]:
freq_words(xboxonex_positive['comment'])

Looking above, it seems most of the most frequent words are "the", "and", and "for".<br>But these are not necessary words because these words don't show true sentiment.<br>The dataset will need to be cleaned further in order to prepare this dataset for the model.

This code removes the characters and symbols from the dataset:

In [None]:
xboxonex_positive['comment'] = xboxonex_positive['comment'].str.replace("[^a-zA-Z#]"," ")

The following code is run in order to remove short words (length < 3) to make the data cleaner:

In [None]:
xboxonex_positive['comment'] = xboxonex_positive['comment'].apply(lambda x: ' '.join([w for w in x.split() if len(w) > 2]))

This code removes the stopwords from the dataset:

In [None]:
reviews = [remove_stopwords(r.split()) for r in xboxonex_positive['comment']]

This code makes the text lowercase:

In [None]:
reviews = [r.lower() for r in reviews]

In [None]:
freq_words(reviews, 20)

Re-running a frequency bar graph on the data, it seems the stopwords have been removed and the reviews are getting closer to 
for the model to use.<br>The next step is to lemmatize the data.

Here, the words are split into tokens:

In [None]:
tokenized_reviews = pd.Series(reviews).apply(lambda x: x.split())
print(tokenized_reviews[1])

Below, the positive reviews are lemmatized:

In [None]:
reviews_2 = lemmatization(tokenized_reviews)
print(reviews_2[1]) 

The following code is run in order to append the reviews to a new dataframe in order to show a frequency graph 
of the lemmatized reviews later on:

In [None]:
reviews_3 = []
for i in range(len(reviews_2)):
    reviews_3.append(' '.join(reviews_2[i]))

xboxonex_positive['reviews'] = reviews_3

In [None]:
freq_words(xboxonex_positive['reviews'], 20)

As can be seen, the tokens are now clearer as words such as "game", "system", and "time" are now clearly showing in the dataset.<br>The dataset is now ready for model building.

#                LDA Model (Xbox One X Positive Reviews)
The code below is run in order to incorporate the reviews into a dictionary in order to set up the LDA Model.

In [None]:
dictionary = corpora.Dictionary(reviews_2)
doc_term_matrix = [dictionary.doc2bow(rev) for rev in reviews_2]

Next, the LDA Model is imported and the next lines of code are run in order to build the model.<br>
The number of topics I chose through trial and error were 3.

In [None]:
LDA = gensim.models.ldamodel.LdaModel

lda_model = LDA(corpus = doc_term_matrix, 
                id2word = dictionary,
                num_topics = 3,
                random_state = 100,
                chunksize = 1000,
                passes = 50)

Here, this code is being run to print the topics and the words associated with it.

In [None]:
lda_model.print_topics()

This code is run to visualize the LDA model into a graphical bubble plot:

In [None]:
pyLDAvis.enable_notebook()
vis = gensimvis.prepare(lda_model, doc_term_matrix, dictionary)
vis

As can be seen above, there are 3 topics which are clearly separated.<br>Based on the words within each topic, I can take a reasonable guess as to what the topics are about:<br>

Topic 1: The topic seems to be on the power and quality of the system: powerful, drive, graphic, quality, amazing<br>
Topic 2: The topic seems to on the condition of the Xbox and the controller: perfect, controller, excellent, condition<br>
Topic 3: The topic is focused on the speed of the Xbox: fast, time, awesome, quick<br>

Based on the reviews, it seems most customers were happy on how quick the system was,the graphics, and the experience.<br>
There were also positive reviews based on the controller, firmware, size, and features.<br>Furthermore, there seems to be positive reviews on the size and quality of the system with some consumers calling the product perfect.

#                              Data Cleaning (Xbox One X Negative Reviews)

The following section is done in order to prepare the Xbox One X negative reviews for the LDA model.
The data will need to be split, tokenized, and lemmatized in order to prepare it for the model.
Visualizations will also be built in order to take a better look at the data.

Here is a frequency bar graph for the negative Xbox One X negative comments:

In [None]:
freq_words(xboxonex_negative['comment'])

Looking above, it seems most of the most frequent words are "the", "and", and "for".<br>But these are not necessary words because these words don't show true sentiment.<br>The dataset will need to be cleaned further in order to prepare this dataset for the model.

This code removes the characters and samples from the text dataset:

In [None]:
xboxonex_negative['comment'] = xboxonex_negative['comment'].str.replace("[^a-zA-Z#]"," ")

The following code is run in order to remove short words (length < 3) to make the data cleaner:

In [None]:
xboxonex_negative['comment'] = xboxonex_negative['comment'].apply(lambda x: ' '.join([w for w in x.split() if len(w) > 2]))

This code removes the stopwords from the dataset:

In [None]:
reviews = [remove_stopwords(r.split()) for r in xboxonex_negative['comment']]

This code makes the text lowercase:

In [None]:
reviews = [r.lower() for r in reviews]

In [None]:
freq_words(reviews, 20)

Re-running a frequency bar graph on the data, it seems the stopwords have been removed and the reviews are getting closer to be used for the model.<br>The next step is to lemmatize the data.

Here, the reviews are tokenized:

In [None]:
tokenized_reviews = pd.Series(reviews).apply(lambda x: x.split())
print(tokenized_reviews[1])

Below, the reviews are lemmatized:

In [None]:
reviews_2 = lemmatization(tokenized_reviews)
print(reviews_2[2]) 

The following code is run in order to append the reviews to a new dataframe in order to show a frequency graph 
of the lemmatized reviews later on:

In [None]:
reviews_3 = []
for i in range(len(reviews_2)):
    reviews_3.append(' '.join(reviews_2[i]))

xboxonex_negative['reviews'] = reviews_3

In [None]:
freq_words(xboxonex_negative['reviews'], 20)

As can be seen, the tokens are now cleaner as words such as "game", "system", and "hour" are now clearly showing in the dataset.<br>The dataset is now ready for model building.

#                                              LDA Model (Xbox One X Negative Reviews)

The code below is run in order to incorporate the reviews into a dictionary in order to set up the LDA Model.

In [None]:
dictionary = corpora.Dictionary(reviews_2)
doc_term_matrix = [dictionary.doc2bow(rev) for rev in reviews_2]

Next, the LDA Model is imported and the next lines of code are run in order to build the model.<br>
The number of topics I chose through trial and error were 4.

In [None]:
LDA = gensim.models.ldamodel.LdaModel

lda_model = LDA(corpus = doc_term_matrix, 
                id2word = dictionary,
                num_topics = 4,
                random_state = 100,
                chunksize = 1000,
                passes = 50)

Here, this code is being run to print the topics and the words associated with it:

In [None]:
lda_model.print_topics()

This code is run to visualize the LDA model into a graphical bubble plot:

In [None]:
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(lda_model, doc_term_matrix, dictionary)
vis

As can be seen above, there are 4 topics which are clearly separated.<br>Based on the words within each topic, I can take a reasonable guess as to what the topics are about:<br>                                    

Topic 1: Possible issues with the price of the product: issue, price, bad, product<br>
Topic 2: Possible issues with the warranty and customer service: problem, warranty, customer, service<br>
Topic 3: Topic is on product defects: error, issue, factory, product, reset<br>
Topic 4: Possible issues with scammers for the product as well as calls for replacements: disc, box, internet, scam, replacement<br>

Based on the topics, it looks like there were more issues with Xbox One X than the Playstation 4 by comparison.<br>
The price of the system seemed to be an issue.<br>It also seems there were possible issues with customer support as customers were calling to replace their Xbox One X.<br>Also, the issues pertain as well to the product defects as there are reviews were people needed to reset their machines.


#                                        Data Splitting: Xbox One X Reviews

Here, the Xbox One X reviews are split into the positive set with ratings >= 4:

In [None]:
positive =  xboxseriesxreviews['stars'] >= 4
xboxseriesxreviews_positive = xboxseriesxreviews[positive]
xboxseriesxreviews_positive.head(5)

In [None]:
xboxseriesxreviews_positive.shape

There are 448 reviews with 3 columns in the dataset.

Below the Xbox One X reviews are split into negative set with ratings <= 3:

In [None]:
negative =  xboxseriesxreviews['stars'] <= 3
xboxseriesxreviews_negative = xboxseriesxreviews[negative]
xboxseriesxreviews_negative.head(5)

In [None]:
xboxseriesxreviews_negative.shape

There are 115 reviews with 3 columns in the dataset.

#                         Data Cleaning (Xbox Series X Positive Reviews)

The following section is done in order to prepare the Xbox Series X positive reviews for the LDA model.
The data will need to be split, tokenized, and lemmatized in order to prepare it for the model.
Visualizations will also be built in order to take a better look at the data.

Here is a frequency bar graph of the positive reviews:

In [None]:
freq_words(xboxseriesxreviews_positive['comment'])

Looking above, it seems most of the most frequent words are "the", "and", and "for".<br>But these are not necessary words because these words don't show true sentiment.<br>The dataset will need to be cleaned further in order to prepare this dataset for the model.

This code removes the characters and samples from the text dataset:

In [None]:
xboxseriesxreviews_positive['comment'] = xboxseriesxreviews_positive['comment'].str.replace("[^a-zA-Z#]"," ")

The following code is run in order to remove short words (length < 3) to make the data cleaner:

In [None]:
xboxseriesxreviews_positive['comment'] = xboxseriesxreviews_positive['comment'].apply(lambda x: ' '.join([w for w in x.split() if len(w) > 2]))

The following removes the stopwords from the text:

In [None]:
reviews = [remove_stopwords(r.split()) for r in xboxseriesxreviews_positive['comment']]

The folowing turns the reviews lowercase:

In [None]:
reviews = [r.lower() for r in reviews]

In [None]:
freq_words(reviews, 20)

Re-running a frequency bar graph on the data, it seems the stopwords have been removed and the reviews are getting close to being used for the model.<br>The next step is to lemmatize the data.

Below, the reviews are tokenized:

In [None]:
tokenized_reviews = pd.Series(reviews).apply(lambda x: x.split())
print(tokenized_reviews[1])

Below, the reviews are lemmatized:

In [None]:
reviews_2 = lemmatization(tokenized_reviews)
print(reviews_2[1]) 

The following code is run in order to append the reviews to a new dataframe in order to show a frequency graph 
of the lemmatized reviews later on:

In [None]:
reviews_3 = []
for i in range(len(reviews_2)):
    reviews_3.append(' '.join(reviews_2[i]))

xboxseriesxreviews_positive['reviews'] = reviews_3

In [None]:
freq_words(xboxseriesxreviews_positive['reviews'], 20)

As can be seen, the tokens are now cleaner as words such as "game", "system", and "hour" are now clearly showing in the dataset.<br>The dataset is now ready for model building.

#                                        LDA Model (Xbox Series X Positive Reviews)

The code below is run in order to incorporate the reviews into a dictionary in order to set up the LDA Model.

In [None]:
dictionary = corpora.Dictionary(reviews_2)
doc_term_matrix = [dictionary.doc2bow(rev) for rev in reviews_2]

Next, the LDA Model is imported and the next lines of code are run in order to build the model.<br>
The number of topics chosen through trial and error were 3.

In [None]:
LDA = gensim.models.ldamodel.LdaModel

lda_model = LDA(corpus = doc_term_matrix, 
                id2word = dictionary,
                num_topics = 3,
                random_state = 100,
                chunksize = 1000,
                passes = 50)

Here, this code is being run to print the topics and the words associated with it:

In [None]:
lda_model.print_topics()

This code is run to visualize the LDA model into a graphical bubble plot:

In [None]:
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(lda_model, doc_term_matrix, dictionary)
vis

As can be seen above, there are 3 topics which are clearly separated.<br>Based on the words within each topic, I can take a reasonable guess as to what the topics are about:<br>         

Topic 1: Topic seems to be on how fast the system has become: amazing, fast, load, quick, new, powerful<br>
Topic 2: Topic on speed but also graphics of the system: Controller, ssd, speed, quiet, feature, launch, graphic<br>
Topic 3: Topic seems to be on the size of the Xbox: new, experience, old, system, big, box<br>

Generally, it seems that early reviews of the system are positive based on the speed of the new Xbox.<br>There were also 
some positie comments on the ssd, controller, and the graphics.<br>Finally, there were comments from the positive reviews
on the size of the Xbox Series X as well.

#                                 Data Cleaning (Xbox Series X Negative Reviews)

The following section is done in order to prepare the Xbox Series X negative reviews for the LDA model.<br>
The data will need to be split, tokenized, and lemmatized in order to prepare it for the model.<br>
Visualizations will also be built in order to take a better look at the data.<br>

Below is a frequency bar graph of the Xbox Series X negative reviews:

In [None]:
freq_words(xboxseriesxreviews_negative['comment'])

Looking above, it seems most of the most frequent words are "the", "and", and "for".<br>But these are not necessary words because these words don't show true sentiment.<br>The dataset will need to be cleaned further in order to prepare this dataset for the model.

This code removes the characters and samples from the text dataset:

In [None]:
xboxseriesxreviews_negative['comment'] = xboxseriesxreviews_negative['comment'].str.replace("[^a-zA-Z#]"," ")

The following code is run in order to remove short words (length < 3) to make the data cleaner:

In [None]:
xboxseriesxreviews_negative['comment'] = xboxseriesxreviews_negative['comment'].apply(lambda x: ' '.join([w for w in x.split() if len(w) > 2]))

The following removes the stopwords from the text:

In [None]:
reviews = [remove_stopwords(r.split()) for r in xboxseriesxreviews_negative['comment']]

The following makes the text lowercase:

In [None]:
reviews = [r.lower() for r in reviews]

In [None]:
freq_words(reviews, 20)

Re-running a frequency bar graph on the data, it seems the stopwords have been removed and the reviews are getting close to being used for the model.<br>The next step is to lemmatize the data.

Below, the following code splits the text into tokens: 

In [None]:
tokenized_reviews = pd.Series(reviews).apply(lambda x: x.split())
print(tokenized_reviews[1])

Below, the reviews is then lemmatized:

In [None]:
reviews_2 = lemmatization(tokenized_reviews)
print(reviews_2[1]) 

The following code is run in order to append the reviews to a new dataframe in order to show a frequency graph 
of the lemmatized reviews later on:

In [None]:
reviews_3 = []
for i in range(len(reviews_2)):
    reviews_3.append(' '.join(reviews_2[i]))

xboxseriesxreviews_negative['reviews'] = reviews_3

In [None]:
freq_words(xboxseriesxreviews_negative['reviews'], 20)

As can be seen, the tokens are now clearer as words such as "game", "system", and "hour" are now clearly showing in the dataset.<br>The dataset is now ready for model building.

#                                             LDA Model (Xbox Series X Negative Reviews)

The code below is run in order to incorporate the reviews into a dictionary in order to set up the LDA Model.


In [None]:
dictionary = corpora.Dictionary(reviews_2)
doc_term_matrix = [dictionary.doc2bow(rev) for rev in reviews_2]

Next, the LDA Model is imported and the next lines of code are run in order to build the model.
The number of topics chosen through trial and error were 5.

In [None]:
LDA = gensim.models.ldamodel.LdaModel

lda_model = LDA(corpus = doc_term_matrix, 
                id2word = dictionary,
                num_topics = 5,
                random_state = 100,
                chunksize = 1000,
                passes = 50)

Below, this code is being run to print the topics and the words associated with it:

In [None]:
lda_model.print_topics()

This code is run to visualize the LDA model into a graphical bubble plot:

In [None]:
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(lda_model, doc_term_matrix, dictionary)
vis

As can be seen above, there are 5 topics which are clearly separated.<br>Based on the words within each topic, I can take a reasonable guess as to what the topics are about:<br>      

Topic 1: Topic may possibly be on the hdr: issue, hdr<br>
Topic 2: Topic could be on complaints of the price and battery: price, issue, time, battery<br>
Topic 3: Possible complaints on the support service: call, hour, week, service, replacement<br>
Topic 4: Topic is on replacing the Xbox: boot, disc, available, replacement<br>
Topic 5: Topic is possibly based on the price and issues with the system<br>

Based on the topics, it seems the negative reviews for the systems are similar to its predecessor.<br> There seems to be issues
with the hardware.<br>There are also issues with the price, time, and battery.<br>Furthermore, the support service for the Xbox
looks to continue to be an issue.<br>Already, customers are looking to replace their Xboxs.<br>This seems to correlate with reports of the Xbox overheating and breaking down as seen in news reports.



#                                              Summary/ Data Limitations


Based on the entire analysis, I can conclude the following:<br>

1. Comparing the PS4 negative reviews with the PS5 negative reviews, it seems Sony has fixed most of the problems with the PS5.<br> The PS4 had issues with the hard drive, warranty, and hardware of the product.<br> In comparison, most consumers complained more about obtaining the PS5 rather than the actual product itelf.<br> This difference reveals that Sony seems to have little to no issues with the new PS5 from a product standpoint compared to its predecessor which is very good from Sony's perspective.

2. Comparing the PS5 negative reviews with Xbox Series X negative reviews, the latter seems to have product issues as there were reviews pointing out the system dying and customers actually reaching out to customer support for refunds.<br> There were also issues with the price and battery.<br>Based on this analysis, it doesn't seem that Microsoft fully checked their new system for defects as much as Sony did.<br>This reveals that Microsoft needs to do a more thorough job in identifying these type of defects.<br>Sony seems to have done a great job, for the most part, in reducing the number of defects within the PS5.

3. Comparing the PS5 positive reviews vs the Xbox Series X positive reviews, both seemed to be similar in their respective LDA topic models.<br>Both systems seem to be praised for their controller, hardware, and speed. Words such as "amazing" and "graphics" were part of both LDA models.<br>It seems both systems seem to have satisfied consumer's expectations based on the positive reviews.

4. Comparing the PS4 positive reviews vs the Xbox One X positive reviews, consumers seems to be really satisifed.<br>The models both showed comments on the controller, design, and speed for both systems.<br>Both systems seem to have improved similarly.<br>The PS4 and Xbox One x positive reviews follow the same pattern as the PS5 and Xbox Series X in terms of positive reviews.

5. Comparing the Xbox Series X negative reviews with the Xbox Series X negative reviews, both systems seem to have issues with customer service as well as product defects.<br>There were comments made towards consumers asking for replacements and refunds due to defects.<br>This is in sharp contrast to the Playstation reviews as there little to no comments made in terms of consumers asking for a refund.


Data limitations need to be recognized though.<br>There was not a lot of data to analyze as only about 6,000 or more reviews overall were analyzed for this sentiment analysis.<br>The PS5 and Xbox Series X are both fairly new so it was difficult to get a good sample size for these reviews.<br>In contrast, PS4 and Xbox One X reviews were plentiful considering both products have been around for some time.<br>


Nevetherless, this analysis shows the different sentiments and comments towards how consumers feel towards the PS4, Xbox One X, PS5, and Xbox Series X.<br>Readers should be able to see this analysis and both Playstation and Microsoft can benefit from this analysis in terms of improving customer experiences and looking for areas of growth and opportunity for their respective game consoles.