### Load Data

In [1]:
import numpy as np
import pandas as pd
import nltk
import re
from collections import Counter

In [2]:
#Read the reviews file
data = pd.read_csv('steam_final.csv',encoding = 'utf-8')
data.dropna(how='any', inplace=True)
data.drop_duplicates('review', inplace=True)

In [3]:
reviews=pd.DataFrame(columns=['review','recommended'])
reviews['review']=data['review']
reviews['recommended']=data['recommended']
reviews.head(5)

Unnamed: 0,review,recommended
0,"Great Game, terrible optimisation. Hopefully t...",False
1,Huh. I actually don't like it. I've 100%-ed ea...,False
2,very trashy talking about the tsev skyrim le,False
3,Lag as shit,False
4,Precarious policies.,False


In [4]:
#Clean up text
reviews['review'] = reviews['review'].str.replace(r"\b\w+n't\b",'not')
reviews['review'] = reviews['review'].str.replace(r'&#[0-9]+;', '')
reviews['review'] = reviews['review'].str.replace(r'[^\w\s]', '')
reviews['review'] = reviews['review'].str.replace(r"[\u4e00-\u9fa5]",'')
reviews['review'] = reviews['review'].str.lower()
reviews.head(5)

  reviews['review'] = reviews['review'].str.replace(r"\b\w+n't\b",'not')
  reviews['review'] = reviews['review'].str.replace(r'&#[0-9]+;', '')
  reviews['review'] = reviews['review'].str.replace(r'[^\w\s]', '')
  reviews['review'] = reviews['review'].str.replace(r"[\u4e00-\u9fa5]",'')


Unnamed: 0,review,recommended
0,great game terrible optimisation hopefully the...,False
1,huh i actually not like it ive 100ed each game...,False
2,very trashy talking about the tsev skyrim le,False
3,lag as shit,False
4,precarious policies,False


### Remove Common Stopwords

Stopwords are common words that appears frequently in the text but do not provide much meaning/insights for the purpose of NLP analysis. We want to remove those stopwords to reduce dimensionality for TF-IDF analysis.

In [5]:
from nltk.corpus import stopwords
stopwords = list(set(stopwords.words('english')))# see the set of words NLTK considers stopwords

First we want to generate a list of general stopwords: set of words NLTK considers stopwords. We will perform some modifications on this list based on our business case.

In [6]:
for word in ["she's", "his", "him", "she", "her", "he", "herself", "himself"]:
    stopwords.remove(word)
print(stopwords)

['just', 'wasn', "shan't", 'mustn', 'out', "mustn't", 'at', 'd', 'only', "wasn't", 'yours', 'are', 'me', 'didn', 'being', 'theirs', 'than', 'and', 'we', 'now', 'isn', 'hers', 'any', 'between', "aren't", 'such', 'you', 'there', 'then', "should've", 'under', 'aren', 'our', 'both', 'ma', 'up', 'few', 've', 'about', 'how', 'my', 'do', 'their', 'what', 'be', "needn't", 'whom', 'until', 'again', 'when', "you'll", 'not', 'some', 'down', 'so', 'further', 'needn', "you've", 'these', "that'll", 'had', 'against', 'of', 'is', 'by', 'other', 'with', 'over', 'doing', "hasn't", 'don', 'this', 'in', 'own', 'haven', "weren't", 'myself', 'can', 'while', 'shan', 'have', 'for', 'shouldn', "haven't", "don't", 'does', 'off', 'hasn', 'why', "isn't", 'the', 'who', 'on', 'm', 'once', 'i', 'those', 'won', 'was', "couldn't", 'as', 'to', 'couldn', 'its', 'ours', 'an', 'but', "wouldn't", "it's", 'am', "you're", 'into', 'o', 'after', 'nor', 'same', 'all', 'because', 'been', 'hadn', 'itself', "hadn't", 'above', 'her

In [7]:
reviews["review"] = reviews["review"].astype(str)

We want to keep gender related words in this case because they can help us analyze who our customers are and how they reviews our products based on demographic information.

In [8]:
#tokenize reviews
reviews["review"] = reviews["review"].apply(nltk.word_tokenize)

In [9]:
#remove stopwords in reviews
def remove_stopwords(tokens):
    tokens_copy = tokens.copy()
    for word in stopwords:
        while word in tokens_copy:
            tokens_copy.remove(word)
    return tokens_copy
reviews["removed_tokens"] = reviews['review'].apply(remove_stopwords)

In [10]:
reviews_removed_tokens = [" ".join(i) for i in reviews["removed_tokens"]]
reviews_removed_tokens[:10]

['great game terrible optimisation hopefully fix instead trying cash grab',
 'huh actually like ive 100ed game edmund ive fan minute end nigh came quickly expectations focus far precision platforming meat boys charm fluidity speed precision platforming 100 miles minute far methodical drawn mention lot forced trial error collectables pretty meh cartridges big ticket item however theyre merely stages lofi graphics music great theres odd charm fact 3050 soundrack public domain music remixed gothicrock stylings isaac animations graphics fantastic controls perfect self contained game map stage real cutscenes yes must backtrack world entirety search collectables mileage may one real charm bland color scheme world little variety mostly waiting jumping right spot repeat opposed meat boys dynamic levels far obstacles variety different mechanics speed kept things fresh end nigh basically meat boy distilled main mechanics remove wall jumping sprint switch focus fluidity precision platforming dull

In [11]:
#remove consecutively repeated patterns
reviews_removed_tokens = [re.sub(r'^(.+?)\1+', r'\1', i) for i in reviews_removed_tokens]

### Lemmatization

We would choose lemmatization because it is less aggressive and keeps the original form of a word. This avoids confusion caused by stemming because it reduces words to its original form. Lemmatization also considers words within the context, would could be more accurate in many ways when we perform NLP data analysis.

In [13]:
# lemmatize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

lemmatizer = WordNetLemmatizer()

# function to convert nltk tag to wordnet tag
def nltk_tag_to_wordnet_tag(nltk_tag):
    if nltk_tag.startswith('J'):
        return wordnet.ADJ
    elif nltk_tag.startswith('V'):
        return wordnet.VERB
    elif nltk_tag.startswith('N'):
        return wordnet.NOUN
    elif nltk_tag.startswith('R'):
        return wordnet.ADV
    else:          
        return None

def lemmatize_sentence(sentence):
    #tokenize the sentence and find the POS tag for each token
    nltk_tagged = nltk.pos_tag(nltk.word_tokenize(sentence))  
    #tuple of (token, wordnet_tag)
    wordnet_tagged = map(lambda x: (x[0], nltk_tag_to_wordnet_tag(x[1])), nltk_tagged)
    lemmatized_sentence = []
    for word, tag in wordnet_tagged:
        if tag is None:
            #if there is no available tag, append the token as is
            lemmatized_sentence.append(word)
        else:        
            #else use the tag to lemmatize the token
            lemmatized_sentence.append(lemmatizer.lemmatize(word, tag))
    return " ".join(lemmatized_sentence)

In [14]:
cleaned_texts = []
for text in reviews_removed_tokens:
    cleaned_texts.append(lemmatize_sentence(text))

In [15]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(stop_words = 'english',binary = True, min_df =0.005, max_features=200)
X = vectorizer.fit_transform(cleaned_texts)

vectorized_df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())
print(f"Shape of dataframe is {vectorized_df.shape}")
print(f"Total number of occurences: {vectorized_df.sum().sum()}")
#print(f"Word counts: {vectorized_df.sum()}")
list(vectorized_df.columns)

Shape of dataframe is (64122, 200)
Total number of occurences: 539538


['10',
 '100',
 '1010',
 '20',
 'able',
 'absolutely',
 'access',
 'actually',
 'add',
 'ai',
 'amaze',
 'amazing',
 'away',
 'awesome',
 'bad',
 'base',
 'best',
 'big',
 'bit',
 'bore',
 'boring',
 'break',
 'bug',
 'build',
 'buy',
 'care',
 'challenge',
 'change',
 'character',
 'combat',
 'come',
 'community',
 'complete',
 'completely',
 'content',
 'control',
 'cool',
 'crash',
 'day',
 'design',
 'developer',
 'devs',
 'die',
 'different',
 'dlc',
 'dont',
 'drop',
 'early',
 'easy',
 'end',
 'enemy',
 'enjoy',
 'especially',
 'expect',
 'experience',
 'fact',
 'fan',
 'far',
 'feature',
 'feel',
 'fight',
 'finish',
 'fix',
 'force',
 'fps',
 'free',
 'friend',
 'fuck',
 'fun',
 'game',
 'gameplay',
 'good',
 'graphic',
 'great',
 'guy',
 'half',
 'happen',
 'hard',
 'hell',
 'help',
 'high',
 'hit',
 'honestly',
 'hope',
 'hour',
 'huge',
 'id',
 'idea',
 'im',
 'instead',
 'issue',
 'item',
 'ive',
 'kill',
 'kind',
 'know',
 'lack',
 'learn',
 'leave',
 'let',
 'level',
 'l

In [16]:
#cleaned_df = pd.DataFrame(cleaned_texts)
#cleaned_df.to_csv('cleaned_df.csv', index=False)

### Regex Cleaning

In [17]:
import re
from textacy.preprocessing.replace import urls, hashtags, numbers, emails, emojis, currency_symbols
for i in range(len(cleaned_texts)):
    cleaned_texts[i] = re.sub(r"\b\w+n't\b",'not',cleaned_texts[i],flags = re.IGNORECASE)
    cleaned_texts[i] = re.sub(r'id\b','i would',cleaned_texts[i],flags = re.IGNORECASE)
    cleaned_texts[i] = re.sub(r'\byouve\b','you have',cleaned_texts[i],flags = re.IGNORECASE)
    cleaned_texts[i] = re.sub(r'\bive\b','i have',cleaned_texts[i],flags = re.IGNORECASE)
    cleaned_texts[i] = re.sub(r'br\b',' ',cleaned_texts[i],flags = re.IGNORECASE)
    cleaned_texts[i] = re.sub(r'\bn?o+?\b','no',cleaned_texts[i],flags = re.IGNORECASE)
    cleaned_texts[i] = re.sub(r'\bb+?a+?d+?\b','bad',cleaned_texts[i],flags = re.IGNORECASE)
    cleaned_texts[i] = re.sub(r'\bg+?o+?d+?\b','good',cleaned_texts[i],flags = re.IGNORECASE)
    cleaned_texts[i] = re.sub(r'\bgreat\b','good',cleaned_texts[i],flags = re.IGNORECASE)
    cleaned_texts[i] = re.sub(r'\bbest\b','good',cleaned_texts[i],flags = re.IGNORECASE)
    cleaned_texts[i] = re.sub(r'\bamazing\b','good',cleaned_texts[i],flags = re.IGNORECASE)
    cleaned_texts[i] = re.sub(r'\bevery time\b','',cleaned_texts[i],flags = re.IGNORECASE)
    cleaned_texts[i] = re.sub(r'\blook like\b','',cleaned_texts[i],flags = re.IGNORECASE)
    cleaned_texts[i] = re.sub(r'\bseem likeb','',cleaned_texts[i],flags = re.IGNORECASE)
    cleaned_texts[i] = numbers(cleaned_texts[i])
reviews['cleaned_text'] = cleaned_texts
reviews.head(10)

Unnamed: 0,review,recommended,removed_tokens,cleaned_text
0,"[great, game, terrible, optimisation, hopefull...",False,"[great, game, terrible, optimisation, hopefull...",good game terrible optimisation hopefully fix ...
1,"[huh, i, actually, not, like, it, ive, 100ed, ...",False,"[huh, actually, like, ive, 100ed, game, edmund...",huh actually like i have 100ed game edmund i h...
2,"[very, trashy, talking, about, the, tsev, skyr...",False,"[trashy, talking, tsev, skyrim, le]",trashy talk tsev skyrim le
3,"[lag, as, shit]",False,"[lag, shit]",lag shit
4,"[precarious, policies]",False,"[precarious, policies]",precarious policy
5,"[i, never, liked, this, game, anyway]",False,"[never, liked, game, anyway]",never like game anyway
6,"[we, need, chinese, about, eu4, the, third, pa...",False,"[need, chinese, eu4, third, party52, muyou, pa...",need chinese eu4 third party52 muyou party mak...
7,"[fuck, these, assholes]",False,"[fuck, assholes]",fuck asshole
8,"[im, torn, this, game, really, is, smartly, de...",False,"[im, torn, game, really, smartly, designed, pl...",im torn game really smartly design played game...
9,[boring],False,[boring],boring


### Custom stopwords

In [18]:
stopwords.extend(['steam', 'platform', 'another', 'huh', 'also', 'game', 'like', 'play', 'get', 'good', 'make', 'feel', 'take', 'time', 'one', 'games', 'would', 'hours', 'playing', 'people', 'played', 'im', 'first', 'also', 'every', 'many', 'go', 'got', '2', 'could', 'back', 'things'])

We would like to remove some words that are too general in our context. For example, steam and platform. Since we are analyzing steam reviews, the word "steam" could be mentioned multiple times, but it does not help much with understanding the pattern. We also want to remove some common words in the documents according to the word count. 

In [19]:
stopwords.extend(['always', 'very', 'really', 'totally', 'definitely', 'especially', 'only', 'absolutely', 'exactly', 'equally', 'perfectly', 'obsolutely', 'highly', 'mostly', 'hopefully', 'anyway','would','could', 'really', 'even', 'definitely',
             'still', 'bit', 'way','absolutely', 'almost', 'enough','ever','much','pretty','seem'])

In [20]:
stopwords.extend(['region','lock','waste','money','worth','price','full','single','player','bug','bugs','load','screen','_number_'])

We would like to remove some adverbs that emphasis certain verbs. They are useful in emphasisng, but not as useful in NLP when we tokenize words and look at adverbs and verbs alone.

In [21]:
reviews["cleaned_text"] = reviews["cleaned_text"].apply(nltk.word_tokenize)

In [22]:
#remove stopwords in cleaned reviews
reviews["final_text"] = reviews['cleaned_text'].apply(remove_stopwords)

In [23]:
reviews["final_text"] = [" ".join(i) for i in reviews["final_text"]]
reviews.head(10)

Unnamed: 0,review,recommended,removed_tokens,cleaned_text,final_text
0,"[great, game, terrible, optimisation, hopefull...",False,"[great, game, terrible, optimisation, hopefull...","[good, game, terrible, optimisation, hopefully...",terrible optimisation fix instead try cash grab
1,"[huh, i, actually, not, like, it, ive, 100ed, ...",False,"[huh, actually, like, ive, 100ed, game, edmund...","[huh, actually, like, i, have, 100ed, game, ed...",actually 100ed edmund fan minute end nigh come...
2,"[very, trashy, talking, about, the, tsev, skyr...",False,"[trashy, talking, tsev, skyrim, le]","[trashy, talk, tsev, skyrim, le]",trashy talk tsev skyrim le
3,"[lag, as, shit]",False,"[lag, shit]","[lag, shit]",lag shit
4,"[precarious, policies]",False,"[precarious, policies]","[precarious, policy]",precarious policy
5,"[i, never, liked, this, game, anyway]",False,"[never, liked, game, anyway]","[never, like, game, anyway]",never
6,"[we, need, chinese, about, eu4, the, third, pa...",False,"[need, chinese, eu4, third, party52, muyou, pa...","[need, chinese, eu4, third, party52, muyou, pa...",need chinese eu4 third party52 muyou party us ...
7,"[fuck, these, assholes]",False,"[fuck, assholes]","[fuck, asshole]",fuck asshole
8,"[im, torn, this, game, really, is, smartly, de...",False,"[im, torn, game, really, smartly, designed, pl...","[im, torn, game, really, smartly, design, play...",torn smartly design phone less _NUMBER_ maybe ...
9,[boring],False,[boring],[boring],boring


### TF-IDF

### N-grams

Since a single without any context usually does not provide much useful information for analysis, we decided to try n-grams of 2, 3, and 4 to determine which would be the most suitable n-gram for our analysis. 

In [24]:
good_review=reviews[reviews['recommended']==True]
poor_review=reviews[reviews['recommended']==False]

#### good_review

In [25]:
from sklearn.feature_extraction.text import TfidfVectorizer

for i in range(2, 5):

    vectorizer = TfidfVectorizer(ngram_range=(i,i),
                             binary=True,
                             max_features=1000,
                             token_pattern=r'\b[a-zA-Z]{3,}\b',
                             stop_words=stopwords)
    
    X = vectorizer.fit_transform(good_review["final_text"])
    terms = vectorizer.get_feature_names()
    tf_idf = pd.DataFrame(X.toarray().transpose(), index=terms)

    tf_idf = tf_idf.sum(axis=1)
    score = pd.DataFrame(tf_idf, columns=["score"])
    score.sort_values(by="score", ascending=False, inplace=True)
    print(f'Top 20 Common reasons with n-grams of {i}:')
    print("{}\n".format(score.head(20)))

Top 20 Common reasons with n-grams of 2:
                       score
fun friend        281.430182
lot fun           190.686628
early access      139.572007
open world        121.146081
recommend anyone  109.564439
dark soul         104.708464
super fun         101.795878
total war          94.524035
look forward       92.343743
art style          72.984692
story line         70.744077
learn curve        70.263756
replay value       68.419024
spend hour         64.984784
fun recommend      60.008835
easy learn         58.044538
fun friends        57.184770
write review       56.702865
must buy           56.625234
hour fun           55.368190

Top 20 Common reasons with n-grams of 3:
                            score
learn hard master       29.474658
easy learn hard         29.356972
steep learn curve       27.747872
grand theft auto        21.219674
lot fun friend          20.937334
total war series        20.749579
recommend anyone enjoy  17.842134
recommend anyone want   17.447755
re

#### poor_review

In [26]:
from sklearn.feature_extraction.text import TfidfVectorizer

for i in range(2, 5):

    vectorizer = TfidfVectorizer(ngram_range=(i,i),
                             binary=True,
                             max_features=1000,
                             token_pattern=r'\b[a-zA-Z]{3,}\b',
                             stop_words=stopwords)
    
    X = vectorizer.fit_transform(poor_review["final_text"])
    terms = vectorizer.get_feature_names()
    tf_idf = pd.DataFrame(X.toarray().transpose(), index=terms)

    tf_idf = tf_idf.sum(axis=1)
    score = pd.DataFrame(tf_idf, columns=["score"])
    score.sort_values(by="score", ascending=False, inplace=True)
    print(f'Top 20 Common reasons with n-grams of {i}:')
    print("{}\n".format(score.head(20)))

Top 20 Common reasons with n-grams of 2:
                       score
early access      324.041088
dont buy          273.793540
pay mod           257.658828
current state     142.054731
negative review   130.083835
year old          127.631353
total war         114.137692
year ago          114.013738
pay win           112.790483
creation club     109.016242
gta online        107.608684
piece shit        107.333694
run around        105.220373
recommend anyone  103.905840
stay away         103.135008
spend hour         95.602849
dont know          93.736584
open world         92.121947
combat system      88.305443
new update         88.221435

Top 20 Common reasons with n-grams of 3:
                             score
recommend current state  59.771744
give negative review     53.539823
fallout new vega         47.007133
buy shark card           46.039478
give positive review     36.125453
dlc early access         34.249450
grand theft auto         29.417508
need lot work            25.

### TF-IDF Analysis

From the TF-IDF scores calculated above, we can determine that customer's review for Steam are both from the positive and negative directions. Based on the analysis, we have gathered some general common themes in good reviews. These themes can be addressed with different functional teams in the company.

The most common themes in poor reviews that can be addressed with different teams are:

Sales and marketing teams:
- worth and reasonable pricing

Game design teams (art, engineering, strategy, game planning):
- fun to play with friends
- game content (art style, story line, etc.)
- game play features (open world, early access etc.)


The most common themes in poor reviews that can be addressed with different teams are:

Operations, sales and marketing teams:
- age and region block
- not worth the money spent

Game design teams:
- single player mode
- game play experience (load screen, bugs, etc.)

After getting the most common themes of all comments, we want to take a look at each categories in more details and see what stands out in each theme.

In [47]:
#positive reviews
good_review['fun_friend'] = good_review['final_text'].str.contains(r'lot fun|fun friend|super fun|lot fun friend')
good_review['worth_money'] = good_review['final_text'].str.contains(r'worth money|recommend anyone|well worth|worth price|full price|full recommend|worth full price|well worth moeny|buy full price|well worth price')
good_review['game_content'] = good_review['final_text'].str.contains(r'story line|art style|learn curve|learn hard master|steep learn curve')
good_review['game_play'] = good_review['final_text'].str.contains(r'open world|early access|turn base strategy|single player campaign')

#negative reviews
poor_review['region_lock'] = poor_review['final_text'].str.contains(r'region lock|regionlocking')
poor_review['not_worthy'] = poor_review['final_text'].str.contains(r'waste money|worth price|worth full price|pay mod|dont buy unless')
poor_review['single_player'] = poor_review['final_text'].str.contains(r'single player|single|multiplayer')
poor_review['game_exp'] = poor_review['final_text'].str.contains(r'bug|bugs|load screen|need lot work|early access')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  good_review['fun_friend'] = good_review['final_text'].str.contains(r'lot fun|fun friend|super fun|lot fun friend')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  good_review['worth_money'] = good_review['final_text'].str.contains(r'worth money|recommend anyone|well worth|worth price|full price|full recommend|worth full price|well worth moeny|buy full price|well worth price')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats

Positive reviews:

In [48]:
fun_friend = good_review[good_review['fun_friend'] == True ]['final_text'].values

vectorizer = TfidfVectorizer(ngram_range=(3,3) )
X = vectorizer.fit_transform(fun_friend)
terms = vectorizer.get_feature_names()
tf_idf = pd.DataFrame(X.toarray().transpose(), index=terms)

tf_idf = tf_idf.sum(axis=1)
score = pd.DataFrame(tf_idf, columns=["score"])
score.sort_values(by="score", ascending=False, inplace=True)
print("Top 10 Common theme related to fun friend reviews with n-grams of 3")
(score.head(10))

Top 10 Common theme related to fun friend reviews with n-grams of 3


Unnamed: 0,score
lot fun friend,6.93526
super fun friend,6.320107
extremely fun friend,4.577513
fun friend recommend,3.811397
fun friend fun,3.589771
amaze fun friend,2.899047
alot fun friend,2.837891
fun friend solo,2.284137
fun friend online,2.251716
fun friends _number_,2.21592


In [49]:
worth_money = good_review[good_review['worth_money'] == True ]['final_text'].values

vectorizer = TfidfVectorizer(ngram_range=(3,3) )
X = vectorizer.fit_transform(worth_money)
terms = vectorizer.get_feature_names()
tf_idf = pd.DataFrame(X.toarray().transpose(), index=terms)

tf_idf = tf_idf.sum(axis=1)
score = pd.DataFrame(tf_idf, columns=["score"])
score.sort_values(by="score", ascending=False, inplace=True)
print("Top 10 Common theme related to worthy of money reviews with n-grams of 3")
score.head(10)

Top 10 Common theme related to worthy of money reviews with n-grams of 3


Unnamed: 0,score
recommend anyone enjoy,2.755082
_number_ recommend anyone,2.665011
recommend anyone look,2.589169
recommend anyone want,2.443731
recommend anyone love,2.349824
fun recommend anyone,2.337154
recommend anyone interested,1.700756
strongly recommend anyone,1.603435
recommend anyone friend,1.424995
love recommend anyone,1.36088


In [50]:
game_content = good_review[good_review['game_content'] == True ]['final_text'].values

vectorizer = TfidfVectorizer(ngram_range=(3,3) )
X = vectorizer.fit_transform(game_content)
terms = vectorizer.get_feature_names()
tf_idf = pd.DataFrame(X.toarray().transpose(), index=terms)

tf_idf = tf_idf.sum(axis=1)
score = pd.DataFrame(tf_idf, columns=["score"])
score.sort_values(by="score", ascending=False, inplace=True)
print("Top 10 Common theme related to game content reviews with n-grams of 3")
score.head(10)

Top 10 Common theme related to game content reviews with n-grams of 3


Unnamed: 0,score
learn hard master,7.538876
easy learn hard,7.350267
steep learn curve,4.706471
love art style,2.698874
learn curve fun,2.690482
_number_ _number_ hour,2.410251
art style music,1.818224
short _number_ _number_,1.619025
story line overall,1.606521
significant brain usage,1.478926


In [51]:
game_play = good_review[good_review['game_play'] == True ]['final_text'].values

vectorizer = TfidfVectorizer(ngram_range=(3,3) )
X = vectorizer.fit_transform(game_play)
terms = vectorizer.get_feature_names()
tf_idf = pd.DataFrame(X.toarray().transpose(), index=terms)

tf_idf = tf_idf.sum(axis=1)
score = pd.DataFrame(tf_idf, columns=["score"])
score.sort_values(by="score", ascending=False, inplace=True)
print("Top 10 Common theme related to game play reviews with n-grams of 3")
score.head(10)

Top 10 Common theme related to game play reviews with n-grams of 3


Unnamed: 0,score
open world rpg,3.703285
amaze open world,3.679078
open world _number_,2.861091
turn base strategy,2.473814
though early access,2.215124
love open world,2.078421
early access fun,2.012725
open world survival,1.914062
buy early access,1.63332
since early access,1.566592


Negative reviews:

In [53]:
region_lock = poor_review[poor_review['region_lock'] == True ]['final_text']

vectorizer = TfidfVectorizer(ngram_range=(3,3))
X = vectorizer.fit_transform(region_lock)
terms = vectorizer.get_feature_names()
tf_idf = pd.DataFrame(X.toarray().transpose(), index=terms)

tf_idf = tf_idf.sum(axis=1)
score = pd.DataFrame(tf_idf, columns=["score"])
score.sort_values(by="score", ascending=False, inplace=True)
print("Top 10 Common theme related to region locking reviews with n-grams of 3")
score.head(10)

Top 10 Common theme related to region locking reviews with n-grams of 3


Unnamed: 0,score
zone competitive world,0.11547
properly organize community,0.11547
hour come hard,0.11547
real problem progress,0.11547
ignore multiple problem,0.11547
immense momentum popular,0.11547
important reason inability,0.11547
cosmetic nothing us,0.11547
inability listen properly,0.11547
continue ignore multiple,0.11547


In [54]:
not_worthy = poor_review[poor_review['not_worthy'] == True ]['final_text']

vectorizer = TfidfVectorizer(ngram_range=(3,3))
X = vectorizer.fit_transform(not_worthy)
terms = vectorizer.get_feature_names()
tf_idf = pd.DataFrame(X.toarray().transpose(), index=terms)

tf_idf = tf_idf.sum(axis=1)
score = pd.DataFrame(tf_idf, columns=["score"])
score.sort_values(by="score", ascending=False, inplace=True)
print("Top 5 Common theme related to not worthy reviews with n-grams of 3")
score.head(5)

Top 5 Common theme related to not worthy reviews with n-grams of 3


Unnamed: 0,score
pay mod fuck,5.016886
dont buy unless,3.867898
club pay mod,3.513929
creation club pay,3.504164
pay mod bad,3.421369


In [55]:
single_player = poor_review[poor_review['single_player'] == True ]['final_text']

vectorizer = TfidfVectorizer(ngram_range=(3,3))
X = vectorizer.fit_transform(single_player)
terms = vectorizer.get_feature_names()
tf_idf = pd.DataFrame(X.toarray().transpose(), index=terms)

tf_idf = tf_idf.sum(axis=1)
score = pd.DataFrame(tf_idf, columns=["score"])
score.sort_values(by="score", ascending=False, inplace=True)
print("Top 10 Common theme related to single player reviews with n-grams of 3")
score.head(10)

Top 10 Common theme related to single player reviews with n-grams of 3


Unnamed: 0,score
_number_ year old,1.459696
buy shark card,1.444074
ban use mod,1.195376
_number_ hour multiplayer,1.074287
ban singleplayer mod,1.071135
use mod singleplayer,1.037581
_number_ _number_ hour,1.003
multiplayer completely broken,1.0
singleplayer user though,1.0
doom multiplayer horrible,1.0


In [56]:
game_exp = poor_review[poor_review['game_exp'] == True ]['final_text']

vectorizer = TfidfVectorizer(ngram_range=(3,3))
X = vectorizer.fit_transform(game_exp)
terms = vectorizer.get_feature_names()
tf_idf = pd.DataFrame(X.toarray().transpose(), index=terms)

tf_idf = tf_idf.sum(axis=1)
score = pd.DataFrame(tf_idf, columns=["score"])
score.sort_values(by="score", ascending=False, inplace=True)
print("Top 10 Common theme related to game play experience reviews with n-grams of 3")
score.head(10)

Top 10 Common theme related to game play experience reviews with n-grams of 3


Unnamed: 0,score
dlc early access,10.567972
pay dlc early,5.232544
need lot work,4.182778
early access title,3.405531
buy early access,3.262628
_number_ early access,2.886342
early access pay,2.810989
access pay dlc,2.742918
buggy piece shit,2.7388
early access _number_,2.532354


According to the above analysis, our team have identified some features that can help Steam generate more revenue in game contracts and also some areas for improvements. Users are generally sensitive about price. For some games, users think the game worth the money spent. However, there are also many games are not well-worthy according to the reviews. We recommend the opeartions, sales, and marketing team at Steam to complete further investments based on game type and region to determine if price reduction strategy or temporary promotion can be implemented. We also recommend these teams to perform target marketing based on users preferences' of single player mode or multiplayer mode. 

For game design team, users commented the most about game features and contents. A game is highly rated by users when the art style and story line are attractive. Users also enjoyed open world games, early access features, and turn based strategy. However, not all users are satisfied with their game play exprience. Some complained about in-game bugs and lags while some others complained about lacking of new game content.