### Problem Statement Overview

Using games data gathered from Steam Valve, predict the success of a game based on its description and reviews by applying **minimum Naive Bayes and Random Forest** algorithms. Success will be determined by the average review score and the number of positive reviews. This is a binary classification problem. 

In [None]:
### importing libraries

import pandas as pd
import numpy as np
import re
import matplotlib.pyplot as plt
import seaborn as sns
import datetime
import tqdm
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB 
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay, plot_confusion_matrix 
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier

from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV

In [None]:
### load dataset

master_df = pd.read_csv('/kaggle/input/master-df3/master_df3.csv')
master_df['reviews']

In [None]:
master_df.info()

In [None]:
### adjusting column header space

_ls = master_df.columns
new_col = []
for i in _ls:
    new_col.append(i.strip())

master_df.columns = new_col

In [None]:
master_df.info()

There are some missing values in our dataset in the columns for price, description, reviews and tagged. 

In [None]:
missing_vals = master_df.isnull().sum()
missing_vals

The number of missing values in each columns is negligible as our dataset has 50k entries, so it should be fine to drop those entries.  

In [None]:
def extract_int(string):
    match = re.search(r'(app|sub)/(\d+)', string)
    if match:
        return int(match.group(2))
    else:
        return None

master_df['game_id'] = master_df['link'].apply(extract_int)

In [None]:
### creating a subset after dropping rows with missing values
### dropping link column as well

df_1 = master_df.dropna()
df_1.drop(columns='link', inplace=True)
df_1

In [None]:
### converting all characters to lowercase for standardization

df_1 = df_1.apply(lambda x: x.str.lower() if x.dtype == 'object' else x)


In [None]:
df_1

In [None]:
### removing the header of the description of the game 'about this game'

df_1['description'] = df_1['description'].str.replace(r'about this game','')
df_1

In [None]:
### splitting the review column to the reviews overview, review types, and review statistics 

df_1[['reviews','review_type','review_stats']] = df_1['reviews'].str.split(':|-', n=2, expand=True)
df_1[['positive_percentage','total_reviews']] = df_1['review_stats'].str.extract(r'(\d+%)\s+of\s+the\s+(\d+)\s+user\s+reviews', expand=True)

df_1 = df_1.reindex(columns =['title','r_date',
                       'price','description','reviews',
                       'review_type','review_stats','positive_percentage',
                              'total_reviews','tagged','game_id'])
df_1.drop(columns='reviews', inplace=True)
df_1.head()

In [None]:
df_1['review_type'].unique()

In [None]:
df_1['r_date'] = pd.to_datetime(df_1['r_date'])

In [None]:
def convert(string):
    match = re.search(r'\$\s*([\d.]+)', string)
    if match:
        return float(match.group(1))
    elif 'free' in string:
        return 0
    else:
        return None

df_1['price'] = df_1['price'].apply(convert)

In [None]:
df_1.info()

In [None]:
### Checking to see if there are any duplicated entries in our dataset

df_1['game_id'].duplicated().value_counts()

In [None]:
df_1.drop_duplicates(subset=['game_id'], inplace=True)

In [None]:
df_1['game_id'].duplicated().value_counts()

In [None]:
df2 = df_1.dropna()
df2

In [None]:
df2['total_reviews']=df2['total_reviews'].astype(int)
df2['positive_percentage'] = df2['positive_percentage'].str.replace('%','').astype(int)

# EDA

We will perform some EDA to determine the parameters of how we classify a game to be successful or not successful based on their reviews characteristics. <br><br>
*Point to take note: $0 priced games in our case refer to games that are "Free-to-Play". Sometimes these games include in-game purchases.**

In [None]:
df2.describe()

The median percentage for positive reviews is at 81% with the upper percentile at 91%. We can take this into account when determining our threshold for our measure of success for a game. 

In [None]:
sns.heatmap(df2.corr(), cmap='coolwarm', annot=True)

there is no correlation between the positive percentage to the total_reviews (0.066) there is no correlation between the positive percntage to the price.

In [None]:
df2[df2['description'].duplicated()]

In [None]:
duplicated_rows = df2[df2.duplicated('description')]

In [None]:
duplicated_rows['description'].to_list()[0]

In [None]:
duplicate_counts = duplicated_rows.value_counts()
duplicated_rows['description'].to_list()[0]

In [None]:
# df2[df2['description'] == ].shape[0]

In [None]:
#empty bcos they use images instead of typing out description
df2[df2['description'] == '']

In [None]:
whitespace_only_rows = df2['description'].apply(lambda x: x.isspace())

In [None]:
# re.find_all(df['description'])

In [None]:
df2[df2['description'] == "boom blaster, 2020-07-18 00:00:00, 4.99, , mixed(29),  68% of the 29 user reviews for this game are positive., 68, 29, action, indie, casual, adventure, shooter, 2d, combat, nature, hero shooter, dungeon crawler, shoot 'em up, puzzle, education"]

In [None]:
df2[df2['total_reviews']> 20]

In [None]:
plt.figure(figsize=(20,10))
sns.histplot(data = df2[df2['total_reviews']>=100], x='total_reviews', bins=100)
plt.title('Fig. 1 - Reviews Per Game(with at least more than 100)', fontdict={'fontsize':20})
plt.xlabel('Total number of reviews per game');

Looking at Fig 1, there are less number of games as the number of reviews per game increase. That makes sense, as from our domain understanding in gaming, there will be more reviews for more popular games due to the amount of people playing that particular game.  

In [None]:
positive_vals = df2[df2['total_reviews']>=100]['positive_percentage'].unique()

In [None]:
plt.figure(figsize=(16,10))
sns.histplot(data = df2[df2['total_reviews']>=100], x='positive_percentage')
plt.title('Fig. 2 - Positive Reviews Per Game(with at least more than 100 reviews each)', fontdict={'fontsize':20})
plt.xticks(ticks = positive_vals, rotation = 90);

In Fig.2, there is a sharp increase in the counts of 75% positive reviews. Perhaps we can consider that as our threshold to measure success for a game. 

In [None]:
df2['positive_percentage'].quantile(0.3)

In [None]:
df2['positive_percentage'].quantile(0.48)

In [None]:
df2_filtered = df2[df2['total_reviews'] >= 100]
df2_filtered['total_review_bins'] =pd.cut(df2_filtered["total_reviews"],
                                          bins=[100,200,300,400,500,600,700,800,900,1000],
                                         labels=['0-100',"100-200", "200-300", "300-400", "400-500",
                                                 '500-600','600-700','800-900','900-1000']) 
                                          


In [None]:
# plt.figure(figsize=(15,20))

# plt.subplot(2,1,1)
# sns.boxplot(df2_filtered, x='total_review_bins', y = 'positive_percentage')
# plt.title('Fig. 3a - Number of Positive Reviews over Number of Total Reviews Per Game', fontdict={'fontsize':15})
# plt.ylabel('Positive Review Percentage Per Game')
# plt.xlabel('Total Number of Reviews Per Game')

# plt.subplot(2,1,2)
# sns.violinplot(df2_filtered, x='total_review_bins', y = 'positive_percentage')
# plt.title('Fig. 3b - Number of Positive Reviews over Number of Total Reviews Per Game', fontdict={'fontsize':15})
# plt.ylabel('Positive Review Percentage Per Game')
# plt.xlabel('Total Number of Reviews Per Game');

We have grouped the games according to their number of reviews given by the 100s. <br>
We see from Fig. 3a that there are usually more outliers in games that has lower number of reviews.
<br>
From Fig.3b, we also observe that there are substantial amount of games that have been positively reviewed in each group. 

Lets explore if there is any relation between released date, price and positive reviews.

In [None]:
### Creating new feature year and including into analysis 

df2['r_year'] = df2['r_date'].dt.year


In [None]:
import statsmodels.api as sm
from sklearn.preprocessing import StandardScaler

### Define the independent and dependent variables

x = df2[['price','r_year','total_reviews']]
y = df2['positive_percentage']

### Scaling independent variables
scaler = StandardScaler()
x_scaled = scaler.fit_transform(x)

# Add a constant to the independent variable
x = sm.add_constant(x)


# Fitting OLS model

model = sm.OLS(y,x).fit()

        
        

In [None]:
model.summary()

**The p-value of the game price is >0.05, which lets us safely assume that the price has no effect on the positivity of the reviews.** <br><br>

In [None]:
# plt.figure(figsize=(15,10))
# sns.set_style('whitegrid')
# sns.boxplot(df2, x='r_year',y='price')
# plt.title('Fig. 4 - Prices of Games over the Years', fontdict={'fontsize':15})
# plt.xlabel('Release Year of Games')
# plt.ylabel('Prices of Games');

Discounting the years 2016 and before as well as 2023(since we're only at the 1st month), the ceiling prices of games seems to have increase year on year. 

In [None]:
price_bins = pd.cut(df2['price'],bins=26, labels = list(range(0,251,10)))
price_bins

In [None]:
# plt.figure(figsize=(15,10))
# sns.barplot(df2, x=price_bins,y='total_reviews')
# plt.title('Fig.5 - Prices vs Total Reviews', fontdict={'fontsize':15})
# plt.xlabel('Prices ranges of Games')
# plt.ylabel('Mean Total Reviews per game price range')
# plt.xticks(rotation = 90, minor=True);


From Fig.5, we observe the following;<br>
1. There are at least 100 reviews for games that costs between \\$20 to \\$100, and the number of reviews falls drastically when a game costs more than \$100. It could be due to the high price that leads to less people buying the games, hence resulting in a minimal count of reviews.
2. There are most number of reviews in games that costs in the \\$70 range and \\$100 range(the most number of reviews).
<br><br>
Perhaps we should omit games that costs more than \\$100 for our model training later on.



In [None]:
sns.countplot(x='r_year', data=df2)
plt.title('Number of Games Released Each Year')
plt.ylabel('Number of Games Released')
plt.xlabel('Year');

It seems like the gaming industry has been very active since 2017, or at least steam has been expanding and started to collect data since then. The number of games released per year has broken the 5000 mark in 2020. Perhaps this could be due to boost contributed by the effect of stay home period due to covid. <br> Some background details in the following links;<br>[WashingtonPost article 2020](https://www.washingtonpost.com/video-games/2020/05/12/video-game-industry-coronavirus/)<br> [Forbes article 2021 on how the gaming industry has leveled up during the pandemic](https://www.forbes.com/sites/forbestechcouncil/2021/06/17/how-the-gaming-industry-has-leveled-up-during-the-pandemic/?sh=30c984e7297c)

### Deciding on our measure of success 

From our EDA, we have gathered the following information;

1. It seems like the gaming industry has been very active since 2017, or at least Steam has been doing well and started to collect data since then. The number of games released per year has broken the 5000 mark in **2020. Perhaps this could be due to boost contributed by the effect of stay home period due to covid.
<br><br>
2. Discounting the years 2016 and before as well as 2023(since we're only at the 1st month), the ceiling prices of games seems to have increase year on year.
<br><br>
3. As the p-value of the game price is >0.05, it lets us safely assume that the price has no effect on the positivity of the reviews.
<br><br>
4. The median percentage for positive reviews is at 81% with the upper percentile at 91%. We can take this into account when determining our threshold for our measure of success for a game.
<br><br>
5. Looking at Fig 1, there are less number of games as the number of reviews per game increase. That makes sense, as from our domain understanding in gaming, there will be more reviews for more popular games due to the amount of people playing that particular game.
<br><br>
6. Followed by Fig.5, where we observe following;
There are at least 100 reviews for games that costs between \\$20 to \\$100, and the number of reviews falls drastically when a game costs more than \\$100. It could be due to the high price that leads to less people buying the games, hence resulting in a minimal count of reviews. This could affect our accuracy metric for a successful game. 
There are most number of reviews in games that costs in the \\$70 range and \\$100 range(the most number of reviews).


<br><br>
7. In Fig.2, there is a sharp increase in the counts of 75% positive reviews. 

<br><br>
8. We saw from Fig. 3a & 3b that there are usually more outliers in games that has lower number of reviews and that there are substantial amount of games that have been positively reviewed in each group.












In [None]:

stop_words = stopwords.words("english")
#removed the words game and rated as thought would not be useful/ might give away the rating of the game
stop_words2 = stop_words + ["game", "rated", "get","like","one", "also", "https", "store", "steampowered"]

In [None]:
pip install langdetect

In [None]:
from langdetect import detect
# function to detect the language of a given text
def detect_language(text):
    try:
        return detect(text)
    except:
        return None

# apply the function to the 'description' column of the DataFrame
df2['language'] = df2['description'].apply(detect_language)

# filter the DataFrame to only include rows where the language is not 'en'
df3 = df2[df2['language'] == 'en']

In [None]:
df3 = df3.reset_index(drop=True)

In [None]:
df2.shape

In [None]:
df3

thou

In [None]:
df2[df2['description'].str.contains(r"\b(rated)\b", regex=True, case=False)].iloc[0]['description']

In [None]:
import nltk
nltk.download('stopwords')

cvec = CountVectorizer(stop_words=stop_words2,ngram_range=(1,2),min_df=0.01)

text_data = cvec.fit_transform(df3['description'])


test_df = pd.DataFrame(text_data.toarray(), columns=cvec.get_feature_names_out())
test_df.sum().sort_values(ascending=False)[:20].plot(kind='barh')

In [None]:
df33 = df3['tagged'].str.get_dummies(', ')

In [None]:
test_df

In [None]:
df33

In [None]:
df3

In [None]:
cvec = CountVectorizer(stop_words=stop_words2,ngram_range=(1,3),min_df=0.01)

text_data = cvec.fit_transform(df3['description'])


test_df = pd.DataFrame(text_data.toarray(), columns=cvec.get_feature_names_out())

test_df.sum().sort_values(ascending=False)[:20].plot(kind='barh')
test_df

In [None]:
result = test_df.merge(df33, left_index=True, right_index=True, suffixes=('_df1', '_df2'))

In [None]:
description_length = df2['description'].apply(lambda x: len(re.findall(r'\w+', x)))
df2.insert(4,'description_length',description_length, True)
df2.head()

In [None]:
df3['sentiment'] = df3['review_type'].apply(lambda x: 1 if 'positive' in x else 0) 

In [None]:
df3

In [None]:
new_df = result.join(df3['sentiment'])


In [None]:
new_df

In [None]:
result_2 = result.join(df3['sentiment'])

In [None]:
X = result
y = df3['sentiment']

In [None]:
y

In [None]:
X_train_cvec,X_test_cvec,y_train_cvec,y_test_cvec = train_test_split(X,y,stratify=y)

In [None]:
lr_pipe = Pipeline([
    ('lr', LogisticRegression(max_iter=1000)) 
])


In [None]:
lr_pipe.fit(X_train_cvec,y_train_cvec)

In [None]:
lr_pipe.score(X_test_cvec,y_test_cvec)

In [None]:
lr_pipe.score(X_train_cvec,y_train_cvec)

In [None]:
y_pred = lr_pipe.predict(X_test_cvec)

In [None]:
cm = confusion_matrix(y_test_cvec, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot();

In [None]:
param_grid = {'C': [0.1, 1, 10], 'penalty': ['l1', 'l2']}
grid_search = GridSearchCV(LogisticRegression(max_iter=1000), param_grid, cv=5)

In [None]:
grid_search.fit(X_train_cvec,y_train_cvec)

In [None]:
grid_search.best_params_

In [None]:
grid_search.score(X_train_cvec,y_train_cvec)

In [None]:
grid_search.score(X_test_cvec,y_test_cvec)

In [None]:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay, accuracy_score, plot_roc_curve, roc_auc_score, recall_score, precision_score, f1_score, classification_report

In [None]:
y_pred_1 = grid_search.predict(X_test_cvec)

In [None]:
yy = y_test_cvec.to_frame()

In [None]:
group_sizes = y_test_cvec.to_frame().groupby('sentiment').size()
group_sizes / group_sizes.sum()

In [None]:
print(classification_report(y_test_cvec, y_pred_1))

In [None]:
cm = confusion_matrix(y_test_cvec, y_pred_1)
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot();

In [None]:
lr_pipe.fit(X_train_cvec,y_train_cvec)

In [None]:
lr_pipe.score(X_train_cvec,y_train_cvec)

In [None]:
lr_pipe.score(X_test_cvec,y_test_cvec)

In [None]:
tfidf_vec = TfidfVectorizer(stop_words=stop_words2,ngram_range=(1,3),min_df=0.01)

text_data = tfidf_vec.fit_transform(df3['description'])


test_df2 = pd.DataFrame(text_data.toarray(), columns=cvec.get_feature_names_out())

test_df2.sum().sort_values(ascending=False)[:20].plot(kind='barh')
test_df2

In [None]:
lr_pipe_2 = Pipeline([
    ('lr', LogisticRegression(max_iter=1000)) 
])



In [None]:
result2 = test_df2.merge(df33, left_index=True, right_index=True, suffixes=('_df1', '_df2'))


In [None]:
X_tf = result2
y_tf = df3['sentiment']

In [None]:
X_train_tf, X_test_tf, y_train_tf, y_test_tf = train_test_split(X_tf,y_tf, stratify=y_tf)

In [None]:
lr_pipe_tfidf = Pipeline([
    ('lr', LogisticRegression(max_iter=1000)) 
])

In [None]:
param_grid = {'C': [0.1, 1, 10], 'penalty': ['l1', 'l2']}
grid_search_2 = GridSearchCV(LogisticRegression(max_iter=1000), param_grid, cv=5)

In [None]:
grid_search_2.fit(X_train_tf, y_train_tf)

In [None]:
grid_search_2.score(X_train_tf,y_train_tf)

In [None]:
grid_search_2.score(X_test_tf,y_test_tf)

In [None]:
# group_sizes = y_test.to_frame().groupby('sentiment').size()
# group_sizes / group_sizes.sum()

In [None]:
gs_2_y_pred_tf = grid_search_2.predict(X_test_tf)

In [None]:


# group_sizes = y_test_tfidf.to_frame().groupby('sentiment').size()
# group_sizes / group_sizes.sum()
print(classification_report(y_test_tf, gs_2_y_pred_tf))

In [None]:

cm = confusion_matrix(y_test_tf, gs_2_y_pred_tf)
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot();

In [None]:
pipe = Pipeline([
    ('nb', MultinomialNB()) 
])

In [None]:
pipe.fit(X_train_cvec, y_train_cvec)

In [None]:
pipe.score(X_train_cvec,y_train_cvec)

In [None]:
pipe.score(X_test_cvec,y_test_cvec)

In [None]:
pipe_y_pred = pipe.predict(X_test_cvec)
cm = confusion_matrix(y_test_cvec, pipe_y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot();

In [None]:
print(classification_report(y_test_cvec, pipe_y_pred))

In [None]:
pipe2 = Pipeline([
    ('nb', MultinomialNB()) 
])

In [None]:
pipe2.fit(X_train_tf,y_train_tf)

In [None]:
pipe2.score(X_train_tf,y_train_tf)

In [None]:
pipe2.score(X_test_tf,y_test_tf)

In [None]:
pipe_y_pred_tf = pipe2.predict(X_test_tf)
cm = confusion_matrix(y_test_cvec, pipe_y_pred_tf)
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot();

In [None]:
print(classification_report(y_test_cvec, pipe_y_pred_tf))

In [None]:
rf = Pipeline([
    ('rf', RandomForestClassifier())])




In [None]:
#  = GridSearchCV(rf,
#                   param_grid= {'ngram_range': (1,3),'n_estimators': [100, 150, 200], 'max_depth': ['None', 1, 2, 3, 4, 5]},cv=5) 

In [None]:
# Define the parameter grid
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [10, 20, 30],
    'min_samples_split': [2, 4, 8],
    'min_samples_leaf': [1, 2, 4]
}

# Initialize the classifier
clf = RandomForestClassifier()

# Initialize the GridSearchCV object
rf_gs_cvec = GridSearchCV(clf, param_grid, cv=5)

# Fit the GridSearchCV object to the training data
rf_gs_cvec.fit(X_train_cvec, y_train_cvec)

# Get the best parameters and best score
print(rf_gs_cvec.best_params_)
print(rf_gs_cvec.best_score_)

In [None]:
RandomForestClassifier().get_params().keys()

In [None]:
rf_gs_cevc_y_pred = rf_gs_cvec.predict(X_test_cvec)

In [None]:
cm = confusion_matrix(y_test_cvec, rf_gs_cevc_y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot();

In [None]:
print(classification_report(y_test_cvec, rf_gs_cevc_y_pred))

In [None]:
rf_tf = Pipeline([
    ('rf', RandomForestClassifier())])

In [None]:
# Define the parameter grid
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [10, 20, 30],
    'min_samples_split': [2, 4, 8],
    'min_samples_leaf': [1, 2, 4]
}

# Initialize the classifier
clf2 = RandomForestClassifier()

# Initialize the GridSearchCV object
rf_gs_tf = GridSearchCV(clf2, param_grid, cv=5)

# Fit the GridSearchCV object to the training data
rf_gs_tf.fit(X_train_tf, y_train_tf)

# Get the best parameters and best score
print(rf_gs_cvec.best_params_)
print(rf_gs_cvec.best_score_)

In [None]:
rf_gs_tf.fit(X_train_tf,y_train_tf)

In [None]:
rf_gs_tf.score(X_train_tf,y_train_tf)

In [None]:
rf_gs_tf.score(X_test_tf,y_test_tf)

In [None]:
rf_gs_tf_y_pred = rf_gs_tf.predict(X_test_tf)

In [None]:
cm = confusion_matrix(y_test_cvec, rf_gs_tf_y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot();

In [None]:
print(classification_report(y_test_cvec, rf_gs_tf_y_pred))