# **Video Game Sales EDA & Prediction**

# **Exploratory Data Analysis / Data Preprocessing**

**Import necessary modules**

In [None]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score

## **Data Familiarization**

In [None]:
df = pd.read_csv('/kaggle/input/videogamesales/vgsales.csv', index_col='Rank')

*Note: Sales are in millions*

In [None]:
df.head()

In [None]:
df.info()

In [None]:
df.shape

In [None]:
df.describe()

*Unique values*

In [None]:
df.nunique()

**Duplicate values**

In [None]:
df.duplicated().sum()

In [None]:
df = df.drop_duplicates(keep='first').copy()

In [None]:
df.loc[df['Name'] == 'Call of Duty: Black Ops II']

Above, we see that this game has multiple rows; This is valid, though, because each row repersents the game on a different platform.

**Missing values**

In [None]:
df.isna().sum()

Below, we'll put these missing values into perspective by seeing the percentage of the NaN values out of total values for each column*

In [None]:
((df.isnull() | df.isna()).sum() * 100 / df.index.size).round(2)

Since the amount of missing values is a very small percentage of all the values in the df, we will drop them

In [None]:
df = df.dropna().copy()

## **DataFrame Creation/Aggregation**

In [None]:
df.loc[df['Name'] == 'Super Mario Bros.']

Like we discussed above, a game can have multiple rows because each row shows the game on a different platform. That said, it would be nice to see the combined sales for each game across all of its platforms. So we'll do just that below!

In [None]:
df_sales_all_plat = df.groupby(['Name'], as_index=False)['Global_Sales'].sum()
df_sales_all_plat = df_sales_all_plat.sort_values(['Global_Sales'], ascending=False)

In [None]:
df_sales_all_plat.loc[df_sales_all_plat['Name'] == 'Super Mario Bros.']

In [None]:
df_top_genre = df.groupby(['Genre'], as_index=False)['Global_Sales'].sum()
df_top_genre = df_top_genre.sort_values(['Global_Sales'], ascending=False)

## **Data Visualization**

*What are the most popular games on a platform?*

In [None]:
a = df['Name'][:5]
b = df['Global_Sales'][:5]

plt.figure(figsize=[13, 6])
plt.title('The 5 most popular video games (by sales) - on their most popular platforms', 
          weight='bold')
plt.xlabel('Games', weight='bold')
plt.ylabel('Sales (in millions)', weight='bold')
plt.bar(a, b, color=['forestgreen', 'limegreen', 'springgreen', 'aquamarine', 'turquoise'])
plt.show()

What are the most popular videogames across all of their platforms?

In [None]:
c = df_sales_all_plat['Name'][:5]
d = df_sales_all_plat['Global_Sales'][:5]

plt.figure(figsize=[13, 6])
plt.title('The 5 most popular video games (by sales) - across ALL of their platforms', 
          weight='bold')
plt.xlabel('Games', weight='bold')
plt.ylabel('Sales (in millions)', weight='bold')
plt.bar(c, d, color=['maroon', 'indianred', 'lightcoral', 'salmon', 'tomato'])
plt.show()

In [None]:
e = df_top_genre['Genre'][:5]
f = df_top_genre['Global_Sales'][:5]

plt.figure(figsize=[13, 6])
plt.title('The 5 most popular videogame genres', 
          weight='bold')
plt.xlabel('Genres', weight='bold')
plt.ylabel('Sales (in millions)', weight='bold')
plt.barh(e, f, color=['darkviolet', 'blueviolet', 'mediumpurple', 'mediumslateblue', 'royalblue'])
plt.show()

In [None]:
df_annual_regional_sales = df.groupby('Year')[['NA_Sales', 'EU_Sales', 'JP_Sales', 'Other_Sales']].sum()
df_annual_regional_sales

In [None]:
plt.figure(figsize=(9, 6))
plt.xlabel('Years (1980-2020)', weight='bold')
plt.ylabel('Sales (in millions)', weight='bold')
sns.lineplot(data=df_annual_regional_sales)

We see that video game sales reached its peak from around 2005-2010, but sales have since significantly dropped

# **Machine Learning**

In order for our models to be able to understand our data, the data needs to be in numerical form. Therefore, we'll use Label Encoder in order to convert our categorical variables to numerical variables.

In [None]:
le = LabelEncoder()

df['Name'] = le.fit_transform(df['Name'])
df['Platform'] = le.fit_transform(df['Platform'])
df['Genre'] = le.fit_transform(df['Genre'])
df['Publisher'] = le.fit_transform(df['Publisher'])

In [None]:
df.info()

**Select variables**

In [None]:
X = df.drop(columns=['Global_Sales', 'JP_Sales', 'Other_Sales'])
y = df['Global_Sales']

**Train-test-split**

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

### **Linear Regression**

In [None]:
lr = LinearRegression()
lr.fit(X_train, y_train)
lr_pred = lr.predict(X_test)

#### Model Evaluation: Linear Regression

In [None]:
def model_score(lr, model_name='Model Name'):
    print(f'Score of {model_name} Model: {lr.score(X_test, y_test) * 100}%')

In [None]:
model_score(lr, 'Linear Regression')

In [None]:
def mse(lr_pred, model_name='Model Name'):
    print('Mean Squared Error: {} of {} model'.format(mean_squared_error(y_test, lr_pred), model_name))

In [None]:
mse(lr_pred, 'Linear Regression')

In [None]:
def mae(lr_pred, model_name='Model Name'):
    print('Mean Absolute Error: {} of {} model'.format(mean_absolute_error(y_test, lr_pred), model_name))

In [None]:
mae(lr_pred, 'Linear Regression')

In [None]:
def cross_val(lr, model_name='Model Name'):
    print('Cross Validation: {} of {} model'.format(str(np.mean(cross_val_score(lr, X, y, cv=5))), model_name))

In [None]:
cross_val(lr, 'Linear Regression')

### **Decision Tree Regressor**

In [None]:
dtr = DecisionTreeRegressor()
dtr.fit(X_train, y_train)
dtr_pred = dtr.predict(X_test)

#### Model Evaluation: Decision Tree Regressor

In [None]:
model_score(dtr, 'Decision Tree Regressor')

In [None]:
mse(dtr_pred, 'Decision Tree Regressor')

In [None]:
mae(dtr_pred, 'Decision Tree Regressor')

In [None]:
cross_val(dtr, 'Decision Tree Regressor')

### **Random Forest Regressor**

In [None]:
rfg = RandomForestRegressor()
rfg.fit(X_train, y_train)
rfg_pred = rfg.predict(X_test)

#### Model Evaluation: Random Forest Regressor

In [None]:
print(f'Score of Random Forest Regressor Model: {rfg.score(X_test, y_test) * 100}%')

In [None]:
model_score(rfg, 'Random Forest Regressor')

In [None]:
print(f'Mean Absolute Error: ' + str(mean_absolute_error(rfg_pred, y_test)) + ' - Random Forest Regressor')

In [None]:
mse(rfg_pred, 'Random Forest Regressor')

In [None]:
mae(rfg_pred, 'Random Forest Regressor')

In [None]:
cross_val(rfg, 'Random Forest Regressor')

### **XGBoost**

In [None]:
xgb = XGBRegressor(n_estimators=1000, learning_rate=0.05)
xgb.fit(X_train, y_train, early_stopping_rounds=5, eval_set=[(X_test, y_test)], verbose=False)
xgb_pred = xgb.predict(X_test)

#### Model Evaluation: XGBoost Regressor

In [None]:
model_score(xgb, 'XGBoost')

In [None]:
mse(xgb_pred, 'XGBoost')

In [None]:
mae(xgb_pred, 'XGBoost')

In [None]:
cross_val(xgb, 'XGBoost')

# **Conclusion**
From these results, we see that our best model is Linear Regression! This makes sense because the Global Sales follow a linear trend which is highly dependent on NA and EU Sales. Our Linear Regression model scores extremely well in terms of accuracy, mean squared error, and mean absolute error. However, cross validation on our model does not return a very good result; this means that our model's performance is dependent on how the data is split, and if the data were to be split differently, the results would vary.

# Please upvote if you like this notebook. Feedback would also be greatly appreciated via the comment section! Thanks