# Video Game Sales Prediction
---
## Problem Statement
Gaming analytics company wants to understand the gaming market better. They want a model to predict the global sales of video games to provide better service to their constumers. Goal is to get the lowest RMSE possible.

### Load Libraries & Data

In [60]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor


In [61]:
# Load data
video_games = pd.read_csv('../data/train.csv')
video_games.head()

Unnamed: 0,name,platform,genre,publisher,developer,rating,year_of_release,na_sales,eu_sales,jp_sales,other_sales,global_sales,critic_score,critic_count,user_score,user_count
0,Warriors Orochi 3,XOne,Action,Tecmo Koei,unknown,E,2014.0,0.01,0.03,0.0,0.0,0.04,68.997119,26.440992,7.1269,163.008846
1,Shooter: Starfighter Sanvein,PS,Shooter,Midas Interactive Entertainment,unknown,E,2000.0,0.01,0.01,0.0,0.0,0.02,68.997119,26.440992,7.1269,163.008846
2,CIMA: The Enemy,GBA,Role-Playing,Marvelous Interactive,Neverland,E,2003.0,0.02,0.01,0.0,0.0,0.03,70.0,11.0,7.1269,163.008846
3,Borderlands: The Pre-Sequel,PS3,Shooter,Take-Two Interactive,2K Australia,M,2014.0,0.26,0.21,0.05,0.1,0.61,77.0,24.0,6.3,130.0
4,Destiny,XOne,Shooter,Activision,"Bungie Software, Bungie",T,2014.0,2.14,0.92,0.0,0.31,3.37,75.0,11.0,5.5,1735.0


## Modeling

### Model Preparation

In [62]:
# select model features
X = video_games.drop(columns=['name','jp_sales', 'other_sales', 'global_sales'])
# select model target
y = video_games['global_sales']

# split train data into train and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.20, random_state=42)

### Baseline Model

In [63]:
# Take the mean to determine the baseline score to beat
y.mean()

0.5293352116966389

Baseline is about 0.53 million

### Linear Regression

In [64]:
# Linear Regression Pipeline
linreg_pipe = Pipeline([
    ('ohe', OneHotEncoder(sparse=False, handle_unknown='ignore')),
    ('sc', StandardScaler()),
    ('lg', LinearRegression())
])

In [65]:
# fit the training data
linreg_pipe.fit(X_train, y_train)

Pipeline(steps=[('ohe', OneHotEncoder(handle_unknown='ignore', sparse=False)),
                ('sc', StandardScaler()), ('lg', LinearRegression())])

In [66]:
# score model on training data (R-Squared)
linreg_pipe.score(X_train, y_train)

0.9877943283580946

Approximately 99% of the variation of the train data can be explained by the features in the model.

In [67]:
# score model on validation data (R-squared)
linreg_pipe.score(X_val, y_val)

-9.353933185301364e+25

model is very overfit. Approximately -9.35e25% of the validation data can be explained by the features in the model.

In [68]:
# cross validate training data
linreg_scores = cross_val_score(linreg_pipe, X_train, y_train, cv=5, n_jobs=-1)
linreg_scores.mean()

-1.1919527464684414e+26

Linear regression is not performing well on this data. -1.19e26 of the training data is can be explained by the features in the model.

In [69]:
# RMSE for Train data
linreg_preds = linreg_pipe.predict(X_train)
print('RMSE Train:', mean_squared_error(y_train, linreg_preds, squared=False))

RMSE Train: 0.15562336114097927


In [70]:
# RMSE for Validation data
linreg_preds = linreg_pipe.predict(X_val)
print('RMSE Val:', mean_squared_error(y_val, linreg_preds, squared=False))

RMSE Val: 13589989979286.088


Linear Regression is not able to model the data accurately. Overall worst than the baseline model. Doesn't make sense to test other linear models like Lasso or Ridge regression.

### Random Forest

In [71]:
# Random Forest pipeline
forest_pipe = Pipeline([
    ('ohe', OneHotEncoder(sparse=False, handle_unknown='ignore')),
    ('sc', StandardScaler()),
    ('rf', RandomForestRegressor())
])

In [72]:
# fit the training set
forest_pipe.fit(X_train, y_train)

Pipeline(steps=[('ohe', OneHotEncoder(handle_unknown='ignore', sparse=False)),
                ('sc', StandardScaler()), ('rf', RandomForestRegressor())])

In [73]:
# score model on training data (R-Squared)
forest_pipe.score(X_train, y_train)

0.9194278726328529

Approximately 92% of the training data variation can be explained by the features in the model

In [74]:
# score model on validation data (R-Squared)
forest_pipe.score(X_val, y_val)

0.35233128960843885

Model is overfitting, but performing better than the Linear Regression model. Approximately 35% of the validation data variation can be explained by the features in the model

In [75]:
# cross validate training data
forest_scores = cross_val_score(forest_pipe, X_train, y_train, cv=5, n_jobs=-1)
forest_scores.mean()

0.35991874975270777

Model isn't performing the best, but better than Linear Regression. Approximately 36% of the training data variation can be explained by the features in the model

In [76]:
# RMSE for Train Data
forest_preds = forest_pipe.predict(X_train)
print('RMSE Train:', mean_squared_error(y_train, forest_preds, squared=False))

RMSE Train: 0.39984011625762444


In [77]:
# RMSE for Validation data
forest_preds = forest_pipe.predict(X_val)
print('RMSE Val:', mean_squared_error(y_val, forest_preds, squared=False))

RMSE Val: 1.1308329722357922


Model is performing well on validation data. No overfitting present. Not performing better than the baseline.

### XGBoost

In [78]:
# XGBoost pipeline
xgb_pipe = Pipeline([
    ('ohe', OneHotEncoder(sparse=False, handle_unknown='ignore')),
    ('sc', StandardScaler()),
    ('xgb', XGBRegressor())
]) 

In [79]:
# fit the training set
xgb_pipe.fit(X_train, y_train)

Pipeline(steps=[('ohe', OneHotEncoder(handle_unknown='ignore', sparse=False)),
                ('sc', StandardScaler()),
                ('xgb',
                 XGBRegressor(base_score=0.5, booster='gbtree',
                              colsample_bylevel=1, colsample_bynode=1,
                              colsample_bytree=1, gamma=0, gpu_id=-1,
                              importance_type='gain',
                              interaction_constraints='',
                              learning_rate=0.300000012, max_delta_step=0,
                              max_depth=6, min_child_weight=1, missing=nan,
                              monotone_constraints='()', n_estimators=100,
                              n_jobs=4, num_parallel_tree=1, random_state=0,
                              reg_alpha=0, reg_lambda=1, scale_pos_weight=1,
                              subsample=1, tree_method='exact',
                              validate_parameters=1, verbosity=None))])

In [80]:
# score model on training data (R-Squared)
xgb_pipe.score(X_train, y_train)

0.9024256094847313

Approximately 90% of the training data variation can be explained by the features in the model

In [81]:
# score model on validation data (R-Squared)
xgb_pipe.score(X_val, y_val)

0.3060243911942018

Overfitting is present. Performing slightly worst than Random Forests, but better than linear regression. Approximately 31% of the validation data variation can be explained by the features in the model

In [82]:
# cross validate training data
xgb_scores = cross_val_score(xgb_pipe, X_train, y_train, cv=5, n_jobs=-1)
xgb_scores.mean()

0.3804003264146777

Overall xgb cross val score has performed the best out of all three models. Approximately 38% of the training data variation can be explained by the features in the model

In [83]:
# RMSE for the Train data
xgb_preds = xgb_pipe.predict(X_train)
print('RMSE Train:', mean_squared_error(y_train, xgb_preds, squared=False))

RMSE Train: 0.44000931780139585


In [84]:
# RMSE for the Validation data
xgb_preds = xgb_pipe.predict(X_val)
print('RMSE Val:', mean_squared_error(y_val, xgb_preds, squared=False))

RMSE Val: 1.170561155208276


XGBoost performed slightly worst than the Random Forest Regressor.