# Video Game Sales Prediction
---
## Problem Statement
Gaming analytics company wants to understand the gaming market better. They want a model to predict the global sales of video games to provide better service to their constumers. Goal is to get the lowest RMSE possible.

### Load Libraries & Data

In [2]:
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.metrics import mean_squared_error
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV


In [6]:
# Load data
video_games = pd.read_csv('../data/train.csv')
video_games.head()

Unnamed: 0,name,platform,genre,publisher,developer,rating,year_of_release,na_sales,eu_sales,jp_sales,other_sales,global_sales,critic_score,critic_count,user_score,user_count
0,Warriors Orochi 3,XOne,Action,Tecmo Koei,unknown,E,2014.0,0.01,0.03,0.0,0.0,0.04,68.997119,26.440992,7.1269,163.008846
1,Shooter: Starfighter Sanvein,PS,Shooter,Midas Interactive Entertainment,unknown,E,2000.0,0.01,0.01,0.0,0.0,0.02,68.997119,26.440992,7.1269,163.008846
2,CIMA: The Enemy,GBA,Role-Playing,Marvelous Interactive,Neverland,E,2003.0,0.02,0.01,0.0,0.0,0.03,70.0,11.0,7.1269,163.008846
3,Borderlands: The Pre-Sequel,PS3,Shooter,Take-Two Interactive,2K Australia,M,2014.0,0.26,0.21,0.05,0.1,0.61,77.0,24.0,6.3,130.0
4,Destiny,XOne,Shooter,Activision,"Bungie Software, Bungie",T,2014.0,2.14,0.92,0.0,0.31,3.37,75.0,11.0,5.5,1735.0


## Modeling

### Model Preparation

In [7]:
# select model features
X = video_games.drop(columns=['jp_sales', 'other_sales', 'global_sales', 'name'])
# select model target 
y = video_games['global_sales']

# split train data into train and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.20, random_state=42)

### Random Forest Fine Tuning

In [13]:
%%time
# Random Forest pipeline
forest_pipe = Pipeline([
    ('ohe', OneHotEncoder(sparse=False, handle_unknown='ignore')),
    ('sc', StandardScaler()),
    ('rf', RandomForestRegressor())
])

# set up pipe parameters
forest_params = {
    'rf__n_estimators': [400, 500, 600],
    'rf__max_depth': [10, 15, 20],
    'rf__max_features': ['sqrt', 'auto']
}

# instantiate RandomizedSearch with pipe and params
forest_grid = GridSearchCV(forest_pipe, forest_params, cv=5, n_jobs=-1, verbose=1)

# fit RandomizedSearch model with train data
forest_grid.fit(X_train, y_train)

# print best score from best model
print('RandomForest Best Score:', forest_grid.best_score_)

# print parameters from best model
forest_grid.best_params_

Fitting 5 folds for each of 18 candidates, totalling 90 fits
RandomForest Best Score: 0.2548281373217177
CPU times: user 21min 6s, sys: 971 ms, total: 21min 7s
Wall time: 1h 16min 49s


{'rf__max_depth': 20, 'rf__max_features': 'auto', 'rf__n_estimators': 600}

In [14]:
# score model on training data (R-squared)
forest_grid.score(X_train, y_train)

0.7256874445326937

In [15]:
# score model on validation data (R-squared)
forest_grid.score(X_val, y_val)

0.25111282319171524

In [16]:
# RMSE for Train Data
forest_preds = forest_grid.predict(X_train)
print('RMSE Train:', mean_squared_error(y_train, forest_preds, squared=False))

RMSE Train: 0.7377628673204125


In [17]:
# RMSE for Validation Data
forest_preds = forest_grid.predict(X_val)
print('RMSE Train:', mean_squared_error(y_val, forest_preds, squared=False))

RMSE Train: 1.215990555712607


This model is overfitting and performing worst than the Random Forest model using default parameters.

### Random Forest Further Tuning

In [5]:
%%time
# Random Forest pipeline
forest_pipe = Pipeline([
    ('ohe', OneHotEncoder(sparse=False, handle_unknown='ignore')),
    ('sc', StandardScaler()),
    ('rf', RandomForestRegressor(n_jobs=12))
])

# set up pipe parameters
forest_params = {
    'rf__n_estimators': [600, 1000],
    'rf__max_depth': [100, None],
    'rf__max_features': ['auto']
}

# instantiate RandomizedSearch with pipe and params
forest_grid = GridSearchCV(forest_pipe, forest_params, cv=5, n_jobs=24, verbose=1)

# fit RandomizedSearch model with train data
forest_grid.fit(X_train, y_train)

# print best score from best model
print('RandomForest Best Score:', forest_grid.best_score_)

# print parameters from best model
forest_grid.best_params_

Fitting 5 folds for each of 4 candidates, totalling 20 fits
RandomForest Best Score: 0.3590417861812004
CPU times: user 1h 4min 37s, sys: 1.13 s, total: 1h 4min 38s
Wall time: 1h 2min 37s


{'rf__max_depth': None, 'rf__max_features': 'auto', 'rf__n_estimators': 1000}

In [6]:
# score model on training data (R-squared)
forest_grid.score(X_train, y_train)

0.9119134063622014

In [7]:
# score model on validation data (R-squared)
forest_grid.score(X_val, y_val)

0.34930892442713224

In [8]:
# RMSE for Train Data
forest_preds = forest_grid.predict(X_train)
print('RMSE Train:', mean_squared_error(y_train, forest_preds, squared=False))

RMSE Train: 0.4180698587581727


In [9]:
# RMSE for Validation Data
forest_preds = forest_grid.predict(X_val)
print('RMSE Train:', mean_squared_error(y_val, forest_preds, squared=False))

RMSE Train: 1.1334684339448633


This model is also performing slightly worst than the Random Forest model with default parameters.

### Final Model

In [3]:
# Load test data
video_game_test = pd.read_csv('../data/test.csv')
video_game_test.head()

Unnamed: 0,name,platform,genre,publisher,developer,rating,year_of_release,na_sales,eu_sales,jp_sales,other_sales,global_sales,critic_score,critic_count,user_score,user_count
0,Tron 2.0: Killer App,GBA,Action,Disney Interactive Studios,Digital Eclipse,E,2004.0,0.04,0.02,0.0,0.0,0.06,68.0,16.0,7.1269,163.008846
1,Tales of Xillia 2,PS3,Role-Playing,Namco Bandai Games,Bandai Namco Games,T,2012.0,0.2,0.12,0.45,0.07,0.84,71.0,59.0,7.9,216.0
2,Totally Spies! Totally Party,PS2,Misc,Ubisoft,unknown,E,2008.0,0.01,0.01,0.0,0.0,0.01,68.997119,26.440992,7.1269,163.008846
3,Super Fire ProWrestling X,SNES,Fighting,Human Entertainment,unknown,E,1995.0,0.0,0.0,0.28,0.0,0.28,68.997119,26.440992,7.1269,163.008846
4,Star Fox: Zero,WiiU,Shooter,Nintendo,PlatinumGames,E10+,2016.0,0.17,0.1,0.07,0.03,0.36,69.0,82.0,7.4,662.0


In [4]:
# select model features
X = video_game_test.drop(columns=['jp_sales', 'other_sales', 'global_sales', 'name'])
# select model target 
y = video_game_test['global_sales']

In [8]:
# Data pipeline
pipe = Pipeline([
    ('ohe', OneHotEncoder(sparse=False, handle_unknown='ignore')),
    ('sc', StandardScaler()),
    ('rf', RandomForestRegressor(n_jobs=-1))
])

pipe.fit(X_train, y_train)

Pipeline(steps=[('ohe', OneHotEncoder(handle_unknown='ignore', sparse=False)),
                ('sc', StandardScaler()),
                ('rf', RandomForestRegressor(n_jobs=-1))])

In [13]:
columns = pipe.named_steps['ohe'].get_feature_names()
feature_importances = pipe.named_steps['rf'].feature_importances_
pd.Series(feature_importances, columns).sort_values(ascending=False)[:20]

x2_Nintendo             0.059370
x7_3.58                 0.033476
x6_29.08                0.032408
x3_Polyphony Digital    0.028697
x7_0.0                  0.028624
x6_23.2                 0.028427
x6_26.93                0.028249
x3_Rockstar North       0.026846
x7_12.76                0.022837
x6_15.68                0.022371
x6_11.27                0.021584
x7_8.89                 0.016483
x8_97.0                 0.013989
x7_6.18                 0.013671
x7_2.26                 0.013142
x7_0.02                 0.012084
x7_0.01                 0.011740
x6_9.0                  0.011103
x3_Infinity Ward        0.011057
x6_0.0                  0.009823
dtype: float64

It appears the most important features for determining game value is the publisher/developer, North American Sales, and the European sales.

In [10]:
# RMSE for Test Data
preds = pipe.predict(X)
print('RMSE Test:', mean_squared_error(y, preds, squared=False))

RMSE Test: 0.6326139187487348


### Conclusion

Overall none of the models were able to outperform the baseline score of 0.53 million. The final RMSE that I was able to attain is 0.63 million. This is actually very close to the baseline, so there is potential for room improvement.

The most important features according to this model are the publisher/developer, North American Sales, and European sales. This aligns with the data exploration I conducted, which means the model is picking modeling the data relatively well. 

With a little bit more exploration on model development I believe there's a possiblity to lower the RMSE below the baseline score. If I were to continue model development I would try using a voting classifier with Random Forest and XGBoost.

Some suprising finds were that Platform or Rating were not picked as top significant features, given that Nintendo WII, Playstation, Xbox, and PC are the top platforms and E and M rated games have the most global sales. 