# Video Game Sales Predictions

The objective of this project is to use classification techniques to predict the sales of games in different regions. The dataset used can be found [here](https://www.kaggle.com/gregorut/videogamesales).

The following are the columns that we primarily focused on:
- Input (x):
    - Platform
    - Year
    - Genre
    - Publisher
- Output (y):
    - NA_Sales
    - EU_Sales
    - JP_Sales
    - Other_Sales

The input columns were used to predict the output columns individually for each region.

In [None]:
import pandas as pd

In [None]:
vgsales = pd.read_csv("../input/videogamesales/vgsales.csv")
vgsales.head()

# Removing missing data

We check for missing values within the dataframe and see that there is missing values only in the Year and Publisher columns.

In [None]:
vgsales.isna().sum()

We decide to remove any rows that have missing values.

In [None]:
print("rows before removing missing values", vgsales.shape[0])

In [None]:
vgsales = vgsales.dropna()
vgsales.isna().sum()

In [None]:
print("rows after removing missing values", vgsales.shape[0])

In [None]:
vgsales['Year'] = vgsales['Year'].astype(int)

# Removing Outliers

We remove any outlier values within the output columns so they do not skew our analysis.

In [None]:
price_cols = ['NA_Sales', 'EU_Sales', 'JP_Sales', 'Other_Sales']

In [None]:
vgsales[price_cols].boxplot(figsize=(24, 8))

In [None]:
def std_based(col_name,df):
    mean = df[col_name].mean()
    std = df[col_name].std()
    cut_off = std * 2
    lower, upper = mean - cut_off, mean + cut_off
    new_df = df[(df[col_name] < upper) & (df[col_name] > lower)]
    return new_df

In [None]:
print(f"dataset size before removing outliers= {vgsales.shape[0]}")
for col in price_cols:
    vgsales = std_based(col, vgsales)
print(f"dataset size after removing outliers= {vgsales.shape[0]}")

# Feature Labeling


We use the One Hot encoding method to label the input columns, and then we change the output columns to categorical values and apply label encoding.

**Reference:** https://pbpython.com/categorical-encoding.html

In [None]:
labeled_df = pd.get_dummies(
    vgsales,
    columns=["Platform", "Genre", 'Publisher'], 
    prefix=["platform", "genre", 'pub']
)
labeled_df.head()

In [None]:
labeled_df.shape

In [None]:
labeled_df.dtypes

In [None]:
bins = pd.qcut(labeled_df['NA_Sales'],7, duplicates='drop')
bins.value_counts(sort=False)

In [None]:
bins = pd.qcut(labeled_df['EU_Sales'],7, duplicates='drop')
bins.value_counts(sort=False)

In [None]:
bins = pd.qcut(labeled_df['JP_Sales'],7, duplicates='drop')
bins.value_counts(sort=False)

In [None]:
bins = pd.qcut(labeled_df['Other_Sales'],7, duplicates='drop')
bins.value_counts(sort=False)

In [None]:
def sales_category(value):
    if value > 0.31:
        return 5
    elif value > 0.16:
        return 4
    elif value > 0.09:
        return 3
    elif value > 0.04:
        return 2
    else :
        return 1

In [None]:
labeled_df[price_cols] = labeled_df[price_cols].applymap(sales_category)

# Sampling the data
The data is very large and will take hours to evaluate, so we will take a 20% sample and evaluate the results.

In [None]:
labeled_df = labeled_df.sample(frac=0.2, random_state=1)

# Splitting the data

We split the data, 80/20. 80% to train the data, and 20% to test. We create seperate y values for each output column.

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
y_na = labeled_df['NA_Sales']
y_eu = labeled_df['EU_Sales']
y_jp = labeled_df['JP_Sales']
y_other = labeled_df['Other_Sales']
x = labeled_df.drop(columns=[*price_cols, 'Global_Sales', 'Name'])

In [None]:
x_train_na, x_test_na, y_train_na, y_test_na = train_test_split(x, y_na, test_size=0.2, random_state=1)
x_train_eu, x_test_eu, y_train_eu, y_test_eu = train_test_split(x, y_eu, test_size=0.2, random_state=1)
x_train_jp, x_test_jp, y_train_jp, y_test_jp = train_test_split(x, y_jp, test_size=0.2, random_state=1)
x_train_other, x_test_other, y_train_other, y_test_other = train_test_split(x, y_other, test_size=0.2, random_state=1)

# Model Evaluation

The classifiers that will be evaluated for the dataset are:
- Support Vector Machines
- Logistic Regression
- K-Nearest Neighbor

**Reference:** https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html

In [None]:
from sklearn.model_selection import GridSearchCV, cross_val_score
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier

We will evaluate the performance of each classifer mentioned above using cross validation and then we will use grid search to verify the best parameters for the model.

### Cross Validation for SVM

In [None]:
clf = SVC()

scores_na = cross_val_score(clf, x, y_na, cv=10)
print("CV scores of NA = {}".format(scores_na))
print("mean CV score of NA= {}".format(scores_na.mean()))

scores_eu = cross_val_score(clf, x, y_eu, cv=10)
print("CV scores of EU = {}".format(scores_eu))
print("mean CV score of EU = {}".format(scores_eu.mean()))

scores_jp = cross_val_score(clf, x, y_jp, cv=10)
print("CV scores of JP = {}".format(scores_jp))
print("mean CV score of JP = {}".format(scores_jp.mean()))

scores_other = cross_val_score(clf, x, y_other, cv=10)
print("CV scores of Other = {}".format(scores_other))
print("mean CV score of Other = {}".format(scores_other.mean()))

### Grid Search for SVM

In [None]:
params = {'C': [0.1, 1.0, 10.0],'kernel': ['linear', 'rbf',]}
svc_grid_na = GridSearchCV(clf, param_grid = params, scoring='accuracy', n_jobs=-1, cv=10)
svc_grid_eu = GridSearchCV(clf, param_grid = params, scoring='accuracy', n_jobs=-1, cv=10)
svc_grid_jp = GridSearchCV(clf, param_grid = params, scoring='accuracy', n_jobs=-1, cv=10)
svc_grid_other = GridSearchCV(clf, param_grid = params, scoring='accuracy', n_jobs=-1, cv=10)

In [None]:
svc_grid_na.fit(x_train_na, y_train_na)
print(svc_grid_na.score(x_test_na, y_test_na))
print(svc_grid_na.best_params_)

In [None]:
svc_grid_eu.fit(x_train_eu, y_train_eu)
print(svc_grid_eu.score(x_test_eu, y_test_eu))
print(svc_grid_eu.best_params_)

In [None]:
svc_grid_jp.fit(x_train_jp, y_train_jp)
print(svc_grid_jp.score(x_test_jp, y_test_jp))
print(svc_grid_jp.best_params_)

In [None]:
svc_grid_other.fit(x_train_other, y_train_other)
print(svc_grid_other.score(x_test_other, y_test_other))
print(svc_grid_other.best_params_)

SVM Best params:
- NA: `{'C': 10.0, 'kernel': 'linear'}` -> 0.5969
- EU: `{'C': 0.1, 'kernel': 'linear'}` -> 0.7273
- JP: `{'C': 0.1, 'kernel': 'linear'}` -> 0.8268
- Other: `{'C': 1.0, 'kernel': 'linear'}` -> 0.9245

### Cross Validation for Logistic Regression

In [None]:
clf = LogisticRegression()

scores_na = cross_val_score(clf, x, y_na, cv=10)
print("CV scores of NA = {}".format(scores_na))
print("mean CV score of NA= {}".format(scores_na.mean()))

scores_eu = cross_val_score(clf, x, y_eu, cv=10)
print("CV scores of EU = {}".format(scores_eu))
print("mean CV score of EU = {}".format(scores_eu.mean()))

scores_jp = cross_val_score(clf, x, y_jp, cv=10)
print("CV scores of JP = {}".format(scores_jp))
print("mean CV score of JP = {}".format(scores_jp.mean()))

scores_other = cross_val_score(clf, x, y_other, cv=10)
print("CV scores of Other = {}".format(scores_other))
print("mean CV score of Other = {}".format(scores_other.mean()))


### Grid Search for Logistic Regression

In [None]:
params = {'C': [0.1, 1.0, 10.0]}
lr_grid_na = GridSearchCV(clf, param_grid = params, scoring='accuracy', n_jobs=-1, cv=10)
lr_grid_eu = GridSearchCV(clf, param_grid = params, scoring='accuracy', n_jobs=-1, cv=10)
lr_grid_jp = GridSearchCV(clf, param_grid = params, scoring='accuracy', n_jobs=-1, cv=10)
lr_grid_other = GridSearchCV(clf, param_grid = params, scoring='accuracy', n_jobs=-1, cv=10)

In [None]:
lr_grid_na.fit(x_train_na, y_train_na)
print(lr_grid_na.score(x_test_na, y_test_na))
print(lr_grid_na.best_params_)

In [None]:
lr_grid_eu.fit(x_train_eu, y_train_eu)
print(lr_grid_eu.score(x_test_eu, y_test_eu))
print(lr_grid_eu.best_params_)

In [None]:
lr_grid_jp.fit(x_train_jp, y_train_jp)
print(lr_grid_jp.score(x_test_jp, y_test_jp))
print(lr_grid_jp.best_params_)

In [None]:
lr_grid_other.fit(x_train_other, y_train_other)
print(lr_grid_other.score(x_test_other, y_test_other))
print(lr_grid_other.best_params_)

Logistic Regression Best params:
- NA: `{'C': 1.0}` -> 0.5815
- EU: `{'C': 10.0}` -> 0.7256
- JP: `{'C': 0.1}` -> 0.8309
- Other: `{'C': 1.0}` -> 0.9074

### Cross Validation Search for KNN

In [None]:
clf = KNeighborsClassifier()

scores_na = cross_val_score(clf, x, y_na, cv=10)
print("CV scores of NA = {}".format(scores_na))
print("mean CV score of NA= {}".format(scores_na.mean()))

scores_eu = cross_val_score(clf, x, y_eu, cv=10)
print("CV scores of EU = {}".format(scores_eu))
print("mean CV score of EU = {}".format(scores_eu.mean()))

scores_jp = cross_val_score(clf, x, y_jp, cv=10)
print("CV scores of JP = {}".format(scores_jp))
print("mean CV score of JP = {}".format(scores_jp.mean()))

scores_other = cross_val_score(clf, x, y_other, cv=10)
print("CV scores of Other = {}".format(scores_other))
print("mean CV score of Other = {}".format(scores_other.mean()))

### Grid Search for KNN

In [None]:
params = {'n_neighbors': [5, 10, 20, 30, 50],'metric': ['minkowski',], 'p': [2]}
knn_grid_na = GridSearchCV(clf, param_grid = params, scoring='accuracy', n_jobs=-1, cv=10)
knn_grid_eu = GridSearchCV(clf, param_grid = params, scoring='accuracy', n_jobs=-1, cv=10)
knn_grid_jp = GridSearchCV(clf, param_grid = params, scoring='accuracy', n_jobs=-1, cv=10)
knn_grid_other = GridSearchCV(clf, param_grid = params, scoring='accuracy', n_jobs=-1, cv=10)

In [None]:
knn_grid_na.fit(x_train_na, y_train_na)
print(knn_grid_na.score(x_test_na, y_test_na))
print(knn_grid_na.best_params_)

In [None]:
knn_grid_eu.fit(x_train_eu, y_train_eu)
print(knn_grid_eu.score(x_test_eu, y_test_eu))
print(knn_grid_eu.best_params_)

In [None]:
knn_grid_jp.fit(x_train_jp, y_train_jp)
print(knn_grid_jp.score(x_test_jp, y_test_jp))
print(knn_grid_jp.best_params_)

In [None]:
knn_grid_other.fit(x_train_other, y_train_other)
print(knn_grid_other.score(x_test_other, y_test_other))
print(knn_grid_other.best_params_)

KNN Best params:
- NA: `{'metric': 'minkowski', 'n_neighbors': 50, 'p': 2}` -> 0.6552
- EU: `{'metric': 'minkowski', 'n_neighbors': 30, 'p': 2}` -> 0.7050
- JP: `{'metric': 'minkowski', 'n_neighbors': 50, 'p': 2}` -> 0.8268
- Other: `{'metric': 'minkowski', 'n_neighbors': 50, 'p': 2}` -> 0.8799

### Result Summary

The following table displays the method and params used to provide the highest accuracy for predicting the values of out target columns.

|Region	|Method	|Params	|Accuracy	|
|---	|---	|---	|---	|
|NA	|KNN	|`{'metric': 'minkowski', 'n_neighbors': 50, 'p': 2}`	|65.5%	|
|EU	|SVM	|`{'C': 0.1, 'kernel': 'linear'}`	|72.3%	|
|JP	|LR	|`{'C': 0.1}`	|83.1%	|
|Other	|SVM	|`{'C': 1.0, 'kernel': 'linear'}`	|92.5%	|