# 1. Introduction

This notebook is a simple look at the dataset **videogamesales** from GregorySmith. We will do some simple Data Visualization and also try to create a simple model to guess a game's plateform according to the other attributs.

## 1.1. Imports 

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import matplotlib.pyplot as plt #Simple plots
import seaborn as sns #Pretty plots
import altair as alt #Interactive plots
import IPython #for JS

#Data science
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import train_test_split
import tensorflow as tf
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense,Input

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

## 1.2. A look at the dataset with Pandas

In [None]:
vgDB = pd.read_csv("/kaggle/input/videogamesales/vgsales.csv")
vgDB

# 2. Data Visualization

## 2.1. Using Matplotlib

In [None]:
plt.figure(figsize=(25,5))
plt.subplot(2, 2, 1)
plt.title('Top 5 of the sum of sales per Publisher')
plt.plot(vgDB.groupby(["Publisher"]).sum().filter(["Global_Sales"]).sort_values(by=["Global_Sales"],ascending=False).head(5))

plt.figure(figsize=(25,2))
plt.subplot(2, 2, 2)
plt.title('Top 10 sales per Platform')
plt.plot(vgDB.groupby(["Platform"]).sum().filter(["Global_Sales"]).sort_values(by=["Global_Sales"],ascending=False).head(10))

plt.figure(figsize=(20,5))
plt.subplot(2, 2, 3)
plt.title('Sum of sales per Year')
plt.plot(vgDB.query("Year < 2015").groupby(["Year"]).sum().filter(["Global_Sales"]))

## 2.2. Using Seaborn (for pretty graphs)

In [None]:
plt.title('Ventes par an')
an = vgDB.query("Year < 2015 & Publisher in ['Electronic Arts','Ubisoft','Nintendo','Activision','Sony Computer Entertainment']").filter(["Year","Global_Sales","Publisher"])
sns.scatterplot(an.iloc[:,0],an.iloc[:,1]/80,hue=an.iloc[:,2])

## 2.2. Using Altair (for interative graphs)

In [None]:
source = an

an['Year'] = an['Year'].astype(float).astype(int).astype(str)

selection = alt.selection_multi(fields=['Publisher'], bind='legend')

alt.Chart(source).mark_point().encode(
  alt.X('Year:T'),
  alt.Y('Global_Sales'),
  alt.Color('Publisher'),
  opacity=alt.condition(selection, alt.value(1), alt.value(0.1))
).add_selection(
    selection
).interactive()

# 3. Data Science

## 3.1. Data Preparation

Our neural network has to handle numbers rather than strings, we have to categorize *Publisher*, *Genre* and *Platform*, meaning for instance we change "Nintendo" to 0, "Sega" to 1 ect... using dictionnaries.
However, for this notebook, I will only keep the top 10 publishers.

In [None]:
List_dev = vgDB.groupby(["Publisher"]).sum().filter(["Global_Sales"]).sort_values(by=["Global_Sales"],ascending=False).head(10)
class_to_dev = { List_dev.index.values[i]:i  for i in range(len(List_dev.index.values))}
DB = vgDB.copy()
DB["Publisher"].replace(class_to_dev,inplace=True)
L =[i for i in range(len(List_dev.index.values))]
DB = DB.drop(DB.query("Publisher not in @L").index)
DB = DB.reset_index(drop=True)

In [None]:
List_platf = vgDB["Platform"].unique()
class_to_platf = { List_platf[i]:i  for i in range(len(List_platf))}
DB["Platform"].replace(class_to_platf,inplace=True)

In [None]:
List_genre = vgDB["Genre"].unique()
class_to_genre = { List_genre[i]:i  for i in range(len(List_genre))}
DB["Genre"].replace(class_to_genre,inplace=True)
DB.dropna(inplace=True)

Then, we shall normalize the value of *Year* and all the sales in order to make it more understandable for the network.

In [None]:
def Normalization(column):
  return (column-column.mean())/column.std()

In [None]:
DB["Year"],DB["NA_Sales"],DB["EU_Sales"],DB["JP_Sales"],DB["Other_Sales"],DB["Global_Sales"]= Normalization(DB["Year"]),Normalization(DB["NA_Sales"]),Normalization(DB["EU_Sales"]),Normalization(DB["JP_Sales"]),Normalization(DB["Other_Sales"]),Normalization(DB["Global_Sales"])

It is common to use sklearn's *train_test_split* function in order to separate the dataset. This way, the network will be tested on never seen before data to have a better view of its accuracy/score.

In [None]:
X = DB.reset_index().filter(["Year","Genre","Publisher","NA_Sales","EU_Sales","JP_Sales","Other_Sales","Global_Sales"]).to_numpy()
y = DB.reset_index().filter(["Platform"]).to_numpy()
X_train, X_test, y_train, y_test = train_test_split(X,y)
print("Separation from {} elements to : train = {} ; test = {}.".format(X.shape[0],X_train.shape[0],X_test.shape[0]))

## 3.2. Using a K-nearest Neighbour with sklearn

This model has correct score but we will see later that classifiers are way more relevent for this problem, furthermore we need to already know the number of clusters (different platforms possible) we are searching for which will not be necessary for classifiers.

In [None]:
knn = KNeighborsClassifier(len(List_platf))
knn.fit(X_train, y_train.ravel())
score = knn.score(X_test, y_test)
print("Score :", score)

In [None]:
test = np.array([5,1,0,1,2,2,1,2]) #random test with Nintendo as a publisher and 
test = test.reshape((1,8))

print("For " + str(List_dev.index.values[test[0,2]]) + " as a publisher and " + str(List_genre[test[0,1]]) + " as a genre, the platform predicted is : \n" + str(List_platf[np.argmax(knn.predict(test))]))

## 3.2. Using a MLPClassifier Neighbour with sklearn

In [None]:
model_sk = MLPClassifier(hidden_layer_sizes=(8,16,32,64))
model_sk.fit(X_train,y_train.ravel())
score = model_sk.score(X_test,y_test)
print("Score :", score)

In [None]:
test = np.array([5,1,0,1,2,2,1,2]) #random test with Nintendo as a publisher and 
test = test.reshape((1,8))

print("For " + str(List_dev.index.values[test[0,2]]) + " as a publisher and " + str(List_genre[test[0,1]]) + " as a genre, the platform predicted is : \n" + str(List_platf[np.argmax(model_sk.predict(test))]))

## 3.3. Using TensorFlow

In [None]:
X_train= np.array(X_train).astype('float32')
X_test=np.array(X_test).astype('float32')
y_train=np.array(y_train).astype('float32')
y_test =np.array(y_test).astype('float32')

In [None]:
model = Sequential()
model.add(Input(shape=(8,)))
model.add(Dense(8,activation="relu"))
model.add(Dense(16,activation="relu"))
model.add(Dense(32,activation="relu"))
model.add(Dense(64,activation="relu"))
model.add(Dense(len(List_platf),activation="softmax"))
model.build(X[0].shape)
model.summary()

In [None]:
loss_fn = tf.keras.losses.SparseCategoricalCrossentropy()
model.compile(loss=loss_fn, optimizer='adam', metrics=['accuracy'])
history = model.fit(X_train,y_train,epochs=150,validation_data=(X_test,y_test))

In [None]:
loss_curve = history.history["loss"]
acc_curve = history.history["accuracy"]
loss_val_curve = history.history["val_loss"]
acc_val_curve = history.history["val_accuracy"]

plt.plot(loss_curve,label="Train")
plt.plot(loss_val_curve,label="Validation")
plt.legend(loc='upper right')
plt.title("Loss")
plt.show()
plt.plot(acc_curve,label="Train")
plt.plot(acc_val_curve,label="Validation")
plt.legend(loc='lower right')
plt.title("Accuracy")
plt.show()

We can see the curves of accuracy and loss from both our training and test. Here we can see a sign of the beginning of overfitting as training accuracy increase without test accuracy which stay stuck around 0.55. However, this remains a better score than our two previous models.

In [None]:
test = np.array([5,1,0,1,2,2,1,2]) #random test with Nintendo as a publisher and 
test = test.reshape((1,8))

print("For " + str(List_dev.index.values[test[0,2]]) + " as a publisher and " + str(List_genre[test[0,1]]) + " as a genre, the platform predicted is : \n" + str(List_platf[np.argmax(model.predict(test))]))