**PUBG Finish Placement Model Creation**

What did we discover? 
Note: Some of the markdown cells in this notebook are generally notes for myself and thought process about tackling the data.

**Data Input**

In [1]:
import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import pandas as pd
import matplotlib.pyplot as plt

trainpath='../input/pubg-finish-placement-prediction/train_V2.csv'
testpath='../input/pubg-finish-placement-prediction/test_V2.csv'
samplepath='../input/pubg-finish-placement-prediction/sample_submission_V2.csv'

train = pd.read_csv(trainpath)
test = pd.read_csv(testpath)
sample= pd.read_csv(samplepath)

In [2]:
train.head()

In [3]:
test.head()

Okay, so looking at the head of the data. I can already spot a few rows to get rid of since they are not useful for our purposes here. I think ID, GroupID, and Match ID are irrelevant to our predictions as they are essentially just labels to identify groups and players as well as the matches the stats came from. Additionally, there is the Match_Type, ideally this categorical variable should be turned into dummy variables -> Additional columns to mark 1 and 0. This is likely where my model is deficient!

Additionally there's quite a few anomalous values for rank points, with -1 values amongst 1500's or greater. This just seems not great, I could choose to drop it for time or attempt to regularize this. (I chose to drop it)

**Data Preprocessing**

Removal of the various ID's, as well as filling a NaN in the index 2744604 as noted by a Kaggler named SeaOtter

In [4]:
train=train.fillna(method='ffill')
train.drop(['Id','groupId','matchId'],inplace=True,axis=1)
test.drop(['Id','groupId','matchId'],inplace=True,axis=1)

This was an attempt to make a heatmap of the correlations between the features and the goal. While mostly unreadable it does give a valuable look that there are actually a pretty significant number of variables just have close to no correlation to the final placement, but a select few that do have positive correlations. 

In [5]:
import seaborn as sn
corrMatrix=train.corr()
sn.heatmap(corrMatrix, annot=True)
plt.savefig('heatmap.png')
plt.show()
corrMatrix

We can see from the correlation matrix, the variables that have the largest correlation to winPlacePerc are boosts, walking distance, weapons acquired. ETC, So I'll toss the other uncorrelated values from the data and train on just those. Since I find that the others will be relatively useless bloat or bring down our model. The data set is properly enormous so if we can bring down the size of it that would be perfect.

Creating new sub-data sets out of the columns I think are best correlated to the goal, and creating the label set. Then splitting these smaller sets into train_test_split groups so I can cross validate within the notebook before submitting it. Additionally, ignoring the actual test data to instead use the CV data is just....computationally less annoying. Using the test data takes forever.

In [6]:
xtrain=train[['boosts','walkDistance','weaponsAcquired','damageDealt','kills']]
ytrain=train['winPlacePerc']
xtest=test[['boosts','walkDistance','weaponsAcquired','damageDealt','kills']]

from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(xtrain, ytrain, test_size=0.33, random_state=42)

from sklearn import preprocessing

print('Pre-scale: ', x_train[:2])
scaler = preprocessing.StandardScaler().fit(x_train)
x_train = scaler.transform(x_train)
#Apply the learned scaler to Test data:
x_test = scaler.transform(x_test)
print('Post-scale', x_train[:2])

Here lies my final dense NN I left behind rather quickly. I tried many variations of the layers including different numbers of neurons, different amounts of layers, and variations of dropout. All very fruitless in producing good results for this dataset compared to Linear and Tree based models. 

In [7]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras import models
from tensorflow.keras.layers import Flatten
def myNN(train_data):
  model=tf.keras.models.Sequential()
  model.add(layers.Dense(units=64,input_shape=(train_data.shape[1],),activation='relu',name='input'))
  model.add(layers.Dropout(0.15))
  model.add(layers.Dense(units=64,activation='relu',name='hidden1'))
  model.add(layers.Dropout(0.15))
  model.add(layers.Dense(units=64,activation='relu',name='hidden2'))
  model.add(layers.Dropout(0.15))
  model.add(layers.Dense(units=64,activation='relu',name='hidden3'))

  model.add(layers.Dense(units=1,activation=None,name='output'))

  model.compile(optimizer='rmsprop',loss='mse',metrics=['mae','accuracy'])

  return model

In [8]:
PUBGmodel = myNN(x_train)
PUBGmodel.summary()

In [9]:
ne = 5 #number of train loops

#Train
callbacks=[keras.callbacks.ModelCheckpoint("pubg.keras",save_best_only=True)]
history=PUBGmodel.fit(x_train, y_train, epochs=ne, batch_size=128, verbose=1,validation_split=0.25,callbacks=callbacks)
model = keras.models.load_model("pubg.keras")

In [10]:
hd = history.history
print(hd)
loss_tr = hd['accuracy']
loss_va = hd['val_accuracy']
epochs = range(0, ne) #ne is number of epochs. Set it! 

import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

plt.plot(epochs, loss_tr, '-.o', label='Training Acc')
plt.plot(epochs, loss_va, 'r', label='Validation Acc')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend()
plt.show()

As we can see, the dense network just doesn't do that well. It's kind of horrible actually. It had an MAE of 0.16610 when submitted to the competition. It somehow beat a few peoples submissions but it's definitely not the best we can aim for. So let's move on to the tried and true regression models. 

In [11]:
#ytest=model.predict(xtest)

In [12]:
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error

y_model = model.predict(x_train)
#Calculate error and plot
mse = mean_squared_error(y_train, y_model)
mae = mean_absolute_error(y_train,y_model)
print("MSE: ", mse)
print("MAE: ", mae)
import gc
gc.collect()

Here I tried applying our processed data into a very simple linear regression model. Not too much of interest going on here code wise. However...

In [13]:
from sklearn import linear_model
PUBGModel2=linear_model.LinearRegression()
PUBGModel2.fit(x_train,y_train)
#ytest=PUBGModel2.predict(xtest)

In [14]:
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error

y_model = PUBGModel2.predict(x_train)
#Calculate error and plot
mse = mean_squared_error(y_train, y_model)
mae = mean_absolute_error(y_train,y_model)
print("MSE: ", mse)
print("MAE: ", mae)
import gc
gc.collect()

Our Linear Regression actually does worse in cross validation, but when submitted to the competition itself it nets a better score than that NN. 

This simple linear regression model was able to climb pretty well up the leaderboard alone with no other changes and just some data preprocessing. MAE: Score: 0.12912

Let's try something else

Histogram-based Gradient Boosting Regression Tree.
The sklearn page for this model type credits its implementation to being inspired by LightGBM https://github.com/Microsoft/LightGBM.

So what is a Light Gradient Boosting Tree? It is a decision tree where multiple weak models are created, then combined to improve as we go. It has some similarities to random forests.

In [15]:
from sklearn.ensemble import HistGradientBoostingRegressor
PUBGM=HistGradientBoostingRegressor().fit(x_train,y_train)
ytest=PUBGM.predict(xtest)

In [16]:
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error

y_model = PUBGM.predict(x_train)
#Calculate error and plot
mse = mean_squared_error(y_train, y_model)
mae = mean_absolute_error(y_train,y_model)
print("MSE: ", mse)
print("MAE: ", mae)

This version of the model gave us a score of: 0.10597 closer to something greater, it also does better in cross validation than the other two models. 

In [17]:
df_submit = sample.copy()
df_submit['winPlacePerc']=ytest
df_submit.head()
df_submit.to_csv('submission.csv', index=False)