In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os
import tensorflow as tf
from keras import backend as K
from keras.utils.np_utils import to_categorical
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPRegressor
from scipy import stats

Let's define a function to read a given file and clean up the data in a unified manner

In [None]:
def cleanFeatures(df):
    #drop columns that likely don't affect the outcome of the match, the y values, and the dummy values (matchType)
    droppedColumns = df.drop(columns=['Id', 'groupId', 'matchId', 'winPlacePerc', 'matchType'], errors='ignore')

    #get the dummy values for the matchType
    matchTypeDummies = pd.get_dummies(columns=list(df['matchType']), data=df['matchType'].values)

    #standardize everything else from 0-1
    mms = MinMaxScaler()
    scaledDroppedColumns = mms.fit_transform(droppedColumns)

    #create one input tensor
    scaledDroppedColumnsDf = pd.DataFrame(data=scaledDroppedColumns, columns=list(droppedColumns))
    X = pd.concat([scaledDroppedColumnsDf, matchTypeDummies], axis=1)
    
    return X


In [None]:
#read in the CSV file
allColumns = pd.read_csv("../input/train_V2.csv")
X = cleanFeatures(allColumns)

#replace empty win percentages (the labels) with 0.0
allColumns['winPlacePerc'] = allColumns['winPlacePerc'].fillna(0.0)
#remove any other NaN rows as they will just cause problems in the NN
allColumns.dropna()

#get just the labels for win percentage
y = allColumns['winPlacePerc'].values

#split the data at an 80/20 split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=True)

In [None]:
#create a heat map to see if there are any features that are highly correlated
f,ax = plt.subplots(figsize=(20, 20))
sns.heatmap(allColumns.corr(), annot=True, linewidths=.5, fmt= '.1f',ax=ax)
plt.show()

The higher the number, the more correlated two features are to each other.  As can be seen, the diagonal shows that every feature is exactly correlated to itself. 

As can be expected, the number of "winPoints" correlates very highly to the number of "killPoints": more kills means more wins. 
The high correlation from "numGroups" to "maxPlace" is a side effect of the way the data was collected. 

The -1.0 correlations are between "rankPoints", "killPoints", and "winPoints".  All three of these features are related to an ELO type ranking system for players, but it appears that "rankPoints" is being deprecated and can likely be discounted. The deprecation could be the reason that "rankPoints" has such a negative correlation. 

From a strict "winPlacePerc" correlation, it appears that the "walkDistance" has the most relation to whether a player wins the match, whereas "killPlace" (the rank for number of players killed) has very little correlation to whether a player wins. The further a player walks (searching for new items, getting to the center of the battleground) the more they are likely to win. 

The first time through this data set involved creating a Keras fully connected neural network involving 6 layers of between 64 and 1024 nodes.  There was to be dropout (due to the high number of input points) and batch normalization leading to a final Mean Squared Error output. Unfortunately, running fitting even once showed that the validation set was only around 2.8% correct.  That is unusable and frankly a waste of time to pursue further.   Even dropping the model to one layer with only a handful of nodes continued to build to around 2.8% correct.  

In the end, Occam's Razor wins: back to the drawing board to see if there is anything easier to use and significantly more correct. 

After some searching, the scikit learn pre-made neural networks: the Multi-Layer Perceptron Regressor (MLPRegressor - http://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPRegressor.html#sklearn.neural_network.MLPRegressor) was determined to likely solve the issues at hand. 

The MLPRegressor has a number of parameters related to how it processes input data.  To ensure that the best network was used, a grid search of parameters was performed (elsewhere) to find the best network creations.  The grid of parameters is as follows:
* Hidden Layers
    * 10
    * 100
    * 200
* Activation
    * Logistic
    * RELU
* Optimizer
    * Adam
    * SGD
* Alpha (L2 Penalty)
    * 0.0001
    * 0.01
* Batch Size
    * 128
    * 256
* Learning Rate Classifier (Learning rate schedule for weight updates)
    * Constant
    * Inverse Scaling: "gradually decreases the learning rate learning_rate_ at each time step ‘t’ using an inverse scaling exponent of ‘power_t’. effective_learning_rate = learning_rate_init / pow(t, power_t)"
    * Adaptive: "keeps the learning rate constant to ‘learning_rate_init’ as long as training loss keeps decreasing. Each time two consecutive epochs fail to decrease training loss by at least tol, or fail to increase validation score by at least tol if ‘early_stopping’ is on, the current learning rate is divided by 5."
* Initial Learning Rate
    * 0.001
    * 0.005
* powerT: "The exponent for inverse scaling learning rate. It is used in updating effective learning rate when the learning_rate is set to ‘invscaling’"
    * 0.5
    * 0.9
* Momentum
    * 0.5
    * 0.9

Notes are from the MLPRegressor man page


After a lengthy search for the most correct network, the network hyper parameters chosen are:

* Activation function: RELU
* Alpha: 0.0001
* Batch Size: 256
* Hidden Layers: 200
* Learning Rate Classifier: Constant
* Intial Learning Rate: 0.001
* Momentum: 0.9
* Optimizer: Adam
* PowerT: 0.5

This network showed about a 92.9% score for the input data, which is noted as significantly higher than the 2.8% from the custom built network.  

Interestingly enough, another set of hyper parameters showed an extremely close score to the chosen network, about 92.7%, where every hyper parameter was the same except that the Batch Size was 128 and the number of Hidden Layers was 100; both roughly half of the respective values for the chosen network.  These two numbers more correlate to the logical layout of the network rather than the weights and biases of the other hyper parameters and likely interact more closely. 

Now we use the hyper parameters found to create an MLPRegressor network

In [None]:
K.clear_session()

model = MLPRegressor(
    hidden_layer_sizes=(200,), 
    activation="relu", 
    solver="adam", 
    alpha=0.0001, 
    batch_size=256, 
    learning_rate="constant", 
    learning_rate_init=0.001, 
    power_t=0.5, 
    momentum=0.9, 
    verbose=1, 
    early_stopping=True) # allow the model to stop early when it is no longer progressing
model.fit(X_train, y_train)

In [None]:
print("Score for test split data: " + str(model.score(X_test, y_test)))

Now the actual test data should be read in, and transformed in the same manner as the training data. 

In [None]:
testColumns = pd.read_csv('../input/test_V2.csv') 
testFeatures = cleanFeatures(testColumns)

testFeatures.describe()

As can be seen here (https://datascience.stackexchange.com/questions/31957/mlpregressor-output-range) there are some nuances to the MLPRegressor that may cause the output of the network to be outside of the expected range of 0-1.  To counteract these error values, manually setting the "out_activation_" value on the model to "relu" before predicting the test output sets the minimum value on the output.  

In [None]:
model.out_activation_ = 'relu'
placementPredictions = model.predict(testFeatures)

However, there are still output nodes that are above a 1.0 valuation.  Clean up the output valuations. 

In [None]:
#to assume that above 1.00 is a win, just set those to 1.00
placementPredictions[placementPredictions > 1.0] = 1.0

stats.describe(placementPredictions)

Now that the predictions have been cast, read the sample submission and write the placement values

In [None]:
submission = pd.read_csv('../input/sample_submission_V2.csv')
submission['winPlacePerc'] = placementPredictions

submission.head()

Create the submission.csv

In [None]:
submission.to_csv("submission.csv", index=False)

Fin