# PUBG predictions with Gradient Boost

I previously created a notebook going through some [exploratory anaylsys] of the PUBG data. I went through many of the different features avalailable and displayed an interesting plot describing the data and potential correlation with the target variable.

* I found that there was one missing value for the target variable and decided that this row of data should be removed, as there was only one player for the match identified by the missing value.

* I also made a few decisions about creating new features and one important way of breaking the data up to gain higher correllations with our features for seperate match types.

## Why Gradient Boost?
I've created a [kernel] that runs through a number of models to deteremine which model would be best. I've decided to create predictions for a few of the models that had a reasonable accuracy. This Kernel runs the same feature engineering and scaling before fitting the data to the training data and making predictions for the testing data. The Gradient Boost Regression model was the best, with a 92.86% on the validation data (30% of the testing data).

[exploratory anaylsys]: https://www.kaggle.com/beaubellamy/pubg-eda#
[kernel]: https://www.kaggle.com/beaubellamy/pubg-predictions

## Import libraries
We import the required libraries and import the data

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns 
sns.set()
from sklearn import preprocessing
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import MinMaxScaler
import warnings
warnings.filterwarnings('ignore')

In [None]:
train = pd.read_csv('../input/train_V2.csv')
test = pd.read_csv('../input/test_V2.csv')

Lets check out the data again.

In [None]:

train.head()

## Missing Data
Based on our EDA, we found a row that had a NULL value for the target variable. We will remove the irrelevant row of data.

In [None]:
# Remove the row with the missing target value
train = train[train['winPlacePerc'].isna() != True]


## Lets Engineer some features
We'll process the testing data the same way we do for the training data so the testing data has the same features and scaling as our training data.


### PlayersJoined
We can determine the number of players that joined each match by grouping the data by matchID and counting the players.

In [None]:
# Add a feature containing the number of players that joined each match.
train['playersJoined'] = train.groupby('matchId')['matchId'].transform('count')
test['playersJoined'] = test.groupby('matchId')['matchId'].transform('count')


In [None]:
# Lets look at only those matches with more than 50 players.
data = train[train['playersJoined'] > 50]

plt.figure(figsize=(15,15))
sns.countplot(data['playersJoined'].sort_values())
plt.title('Number of players joined',fontsize=15)
plt.show()

You can see that there isn't always 100 players in each match, in fact its more likely to have between 90 and 100 players. It may be benficial to normalise those features that are affected by the number of players.

### Normalised Features
Here, I am making the assumption that it is easier to find an enemy when there are 100 players, than it is when there are 90 players.


In [None]:
def normaliseFeatures(train):
    train['killsNorm'] = train['kills']*((100-train['playersJoined'])/100 + 1)
    train['headshotKillsNorm'] = train['headshotKills']*((100-train['playersJoined'])/100 + 1)
    train['killPlaceNorm'] = train['killPlace']*((100-train['playersJoined'])/100 + 1)
    train['killPointsNorm'] = train['killPoints']*((100-train['playersJoined'])/100 + 1)
    train['killStreaksNorm'] = train['killStreaks']*((100-train['playersJoined'])/100 + 1)
    train['longestKillNorm'] = train['longestKill']*((100-train['playersJoined'])/100 + 1)
    train['roadKillsNorm'] = train['roadKills']*((100-train['playersJoined'])/100 + 1)
    train['teamKillsNorm'] = train['teamKills']*((100-train['playersJoined'])/100 + 1)
    train['damageDealtNorm'] = train['damageDealt']*((100-train['playersJoined'])/100 + 1)
    train['DBNOsNorm'] = train['DBNOs']*((100-train['playersJoined'])/100 + 1)
    train['revivesNorm'] = train['revives']*((100-train['playersJoined'])/100 + 1)

    # Remove the original features we normalised
    train = train.drop(['kills', 'headshotKills', 'killPlace', 'killPoints', 'killStreaks', 
                        'longestKill', 'roadKills', 'teamKills', 'damageDealt', 'DBNOs', 'revives'],axis=1)

    return train

train = normaliseFeatures(train)
test = normaliseFeatures(test)

In [None]:
train.head()

### TotalDistance
An additional feature we can create is the total distance the player travels. This is a combination of all the distance features in the original data set.

In [None]:
# Total distance travelled
train['totalDistance'] = train['walkDistance'] + train['rideDistance'] + train['swimDistance']
test['totalDistance'] = test['walkDistance'] + test['rideDistance'] + test['swimDistance']


# Standardize the matchType feature
Here I decided that many of the existing 16 seperate modes of game play were just different versions of four types of game.

1. Solo: Hunger Games style, last man/women standing.
2. Duo: Teams of two against all other players.
3. Squad: Teams of up to 4 players against All other players
4. Other: These modes consist of custom and special events modes

In [None]:
# Normalise the matchTypes to standard fromat
def standardize_matchType(data):
    data['matchType'][data['matchType'] == 'normal-solo'] = 'Solo'
    data['matchType'][data['matchType'] == 'solo'] = 'Solo'
    data['matchType'][data['matchType'] == 'solo-fpp'] = 'Solo'
    data['matchType'][data['matchType'] == 'normal-solo-fpp'] = 'Solo'
    data['matchType'][data['matchType'] == 'normal-duo-fpp'] = 'Duo'
    data['matchType'][data['matchType'] == 'duo'] = 'Duo'
    data['matchType'][data['matchType'] == 'normal-duo'] = 'Duo'
    data['matchType'][data['matchType'] == 'duo-fpp'] = 'Duo'
    data['matchType'][data['matchType'] == 'squad'] = 'Squad'
    data['matchType'][data['matchType'] == 'squad-fpp'] = 'Squad'
    data['matchType'][data['matchType'] == 'normal-squad'] = 'Squad'
    data['matchType'][data['matchType'] == 'normal-squad-fpp'] = 'Squad'
    data['matchType'][data['matchType'] == 'flaretpp'] = 'Other'
    data['matchType'][data['matchType'] == 'flarefpp'] = 'Other'
    data['matchType'][data['matchType'] == 'crashtpp'] = 'Other'
    data['matchType'][data['matchType'] == 'crashfpp'] = 'Other'

    return data


train = standardize_matchType(train)
test = standardize_matchType(test)

In [None]:
train = train.drop(['Id','groupId','matchId'], axis=1)
# Save the Ids for the submission later on
test_ids = test['Id']
test = test.drop(['Id','groupId','matchId'], axis=1)

Now we can transform the matchTypes into dummy values so we can use them in the model.

In [None]:
# Transform the matchType into scalar values
le = LabelEncoder()
train['matchType']=le.fit_transform(train['matchType'])
test['matchType']=le.fit_transform(test['matchType'])

In [None]:
# We can do a sanity check of the data, making sure we have the new 
# features created and the matchType feature is standardised.
train.head()

In [None]:
test.head()

# Scale the features
Some of the features have large variances, so in order to make sure they dont over influence the training or predictions. We can scale all our features so they provide the same influence over the model.

In [None]:
train.describe()

You can see most features range 0 to 100 or 1000's, but there are two features that doesn't really need scaling, VehicleDestroys and matchType, as they only range between 0 to 5, 6. Its not neccassary to scale these features, but we will any way, because it makes the code easier.

In [None]:
scaler = MinMaxScaler()
train_scaled = pd.DataFrame(scaler.fit_transform(train), columns=train.columns)
test_scaled = pd.DataFrame(scaler.fit_transform(test), columns=test.columns)

train_scaled.head()

In [None]:
train_scaled.describe()

# Model Development
We'll first validate the model by keeping a small part of the data to validate the results.


In [None]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingRegressor


We need to extract the target variable and split the data up into a training and validation set.

In [None]:
# Train Test Split
y = train_scaled['winPlacePerc']
X = train_scaled.drop(['winPlacePerc'],axis=1)
size = 0.30
seed = 42

X_train, X_validation, Y_train, Y_validation = train_test_split(X, y, test_size=size, random_state=seed)


## Gradient Boost Regressor


In [None]:
GBR = GradientBoostingRegressor(learning_rate=0.8)
GBR.fit(X,y)

predictions = GBR.predict(test)

Before we make the submission of our predictions, we need to make sure they are consistent with the boundaries of the target variable. The target variable "winPlacePerc" is a number between 0 and 1, so anything outside that will contribute to incorrect predictions.

Here we'll force these values back down to the boundaries.

In [None]:
predictions[predictions > 1] = 1
predictions[predictions < 0] = 0

In [None]:
submission = pd.DataFrame({'Id': test_ids, 'winPlacePerc': predictions})
submission.to_csv('submission_GBR.csv',index=False)