 # PUBG Finish Placement Prediction
 
**final aim**: to build a model to predict final ***winPlacePer ***


> ## Outline: 

1.  EDA including univariate, bivariate Analysis
2. Feature Engineering 
3. Outlier Handling
4. Model Training and Evaluation










In [None]:
%matplotlib inline
%pylab inline
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
import os
print(os.listdir("../input"))
import xgboost
import warnings
warnings.filterwarnings(action='ignore')

# Resetting the datatypes

**The Size of the DataSet is quite big almost 4.5mn rows in Training Data which is causing time and memory issues.**

One of the solution is optimising the datatypes. Let's see how much it can help.

 

In [None]:
# Memory saving function
def reduce_mem_usage(df):
    """ iterate through all the columns of a dataframe and modify the data type
        to reduce memory usage.        
    """
    start_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))

    for col in df.columns:
        col_type = df[col].dtype

        if col_type != object:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)

    end_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
    print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))

    return df

In [None]:
df_train = pd.read_csv('../input/train_V2.csv')
df = reduce_mem_usage(df_train)
df.info()

**Memory usage of dataframe is 983.90 MB**

**Memory usage after optimization is: 288.39 MB**

**Decreased by 70.7%**

In [None]:
df.columns

> ## Finding Co-relation between different variables using Pearson's Coefficient

In [None]:
f,ax = plt.subplots(figsize=(11,11))
sns.heatmap(df.corr(), annot=True, linewidths=.5, fmt= '.1f',ax=ax)
plt.show()

> **In terms of the target variable (winPlacePerc), there are a few variables high medium to high correlation. The highest positive correlation is walkDistance and the highest negative the killPlace.**


In [None]:
#importing other dependencies
from pandas_summary import DataFrameSummary
from sklearn.ensemble import RandomForestRegressor
from IPython.display import display
from sklearn import metrics
from scipy.cluster import hierarchy as hc
from pdpbox import pdp
from plotnine import *
from fastai.imports import *
from fastai.structured import *

In [None]:
df.describe()

**There is one particular player with a 'winPlacePerc' of NaN. The case was that this match had only one player. We will delete this row from our dataset.**

In [None]:
df[df['winPlacePerc'].isnull()]

In [None]:
# Delete this player
df.drop(2744604, inplace=True)

# And he's gone
df[df['winPlacePerc'].isnull()]

In [None]:
train = df

# Feature Engineering

> In this section we add more interesting features to improve the predictive quality of our machine learning models.
> Note: It is important with feature engineering that you also add the engineered features to your test set!

### Feature Engineering Code Ideas 
*  kills relative to totalDistance
* kills relative to matchType
* kills relative to matchDuration
* weaponsAcquired and Kills
* damageDealt relative to Kills
* damageDealt relative to matchDuration

## Players Joined

This is likely a very valuable feature for our model. If we know how many people are in a match we can **normalize** other features and get stronger predictions on individual players.

In [None]:
train['playersJoined'] = train.groupby('matchId')['matchId'].transform('count')
data = train.copy()
data = data[data['playersJoined']>49]
plt.figure(figsize=(15,10))
sns.countplot(data['playersJoined'])
plt.title("Players Joined",fontsize=15)
plt.show()

There are a few matches with fewer than 75 players. As you can see most of the matches are nearly packed a have nearly 100 players. It is nevertheless interesting to take these features into our analysis.

## Normalized features
> Now that we have a feature 'playersJoined' we can normalize other features based on the amount of players. First that can be valuable to normalize are:

* kills
* damageDealt
* maxPlace
* matchDuration

Let's try out some things!

In [None]:
train['killsNorm'] = train['kills']*((100-train['playersJoined'])/100 + 1)
train['damageDealtNorm'] = train['damageDealt']*((100-train['playersJoined'])/100 + 1)
train['maxPlaceNorm'] = train['maxPlace']*((100-train['playersJoined'])/100 + 1)
train['matchDurationNorm'] = train['matchDuration']*((100-train['playersJoined'])/100 + 1)
to_show = ['Id', 'kills','killsNorm','damageDealt', 'damageDealtNorm', 'maxPlace', 'maxPlaceNorm', 'matchDuration', 'matchDurationNorm']
train[to_show][0:11]

## Heals and Boosts

We create a feature called 'healsandboosts' by adding heals and boosts. (duh!) We are not sure if this has additional predictive value but we can always delete it if the feature importance according to our random forest model is too low.

In [None]:
train['healsandboosts'] = train['heals'] + train['boosts']
train[['heals', 'boosts', 'healsandboosts']].tail()

## Killing without moving

We try to identify cheaters by checking if people are getting kills without moving. We first identify the totalDistance travelled by a player and then set a boolean value to True if someone got kills without moving a single inch. We will remove cheaters in our outlier detection section.

In [None]:
train['totalDistance'] = train['rideDistance'] + train['walkDistance'] + train['swimDistance']
train['killsWithoutMoving'] = ((train['kills'] > 0) & (train['totalDistance'] == 0))

**The feature headshot_rate will also help us to catch cheaters.**

In [None]:
train['headshot_rate'] = train['headshotKills'] / train['kills']
train['headshot_rate'] = train['headshot_rate'].fillna(0)

# Outlier Detection

Some rows in our dataset have really weird characteristics. The players could be cheaters, maniacs or just anomalies. Removing these outliers will most likely improve results.

## Kills without movement

This is perhaps the most obvious sign of cheating in the game. It is already fishy if a player hasn't moved during the whole game, but the player could be AFK and got killed. However, if the player managed to get kills it is most likely a cheater.

In [None]:
display(train[train['killsWithoutMoving'] == True].shape)
train[train['killsWithoutMoving'] == True]

In [None]:
# Remove outliers
train.drop(train[train['killsWithoutMoving'] == True].index, inplace=True)

## Anomalies in roadKills**

In [None]:
# Players who got more than 10 roadKills
train[train['roadKills'] > 10]

Note that player c3e444f7d1289d drove 5 meters but killed 14 people with it. Sounds insane doesn't it?

In [None]:
train.drop(train[train['roadKills'] > 10].index, inplace=True)

> ### Anomalies in aim part 1 (More than 45 kills)**

Let's plot the total kills for every player first. It doesn't look like there are too many outliers.

In [None]:
plt.figure(figsize=(12,4))
sns.countplot(data=train, x=train['kills']).set_title('Kills')
plt.show()

In [None]:
# Let's take a closer look
# Players who got more than 30 kills
display(train[train['kills'] > 30].shape)
train[train['kills'] > 30]

In [None]:
# Remove outliers
train.drop(train[train['kills'] > 30].index, inplace=True)

> ### Anomalies in aim part2  (100% headshot rate)**

Again, we first take a look at the whole dataset and create a new feature 'headshot_rate'. We see that the most players score in the 0 to 10% region. However, there are a few anomalies that have a headshot_rate of 100% percent with more than 9 kills!

In [None]:
plt.figure(figsize=(12,4))
sns.distplot(train['headshot_rate'], bins=10)
plt.show()

In [None]:
# Players who made a minimum of 10 kills and have a headshot_rate of 100%
display(train[(train['headshot_rate'] == 1) & (train['kills'] > 9)].shape)
train[(train['headshot_rate'] == 1) & (train['kills'] > 9)]

***It is unclear if these players are cheating so we are probably not deleting these players from the dataset. If they are legitimate players, they are probably really crushing the game!***

> ### Anomalies in aim part 3 (Longest kill)**

Most kills are made from a distance of 100 meters or closer. There are however some outliers who make a kill from more than 1km away. This is probably done by cheaters.

In [None]:
plt.figure(figsize=(12,4))
sns.distplot(train['longestKill'], bins=10)
plt.show()

In [None]:
#Let's take a look at the players who make these shots.
display(train[train['longestKill'] >= 1000].shape)
train[train['longestKill'] >= 1000]

**There is something fishy going on with these players. We are probably better off removing them from our dataset.**

In [None]:
# Remove outliers
train.drop(train[train['longestKill'] >= 1000].index, inplace=True)

> ### Anomalies in travelling (rideDistance, walkDistance and swimDistance)**

Let's check out anomalies in Distance travelled.

In [None]:
train[['walkDistance', 'rideDistance', 'swimDistance', 'totalDistance']].describe()

> ## walkDistance

In [None]:
plt.figure(figsize=(12,4))
sns.distplot(train['walkDistance'], bins=10)
plt.show()


In [None]:
# walkDistance anomalies
display(train[train['walkDistance'] >= 10000].shape)
train[train['walkDistance'] >= 10000]

In [None]:
# Remove outliers
train.drop(train[train['walkDistance'] >= 10000].index, inplace=True)

> ## Ride Distance

In [None]:
plt.figure(figsize=(12,4))
sns.distplot(train['rideDistance'], bins=10)
plt.show()

In [None]:
# rideDistance anomalies
train[train['rideDistance'] >= 20000]

In [None]:
# Remove outliers
train.drop(train[train['rideDistance'] >= 20000].index, inplace=True)

> ## swimDistance

In [None]:
#swimDistance
plt.figure(figsize=(12,4))
sns.distplot(train['swimDistance'], bins=10)
plt.show()

In [None]:
train[train['swimDistance'] >= 2000]

In [None]:
# Remove outliers
train.drop(train[train['swimDistance'] >= 2000].index, inplace=True)

> ## Anomalies in supplies (weaponsAcquired)

Most people acquire between 0 and 10 weapons in a game, but you also see some people acquire more than 80 weapons! Let's check these guys out.

In [None]:
plt.figure(figsize=(12,4))
sns.distplot(train['weaponsAcquired'], bins=100)
plt.show()

In [None]:
train[train['weaponsAcquired'] >= 80]

In [None]:
# Remove outliers
train.drop(train[train['weaponsAcquired'] >= 80].index, inplace=True)

**We should probably remove these outliers from our model.**
Note that player 3f2bcf53b108c4 acquired 236 weapons in one game!

> ## Anomalies in supplies part 2 (heals)**

Most players us 5 healing items or less. We can again recognize some weird anomalies

In [None]:
plt.figure(figsize=(12,4))
sns.distplot(train['heals'], bins=10)
plt.show()

In [None]:
display(train[train['heals'] >= 40].shape)
train[train['heals'] >= 40]

In [None]:
# Remove outliers
train.drop(train[train['heals'] >= 40].index, inplace=True)

> ## Outlier conclusions

We removed about 2000 players from our dataset.

In [None]:
train.shape

> # Prepararation for Machine Learning

In [None]:
# We delete the matchType, Id, groupId and matchId columns here for convenience
# We will come back to this later.
train = train.drop(columns = ['matchType', 'Id', 'groupId', 'matchId'])

In [None]:
# Take sample for debugging and exploration
sample = 500000
df_sample = train.sample(sample)

In [None]:
df = df_sample.drop(columns = ['winPlacePerc']) #all columns except target
y = df_sample['winPlacePerc'] # Only target variable

> ## Split target variable, validation data, etc.

In [None]:
def split_vals(a, n : int): 
    return a[:n].copy(), a[n:].copy()
val_perc = 0.12 # % to use for validation set
n_valid = int(val_perc * sample) 
n_trn = len(df)-n_valid
raw_train, raw_valid = split_vals(df_sample, n_trn)
X_train, X_valid = split_vals(df, n_trn)
y_train, y_valid = split_vals(y, n_trn)

print('Sample train shape: ', X_train.shape, 
      'Sample target shape: ', y_train.shape, 
      'Sample validation shape: ', X_valid.shape)

In [None]:
#Set metrics (MAE)
# Metric used for the PUBG competition (Mean Absolute Error (MAE))
from sklearn.metrics import mean_absolute_error

# Function to print the MAE score
def print_score(m : RandomForestRegressor):
    res = ['mae train: ', mean_absolute_error(m.predict(X_train), y_train), 
           'mae val: ', mean_absolute_error(m.predict(X_valid), y_valid)]
    if hasattr(m, 'oob_score_'): res.append(m.oob_score_)
    print(res)

In [None]:
# Train basic model
m1 = RandomForestRegressor(n_estimators=40, min_samples_leaf=3, max_features='sqrt',
                          n_jobs=-1)
m1.fit(X_train, y_train)
print_score(m1)

In [None]:
# What are the most predictive features according to our basic random forest model
fi = rf_feat_importance(m1, df); fi[:10]

In [None]:
# Plot a feature importance graph for all features
plot1 = fi.plot('cols', 'imp', figsize=(12,10), legend=False, kind = 'barh')
plot1

# Save figure
#fig = plot1.get_figure()
#fig.savefig("Feature_importances(AllFeatures).png")

In [None]:
#Keep only significant features
to_keep = fi[fi.imp>0.005].cols
print('Significant features: ', len(to_keep))
to_keep

In [None]:
# Make a DataFrame with only significant features
df_keep = df[to_keep].copy()
X_train, X_valid = split_vals(df_keep, n_trn)

## Second Random Forest Model**

In [None]:
# Train model on top features
m2 = RandomForestRegressor(n_estimators=80, min_samples_leaf=3, max_features='sqrt',
                          n_jobs=-1)
m2.fit(X_train, y_train)
print_score(m2)

In [None]:
# Get feature importances of our top features
fi_to_keep = rf_feat_importance(m2, df_keep)
plot2 = fi_to_keep.plot('cols', 'imp', figsize=(10,6), legend=False, kind = 'barh')
plot2

# save figure
#fig = plot2.get_figure()
#fig.savefig("Feature_importances(TopFeatures).png")

In [None]:
# Create a Dendrogram to view highly correlated features
corr = np.round(scipy.stats.spearmanr(df_keep).correlation, 4)
corr_condensed = hc.distance.squareform(1-corr)
z = hc.linkage(corr_condensed, method='average')
fig = plt.figure(figsize=(14,10))
dendrogram = hc.dendrogram(z, labels=df_keep.columns, orientation='left', leaf_font_size=16)
plt.plot()

# For saving figure
#plt.savefig('Dendrogram.png')

In [None]:
# Plot the predictive quality of kills 
x_all = get_sample(train, 100000)
ggplot(x_all, aes('kills','winPlacePerc'))+stat_smooth(se=True, colour='red', method='mavg')

In [None]:
# Plot the predictive quality of walkDistance
x_all = get_sample(train, 100000)
ggplot(x_all, aes('walkDistance','winPlacePerc'))+stat_smooth(se=True, colour='red', method='mavg')

## Final Random Forest Model**

In [None]:
# Prepare data
val_perc_full = 0.12 # % to use for validation set
n_valid_full = int(val_perc_full * len(train)) 
n_trn_full = len(train)-n_valid_full
df_full = train.drop(columns = ['winPlacePerc']) # all columns except target
y = train['winPlacePerc'] # target variable
df_full = df_full[to_keep] # Keep only relevant features
X_train, X_valid = split_vals(df_full, n_trn_full)
y_train, y_valid = split_vals(y, n_trn_full)

print('Sample train shape: ', X_train.shape, 
      'Sample target shape: ', y_train.shape, 
      'Sample validation shape: ', X_valid.shape)

In [None]:
m3 = RandomForestRegressor(n_estimators=60, min_samples_leaf=3, max_features=0.5, n_jobs=-1)
m3.fit(X_train, y_train)
print_score(m3)

In [None]:
test = pd.read_csv('../input/test_V2.csv')

> ## Normalising test variables as well**

In [None]:
test['headshot_rate'] = test['headshotKills'] / test['kills']
test['headshot_rate'] = test['headshot_rate'].fillna(0)
test['totalDistance'] = test['rideDistance'] + test['walkDistance'] + test['swimDistance']
test['playersJoined'] = test.groupby('matchId')['matchId'].transform('count')
test['killsNorm'] = test['kills']*((100-test['playersJoined'])/100 + 1)
test['damageDealtNorm'] = test['damageDealt']*((100-test['playersJoined'])/100 + 1)
test['maxPlaceNorm'] = test['maxPlace']*((100-train['playersJoined'])/100 + 1)
test['matchDurationNorm'] = test['matchDuration']*((100-test['playersJoined'])/100 + 1)
test['healsandboosts'] = test['heals'] + test['boosts']
test['killsWithoutMoving'] = ((test['kills'] > 0) & (test['totalDistance'] == 0))

# Remove irrelevant features from the test set
test_pred = test[to_keep].copy()

# Fill NaN with 0 (temporary)
test_pred.fillna(0, inplace=True)
test_pred.head()

In [None]:
predictions = np.clip(a = m3.predict(test_pred), a_min = 0.0, a_max = 1.0)
pred_df = pd.DataFrame({'Id' : test['Id'], 'winPlacePerc' : predictions})
# Create submission file
pred_df.to_csv("submission.csv", index=False)

In [None]:
print('Head of submission: ')
display(pred_df.head())
print('Tail of submission: ')
display(pred_df.tail())