## Introduction

### Challenge description

In a PUBG game, up to 100 players start in each match (matchId). Players can be on teams (groupId) which get ranked at the end of the game (winPlacePerc) based on how many other teams are still alive when they are eliminated. In game, players can pick up different munitions, revive downed-but-not-out (knocked) teammates, drive vehicles, swim, run, shoot, and experience all of the consequences -- such as falling too far or running themselves over and eliminating themselves.

You are provided with a large number of anonymized PUBG game stats, formatted so that each row contains one player's post-game stats. The data comes from matches of all types: solos, duos, squads, and custom; there is no guarantee of there being 100 players per match, nor at most 4 player per group.

You must create a model which predicts players' finishing placement based on their final stats, on a scale from 1 (first place) to 0 (last place).

### PUBG Gameplay description

Battlegrounds is a player versus player shooter game in which up to one hundred players fight in a battle royale, a type of large-scale last man standing deathmatch where players fight to remain the last alive. Players can choose to enter the match solo, duo, or with a small team of up to four people. The last person or team alive wins the match.

Each match starts with players parachuting from a plane onto one of the four maps, with areas of approximately 8 × 8 kilometres (5.0 × 5.0 mi), 6 × 6 kilometres (3.7 × 3.7 mi), and 4 × 4 kilometres (2.5 × 2.5 mi) in size. The plane's flight path across the map varies with each round, requiring players to quickly determine the best time to eject and parachute to the ground. Players start with no gear beyond customized clothing selections which do not affect gameplay. Once they land, players can search buildings, ghost towns and other sites to find weapons, vehicles, armor, and other equipment. These items are procedurally distributed throughout the map at the start of a match, with certain high-risk zones typically having better equipment. Killed players can be looted to acquire their gear as well. Players can opt to play either from the first-person or third-person perspective, each having their own advantages and disadvantages in combat and situational awareness; though server-specific settings can be used to force all players into one perspective to eliminate some advantages.

Every few minutes, the playable area of the map begins to shrink down towards a random location, with any player caught outside the safe area taking damage incrementally, and eventually being eliminated if the safe zone is not entered in time; in game, the players see the boundary as a shimmering blue wall that contracts over time. This results in a more confined map, in turn increasing the chances of encounters. During the course of the match, random regions of the map are highlighted in red and bombed, posing a threat to players who remain in that area. In both cases, players are warned a few minutes before these events, giving them time to relocate to safety. A plane will fly over various parts of the playable map occasionally at random, or wherever a player uses a flare gun, and drop a loot package, containing items which are typically unobtainable during normal gameplay. These packages emit highly visible red smoke, drawing interested players near it and creating further confrontations. On average, a full round takes no more than 30 minutes.

At the completion of each round, players gain in-game currency based on their performance. The currency is used to purchase crates which contain cosmetic items for character or weapon customization. A rotating "event mode" was added to the game in March 2018. These events change up the normal game rules, such as establishing larger teams or squads, or altering the distribution of weapons and armor across the game map.

Source: [Wikipedia](https://en.wikipedia.org/wiki/PlayerUnknown%27s_Battlegrounds)


## Exploratory Data Analysis

### Loading train data

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt # data visualization
import seaborn as sns # data visualization
from sklearn.model_selection import KFold
from sklearn import svm
from sklearn.metrics import mean_squared_error
from sklearn import linear_model
from sklearn.ensemble import RandomForestRegressor
from sklearn.datasets import make_regression
import time

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import os
print(os.listdir("../input"))

# Any results you write to the current directory are saved as output.

In [None]:
train = pd.read_csv('../input/train_V2.csv')

Let's take a quick look at the features and look for NaN values.

In [None]:
train.info()

In [None]:
train.head()

In [None]:
train.tail()

Searching for null values...

In [None]:
print("Total of NaN values on train dataset: ", train.isna().sum().sum())

So let's find that row.

In [None]:
train.isna().sum()

In [None]:
train[train.isna().any(axis=1)]

This case seems like a unique case where only one player connected to the server. Because there is only 1 group in a solo match, it represents the single player connected. Thus the match took only 9 seconds. It makes sense to remove this row from our dataset.

In [None]:
# Removing outlier
train.drop(train[train.isna().any(axis=1)].index, inplace=True)

Alright, then we can start to look at the values and understand some of the data.

I am a PUBG player, so I am aware that not every match has all of the 100 players joined in the server, nor every squad is complete in this match type. Sometimes, due lack of servers players might join alone in squad matches, so it does not take too long to find another match. Another case is when the player just want to practice the situation of being the last squad member alive.

Let's check these informations.

In [None]:
# Creating a new feature of number of players joined in a match
train['playersJoined'] = train.groupby('matchId')['matchId'].transform('count')
plt.figure(figsize=(25,10))
sns.countplot(train['playersJoined'])
plt.title('Players Joined in a Match')
plt.show()

According to the data description given in the challenge:

*matchType - String identifying the game mode that the data comes from. The standard modes are “solo”, “duo”, “squad”, “solo-fpp”, “duo-fpp”, and “squad-fpp”; other modes are from events or custom matches.*

Let's check what they mean by that.

In [None]:
train['matchType'].unique().tolist()

So we will remove those to keep the analysis on the standard modes of the game.

In [None]:
# Removing event and custom matches
train = train[train.matchType.isin(['solo', 'duo', 'squad', 'solo-fpp', 'duo-fpp', 'squad-fpp'])]

Now I would like to group the players and check if there is any match with group of players greater than 4, which is the max for squad matches.

In [None]:
group = train.groupby(['matchId','groupId','matchType'])['Id'].count().to_frame('players').reset_index()
group.info()
group.head()


As we can see, the data still has some non-standard matches. Let's find them all.

In [None]:
# Non-standard matches
to_remove = group[group['players'] > 4].matchId.unique().tolist()
print(len(to_remove), "matches don't agree with the standard PUBG gameplay.")

# Removing those matchId's from train dataframe
group = group[~group.matchId.isin(to_remove)]

In order to get a general look at these match groups, let's plot them.

In [None]:
fig, ax = plt.subplots(2, 3, figsize=(16, 8))
for mt, ax in zip(['solo', 'duo', 'squad', 'solo-fpp', 'duo-fpp', 'squad-fpp'], ax.ravel()):
    ax.set_xlabel(mt)
    group[group['matchType'] == mt]['players'].value_counts().sort_index().plot.bar(ax=ax)

Apparently, we've missed some event/custom matches. Time to remove them as well. First, let's take a look if these players are only observers or if they have done any damage.

In [None]:
group.loc[(group['matchType'] == 'solo-fpp') & (group['players'] > 1)].tail()

In [None]:
train[train.groupId == '07b6286649f1e5']

In [None]:
train[train.groupId == '9512eb0b2c0d24']

Picking two *groupId* in that list we can see that damage was dealt by the players in the first group. In the other hand, the second group had only travel and picked up one weapon. Besides the fact of having more players than the usual allowed, it makes sense to remove these kind of matches because they are not only spectators.

In [None]:
# Removing matches not agreeing with standard PUBG gameplay
to_remove = group.loc[(group['matchType'] == 'solo') & (group['players'] > 1)].matchId.unique().tolist()
group = group[~group.matchId.isin(to_remove)]

to_remove = group.loc[(group['matchType'] == 'solo-fpp') & (group['players'] > 1)].matchId.unique().tolist()
group = group[~group.matchId.isin(to_remove)]

to_remove = group.loc[(group['matchType'] == 'duo') & (group['players'] > 2)].matchId.unique().tolist()
group = group[~group.matchId.isin(to_remove)]

to_remove = group.loc[(group['matchType'] == 'duo-fpp') & (group['players'] > 2)].matchId.unique().tolist()
group = group[~group.matchId.isin(to_remove)]

In [None]:
fig, ax = plt.subplots(2, 3, figsize=(16, 8))
for mt, ax in zip(['solo', 'duo', 'squad', 'solo-fpp', 'duo-fpp', 'squad-fpp'], ax.ravel()):
    ax.set_xlabel(mt)
    group[group['matchType'] == mt]['players'].value_counts().sort_index().plot.bar(ax=ax)

Now our plots look more legit to what is expected for those kind of PUBG matches. It is time to make our train dataset clean with these changes.

In [None]:
# Removing more event and custom matches from the actual train dataset
train = train[train.matchId.isin(group.matchId.unique().tolist())]

print("We have now only {} matches.".format(len(train.matchId.unique().tolist())))

Looking at the description again I noticed this:

*rankPoints - Elo-like ranking of player. This ranking is inconsistent and is being deprecated in the API’s next version, so use with caution. Value of -1 takes place of “None”.*

Even *rankPoints* being inconsistent (we should probably drop it from our data), I decided to check how the rankings they have (*rankPoints*, *killPoints*, *winPoints*) are related to *winPlacePerc*.


In [None]:
g = sns.PairGrid(train, y_vars=['winPlacePerc'], x_vars=['rankPoints', 'killPoints', 'winPoints'], height=5)
g.map(sns.scatterplot)

They don't look like good features to me, but we will keep them. Plus, there is lots of low rank/kill/winPoints with high *winPlacePerc*. These values would make sense for duo/squad players that were carried to the win by their teammates. Specially if the player died at the begninning of the match (and did not quit) and the teammates got the win.

As part of my personal experience playing the game, boost and heal items help the player to stay alive outside the "playing zone" and probably win the game. Let's plot these information.

In [None]:
g = sns.PairGrid(train, y_vars=['winPlacePerc'], x_vars=['boosts', 'heals'], height=5)
g.map(sns.scatterplot)

The game changes the playzone at specific times during a match, making the playzone each time smaller than before. Unless the winner had the luck to stay always at the same place, inside the playzone, he probably had to walk, swim or ride a certain distance to make it and to always stay inside the "playzone". So let's plot this as well.

In [None]:
g = sns.PairGrid(train, y_vars=['winPlacePerc'], x_vars=['walkDistance', 'rideDistance', 'swimDistance'], height=5)
g.map(sns.scatterplot)

## Feature Engineering

Well, we have created already one feature (*playersJoined*). This feature can help us to normalize some other features, such as *kills* and *damageDealt*, because not every match has 100 players.

In [None]:
train['killsNorm'] = train['kills']*((100-train['playersJoined'])/100 + 1)
train['damageDealtNorm'] = train['damageDealt']*((100-train['playersJoined'])/100 + 1)

Boosts and healing items are crucial for player survival, so could be useful to create a new feature *boostsHeals* as sum of those features.

Above we have seen some relation on *winPlacePerc* and walk, ride, and swim distance. Let's make a *totalDistance* feature too.

As I have said, boosts and healing items help the player stay longer in the match. Boots make the player run faster and regenerate a little amout of HP (health points) over time. Heals regenerate HP after suffered damage and help players to stay alive until they leave the "blue zone" (outside the playzone, where players receive damage over time). So it makes sense to create a feature to represent how many boots and heals were used over the total distance travelled.

In [None]:
train['boostsHeals'] = train['boosts'] + train['heals']
train['totalDistance'] = train['walkDistance'] + train['rideDistance'] + train['swimDistance']
train['boostsHealsPerTotalDistance'] = train['boostsHeals']/(train['totalDistance']+1) # To avoid infity cases, we add 1 for cases where totalDistance might be 0 and bootsHeals > 0

## Machine Learning Model Preparation

Initally, I have got some errors and Kaggle suggested to one hot encode some features, such as *matchType*. However, it does not seem the best ideia for *matchId*, *groupId* ([see here](https://www.kaggle.com/dansbecker/using-categorical-data-with-one-hot-encoding) and [here](https://www.kaggle.com/carlolepelaars/pubg-data-exploration-rf-funny-gifs#Categorical-Variables-)). In addition, *id* can be dropped. Let's make those adjustments.

In [None]:
# One hot encode matchType
train = pd.get_dummies(train, columns=['matchType'])

# Turn groupId and match Id into categorical types
train['groupId'] = train['groupId'].astype('category')
train['matchId'] = train['matchId'].astype('category')

# Get category coding for groupId and matchID
train['groupId_cat'] = train['groupId'].cat.codes
train['matchId_cat'] = train['matchId'].cat.codes

# Get rid of old columns
train.drop(columns=['groupId', 'matchId'], inplace=True)

# Lets take a look at our newly created features
# train[['groupId_cat', 'matchId_cat']].head()

# Take a look at the encoding
# matchType_encoding = train.filter(regex='matchType')
# matchType_encoding.head()

### Linear Regression

My first idea was to use a more robust method, such as Support Vector Regressor. However, due time constraints I have to choose a quicker method. I also subsampled the curated dataset for debuggin purposes, but the model is trained with the complete dataset.

To avoid overfitting, I decided to cross-validate our training with KFold and the standard value of *k* = 10.

In [None]:
# Subsampling dataset
#sample = 30000
#train_sample = train.sample(sample)

# Split dataset into training data and target variable

#X = train_sample.drop(columns = ['Id', 'winPlacePerc']) # Subsampled X
#y = train_sample['winPlacePerc'] # Subsampled y

X = train.drop(columns = ['Id', 'winPlacePerc'])
y = train['winPlacePerc']

kf = KFold(n_splits=10)
reg = linear_model.LinearRegression(n_jobs = -1)
outcomes = []
fold = 0

start = time.time()
for train_index, test_index in kf.split(X):
        fold += 1
        X_train, X_test = X.values[train_index], X.values[test_index]
        y_train, y_test = y.values[train_index], y.values[test_index]
        reg.fit(X_train, y_train)
        predictions = reg.predict(X_test)
        print("Coefficients:", reg.coef_)
        mse = mean_squared_error(y_test, predictions)
        outcomes.append(mse)
        print("Fold {0} MSE: {1}".format(fold, mse))     
mean_outcome = np.mean(outcomes)
print("Average MSE: {0}".format(mean_outcome))

end = time.time()
print("Elapsed time:",(end-start))

# Checking predicted values against the real ones (only last fold)
df = pd.DataFrame().from_dict({'predicted':predictions,'truth':y_test})
df.head()



## Prediction and Submission


### Loading test dataset

In [None]:
# Loading test dataset

test = pd.read_csv('../input/test_V2.csv')
print("Total of NaN values on test dataset: ", test.isna().sum().sum())

### Test dataset preparation

In [None]:
# Creating playersJoined feature
test['playersJoined'] = test.groupby('matchId')['matchId'].transform('count')

# Removing event and custom matches
test = test[test.matchType.isin(['solo', 'duo', 'squad', 'solo-fpp', 'duo-fpp', 'squad-fpp'])]
group = test.groupby(['matchId','groupId','matchType'])['Id'].count().to_frame('players').reset_index()

# Non-standard matches
to_remove = group[group['players'] > 4].matchId.unique().tolist()

# Removing those matchId's from train dataframe
group = group[~group.matchId.isin(to_remove)]

# Removing matches not agreeing with standard PUBG gameplay
to_remove = group.loc[(group['matchType'] == 'solo') & (group['players'] > 1)].matchId.unique().tolist()
group = group[~group.matchId.isin(to_remove)]

to_remove = group.loc[(group['matchType'] == 'solo-fpp') & (group['players'] > 1)].matchId.unique().tolist()
group = group[~group.matchId.isin(to_remove)]

to_remove = group.loc[(group['matchType'] == 'duo') & (group['players'] > 2)].matchId.unique().tolist()
group = group[~group.matchId.isin(to_remove)]

to_remove = group.loc[(group['matchType'] == 'duo-fpp') & (group['players'] > 2)].matchId.unique().tolist()
group = group[~group.matchId.isin(to_remove)]

# Removing more event and custom matches from the actual test dataset
test = test[test.matchId.isin(group.matchId.unique().tolist())]

# Creating the same features as train dataset
test['killsNorm'] = test['kills']*((100-test['playersJoined'])/100 + 1)
test['damageDealtNorm'] = test['damageDealt']*((100-test['playersJoined'])/100 + 1)
test['boostsHeals'] = test['boosts'] + test['heals']
test['totalDistance'] = test['walkDistance'] + test['rideDistance'] + test['swimDistance']
test['boostsHealsPerTotalDistance'] = test['boostsHeals']/(test['totalDistance']+1) # To avoid infity cases, we add 1 for cases where totalDistance might be 0 and bootsHeals > 0

# Working on those non-categorical values
# One hot encode matchType
test = pd.get_dummies(test, columns=['matchType'])

# Turn groupId and match Id into categorical types
test['groupId'] = test['groupId'].astype('category')
test['matchId'] = test['matchId'].astype('category')

# Get category coding for groupId and matchID
test['groupId_cat'] = test['groupId'].cat.codes
test['matchId_cat'] = test['matchId'].cat.codes

# Get rid of old columns
test.drop(columns=['groupId', 'matchId'], inplace=True)

# Final test dataset for prediction without Id
test_pred = test.loc[:, ~test.columns.isin(['Id'])]

Last check before predicting the data.

In [None]:
test_pred.head()

In [None]:
predictions = reg.predict(test_pred)
pred_df = pd.DataFrame({'Id' : test['Id'], 'winPlacePerc' : predictions})

# Submission file
pred_df.to_csv("submission.csv", index=False)

In [None]:
pred_df.head()

Finally, a peek at the submitted file.

Eventually I will try to use the bellow methods to predict the data.

### Ridge Regression

*TODO*

### SVR (Support Vector Regressor)

*TODO*

### Random Forest Regressor

*TODO*