# PUBG: Exploratory analysis and predictions
#### by Kristofer Söderström
___
*This notebook showcases data exploration, manipulation and predictions with a video-game database*
### Contents
1.  Database Description
1. Exploratory Analysis
1.  Baseline Model
1.  Feature Engineering

**Keywords:** Descriptive statistics, feature engineering, prediction. 

### 1. Database Description

In a PUBG game, up to 100 players start in each match (matchId). Players can be on teams (groupId) which get ranked at the end of the game (winPlacePerc) based on how many other teams are still alive when they are eliminated. In game, players can pick up different munitions, revive downed-but-not-out (knocked) teammates, drive vehicles, swim, run, shoot, and experience all of the consequences -- such as falling too far or running themselves over and eliminating themselves.
You are provided with a large number of anonymized PUBG game stats, formatted so that each row contains one player's post-game stats. The data comes from matches of all types: solos, duos, squads, and custom; there is no guarantee of there being 100 players per match, nor at most 4 player per group.

**Objective** : Predict a players finishing placement based on their stats.

**Data fields**
* DBNOs - Number of enemy players knocked.
* assists - Number of enemy players this player damaged that were killed by teammates.
* boosts - Number of boost items used.
* damageDealt - Total damage dealt. Note: Self inflicted damage is subtracted.
* headshotKills - Number of enemy players killed with headshots.
* heals - Number of healing items used.
* Id - Player’s Id
* killPlace - Ranking in match of number of enemy players killed.
* killPoints - Kills-based external ranking of player. (Think of this as an Elo ranking where only kills matter.) If there is a value other than -1 in rankPoints, then any 0 in killPoints should be treated as a “None”.
* killStreaks - Max number of enemy players killed in a short amount of time.
* kills - Number of enemy players killed.
* longestKill - Longest distance between player and player killed at time of death. This may be misleading, as downing a player and driving away may lead to a large longestKill stat.
* matchDuration - Duration of match in seconds.
* matchId - ID to identify match. There are no matches that are in both the training and testing set.
* matchType - String identifying the game mode that the data comes from. The standard modes are “solo”, “duo”, “squad”, “solo-fpp”, “duo-fpp”, and “squad-fpp”; other modes are from events or custom matches.
* rankPoints - Elo-like ranking of player. This ranking is inconsistent and is being deprecated in the API’s next version, so use with caution. Value of -1 takes place of “None”.
* revives - Number of times this player revived teammates.
* rideDistance - Total distance traveled in vehicles measured in meters.
* roadKills - Number of kills while in a vehicle.
* swimDistance - Total distance traveled by swimming measured in meters.
* teamKills - Number of times this player killed a teammate.
* vehicleDestroys - Number of vehicles destroyed.
* walkDistance - Total distance traveled on foot measured in meters.
* weaponsAcquired - Number of weapons picked up.
* winPoints - Win-based external ranking of player. (Think of this as an Elo ranking where only winning matters.) If there is a value other than -1 in rankPoints, then any 0 in winPoints should be treated as a “None”.
* groupId - ID to identify a group within a match. If the same group of players plays in different matches, they will have a different groupId each time.
* numGroups - Number of groups we have data for in the match.
* maxPlace - Worst placement we have data for in the match. This may not match with numGroups, as sometimes the data skips over placements.
* winPlacePerc - The target of prediction. This is a percentile winning placement, where 1 corresponds to 1st place, and 0 corresponds to last place in the match. It is calculated off of maxPlace, not numGroups, so it is possible to have missing chunks in a match.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
import os
print(os.listdir("../input"))
#loading additional dependencies 
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt #plots and graphs
import seaborn as sns #additional functionality and visualization
from math import sqrt
import random
random.seed(30) #seed for reproducibility
#dependencies for preprocessing and modelling data
from sklearn import preprocessing 
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error,r2_score
import xgboost as xgb


### 2. Exploratory Analysis

EDA for short helps us understand the data better by summarizing main characteristics. This also helps formulating hypotheses for more data colleciton or experiments. 

In [None]:
#load data and create dataframe 
train_data = pd.read_csv('../input/train_V2.csv')
#summarize information 
train_data.info()
print("database shape:",train_data.shape)
before = train_data.shape
print("missing data?",train_data.isnull().values.any())
print("deleting missing values...")# dataframe has missing values, we will drop them because of time constraints. Usually not desirable since missing information can actually provide with important insights.
train_data = train_data.dropna()
print("missing data?",train_data.isnull().values.any())
after = train_data.shape
#print("using random sample (1% of data) to speed up computation...")
#train_data = train_data.sample(n=None, frac=0.01, replace=False, weights=None, random_state=None, axis=None)
print("database shape:",train_data.shape)
print("Dropped rows:",before[0]-after[0])
train_data.head()

* There is a mix of object, interger, and float data.
* Around 4.5 million rows.
* 29 columns 
* The **target** or **dependent variable** is continuous ranging from 0 (last place) to 1 (first place). 
* Most data is numerical. Id, groupId, matchId and matchType are object data

A **heatmap** is a good way to start visualization. It will allow a bird's-eye view of the dataset and identifying correlation between features.

In [None]:
#developing a heatmap with example from https://seaborn.pydata.org/examples/many_pairwise_correlations.html
sns.set(style="white") # set a style for the graph
corr = train_data.corr() # compute correlation matrix
f, ax = plt.subplots(figsize=(15,15)) #set size
cmap = sns.diverging_palette(220,10,as_cmap=True) #define a custom color palette
sns.heatmap(corr,annot=False,cmap=cmap,square=True,linewidths=0.5) #draw graph
plt.show()

It's already possible to make out some insights from the data.The upper left quadrant of the heatmap let's us know that some of the features (independent variables) are highly correlated, for example kills and damage dealt, or heals and boost. In econometrics, this is called multicollinearity. It is not as big of a problem for prediction models (in the sample dataset) than it is for explanatory models. However, there are some ways to deal with this problem in machine learning, like feature engineering by hand or principal component analysis (PCA).

The objective is to predict the finishing placement (winPlacePerc),and we can see that some variables are highly correlated. However, it is also interesting to see what variables have almost zero or inverse relationship with the target, such as:
* walkDistance, high correlation
* boosts, high correlation
* damageDealt, high correlation
* roaDkills, low correlation
* matchDuration, low correlation
* killPlace, negative correlation.
    * This feature refers to how many enemy players someone kils during a match. Perhaps implying a strategy for winning players. 
    
We can zoom in on those to get a better understanding of the potential predictors.

In [None]:
n = 10
f, ax = plt.subplots(figsize=(15,15))
cols = train_data.corr().nlargest(n,"winPlacePerc")["winPlacePerc"].index
corr = np.corrcoef(train_data[cols].values.T)
sns.heatmap(corr, annot=True,cmap=cmap,square=True,linewidths=0.5,xticklabels=cols.values,yticklabels=cols.values)
plt.show()

Walking distance is the feature with the highest correlation. This makes sense if we think of walking distance as a proxy for persistence in the game. This is a good example of correlation vs causation. Obviously, running is not going to make you win first place in a match but it does give a signal that will be useful for prediction. 

In [None]:
sns.set_style("white")
f, ax = plt.subplots(figsize=(8,5))
sns.scatterplot(x="walkDistance",y="winPlacePerc",data=train_data)
plt.show()

There are some observations to  be made here: 
1. There is a bias towards zero
    1. There are players that appear to win the match without moving, which might indicate some sort of cheating in the game, although more thorough examination must be done to confirm this. 
1. There are a number of outliers skewing the graph. 

It's easier (but computationally more expensive) to see the relationship when using a line plot


In [None]:
sns.set_style("white")
f, ax = plt.subplots(figsize=(8,5))
sns.lineplot(x="walkDistance", y="winPlacePerc", data=train_data)
plt.show()

There is an interesting logarithmic trend seen in the data that plateaus around 4000 meters. There is a steady trend between the range of 2000 to 4000 meteres travelled that seems to be associated with finishing placement. Perhaps this signifies the walking distance from the outer to inner-most rings during the matches.

In [None]:
sns.set(style="white")
cols = ['winPlacePerc','walkDistance', 
        'boosts','weaponsAcquired',
        'damageDealt','heals']
sns.pairplot(train_data[cols])
plt.show()

As expected, based on their correlation coefficient, it is possible to see a strong positive association between the features and the target variable. It is also possible to see the multicollinearity between some of the features. 
The diagonal graphs show the histogram of each variable, here the skewness towards zero is even more evident. This will warrant some sort of normalization in the data to achieve more accurate results. 
Usually, data exploration would be performed deeper to better understand the data. For the purposes of this demo, we will jump ahead to a baseline model. 

### 3. Baseline Model
#### Linear Regression

The baseline model will be the most simplistic model possible from which we can compare and gauge our results in more complex modelling. It is a good starting point where not much change is implemented to the original dataset. Before we do this, we need to separate features and targets, as well as transforming matchType to categorical values, since it might include some useful information for the model. For now, we drop id, groupId and matchId

In [None]:
train_data["matchType"].unique() # we can see the different types of matches in the game

In [None]:
#we start by diving the training data between features and targets
x_train = train_data.iloc[:,3:-1]
y_train = train_data.iloc[:,-1].values
print("traning data has the shape:",x_train.shape)
#one hot encoding matchType to include in analysis, it has 16 different types which might reflect specific characteristics between the match types
x_train = pd.get_dummies(x_train, prefix = ["matchType"])
print("x_train shape after one hot encoding",x_train.shape)
print("y_train shape",y_train.shape)

#we will normalize data to facilitate learning
print("normalizing data...")
#x_train = preprocessing.StandardScaler().fit_transform(x_train)
x_train = preprocessing.scale(x_train)

print("validation split to 20%")#
X_train,X_val,y_train,y_val = train_test_split(x_train,y_train,
                                               test_size=0.2,
                                               random_state=30)
print("fitting linear regression...")
reg = LinearRegression()
fit = reg.fit(X_train,y_train)
y_predicted = reg.predict(X_val)
print('Train R2',reg.score(X_train,y_train))
print('Val R2', r2_score(y_val,y_predicted))
print('Val RMSE', sqrt(mean_squared_error(y_val, y_predicted)))

Our baseline model is an out-of-the-box Linear Regression. It is adequated for regression problems where we have abundance of data and are not concerned with selecting only a few features. It should be noted that normalization was done to the features to facilitate learning by the model.
1. Training and validation is around 84%, indicating robusteness and a generous fit out of the box. 
2. The value of the RMSE implies a 12.3% mean error (according to the scale of the target) between the predicted and validation data. Compared to more advanced models, this is probably high. Let's try a more complex model.

#### XGBoost: A more complex model
This is a powerful algorithm for machine learning which usually outperforms every other complex algorithm in terms of ease of use and speed. 

In [None]:
reg = xgb.XGBRegressor()                        
print("fitting xgboost regression")
fit = reg.fit(X_train,y_train)
y_predicted = reg.predict(X_val)
print('Train R2',reg.score(X_train,y_train))
print('Val R2', r2_score(y_val,y_predicted))
print('Val RMSE', sqrt(mean_squared_error(y_val, y_predicted)))

The default XGBoost model outpeformed Linear Regression in all fronts. Increasing training and validation R2 to around 90% and reducing the error to 9.8%. However, computation time was significantly higher.
For now we can continue to feature engineering.

### 4. Feautre Engineering
The process of feature engineering requires domain knowledge to create features, it is one of the main components of applied machine learning and very time consuming. 
I have never played this particular game but I am familiar with battle royale style games and gaming in general, which might help with the objective of predicting final placement.
#### Feature creation and selection
Based on our correlation matrix earlier, it is possible to see some variables that do not seem to be strong predictors for the taget. Whether they should be dropped from the model is not easy to say at first glance. However, it's possible to combine some features that are already correlated with each other to reduce multicollinearity and increasing robustness. 
* Boosts and heals are items that might increase a players **passive capabilities**, instead of actively, like a weapon would. 

In [None]:
#boosts and heals 
print("correlation between passive items and finishing placement:")
train_data["_passiveItems"] = train_data["boosts"]+train_data["heals"]
print(np.corrcoef(train_data["_passiveItems"],train_data["winPlacePerc"])) #corrcoefficient
#correlation graph
sns.set_style("white")
f, ax = plt.subplots(figsize=(8,5))
sns.scatterplot(x="_passiveItems",y="winPlacePerc",data=train_data,legend="full")
plt.show()

* If we think of **distance** as just another measure of persistence in the match, we can go ahead and combine ride, swim and walk distnance into a single feature of total distance travelled. However, we can see a sharp decrease in the correlation coefficient when compared to walkDistance. This might affect results 

In [None]:
#total distance
print("correlation total distance travelled and finishing placement:")
train_data["_totalDistance"] = train_data["walkDistance"]+train_data["rideDistance"]+train_data["swimDistance"]

print(np.corrcoef(train_data["_totalDistance"],train_data["winPlacePerc"])) #corrcoefficient
#correlation graph
sns.set_style("white")
f, ax = plt.subplots(figsize=(8,5))
sns.lineplot(x="_totalDistance",y="winPlacePerc",data=train_data,legend="full")
plt.show()

* In econometrics, the use of dummy variables aims to control for unseen characteristics in a model. We might not know how exactly they interact with the target, but we suppose they are there. Similarly, there might be match specific characteristics that ultimately impact the outcome of a match. Any set of characteristics that are particular to the match in progress. 

In [None]:
train_data["matchId"].describe()

There are 47,964 unique matches in the data. If we tried to do one-hot-encoding, it would add the same amount of columns to the database. In terms of added information vs cost, this might not be worth exploring. We will attempt to do so anyways to see what happens. Deep learning models perform better with more data and perhaps there is a signal here that we would be missing out from. 
Unfortunately, attempting to one hot encode all the unique matches results in a memory error. While using categorical values in one column instead would solve that problem, it would incorrectly assign cardinality to the variable (one match is not necessarily "better" or "worse" than another). An attempt was to use a random sample of 1% of the data which made the one hot encoding possible. However, even using linear regression would result again in a memory error. 

In [None]:
#skipped
#adding matchId
#x_train = train_data.iloc[:,2:-1]
#y_train = train_data.iloc[:,-1].values
#print("traning data has the shape:",x_train.shape)

#x_train = pd.get_dummies(x_train, prefix = ["matchId","matchType"])
#print("x_train shape after one hot encoding", x_train.shape)
#print("y_train shape", y_train.shape)


1. Let's see how our constructed features perform on xgboost relative to the base models, we will only drop the features that where used to construct our own and the ID columns

In [None]:
train_data_feat = train_data.drop(columns=["boosts","heals","walkDistance",
                                           "rideDistance","swimDistance",
                                           "Id","groupId","matchId"])

#once again we construct our x and y sets, one hot encode matchType and normalize the data
x_train = train_data_feat.drop(columns=["winPlacePerc"])
y_train = train_data[["winPlacePerc"]].values
print("traning data has the shape:",x_train.shape)
#one hot encoding matchType to include in analysis
x_train = pd.get_dummies(x_train, prefix = ["matchType"])
print("x_train shape after one hot encoding", x_train.shape)
print("y_train shape", y_train.shape)

#we will normalize data to facilitate learning
print("normalizing data...")
from sklearn import preprocessing 
x_train = preprocessing.scale(x_train)

#training-validation split
X_train,X_val,y_train,y_val = train_test_split(x_train,y_train,test_size=0.2,random_state=30)
print("fitting xgboost regression...")

reg = xgb.XGBRegressor()  
fit = reg.fit(X_train,y_train)
y_predicted = reg.predict(X_val)
print('Train R2',reg.score(X_train,y_train))
print('Val R2', r2_score(y_val,y_predicted))
print('Val RMSE', sqrt(mean_squared_error(y_val, y_predicted)))

It seems that our constructed variables are not helping increasing the prediction capacity of the model, as measured by R2 on the validation data. It might be that since we are decreasing the number of variables instead of increasing it, we are taking away signals from the model. However, the similar results with the previous model might signify some level of robustness in the model. 

Our xgboost regression model with not constructed variables was the better performer. Next steps could be trying to create better features, as well as tuning the current model or trying other approaches like neural networks.