# How to Win at PUBG

### A Machine Learning Project by Eric Plass and Victor Butoi

PUBG is a multi-millionare dollar game that revolves around different players either teaming up or going solo in order to kill off the rest of the competition and be the last man standing
<br>
This project's goal is to be able to predict the finish of PUBG players based on the following myriad of statistics
<br>
<li>DBNOs - Number of enemy players knocked.
<li>assists - Number of enemy players this player damaged that were killed by teammates.
<li>boosts - Number of boost items used.
<li>damageDealt - Total damage dealt. Note: Self inflicted damage is subtracted.
<li>headshotKills - Number of enemy players killed with headshots.
<li>heals - Number of healing items used.
<li>Id - Player’s Id
<li>killPlace - Ranking in match of number of enemy players killed.
<li>killPoints - Kills-based external ranking of player. (Think of this as an Elo ranking where only kills matter.) If there is a value other than -1 in rankPoints, then any 0 in killPoints should be treated as a “None”.
<li>killStreaks - Max number of enemy players killed in a short amount of time.
<li>kills - Number of enemy players killed.
<li>longestKill - Longest distance between player and player killed at time of death. This may be misleading, as downing a player and driving away may lead to a large longestKill stat.
<li>matchDuration - Duration of match in seconds.
<li>matchId - ID to identify match. There are no matches that are in both the training and testing set.
<li>matchType - String identifying the game mode that the data comes from. The standard modes are “solo”, “duo”, “squad”, “solo-fpp”, “duo-fpp”, and “squad-fpp”; other modes are from events or custom matches.
rankPoints - Elo-like ranking of player. This ranking is inconsistent and is being deprecated in the API’s next version, so use with caution. Value of -1 takes place of “None”.
<li>revives - Number of times this player revived teammates.
<li>rideDistance - Total distance traveled in vehicles measured in meters.
<li>roadKills - Number of kills while in a vehicle.
<li>swimDistance - Total distance traveled by swimming measured in meters.
<li>teamKills - Number of times this player killed a teammate.
<li>vehicleDestroys - Number of vehicles destroyed.
<li>walkDistance - Total distance traveled on foot measured in meters.
<li>weaponsAcquired - Number of weapons picked up.
<li>winPoints - Win-based external ranking of player. (Think of this as an Elo ranking where only winning matters.) If there is a value other than -1 in rankPoints, then any 0 in winPoints should be treated as a “None”.
groupId - ID to identify a group within a match. If the same group of players plays in different matches, they will have a different groupId each time.
<li>numGroups - Number of groups we have data for in the match.
<li>maxPlace - Worst placement we have data for in the match. This may not match with numGroups, as sometimes the data skips over placements.
<li>winPlacePerc - The target of prediction. This is a percentile winning placement, where 1 corresponds to 1st place, and 0 corresponds to last place in the match. It is calculated off of maxPlace, not numGroups, so it is possible to have missing chunks in a match.

# Working with the Data
<br>
First lets start with the standard procedure of loading the Data in

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import gc

import seaborn as sns
import numpy as np

import random
random.seed(42)

from scipy.stats import norm
from sklearn.preprocessing import StandardScaler
from scipy import stats 
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error

import warnings 
warnings.filterwarnings('ignore')
%matplotlib inline

train_f = pd.read_csv("train_V2.csv")

Heres what the data looks like

In [None]:
train_f.head(5)

Lets look at general info about our Data

In [None]:
train_f.info()

With all the different parameters and games with this dataset, its in our best interest to save some memory

We can do better if we know what features we are targeting. Python automatically assumes the type which waste a lot of space. Since we know that the numbers we are dealing with aren't that large, we can shrink down the amount of memory that we are using to the smallest size it can fit in.

In [None]:
dtypes = {
        'Id'                : 'object',
        'groupId'           : 'object',
        'matchId'           : 'object',
        'assists'           : 'uint8',
        'boosts'            : 'uint8',
        'damageDealt'       : 'float16',
        'DBNOs'             : 'uint8',
        'headshotKills'     : 'uint8', 
        'heals'             : 'uint8',    
        'killPlace'         : 'uint8',    
        'killPoints'        : 'uint8',    
        'kills'             : 'uint8',    
        'killStreaks'       : 'uint8',    
        'longestKill'       : 'float16',    
        'maxPlace'          : 'uint8',
        'matchType'         : 'object',
        'numGroups'         : 'uint8',    
        'revives'           : 'uint8',    
        'rideDistance'      : 'float16',    
        'roadKills'         : 'uint8',    
        'swimDistance'      : 'float16',    
        'teamKills'         : 'uint8',    
        'vehicleDestroys'   : 'uint8',    
        'walkDistance'      : 'float16',    
        'weaponsAcquired'   : 'uint8',    
        'winPoints'         : 'uint8', 
        'winPlacePerc'      : 'float16' 
}

train = pd.read_csv("train_V2.csv", dtype=dtypes)

train.info()

We saved a whopping 657.3MB or 67% of the data! Now on to analyzing the features themselves.

# Cleanup, Data Analysis, and Feature Engineering

To begin we know that any matches with NaN values for WinPlacePerc should be thrown out.

In [None]:
falseWinPlacePerc = train[train['winPlacePerc'].isna()]['matchId'].values
train = train[-train['matchId'].isin(falseWinPlacePerc)]

In order to see which features we should choose, a correlation map is our best bet. However, for the parameters which have the Object types it wouldn't make sense to put them in this graph, so we toss them out.

In [None]:
droppedColumns = ['Id', 'groupId', 'matchId', 'matchType']
fittedColumns = [col for col in train.columns if col not in droppedColumns]
corr = train[fittedColumns].corr()

Then we build the heatmap to present the correlation

In [None]:
plt.figure(figsize=(15,10))
sns.heatmap(
    corr,
    xticklabels=corr.columns.values,
    yticklabels=corr.columns.values,
    linecolor='white',
    linewidths=0.1,
    cmap="BuPu"
)
plt.show()

Nice! Now we have an understanding of things that are correlated and we can start to unpack some of the relationships. For example, it would make a lot of sense that damage would be highly correlated with kills since you have to deal damage to get kills. A more hidden and interesting correlation is boosts to walkDistance. This makes sense! If you can run faster, you will go farther (on average) a little gem of a statistic.

This is an interesting analysis, but we really only care about correlations that are pretty strong for training, so let's try to boil down everything that has a bass correlation of 0.5.

In [None]:
plt.figure(figsize=(15,10))
corr1=corr.abs()>0.25
sns.heatmap(corr1,annot=True)
plt.show()

Two things from playing the game that are super apparent are if you want to win you must do two things which are going to be the biggest correlators. How many people you kill and how far you walk

To show these correlations, we chose to use box and whisker plots for "bins" of killers and runners, to show the skewed shape of the distribution, its central value, and its variability.
<br>
Here's what we found from analyzing the kills

In [None]:
killsCategories = train.copy()
killsCategories['killsCategories'] = pd.cut(train['kills'], [-1, 0, 2, 5, 60], labels=['0_kills','1-2_kills', '3-5_kills', '6-10+kills'])

plt.figure(figsize=(15,10))
sns.boxplot(x="killsCategories", y="winPlacePerc", data=killsCategories)
plt.show()

No suprise here, you need kills to win (unless the zone happens to close on you). But how about running?

In [None]:
walkDistanceData = train.copy()
walkDistanceData['walkDistance'] = pd.cut(train['walkDistance'], [-1, 0, 1000, 2000, 10000], labels=['0_steps','1-1000_steps','1000-2000_steps','2000-10000_steps'])

plt.figure(figsize=(15,10))
sns.boxplot(x="walkDistance", y="winPlacePerc", data=walkDistanceData)
plt.show()

This is exactly what we thought would happen, you need to kill and run to win PUBG games, let's make some features that show how players use their skill of killing and running in relations with others aspects of the game. 

Let's begin by making some new features that have to deal with how well people do with killing. We work based on two assumptions with these features:
<li>if someone gets more headshots for their kills, then they are a better player.
<li>if someone can get more kills without moving as much, they are a better player.

In [None]:
train['headshotKills_over_kills'] = train['headshotKills'] / train['kills']
train['headshotKills_over_kills'].fillna(0, inplace=True)
train['headshotKills_over_kills'].replace(np.inf, 0, inplace=True)

train['distancePerKill'] = train['kills'] / (train['walkDistance'] + 1)##No infinity error
train['distancePerKill'].fillna(0, inplace=True)
train['distancePerKill'].replace(np.inf, 0, inplace=True)

We already showed above how much kills matter, but our data isn't really that nice. A kill should count more if there are less players to kill right? So we account for that with normalizing the data.

In [None]:
#If 100 players join, then a kill matters less than if 2 players joined
train['normalizedKills'] = train['kills'] * ((100 - train['numGroups']) / 100 + 1)
train['normalizedDamage'] = train['damageDealt'] * ((100 - train['numGroups']) / 100 + 1)

Next, we know that winPlace has a good correlation from both boosts and heals, why not their combination? Also good players will know how to efficiently use heals and boosts and are probably going to have similar ratios for heals and boosts per distance traveled.

In [None]:
train['healsAndBoosts'] = train['heals'] + train['boosts']

train['itemEfficiency'] = train['healsAndBoosts'] / (train['walkDistance'] + 1)##no divide by zeros today
train['itemEfficiency'].fillna(0, inplace=True)

Finally, let's consider how far a person travels around the map. The more they move in general should be correlated with how good of a player they are.

In [None]:
train['totalDistance'] = train['walkDistance'] + train['rideDistance'] + train['swimDistance']

# Testing our New Features

Time to see if the new feature we added actually can help us with predictions, first we use a simple correlation graph like before, but then a more advanced tactic.

In [None]:
corr = train[['headshotKills_over_kills','distancePerKill','normalizedKills','normalizedDamage','healsAndBoosts',
             'itemEfficiency','totalDistance','winPlacePerc']].corr()

plt.figure(figsize=(15,10))
sns.heatmap(
    corr,
    xticklabels=corr.columns.values,
    yticklabels=corr.columns.values,
    annot=True,
    linecolor='white',
    linewidths=0.1,
    cmap="RdBu"
)
plt.show()

It seems some of our predicitions could be off; however, we have a more definitive way of determining this through Linear Regression. First, we begin by defining a special train test split that will choose from matches with the same match id to train from.

In [None]:
def group_by_match_TTS(train, test_size=0.2):
    match_ids = train['matchId'].unique().tolist()
    train_size = int(len(match_ids) * (1 - test_size))
    train_match_ids = random.sample(match_ids, train_size)

    X = train[train['matchId'].isin(train_match_ids)]
    y = train[-train['matchId'].isin(train_match_ids)]
    
    return X, y

The following testFeatures takes in a method and then runs a Linear Regression model on a version of the data that has one of the new features to see its impact on a final score.

Linear regression is is a linear approach to modelling the relationship between a scalar response (or dependent variable) and one or more explanatory variables (or independent variables). For more than one explanatory variable, the process is called multiple linear regression. Since our goal is to explain variation in the response variable( that can be attributed to variation in the explanatory variables), we can use multiple linear regression to show the strength of the relationship between the response and the explanatory variables, and in particular to determine whether some explanatory variables not contribute to accuracy of our model.

In [None]:
def testFeatures(features):
    results = []
    
    for feature in features:
        gc.collect()
        test_train = pd.read_csv("train_V2.csv", dtype=dtypes)
        invalid_match_ids = test_train[test_train['winPlacePerc'].isna()]['matchId'].values
        test_train = test_train[-test_train['matchId'].isin(invalid_match_ids)]
        
        test_train.drop(columns=['matchType'], inplace=True)
        test_train = feature(test_train)

        cols_to_drop = ['Id', 'groupId', 'matchId', 'winPlacePerc']
        cols_to_fit = [col for col in test_train.columns if col not in cols_to_drop]
        
        train, val = group_by_match_TTS(test_train, 0.2)
    
        model = LinearRegression()
        model.fit(train[cols_to_fit], train['winPlacePerc'])
    
        y_true = val['winPlacePerc']
        y_pred = model.predict(val[cols_to_fit])
        score = mean_absolute_error(y_true, y_pred)
        
        results.append({
            'name': feature.__name__,
            'score': 1 - score
        })
        
    return pd.DataFrame(results, columns=['name', 'score']).sort_values(by='score')

The way this method is going to work is we pass in a function name that will get called, so we need to define what happens when we make the feature: ie, make it for the sample dataset

In [None]:
def normal(train):
    return train

def totalDistance(train):
    train['totalDistance'] = train['walkDistance'] + train['rideDistance'] + train['swimDistance']
    return train

def headshotKills_over_kills(train):
    train['headshotKills_over_kills'] = train['headshotKills'] / train['kills']
    train['headshotKills_over_kills'].fillna(0, inplace=True)
    train['headshotKills_over_kills'].replace(np.inf, 0, inplace=True)
    return train

def normalizedKills(train):
    train['normalizedKills'] = train['kills'] * ((100 - train['numGroups']) / 100 + 1)
    return train

def normalizedDamage(train):   
    train['normalizedDamage'] = train['damageDealt'] * ((100 - train['numGroups']) / 100 + 1)
    return train

def healsAndBoosts(train):
    train['healsAndBoosts'] = train['heals'] + train['boosts']
    return train

def itemEfficiency(train):
    train['healsAndBoosts'] = train['heals'] + train['boosts']
    train['itemEfficiency'] = train['healsAndBoosts'] / (train['walkDistance'] + 1)##no divide by zeros today
    train['itemEfficiency'].fillna(0, inplace=True)
    return train

def distancePerKill(train):
    train['distancePerKill'] = train['kills'] / (train['walkDistance'] + 1)##No infinity error
    train['distancePerKill'].fillna(0, inplace=True)
    train['distancePerKill'].replace(np.inf, 0, inplace=True)
    return train

Time to run over all the new features, note that the score is their respective MAE. The Mean Absolute Error of a model refers to the mean of the absolute values of each difference between the actual value and the predicted value.

In [None]:
testFeatures([
    normal,
    totalDistance,
    headshotKills_over_kills,
    normalizedKills,
    normalizedDamage,
    healsAndBoosts,
    itemEfficiency,
    distancePerKill
])

Interestingly enough, the new heals and boosts showed a high correlation with winning but in fact did not do anything to help our training score.

# Cheaters

Now I am going to deal with a group of players that we HATE to run into in the game: cheaters. But how do we know if someone is cheating? Well, if someone gets kills without moving then they definitely are cheating. If they get 100% headshot accuracy, they are either really good or aimboting; here we will forgive those cheaters since we might lose too many real players.

Here are the following categories that we want to drop cheaters for. Removing the outliers will help our model's accuracy
<br>
<li>Killing without moving

In [None]:
 ##Dropping people who got kills without moving
train['killsWithoutMoving'] = ((train['kills'] > 0) & (train['totalDistance'] == 0))
train.drop(train[train['killsWithoutMoving'] == True].index, inplace=True)

<li>Killing over insane distances

In [None]:
##Dropping people who got kills over 1000m
train.drop(train[train['longestKill'] >= 1000].index, inplace=True)

<li>Irregular Movement

In [None]:
##Dropping people who speedrun like Sonic through the map
train.drop(train[train['rideDistance'] >= 20000].index, inplace=True)
train.drop(train[train['walkDistance'] >= 10000].index, inplace=True)
train.drop(train[train['swimDistance'] >= 2000].index, inplace=True)

<li>Suspicion weapon counts

In [None]:
##Dropping people who got a suspicion amount of weapons
train.drop(train[train['weaponsAcquired'] >= 80].index, inplace=True)

Now this won't catch all the cheaters, but it does catch the most common ones and help improve the accuracy of our model.

# Using the Boost Model

Boosting is a sequential ensemble method: we start with a very weak learner and each iteration it gets better. The method by which we improve this model is underspecified, and each specific boosting implementation performs it differently.

Generally, the improvement method is as follows: take the parts of the data where the model has done poorly and retrain a new version of the model on this "failed" data it is then combined with new data, in hopes that overall it preforms better.

For our model we chose to use: CatBoost, a recently open-sourced machine learning algorithm from Yandex. 

A reasonf for using Catboost is that it handles Categorical features automatically: We can use CatBoost without any explicit pre-processing to convert categories into numbers. CatBoost converts categorical values into numbers using various statistics on combinations of categorical features and combinations of categorical and numerical features. 

It also reduces the need for extensive hyper-parameter tuning and lower the chances of overfitting also which leads to more generalized models.

We use R2 as the statistic that will give some information about how good our model is. In regression, the R2 coefficient of determination shows how well the regression predictions approximate the real data points. 

In [None]:
from catboost import CatBoostRegressor
from xgboost.sklearn import XGBRegressor
from sklearn.model_selection import cross_val_score,KFold
import numpy as np
from sklearn.pipeline import make_pipeline
from sklearn import preprocessing
from  sklearn.model_selection import RandomizedSearchCV,train_test_split
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.base import BaseEstimator, TransformerMixin, RegressorMixin, clone

In [None]:
cb = CatBoostRegressor(iterations=500, learning_rate=0.05, 
                       loss_function='MAE',eval_metric='MAE', 
                       depth = 15, use_best_model=True, od_type="Iter", 
                       od_wait=20)

In [None]:
xgb = XGBRegressor(learning_rate=.03,min_child_weight=4,
                 max_depth=10,subsample=.4,
                 n_estimators=500,n_jobs=-1)

In [None]:
GBoost = GradientBoostingRegressor(n_estimators=600, learning_rate=0.05,
                                   max_depth=4, max_features='auto',
                                   min_samples_leaf=15, min_samples_split=5, 
                                   loss='huber', random_state =5)

In [None]:
y = train['winPlacePerc']
train = train.drop(columns=['winPlacePerc','Id','groupId','matchId'],axis=1)
train= pd.get_dummies(train)

x_train,x_test,y_train,y_test=train_test_split(train,y,test_size=0.2)

xgb.fit(x_train,y_train)
cb.fit(x_train,y_train)
GBoost.fit(x_train,y_train)


pred1=xgb.predict(x_test)
pred2=cb.predict(x_test)
pred3=GBoost.predict(x_test)

stacked_predictions=np.column_stack((pred1,pred2,pred3))

meta_model=LinearRegression()
meta_model.fit(stacked_predictions,y_test)
final=meta_model.predict(y_train)

print('r2 score is:', r2_score(y_test,final))


92.93% accuracy, that's pretty good!