# Introduction 

![Pubg](https://i.pinimg.com/originals/fa/fe/d5/fafed5fc56723131462f114588efc306.jpg) [[1](#references)]


#### * Game was released at the end of 2017 
#### * Created by Brendan Greene whose online handle was PlayerUnknown
#### * Won numerous awards including Steam's Game of the Year (2018) [[2](#references)]
#### * The 5th best selling game of all time with over 50 million copies sold [[3](#references)]
#### * Has expanded to mobile with over 400 million users (50 million daily users) [[4](#references)]

![Pubg](https://images.mmorpg.com/images/heroes/news/52736.jpg) [[5](#references)]

![Pubg](https://qph.fs.quoracdn.net/main-qimg-da91453430a3e09c90b6131c85265e22) [[6](#references)]

### Problem Statement:  Given over 65,000 games worth of anonymized player data, can one predict final placement from final in-game stats and initial player ratings?

## Data fields [[3](#references)]

* DBNOs - Number of enemy players knocked.
* assists - Number of enemy players this player damaged that were killed by teammates.
* boosts - Number of boost items used.
* damageDealt - Total damage dealt. Note: Self inflicted damage is subtracted.
* headshotKills - Number of enemy players killed with headshots.
* heals - Number of healing items used.
* Id - Player’s Id
* killPlace - Ranking in match of number of enemy players killed.
* killPoints - Kills-based external ranking of player. (Think of this as an Elo ranking where only kills matter.) If there is a value other than -1 in rankPoints, then any 0 in killPoints should be treated as a “None”.
* killStreaks - Max number of enemy players killed in a short amount of time.
* kills - Number of enemy players killed.
* longestKill - Longest distance between player and player killed at time of death. This may be misleading, as downing a player and driving away may lead to a large longestKill stat.
* matchDuration - Duration of match in seconds.
* matchId - ID to identify match. There are no matches that are in both the training and testing set.
* matchType - String identifying the game mode that the data comes from. The standard modes are “solo”, “duo”, “squad”, “solo-fpp”, “duo-fpp”, and “squad-fpp”; other modes are from events or custom matches.
* rankPoints - Elo-like ranking of player. **This ranking is inconsistent and is being deprecated in the API’s next version, so use with caution.** Value of -1 takes place of “None”.
* revives - Number of times this player revived teammates.
* rideDistance - Total distance traveled in vehicles measured in meters.
* roadKills - Number of kills while in a vehicle.
* swimDistance - Total distance traveled by swimming measured in meters.
* teamKills - Number of times this player killed a teammate.
* vehicleDestroys - Number of vehicles destroyed.
* walkDistance - Total distance traveled on foot measured in meters.
* weaponsAcquired - Number of weapons picked up.
* winPoints - Win-based external ranking of player. (Think of this as an Elo ranking where only winning matters.) If there is a value other than -1 in rankPoints, then any 0 in winPoints should be treated as a “None”.
* groupId - ID to identify a group within a match. If the same group of players plays in different matches, they will have a different groupId each time.
* numGroups - Number of groups we have data for in the match.
* maxPlace - Worst placement we have data for in the match. This may not match with numGroups, as sometimes the data skips over placements.
* winPlacePerc - The target of prediction. This is a percentile winning placement, where 1 corresponds to 1st place, and 0 corresponds to last place in the match. It is calculated off of maxPlace, not numGroups, so it is possible to have missing chunks in a match.

In [None]:
# Import libraries 
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns 
import os
import warnings

%matplotlib inline
warnings.filterwarnings("ignore")

In [None]:
# Set the size of the plots 
plt.rcParams["figure.figsize"] = (18,8)
sns.set(rc={'figure.figsize':(18,8)})

In [None]:
data = pd.read_csv("../input/pubg-finish-placement-prediction/train_V2.csv")
print("Finished loading the data")

In [None]:
data.shape

In [None]:
data.info()

In [None]:
data.head()

# Basic Assumptions and Observations

Remove rank Points since notes on dataset said this feature is inconsistent and will be deprecated in later versions of the game

In [None]:
data.drop(columns=['rankPoints'], inplace=True)

In [None]:
# Check to see what we are dealing with regarding missing and null values 
data.isnull().values.any()

In [None]:
data.isnull().sum()

In [None]:
data.dropna(inplace=True)
data.isnull().values.any()

In [None]:
# Check to see win percentage distribution 
sns.distplot(data['winPlacePerc']).set_title('Distribution of Winning Percentile');

In [None]:
print('Mean: {:.4f}, Median {:.4f}'.format(data['winPlacePerc'].mean(), data['winPlacePerc'].median()))

Outside of the tails the distribution is uniform as one would expect 

How about we now look at the mean and median based on the matchId

In [None]:
data['matchMean'] = data.groupby('matchId')['winPlacePerc'].transform('mean')
data['matchMedian'] = data.groupby('matchId')['winPlacePerc'].transform('median')

In [None]:
sns.distplot(data['matchMean'], kde=False).set_title('Mean for Winning Percentile grouped by match');

In [None]:
sns.distplot(data['matchMedian'], kde=False).set_title('Median for Winning Percentile grouped by match');

In [None]:
# Get values
print('Mean: {:.4f}, Median {:.4f}'.format(data['matchMean'].mean(), data['matchMedian'].median()))

Let's look into team sizes and match size 

In [None]:
# Can do this with matchType and then derive the team and match size
data['matchType'].unique()

In [None]:
sns.countplot('matchType', data=data);

In [None]:
data['teamSize'] = data.groupby('groupId')['groupId'].transform('count')
data['maxTeamSize'] = data.groupby('matchId')['teamSize'].transform('max')
data['matchSize'] = data.groupby('matchId')['Id'].transform('nunique')

In [None]:
sns.distplot(data['matchSize'], kde=False).set_title('Distribution of Players per Game');

Looks like the frequency is a lot greater as you move towards 100 people in a match. Wouldn't be suprised if the earlier plots were due to a team imbalance 

In [None]:
# Let's see the largest team size
data['maxTeamSize'].max()

In [None]:
sns.distplot(data['teamSize'], kde=False);

Looks like there are custom games that are distorting the statistics 

How about we cut out the less represented type modes

In [None]:
types = ['solo', 'solo-fpp', 'duo', 'duo-fpp', 'squad', 'squad-fpp']
data = data.loc[data['matchType'].isin(types)]

In [None]:
sns.countplot('matchType', data=data);

In [None]:
sns.distplot(data['matchSize'], kde=False).set_title('Distribution of Players per Game');sns.distplot(data['matchSize'], kde=False).set_title('Distribution of Players per Game');

In [None]:
data['matchSize'].min()

In [None]:
sns.distplot(data['teamSize'], kde=False);

# Look at a few features

The features I am interested in are:

* Boosts 
* Heals 
* Kills
* Damage Dealt
* Match Duration
* Kill Points
* Win Points

In [None]:
# Also look at top 10% and bottom 10% of players 
top_10 = data[data['winPlacePerc'] >= 0.9]
bottom_10 = data[data['winPlacePerc'] <= 0.1]

## Boosts and Heals 

In [None]:
data['boosts'].unique()

In [None]:
sns.scatterplot(x="boosts", y="winPlacePerc", data=data, color='seagreen');

In [None]:
sns.scatterplot(x="boosts", y="winPlacePerc", data=top_10, color='seagreen');

In [None]:
sns.scatterplot(x="boosts", y="winPlacePerc", data=bottom_10, color='seagreen');

In [None]:
sns.scatterplot(x="heals", y="winPlacePerc", data=data, color='seagreen');

In [None]:
sns.scatterplot(x="heals", y="winPlacePerc", data=top_10, color='seagreen');

In [None]:
sns.scatterplot(x="heals", y="winPlacePerc", data=bottom_10, color='seagreen');

In [None]:
top_10[['boosts', 'heals']].describe()

In [None]:
bottom_10[['boosts', 'heals']].describe()

## Kills

In [None]:
# Count 
sns.countplot(data['kills'], color='red');

In [None]:
sns.lineplot(x="kills", y='winPlacePerc', data=data, color='red');

In [None]:
sns.scatterplot(x="kills", y="winPlacePerc", data=data, color='red');

In [None]:
sns.scatterplot(x="kills", y="winPlacePerc", data=top_10, color='red');

In [None]:
sns.scatterplot(x="kills", y="winPlacePerc", data=bottom_10, color='red');

Bottom 10% looks like some of the high kill players hot drop for the heck of it 

Since most of the players ended the game with 0 kills I want dig in to that data to see if there is anything interesting

In [None]:
zero_kills = data.copy()
zero_kills = zero_kills[zero_kills['kills']==0]

In [None]:
# Same reason as previous line
sns.scatterplot(x="kills", y='winPlacePerc', data=zero_kills);

In [None]:
sns.lineplot(x="killPlace", y='winPlacePerc', data=zero_kills);

Want to see if groups (duos and squads) help increase placement for players with 0 kills

In [None]:
data.head()

In [None]:
data[data['groupId'] == '4d4b580de459be'][['matchType', 'kills', 'killPlace', 'winPlacePerc']]

In [None]:
data[data['matchType'] == 'duo-fpp'].head()

In [None]:
data[data['groupId'] == '8e0a0ea95d3596'][['matchType', 'kills', 'killPlace', 'winPlacePerc']]

## Damage Dealt

In [None]:
sns.scatterplot(x="damageDealt", y="winPlacePerc", data=data);

In [None]:
sns.scatterplot(x="damageDealt", y="winPlacePerc", data=top_10);

In [None]:
sns.scatterplot(x="damageDealt", y="winPlacePerc", data=bottom_10);

## Match Duration 

In [None]:
sns.scatterplot(x="matchDuration", y="winPlacePerc", data=data, color='yellow');

In [None]:
sns.scatterplot(x="matchDuration", y="winPlacePerc", data=top_10, color='yellow');

In [None]:
sns.scatterplot(x="matchDuration", y="winPlacePerc", data=bottom_10, color='yellow');

## Kill Points

In [None]:
sns.scatterplot(x="killPoints", y="winPlacePerc", data=data, color='orange');

In [None]:
sns.scatterplot(x="killPoints", y="winPlacePerc", data=top_10, color='orange');

In [None]:
sns.scatterplot(x="killPoints", y="winPlacePerc", data=bottom_10, color='orange');

In [None]:
sns.lineplot(x="killPoints", y='kills', data=data, color='orange');

In [None]:
sns.lineplot(x="kills", y='killPoints', data=data, color='orange');

## Win Points

In [None]:
sns.lineplot(x="winPoints", y='winPlacePerc', data=data, color='brown');

In [None]:
sns.scatterplot(x="winPoints", y="winPlacePerc", data=data, color='brown');

In [None]:
sns.scatterplot(x="winPoints", y="winPlacePerc", data=top_10, color='brown');

In [None]:
sns.scatterplot(x="winPoints", y="winPlacePerc", data=bottom_10, color='brown');

<a id='references'></a>

# References

[1] https://i.pinimg.com/originals/fa/fe/d5/fafed5fc56723131462f114588efc306.jpg

[2] https://www.pcgamer.com/pubg-claims-game-of-the-year-in-the-2018-steam-awards/

[3] https://www.kaggle.com/c/pubg-finish-placement-prediction/data

[4] https://www.pcgamesn.com/pubg-mobile-player-count

[5] https://images.mmorpg.com/images/heroes/news/52736.jpg

[6] https://qph.fs.quoracdn.net/main-qimg-da91453430a3e09c90b6131c85265e22

[\*] https://www.kaggle.com/gemartin/load-data-reduce-memory-usage

\* Used throughout other notebooks