# Background

PlayerUnknowns Battleground (Pubg) is a game where 100 playes drop onto a deserted island alone, with a partner, or with three others and seek to be the final one(s) standing. The goal is to predict the likelihood an individual will win based on a variety of statistics. 

The features as taken from the compeition's data page is:
* **DBNOs** - Number of enemy players knocked.
* **assists** - Number of enemy players this player damaged that were killed by teammates.
* **boosts** - Number of boost items used.
* **damageDealt** - Total damage dealt. Note: Self inflicted damage is subtracted.
* **headshotKills** - Number of enemy players killed with headshots.
* **heals** - Number of healing items used.
* **Id** - Player’s Id
* **killPlace** - Ranking in match of number of enemy players killed.
* **killPoints** - Kills-based external ranking of player. (Think of this as an Elo ranking where only kills matter.) If there is a value other than -1 in rankPoints, then any 0 in killPoints should be treated as a “None”.
* **killStreaks** - Max number of enemy players killed in a short amount of time.
* **kills** - Number of enemy players killed.
* **longestKill** - Longest distance between player and player killed at time of death. This may be misleading, as downing a player and driving away may lead to a large longestKill stat.
* **matchDuration** - Duration of match in seconds.
* **matchId** - ID to identify match. There are no matches that are in both the training and testing set.
* **matchType** - String identifying the game mode that the data comes from. The standard modes are “solo”, “duo”, “squad”, “solo-fpp”, “duo-fpp”, and “squad-fpp”; other modes are from events or custom matches.
* **rankPoints** - Elo-like ranking of player. This ranking is inconsistent and is being deprecated in the API’s next version, so use with caution. Value of -1 takes place of “None”.
* **revives** - Number of times this player revived teammates.
* **rideDistance** - Total distance traveled in vehicles measured in meters.
* **roadKills** - Number of kills while in a vehicle.
* **swimDistance** - Total distance traveled by swimming measured in meters.
* **teamKills** - Number of times this player killed a teammate.
* **vehicleDestroys** - Number of vehicles destroyed.
* **walkDistance** - Total distance traveled on foot measured in meters.
* **weaponsAcquired** - Number of weapons picked up.
* **winPoints** - Win-based external ranking of player. (Think of this as an Elo ranking where only winning matters.) If there is a value other than -1 in rankPoints, then any 0 in winPoints should be treated as a “None”.
* **groupId** - ID to identify a group within a match. If the same group of players plays in different matches, they will have a different groupId each time.
* **numGroups** - Number of groups we have data for in the match.
* **maxPlace** - Worst placement we have data for in the match. This may not match with numGroups, as sometimes the data skips over placements.
* **winPlacePerc** - The target of prediction. This is a percentile winning placement, where 1 corresponds to 1st place, and 0 corresponds to last place in the match. It is calculated off of maxPlace, not numGroups, so it is possible to have missing chunks in a match.

I will be breaking up this into a three part series with Part 1 (this notebook) investigating the features individually. Part 2 will look at the groups, and finally, Part 3 will involve the modeling and predictions. 

In [None]:
# Import necessary libraries
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns 
import os
import warnings

warnings.filterwarnings("ignore")
plt.rcParams["figure.figsize"] = (18,8)
sns.set(rc={'figure.figsize':(18,8)})

In [None]:
data = pd.read_csv("../input/train_V2.csv")
print("Done loading the data")
# Extra original dataset in case I need it
data2 = data.copy()
print("Done copying the data")

# Exploration

In [None]:
data.shape

So we have 28 features and 1 target

In [None]:
# Let's get some information about the data
data.info()

In [None]:
# A look into data
data.head()

I want to examine this data first with each row as an individual observation and then grouping together the rows based on the matchType

However before I continue I want to get rid of rankPoints since the notes on the dataset this feature is inconsistent as well as going to be deprecated in future versions.

In [None]:
data.drop(columns=['rankPoints'], inplace=True)

In [None]:
# Now check for missing values
data.isnull().values.any()

In [None]:
data.isnull().sum()

In [None]:
data.dropna(inplace=True)
data.isnull().values.any()

In [None]:
# A more detailed look into the data
data.describe()

Just eyeing up the data I think it is interesting that most of the kills seem to be centered around 0. I want to explore that in more detail as well as walkDistance, heals, boosts, weaponsAcquired, damageDealt, matchDuration. 

## Kills

In [None]:
sns.countplot(data['kills']).set_title("Kills");

In [None]:
sns.lineplot(x="kills", y='killPoints', data=data);

In [None]:
sns.lineplot(x="kills", y='winPlacePerc', data=data);

Based on this, if you have over 50 kills it seems likely that you are going to win. However it is extremely unlikely you can get that many kills (out of the 100 people who started) unless you were using some cheat, especially if the other players were somewhat skilled. 

I want to do that same graphs above but instead I want to limit the kills to 0 to see if there is anything interesting there

In [None]:
zero_kills = data.copy()
zero_kills = zero_kills[zero_kills['kills']==0]
# Scatter plot instead of lineplot since line is hard to see 
sns.scatterplot(x='kills', y='killPoints', data=zero_kills);

In [None]:
# Same reason as previous line
sns.scatterplot(x="kills", y='winPlacePerc', data=zero_kills);

In [None]:
sns.lineplot(x="killPlace", y='winPlacePerc', data=zero_kills);

Based on this I definately believe I need to use group statistics instead of indivuial statistics, and I beleive this is because if you have good luck or one or two competant teamamtes then you are more likely to win. This is given by the fact that there are many ok players (0 kills) and a few really good consistent players. However, first I want to look into the rest of the individual features I mentioned above before moving on to group stats since there may be some interesting patterns lurking about.

## Walk Distance

In [None]:
sns.distplot(data['walkDistance'], color = 'sandybrown');

In [None]:

sns.lineplot(x="walkDistance", y='winPlacePerc', data=data, color='sandybrown');

## Heals and Boosts

In [None]:
sns.jointplot(x="heals", y="winPlacePerc",  data=data, height = 12, ratio = 4, color='seagreen');

In [None]:
sns.jointplot(x="boosts", y="winPlacePerc",  data=data, height = 12, ratio = 4, color='seagreen');

## Weapons Acquired

In [None]:
sns.jointplot(x="weaponsAcquired", y="winPlacePerc",  data=data, height = 10, ratio = 4, color='orchid');

## Damage Dealt

In [None]:
sns.lineplot(x="damageDealt", y='winPlacePerc', data=data, color='darkgreen');

## MatchDuration 

In [None]:
sns.distplot(data['matchDuration'], color='darkgreen');

# Created Features 

In [None]:
data['killsPerMeter'] = data['kills']/data['walkDistance']
data['killsPerMeter'].fillna(0, inplace=True)
data['killsPerMeter'].replace(np.inf, 0, inplace=True)

In [None]:
data['healsPerMeter'] = data['heals'] / data['walkDistance']
data['healsPerMeter'].fillna(0, inplace=True)
data['healsPerMeter'].replace(np.inf, 0, inplace=True)

In [None]:
data['killsPerHeal'] = data['kills'] / data['heals']
data['killsPerHeal'].fillna(0, inplace=True)
data['killsPerHeal'].replace(np.inf, 0, inplace=True)

In [None]:
data['killsPerSecond'] = data['kills'] / data['matchDuration']
data['killsPerSecond'].fillna(0, inplace=True)
data['killsPerSecond'].replace(np.inf, 0, inplace=True)

In [None]:
data['TotalHealsPerTotalDistance'] = (data['boosts'] + data['heals']) / (data['walkDistance'] + data['rideDistance'] + data['swimDistance'])
data['TotalHealsPerTotalDistance'].fillna(0, inplace=True)
data['TotalHealsPerTotalDistance'].replace(np.inf, 0, inplace=True)

In [None]:
data['killPlacePerMaxPlace'] = data['killPlace'] / data['maxPlace']
data['killPlacePerMaxPlace'].fillna(0, inplace=True)
data['killPlacePerMaxPlace'].replace(np.inf, 0, inplace=True)

In [None]:
len(data.columns)

In [None]:
# Check Correlations
f, ax = plt.subplots(figsize=(20,20))
sns.heatmap(data.corr(), annot=True, linewidths=1,fmt='.2f', ax=ax)
plt.show()

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error as mae
from sklearn.metrics import mean_squared_error as mse
from sklearn.model_selection import train_test_split

In [None]:
def lin_reg_exp(df):
    
    # i is columns that we want to test 
    i = 29
    num_features = []
    error_mae = []
    error_mse = []
    feature_dropped = []
    
    target = 'winPlacePerc'
    # Right now ignorning categorical variables but will look at them and incorporate soon
    drop = ['Id', 'matchId', 'groupId', 'matchType', target]
    
    X = df.copy()
    X.dropna(inplace=True)
    y = df.copy()
    y.dropna(inplace=True)
    y = y[target]
    X.drop(columns=drop, axis=1, inplace=True)
        
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, random_state=12)
    
    model = LinearRegression()
    model.fit(X_train, y_train)
        
    y_pred = model.predict(X_test)
    
    num_features.append(len(X_train.columns))
    error_mae.append(mae(y_test, y_pred))
    error_mse.append(mse(y_test, y_pred))
    feature_dropped.append('None')
    print("First pass done")
    
    while(i >= 1):               
        X = df.copy()
        X.dropna(inplace=True)
        y = df.copy()
        y.dropna(inplace=True)
        y = y[target]
        X.drop(columns=drop, axis=1, inplace=True)
        
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, random_state=12)
        
        feature_dropped.append(X_train.columns[i-1])
        X_train.drop(X_train.columns[i-1], axis= 1, inplace=True)
        X_test.drop(X_test.columns[i-1], axis = 1, inplace=True)
        
        
        model = LinearRegression()
        model.fit(X_train, y_train)
        
        y_pred = model.predict(X_test)
        num_features.append(i)
        error_mae.append(mae(y_test, y_pred))
        error_mse.append(mse(y_test, y_pred))
        print(i)
        i -= 1
        
    results = pd.DataFrame({'MAE Error': error_mae,
                          'MSE Error': error_mse,
                          'Dropped Feature': feature_dropped})
    
    return(results)

In [None]:
lin_reg_exp(data)

In [None]:
data.to_csv(r'Training_Data_New.csv')
