<img src="pubg.jpg">

## Problem Statement

PlayerUnknown’s BattleGround (PUBG) has taken the world by storm. 100 players are dropped onto an island empty-handed and must explore, scavenge, and eliminate other players until only one is left standing, all while the play zone continues to shrink.
PUBG has enjoyed massive popularity. With over 50 million copies sold, it's the fifth best selling game of all time and has millions of active monthly players.
Though we are pretty sure that all of you would have shown some great skills in playing PUBG, it’s time for the action outside “The Blue Circle”, but this time with the power of Machine Learning.

The task is to predict the number of kills made by a player by analyzing its other attributes like survival time, team size, assists, walking and riding time etc. given in the dataset ‘pubg_kills.csv’.

## Data  Description

The given dataset has the following variables:

* match_id: The unique match id.
* date: The date and time the match took place
* game_size: The total number of teams that were in the game
* match_mode: whether the game was played in first-person (FPP) or third-person (TPP)
* party_size: The maximum number of players per team. e.g 2 implies it was a duo.
* player_name: Name of the player
* team_id: The team id that the player belonged to
* team_placement: The final rank of the team within the match
* player_dbno: Number of knockdowns the player has scored
* player_assists: Number of assists the player has scored
* player_dmg: Total Hitpoint that the player has dealt
* player_dist_ride: Total distance that the player has traveled in a vehicle
* player_dist_walk: Total distance that the player has traveled on foot
* player_kills: Number of kills the player has scored ⇒ <b>To be predicted</b>


## Lets start Data Science Game

### Importing Standard Libraries

In [None]:
import numpy as np  # Library for array processing , Linear algebra
import pandas as pd  # Library for data processing, data manipulation
import matplotlib.pyplot as plt  # Library for data visualisation
import seaborn as sns  # Library for different plots

from sklearn.model_selection import train_test_split  # To split data into training and validation data
from sklearn.metrics import mean_squared_error  # Evaluation metric

from subprocess import check_call # for running command line process

sns.set(style="whitegrid", color_codes=True) 
sns.set(font_scale=1)

from IPython.display import display 
pd.options.display.max_columns = None  # To display all columns in the notebook
from IPython.display import Image as PImage # To display all images inside the notebook

# Displaying graphs in the notebook itself
%matplotlib inline 

import warnings
warnings.filterwarnings('ignore')  # Doesn't display warnings

### Loading Dataset

In [None]:
### START CODE HERE ###
# Read and store the data in a dataframe 'data' to be used for furthur processing (1 line of code)
data = pd.read_csv("pubg_kills.csv")
### END CODE HERE ###

In [None]:
# Display first five rows of the dataset
data.head()

In [None]:
# Similarily data.tail() shows last five rows of the data

### START CODE HERE ###
# Display the last five rows of the data (1 line of code)
data.tail()
### END CODE HERE ###

In [None]:
# Dimensions of the data
# Number of rows, Number of columns(features)
print(data.shape)

In [None]:
# print all the columns/features in the data

#### Length of the dataset

In [None]:
#length of dataset
len(data)

## Understanding Pandas DataFrame

In [None]:
#To access a column player_survive_time
data['player_survive_time'].head()

In [None]:
#To access multiple columns
data[['party_size','player_kills']].head(4)

In [None]:
#To access a multiple rows
data.iloc[3:6]

## Dealing with the 'date' feature

In [None]:
#to change the date format
data['date'] =  pd.to_datetime(data['date'], format='%Y-%m-%dT%H:%M:%S+0000')

In [None]:
#extracting the weekday from date
data['Day'] = pd.DatetimeIndex(data['date']).weekday
weekday_map = {0: 'Monday', 1: 'Tuesday', 2: 'Wednesday', 3: 'Thursday', 4: 'Friday', 5: 'Saturday', 6: 'Sunday'}

In [None]:
# Extracting hour from time
# creating new variable hour from the time variable 
data['Hour'] = pd.DatetimeIndex(data['date']).hour

In [None]:
# display first three rows of the data
data.head(3)

### Getting Rid of Redundant Variables

In [None]:
del data['date']  # As we have already extracted the useful info i.e. Weekday and Hour
del data['match_mode']  # Because all the matches were played in TPP (Third-Person Perspective) mode
del data['team_id']  # Because we already have match_id and player_name to uniquely identify an instance

## Steps
*  Problem Identification 
*  Hypothesis Generation
*  Variable Identification
*  Univariate Analysis
*  Bivariate Analysis
*  Missing Values
*  Outliers
*  Feature Engineering/Variable Transformation
*  Predictive Modeling
*  Analysing the Model
*  Final Model Selection

## Variable Identification & their datatypes
Identify the predictor and target variables & their data types along with the category of variables

In [None]:
# determining data types of the variable
data.dtypes

#### Normally, numeric columns in python are represented as "int32", "float32", "int64", "float64". Whereas character columns are represented as "object"

## Univariate Analysis
Analysing the variables one at a time. Let's analyse coninuous and categorical variables separately.

### For Continuous Variables : We generally measure the central tendency of the variable such as Mean , Median , Mode , Std, variance ,etc.
* Basic Statistics
* Plotting Histogram
* Plotting Boxplot

In [None]:
# continious variable analysis
data.describe()

In [None]:
# plot given numerical variable with respect to other variables
cont_vars = ['player_dbno', 'player_dist_walk', 'player_dmg', 'player_kills']
sns.pairplot(data[cont_vars])

In [None]:
#Plotting histogram for 'player_kills' variable
sns.distplot(data['player_kills'], color="purple", kde=False)
plt.title("Distribution of Number of Kills")
plt.ylabel("Number of Occurences")
plt.xlabel("Number of Kills");

In [None]:
#frequency of each value in weekday column
weekday_map = {0: 'Monday', 1: 'Tuesday', 2: 'Wednesday', 3: 'Thursday', 4: 'Friday', 5: 'Saturday', 6: 'Sunday'}
dict(data.Day.value_counts())

In [None]:
#Plotting histogram for 'Day' variable
week_data = {'Mon': 14155, 'Tue': 13860, 'Wed': 13183, 'Thu': 11611, 'Fri': 14458, 'Sat': 16443, 'Sun': 16290}
names = list(week_data.keys())
values = list(week_data.values())

fig, axs = plt.subplots(1, 2, figsize=(12, 5), sharey=True)
axs[0].bar(names, values)
axs[1].plot(names, values)
fig.suptitle('Categorical Plotting')

In [None]:
#for more information -> https://chartio.com/resources/tutorials/what-is-a-box-plot/
sns.boxplot("game_size", data=data, showfliers=False)
plt.title("Distribution of game_size")
plt.xlabel("Number of Teams in Game");

### For categorical variables: We generally measure the frequency of categories appearing in a particular categorical variable
* Count/Frequency Table
* Plotting Stacked Bar Graph

In [None]:
# selecting categorical variables from the data
categorical_variables = ['party_size', 'Day', 'Hour']

In [None]:
#print categorical variables
print(categorical_variables)

In [None]:
# unique values count in each categorical variable
data[categorical_variables].apply(lambda x: len(x.unique()))

In [None]:
#frequency count of each categorical variable
for var in categorical_variables:
    print(var)
    print(data[var].value_counts())
    print('\n')

In [None]:
#display in pie chart
labels = data['party_size'].unique()
sizes = data['party_size'].value_counts().values
explode=[0.1,0,0]
parcent = 100.*sizes/sizes.sum()
labels = ['{0} - {1:1.1f} %'.format(i,j) for i,j in zip(labels, parcent)]

colors = ['yellowgreen', 'gold', 'lightblue']
patches, texts= plt.pie(sizes, colors=colors,explode=explode,
                        shadow=True,startangle=90)
plt.legend(patches, labels, loc="best")

plt.title("Party Size Classification")
plt.show()

## Bivariate Analysis
Bivariate analysis is used to find out the relationship between any 2 variables. It can be done for any combination of variables. The combinations are: 
* Continuous & Continuous
* Categorical & Continuous
* Categorical & Categorical

### Continuous & Continuous
Scatter Plots are used

In [None]:
#scatter plot
plt.scatter(np.sqrt(data["player_dmg"]), data["player_dbno"])
# to display title above the plot
plt.title("Hitpoints Dealt Vs Down but not out ")
# to label y-axis
plt.ylabel("No. of DBNO's")
# to label x-axis
plt.xlabel("Hitploints Dealt by the Player");

In [None]:
# correlation between variables 
# heat map
corrMatrix = data[["game_size", "player_assists", "player_dbno",
                   "player_dist_ride", "player_dist_walk", "player_dmg",
                   "player_survive_time", "team_placement", "player_kills"]].corr()

sns.set(font_scale=1.10)
plt.figure(figsize=(9, 9))

sns.heatmap(corrMatrix, vmax=.8, linewidths=0.01,
            square=True,annot=True,cmap='viridis',linecolor="white")
plt.title('Correlation between features');

#### +1 : perfect postive correlation ; -1 : perfect negative correlation ; 0 : No correlation

### Categorical & Continuous
Boxplots can be used

In [None]:
# sns.boxplot(x, y, argument to hide outliers)
sns.boxplot(data["party_size"], data["player_survive_time"], showfliers=False)
# title for the plot
plt.title("Survival Time vs Team Size")
plt.ylabel("Survival Time")
plt.xlabel("Team Size");

### Categorical and categorical
Crosstable and stacked bar plots are used

In [None]:
crosstable = pd.crosstab(data.Day, data.party_size)

In [None]:
crosstable

In [None]:
# Plotting stacked bar plot
crosstable.plot(kind='bar',stacked='True')

## Missing Values

In [None]:
# Detecting missing values
data.isnull().sum()


### Treating missing values:
* For continuous variables impute with mean
* For categorical variables impute with mode
* For better results predict missing values in a variable by considering it target variable
* If missing values are less then we can delete the observations having missing values


## Outliers
Outliers are the data points showing out of the box behaviour or that appears far away from the overall trend.

In [None]:
#box plot
sns.boxplot("player_survive_time", data=data, showfliers=True)
plt.title("Distribution of Survival Time")
plt.xlabel("Survival Time");

In [None]:
#Treating outliers
# Removing Outliers
Q1 = data['player_survive_time'].quantile(.25)
Q3 = data['player_survive_time'].quantile(.75)
IQR = Q3-Q1
lower_value = IQR-1.5*Q1
upper_value = IQR+1.5*Q3

In [None]:
# print range lower_value and upper_value
lower_value, upper_value

In [None]:
#replacing outlier with meadian value the data
def outlier_imputer(x):
    if x < lower_value or x > upper_value:
        return data['player_survive_time'].median()
    else:
        return x

In [None]:
result = data['player_survive_time'].apply(outlier_imputer)  # This would take a lil bit time to run

In [None]:
sns.boxplot(result, showfliers=True)
plt.title("Distribution of Survival Time")
plt.xlabel("Survival Time");

# Building the First Model

#### After tightening seat-belt its time to takeoff

In [None]:
#depenent_variable -> which we are going to predict
#independent_variable -> helps in predicting dependent_variable
dependent_variable = 'player_kills'
independent_variable = ['game_size', 'party_size', 'player_assists', 'player_dbno', 'player_dist_ride', 'Hour', 
                        'player_dist_walk', 'player_dmg', 'player_survive_time', 'team_placement', 'Day']

In [None]:
independent_variable

###  Splitting our data into training and testing(validation) data

In [None]:
#library to split data
from sklearn.model_selection import train_test_split

In [None]:
train, test = train_test_split(data, test_size=.2, shuffle=True, random_state=42)

In [None]:
train.head()

In [None]:
print(len(data))
print(len(train))
print(len(test))

In [None]:
# Predicting by using mode
np.round(train['player_kills'].mean())  # train['player_kills'].mean() = 0.887

In [None]:
test['prediction'] = 1.0

In [None]:
test.head()

In [None]:
# Analysing the prediction
from sklearn.metrics import mean_squared_error

In [None]:
RMSE = np.sqrt(mean_squared_error(test['prediction'], test[dependent_variable]))
np.round(RMSE)  # RMSE = 1.616

# Building Machine Learning Model

# Linear Regression

## Simple Linear Regression

Simple linear regression is an approach for predicting a **quantitative response** using a **single feature** (or "predictor" or "input variable"). It takes the following form:

$y = \beta_0 + \beta_1x$

What does each term represent?
- $y$ is the response
- $x$ is the feature
- $\beta_0$ is the intercept
- $\beta_1$ is the coefficient for x

Together, $\beta_0$ and $\beta_1$ are called the **model coefficients**. To create your model, you must "learn" the values of these coefficients. And once we've learned these coefficients, we can use the model to predict Sales!

## Estimating ("Learning") Model Coefficients

Generally speaking, coefficients are estimated using the **least squares criterion**, which means we are find the line (mathematically) which minimizes the **sum of squared residuals** (or "sum of squared errors"):

<img src="08_estimating_coefficients.png">

What elements are present in the diagram?
- The black dots are the **observed values** of x and y.
- The blue line is our **least squares line**.
- The red lines are the **residuals**, which are the distances between the observed values and the least squares line.

How do the model coefficients relate to the least squares line?
- $\beta_0$ is the **intercept** (the value of $y$ when $x$=0)
- $\beta_1$ is the **slope** (the change in $y$ divided by change in $x$)

Here is a graphical depiction of those calculations:

<img src="08_slope_intercept.png">

### Using Linear Regression Algorithm

In [None]:
# Importing machine learning library
from sklearn.linear_model import LinearRegression

In [None]:
# Creating machine learning model
model1 = LinearRegression()

In [None]:
# Training our model
model1.fit(train[independent_variable], train[dependent_variable])

In [None]:
# Get coeffecients
model1.coef_

In [None]:
# Get intercept
model1.intercept_

In [None]:
# Predicting on test data
prediction = model1.predict(test[independent_variable])

#### Analysing our model

In [None]:
# Accuracy on training dataset
np.sqrt(mean_squared_error(model1.predict(train[independent_variable]), train[dependent_variable]))

In [None]:
# Accuracy on testing dataset
np.sqrt(mean_squared_error(model1.predict(test[independent_variable]), test[dependent_variable]))

# Introduction to Decision Trees


||continuous|categorical|
|---|---|---|
|**supervised**|**regression**|**classification**|
|**unsupervised**|dimension reduction|clustering|

## Regression trees

Let's look at a simple example to motivate our learning.

Our goal is to **predict a baseball player's Salary** based on **Years** (number of years playing in the major leagues) and **Hits** (number of hits he made in the previous year). Here is the training data, represented visually (low salary is blue/green, high salary is red/yellow):

<img src="15_salary_color.png">

**How might you "stratify" or "segment" the feature space into regions, based on salary?** Intuitively, you want to **maximize** the similarity (or "homogeneity") within a given region, and **minimize** the similarity between different regions.

Below is a regression tree that has been fit to the data by a computer. (We will talk later about how the fitting algorithm actually works.) Note that  Salary is measured in thousands and has been log-transformed.

<img src="15_salary_tree.png">

**How do we make Salary predictions (for out-of-sample data) using a decision tree?**

- Start at the top, and examine the first "splitting rule" (Years < 4.5).
- If the rule is True for a given player, follow the left branch. If the rule is False, follow the right branch.
- Continue until reaching the bottom. The predicted Salary is the number in that particular "bucket".
- *Side note:* Years and Hits are both integers, but the convention is to label these rules using the midpoint between adjacent values.

Examples predictions:

- Years=3, then predict 5.11 ($\$1000 \times e^{5.11} \approx \$166000$)
- Years=5 and Hits=100, then predict 6.00 ($\$1000 \times e^{6.00} \approx \$403000$)
- Years=8 and Hits=120, then predict 6.74 ($\$1000 \times e^{6.74} \approx \$846000$)

**How did we come up with the numbers at the bottom of the tree?** Each number is just the **mean Salary in the training data** of players who fit that criteria. Here's the same diagram as before, split into the three regions:

<img src="15_salary_regions.png">

This diagram is essentially a combination of the two previous diagrams (except that the observations are no longer color-coded). In $R_1$, the mean log Salary was 5.11. In $R_2$, the mean log Salary was 6.00. In $R_3$, the mean log Salary was 6.74. Thus, those values are used to predict out-of-sample data.

Let's introduce some terminology:

<img src="15_salary_tree_annotated.png">

**How might you interpret the "meaning" of this tree?**

- Years is the most important factor determining Salary, with a lower number of Years corresponding to a lower Salary.
- For a player with a lower number of Years, Hits is not an important factor determining Salary.
- For a player with a higher number of Years, Hits is an important factor determining Salary, with a greater number of Hits corresponding to a higher Salary.

What we have seen so far hints at the advantages and disadvantages of decision trees:

**Advantages:**

- Highly interpretable
- Can be displayed graphically
- Prediction is fast

**Disadvantages:**

- Predictive accuracy is not as high as some supervised learning methods
- Can easily overfit the training data (high variance)

## How does a computer build a regression tree?

The ideal approach would be for the computer to consider every possible partition of the feature space. However, this is computationally infeasible, so instead an approach is used called **recursive binary splitting:**

- Begin at the top of the tree.
- For every single predictor, examine every possible cutpoint, and choose the predictor and cutpoint such that the resulting tree has the **lowest possible mean squared error (MSE)**. Make that split.
- Repeat the examination for the two resulting regions, and again make a single split (in one of the regions) to minimize the MSE.
- Keep repeating this process until a stopping criteria is met.

**How does it know when to stop?**

1. We could define a stopping criterion, such as a **maximum depth** of the tree or the **minimum number of samples in the leaf**.
2. We could grow the tree deep, and then "prune" it back using a method such as "cost complexity pruning" (aka "weakest link pruning").

Method 2 involves setting a tuning parameter that penalizes the tree for having too many leaves. As the parameter is increased, branches automatically get pruned from the tree, resulting in smaller and smaller trees. The tuning parameter can be selected through cross-validation.

Note: **Method 2 is not currently supported by scikit-learn**, and so we will use Method 1 instead.

Here's an example of an **unpruned tree**, and a comparison of the training, test, and cross-validation errors for trees with different numbers of leaves:

<img src="15_salary_unpruned.png">

As you can see, the **training error** continues to go down as the tree size increases, but the lowest **cross-validation error** occurs for a tree with 3 leaves.

## Building a regression tree in scikit-learn

In [None]:
# Importing Decision Tree Classifier
from sklearn.tree import DecisionTreeRegressor

In [None]:
model2 = DecisionTreeRegressor()

In [None]:
# Training our model
model2.fit(train[independent_variable], train[dependent_variable])

In [None]:
# Get Predictions
prediction = model2.predict(test[independent_variable])

In [None]:
# Accuracy on testing dataset
np.sqrt(mean_squared_error(prediction, test[dependent_variable]))

In [None]:
# create a Graphviz file
from sklearn.tree import export_graphviz
with open("tree1.dot", 'w') as f:
    f = export_graphviz(model2, out_file=f, feature_names=independent_variable)
    
#Convert .dot to .png to allow display in web notebook
#Please install graphviz before this conda install python-graphviz
#check_call(['dot','-Tpng','tree.dot','-o','tree.png'])

# Annotating chart with PIL
#img = Image.open("tree.png")
#img.save('sample-out.png')
#PImage("sample-out.png")

# Introduction to Boosting

### Boosting is an ensemble technique in which the predictors are not made independently, but sequentially.

This technique employs the logic in which the subsequent predictors learn from the mistakes of the previous predictors. Therefore, the observations have an unequal probability of appearing in subsequent models and ones with the highest error appear most. (So the observations are not chosen based on the bootstrap process, but based on the error). The predictors can be chosen from a range of models like decision trees, regressors, classifiers etc.

<img src="residuals_learning.png">

<img src="residuals_learning2.png">

### Using GradientBoostingRegressor

In [None]:
from sklearn.ensemble import GradientBoostingRegressor

In [None]:
model3 = GradientBoostingRegressor()

In [None]:
# Training our model
model3.fit(train[independent_variable], train[dependent_variable])

In [None]:
feat_importances = pd.Series(model3.feature_importances_, index=train[independent_variable].columns)
feat_importances.nsmallest(len(independent_variable)).plot(kind='barh')

In [None]:
# Get Predictions
prediction = model3.predict(test[independent_variable])

In [None]:
# Accuracy on testing dataset
np.sqrt(mean_squared_error(prediction, test[dependent_variable]))

## What You Can Try Next on Your Own

We saw that LightGBM outperformed Linear Regression and Decision Trees by a little margin and clearly surpassed our baseline model by a huge amount. However, few more things can be tried to push RMSE:

* HyperParameter Tuning using Hyperopt etc.
* Better feature generation.
* Trying ensembles of different models.
* Better feature transformations.

## Where to Go from Here

Here are some resources and blogs that would help one to get started in Data Science and Machine Learning:

* __[DSG Blog about How to Start Data Science](https://medium.com/data-science-group-iitr/stop-thinking-start-learning-cb74629bca3a)__
* __[DSG Medium Handle](https://medium.com/data-science-group-iitr)__
* __[3 Blue 1 Brown](https://www.youtube.com/channel/UCYO_jab_esuFRV4b17AJtAw)__
* __[Harvard Data Science Course (CS109)](http://cs109.github.io/2015/pages/videos.html)__
* __[Andrew Ng Machine Learning Course](http://cs229.stanford.edu/)__
* __[Analytics Vidhya](https://www.analyticsvidhya.com/blog/)__
* __[Machine Learning Mastery](https://machinelearningmastery.com/)__
* __[Kaggle (A Competitive Data Science Platform)](https://www.kaggle.com/)__