# PUBG Exploratory Data Analysis (EDA)

**Task**

In a PUBG game, up to 100 players start in each match (matchId). Players can be on teams (groupId) which get ranked at the end of the game (winPlacePerc) based on how many other teams are still alive when they are eliminated. In game, players can pick up different munitions, revive downed-but-not-out (knocked) teammates, drive vehicles, swim, run, shoot, and experience all of the consequences -- such as falling too far or running themselves over and eliminating themselves.

You are provided with a large number of anonymized PUBG game stats, formatted so that each row contains one player's post-game stats. The data comes from matches of all types: solos, duos, squads, and custom; there is no guarantee of there being 100 players per match, nor at most 4 players per group.


# Understanding the dataset

**Data Set Information**

**Features**

|Feature|Description|
|-----|-----|
|DBNOs|Number of enemy players knocked.|
|assists|Number of enemy players this player damaged that were killed by teammates.|
|boosts|Number of boost items used.|
|damageDealt|Total damage dealt. Note: Self inflicted damage is subtracted.|
|headshotKills|Number of enemy players killed with headshots.|
|heals|Number of healing items used.|
|Id|Player’s Id|
|killPlace|Ranking in match of number of enemy players killed.|
|killPoints|Kills-based external ranking of players. (Think of this as an Elo ranking where only kills matter.) If there is a value other than -1 in rankPoints, then any 0 in killPoints should be treated as a “None”.|
|killStreaks|Max number of enemy players killed in a short amount of time.|
|kills|Number of enemy players killed.|
|longestKill|Longest distance between player and player killed at time of death. This may be misleading, as downing a player and driving away may lead to a large longestKill stat.|
|matchDuration|Duration of match in seconds.|
|matchId|ID to identify matches. There are no matches that are in both the training and testing set.|
|matchType|String identifying the game mode that the data comes from. The standard modes are “solo”, “duo”, “squad”, “solo-fpp”, “duo-fpp”, and “squad-fpp”; other modes are from events or custom matches.|
|rankPoints|Elo-like ranking of players. This ranking is inconsistent and is being deprecated in the API’s next version, so use with caution. Value of -1 takes the place of “None”.|
|revives|Number of times this player revived teammates|.
|rideDistance|Total distance traveled in vehicles measured in meters.|
|roadKills|Number of kills while in a vehicle.|
|swimDistance|Total distance traveled by swimming measured in meters.|
|teamKills|Number of times this player killed a teammate.|
|vehicleDestroys|Number of vehicles destroyed.|
|walkDistance|Total distance traveled on foot measured in meters.|
|weaponsAcquired|Number of weapons picked up.|
|winPoints|Win-based external ranking of players. (Think of this as an Elo ranking where only winning matters.) If there is a value other than -1 in rankPoints, then any 0 in winPoints should be treated as a “None”.|
|groupId|ID to identify a group within a match. If the same group of players plays in different matches, they will have a different groupId each time.|
|numGroups|Number of groups we have data for in the match.|
|maxPlace|Worst placement we have data for in the match. This may not match with numGroups, as sometimes the data skips over placements.|
|winPlacePerc|The target of prediction. This is a percentile winning placement, where 1 corresponds to 1st place, and 0 corresponds to last place in the match. It is calculated off of maxPlace, not numGroups, so it is possible to have missing chunks in a match.|

###  Importing necessary libraries

The following code is written in Python 3.x. Libraries provide pre-written functionality to perform necessary tasks.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
# loading the dataset in the variable called 'df' and setting the index column as 'Id'

df= pd.read_csv('/kaggle/input/pubg-dataset/data.csv', index_col='Id')

In [None]:
df.head()  # To view top 5 entries in the dataset.

# in order to view bottom 5 entries, we can do
#df.tail()

#in order to view more than 5 entries, we can enter any integer value into '()'.
#Ex: df.head(10) or df.tail(15), etc

In [None]:
# Now, we let us see all the column names.
df.columns

In [None]:
# We can see that the column named 'Unnamed: 0' can be removed from the dataset as it is not clear, what it represents.

df= df.drop(['Unnamed: 0'], axis=1)  # If axis=0, it consitutes row operation. Since we have to remove the column, we do axis=1.

In [None]:
df.head(2)

In [None]:
# Now let us check for the shape of the dataset and also that are any null values present in our dataset.
# For that,

print('shape of the dataset=', df.shape)

print(' \nThe null count of each column of the dataset are as follows:')
df.isnull().sum()

From observing above information, we can see that the shape of the dataset is (1111742, 28). This means that the dataset has 1111742 rows and 28 columns.

Later we can notice that none, but only 1 column has 1 null value. The name of that column is 'winPlacePerc'.

In [None]:
# To view the null row from the dataset:

df[df['winPlacePerc'].isnull() == True]

In [None]:
# Function to identify numeric features:

def numeric_features(dataset):
    numeric_col = dataset.select_dtypes(include=np.number).columns.tolist()
    return dataset[numeric_col].head()
    
numeric_columns = numeric_features(df)
print("Numerical Features:")
print(numeric_columns)

print("===="*20)




# Function to identify categorical features:

def categorical_features(dataset):
    categorical_col = dataset.select_dtypes(exclude=np.number).columns.tolist()
    return dataset[categorical_col].head()

categorical_columns = categorical_features(df)
print("Categorical Features:")
print(categorical_columns)

print("===="*20)



# Function to check the datatypes of all the columns:

def check_datatypes(dataset):
    return dataset.dtypes

print("Datatypes of all the columns:")
check_datatypes(df)

### Detect outliers in the continuous columns

Outliers are observations that lie far away from majority of observations in the dataset and can be represented mathematically in different ways.

One method of defining outliers are: outliers are data points lying beyond **(third quartile + 1.5xIQR)** and below **(first quartile - 1.5xIQR)**. 

- The function below takes a dataframe and outputs the number of outliers in every numeric feature based on the above rule of *IQR* 

You can even modify the function below to capture the outliers as per their other definitions. 

In [None]:
# Function to detect outliers in every feature
def detect_outliers(df):
    cols = list(df)
    outliers = pd.DataFrame(columns = ['Feature', 'Number of Outliers'])
    for column in cols:
        if column in df.select_dtypes(include=np.number).columns:
            q1 = df[column].quantile(0.25)
            q3 = df[column].quantile(0.75)
            iqr = q3 - q1
            fence_low = q1 - (1.5*iqr)
            fence_high = q3 + (1.5*iqr)
            outliers = outliers.append({'Feature':column, 'Number of Outliers':df.loc[(df[column] < fence_low) | (df[column] > fence_high)].shape[0]},ignore_index=True)
    return outliers

detect_outliers(df)

### Observations :
- As per the IQR methodology, there are outliers in majority of the columns.

## EDA & Data Visualizations

Exploratory data analysis is an approach to analyzing data sets by summarizing their main characteristics with visualizations. The EDA process is a crucial step prior to building a model in order to unravel various insights that later become important in developing a robust algorithmic model.

## Univariate analysis

Univariate analysis means analysis of a single variable. It’s mainly describes the characteristics of the variable.


#### Number of enemys the player killed:
This is the number of enemy players that were killed by each player.

In [None]:
# Summary statistics for the number of kills
print('The average person kills {:.4f} players'.format(df['kills'].mean()))
print('50% of people have ',df['kills'].quantile(0.50),' kills or less')
print('75% of people have ',df['kills'].quantile(0.75),' kills or less')
print('99% of people have ',df['kills'].quantile(0.99),' kills or less')
print('while the most kills recorded in the data is', df['kills'].max())

The other way to view this statistical data is in the form of graph as shown. Here is a plot of the number of players that make 1, 2, 3, ... 8+ kills in a game!!

In [None]:
data = df.copy()
data.loc[data['kills'] > data['kills'].quantile(0.99)] = '8+'
plt.figure(figsize=(20,15))
sns.countplot(data['kills'].astype('str').sort_values())
plt.title('Kill Count',fontsize=15)
plt.xlabel('Kills', fontsize=15)
plt.ylabel('Count',fontsize=13)
plt.show()

#### Maximum number of enemy players killed in a short time.
This is the number of enemy players killed in a short time by each player.

In [None]:
# Summary statistics for the number of kills
print('The average person kills {:.4f} players in a short time'.format(df['killStreaks'].mean()))
print('50% of people have ',df['killStreaks'].quantile(0.50),' kills or less in a short time')
print('75% of people have ',df['killStreaks'].quantile(0.75),' kills or less in a short time')
print('99% of people have ',df['killStreaks'].quantile(0.99),' kills or less in a short time')
print('While the most kills in a row recorded in the data is', df['killStreaks'].max())

The other way to view this statistical data is in the form of graph as shown. Here is a plot of the number of players that make 1, 2, 3,4+ killStreaks in a game!!

In [None]:
data = df.copy()
data.loc[data['killStreaks'] > data['killStreaks'].quantile(0.99)] = '4+'
plt.figure(figsize=(20,15))
sns.countplot(data['killStreaks'].astype('str').sort_values())
plt.title('Kill Count',fontsize=15)
plt.xlabel('killStreaks', fontsize=15)
plt.ylabel('Count',fontsize=13)
plt.show()

#### Dealing with the 'matchType' column

In [None]:
# To check how many unique values are present in this categorical column: 

df['matchType'].value_counts()

In [None]:
# To plot the above insights in form of 'countplot'

plt.figure(figsize=(20,15))
sns.countplot(df['matchType'], )
plt.title('Match Type',fontsize=15)
plt.xlabel('Match Type', fontsize=15)
plt.ylabel('Count',fontsize=13)
plt.show()

##### Observations :
- From the above graph, it is clear that the most played matchtype is **squad-fpp** 
- The least played matchtype is **normal-duo**

#### Damage to enemy players
We've seen that most people aren't able to kill any one, so maybe they inflict some damage to their enemies

In [None]:
data = df.copy()

# Keep only those players that didn't kill anyone
data = data[data['kills']==0]
plt.figure(figsize=(15,10))
plt.title('Damage Dealt by 0 killers',fontsize=15)
sns.distplot(data['damageDealt'])
plt.xlabel('Damage Dealt', fontsize=15)
plt.ylabel('Density',fontsize=13)
plt.show()

##### Ovservation:
- Here, we see a distribution of how much damage, players that dont kill anyone, can inflict on there enemies. We can see that most players dont deal out too much, this is most likely all the new players trying to figure out the controls and getting to know the game while they continually get beaten up by the more expereince players.

#### Lets have a look at the match duration for all the winners.

In [None]:
# Keep only the players that won the match
data = df[df['winPlacePerc'] == 1]

plt.figure(figsize=(15,10))
plt.title('Match duration for winners',fontsize=15)
sns.distplot(data['matchDuration'])
plt.xlabel('Match Duration', fontsize=15)
plt.ylabel('Density',fontsize=13)
plt.show()

##### Observations:
- It appears that the match duration has no bearing on the winPlacePerc. Apparently you can even win the game in just over 2 min, but more commonly the game is won in approximately 1400 or 1850 seconds.
- The match Duration is not a feature that is likely to be useful in predicting the winPlacePer.

### Bivariate Analysis 

Bivariate analysis involves checking the relationship between two variables simultaneously.

Lets have a look at the data for this and see if there is any correlation to our target variable "winPlacePerc".

A **correlation** between two random vairables describes a statistical association, which basically means how close these two random variables are to having a linear relation ship. The correlation can range between -1 and 1:

- A correlation of 1 means the variables are perfectly correlated.
- A correlation of 0 means there is no corerlation between teh variables.
- A corerlation of -1 means the variabels are prefectly negatively corerlated

In [None]:
plt.figure(figsize=(15,10))
sns.jointplot(x='winPlacePerc', y='killStreaks', data=df, color='b')
# plt.title('Win place vs Kill Streaks')
plt.xlabel('Win Place Prec', fontsize=15)
plt.ylabel('Kill streaks',fontsize=13)
plt.show()

In [None]:
sns.jointplot(x='winPlacePerc', y='damageDealt', data=df, color='b')

There is a reasonable correlation here with the damadge we deal out to enemey players and the winPlacePerc.

#### Number of times a player killed a team mate
- This is the number of times a team member kills one of there own team.

In [None]:
# Summary statistics for the number of kills
print('The average person kills {:.4f} players on their own team'.format(df['teamKills'].mean()))
print('50% of people have killed ',df['teamKills'].quantile(0.50),' team players')
print('75% of people have killed ',df['teamKills'].quantile(0.75),' team players')
print('99% of people have killed ',df['teamKills'].quantile(0.99),' team players')
print('while the most kills recorded in the data is', df['teamKills'].max())

In [None]:
sns.jointplot(x='winPlacePerc', y='teamKills', data=df, ratio=3, color='r')

#### Total distance travelled
This is not an existing feature in the data, but we can combine the distance features to forma a total distance measure, so see if this has any predictive power of our target variable.

In [None]:
# Create a new feature for total distance travelled
data = df[['winPlacePerc']].copy()
data['totalDistance'] = df['walkDistance'] + df['rideDistance'] + df['swimDistance']

# Summary statistics for the total distance travelled
print('The average person travelled {:.2f} m'.format(data['totalDistance'].mean()))
print('25% of people have travelled {:.2f} m or less'.format(data['totalDistance'].quantile(0.25)))
print('50% of people have travelled {:.2f} m or less'.format(data['totalDistance'].quantile(0.50)))
print('75% of people have travelled {:.2f} m or less'.format(data['totalDistance'].quantile(0.75)))
print('99% of people have travelled {:.2f} m or less'.format(data['totalDistance'].quantile(0.99)))
print('The longest distance travelled in the data is {:.2f} m'.format(data['totalDistance'].max()))

In [None]:
sns.jointplot(x='winPlacePerc', y='totalDistance', data=data, ratio=3, color='r')


##### Ovservation:
- There is a reasonably strong correlation with the total distance travelled and winning, although most of this correlation may just be due to the strong correlation with walking distance and winPlacePerc. However, one interesting item to note is that it looks like the person that travelled the longest distance didn't win, when they travelled over 41 kms in a single match.

#### Healing and using Boosts affect on the result
- Healing items are used to heal yourself after you've been injured, which improves your health and allows you to continue palying the game for longer
- Boost items are used by a player to increase speed and accuracy, which will allow a player to achieve more kills with weapons or get away from a fight faster.

In [None]:
# Summary statistics for the number of healing items used
print('The average person uses {:.2f} healing items'.format(df['heals'].mean()))
print('50% of people used {:.2f} healing items'.format(df['heals'].quantile(0.50)))
print('75% of people used {:.2f} healing items or less'.format(df['heals'].quantile(0.75)))
print('99% of people used {:.2f} healing items or less'.format(df['heals'].quantile(0.99)))
print('The doctor of the data used {:.2f} healing items'.format(df['heals'].max()))

In [None]:
# Summary statistics for the number of boosting items used
print('The average person uses {:.2f} boosting items'.format(df['boosts'].mean()))
print('50% of people used {:.2f} boosting items'.format(df['boosts'].quantile(0.50)))
print('75% of people used {:.2f} boosting items or less'.format(df['boosts'].quantile(0.75)))
print('99% of people used {:.2f} boosting items or less'.format(df['boosts'].quantile(0.99)))
print('The addict of the data used {:.2f} boosting items'.format(df['boosts'].max()))

In [None]:
data = df.copy()
data = data[data['heals'] < data['heals'].quantile(0.99)]
data = data[data['boosts'] < data['boosts'].quantile(0.99)]

f,ax1 = plt.subplots(figsize =(20,10))
sns.pointplot(x='heals',y='winPlacePerc',data=data,color='red',alpha=1.0)
sns.pointplot(x='boosts',y='winPlacePerc',data=data,color='blue',alpha=0.8)
plt.text(4,0.6,'Heals',color='red',fontsize = 17,style = 'italic')
plt.text(4,0.55,'Boosts',color='blue',fontsize = 17,style = 'italic')
plt.xlabel('Number of heal/boost items',fontsize = 15,color='blue')
plt.ylabel('Win Percentage',fontsize = 15,color='blue')
plt.title('Heals vs Boosts',fontsize = 20,color='blue')
plt.grid()

##### Observations:
- Here we can see how the heal items and boost items ae used compared to each other.
- This seems to indicate that using a few healing items increases your chance of winning, but you need to use more boosts to actaully achieve a change of winning.

### Multivariate analysis

#### Pearson correlation between all features

In [None]:
f,ax = plt.subplots(figsize=(15, 15))
sns.heatmap(df.corr(), annot=True, linewidths=.5, fmt= '.2f',ax=ax)

##### Observations:
- According to the colorbar we can find the correlation between different features.
- If correlation is positive, one variable increases with other.
- If correlation is negative, as one variable increases, the other decreases.
- if correlation is 1, it means that either the variables are same or they are almost same

## Thank You For Having A Look At This Notebook
### Please Upvote if this was Helpful.