# **Know The Data**


In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

Players are dropped into a wide, open area, and they must fight to the death - all while the battlefield shrinks, adding pressure to all in its grip. Use a variety of interesting weapons and vehicles amid the BATTLEGROUNDS. "Killing off another player cuts down on your competition, but it also offers up the opportunity to grab some loot. Your character can only carry around a limited amount of gear, so there are important questions to ask whenever you come across new items. Is it better to stick with your current 9mm pistol, or hold out hope that you'll be able to find ammo for a found 12 gauge shotgun?". Source- [Source](https://pubg.gamepedia.com/About)  

In this kernel I'll try to explore the data and play with data. I'm going to do Exploratory Data Analysis and Feature Engineering on the data and see how different columns are co-related to the target variable.

### Reading the data
First things first, let's start by exporting the libraries that I'll need and reading the data using pandas.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

plt.style.use('ggplot')

df = pd.read_csv('../input/pubg-finish-placement-prediction/train_V2.csv')
df.head()

The DataFrame (df) has 4446966 rows and 29 columns. Before starting first let's acquire some information about each column.
* **DBNOs** - Number of enemy players knocked.
* **assists** - Number of enemy players this player damaged that were killed by teammates.
* **boosts** - Number of boost items used.
* **damageDealt** - Total damage dealt. Note: Self inflicted damage is subtracted.
* **headshotKills** - Number of enemy players killed with headshots.
* **heals** - Number of healing items used.
* **Id** - Player’s Id
* **killPlace** - Ranking in match of number of enemy players killed.
* **killPoints** - Kills-based external ranking of player. (Think of this as an Elo ranking where only kills matter.) If there is a value other than -1 in rankPoints, then any 0 in killPoints should be treated as a “None”.
* **killStreaks** - Max number of enemy players killed in a short amount of time.
* **kills** - Number of enemy players killed.
* **longestKill** - Longest distance between player and player killed at time of death. This may be misleading, as downing a player and driving away may lead to a large longestKill stat.
* **matchDuration** - Duration of match in seconds.
* **matchId** - ID to identify match. There are no matches that are in both the training and testing set.
* **matchType** - String identifying the game mode that the data comes from. The standard modes are “solo”, “duo”, “squad”, “solo-fpp”, “duo-fpp”, and “squad-fpp”; other modes are from events or custom matches.
* **rankPoints** - Elo-like ranking of player. This ranking is inconsistent and is being deprecated in the API’s next version, so use with caution. Value of -1 takes place of “None”.
* **revives** - Number of times this player revived teammates.
* **rideDistance** - Total distance traveled in vehicles measured in meters.
* **roadKills** - Number of kills while in a vehicle.
* **swimDistance** - Total distance traveled by swimming measured in meters.
* **teamKills** - Number of times this player killed a teammate.
* **vehicleDestroys** - Number of vehicles destroyed.
* **walkDistance** - Total distance traveled on foot measured in meters.
* **weaponsAcquired** - Number of weapons picked up.
* **winPoints** - Win-based external ranking of player. (Think of this as an Elo ranking where only winning matters.) If there is a value other than -1 in rankPoints, then any 0 in winPoints should be treated as a “None”.
* **groupId** - ID to identify a group within a match. If the same group of players plays in different matches, they will have a different groupId each time.
* **numGroups** - Number of groups we have data for in the match.
* **maxPlace** - Worst placement we have data for in the match. This may not match with numGroups, as sometimes the data skips over placements.
* **winPlacePerc** - The target of prediction. This is a percentile winning placement, where 1 corresponds to 1st place, and 0 corresponds to last place in the match. It is calculated off of maxPlace, not numGroups, so it is possible to have missing chunks in a match.   
[Click Here to know more about data](https://www.kaggle.com/c/pubg-finish-placement-prediction/data)



## Let's Start
Let's start by first identifying the cheaters/hackers in the game. As the popularity of the game increased the people started to use different Applications which increased the chance of winning. So I'll use 3 approaches to identify the cheaters/hackers:
* Have a very high headshot percentage.
* Have lots of kills without moving.
* Have acquired lots of weapons without moving.

Let's Start our Analysis:

1. **Have a very high headshot percentage.**
Generally cheaters have a very high headshot percentage, So I'm finding out the players who have more than 40 kills and 90% of them are from headshots.


In [None]:
#Calculating headshot percentage 
df['headShotPercentage'] = df[['kills','headshotKills']].apply(lambda x: x['headshotKills']/x['kills'] if x['kills'] > 0 else 0, axis=1) #if kills are 0 returning 0 else 'heashotKills' by 'kills'
df[(df['headShotPercentage'] > 0.9) & (df['kills'] > 40)][['Id','boosts','heals','headshotKills','kills','winPlacePerc']] #finding outthe cheaters by searching the players with more than 40 kills and 90% of them are from headshots

So, there are two players player one with id '15622257cb44e2' have got 40 out of 42 kills by headshot without using boosts or heals and another player with id '15622257cb44e2' have got 39 out of 42 kills by headshot with only two heals and no boosts. 

2. **Have lots of kills without moving.**  
If two opponents land at the same place one might get few kills without moving. I'm using a threshold of 10 kills i.e, if any player has more than 10 kills without moving he might be a cheater. For this I'm adding a new feature 'totalDistance' which is the total distance travelled by a player.

In [None]:
#first Calculating the total Distance travelled by a player i.e, sum of 'rideDistance','swimDistance' and 'walkDistance'
df['totalDistance'] = df['rideDistance'] + df['swimDistance'] + df['walkDistance']
kills_without_movement = df[(df['totalDistance'] <= 0) & (df['kills'] > 10)]#Getting the players who had not moved but got kills more than 10
kills_without_movement['kills'].plot.hist(figsize=(12, 7)) #Plotting the graph to chech the frequency of kills
plt.xlabel('Kills')
plt.title('Frequency of kills by cheaters without movement')

Most of the kills (about 100) are in between 10-14, there are even some of the players who have more than 30 kills without movement.

3. **Have acquired lots of weapons without moving.**  
In PUBG the players land bare hands they have to loot for weapons. There are some cheaters in the game who may have acquired lots of weapons without travelling any distance. Let's see if there are any in our dataset.  
For this I'm looking for players who have acquired more than 50 weapons without moving 


In [None]:
weapons_without_movement = df[(df['totalDistance'] == 0) & (df['weaponsAcquired'] > 50)]#players who have more than 50 weapons and has not moved
fig, ax1 = plt.subplots(figsize=(12, 7))
sns.distplot(weapons_without_movement['weaponsAcquired'], ax=ax1)
plt.title('Frequency of weapons acquired by cheaters')

See how the distribution is, there are lots of cheaters in the dataset.  

Now let's see if there are any players common in both of previous observations i.e, kills without movement and weapons without movement.


In [None]:
#Common player in both of previous observations
common_players = [player for player in kills_without_movement['Id'].unique() if player in weapons_without_movement['Id'].unique()]
common_players

There are a total of 15 players who are common in both the observations. I can either:
* drop each row which have cheater, or
* drop each entry of these players  

I'm not doing anything to them now because our main focus is on analysing the data.


Let's see which match type are cheaters interested in to play. Do they like solo or duo let's see it.

In [None]:
matchType = kills_without_movement['matchType'].value_counts() #Getting the value count for each match in match type
matchType.drop(matchType[matchType == 0].index, axis=0, inplace=True)#dropping the matchType which has 0 value count.
matchType.plot.barh(figsize=(12,6))#plotting a bar plot
plt.xlabel('Numbers of players')
plt.ylabel('Match type')
plt.title('Match type cheaters are interested in')

Cheaters like to play in 'normal-solo-fpp','normal-squad','normal-duo-fpp',''normal-duo','normal-solo' and 'normal-squad-fpp'.

## Match Type  
Let's see how kills vary with match type.

In [None]:
fig, axes = plt.subplots(figsize=(20,5))
sns.boxplot(y='kills', x='matchType', data=df, ax= axes)#Plotting box plot between different match type and kills
plt.xticks(rotation=45)
plt.title('kills for different matches')

As is the above graph 'normal-solo-fpp','normal-squad','normal-duo-fpp' and 'normal-squad-fpp' has got the highest kills this might be because there are lots of cheaters in these match types as discussed above.  

Let's see which match has the highest playing percentage. 

In [None]:
match_count = df.groupby(['matchType']).size().reset_index() #total size (count) of each match in match type
match_count[0] = match_count[0]/df.shape[0]*100  # playing % of each match
match_count.drop(match_count[match_count[0] < 1].index, axis=0, inplace=True) #Deleting the match which have playing percentage less than 1%
fig, ax1 = plt.subplots(figsize=(16,7))
ax1.pie(match_count[0], labels=match_count['matchType'], autopct='%1.2f%%', shadow=True, startangle=90)

From the pie chart it is clear that the players like to play fpp (First Person Perspective) matches rather than classic 'solo', 'duo' or 'squad'. The playing percentage of 'normal-solo-fpp','normal-squad','normal-duo-fpp' and 'normal-squad-fpp' is less than 0% as most cheates like to play in this match.  
Let's see how kills and win Percentage (winPlacePerc) are related to each other in each match type.


In [None]:
match_to_keep = ['squad-fpp','squad','solo-fpp','solo','duo-fpp','duo']
most_match = df[df.matchType.isin(match_to_keep)] #Getting only matches with has playing % more than 1
fig, ax1 = plt.subplots(figsize=(16,7))
sns.pointplot(x='kills', y='winPlacePerc', hue='matchType', data=most_match, ax=ax1)
plt.title('Win percentage vs Kills for different matches')

## Boosts/Heals 

Let's see how win percentage varies with Boosts and heals.

In [None]:
sns.jointplot(y='boosts', x='winPlacePerc', data=df, color='#0066ff')#plotting for boosts
plt.title('Win Percentage vs Boosts')

As the boosting item increases the win percentage increases. 

In [None]:
fig, ax1 = plt.subplots(figsize=(13,5))# plotting for heals
sns.pointplot(x='heals', y='winPlacePerc', data=df, ax=ax1)
plt.xlim((0,30))
plt.title('win percentage vs heals')

As I thought, the healing items increase the win percentage.

## Total Distance  

As we have created the feature 'totalDistance' earlier, first let's see how it is distributed.

In [None]:
fig, ax1 = plt.subplots(figsize=(14,6))
sns.distplot(df['totalDistance'], color='#1ab2ff', ax=ax1) #getting the distribution of total Distance
plt.title('Distribution of Total Distance')

Most of the players travel about 0-10000 meters. There are few who have travelled more than 30000 meters, be aware they might be the cheates using some Application for high speed.  

Now let's see how Total Distance is dependent on win percentage.

In [None]:
fig, ax1 = plt.subplots(figsize=(14,6))
distance = df[df['totalDistance'] < 30000][['totalDistance','kills','winPlacePerc']] #Getting total distance less than 30 km (30000 m)
distance['totalDistance'] = distance['totalDistance'].apply(lambda x: np.around(x/1000)) #Converting the distance in Km and taking round of it
sns.lineplot(x='totalDistance', y='winPlacePerc', data=distance, color='#002233', ax=ax1)
plt.xlabel('Total Distance in Km')
plt.title('win percentage vs total distance')

There is a slight decay in the win percentage in between 15-25 Km this might be because of playzone, the players who have travelled less might have landed in the playzone area and have acquired places for attack and defence while the players who have travelled a lot for safezone have to fight with them therefore has less chances of winning the game. I think that might be the reason. 

## knocked  

Let's see how win percentage depends on knocks.

In [None]:
bin_used = [0,10,20,30,40,55]
label_used = ['0-10','10-20','20-30','30-40','40-55']
categories = pd.cut(df['DBNOs'], bins=bin_used, labels=label_used, include_lowest=True) #Converting the knocks into bins

fig, ax1 = plt.subplots(figsize=(14,6))
sns.boxplot(y=df['winPlacePerc'], x=categories, ax=ax1)
plt.title('Win percentage vs Knocks')
plt.xlabel('Knocks')

The number of knocks increases your chances of winning, there is 1.0 winning percentage for knocks 40-55 that is because there are only few players in that range and they have a very high win percentage.

## Feature Engineering  

Now let's create some interesting features.  
I have already created two features i.e, 'headshotPercentage' and 'totalDistance'. Now let's create some new features.

In [None]:
df['killsPerDistance'] = df[['kills','totalDistance']].apply(lambda x : x['kills']/x['totalDistance'] if x['totalDistance'] > 0 else 0, axis=1) #If totalDistance is greater than 0 return 0 else return 'kills' by 'totalDistance'
df['totalBoosts/Heals'] = df['boosts'] + df['heals'] #sum of 'boosts' and 'heals'
df['damagePerKill'] = df.apply(lambda x: x['damageDealt']/x['kills'] if x['kills'] > 0 else 0, axis=1)#damage delt per kill

I have created 3 more features:
* **killsPerDistance** : it is kills by totalDistance
* **totalBoosts/Heals** : it is the sum of boosts and heals
* **damagePerKills** : it is damage dealt per kills

# correlation with target  

Let's see which are the top 10 features that are highly correlated with the target variable i.e winPlacePerc.

In [None]:
corr = df.corr() #Getting correlation
corr1 = np.abs(corr).nlargest(11, 'winPlacePerc') #absoluting the corr value so we can find the top 10 correlated columns
corr = corr.loc[corr1.index, corr1.index] #get highly correlated columns

fig, axes = plt.subplots(figsize=(13,6))
sns.heatmap(corr, annot=True, cmap='RdYlBu', ax=axes)

From above heatmap we can say that:
* The totalDistance column is positively correlated with walkDistance. 
* totalBoosts/Heals column is highly positively correlated with 'boosts' and 'heals' column.
* 'kills' column is highly positively correlated with 'killPlace' and 'damageDealt'  

From above corelated columns we can have the columns in which winPlacePerc is highly correlated.

Let's see how these all columns are coorelated with pair plor.

In [None]:
corr_col = df[corr.index] #Getting the correlated columns
sns.pairplot(corr_col)

## **Thanks for reading the kernel.** 
## **If you Like it, please Upvote**