# Pub - G Data Analysis


# About The Dataset:
DBNOs - Number of enemy players knocked.

assists - Number of enemy players this player damaged that were killed by teammates.

boosts - Number of boost items used.

damageDealt - Total damage dealt. Note: Self inflicted damage is subtracted.

headshotKills - Number of enemy players killed with headshots.

heals - Number of healing items used.

Id - Player’s Id

killPlace - Ranking in match of number of enemy players killed.

killPoints - Kills-based external ranking of player. (Think of this as an Elo ranking where only kills matter.) If there is a value other than -1 in rankPoints, then any 0 in killPoints should be treated as a “None”.

killStreaks - Max number of enemy players killed in a short amount of time.

kills - Number of enemy players killed.

longestKill - Longest distance between player and player killed at time of death. This may be misleading, as downing a player and driving away may lead to a large longestKill stat.

matchDuration - Duration of match in seconds.

matchId - ID to identify match. There are no matches that are in both the training and testing set.

matchType - String identifying the game mode that the data comes from. The standard modes are “solo”, “duo”, “squad”, “solo-fpp”, “duo-fpp”, and “squad-fpp”; other modes are from events or custom matches.

rankPoints - Elo-like ranking of player. This ranking is inconsistent and is being deprecated in the API’s next version, so use with caution. Value of -1 takes place of “None”.

revives - Number of times this player revived teammates.

rideDistance - Total distance traveled in vehicles measured in meters.

roadKills - Number of kills while in a vehicle.

swimDistance - Total distance traveled by swimming measured in meters.

teamKills - Number of times this player killed a teammate.

vehicleDestroys - Number of vehicles destroyed.

walkDistance - Total distance traveled on foot measured in meters.

weaponsAcquired - Number of weapons picked up.

winPoints - Win-based external ranking of player. (Think of this as an Elo ranking where only winning matters.) If there is a value other than -1 in rankPoints, then any 0 in winPoints should be treated as a “None”.

groupId - ID to identify a group within a match. If the same group of players plays in different matches, they will have a different groupId each time.

numGroups - Number of groups we have data for in the match.

maxPlace - Worst placement we have data for in the match. This may not match with numGroups, as sometimes the data skips over placements.

winPlacePerc - The target of prediction. This is a percentile winning placement, where 1 corresponds to 1st place, and 0 corresponds to last place in the match. It is calculated off of maxPlace, not numGroups, so it is possible to have missing chunks in a match.

# Dataset Collection

 the data set has been collected from kaggle(https://www.kaggle.com/competitions/pubg-finish-placement-prediction/data)

# About- Project

In a PUBG game, up to 100 players start in each match (matchId). Players can be on teams (groupId) which get ranked at the end of the game (winPlacePerc) based on how many other teams are still alive when they are eliminated. In game, players can pick up different munitions, revive downed-but-not-out (knocked) teammates, drive vehicles, swim, run, shoot, and experience all of the consequences -- such as falling too far or running themselves over and eliminating themselves.

I have provided with a large number of anonymized PUBG game stats, formatted so that each row contains one player's post-game stats. The data comes from matches of all types: solos, duos, squads, and custom; there is no guarantee of there being 100 players per match, nor at most 4 player per group.

I will analyse the relationship between the different features.I will do the graphical representation and visualisation of data using matplotlib and seaborn library in python helps us to easily understand a lot better about the dataset.create 10 most useful columns as per my knowledge that will be useful to predict the final position of a player.


# installing Libraries 
 numpy for numerical operations , pandas for  dataset reading , matplotlib.pyplot & seaborn for data visualization

In [None]:
import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns

# Uploading the notebook

In [None]:
#uploading the dataset
train=pd.read_csv("../input/pubg-finish-placement-prediction/train_V2.csv")

# DATA PREPARATION & CLEANING

 Let's gather some informations about our data

In [None]:
#collecting information about the data 
train.info()

We got a very big data set (appx 4446966 entries) 

In [None]:
# checking the top 5 rows 
train.head()

# Checking for null values

In [None]:
train.isnull().sum()

But we can see that the data is very clean.

# Exploration & Visualizing

# Q. A average person kills ?

In [None]:
print("the avg person kills {:4f} players ". format (train['kills'].mean()))

# Q. How much 99% of people kills ?

In [None]:
print("99% of people have {} kil ls or less". format (train['kills'].quantile(0.99)))

# Q. Highest number of kills?

In [None]:
print ("the most kills ever recorded is {}.". format (train['kills'].max()))

# visualization for killing data

we will just see the analysis of kill records between 0-8 becase 99% ids has less than 7 kills & there are many numbers over 8 ( maximum kill is 72. All the numbers above 8 are shown as 8+.

In [None]:
data=train.copy()
data.loc[data['kills']>data['kills'].quantile(0.99)]='8+'
plt.figure(figsize=(15,10))
sns.countplot(data['kills'].astype('str').sort_values())
plt.title("Kill count",fontsize=15)
plt.show()

# Q. let's see most people can't make a single kill . atleast do they do damage ?

In [None]:
data=train.copy()
data=data[data['kills']==0]
plt.figure(figsize=(10,10))
plt.title("damage dealt by 0 killers",fontsize=15)
sns.distplot(data['damageDealt'])
plt.show()

well , most of them don't . let's investigate the exceptions.

# Q. how many players have won the game without a single kill?

In [None]:
print("{} players have won without a single kill!". format(len(data[data['winPlacePerc']==1])))

# Q. How many players have won without dealing damage? 

In [None]:
data1=train[train['damageDealt']==0].copy()
print("{} players have won without dealing damage!".format (len(data1[data1['winPlacePerc']==1])))

#  visualizing win placement percentage vs kills

In [None]:
sns.jointplot(x="winPlacePerc",y="kills",data=train,height=10,ratio=3)
plt.show()

# Q. Does killing has any relation with final placement?

In [None]:
kills=train.copy()
kills['killsCategories']=pd.cut(kills['kills'],
                                [-1,0,2,5,10,60],
                                labels=['0_kills','1-2_kills','3-5kills','6-10_kills','10+_kills'])
sns.boxplot(x="killsCategories", y="winPlacePerc",data=kills)
plt.show()

Apperently killing has a correlation with winning .Finally lt's group players based on kills (0 kills,1-2 kills, 3-5 kills,6-10 kills & 10+ kills

# walks

# Q. Average walking distance of a person ?

In [None]:
print ("a avg person walks for {:.1f}m.".format(train['walkDistance'].mean()))

# Q. how much does the 99% of people walk ?

In [None]:
print("99% of people have walked {}m or less.". format(train['walkDistance'].quantile(0.99)))

# Q. Maximum walking distance recorded ?

In [None]:
print("the marathon champion walked for {}m".format(train['walkDistance'].max()))

# Visualization the walking data 

In [None]:
data=train.copy()
data=data[data['walkDistance']<train['walkDistance'].quantile(0.99)]
plt.title("walking distance distribution",fontsize=15)
sns.distplot(data['walkDistance'])
plt.show


# Q. how many players have died before walking a single step ?

In [None]:
print("{} players walked 0 meters.".format(len(data[data['walkDistance']==0])))
print("this means that they die before even taking a step.")

# Q. Do walking has any correlation with win place?

# Visualizing the walking 

In [None]:
sns.jointplot(x="winPlacePerc",y="walkDistance",data=train)
plt.show()

 apperently walking has a high correlation with win place

# Drivers

# Q. How much a average person drives ?
# Q. how much 99% of people in the game drive ?
# Q. Maximum driving distance recorded ?

In [None]:
print("the avg person drives for {}m.".format(train['rideDistance'].mean()))
print("99% of people have drived {}m. or less.".format(train['rideDistance'].quantile(0.99)))
print("the formula 1 champion drived for {}m.".format(train['rideDistance'].max()))

# Q. How many players have drived 0 meters ?

In [None]:
print("{} players drived for 0 meters.".format(len(data[data['rideDistance']==0])))

# Q. Do there is any correlation between riding & winning place 
# Q. Visualizing the the plot of riding data

In [None]:
data=train.copy()
data=data[data['rideDistance']<train['rideDistance'].quantile(0.9)]
plt.title("Ride Distance Distribution",fontsize=15)
sns.distplot(data['rideDistance'])
plt.show()

 it seems that this 2 variables are slightly correlated 

# Q. Do destroying a vehicle have any relation with winning ?. 
# in this case I am using point plot because this data has some limited values.

In [None]:
f,ax1=plt.subplots(figsize=(20,10))
sns.pointplot(x='vehicleDestroys',y='winPlacePerc',data=data)
plt.xlabel('number of vehicle destroyed',fontsize=15,color='blue')
plt.ylabel('win percentage',fontsize=15,color='blue')
plt.title('vehicle destroys/win ratio',fontsize=20,color='blue')
plt.grid()
plt.show()

yes destoying a vehicle increases your chance of winning

#  Swimmers

# Q.How much a average person swims ?
# Q.how much 99% of people in the game swim ?
# Q. Maximum swimming distance recorded ?

In [None]:
print("the avg person swims for {}m.".format(train['swimDistance'].mean()))
print("99% of people have swimmed {}m or less.".format(train['swimDistance'].quantile(0.99)))
print("the olympic champion swimmed for {}m.".format(train['swimDistance'].max()))

# Visualizing the the swimming distance  in 95% quartile

In [None]:
data=train.copy()
data=data[data['swimDistance']<train['swimDistance'].quantile(0.95)]
plt.figure
plt.title("swim distance distribution")
sns.distplot(data['swimDistance'])
plt.show()

 almost no one swims . let's group the swimming distances in 4 categories and plot vs winPlacePerc 

# Visualizing the the  swimming data in group format

In [None]:
swim=train.copy()
swim['swimDistance']=pd.cut(swim['swimDistance'],
                             [-1,0,5,20,5286],
                             labels=['0m','1-5m','6-20m','20m+'])
plt.figure
sns.boxplot(x="swimDistance",y="winPlacePerc",data=swim)
plt.show()


 it seems that if you swim. you rise to the top. the pubg there are currently 3 maps . one of them has almost no water. keep that in mind . we might plan on doing analysis to find out in which map a match is played.

# Healers

# Q. How many a average person use heal items ?
# Q. how many 99% of people in the game use healing?
# Q. Maximum healing done?

In [None]:
# print("the avg person uses {} heal items.". format(train['heals'].mean()))
print("99% of people use {} or less heals.".format(train['heals'].quantile(0.99)))
print("the max heals used {}.".format(train['heals'].max()))

# Q. How many a average person use boost items ?
# Q. how many 99% of people in the game use boosts?
# Q. Maximum boosting done?

In [None]:
print("the avg person uses {} boosts items.".format(train["boosts"].mean()))
print("99% of people use {} or less boosts.".format(train['boosts'].quantile(0.99)))
print("the max boosts used {}.".format(train['boosts'].max()))

# Visualizing the boosts & heals with wining place plot 

In [None]:
data=train.copy()
data=data[data['heals']<data['heals'].quantile(0.99)]
data=data[data['boosts']<data['boosts'].quantile(0.99)]
f,ax1=plt.subplots(figsize=(20,10))
sns.pointplot(x='heals',y='winPlacePerc',data=data,color='lime',alpha=0.8)
sns.pointplot(x='boosts',y='winPlacePerc',data=data,color='blue',alpha=0.8)
plt.text(4,0.6,'heals',color='lime',fontsize=17)
plt.text(4,0.55,'boosts',color='blue',fontsize=17)
plt.xlabel('number of heal/boost items',fontsize=17)
plt.ylabel('win percentage',fontsize=17)
plt.grid()
plt.show()

# Visualizing healing with win place 

In [None]:
sns.jointplot(x="winPlacePerc",y="heals",data=train, height=10,ratio=3)
plt.show()

# Visualizing boosting with win place 

In [None]:
sns.jointplot(x="winPlacePerc",y="boosts",data=train, height=10 ,ratio=3)
plt.show()

 so healing and boosting are definitely correlated with winplaceperce.boosting is more correlated. in every plot, there is as abnormal behavior when calues are 0

# representing the data between Duos , Solos & Squads

there are 3 game modes in the game . one can play solo , or with a friend ,duo, or with 3 other friends , squad, 100 players join the same server , so in the case of duos the max teams are 50 and in case of squads the max teams are 25.

In [None]:
s=train[train['numGroups']>50]
d=train[(train['numGroups']>25) & (train['numGroups']<=50)]
sq=train[train['numGroups']<=25]

# Q. Percentage of Solo games in the data ?
# Q. percentage of duo games in the data?
# Q. percentage of squad games in the data ?

In [None]:
print("there are {} ({}%) solo games.".format(len(s),100*len(s)/len(train)))
print("there are {} ({}%) duo games.".format(len(d),100*len(d)/len(train)))
print("there are {} ({}%) squad gmaes.". format(len(sq),100*len(s)/len(train)))

# Visualizing the solo-duo-squad killing data with winning place data 

In [None]:
f,ax1=plt.subplots(figsize=(20,10))
sns.pointplot('kills',y='winPlacePerc',data=s, color='black',alpha=0.8)
sns.pointplot(x='kills',y='winPlacePerc',data=d,color='red',alpha=0.8)
sns.pointplot(x='kills',y='winPlacePerc',data=sq,color='blue',alpha=0.8)
plt.text(37,0.6,'solos',color='black',fontsize=17)
plt.text(37,0.55,'duos',color='red',fontsize=17)
plt.text(37,0.5,'squad',color='blue',fontsize=17)
plt.xlabel('number of kills',fontsize=15)
plt.ylabel('win percentage',fontsize=15)
plt.title('solo vs squad kills', fontsize=20)
plt.grid()
plt.show()

solos and duoes behave the same, but when playing squads kills don't matter that much.
the attribute DBNOs means enemy players knocked . A "Knock" can happen only in duos or squads.
the attribute assist can also happen only in duos pr squads. it generally means that the players had an involvement in a kill .
the attribute revive also happens in duos or squads

# Visualizing duo-squad number of players knocked , assists , revives with win place 

In [None]:
f,ax1=plt.subplots(figsize=(20,10))
sns.pointplot(x='DBNOs', y='winPlacePerc',data=d,color='black',alpha=0.8)
sns.pointplot(x='DBNOs', y='winPlacePerc',data=sq,color='lime',alpha=0.8)
sns.pointplot(x='assists', y='winPlacePerc',data=d,color='red',alpha=0.8)
sns.pointplot(x='assists', y='winPlacePerc',data=sq,color='blue',alpha=0.8)
sns.pointplot(x='revives', y='winPlacePerc',data=d,color='green',alpha=0.8)
sns.pointplot(x='revives', y='winPlacePerc',data=sq,color='purple',alpha=0.8)
plt.text(14,0.5,'duos-assists',color='red',fontsize=17)
plt.text(14,0.45,'duos-dbnos',color='black',fontsize=17)
plt.text(14,0.4,'duos-revives',color='green',fontsize=17)
plt.text(14,0.35,'squad-assists',color='blue',fontsize=17)
plt.text(14,0.3,'squads-dbnos',color='lime',fontsize=17)
plt.text(14,0.25,'squads-revive',color='purple',fontsize=17)
plt.xlabel('number of dbnos/assists/revives',fontsize=15)
plt.ylabel('win percentage',fontsize=15)
plt.title('duo vs squad dbnos,assists, and revives',fontsize=20)
plt.grid()
plt.show()

# Visualizing the Pearson's correlation between variables

In [None]:
f,ax=plt.subplots(figsize=(15,15))
sns.heatmap(train.corr(),annot=True,linewidths=.5,fmt='.1f',ax=ax)
plt.show()

in term of the target variable (winPlacePrec), there are a few variables high medium to high correlation. the highest positive correlation is walkDistance and the highest negetive the killplace

# Visualizing the top 5 most positive correlated variables with the target

In [None]:
k=5
f,ax=plt.subplots(figsize=(11,11))
cols=train.corr().nlargest(k,'winPlacePerc')['winPlacePerc'].index
cm=np.corrcoef(train[cols].values.T)
sns.set(font_scale=1.25)
hm=sns.heatmap(cm,cbar=True,annot=True,square=True)
plt.show()

# Visualizing the top 5 variables and target variable as pair plot

In [None]:
sns.set()
cols=['winPlacePerc','walkDistance','boosts','weaponsAcquired','damageDealt','killPlace']
sns.pairplot(train[cols],size=2.5)
plt.show()

In this plot you can see all the infernce between the given datas

# Feature Engineering

A game in pubg can have upto 100 players fighting . But most of the times a game isn't full. there is no variable that give us the number of players joined . So let's creat one 

In [None]:
train['playersJoined']=train.groupby('matchId')['matchId'].transform('count')

# Visualizing the game ratio where more than 49 players have joined

In [None]:
data=train.copy()
data = data [data['playersJoined']>49]
plt.figure(figsize=(15,10))
sns.countplot(data['playersJoined'])
plt.title("Players Joined",fontsize=15)
plt.show()

Based On The "playersjoined" feature we can creat (or change) a lot of others to normalize their values. For example I will creat the "killsNorm" and "damagedealtNorm" features.When there are 100 plahyers in the game it might be easier to find and kill someone , when ther are 90 players,So I will Normalize the kills in a way that a kill in 100 players  will score 1 (as it is) and in 90 players it will score (100-90)/100+1=1.1 . This is just an assumption .You can use different sacles .

# Creating some new column which can help us to analyze the final placement prediction 

In [None]:
train['killsNorm']=train['kills']*((100-train['playersJoined'])/100+1)
train['damageDealtNorm']=train['damageDealt']*((100-train['playersJoined'])/100+1)
train[['playersJoined','kills','killsNorm','damageDealt','damageDealtNorm']][5:8]

Another simple feature is the sum  of heals and boosts . Also the sum of total distance travelled.

In [None]:
train['healsAndBoosts']=train['heals']+train['boosts']
train['totalDistance']=train['walkDistance']+train['rideDistance']+train['swimDistance']

When using boosting items you rum faster .They also help staying out of the zone .so let creat a feature boosts per walking distance .Heals don't make you run faster , but they also help staying out of the zone and ;oot more . so lets creat the same feature for heals also.

In [None]:
#the +1 is to avoid infinity , because there are entries where boosts >0
# & the walk distance =0 
train['boostsPerWalkDistance']=train['boosts']/(train['walkDistance']+1)
train['boostsPerWalkDistance'].fillna(0,inplace=True)

In [None]:
#the +1 is to avoid infinity . because there are entries 
#where heals>0 and walkDistance=0
train ['healsPerWalkDistance']=train['heals']/(train['walkDistance']+1)
train['healsPerWalkDistance'].fillna(0,inplace=True)

# visualizing the new table

In [None]:
#the +1 is to avoid infinity
train['healsAndBoostsPerWalkDistace']=train['healsAndBoosts']/(train['walkDistance']+1)
train['healsAndBoostsPerWalkDistace'].fillna(0,inplace=True)

train[['walkDistance','boosts','boostsPerWalkDistance','heals','healsPerWalkDistance','healsAndBoosts']][5:20]

# creating the feature "killsPerWalkDistance"

In [None]:
#the +1 is to avoid infinity  because there are wntries where kills>0 and walkdistance=0 
train['killsPerWalkDistance']=train['kills']/(train['walkDistance']+1)
train['killsPerWalkDistance'].fillna(0,inplace=True)
train[['kills','walkDistance','rideDistance','killsPerWalkDistance','winPlacePerc']][5:15]

0 walking distance and many kills ? also most have winPlacePerc=1. Deafinitely cheaters.

# Creating new column for solos Duos & squads .

In [None]:
train ['team']= [1 if i<50 else 2 if (i>25 & i<=50)else 4 for i in train['numGroups']]

In [None]:
train.head()

# Inferences and Conclusions
I've drawn many inferences from the data. Here's a summary of a few of them:

I have got a very big dataset of around 4446966 entries.

Killing- a average person kills 0.924783 players ,99% of people have 7.0 kills or less , the most kills ever recorded is 72.Most of the people in the game doesn't make a single kill in the game.16666 players have won without a single kill.4770 players have won without dealing damage.Apperently killing has a correlation with winning placement .

Walking -A average person walks for 1154.2m.99% of people have walked 4396.0m or less.the marathon champion walked for 25780.0m.99603 players walked 0 meters.this means that they die before even taking a step.apperently walking has a high correlation with win place.

Driving - The avg person drives for 606.115669154093m.99% of people have drived 6966.0m. or less.The formula 1 champion drived for 40710.0m.3309429 players drived for 0 meters.it seems that this 2 variables are slightly correlated .yes destoying a vehicle increases your chance of winning.

Swimming -The avg person swims for 4.509322451307243m.99% of people have swimmed 123.0m or less.The olympic champion swimmed for 3823.0m.it seems that if you swim. you rise to the top. the pubg there are currently 3 maps . one of them has almost no water. keep that in mind . we might plan on doing analysis to find out in which map a match is played.

Healing -The avg person uses 1.1069077209045448 boosts items.99% of people use 7.0 or less boosts.The max boosts used 33.so healing and boosting are definitely correlated with winplaceperce.boosting is more correlated. in every plot, there is as abnormal behavior when calues are

there are 3 game modes in the game . one can play solo , or with a friend ,duo, or with 3 other friends , squad, 100 players join the same server , so in the case of duos the max teams are 50 and in case of squads the max teams are 25.
there are 709111 (15.945950564946978%) solo games,there are 3295326 (74.10279278051597%) duo games,there are 442529 (15.945950564946978%) squad gmaes in the data.solos and duoes behave the same, but when playing squads kills don't matter that much.
the attribute DBNOs means enemy players knocked . A "Knock" can happen only in duos or squads.
the attribute assist can also happen only in duos pr squads. it generally means that the players had an involvement in a kill .
the attribute revive also happens in duos or squads
Win Place perc A very high correlation between this 5 features  . boosting , weapon accuaring , damage dealt , walk distance , kill place .I have created  some more features watching the scenarios of the game which has high correlation with win place perc that are "kills norm" ," damage dealt ","healsandboosts" , "totalDistanc" ,"boostsper walk distance" , "heals per walk distance" , " kills per walk distance" , "solos" ,"duos" ,"squad" & added them with the table .

Future works -
this data analysis model project can be used to  develop the game and making the game more interesting . It can also be used to 

In [None]:
train.to_csv("submission.csv",index=False)