# PUBG mobile data analysis

## In this notebook, I’ll be analyzing PUBG mobile game dataset, which is an online battle game, I will also build a model to predict the winning place percentage of a player. 
### In a PUBG game, up to 100 players start in each match (matchId). Players can be on teams (groupId) that get ranked at the end of the game (winPlacePerc) based on how many other teams are still alive when they are eliminated. In-game, players can pick up different munitions, revive downed-but-not-out (knocked) teammates, drive vehicles, swim, run, shoot, and experience all of the consequences -- such as falling too far or running themselves over and eliminating themselves. My goal in this project is to build a storyline of the perfect winning PUBG match with the help of EDA and ML.

### importing libraries

In [None]:
import numpy as np 
import pandas as pd
import os
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error,r2_score
from sklearn.neural_network import MLPRegressor
%matplotlib inline
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

### reading dataset

In [None]:
train = pd.read_csv('/kaggle/input/pubg-finish-placement-prediction/train_V2.csv')

In [None]:
train.shape

### all the 29 columns wouldn't fit in the preview so I'll divide them

In [None]:
train.iloc[:,:15].head()

In [None]:
train.iloc[:,14:].head()

### checking datatypes

In [None]:
train.info()

### looks like datatypes are fine

### checking missing values

In [None]:
train.isnull().sum()

### only one missing value, we'll drop it as it's not significant 

In [None]:
train.dropna(inplace=True)

### getting stats. info 

In [None]:
train.describe().T

## EDA and visulaization

### The killers..

In [None]:
train['kills'].quantile(0.99)

In [None]:
#replace any no of kills greater than 0.99 of data with 8 kills for better visuals
temp= train.copy()
temp.loc[temp['kills'] > temp['kills'].quantile(0.99)] = '8+'
plt.figure(figsize=(15,10))
sns.countplot(temp['kills'].astype('str').sort_values())
plt.title('No. of Kills');

### most people don't kill anyone, let's check if they deal damage..

In [None]:
temp= train.copy()
temp =temp[temp['kills']==0]
plt.figure(figsize=(15,10))
sns.distplot(temp['damageDealt'])
plt.title('Damage dealt by non killers');

### it's obvious that most non killers don't deal damage too.

In [None]:
del temp

### Let's move on to check who wins most Solos,Duos or squads?

In [None]:
train.columns

In [None]:
plt.figure(figsize=(15,10))
plt.xticks(rotation=45)
sns.countplot(train['matchType'].astype('str'));

### most players play as squad "4members"

### let's check the matchType further as tpp and fpp are not that intersting it's just a mode

In [None]:
train['matchType'].value_counts()

### Combine all squad types into 1 column Squad, same for solo, duo other will be combined in others

In [None]:
train.matchType.replace(['squad-fpp','squad','normal-squad-fpp','normal-squad'],'Squad',inplace=True)

In [None]:
train.matchType.replace(['duo-fpp','duo','normal-duo-fpp','normal-duo'],'Duo',inplace=True)


In [None]:
train.matchType.replace(['solo-fpp','solo','normal-solo-fpp','normal-solo'],'Solo',inplace=True)

In [None]:
train.matchType.replace(['crashfpp','flaretpp','flarefpp','crashtpp'],'Othertypes',inplace=True)

In [None]:
sns.countplot(train.matchType);

In [None]:
print('{}% of players play as Squads'.format(train.matchType.value_counts()['Squad']/len(train.matchType) *100 ))

## let's see what's the best strategy to win a match through: vehicles,kills damages and other variables.

### 1- Movement and vehicles

In [None]:
## The running players
print('A player travels an avg distance of {} meters'.format(train['walkDistance'].mean()))

In [None]:
temp= train.copy()
temp=temp[temp['walkDistance'] > temp['walkDistance'].quantile(0.99)] 
plt.figure(figsize=(15,10))
sns.distplot(temp['walkDistance'])
plt.title('Walking dinstance distribution');

### The relationship between walking and winning

In [None]:
plt.figure(figsize=(15,10))
sns.scatterplot(x='winPlacePerc',y='walkDistance',data=train)
plt.title('The relationship between winning and running')

### looks like there's a +ve correlation, let's get the exact value

In [None]:
train[['winPlacePerc','walkDistance']].corr()

### Most walking players win according to the scatter plot with a correlation coefficient of 0.81

### let's check riding vehicles..

In [None]:
plt.figure(figsize=(15,10))
sns.scatterplot(x='winPlacePerc',y='rideDistance',data=train)
plt.title('The relationship between winning and driving')

In [None]:
train[['winPlacePerc','rideDistance']].corr()

### driving is less corelated, but there's a trick in PUBG a player can kill an enemy by destroyng the enemy's car by shooting it or throwing a bomb at it, let's check

In [None]:
f,ax1 = plt.subplots(figsize =(20,10))
sns.pointplot(x='vehicleDestroys',y='winPlacePerc',data=train,color='#606060',alpha=0.8)
plt.xlabel('Number of Vehicle Destroys',fontsize = 15,color='blue')
plt.ylabel('Win Percentage',fontsize = 15,color='blue')
plt.title('Vehicle Destroys/ Win Ratio',fontsize = 20,color='blue')
plt.grid();

### The point plot, it shows that destroying at least on vehicle increases the chance of winning by ~35%, AWESOME!!

## PUBG is a team based game, when a member in your team is knocked down, you can revive him and bring him back in the game as long as he's not dead..let's check if that affects the winning.

In [None]:
f,ax1 = plt.subplots(figsize =(20,10))
sns.pointplot(x='revives',y='winPlacePerc',data=train,alpha=0.8)
plt.xlabel('Number of Revives',fontsize = 15,color='blue')
plt.ylabel('Win Percentage',fontsize = 15,color='blue')
plt.title('Revives/ Win Ratio',fontsize = 20,color='blue')
plt.grid();

### looks like it doesn't affect winning that much.. 

# last thing I will check is the boosts and healing elements


In [None]:
plt.figure(figsize=(15,10))
sns.scatterplot(x='winPlacePerc',y='heals',data=train)
plt.title('The relationship between winning and healing elements')

In [None]:
train[['winPlacePerc','heals']].corr()

In [None]:
train[['winPlacePerc','boosts']].corr()

### looks like health boosters do relate to winning, as shown in the plots below with a correlation coefficient of 0.42 for heals, and 0.634 for boosts

## Feature engineering.

### let's check correlation between variables in our data

In [None]:
train.shape

In [None]:
f,ax = plt.subplots(figsize=(15, 15))
sns.heatmap(train.corr(), annot=True, linewidths=.5, fmt= '.1f',ax=ax);

### There are many attributes that have little corr values with the target variable..let's choose the top 5 attributes and explore them further,it's also to be noticed that the least correlated feature is the kill place

In [None]:
f,ax = plt.subplots(figsize=(11, 11))
cols = train.corr().nlargest(5, 'winPlacePerc')['winPlacePerc'].index
cm = np.corrcoef(train[cols].values.T)
sns.set(font_scale=1.25)
hm = sns.heatmap(cm, cbar=True, annot=True, square=True, fmt='.2f', annot_kws={'size': 10}, yticklabels=cols.values, xticklabels=cols.values)
plt.show()

### A PUBG game typically have 100 players but sometimes not all the 100 players join, let's create a feature indicates the number of players in a pubg match

### Creating new feature playersJoined that indicate the number of players in a game

In [None]:
train['playersJoined'] = train.groupby('matchId')['matchId'].transform('count')

In [None]:
plt.figure(figsize=(15,15))
temp=train[train['playersJoined']>=50]
sns.countplot(temp['playersJoined'])
plt.title("Players Joined",fontsize=15)
plt.show()

### In EDA we knew that bosts and healing elements increase tha chance of winning a game so let's put them in one feature also the walking,swimming and riding distance

In [None]:
train['healsAndBoosts'] = train['heals']+train['boosts']
train['totalDistance'] = train['walkDistance']+train['rideDistance']+train['swimDistance']

In [None]:
train.columns

### Get the number of players in a team corresponding to solos, duos, and squads in team columns

In [None]:
train['team'] = [1 if i>50 else 2 if (i>25 & i<=50) else 4 for i in train['numGroups']]

### selecting relavent data columns

In [None]:
train.columns

In [None]:
train=train[['assists','healsAndBoosts','damageDealt','DBNOs','kills','playersJoined','totalDistance','weaponsAcquired','winPlacePerc']]

In [None]:
train.head()

In [None]:
X=train.drop('winPlacePerc',axis=1)
y=train['winPlacePerc']

### getting rid of skewness

In [None]:
sns.distplot(X['damageDealt']);

In [None]:
sns.distplot(X['totalDistance']);

### both are +ve skewed so i'll use cube root transformation to keep 0 values

In [None]:
X['damageDealt']=X['damageDealt']**(1/3)
sns.distplot(X['damageDealt']);

In [None]:
X['totalDistance']=X['totalDistance']**(1/3)
sns.distplot(X['totalDistance']);

## The winning strategy (My objective):
### So I’ve said in the beginning that I’m trying to find out the best strategy to win a PUBG game by using analytics, so here’s what I figured out from the analysis:
1. Play in a team.
2. Use healings and health-boosting elements.
3. Destroy your enemies vehicles.
4. Kill as many enemies as you can.
5. Move a lot and collect powerful weapons.


In [None]:
## to do, regression :D