# <center><font color = "green">PUBG Game Prediction</font></center>


<center><img src = "https://media.giphy.com/media/XVbrX433vn6rqkexSj/giphy.gif"></center>

## <font color = "green">About Dataset:</font>

In a PUBG game, up to 100 players start in each match (matchId). Players can be on teams (groupId) which get ranked at the end of the game (winPlacePerc) based on how many other teams are still alive when they are eliminated. In game, players can pick up different munitions, revive downed-but-not-out (knocked) teammates, drive vehicles, swim, run, shoot, and experience all of the consequences -- such as falling too far or running themselves over and eliminating themselves.

You are provided with a large number of anonymized PUBG game stats, formatted so that each row contains one player's post-game stats. The data comes from matches of all types: solos, duos, squads, and custom; there is no guarantee of there being 100 players per match, nor at most 4 players per group.

### <font color = "green">Link to dataset:</font>

 - Kaggle - https://www.kaggle.com/datasets/ashishjangra27/pubg-games-dataset

## <font color = "green">Data Description:</font>

- **DBNOs -** Number of enemy players knocked.
- **assists -** Number of enemy players this player damaged that were killed by teammates.
- **boosts -** Number of boost items used.
- **damageDealt -** Total damage dealt. Note: Self inflicted damage is subtracted.
- **headshotKills -** Number of enemy players killed with headshots.
- **heals -** Number of healing items used.
- **Id -** Player’s Id
- **killPlace -** Ranking in match of number of enemy players killed.
- **killPoints -** Kills-based external ranking of player. (Think of this as an Elo ranking where only kills matter.) If there is a value other than -1 in rankPoints, then any 0 in killPoints should be treated as a “None”.
- **killStreaks -** Max number of enemy players killed in a short amount of time.
- **kills -** Number of enemy players killed.
- **longestKill -** Longest distance between player and player killed at time of death. This may be misleading, as downing a player and driving away may lead to a large longestKill stat.
- **matchDuration -** Duration of match in seconds.
- **matchId -** ID to identify match. There are no matches that are in both the training and testing set.
- **matchType -** String identifying the game mode that the data comes from. The standard modes are “solo”, “duo”, “squad”, “solo-fpp”, “duo-fpp”, and “squad-fpp”; other modes are from events or custom matches.
- **rankPoints -** Elo-like ranking of player. This ranking is inconsistent and is being deprecated in the API’s next version, so use with caution. Value of -1 takes place of “None”.
- **revives -** Number of times this player revived teammates.
- **rideDistance -** Total distance traveled in vehicles measured in meters.
- **roadKills -** Number of kills while in a vehicle.
- **swimDistance -** Total distance traveled by swimming measured in meters.
- **teamKills -** Number of times this player killed a teammate.
- **vehicleDestroys -** Number of vehicles destroyed.
- **walkDistance -** Total distance traveled on foot measured in meters.- 
- **weaponsAcquired -** Number of weapons picked up.
- **winPoints -** Win-based external ranking of player. (Think of this as an Elo ranking where only winning matters.) If there is a value other than -1 in rankPoints, then any 0 in winPoints should be treated as a “None”.
- **groupId -** ID to identify a group within a match. If the same group of players plays in different matches, they will have a different groupId each time.
- **numGroups -** Number of groups we have data for in the match.
- **maxPlace -** Worst placement we have data for in the match. This may not match with numGroups, as sometimes the data skips over placements.
- **winPlacePerc -** The target of prediction. This is a percentile winning placement, where 1 corresponds to 1st place, and 0 corresponds to last place in the match. It is calculated off of maxPlace, not numGroups, so it is possible to have missing chunks in a match.

### <font color = "green">Tool and Libraries Used:</font>

- **Tool:**
  - Python 3.11.7
- **Standard Libraries:**
  - warnings
  - numpy (imported as np)
  - pandas (imported as pd)
- **Visualization Libraries:**
  - matplotlib.pyplot (imported as plt)
  - seaborn (imported as sns)
- **Machine Learning Libraries:**
  - sklearn.preprocessing (specifically StandardScaler)
  - sklearn.model_selection (specifically train_test_split)
  - catboost (imported as cb)
  - sklearn.metrics (specifically mean_squared_error and r2_score)

### <font color = "green">Table of Content</font><a class = "anchor" id = "content"></a>

1. [Importing Libraries](#import)
2. [Reading the Data](#read)
3. [Data Wrangling](#wrangle)
4. [Feature Engineering](#feature)
6. [ML - CatBoost Model](#cat)
  - [CatBoost Model](#catboost)
  - [Prediction](#prediction)

# <font color = "green">Importing Libraries</font><a class = "anchor" id = "import"></a>

In [1]:
## handling warnings

import warnings
warnings.filterwarnings("ignore")

##standard libraries

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

## visualisation

import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams["figure.figsize"] = (11,5)

import seaborn as sns

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

## !pip install catboost (for jupyter/colab)

import catboost as cb

from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score

[🔝](#content)

# <font color = "green">Reading the Data </font><a class = "anchor" id = "read"></a>

In [2]:
## load the data

df = pd.read_csv("pubg_game_prediction.csv")

## glimpse of the data

df.head(2)

FileNotFoundError: [Errno 2] No such file or directory: 'pubg_game_prediction.csv'

In [None]:
## data dimension

df.shape

In [None]:
## data information

df.info()

[🔝](#content)

# <font color = "green">Data Wrangling</font><a class = "anchor" id = "wrangle"></a>

#### Check for the rows with missing win prediction value

In [None]:
## check row with NULL win prediction value

df[df['winPlacePerc'].isnull()]

In [None]:
## remove the data row - 2744604

df.drop(2744604, inplace = True)

#### Understanding Players distribution in a game

In [None]:
## prepare new parameter to know how many players are in a game

df['playersJoined'] = df.groupby('matchId')['matchId'].transform('count')
df.head(1)

In [None]:
## visualize matches where players joined >= 75

sns.countplot(data = df[df['playersJoined']>=75],x = 'playersJoined')
plt.show()

###### Observation:
The data for 75 and + people in a match is huge with maximum matches having 95-98 players

## Analysing the data

#### Kills Without Moving?

###### It is not possible to kill even 1 player if you do not move by atleast 1 unit. Following are mostly used practices by cheaters (ones who interfere with the game's genuine natural processes):
* Aimbots
* Wallhacks
* Triggerbots
* ESP (Extra Sensory Perception)
* Silent Aim

In [None]:
## prepare a data parameter to gather the information of the total distance travelled

df['totalDistance'] = df['rideDistance'] + df['walkDistance'] + df['swimDistance']

## prepare a data parameter to check for anamoly detection that
## the person has not moved but still managed to do the kills

df['killswithoutMoving'] = ((df['kills'] > 0) & (df['totalDistance'] == 0))

In [None]:
## check data for people who have killed without moving

df[df['killswithoutMoving'] == True].head(2)

In [None]:
## check total kills without moving data

df[df['killswithoutMoving'] == True].shape

###### Observation:
1535 instances have either used hacks or been lucky ! We cannot use such data (which cannot be generalised) for our model. Hence, dropping these instances.

In [None]:
## drop the instances

df.drop(df[df['killswithoutMoving'] == True].index , inplace = True)

#### Extra-ordinary Road Kills !

In [None]:
## check data for roadkills > 5

df[df['roadKills'] > 5].shape

###### Observation:
It takes to be expert among the other players in a match to kill by vehicles only. Hence dropping the 46 instances from data frame.

In [None]:
## drop the instance

df.drop(df[df['roadKills'] > 5].index, inplace = True)

#### So many KILLS - how ???

In [None]:
## visualize data for No. of players | Kills

sns.countplot(data = df, x = df['kills']).set_title("Distribution of KILLS by a player")
plt.ylabel("Count of players")
plt.xlabel("Number of Kills")
plt.show()

###### Observation:
Maximum people kills upto maximum 12 players.

In [None]:
## visualize data for No. of players | Kills >= 15

sns.countplot(data = df[df['kills']>=15],x='kills').set_title("Distribution of KILLS by a player")
plt.ylabel("Count of players")
plt.xlabel("Number of Kills")
plt.show()

In [None]:
## kills > 20 cannot be generalized

df[df['kills'] > 20].shape

###### Observation:
Kills beyond 20 are rare and cannot be used a general use case. Hence, dropping the instance.

In [None]:
## drop the instances

df.drop(df[df['kills'] > 20].index, inplace = True)

#### Head Shot

<center><img src = "https://media.giphy.com/media/l3mZrOajz5VCZf7Hy/giphy.gif"></center>

In [None]:
## calculate headshot rate

df['headshot_rate'] = df['headshotKills']/df['kills']

## fill with 0 if there is not headshot

df['headshot_rate'] = df['headshot_rate'].fillna(0)

In [None]:
## plot the headshot rate distribution

sns.distplot(df['headshot_rate'], bins =10).set_title("Distplot showing the distribution of headshot rate")
plt.ylabel("Count of players")
plt.show()

In [None]:
## find headshot rate == 100% with kills > 5

df[(df['headshot_rate'] == 1) & (df['kills'] > 5)].shape

###### Observation
Killing more than 5 people as headshots where all the shots in a match are headshots is mostly not a general case. 187 instances have such anomaly and hence, we will drop them.

In [None]:
## droping the instances

df.drop(df[(df['headshot_rate'] == 1) & (df['kills'] > 6)].index, inplace = True)

#### Longest Shot

###### The maximum possible distance that is made possible to snipe from in PUBG is 1km or 1000 meters. However, this is not general case and most of the times, hackers use either of the following to take advantage and win a match:
* Sniper Aimbots
* Bullet Speed/Trajectory Hacks
* No Recoil/No Spread
* Zoom Hacks

In [None]:
## visualize Number of people | Longest Kills

sns.distplot(df['longestKill'], bins = 50).set_title("Histogram showing the Longest Kill Distribution")
plt.ylabel("Count of players")
plt.show()

In [None]:
## calculate instances with longestkill distance > 500 meters

df[df['longestKill']>=500].shape

###### Observation:
1747 instances have kills > 500. hence, we will drop these.

In [None]:
## dropping the instances

df.drop(df[df['longestKill']>=500].index, inplace = True)

#### Weapon Change

###### In general, people change upto 10 guns in match (avg. being 5 to 6). But, cheaters sometimes use either of the following for unlimited recoil/ guns in a single match:
* Macro Scripts
* Rapid Fire Hacks
* Input Spoofing

In [None]:
## visualize number of players | weapon change

sns.distplot(df['weaponsAcquired'], bins=100).set_title("Weapons Distribution")
plt.show()

In [None]:
## calculate instances with weapons acquired > 15

df[df['weaponsAcquired']>=15].shape

##### Observation:
In 6809 instances, people have changed gun more than 15 times in a match. Such is not a general the use case and hence, we will drop these values.

In [None]:
## drop instance

df.drop(df[df['weaponsAcquired']>=15].index, inplace = True)

### Exploratory Data Analysis

In [None]:
## final shape

df.shape

In [None]:
## total number of null values

df.isna().sum()

In [None]:
## correlation of parameter with Win Prediction

plt.figure(figsize=[30,30])
sns.heatmap(df.corr(numeric_only = True), annot = True)
plt.show()

[🔝](#content)

# <font color = "green">Feature Engineering</font><a class = "anchor" id = "feature"></a>

In [None]:
## calculate normalization factor
## (100-factor)/100 = 0 for matches including 100 players
## use (100-factor)/100 + 1

normalising_factor = (100 - df['playersJoined']/100)+1

In [None]:
## create new attributes with normalization factor

df['killsNorm'] = df['kills'] * normalising_factor
df['damageDealtNorm'] = df['damageDealt'] * normalising_factor
df['maxPlaceNorm'] = df['maxPlace'] * normalising_factor
df['matchDurationNorm'] = df['matchDuration'] * normalising_factor
df['traveldistance'] = df['walkDistance']+ df['swimDistance'] + df['rideDistance']
df['healsnboosts'] = df['heals'] + df['boosts']
df['assist'] = df['assists'] + df['revives']

In [None]:
## analyze columns

df.columns

#### Removing unwanted columns

In [None]:
## not tampering the cleaned data, creating important dataset

data = df.drop(columns = ['Id', 'groupId', 'matchId', 'assists', 'boosts', 'walkDistance', 'swimDistance',
                          'rideDistance', 'heals', 'revives', 'kills', 'damageDealt', 'maxPlace', 'matchDuration'])

In [None]:
## check data dataframe

data.head(2)

# <font color = "green">ML - Catboost Model</font><a class = "anchor" id = "cat"></a>

#### Handling categorical data

In [None]:
x = data.drop(['winPlacePerc'], axis = 1)
y = data['winPlacePerc']

#### One-hot Encoding

In [None]:
x = pd.get_dummies(x, columns = ['matchType', 'killswithoutMoving'])
x = x.applymap(lambda x: int(x) if isinstance(x, bool) else x)
x.head()

In [None]:
features = x.columns

#### Scaling the data

In [None]:
## prevent model from giving undue preference
## to instances with higher values

sc = StandardScaler()
sc.fit(x)
x = pd.DataFrame(sc.transform(x))
x.head(2)

#### Splitting data

In [None]:
## train and test within the single file

xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size = 0.3, random_state = 0)
print(xtrain.shape, ytrain.shape)
print(xtest.shape, ytest.shape)

##### Check:
<font  color = "Green">**Training Parameters:**</font> **3105414** <br>
<font  color = "Green">**Testing Parameters:**</font> **1330892** <br>

### <font color = "blue">CatBoost Model</font><a class = "anchor" id = "catboost"></a>

In [None]:
train_dataset = cb.Pool(xtrain, ytrain)
test_dataset = cb.Pool(xtest, ytest)

In [None]:
model = cb.CatBoostRegressor(loss_function='RMSE')

In [None]:
## GRID search
## run model one by one on all combinations
## return the best parameter combination

grid = {'iterations': [100, 150], 
       'learning_rate': [0.03, 0.1], 
       'depth': [2, 4, 6, 8]} ## runs 16 combinations here

model.grid_search(grid, train_dataset)

###### Observations:
Our model has prepare final data after Kfold cross validation.

**Best Parameters:**
 - 'depth': 8
 - 'learning_rate': 0.1
 - 'iterations': 150}
 - 'iterations': [0,....149]

In [None]:
feature_importance_df = pd.DataFrame()
feature_importance_df['features'] = features
feature_importance_df['importance'] = model.feature_importances_

feature_importance_df = feature_importance_df.sort_values(by = ['importance'], ascending=False)
feature_importance_df

In [None]:
plt.figure(figsize=(10, 6))  # Adjust the figure size if needed

# Set the background color of the graph
plt.gca().set_facecolor('green')

# Plot the bar chart with specified colors
bars = plt.bar(feature_importance_df.features, feature_importance_df.importance, color='yellow', edgecolor='white')

# Set the labels and their colors
plt.ylabel("CatBoost Feature Importance", color='black')
plt.xticks(rotation=90, color='black')
plt.yticks(color='black')

# Display the plot
plt.show()

###### Observation:
The model can be trained dropping the following parameters:

* matchType_normal-squad
* vehicleDestroys
* headshot_rate
* matchType_normal-solo
* matchType_normal-solo-fpp
* matchType_crashtpp
* matchType_normal-duo-fpp
* matchType_normal-duo
* matchType_flarefpp
* headshotKills
* killswithoutMoving_False

## <font color = "Blue">Prediction</font><a class = "anchor" id = "catboost"></a>

In [None]:
pred = model.predict(xtest)

In [None]:
## evaluate model

rmse = np.sqrt(mean_squared_error(ytest, pred)) ## percentage of error
r2 = r2_score(ytest, pred) ## needs to be high closer to 1 (ranging from 0 to 1)

print("Testing performance")

print("RMSE: {:.2f}".format(rmse))
print("R2: {:.2f}".format(r2))

###### Observation:
An 8% error with r2 Value closer to 1, which means the model accuracy is high without being overfitting.

Hence,
<center>
  <img src="https://media.giphy.com/media/KB89dMAtH79VIvxNCW/giphy.gif" style="width:80%; height:400px;">
</center>


I hope you found this analysis of PUBG game ranking prediction using the CatBoost model both comprehensive and insightful! With an RSME of 0.08 and an R² score close to 1, the model demonstrates high accuracy in predicting player rankings.<br><br>
Your feedback is invaluable, please share your thoughts if you enjoyed it. 
<br><br>
Check out more such projects [here](https://github.com/sho-das)! 😄😅
<br><br>
[🔝](#content)