# **PUBG Finish Placement Prediction**


!["Image"](https://cdn.dnaindia.com/sites/default/files/styles/full/public/2020/10/22/933055-pubg-2.jpg)


**Given over 65,000 games' worth of anonymized player data, split into training and testing sets, we have  to predict final placement from final in-game stats and initial player ratings.** 


## **0. Data Reading and Description**

### Libraries and Data information

In [None]:
import numpy as np 
import pandas as pd 

import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("darkgrid")

import plotly.express as px

In [None]:
# Train/test data Reading 

train_df = pd.read_csv("../input/pubg-finish-placement-prediction/train_V2.csv")

test_df = pd.read_csv('../input/pubg-finish-placement-prediction/test_V2.csv')

print(train_df.shape,test_df.shape)

In [None]:
# look at some of data points
train_df.sample(10)

In [None]:
test_df.head(2)

In [None]:
# data description
train_df.describe()

In [None]:
# data information
train_df.info()

- **Loooking at the dataset it seems to be a regression problem with ample amout of data, Moving forward I'll try to perform some EDA on the dataset, process the data and build a model suitable for best prediction.**

## **1. EDA**

- **Correlation within the data**

In [None]:
plt.figure(figsize=[25,12])
sns.heatmap(train_df.corr(),annot = True,cmap = "BuPu");

- **Nuber of unique values for ID's**

In [None]:
id_cols = train_df.columns[:3]

for i in id_cols:
  print(f" {i} : {train_df[i].nunique()}")

> Probably should drop these later from both train/test datasets.

- **Assists v/s KillPoints v/s Kills v/s WinPercentage**

In [None]:
# taking a subset of data for plotting
exp_df = train_df.sample(30_000)

In [None]:
fig = px.scatter(exp_df, x = "assists", y = "kills",color="winPlacePerc",
                 size="assists",hover_name="Id")
fig.show()

> High kills  may be leading to early exit of the teams.


- **DamageDealt v/s Heals v/s Kills v/s winPlacePercentage.**

In [None]:
fig = px.scatter(exp_df, y = "damageDealt", x = "heals", log_y= False,
                    color = "kills", hover_name = "winPlacePerc",
                 size = "damageDealt"
                   )
fig.show()

> High kills leads to high healing but not particulary high win chances

- **Match Duration v/s  Walk Distance v/s Swim distance**

In [None]:
fig = px.scatter(exp_df, x = "matchDuration", y = "walkDistance",size = "swimDistance",
                 color = "revives",width=1200, hover_name = "winPoints")
fig.show()

- **Win Percentage Distribution**

In [None]:
fig = px.histogram(exp_df,x = "winPlacePerc",color = "winPlacePerc")
fig.show()

## **2. Data Pre-Processing**

In [None]:
# Dropping id columns for bot train/test

data_train = train_df.drop(columns = id_cols)
data_test = test_df.drop(columns = id_cols)

In [None]:
data_train.columns

- **Checking for null_values**

In [None]:
data_train.isna().sum()

In [None]:
data_test.isna().sum().any()

In [None]:
# filling one missing with mean
data_train["winPlacePerc"] = data_train["winPlacePerc"].fillna(np.mean(data_train.winPlacePerc))

- **Encoding Categorical Data**

In [None]:
data_train.matchType.value_counts()

In [None]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

In [None]:
data_train["matchType"] = le.fit_transform(data_train["matchType"])
data_test["matchType"] = le.transform(data_test["matchType"])

- **Splitting target and Data points**

In [None]:
data = data_train.drop("winPlacePerc",axis = 1)
target = data_train["winPlacePerc"]

- **Scaling numerical Features.**

In [None]:
from sklearn.preprocessing import StandardScaler
sc  = StandardScaler()

In [None]:
data = sc.fit_transform(data)
data_test = sc.transform(data_test)

### Spitting into train/test data

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
x_train,x_test, y_train,y_test = train_test_split(data,target, random_state = 1234, test_size = 0.2)

print(x_train.shape, x_test.shape)

## **3. Model**

In [None]:
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LinearRegression
from xgboost import XGBRFRegressor
from sklearn.ensemble import RandomForestRegressor
import time
import pickle

In [None]:
# define a pipeline to train a few models 
pipeline = {
      "LinearRegression" : make_pipeline(LinearRegression()),
      "RandomForestRegressor" : make_pipeline(RandomForestRegressor()),
      "XGBRFRegressor" : make_pipeline(XGBRFRegressor())
}

In [None]:
# train the models on the trian data

fitted_models = {}

for algo,pipeline in pipeline.items():
  model = pipeline.fit(x_train[:50_000],y_train[:50_000])
  fitted_models[algo] = model


print("Finished training..")

In [None]:
fitted_models

In [None]:
# train score
for model in fitted_models:
  print(f" Score for {model} is {fitted_models[model].score(x_train[:10_000],y_train[:10_000])}")

In [None]:
# test score
for model in fitted_models:
  print(f" Score for {model} is {fitted_models[model].score(x_test[:10_000],y_test[:10_000])}")

- **Mean Absolute error**

In [None]:
from sklearn.metrics import mean_absolute_error

In [None]:
pred_test = fitted_models["RandomForestRegressor"].predict(x_test[:10_000])

In [None]:
print(f"{mean_absolute_error(pred_test,y_test[:10_000]):.4f}")

### I'll use Random Forest for further processes of hyperparameter tuning

In [None]:
# subset of the large data for hyperparameter tuning
x_train_small,x_test_small,y_train_small,y_test_small = x_train[:50_000],x_test[:30_000],y_train[:50_000],y_test[:30_000]

In [None]:
# function to tune an test the model
def train_and_eval(x_train,y_train,x_test,y_test, **params):
    model = RandomForestRegressor(random_state=42, n_jobs = -1, **params)
    model.fit(x_train,y_train)
    train_mae = mean_absolute_error(model.predict(x_train),y_train)
    test_mae = mean_absolute_error(model.predict(x_test),y_test)
    return model,train_mae,test_mae

In [None]:
# test 1
model,train_mae,test_mae = train_and_eval(x_train_small,y_train_small,x_test_small,y_test_small)
print(f"Model : {model},\n\n train_mae : {train_mae:.4f}, test_mae : {test_mae:.4f}\n")

In [None]:
# test 2 
model,train_mae,test_mae = train_and_eval(x_train_small,y_train_small,x_test_small,y_test_small, n_estimators = 30, max_depth = 10, min_samples_leaf = 3)
print(f"Model : {model},\n\n train_mae : {train_mae:.4f}, test_mae : {test_mae:.4f}\n")

In [None]:
# test 3

model,train_mae,test_mae = train_and_eval(x_train_small,y_train_small,x_test_small,y_test_small, n_estimators = 50, max_depth = 5, min_samples_leaf = 3)
print(f"Model : {model},\n\n train_mae : {train_mae:.4f}, test_mae : {test_mae:.4f}\n")

In [None]:
# test 3

model,train_mae,test_mae = train_and_eval(x_train_small,y_train_small,x_test_small,y_test_small, n_estimators = 150, max_depth = 15, min_samples_leaf = 3)
print(f"Model : {model},\n\n train_mae : {train_mae:.4f}, test_mae : {test_mae:.4f}\n")

### Making Predictions

In [None]:
model = RandomForestRegressor(n_estimators = 110, max_depth = 13, min_samples_leaf = 3)

### Train

In [None]:
start = time.time()
model.fit(x_train,y_train)
end = time.time()

In [None]:
print(f"Finished training in {(end-start):.2f} seconds.")

In [None]:
# saving this model

with open("Model_RF.pkl","wb") as f:
    pickle.dump(model,f)

> Train score

In [None]:
pred_train = model.predict(x_train)
print(mean_absolute_error(pred_train,y_train))

> Test Score

In [None]:
pred_test = model.predict(x_test)
print(mean_absolute_error(pred_test,y_test))

### Visualizations and weights

In [None]:
from sklearn.tree import plot_tree, export_text

In [None]:
model.estimators_[0]

In [None]:
model.estimators_[109]

In [None]:
plt.figure(figsize = [20,15])
plot_tree(model.estimators_[0],max_depth = 2, feature_names=data_train.columns[:-1],filled = True,rounded = True);

In [None]:
plt.figure(figsize = [20,15])
plot_tree(model.estimators_[109],max_depth = 2, feature_names=data_train.columns[:-1],filled = True,rounded = True);

> Saving importances

In [None]:
importance_df = pd.DataFrame({
    'feature': data_train.columns[:-1],
    'importance': model.feature_importances_
}).sort_values('importance', ascending=False)

In [None]:
importance_df[:10]

In [None]:
plt.title('Feature Importance')
sns.barplot(data=importance_df.head(10), x='importance', y='feature');

### **Summary.**
 - Downlaoded a real world Datset
 - Prepare dataset
 - EDA
 - Data cleaning and splitting
 - Testing three models
 > Linear Regression,
 > Random Forest,
 > XG Boost

 - Performing hyperparamater tuning on Random Forest
 - Testing and making predictions

 Dataset from : https://www.kaggle.com/c/pubg-finish-placement-prediction

### Make predictions on test data and submit"

In [None]:
sample = pd.read_csv("../input/pubg-finish-placement-prediction/sample_submission_V2.csv")

In [None]:
sample.shape

In [None]:
pd.DataFrame(data_test)

In [None]:
predictions = model.predict(data_test)

In [None]:
predictions[:10]

In [None]:
sample["winPlacePerc"] = predictions

In [None]:
sample

In [None]:
sample.to_csv("submission.csv",index = False)