#  BUPG EDA & Baseline Model

![](https://cnet3.cbsistatic.com/img/eByo0DXxBrKt_cftmlqVzapL6qA=/970x0/2017/12/16/765799c3-237e-4c8a-8d7e-3a5a14dedc0a/pubg.jpg)

### This is a comprehensive kernel including exploratory data analysis and prediction using LGB/XGB and LR
#### Here is the EDA part.

## Lib and Load data

In [None]:
import numpy as np
import pandas as pd
from sklearn import model_selection, preprocessing, metrics
import matplotlib.pyplot as plt
import seaborn as sns

pd.set_option('display.width', 1000) 

pd.set_option('display.max_rows', 200) 

pd.set_option('display.max_columns', 200) 

df_train =  pd.read_csv('../input/train.csv')

df_test = pd.read_csv('../input/test.csv')


### Check the trainning set

In [None]:
df_train.info()

In [None]:
df_train['Id'].nunique()

In [None]:
df_train['groupId'].nunique()

In [None]:
df_train['matchId'].nunique()

There are 4357336 players participated,they comprised 1888732 groups,and played 47734 matches.
These numbers are consistent with our commensense of BUPG -- a group consists of 3 or 4 player,about 100 players can play in a single match.
![](https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcTzlJgrVHqRHJKeZukHtyx6h9-uQbZ7ZcGerKFnknk6KcmaMTbCSQ)

## Explore distribution of single variable

In [None]:
#================== EDA =======================================================

# ---------- single distributions ---------

plt.hist(df_train['winPlacePerc'])
plt.xlabel("winPlacePerc") 
plt.ylabel("count") 
plt.title('Distribution of winPlacePerc')

**winPlacePerc** is the target we are going to predict on testing set.

![](https://i.ytimg.com/vi/EY_9IVJE8MU/maxresdefault.jpg)Its distribution on training set is not kind of a 'normal distribution'  but the opposite -- values close to 0 and 1 are apparently more than the middle values.


In [None]:
plt.figure(figsize=[10,6])
df_train['assists'].value_counts().plot(kind='bar')
plt.title("Distribution of assists") 
plt.ylabel("count") 
plt.show()
print(df_train['assists'].value_counts())

In [None]:
plt.figure(figsize=[10,6])
df_train['boosts'].value_counts().plot(kind='bar')
plt.title("Distribution of boosts") 
plt.ylabel("count") 
plt.show()
print(df_train['boosts'].value_counts())

In [None]:
plt.figure(figsize=[10,6])
(df_train.loc[df_train['damageDealt']>500, 'damageDealt'].astype(float)).value_counts().plot(kind='bar')
plt.title("Distribution of damageDealt") 
plt.ylabel("count") 
plt.show()


Here we choose those whose damageDealt is more than 500 to show.
We can see above the counts of higher damageDealt smoothly decrease.

In [None]:
plt.figure(figsize=[10,6])
df_train['DBNOs'].value_counts().plot(kind='bar')
plt.title("Distribution of DBNOs") 
plt.ylabel("count") 
plt.show()
print(df_train['DBNOs'].value_counts())

 PS: DBNO means** 'down but not out'** in BUPG, it's known in experienced BUPG players that many times you may not be able to kill a encounterd enemy but only beat down them,they can still be saved by their teammates.
 
![](http://gameplay.tips/uploads/posts/2017-09/1505063822_8.jpg) 

In [None]:
plt.figure(figsize=[10,6])
df_train['headshotKills'].value_counts().plot(kind='bar')
plt.title("Distribution of headshotKills") 
plt.ylabel("count") 
plt.show()
print(df_train['headshotKills'].value_counts())

In [None]:
plt.figure(figsize=[10,6])
df_train['heals'].value_counts().plot(kind='bar')
plt.title("Distribution of heals") 
plt.ylabel("count") 
plt.show()
print(df_train['heals'].value_counts())

PS:heals means **'number of healing teammates'**.We may  naively refer that the more you heal your teammate,the more likely you are going to get a higher rank.
![](https://media0dk-a.akamaihd.net/80/59/2df2682d732d90d2ec1231282817cf45.jpg)

In [None]:
plt.figure(figsize=[18,4])
df_train['killPlace'].value_counts().plot(kind='bar')
plt.title("Distribution of killPlace") 
plt.ylabel("count") 
plt.show()
print(df_train['killPlace'].value_counts())

In [None]:
df_train['matchId'].nunique()

I reaffirm the numbers of matches above to show that the most of killPlace is equal to the number of matches.
And the value of killPlace distribute platly from 1 to 95 and slowly decrease when to 100,which indicates killPlace is the place of gameboard in one match which vary from 1 to 100.
The decrease of 90-100 is caused by the players in one match is not always 100, 90s is enough to begin a game.

In [None]:
plt.figure(figsize=[18,4])
df_train['kills'].value_counts().plot(kind='bar')
plt.title("Distribution of kills") 
plt.ylabel("count") 
plt.show()
print(df_train['kills'].value_counts())


In [None]:
plt.figure(figsize=[18,4])
df_train['killStreaks'].value_counts().plot(kind='bar')
plt.title("Distribution of killStreaks") 
plt.ylabel("count") 
plt.show()
print(df_train['killStreaks'].value_counts())

In [None]:
plt.figure(figsize=[18,4])
df_train['maxPlace'].value_counts().plot(kind='bar')
plt.title("Distribution of maxPlace") 
plt.ylabel("count") 
plt.show()
print(df_train['maxPlace'].value_counts())

In [None]:
plt.figure(figsize=[18,4])
df_train['numGroups'].value_counts().plot(kind='bar')
plt.title("Distribution of numGroups") 
plt.ylabel("count") 
plt.show()
print(df_train['numGroups'].value_counts())

In [None]:
print(df_train['revives'].value_counts())

In [None]:
plt.figure(figsize=[10,6])
(df_train.loc[df_train['longestKill']>0, 'longestKill'].astype(float)).value_counts().plot(kind='bar')
plt.title("Distribution of longestKill") 
plt.ylabel("count") 
plt.show()

In [None]:
print(df_train['teamKills'].value_counts())
# teamKills
# is
# not 
# a 
# valuable
# variable

In [None]:
print(df_train['vehicleDestroys'].value_counts())

In [None]:
print(df_train['weaponsAcquired'].value_counts())

## See the variables' correlation with target

In [None]:
# ---------------- correlation --------------

# variable correlation 
correlation = df_train.corr()
correlation = correlation['winPlacePerc'].sort_values(ascending=False)
print(correlation.head(20))

In [None]:
sns.heatmap(df_train.corr(),annot=True,cmap='RdYlGn',linewidths=0.2) 
fig=plt.gcf()
fig.set_size_inches(20,16)
plt.show()

### See the non-sparse variables in format way

In [None]:
train_ = df_train

def show_count_sum(df, col,n=10):
    return df.groupby(col).agg({'winPlacePerc': ['count', 'mean']}).sort_values(('winPlacePerc', 'count'), ascending=False).head(n)

In [None]:
show_count_sum(train_, 'assists')

In [None]:
show_count_sum(train_, 'boosts')

In [None]:
show_count_sum(train_, 'DBNOs')

In [None]:
show_count_sum(train_, 'headshotKills')

In [None]:
show_count_sum(train_, 'heals')

In [None]:
show_count_sum(train_, 'killPlace')

In [None]:
show_count_sum(train_, 'killPoints')

In [None]:
show_count_sum(train_, 'kills')

In [None]:
show_count_sum(train_, 'killStreaks')

In [None]:
show_count_sum(train_, 'maxPlace')

In [None]:
show_count_sum(train_, 'numGroups')

In [None]:
show_count_sum(train_, 'revives')

In [None]:
show_count_sum(train_, 'vehicleDestroys')

In [None]:
show_count_sum(train_, 'weaponsAcquired')

In [None]:
show_count_sum(train_, 'winPoints')

### See the sparse variables in plot-scatter way

In [None]:
data = pd.concat([train_['damageDealt'], train_['winPlacePerc']], axis=1)
data.plot.scatter(x='damageDealt', y='winPlacePerc')

In [None]:
data = pd.concat([train_['killPoints'], train_['winPlacePerc']], axis=1)
data.plot.scatter(x='killPoints', y='winPlacePerc')


In [None]:
data = pd.concat([train_['longestKill'], train_['winPlacePerc']], axis=1)
data.plot.scatter(x='longestKill', y='winPlacePerc')


In [None]:
data = pd.concat([train_['rideDistance'], train_['winPlacePerc']], axis=1)
data.plot.scatter(x='rideDistance', y='winPlacePerc')

In [None]:
data = pd.concat([train_['walkDistance'], train_['winPlacePerc']], axis=1)
data.plot.scatter(x='walkDistance', y='winPlacePerc')

In [None]:
data = pd.concat([train_['winPoints'], train_['winPlacePerc']], axis=1)
data.plot.scatter(x='winPoints', y='winPlacePerc')

# Predicting

## Data Preparation

In [None]:
#====================== Predicting ============================================

Y = (df_train['winPlacePerc'].astype(float)).values

sum_id = df_test["Id"].values

df_train = df_train.drop(['Id','groupId','matchId','winPlacePerc'], axis = 1)
                          
df_test= df_test.drop(['Id','groupId','matchId'], axis = 1)

**Reference:** This code below is by Joao in https://www.kaggle.com/joaopmpeinado/winner-winner-chicken-dinner

The main reason is players are rank as a group together in one match,so we have to consider this.

In [None]:
'''
lgb_pred[lgb_pred > 1] = 1
    
test  = pd.read_csv('../input/test.csv')
test['winPlacePercPred'] = lgb_pred
aux = test.groupby(['matchId','groupId'])['winPlacePercPred'].agg('mean').groupby('matchId').rank(pct=True).reset_index()
aux.columns = ['matchId','groupId','winPlacePerc']
test = test.merge(aux, how='left', on=['matchId','groupId'])
    
subm = test[['Id','winPlacePerc']]
    
subm.to_csv("LGB.csv", index=False)
'''

## LGB

In [None]:
#=========================== lgb =================================== 

import lightgbm as lgb

model_lgb = lgb.LGBMRegressor(objective='regression',num_leaves=5,
                              learning_rate=0.05, n_estimators=720,
                              max_bin = 55, bagging_fraction = 0.8,
                              bagging_freq = 5, feature_fraction = 0.2319,
                              feature_fraction_seed=9, bagging_seed=9,
                              min_data_in_leaf =6, min_sum_hessian_in_leaf = 11)

model_lgb.fit(df_train, Y)
lgb_pred = model_lgb.predict(df_test)

lgb_pred[lgb_pred > 1] = 1
    
test  = pd.read_csv('../input/test.csv')
test['winPlacePercPred'] = lgb_pred
aux = test.groupby(['matchId','groupId'])['winPlacePercPred'].agg('mean').groupby('matchId').rank(pct=True).reset_index()
aux.columns = ['matchId','groupId','winPlacePerc']
test = test.merge(aux, how='left', on=['matchId','groupId'])
    
subm = test[['Id','winPlacePerc']]
    
subm.to_csv("LGB.csv", index=False)

## XGB

In [None]:
#=========================== xgboost ===================================

#----------------- 1 ------------------ 

import xgboost as xgb 

dtrain = xgb.DMatrix(df_train, label=Y)
dtest = xgb.DMatrix(df_test)

params = {'max_depth':7,
          'eta':1,
          'silent':1,
          'objective':'reg:linear',
          'eval_metric':'rmse',
          'learning_rate':0.05
         }
num_rounds = 50

xb = xgb.train(params, dtrain, num_rounds)

y_pred_xgb = xb.predict(dtest)

y_pred_xgb[y_pred_xgb > 1] = 1
    
test  = pd.read_csv('../input/test.csv')
test['winPlacePercPred'] = y_pred_xgb
aux = test.groupby(['matchId','groupId'])['winPlacePercPred'].agg('mean').groupby('matchId').rank(pct=True).reset_index()
aux.columns = ['matchId','groupId','winPlacePerc']
test = test.merge(aux, how='left', on=['matchId','groupId'])
    
subm = test[['Id','winPlacePerc']]
    
subm.to_csv("XGB1.csv", index=False)

In [None]:
#----------------- 2 ------------------ Score: 

import xgboost as xgb

dtrain = xgb.DMatrix(df_train, label = Y)
dtest = xgb.DMatrix(df_test)

params = {"max_depth":2, "eta":0.1}
model = xgb.cv(params, dtrain,  num_boost_round=500, early_stopping_rounds=100)

model_xgb = xgb.XGBRegressor(n_estimators=360, max_depth=2, learning_rate=0.1) #the params were tuned using xgb.cv
model_xgb.fit(df_train, Y)

xgb_preds = model_xgb.predict(df_test)

xgb_preds[xgb_preds > 1] = 1
    
test  = pd.read_csv('../input/test.csv')
test['winPlacePercPred'] = xgb_preds
aux = test.groupby(['matchId','groupId'])['winPlacePercPred'].agg('mean').groupby('matchId').rank(pct=True).reset_index()
aux.columns = ['matchId','groupId','winPlacePerc']
test = test.merge(aux, how='left', on=['matchId','groupId'])
    
subm = test[['Id','winPlacePerc']]
    
subm.to_csv("XGB2.csv", index=False)

## Model Stacking

In [None]:
sub = xgb_preds * 0.1 + y_pred_xgb * 0.65 + lgb_pred * 0.25 # Score:

sub[sub > 1] = 1
    
test  = pd.read_csv('../input/test.csv')
test['winPlacePercPred'] = sub
aux = test.groupby(['matchId','groupId'])['winPlacePercPred'].agg('mean').groupby('matchId').rank(pct=True).reset_index()
aux.columns = ['matchId','groupId','winPlacePerc']
test = test.merge(aux, how='left', on=['matchId','groupId'])
    
subm = test[['Id','winPlacePerc']]
    
subm.to_csv("Stacked_1.csv", index=False)

## Linear Regression

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

LR = LinearRegression()

LR.fit(df_train, Y)
X_train, X_val, y_train,y_val = train_test_split(df_train,Y,test_size=0.3, random_state=42) 

print('Accuracy on training：\n',LR.score(X_train, y_train)) 
print('Accuracy on validation：\n',LR.score(X_val, y_val))
print('LinearRegression Accuracy：\n',LR.score(df_train, Y))

pred = LR.predict(df_test)
  
pred = pd.DataFrame({'Id':sum_id, 'winPlacePerc':pred}) 

pred.to_csv('pred_Linear.csv',index=None) 

## Thanks if you read all the way through. 
## If you think this kernel is helpful,please vote it,appreciated.


# PS: Good Luck and have a Dinner!
![](https://www.hindustantimes.com/rf/image_size_960x540/HT/p2/2018/06/30/Pictures/_802421c8-7c33-11e8-8d5f-3f0c905295d2.jpg)

