# PUBG prediction challenge

While I intensively play PUBG mobile, I found this competition pretty interesting.
Hope such gaming-related conpetition continues, and I will present my first kernel here. Let's get to it!



## Data colmuns

- __DBNOs__ - Number of enemy players knocked.
- __assists__ - Number of enemy players this player damaged that were killed by teammates.
- __boosts__ - Number of boost items used.
- __damageDealt__ - Total damage dealt. Note: Self inflicted damage is subtracted.
- __headshotKills__ - Number of enemy players killed with headshots.
- heals - Number of healing items used.
- Id - Player’s Id
- killPlace - Ranking in match of number of enemy players killed.
- killPoints - Kills-based external ranking of player. (Think of this as an Elo ranking where only kills matter.) If there is a value other than -1 in rankPoints, then any 0 in killPoints should be treated as a “None”.
- killStreaks - Max number of enemy players killed in a short amount of time.
- kills - Number of enemy players killed.
- longestKill - Longest distance between player and player killed at time of death. This may be misleading, as downing a player and driving away may lead to a large longestKill stat.
- matchDuration - Duration of match in seconds.
- matchId - ID to identify match. There are no matches that are in both the training and testing set.
- matchType - String identifying the game mode that the data comes from. The standard modes are “solo”, “duo”, “squad”, “solo-fpp”, -“duo-fpp”, and “squad-fpp”; other modes are from events or custom matches.
- rankPoints - Elo-like ranking of player. This ranking is inconsistent and is being deprecated in the API’s next version, so use with -caution. Value of -1 takes place of “None”.
- revives - Number of times this player revived teammates.
- rideDistance - Total distance traveled in vehicles measured in meters.
- roadKills - Number of kills while in a vehicle.
- swimDistance - Total distance traveled by swimming measured in meters.
- teamKills - Number of times this player killed a teammate.
- vehicleDestroys - Number of vehicles destroyed.
- walkDistance - Total distance traveled on foot measured in meters.
- weaponsAcquired - Number of weapons picked up.
- winPoints - Win-based external ranking of player. (Think of this as an Elo ranking where only winning matters.) If there is a value other than -1 in rankPoints, then any 0 in winPoints should be treated as a “None”.
- groupId - ID to identify a group within a match. If the same group of players plays in different matches, they will have a different groupId each time.
- numGroups - Number of groups we have data for in the match.
- maxPlace - Worst placement we have data for in the match. This may not match with numGroups, as sometimes the data skips over placements.
- __winPlacePerc__ - The target of prediction. This is a percentile winning placement, where 1 corresponds to 1st place, and 0 corresponds to last - place in the match. It is calculated off of maxPlace, not numGroups, so it is possible to have missing chunks in a match.

In [None]:
# debug mode
debug = False;
debug_rows = 10000;

# import
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import xgboost as xgb
import lightgbm as lgb

from xgboost import XGBRegressor
from sklearn import model_selection
from sklearn.metrics import confusion_matrix, mean_squared_error, mean_absolute_error

import gc, sys
gc.enable()

pd.options.display.float_format = '{:,.2f}'.format
pd.set_option('display.max_columns', 100)

%matplotlib inline
# if(debug):
#     plt.style.use("dark_background")


In [None]:
if(debug):
    train = pd.read_csv('../input/train_V2.csv', nrows = debug_rows)
    test  = pd.read_csv('../input/test_V2.csv')
else:
    train = pd.read_csv('../input/train_V2.csv', nrows = debug_rows)
    test  = pd.read_csv('../input/test_V2.csv')

In [None]:
train.shape

In [None]:
test.shape

### Dropping NA value

In [None]:
train.isnull().sum()

In [None]:
train.dropna(axis=0, how='all')
print(train.isnull().any().any())

## Data tweaking

### Memory reduction

First make memory reduced: credit to [this website](https://www.kaggle.com/gemartin/load-data-reduce-memory-usage)

In [None]:
# Memory saving function credit to https://www.kaggle.com/gemartin/load-data-reduce-memory-usage
def reduce_mem_usage(df):
    """ iterate through all the columns of a dataframe and modify the data type
        to reduce memory usage.        
    """
    #start_mem = df.memory_usage().sum() / 1024**2
    #print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))

    for col in df.columns:
        col_type = df[col].dtype

        if col_type != object:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)

    #end_mem = df.memory_usage().sum() / 1024**2
    #print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
    #print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))

    return df

# train = reduce_mem_usage(train)
# test = reduce_mem_usage(test)

In [None]:
train.describe()

In [None]:
train.quantile(q=[0.10, 0.90], numeric_only=True)


## EDA

### Assists

In [None]:
assists = train['assists']
assists.describe()

In [None]:
fig, ax = plt.subplots(figsize=(16,8))
sns.boxplot(x=assists, y=train['winPlacePerc'], ax=ax)

In [None]:
fig, ax = plt.subplots(figsize=(16,8))
sns.distplot(assists, kde=False)

In [None]:
assists[assists>10].count()

___Findings___
- more assists likely to be winners.
- 5~ assists : rare
- 10~ assists : cheater...? (TODO:check cheater for large assists again)
- more assists = teammate killed = team wins or strong = winPlacePerc big. Consider adding teammate kill number to its feature

### Boosts

In [None]:
boosts = train['boosts']
boosts.describe()

In [None]:
fig, ax = plt.subplots(figsize=(16,8))
sns.boxplot(x=boosts, y=train['winPlacePerc'], ax=ax)

In [None]:
fig, ax = plt.subplots(figsize=(16,8))
sns.distplot(boosts, kde=False)

In [None]:
train[['boosts', 'heals']].corr()

___Findings___
- more boosts gets chicken dinner.
- ~15 boosts = not cheater? just a junky drug guys?

### DamageDealt

In [None]:
eda = train['damageDealt']
eda.describe()

In [None]:
fig, ax = plt.subplots(figsize=(16,8))
sns.regplot(x=eda, y=train['winPlacePerc'], fit_reg=False, ax=ax)

In [None]:
eda[eda>15].count()

___Findings___
- more damage gets c-dinner
- there are 0.0 winPlacePerc with 0 - 1000 damage. cheater?
yes, actually a lot of cheater here as well: walkDistance == 0.0. Consider removing these.

- too many walks...?

In [None]:
train[(train['damageDealt']>400) & (train['winPlacePerc']==0.0) & (train['matchType']=='solo')].head()

### DBNOs

In [None]:
eda = train['DBNOs']
eda.describe()

In [None]:
fig, ax = plt.subplots(figsize=(16,8))
sns.boxplot(x=eda, y=train['winPlacePerc'], ax=ax)

In [None]:
eda[eda>10].count()

### WalkDistance

In [None]:
eda = train['walkDistance']
eda.describe()

In [None]:
fig, ax = plt.subplots(figsize=(16,8))
sns.regplot(x=eda, y=train['winPlacePerc'], fit_reg=False,ax=ax)

In [None]:
cheater = train[(train['walkDistance']<=50.0)&(train['damageDealt']>0.0)]
cheater.head()

___Findings___

- very small walkDistance (less than meter) with kills, weapons, damages == ___mostly cheaters:___ less than meter and soon weapons is super wierd.
- very small walkDistance + weapon, heals, damage, kills, headshots, etc. should be removed from the training data.

In [None]:
fig, ax = plt.subplots(figsize=(16,8))
sns.regplot(x=cheater['walkDistance'], y=cheater['winPlacePerc'], fit_reg=False,ax=ax)

In [None]:
train[train['walkDistance']>10000]

cheaters on the left!

### HeadshotKills

In [None]:
eda = train['headshotKills']
eda.describe()

In [None]:
fig, ax = plt.subplots(figsize=(16,8))
sns.boxplot(x=eda, y=train['winPlacePerc'],ax=ax)

In [None]:
train[train['headshotKills']>10]

___Findings___

- those who have _DeadEye_ will likely to get checken dinner.
- not that weird to have 10 headshots...? not definitely cheater but _DeadEye_ s

### Heals

In [None]:
eda = train['heals']
eda.describe()

In [None]:
fig, ax = plt.subplots(figsize=(16,8))
sns.boxplot(x=eda, y=train['winPlacePerc'],ax=ax)

In [None]:
train[train['heals']>40].describe()

___Findings___

- more heals = more survival
- insane heals (30~) : not a cheater...? but too junky drinking a lot of energydrink
- oh, perhaps it's bandage heals: counts up 5 to 10 for healing

### killPlace

In [None]:
eda = train['killPlace']
eda.describe()

In [None]:
fig, ax = plt.subplots(figsize=(16,8))
sns.boxplot(x=eda, y=train['winPlacePerc'],ax=ax)

WTF? This weird curve of killplace-winplaceperc could have been caused by __area phase__ in the games (if the area started to shrink, there can be shootings but at the same time risk of being killed).

___Findings___

- need to find out the rationale of this 'w' curve.

### killPoints

In [None]:
eda = train['killPoints']
eda.describe()

In [None]:
fig, ax = plt.subplots(figsize=(16,8))
sns.regplot(x=eda, y=train['winPlacePerc'],fit_reg=False, ax=ax)

___Findings___

- guys between 10 ~ 800 pts are poor at surviving. consider removing other range and use these as a feature.
- guys on 0pt are somehow distributed. consider using these as another feature as well.

### kills
well, if you kill everyone on the game you get chicken dinner :) :)

In [None]:
eda = train['kills']
eda.describe()

In [None]:
fig, ax = plt.subplots(figsize=(16,8))
sns.boxplot(x=eda, y=train['winPlacePerc'], ax=ax)

This graph looks same as headshot kills and damage dealt feature. Take a loook at correlations.

In [None]:
sns.heatmap(train[['damageDealt','kills', 'headshotKills', 'DBNOs']].corr(), annot=True)

There is a serial killer. Let's have a look if there's any cheater exists.

In [None]:
train[train['kills']>30].describe()

In [None]:
sns.regplot(x=train[train['kills']>10]['kills'], y=train[train['kills']>10]['walkDistance'], fit_reg=False)

___Findings___

- kills - damageDealt has multicorralatory. Consider removing these?
- chaters: ~70 kills, there are serial kills without not much traveling (~1000m). Consider removing these as cheaters.

### killstreaks
killing spree!

In [None]:
eda = train['killStreaks']
eda.describe()

In [None]:
fig, ax = plt.subplots(figsize=(16,8))
sns.boxplot(x=eda, y=train['winPlacePerc'], ax=ax)

This graph looks same as headshot kills and damage dealt feature. Take a loook at correlations.

In [None]:
sns.heatmap(train[['damageDealt','kills', 'headshotKills', 'DBNOs', 'killStreaks']].corr(), annot=True)

high corr with 'kills'.

In [None]:
train[train['killStreaks']>10].head()

___Findings___

- kills - damageDealt has multicorralatory. Consider removing these?
- chaters: ~10 killStreaks are likely to be cheaters

### longestKill

In [None]:
eda = train['longestKill']
eda.describe()

In [None]:
fig, ax = plt.subplots(figsize=(16,8))
sns.regplot(x=eda, y=train['winPlacePerc'],fit_reg=False, ax=ax)

In [None]:
train[(train['longestKill']<0.01) & (train['kills']!=0)].describe()

In [None]:
sns.regplot(x=train[(train['longestKill']<1.0) & (train['kills']!=0)]['longestKill'], y=train[(train['longestKill']<1.0) & (train['kills']!=0)]['winPlacePerc'], fit_reg=False)

___Findings___

- longest kill ~0.8 could be CQC range.

### matchDuration

In [None]:
eda = train['matchDuration']
eda.describe()

In [None]:
fig, ax = plt.subplots(figsize=(16,8))
sns.regplot(x=eda, y=train['winPlacePerc'],fit_reg=False, ax=ax)

___Findings___

- longest kill ~0.8 could be CQC range.

### rankPoints

In [None]:
eda = train['rankPoints']
eda.describe()

In [None]:
fig, ax = plt.subplots(figsize=(16,8))
sns.regplot(x=eda, y=train['winPlacePerc'],fit_reg=False, ax=ax)

In [None]:
train[train['rankPoints']==-1].describe()

___Findings___

- rankPoints == -1 are new players. consider removing these and turn them into feature.

### revives

In [None]:
eda = train['revives']
eda.describe()

In [None]:
fig, ax = plt.subplots(figsize=(16,8))
sns.boxplot(x=eda, y=train['winPlacePerc'], ax=ax)

___Findings___

- more revive = more win.

### roadkills

In [None]:
eda = train['roadKills']
eda.describe()

In [None]:
fig, ax = plt.subplots(figsize=(16,8))
sns.boxplot(x=eda, y=train['winPlacePerc'],ax=ax)

In [None]:
train[(train['rideDistance']==0)&(train['roadKills']>0)]

___Findings___

- There exists roadDistance==0 and roadKills > 0. Consider make these roadkills = 0 for better accuracy.

### swimDistance

In [None]:
eda = train['swimDistance']
eda.describe()

In [None]:
fig, ax = plt.subplots(figsize=(16,8))
sns.regplot(x=eda, y=train['winPlacePerc'],fit_reg=False, ax=ax)

___Findings___

- more revive = more win.

### TeamKills
Shame on TK!

In [None]:
eda = train['teamKills']
eda.describe()

In [None]:
fig, ax = plt.subplots(figsize=(16,8))
sns.boxplot(x=eda, y=train['winPlacePerc'], ax=ax)

In [None]:
train[train['teamKills']>3]

___Findings___

- How they killed more than 4 times...?

### vehicleDestroys

In [None]:
eda = train['vehicleDestroys']
eda.describe()

In [None]:
fig, ax = plt.subplots(figsize=(16,8))
sns.boxplot(x=eda, y=train['winPlacePerc'], ax=ax)

___Findings___

- not a big correlation between kills and damages, but the win rate is larger if more vehicle destroys.

### walkDistance

In [None]:
eda = train['walkDistance']
eda.describe()

In [None]:
fig, ax = plt.subplots(figsize=(16,8))
sns.regplot(x=eda, y=train['winPlacePerc'],fit_reg=False, ax=ax)

___Findings___

- 'winner' travels a lot.

### weaponsAcquired

In [None]:
eda = train['weaponsAcquired']
eda.describe()

In [None]:
fig, ax = plt.subplots(figsize=(16,8))
sns.boxplot(x=eda, y=train['winPlacePerc'], ax=ax)

In [None]:
train[(train['weaponsAcquired']>30) & (train['walkDistance']<100)].describe()

___Findings___

- There exists suspicious cheater (walking less than just a 100m but stil acquires 30~ weapons and kills a lot) consider removing such player from the train data as well.

### winpoints

In [None]:
eda = train['winPoints']
eda.describe()

In [None]:
fig, ax = plt.subplots(figsize=(16,8))
sns.regplot(x=eda, y=train['winPlacePerc'],fit_reg=False, ax=ax)

___Findings___

- 0 or -1 is treated as 'unrated'. consider removing these and create new feature.
- winpoints ~250 to 1200 are losers. Consider removing these and make these as 'looser' feature.

## Feature engineering

TODOs left in the ___Findings___ section of the notebook:

- count assists as team kill= make team kill sum feature?
- walkDistance ~ 10.0 with some damage, kill, weapons, heal, headshot: remove.
- kill points = 0 -> unrated guys. remove these make new columns for them.
- kill points 10~800 are poor at surving. remove these and make it one-hot kill-poor feature.
- kill-damageDealt has multicorr. make kill per damageDealt column?
- killstreak: 10~ are likely to be cheaters.
- rankPoints == -1 : new players. consider remove and make new feature
- roadDistance==0 and roadkills > 0 should be fixed.
- walk<100m but acquires  30~ weapons: cheater or bug. remove.
- winpoints: 0 and -1 is unrated. and 250 to 1200 are poor at gaming. consider removing these and make new feature 'unrated' and 'win_loosers'.


In [None]:
df = train # just to save train_df safe

df = df.drop(df[(df['walkDistance']<10.0) & (df['damageDealt']>0)].index)
df = df.drop(df[(df['walkDistance']<10.0) & (df['kills']>10)].index)
df = df.drop(df[(df['walkDistance']<100.0) & (df['weaponsAcquired']>30)].index)
df = df.drop(df[(df['walkDistance']<10.0) & (df['heals']>100)].index)
df = df.drop(df[(df['walkDistance']<10.0) & (df['headshotKills']>5)].index)
df = df.drop(df[(df['walkDistance']<10.0) & (df['headshotKills']>5)].index)

# unrated guys (killPoints)
df['unrated_kill'] = 0
df.loc[df['killPoints']==0, 'unrated_kill']=1

# poor on kill points
df['poor_kills'] = 0
df.loc[(df['killPoints']>10) & (df['killPoints']<800), 'poor_kills'] = 1
df.loc[(df['killPoints']>10) & (df['killPoints']<800), 'killPoints'] = 0

# killPerDamage
df['killPerDamage'] = df['kills']/df['damageDealt']
df = df.fillna(0)

# drop savage killer (kill streak > 10)
df = df.drop(df[df['killStreaks']>=10].index)


# rank unrated players
df['unrated_rank'] = 0
df.loc[df['rankPoints']==-1, 'unrated_rank']=1

# roadDistance glitch drop
df = df.drop(df[(df['rideDistance']==0.0) & (df['roadKills']>0)].index)

# insane weapon scavenger = cheater. drop
df = df.drop(df[(df['weaponsAcquired']>30) & (df['walkDistance']<100)].index)

# winpoints unrated
df['unrated_win'] = 0
df.loc[(df['winPoints']==-1) | (df['winPoints'] == 0), 'unrated_win']=1

# poor on winpoints
df['poor_wins'] = 0
df.loc[(df['winPoints']>250) & (df['winPoints']<1200), 'poor_wins'] = 1
df.loc[(df['winPoints']>250) & (df['winPoints']<1200), 'killPoints'] = 0

print('removed:' + str(train['Id'].count() - df['Id'].count()))
df.head()



In [None]:
# thanks to awesome https://www.kaggle.com/chocozzz/lightgbm-baseline

def feature_engineering(is_train=True,debug=True):
    test_idx = None
    if is_train: 
        print("processing train.csv")
        if debug == True:
            df = pd.read_csv('../input/train_V2.csv', nrows=1000000)
        else:
            df = pd.read_csv('../input/train_V2.csv')

        df = df[df['maxPlace'] > 1]
    else:
        print("processing test.csv")
        if debug == True:
            df = pd.read_csv('../input/test_V2.csv')
        else:
            df = pd.read_csv('../input/test_V2.csv')
        test_idx = df.Id
    
    # df = reduce_mem_usage(df)
    #df['totalDistance'] = df['rideDistance'] + df["walkDistance"] + df["swimDistance"]
    
    # df = df[:100]
    
    print("remove some columns")
    target = 'winPlacePerc'
    
    if(is_train):
        print("removing cheaters")
        df = df.drop(df[(df['walkDistance']<10.0) & (df['damageDealt']>0)].index)
        df = df.drop(df[(df['walkDistance']<10.0) & (df['kills']>10)].index)
        df = df.drop(df[(df['walkDistance']<100.0) & (df['weaponsAcquired']>30)].index)
        df = df.drop(df[(df['walkDistance']<10.0) & (df['heals']>100)].index)
        df = df.drop(df[(df['walkDistance']<10.0) & (df['headshotKills']>5)].index)
        df = df.drop(df[(df['walkDistance']<10.0) & (df['headshotKills']>5)].index)

        # drop savage killer (kill streak > 10)
        df = df.drop(df[df['killStreaks']>=10].index)

        # roadDistance glitch drop
        df = df.drop(df[(df['rideDistance']==0.0) & (df['roadKills']>0)].index)

        # insane weapon scavenger = cheater. drop
        df = df.drop(df[(df['weaponsAcquired']>30) & (df['walkDistance']<100)].index)

    # unrated guys (killPoints)
    df['unrated_kill'] = 0
    df.loc[df['killPoints']==0, 'unrated_kill']=1

    # poor on kill points
    df['poor_kills'] = 0
    df.loc[(df['killPoints']>10) & (df['killPoints']<800), 'poor_kills'] = 1
    df.loc[(df['killPoints']>10) & (df['killPoints']<800), 'killPoints'] = 0

    


    # rank unrated players
    df['unrated_rank'] = 0
    df.loc[df['rankPoints']==-1, 'unrated_rank']=1

    

    # winpoints unrated
    df['unrated_win'] = 0
    df.loc[(df['winPoints']==-1) | (df['winPoints'] == 0), 'unrated_win']=1

    # poor on winpoints
    df['poor_wins'] = 0
    df.loc[(df['winPoints']>250) & (df['winPoints']<1200), 'poor_wins'] = 1
    df.loc[(df['winPoints']>250) & (df['winPoints']<1200), 'killPoints'] = 0

   
    df[df == np.Inf] = np.NaN
    df[df == np.NINF] = np.NaN
    
    print("Removing Na's From DF")
    df.fillna(0, inplace=True)

    
    features = list(df.columns)
    features.remove("Id")
    features.remove("matchId")
    features.remove("groupId")
    features.remove("matchType")
    
    # matchType = pd.get_dummies(df['matchType'])
    # df = df.join(matchType)    
    
    y = None
    
    
    if is_train: 
        print("get target")
        y = np.array(df.groupby(['matchId','groupId'])[target].agg('mean'), dtype=np.float64)
        features.remove(target)

    print("get group mean feature")
    agg = df.groupby(['matchId','groupId'])[features].agg('mean')
    agg_rank = agg.groupby('matchId')[features].rank(pct=True).reset_index()
    
    if is_train: df_out = agg.reset_index()[['matchId','groupId']]
    else: df_out = df[['matchId','groupId']]

    df_out = df_out.merge(agg.reset_index(), suffixes=["", ""], how='left', on=['matchId', 'groupId'])
    df_out = df_out.merge(agg_rank, suffixes=["_mean", "_mean_rank"], how='left', on=['matchId', 'groupId'])
    
    # print("get group sum feature")
    # agg = df.groupby(['matchId','groupId'])[features].agg('sum')
    # agg_rank = agg.groupby('matchId')[features].rank(pct=True).reset_index()
    # df_out = df_out.merge(agg.reset_index(), suffixes=["", ""], how='left', on=['matchId', 'groupId'])
    # df_out = df_out.merge(agg_rank, suffixes=["_sum", "_sum_rank"], how='left', on=['matchId', 'groupId'])
    
    # print("get group sum feature")
    # agg = df.groupby(['matchId','groupId'])[features].agg('sum')
    # agg_rank = agg.groupby('matchId')[features].agg('sum')
    # df_out = df_out.merge(agg.reset_index(), suffixes=["", ""], how='left', on=['matchId', 'groupId'])
    # df_out = df_out.merge(agg_rank.reset_index(), suffixes=["_sum", "_sum_pct"], how='left', on=['matchId', 'groupId'])
    
    print("get group max feature")
    agg = df.groupby(['matchId','groupId'])[features].agg('max')
    agg_rank = agg.groupby('matchId')[features].rank(pct=True).reset_index()
    df_out = df_out.merge(agg.reset_index(), suffixes=["", ""], how='left', on=['matchId', 'groupId'])
    df_out = df_out.merge(agg_rank, suffixes=["_max", "_max_rank"], how='left', on=['matchId', 'groupId'])
    
    print("get group min feature")
    agg = df.groupby(['matchId','groupId'])[features].agg('min')
    agg_rank = agg.groupby('matchId')[features].rank(pct=True).reset_index()
    df_out = df_out.merge(agg.reset_index(), suffixes=["", ""], how='left', on=['matchId', 'groupId'])
    df_out = df_out.merge(agg_rank, suffixes=["_min", "_min_rank"], how='left', on=['matchId', 'groupId'])
    
    print("get group size feature")
    agg = df.groupby(['matchId','groupId']).size().reset_index(name='group_size')
    df_out = df_out.merge(agg, how='left', on=['matchId', 'groupId'])
    
    print("get match mean feature")
    agg = df.groupby(['matchId'])[features].agg('mean').reset_index()
    df_out = df_out.merge(agg, suffixes=["", "_match_mean"], how='left', on=['matchId'])
    
    # print("get match type feature")
    # agg = df.groupby(['matchId'])[matchType.columns].agg('mean').reset_index()
    # df_out = df_out.merge(agg, suffixes=["", "_match_type"], how='left', on=['matchId'])
    
    print("get match size feature")
    agg = df.groupby(['matchId']).size().reset_index(name='match_size')
    df_out = df_out.merge(agg, how='left', on=['matchId'])
    
    print("Adding Features")
 
    df['headshotrate'] = df['kills']/df['headshotKills']
    df['killStreakrate'] = df['killStreaks']/df['kills']
    df['healthitems'] = df['heals'] + df['boosts']
    df['totalDistance'] = df['rideDistance'] + df["walkDistance"] + df["swimDistance"]
    df['killPlace_over_maxPlace'] = df['killPlace'] / df['maxPlace']
    df['headshotKills_over_kills'] = df['headshotKills'] / df['kills']
    df['distance_over_weapons'] = df['totalDistance'] / df['weaponsAcquired']
    df['walkDistance_over_heals'] = df['walkDistance'] / df['heals']
    df['walkDistance_over_kills'] = df['walkDistance'] / df['kills']
    df['killsPerWalkDistance'] = df['kills'] / df['walkDistance']
    df["skill"] = df["headshotKills"] + df["roadKills"]
    
    df[df == np.Inf] = np.NaN
    df[df == np.NINF] = np.NaN
    print("Removing Na's From DF")
    df.fillna(0, inplace=True)
    
    df_out.drop(["matchId", "groupId"], axis=1, inplace=True)

    X = df_out
    
    feature_names = list(df_out.columns)

    del df, df_out, agg, agg_rank
    gc.collect()

    return X, y, feature_names, test_idx

In [None]:
x_train, y_train, train_columns, _ = feature_engineering(True, debug=debug)
x_test, _, _ , test_idx = feature_engineering(False, debug=debug)


In [None]:
sns.heatmap(x_train.head(1000).corr())


In [None]:
# Thanks and credited to https://www.kaggle.com/gemartin who created this wonderful mem reducer
def reduce_mem_usage(df):
    """ iterate through all the columns of a dataframe and modify the data type
        to reduce memory usage.        
    """
    start_mem = df.memory_usage().sum() 
    print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))
    
    for col in df.columns:
        col_type = df[col].dtype
        
        if col_type != object:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
        else:
            df[col] = df[col].astype('category')

    end_mem = df.memory_usage().sum() 
    print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
    print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))
    
    return df

x_train = reduce_mem_usage(x_train)
x_test = reduce_mem_usage(x_test)

In [None]:
#excluded_features = []
#use_cols = [col for col in df_train.columns if col not in excluded_features]
gc.collect();
train_index = round(int(x_train.shape[0]*0.8))
dev_X = x_train[:train_index] 
val_X = x_train[train_index:]
dev_y = y_train[:train_index] 
val_y = y_train[train_index:] 
gc.collect();

# custom function to run light gbm model
def run_lgb(train_X, train_y, val_X, val_y, x_test):
    params = {"objective" : "regression", "metric" : "mae", 'n_estimators':20000, 'early_stopping_rounds':200,
              "num_leaves" : 31, "learning_rate" : 0.05, "bagging_fraction" : 0.7,
               "bagging_seed" : 0, "num_threads" : 4,"colsample_bytree" : 0.7
             }
    
    lgtrain = lgb.Dataset(train_X, label=train_y)
    lgval = lgb.Dataset(val_X, label=val_y)
    model = lgb.train(params, lgtrain, valid_sets=[lgtrain, lgval], early_stopping_rounds=200, verbose_eval=1000)
    
    pred_test_y = model.predict(x_test, num_iteration=model.best_iteration)
    return pred_test_y, model

# Training the model #
pred_test, model = run_lgb(dev_X, dev_y, val_X, val_y, x_test)
pred_test

In [None]:


print(pred_test.shape[0])
pred_test

df_sub = pd.read_csv("../input/sample_submission_V2.csv")
df_sub['winPlacePerc'] = pred_test
df_sub.head()


In [None]:
if(debug):
    df_sub = pd.read_csv("../input/sample_submission_V2.csv", nrows=pred_test.shape[0])
    df_test = pd.read_csv("../input/test_V2.csv", nrows=pred_test.shape[0])
else:
    df_sub = pd.read_csv("../input/sample_submission_V2.csv")
    df_test = pd.read_csv("../input/test_V2.csv")
df_sub['winPlacePerc'] = pred_test
# Restore some columns
df_sub = df_sub.merge(df_test[["Id", "matchId", "groupId", "maxPlace", "numGroups"]], on="Id", how="left")

# Sort, rank, and assign adjusted ratio
df_sub_group = df_sub.groupby(["matchId", "groupId"]).first().reset_index()
df_sub_group["rank"] = df_sub_group.groupby(["matchId"])["winPlacePerc"].rank()
df_sub_group = df_sub_group.merge(
    df_sub_group.groupby("matchId")["rank"].max().to_frame("max_rank").reset_index(), 
    on="matchId", how="left")
df_sub_group["adjusted_perc"] = (df_sub_group["rank"] - 1) / (df_sub_group["numGroups"] - 1)

df_sub = df_sub.merge(df_sub_group[["adjusted_perc", "matchId", "groupId"]], on=["matchId", "groupId"], how="left")
df_sub["winPlacePerc"] = df_sub["adjusted_perc"]

# Deal with edge cases
df_sub.loc[df_sub.maxPlace == 0, "winPlacePerc"] = 0
df_sub.loc[df_sub.maxPlace == 1, "winPlacePerc"] = 1

# Align with maxPlace
# Credit: https://www.kaggle.com/anycode/simple-nn-baseline-4
subset = df_sub.loc[df_sub.maxPlace > 1]
gap = 1.0 / (subset.maxPlace.values - 1)
new_perc = np.around(subset.winPlacePerc.values / gap) * gap
df_sub.loc[df_sub.maxPlace > 1, "winPlacePerc"] = new_perc

# Edge case
df_sub.loc[(df_sub.maxPlace > 1) & (df_sub.numGroups == 1), "winPlacePerc"] = 0
assert df_sub["winPlacePerc"].isnull().sum() == 0

df_sub[["Id", "winPlacePerc"]].to_csv("submission_adjusted.csv", index=False)
df_sub

In [None]:
#small test in small batch data
# train_small = df.sample(10000)

In [None]:
# feature_list = ['DBNOs','headshotKills','heals','longestKill','assists','walkDistance', 'boosts','damageDealt', 'damageDealer','healer','deadEye','walker','booster','winPlacePerc']

# train_small=train_small.drop('Id', axis=1)
# train_small=train_small.drop('groupId', axis=1)
# train_small=train_small.drop('matchId', axis=1)
# train_small=train_small.drop('matchType', axis=1)

# train_small_batch = train_small.copy()

In [None]:
# corr = train_small_batch.corr()
# fig, ax = plt.subplots(figsize=(20,20))
# sns.heatmap(corr, annot=True,ax = ax)

In [None]:
# train_df, test_df = model_selection.train_test_split(train_small_batch, test_size=0.3, random_state=49)
# train_df_y = train_df[['winPlacePerc']]
# train_df_x = train_df.copy().drop('winPlacePerc', axis=1)
# test_df_y = test_df[['winPlacePerc']]
# test_df_x = test_df.copy().drop('winPlacePerc', axis=1)

In [None]:

# clf = XGBRegressor()
# clf_cv = model_selection.GridSearchCV(clf, {'max_depth': [2,4,6], 'n_estimators': [50,100,200]}, verbose=1)
# clf_cv.fit(train_df_x, train_df_y)
# print(clf_cv.best_params_, clf_cv.best_score_)

In [None]:
# clf = XGBRegressor(max_depth=4, n_estimators=200)
# clf.fit(train_df_x, train_df_y)

In [None]:
# pred = clf.predict(test_df_x)
# rmse = np.sqrt(mean_absolute_error(test_df_y, pred))
# mean_pred = [train_df_y.mean() for i in range(len(test_df_y))]
# rmse_base = np.sqrt(mean_absolute_error(test_df_y, mean_pred))

# print('trained feature list: ' + str(feature_list))

# print(rmse_base)
# print(rmse)

# xgb.plot_importance(clf, max_num_features=100)


## Feature engineering record

testing 10000 parameters.

`clf_cv = model_selection.GridSearchCV(clf, {'max_depth': [2,4,6], 'n_estimators': [50,100,200]}, verbose=1)
`

default
- 0.28...
- 0.270245891656
- 0.261273510973
- 0.257954486653


In [None]:
# _test  = pd.read_csv('../input/test_V2.csv')
# test = _test.copy()

In [None]:
# # test['damageDealer']=0
# # test.loc[test['damageDealt']>186.70, 'damageDealer'] = 1

# # test['deadEye'] = 0
# # test.loc[test['headshotKills']>3, 'deadEye'] = 1

# # test['healer'] = 0
# # test.loc[test['heals']>=5, 'healer'] = 1

# # test['walker'] = 0
# # test.loc[test['walkDistance']>3000.0, 'walker'] = 1

# # test['booster'] = 0
# # test.loc[test['boosts']>=4.0, 'booster'] = 1

# # test['sniper'] = 0
# # test.loc[test['headshotKills']>=2.0, 'sniper'] = 1

# test['killPerddamage'] = test['kills']/test['damageDealt']

# test=test.drop('Id', axis=1)
# test=test.drop('groupId', axis=1)
# test=test.drop('matchId', axis=1)
# test=test.drop('matchType', axis=1)


In [None]:

# pred = clf.predict(test)


In [None]:
# submission = pd.DataFrame({'Id':_test['Id'], 'winPlacePerc':pred})

In [None]:
# submission.head()