This is one of my first public kernels on Kaggle. It may be a bit rough around the edges, with some leftover code here and there, but it got me in the top 25% on the competition leaderboards as of November 5th, 2018. It's been a great learning experience and I'm excited to move on to something new!

I also greatly appreciate any feedback.

In [None]:
import numpy as np
import pandas as pd
import re
import seaborn as sns
import warnings
import matplotlib.pyplot as plt
%matplotlib inline

import gc, sys
gc.enable()

In [None]:
warnings.simplefilter("ignore")
# warnings.resetwarnings()

In [None]:
df_train = pd.read_csv('../input/train_V2.csv', sep=',')
# df_test = pd.read_csv('../input/test_V2.csv', sep=',')

In [None]:
# Memory saving function credit to https://www.kaggle.com/gemartin/load-data-reduce-memory-usage
def reduce_mem_usage(df):
    # iterate through all the columns of a dataframe and modify the data type
     #   to reduce memory usage.        
    
    #start_mem = df.memory_usage().sum() / 1024**2
    #print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))

    for col in df.columns:
        col_type = df[col].dtype

        if col_type != object:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)

    #end_mem = df.memory_usage().sum() / 1024**2
    #print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
    #print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))

    return df

In [None]:
df_train = reduce_mem_usage(df_train)

I used the code below to sample the training dataset while I was modifying model parameters and whatnot.

In [None]:
# sampled_matches = pd.Series(df_train.matchId.unique()).sample(frac=1/4, random_state=42)
# df_train = df_train[df_train.matchId.isin(sampled_matches)]
# df_train.groupby("matchId").size().mean()

Just taking a brief look at the data...

In [None]:
df_train.head(10)

In [None]:
df_train.info()

In [None]:
df_train.describe()

## Feature Descriptions

**DBNOs** - Number of enemy players knocked.

**assists** - Number of enemy players this player damaged that were killed by teammates.

**boosts** - Number of boost items used.

**damageDealt** - Total damage dealt. Note: Self inflicted damage is subtracted.

**headshotKills** - Number of enemy players killed with headshots.

**heals** - Number of healing items used.

**Id** - Player’s Id

**killPlace** - Ranking in match of number of enemy players killed.

**killPoints** - Kills-based external ranking of player. (Think of this as an Elo ranking where only kills matter.) If there is a value other than -1 in rankPoints, then any 0 in killPoints should be treated as a “None”.

**killStreaks** - Max number of enemy players killed in a short amount of time.

**kills** - Number of enemy players killed.

**longestKill** - Longest distance between player and player killed at time of death. This may be misleading, as downing a player and driving away may lead to a large longestKill stat.

**matchDuration** - Duration of match in seconds.

**matchId** - ID to identify match. There are no matches that are in both the training and testing set.

**matchType** - String identifying the game mode that the data comes from. The standard modes are “solo”, “duo”, “squad”, “solo-fpp”, “duo-fpp”, and “squad-fpp”; other modes are from events or custom matches.

**rankPoints** - Elo-like ranking of player. This ranking is inconsistent and is being deprecated in the API’s next version, so use with caution. Value of -1 takes place of “None”.

**revives** - Number of times this player revived teammates.

**rideDistance** - Total distance traveled in vehicles measured in meters.

**roadKills** - Number of kills while in a vehicle.

**swimDistance** - Total distance traveled by swimming measured in meters.

**teamKills** - Number of times this player killed a teammate.

**vehicleDestroys** - Number of vehicles destroyed.

**walkDistance** - Total distance traveled on foot measured in meters.

**weaponsAcquired** - Number of weapons picked up.

**winPoints** - Win-based external ranking of player. (Think of this as an Elo ranking where only winning matters.) If there is a value other than -1 in rankPoints, then any 0 in winPoints should be treated as a “None”.

**groupId** - ID to identify a group within a match. If the same group of players plays in different matches, they will have a different groupId each time.

**numGroups** - Number of groups we have data for in the match.

**maxPlace** - Worst placement we have data for in the match. This may not match with numGroups, as sometimes the data skips over placements.

**winPlacePerc** - The target of prediction. This is a percentile winning placement, where 1 corresponds to 1st place, and 0 corresponds to last place in the match. It is calculated off of maxPlace, not numGroups, so it is possible to have missing chunks in a match.

## Cleaning

There is one row of data where the variable we are predicting is missing. Let's isolate it and drop it.

In [None]:
df_train.loc[pd.notnull(df_train['winPlacePerc']) == False]

In [None]:
df_train = df_train[pd.notnull(df_train['winPlacePerc'])]

In [None]:
df_train.loc[pd.notnull(df_train['winPlacePerc']) == False]

It's gone! Let's check for other missing data in our train dataset...

In [None]:
total_train = df_train.isnull().sum().sort_values(ascending=False)
percent = (df_train.isnull().sum()/df_train.isnull().count()).sort_values(ascending=False)*100
missing_data = pd.concat([total_train, percent], axis=1,join='outer', keys=['Total Missing Count', ' % of Total Observations'])
missing_data.index.name ='Feature'
missing_data.head(10)

I really tried to keep the feature engineering all in one place, but we need to do a bit of work to drop the custom games and we are going to use the features we create here later too.

In [None]:
df_train = df_train.assign(team_size=df_train.groupby('groupId').groupId.transform('count'))
df_train = df_train.assign(max_team_size=df_train.groupby('matchId').team_size.transform('max'))
df_train = df_train.assign(match_size=df_train.groupby('matchId').Id.transform('nunique'))

In [None]:
df_train =  df_train.assign(team_indicator = df_train.team_size.apply(lambda x: 5 if x>= 5 else x))

df_train = pd.get_dummies(df_train, columns=['team_indicator'])
dummy_cols = ['team_indicator_{}'.format(i) for i in np.arange(1,6)]
df_train[dummy_cols] = df_train.groupby('matchId')[dummy_cols].transform('mean')

In [None]:
df_train.loc[df_train.team_indicator_1 >= 0.7, 'game_mode'] = 'solo'
df_train.loc[df_train.team_indicator_2 >= 0.6, 'game_mode'] = 'duo'
df_train.loc[(df_train.team_indicator_3 + df_train.team_indicator_4) >= 0.5, 'game_mode'] = 'squad'
df_train.game_mode = np.where((df_train.team_indicator_5 >= 0.2), 'custom', df_train.game_mode)
df_train.game_mode = df_train.game_mode.fillna('custom')

In [None]:
print('Shape before dropping custom games: ' + str(df_train.shape))
df_train = df_train.loc[df_train['game_mode'] != 'custom']
print('Shape after dropping custom games: ' + str(df_train.shape))

Not only have we eliminated custom games (roughly) from our test set, we have also accounted for any outliers in team size using the lambda function above.

## Preliminary Exploratory Data Analysis

#### Checking correlation

In [None]:
num_feat_train = df_train.select_dtypes(include=[np.number])
print(num_feat_train.info())

corrmat = num_feat_train.corr() 
cols = corrmat.nlargest(25, 'winPlacePerc').index # nlargest : Return this many descending sorted values
cm = np.corrcoef(num_feat_train[cols].values.T)

# correlation 
sns.set(font_scale=1.25)
f, ax = plt.subplots(figsize=(15, 12))
hm = sns.heatmap(cm, cbar=True, annot=True, square=True, fmt='.2f', annot_kws={'size': 8}, 
                 yticklabels=cols.values, xticklabels=cols.values)
plt.show()

#### Function to plot distributions

In [None]:
def plot_hist(x, title, bin_count=50):
    
    fig, ax = plt.subplots(figsize=(13,7))
    formatter = plt.FuncFormatter(lambda x, y: '{:,.2f}'.format(x))
    
    ax.yaxis.set_major_formatter(formatter=formatter)
    ax.xaxis.set_major_formatter(formatter=formatter)
    b = bin_count

    ax.set_title(title)
    sns.distplot(x, bins=b, kde=True, ax=ax, color='darkred')

In [None]:
print('The average winning percentile is {:.3f}, the median is {:.3f}'.format(df_train['winPlacePerc'].mean(), df_train['winPlacePerc'].median()))

In [None]:
plt.clf()
sns.set_style("darkgrid")
plot_hist(df_train['winPlacePerc'], title='Distribution of winning percentiles')

In [None]:
df_train = df_train.assign(match_mean = df_train.groupby('matchId')['winPlacePerc'].transform('mean'))
df_train = df_train.assign(match_median = df_train.groupby('matchId')['winPlacePerc'].transform('median'))

In [None]:
print('The average match winning percentile is {:.2f}, the median is {:,.2f}'.format(df_train.winPlacePerc.mean(), df_train.winPlacePerc.median()))

In [None]:
plot_hist(df_train.match_mean, title='Distribution of average match winning percentiles')
plot_hist(df_train.match_median, title='Distribution of median match winning percentiles')

## Boosts, Heals, and Match Length

In [None]:
plot_hist(df_train['boosts'], title='Distribution of boost usage')

In [None]:
def plot_joint(x_feat, y_feat, df):
    sns.jointplot(x=x_feat, y=y_feat, data=df, kind='reg', color='darkred')
    # sns.jointplot(boosts_df['high_boosts'], boosts_df['matchDuration'], kind='reg', color='darkred', scatter_kws={'edgecolor':'w'}, line_kws={'color':'black'})

I am using a sample here because trying to plot all the points from the full dataset is too computationally expensive.

In [None]:
sampled_matches = pd.Series(df_train.matchId.unique()).sample(2000, random_state=42)
df_train_sample = df_train[df_train.matchId.isin(sampled_matches)]
df_train_sample.groupby("matchId").size().mean()

In [None]:
# because most losing players don't use very many boosts

boosts_df = pd.concat([df_train_sample['winPlacePerc'], df_train_sample['matchDuration'], df_train_sample['boosts']], axis=1)

boosts_mnPlusStd = boosts_df['boosts'].mean() + boosts_df['boosts'].std()
# boosts_df['high_boosts'] = [i for i in boosts_df['boosts'] if i > boosts_df['boosts'].mean()]
boosts_df['high_boosts'] = [i if i > boosts_mnPlusStd else 'None' for i in boosts_df['boosts']]
boosts_df = pd.DataFrame(boosts_df.loc[boosts_df['high_boosts'] != 'None'].astype(float))

In [None]:
plt.clf()
sns.set_style("darkgrid")
sns.set(rc={'figure.figsize':(15,12)})
plot_joint('winPlacePerc', 'boosts', df_train_sample)

It seems that winning players tend to use more boosts. Part of this is likely because they are alive for longer, which is why we will later weight boosts by match duration so that boosts are more important of a predictor as matches go on longer than average.

In [None]:
plt.clf()
sns.set_style("darkgrid")
sns.set(rc={'figure.figsize':(15,12)})
plot_joint('high_boosts', 'matchDuration', boosts_df)


I do not play the game so I am unsure of why matchDuration is essentially split at around 1600. However, let's take a look at the relationship between boosts and match duration, by long match duration and short match duration.

In [None]:
long_boosts_df = pd.DataFrame(boosts_df.loc[boosts_df['matchDuration'] > boosts_df['matchDuration'].mean()].astype(int))
short_boosts_df = pd.DataFrame(boosts_df.loc[boosts_df['matchDuration'] < boosts_df['matchDuration'].mean()].astype(int))

In [None]:
plt.clf()
sns.set_style("darkgrid")
sns.set(rc={'figure.figsize':(15,12)})
plot_joint('high_boosts', 'matchDuration', long_boosts_df)

In [None]:
plt.clf()
sns.set_style("darkgrid")
sns.set(rc={'figure.figsize':(15,12)})
plot_joint('high_boosts', 'matchDuration', short_boosts_df)

In [None]:
plt.clf()
sns.set_style("darkgrid")
sns.set(rc={'figure.figsize':(15,12)})
plot_joint('heals', 'matchDuration', df_train_sample)

In [None]:
del df_train_sample, boosts_df, long_boosts_df, short_boosts_df
gc.collect()

## Team and Match Sizes

In [None]:
plot_hist(df_train[df_train['match_size']>=75]['match_size'], title='Distribution of players per game')

In [None]:
# I used this to compare to the original dataset but it was too computationally expensive to justify its usage

# df_train_copy = df_train_copy.assign(team_size_copyset=df_train_copy.groupby('groupId').groupId.transform('count'))
# df_train_copy = df_train_copy.assign(max_team_size_copyset=df_train_copy.groupby('matchId').team_size_copyset.transform('max'))
# df_train_copy = df_train_copy.assign(match_size_copyset=df_train_copy.groupby('matchId').Id.transform('nunique'))


# print('The largest team in the original training dataset had {} team members'.format(df_train_copy.max_team_size_copyset.max()))
print('The largest team after dropping custom games has {} team members'.format(df_train.max_team_size.max()))

In [None]:
plot_hist(df_train.team_size, title='Distribution of team sizes')
plt.xlim(0,20)

In [None]:
plot_hist(df_train.max_team_size, title='Distribution of maximum team size', bin_count=25)
plt.xlim(0,20)

In [None]:
plt.clf()
fig = plt.figure(figsize=(10,7))
fig.add_subplot(1,1,1)
fig.autofmt_xdate()
sns.set_style("darkgrid")
sns.countplot(x='matchType', data = df_train, palette='Reds_d')

Remember, we simplified our categories earlier, from matchType to game_mode.

In [None]:
plt.clf()
fig = plt.figure(figsize=(10,7))
fig.add_subplot(1,1,1)
fig.autofmt_xdate()
sns.set_style("darkgrid")
sns.countplot(x='game_mode', data = df_train, palette='Reds_d')

In [None]:
# these objects are eventually used to drop all match type features ...NOT game_mode

matchType_keep = ['matchType_squad', 'matchType_solo-fpp', 'matchType_squad-fpp', 'matchType_duo-fpp',
                  'matchType_solo', 'matchType_duo']
matchType_drop = ['matchType_squad', 'matchType_solo-fpp', 'matchType_squad-fpp', 'matchType_duo-fpp', 
                  'matchType_solo', 'matchType_duo', 'matchType_normal-squad-fpp', 'matchType_flaretpp', 
                  'matchType_crashfpp', 'matchType_normal-duo-fpp', 'matchType_crashtpp', 'matchType_normal-solo-fpp', 
                  'matchType_flarefpp', 'matchType_normal-solo', 'matchType_normal-squad', 'matchType_normal-duo']


## Outlier Detection and Removal

Many of the ideas below are pulled from these two kernels. Big thank you to the authors of both.

https://www.kaggle.com/rejasupotaro/cheaters-and-zombies

https://www.kaggle.com/carlolepelaars/pubg-data-exploration-rf-funny-gifs-v2

I feel that most of the processes used in this section are self-explanatory. If anything is not clear, just leave a comment and I will get back to you.



#### Distance Traveled

First we are going to remove outliers in distance traveled for each of the three features, then combine what is left over into a new feature measuring total distance traveled.

In [None]:
plt.clf()
sns.set_style("darkgrid")
plot_hist(df_train['walkDistance'], title='Distribution of distance traveled by walking per player')

In [None]:
print('Number of players who traveled more than 8000 by walking: ' + str(len(df_train[df_train['walkDistance'] > 8000])))

In [None]:
def remove_outliers_walk(data):
    print('Number of walk distance outliers: ' + str(len(data.loc[data['walkDistance'] > 8000])))
    data.drop(data.loc[data['walkDistance'] > 8000].index, inplace=True)
    return str('Walk distance outliers removed.')

In [None]:
remove_outliers_walk(df_train)

In [None]:
plt.clf()
sns.set_style("darkgrid")
plot_hist(df_train['rideDistance'], title='Distribution of distance traveled by vehicle per player')

In [None]:
print('Number of players who traveled more than 15000 by vehicle: ' + str(len(df_train[df_train['rideDistance'] > 15000])))

In [None]:
def remove_outliers_ride(data):
    print('Number of ride distance outliers: ' + str(len(data.loc[data['rideDistance'] > 15000])))
    data.drop(data.loc[data['rideDistance'] > 15000].index, inplace=True)
    return str('Ride distance outliers removed.')

In [None]:
remove_outliers_ride(df_train)

In [None]:
plt.clf()
sns.set_style("darkgrid")
plot_hist(df_train['swimDistance'], title='Distribution of distance swam per player')

In [None]:
print('Number of players who swam more than 500: ' + str(len(df_train[df_train['swimDistance'] > 500])))

In [None]:
def remove_outliers_swim(data):
    print('Number of swim distance outliers: ' + str(len(data.loc[data['swimDistance'] > 500])))
    data.drop(data.loc[data['swimDistance'] > 500].index, inplace=True)
    return str('Swim distance outliers removed.')

In [None]:
remove_outliers_swim(df_train)

In [None]:
df_train['total_distance'] = df_train.rideDistance + df_train.swimDistance + df_train.walkDistance

In [None]:
plt.clf()
sns.set_style("darkgrid")
plot_hist(df_train['total_distance'], title='Distribution of total distance traveled per player')

#### Cheaters and/or Maniacs

This function removes players who never moved but somehow managed to get kills. In other words, it removes cheaters.

In [None]:
def remove_movecheat(data):
    data['killsWithoutMoving'] = ((data['kills'] > 0) & (data['total_distance'] == 0))
    print('Number of movement cheaters: ' + str(len(data.loc[data['killsWithoutMoving'] == True])))
    data.drop(data.loc[data['killsWithoutMoving'] == True].index, inplace=True)
    return str('Cheaters removed.')

In [None]:
remove_movecheat(df_train)

In [None]:
plt.clf()
sns.set_style("darkgrid")
plot_hist(df_train['roadKills'], title='Distribution of total kills while in vehicle per player')

Here, we are taking players who got at least one road kill. Then, we take the mean of that group (the mean road kills for players who had at least one road kill), and add 3 standard deviations to the mean. We remove the group of players who had more road kills than that statistic from our training dataset following the logic that such cases are not representative of player skill or success in a match. Most players do not even get a single road kill. At the very least, the players removed on these grounds were probably just driving around rather than playing strategically.

In [None]:
def remove_outliers_road(data):
    roadk1_mnPlus3Std = data[data['roadKills'] >= 1]['roadKills'].mean() + 3*data[data['roadKills'] >= 1]['roadKills'].std()
    print('Number of vehicle kill outliers: ' + str(len(data.loc[data['roadKills'] > roadk1_mnPlus3Std])))
    data.drop(data.loc[data['roadKills'] > roadk1_mnPlus3Std].index, inplace=True)
    return str('Vehicle kill outliers removed.')

In [None]:
remove_outliers_road(df_train)

In [None]:
plt.clf()
sns.set_style("darkgrid")
plot_hist(df_train['kills'], title='Distribution of total kills per player')

There seem to be some players who achieve a kill count that is wildly detached from the rest of the distribution. This is suspicious and we are going to remove them. We will choose 20 as the cutoff point.

In [None]:
def remove_outliers_kills_old(data):
    kills_mnPlus3Std = data['kills'].mean() + 3*data['kills'].std()
    print('Number of kill outliers: ' + str(len(data.loc[data['kills'] > kills_mnPlus3Std])))
    data.drop(data.loc[data['kills'] > kills_mnPlus3Std].index, inplace=True)
    return str('Kill outliers removed.')

In [None]:
def remove_outliers_kills(data):
    print('Number of kill outliers: ' + str(len(data.loc[data['kills'] > 20])))
    data.drop(data.loc[data['kills'] > 20].index, inplace=True)
    return str('Kill outliers removed.')

In [None]:
remove_outliers_kills(df_train)

### Weapons Acquired, Longest Kill, and Heals/Boosts

According to the description of the data, the feature measuring longest kill can be misleading because a player can down another player and then drive away before they bleed out and the kill is actually awarded. A high value may also be an indication of cheating. Let's check the distribution and choose a point above which to remove outliers.

In [None]:
plt.clf()
sns.set_style("darkgrid")
plot_hist(df_train['longestKill'], title='Distribution of longest kill per player')

In [None]:
def remove_outliers_longkills(data):
    print('Number of longest kill outliers: ' + str(len(data.loc[data['longestKill'] > 800])))
    data.drop(data.loc[data['longestKill'] > 800].index, inplace=True)
    return str('Longest kill outliers removed.')

In [None]:
remove_outliers_longkills(df_train)

I think it's safe to say that players who pick up an abnormal amount of weapons are not playing the game in a way that helps us to make accurate predictions.

In [None]:
plt.clf()
sns.set_style("darkgrid")
plot_hist(df_train['weaponsAcquired'], title='Distribution of weapons acquired per player')

In [None]:
print('Number of players who acquired more than 50 weapons: ' + str(len(df_train[df_train['weaponsAcquired'] > 50])))


In [None]:
def remove_outliers_weapons(data):
    print('Number of weapons acquired outliers: ' + str(len(data.loc[data['weaponsAcquired'] > 50])))
    data.drop(data.loc[data['weaponsAcquired'] > 50].index, inplace=True)
    return str('Weapons acquired outliers removed.')

In [None]:
remove_outliers_weapons(df_train)

In [None]:
df_train['heals_boosts'] = df_train['heals'] + df_train['boosts']

In [None]:
plt.clf()
sns.set_style("darkgrid")
plot_hist(df_train['heals_boosts'], title='Distribution of heals and boosts acquired per player')

In [None]:
print('Number of players who used more than 35 heals/boosts: ' + str(len(df_train[df_train['weaponsAcquired'] > 35])))

In [None]:
def remove_outliers_heals_boosts(data):
    print('Number of heals/boosts used outliers: ' + str(len(data.loc[data['weaponsAcquired'] > 35])))
    data.drop(data.loc[data['heals_boosts'] > 35].index, inplace=True)
    return str('Heals/boosts outliers removed.')

In [None]:
remove_outliers_heals_boosts(df_train)

## Feature Engineering

I did a little bit of testing and got better results when I imputed winPoints and rankPoints with the mean of each. I think assuming that players for whom data is missing are of average skill helps to make the model more robust.

In [None]:
winPoints_clean = df_train.drop(df_train[df_train['winPoints'] < 1].index)
winPoints_clean = winPoints_clean['winPoints']
winPoints_mean = winPoints_clean.mean()

df_train['winPoints_imp'] = df_train['winPoints'].replace({0 : winPoints_mean})

In [None]:
plot_hist(df_train['winPoints'], title='Distribution of winPoints')

In [None]:
plot_hist(df_train['winPoints_imp'], title='Distribution of Mean-Imputed winPoints')

rankPoints is highly correlated with some of our other features and I tried dropping it but ended up with worse results. Just a note.

In [None]:
rankPoints_clean = df_train.drop(df_train[df_train['rankPoints'] < 1].index)
rankPoints_clean = rankPoints_clean['rankPoints']
rankPoints_mean = rankPoints_clean.mean()

df_train['rankPoints_imp'] = df_train['rankPoints'].replace({-1 : rankPoints_mean,
                                                            0 : rankPoints_mean})

In [None]:
plot_hist(df_train['rankPoints'], title='Distribution of rankPoints')

In [None]:
plot_hist(df_train['rankPoints_imp'], title='Distribution of Mean-Imputed rankPoints')

This is a bit of leftover code from when I was attempting to capture match type and team size in a different way. I am leaving it in as an example of what not to do, and with one line active because I don't want to break anything.

In [None]:
# df_train = pd.get_dummies(df_train, columns=['matchType'])

# matchType_enc = df_train.filter(regex='matchType')
# matchType_enc.head()

In [None]:
# df_train['solo_matchType'] = df_train['matchType_solo-fpp'] + df_train['matchType_solo']


df_train = df_train.drop(columns=matchType_drop, errors='ignore')

#### Normalizing By Match Size
Here, we normalize some features based on count of players in the match.  Some other Kagglers have normalized damage dealt in this way but I feel that this is not the best technique because damage is not inherently tied to the number of players in the game. Yes, more players means more opportunity to deal damage, but the same player can receive damage, recover, then receive more damage, etc. However, kills and placement probably have more of a relationship with match size. I also normalized knock-outs, which I have not seen done elsewhere. If you're reading this and have any thoughts on the issue, drop them in the comments!

In [None]:
def playersJoined_norm(list_of_cols, df):
     for column in list_of_cols:
        df[column] = df[column]*((100-df['match_size'])/ 100 + 1)
         

In [None]:
to_pJnorm = ['DBNOs', 'kills', 'maxPlace'] # 'damageDealt',

In [None]:
playersJoined_norm(to_pJnorm, df_train)

#### Weighting Boosts
Weighting boosts by match length. The logic here is that boost usage depends on what stage of progress the game is in, i.e., more boosts in later stages. This means that, as matches get longer than average, the importance of boosts/heals increases and as matches get shorter, the importance of boosts decreases. I did this because I detected a slight increase in the number of boosts used in longer games in my exploratory data analysis. Intuition also helped, and reading online that people tend to save them for later in the game. Players who use more boosts have been alive for longer (part of why it is such an important predictor). In that same train of thought, players who use all of their boosts at the beginning of the game are more likely to be wiped out by players who save their boosts to use against other skilled players or are good enough that they can easily take out other players in the earliers stages of the game without using their boosts, instead saving them for later

Essentially, what is happening here is that I am assigning more importance to boosts/heals in longer games in the hopes of catching some of this effect. I also combined heals and boosts a bit earlier in the kernel because they perform a similar function in the game and because this method has been working well in other kernels.

This seems to add some predictive power when training on a sample, but not a huge amount. Due to time constraints, it is hard to measure the change in predictive power when training on the whole training dataset. We will leave this weight in our data and assume that the effect may be amplified when scaling up to the full-sized training dataset.

Do you feel that this technique is valid? Let me know in the comments. I am eager for feedback.

In [None]:
def matchDuration_wt(list_of_cols, df):
    for column in list_of_cols:
        df[column] = (df[column] * df[column])/ df['matchDuration'].mean()
        df[column].clip(0)
        

In [None]:
to_matchDwt = ['heals_boosts']

In [None]:
matchDuration_wt(to_matchDwt, df_train)

Both of the functions below were modified from functions other kernels but I can't find the first one. The second one is modified from the function in [this kernel](https://www.kaggle.com/anycode/simple-nn-baseline-3). We are really letting LightGBM do most of the work here by casting a wide net of potentially valuable features and feeding them into the algorithm for it to decide what is best to use.

In [None]:
def engineer_features(data):
    data['max_possible_kills'] = data.match_size - data.team_size
    # data['total_distance'] = data.rideDistance + data.swimDistance + data.walkDistance
    data['total_items_acquired'] = data.boosts + data.heals + data.weaponsAcquired
    data['items_per_distance'] =  data.total_items_acquired/data.total_distance
    data['kills_per_distance'] = data.kills/data.total_distance
    data['knocked_per_distance'] = data.DBNOs/data.total_distance
    data['damage_per_distance'] = data.damageDealt/data.total_distance
    data['headshot_kill_rate'] = data.headshotKills/data.kills
    data['headshot_kill_rate'] = data['headshot_kill_rate'].fillna(data['headshot_kill_rate'].mean())
    
    data['max_kills_by_team'] = data.groupby(['matchId','groupId']).kills.transform('max')
    data['total_team_damage'] = data.groupby(['matchId','groupId']).damageDealt.transform('sum')
    data['total_team_kills'] =  data.groupby(['matchId','groupId']).kills.transform('sum')
    data['total_team_items'] = data.groupby(['matchId','groupId']).total_items_acquired.transform('sum')
    data['pct_killed'] = data.kills/data.max_possible_kills
    data['pct_knocked'] = data.DBNOs/data.max_possible_kills
    data['pct_team_killed'] = data.total_team_kills/data.max_possible_kills
    data['team_kill_points'] = data.groupby(['matchId','groupId']).killPoints.transform('sum')
    data['team_kill_rank'] = data.groupby(['matchId','groupId']).killPlace.transform('mean')
    data['max_kills_match'] = data.groupby('matchId').kills.transform('max')
    data['total_kills_match'] = data.groupby('matchId').kills.transform('sum')
    # data['total_distance_match'] = data.groupby('matchId').total_distance.sum()
    # data['map_has_sea'] =  data.groupby('matchId').swimDistance.transform('sum').apply(lambda x: 1 if x>0 else 0)
     
    data = data.join(pd.get_dummies(data['game_mode']))
    
    data.fillna(0, axis=1, inplace=True)
      
    
    return data


In [None]:
engineer_features(df_train)

In [None]:
df_train = reduce_mem_usage(df_train)

In [None]:
def engineer_features2(is_train=True):
    test_idx = None
    if is_train: 
        print("processing train")
        df = df_train           

        df = df[df['maxPlace'] > 1]
    else:
        print("processing test")
        df = df_test
        test_idx = df['Id']
    

    
    print("selected features to generate aggregates for")
    target = 'winPlacePerc'
    features = [#'Id',
                #'groupId',
                #'matchId',
                'assists',
                #'boosts',
                'damageDealt',
                'DBNOs',
                #'headshotKills',
                #'heals',
                'killPlace',
                'killPoints',
                'kills',
                'killStreaks',
                'longestKill',
                #'matchDuration',
                'maxPlace',
                'numGroups',
                #'rankPoints',
                'revives',
                #'rideDistance',
                #'roadKills',
                #'swimDistance',
                'teamKills',
                'vehicleDestroys',
                #'walkDistance',
                'weaponsAcquired',
                #'winPoints',
                #'winPlacePerc',
                'team_size',
                #'max_team_size',
                'match_size',
                'team_indicator_1',
                'team_indicator_2',
                'team_indicator_3',
                'team_indicator_4',
                #'team_indicator_5',
                #'game_mode',
                #'match_mean',
                #'match_median',
                'total_distance',
                #'killsWithoutMoving',
                'heals_boosts',
                'winPoints_imp',
                'rankPoints_imp',
                'max_possible_kills',
                'total_items_acquired',
                'items_per_distance',
                'kills_per_distance',
                'knocked_per_distance',
                'damage_per_distance',
                'headshot_kill_rate',
                #'max_kills_by_team',
                'total_team_damage',
                'total_team_kills',
                'total_team_items',
                'pct_killed',
                'pct_knocked',
                'pct_team_killed',
                'team_kill_points',
                'team_kill_rank',
                #'max_kills_match',
                'total_kills_match',
                #'duo',
                #'solo',
                #'squad'
    ]

   
  
     
    y = None
        
    if is_train: 
        print("get target")
        y = np.array(df.groupby(['matchId','groupId'])[target].agg('mean'), dtype=np.float64)
        # features.remove(target)
    
    # else:
    #     y = np.array(df.groupby(['matchId','groupId'])[target].agg('mean'), dtype=np.float64)

    print("get group mean feature")
    agg = df.groupby(['matchId','groupId'])[features].agg('mean')
    agg_rank = agg.groupby('matchId')[features].rank(pct=True).reset_index()
    
    if is_train: df_out = agg.reset_index()[['matchId','groupId']]
    else: df_out = df[['matchId','groupId']]
    #df_out = agg.reset_index()[['matchId','groupId']]
        
    df_out = df_out.merge(agg.reset_index(), suffixes=["", ""], how='left', on=['matchId', 'groupId'])
    df_out = df_out.merge(agg_rank, suffixes=["_mean", "_mean_rank"], how='left', on=['matchId', 'groupId'])
    
    # print("get group sum feature")
    # agg = df.groupby(['matchId','groupId'])[features].agg('sum')
    # agg_rank = agg.groupby('matchId')[features].rank(pct=True).reset_index()
    # df_out = df_out.merge(agg.reset_index(), suffixes=["", ""], how='left', on=['matchId', 'groupId'])
    # df_out = df_out.merge(agg_rank, suffixes=["_sum", "_sum_rank"], how='left', on=['matchId', 'groupId'])
    
    # print("get group sum feature")
    # agg = df.groupby(['matchId','groupId'])[features].agg('sum')
    # agg_rank = agg.groupby('matchId')[features].agg('sum')
    # df_out = df_out.merge(agg.reset_index(), suffixes=["", ""], how='left', on=['matchId', 'groupId'])
    # df_out = df_out.merge(agg_rank.reset_index(), suffixes=["_sum", "_sum_pct"], how='left', on=['matchId', 'groupId'])
    
    print("get group max feature")
    agg = df.groupby(['matchId','groupId'])[features].agg('max')
    agg_rank = agg.groupby('matchId')[features].rank(pct=True).reset_index()
    df_out = df_out.merge(agg.reset_index(), suffixes=["", ""], how='left', on=['matchId', 'groupId'])
    df_out = df_out.merge(agg_rank, suffixes=["_max", "_max_rank"], how='left', on=['matchId', 'groupId'])
    
    print("get group min feature")
    agg = df.groupby(['matchId','groupId'])[features].agg('min')
    agg_rank = agg.groupby('matchId')[features].rank(pct=True).reset_index()
    df_out = df_out.merge(agg.reset_index(), suffixes=["", ""], how='left', on=['matchId', 'groupId'])
    df_out = df_out.merge(agg_rank, suffixes=["_min", "_min_rank"], how='left', on=['matchId', 'groupId'])
    
    print("get group size feature")
    agg = df.groupby(['matchId','groupId']).size().reset_index(name='group_size')
    df_out = df_out.merge(agg, how='left', on=['matchId', 'groupId'])
    
    print("get match mean feature")
    agg = df.groupby(['matchId'])[features].agg('mean').reset_index()
    df_out = df_out.merge(agg, suffixes=["", "_match_mean"], how='left', on=['matchId'])
    
    # print("get match type feature")
    # agg = df.groupby(['matchId'])[matchType.columns].agg('mean').reset_index()
    # df_out = df_out.merge(agg, suffixes=["", "_match_type"], how='left', on=['matchId'])
    
    # print("get match size feature")
    # agg = df.groupby(['matchId']).size().reset_index(name='match_size')
    # df_out = df_out.merge(agg, how='left', on=['matchId'])
    
    # df_out.drop(["matchId", "groupId"], axis=1, inplace=True)

    X = df_out
    
    feature_names = list(df_out.columns)

    del df, df_out, agg, agg_rank
    gc.collect()

    return X, y, feature_names, test_idx

In [None]:
df_train, y_train, train_columns, _ = engineer_features2(is_train=True)

I'm reducing memory usage again because these functions created features that take up too much memory and cause us to use up all the RAM available in the kernel if left untouched.

In [None]:
df_train = reduce_mem_usage(df_train)

These objects just contain the names of features that were amalgamated into new features, used only for outlier detection, or were created by the functions but are not logical. 

In [None]:
to_drop = ['winPoints', 'rankPoints', 'rideDistance', 'walkDistance', 'swimDistance', 'headshotKills',
           'roadKills', 'max_team_size', 'match_mean', 'match_median', 'team_indicator_5', 
           'game_mode', 'killsWithoutMoving', 'killsWithoutMoving_mean', 'heals', 'boosts',
           'match_mean_mean', 'rankPoints'] 
           

to_drop_test = ['winPoints', 'rankPoints', 'rideDistance', 'walkDistance', 'swimDistance', 'headshotKills',
                'roadKills', 'max_team_size', 'team_indicator_5', 'game_mode', 'killsWithoutMoving_mean',
                'heals', 'boosts', 'match_mean_mean', 'winPlacePerc', 'rankPoints']
                

to_drop2 = ['Id', 'groupId', 'matchId', 'team_size']

categorical = ['matchId', 'groupId']

In [None]:
df_train.drop(labels=to_drop,inplace=True,axis=1, errors='ignore')

In [None]:
df_train['winPlacePerc'] = y_train

## Model Development and Training

In [None]:
from sklearn.metrics import mean_absolute_error, r2_score
from lightgbm import LGBMRegressor

def train_validation(df, train_size=0.9):
    
    unique_games = df.matchId.unique()
    train_index = round(int(unique_games.shape[0]*train_size))
    
    np.random.seed(42)
    np.random.shuffle(unique_games)
    
    train_Id = unique_games[:train_index]
    validation_Id = unique_games[train_index:]
    
    train = df[df.matchId.isin(train_Id)]
    validation = df[df.matchId.isin(validation_Id)]
    
    return train, validation
    
train, validation = train_validation(df_train)


In [None]:
train_weights = (1/train['team_size'])
validation_weights = (1/validation['team_size'])


In [None]:
train.drop(labels=to_drop2,inplace=True,axis=1, errors='ignore')
validation.drop(labels=to_drop2,inplace=True,axis=1, errors='ignore')

X_train = train.drop(labels=['winPlacePerc'],axis=1)
X_val = validation.drop(labels=['winPlacePerc'], axis=1)

y_train = train['winPlacePerc']
y_val = validation['winPlacePerc']

print(str(X_train.shape) + str(X_val.shape))

In [None]:
del df_train
gc.collect()

In [None]:
lgbm = LGBMRegressor(objective='mae', n_estimators=4500,  
                     learning_rate=0.03, num_leaves=300,
                     max_depth=14,
                     n_jobs=-1, random_state=42, verbose=50)

lgb_reg = lgbm.fit(X_train, y_train, eval_set=[(X_val, y_val)], eval_metric='mae', early_stopping_rounds=100, verbose=50)    

In [None]:
# lgb_reg.save_model('model.txt', num_iteration=lgb_reg.best_iteration)

In [None]:
pd.DataFrame(sorted(zip(lgbm.feature_importances_, X_train.columns)))

In [None]:
del X_train, y_train, X_val, y_val
gc.collect()

In [None]:
df_test = pd.read_csv('../input/test_V2.csv', sep=',')
df_test = reduce_mem_usage(df_test)

In [None]:
# test_id = df_test['Id']

df_test = df_test.assign(team_size=df_test.groupby('groupId').groupId.transform('count'))
df_test = df_test.assign(max_team_size=df_test.groupby('matchId').team_size.transform('max'))
df_test = df_test.assign(match_size=df_test.groupby('matchId').Id.transform('nunique'))

df_test =  df_test.assign(team_indicator = df_test.team_size.apply(lambda x: 5 if x>= 5 else x))

df_test = pd.get_dummies(df_test, columns=['team_indicator'])
dummy_cols = ['team_indicator_{}'.format(i) for i in np.arange(1,6)]
df_test[dummy_cols] = df_test.groupby('matchId')[dummy_cols].transform('mean')

df_test.loc[df_test.team_indicator_1 >= 0.7, 'game_mode'] = 'solo'
df_test.loc[df_test.team_indicator_2 >= 0.6, 'game_mode'] = 'duo'
df_test.loc[(df_test.team_indicator_3 + df_test.team_indicator_4) >= 0.5, 'game_mode'] = 'squad'
df_test.game_mode = np.where((df_test.team_indicator_5 >= 0.2), 'custom', df_test.game_mode)
df_test.game_mode = df_test.game_mode.fillna('custom')

# don't drop custom games for test set, game_mode is dropped in the big function

df_test['total_distance'] = df_test.rideDistance + df_test.swimDistance + df_test.walkDistance

df_test['heals_boosts'] = df_test['heals'] + df_test['boosts']

winPoints_clean = df_test.drop(df_test[df_test['winPoints'] < 1].index)
winPoints_clean = winPoints_clean['winPoints']
winPoints_mean = winPoints_clean.mean()

df_test['winPoints_imp'] = df_test['winPoints'].replace({0 : winPoints_mean})


rankPoints_clean = df_test.drop(df_test[df_test['rankPoints'] < 1].index)
rankPoints_clean = rankPoints_clean['rankPoints']
rankPoints_mean = rankPoints_clean.mean()

df_test['rankPoints_imp'] = df_test['rankPoints'].replace({-1 : rankPoints_mean})


df_test = pd.get_dummies(df_test, columns=['matchType'])

matchType_enc = df_test.filter(regex='matchType')

playersJoined_norm(to_pJnorm, df_test)

matchDuration_wt(to_matchDwt, df_test)

df_test = engineer_features(df_test)
df_test, _, test_columns, test_id = engineer_features2(is_train=False)

df_test.drop(labels=to_drop_test,inplace=True, axis=1, errors='ignore')
df_test.drop(labels=to_drop2,inplace=True, axis=1, errors='ignore')




In [None]:
df_test = reduce_mem_usage(df_test)

In [None]:
results = pd.DataFrame()
results['Id'] = test_id
results['winPlacePerc'] = lgbm.predict(df_test, num_iteration=lgbm.best_iteration_)
results.winPlacePerc = results.winPlacePerc.clip(0, 1)

results.to_csv("submission.csv", index=False)
results.head(20)