In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

The real reason behind doing all the hard work in the two previous notebooks is to actually get better results. Otherwise whats the point in writing tons of fancy code that does nothing. I attempted this competition before and got quite ordinnary results. Now in this notebook lets try to improve our score with all the fancy code that we learnt. Now Due to a bug in kaggle kernels we will use keras to train a models. In future we will also demonstrate GCP related operation where we will use distributed training.

Thank fully the dataset is already split into test and train so there is not much effor required in that part. So lets take a look at the data.

In [None]:
%%bash
cd ../input/pubg-finish-placement-prediction
ls

The dataset is huge. This makes it challenging for using large datasets in memory with pandas. Although we have plenty of ram but still will use it cautiously and so we will write a very simple apache beam job to pre process the data and write it down to at location as the eval set. So we have installed beam. Beam is an unified framework for defining batch and stream processing pipelines and it can run on top of a variety for backends. So using this simple pipeline we will create our eval set instead of using in memory pandas.

Since we have plenty of data about 4 million plenty , it won't hurt us much to take out a few datapoints randomly out of the train set. But this split has to be reproducable to get consistent results. In the previous notebooks we discussed about a technique that would randomly split the data yet make the split reproducible. In the previous dataset we used a random seed but here we will hash the data to generate random numbers. So lets get to work and make our train_test_split method

# Apache beam approach to train test split

In order to get a repeatable set of train and evaluation data which is randomly distributed we cannot rely on a random seed. So we use the dataset itself to generate the seed. The splitting algorithm calculates a numeric hash from an unused column in the dataset like the match ID. Then we take the absolute value of this hash and do a modulus with a values like 10000. Depending on the size of the dataset and the level of granularity required for the splits we can increase this number to account for more granularity on large numbers and dataset. Now the remainder is compared with a threshold value which is typically a float like 0.2 multiplied by the number we have selected. Hence if the remainder is less than the threshold it goes to the eval set , else it goes to the train set.

Our beam pipeline exploits this algorithm. It reads the main train file and after this it , splits the pipe into two p collections. These are then separately written down into 2 separate files. This method might seem more time consuming as compared to pandas train test split. But this method allows us for a distributed approach and hence we can even perform these preprocessing techniques of huge datasets without compromising efficient system memory usage , which is not always possible in case of pandas

In [None]:
%%bash
pip3 install apache-beam

# Defining Pipeline

In [None]:
import apache_beam as beam

def eval_filter(data):
    eval_split = 0.1
    if(data != None and 'Id' not in data):
        if(abs(hash(data[0])) % 1000000 < (1000000 * eval_split)):
            return True
        else:
            return False
    else:
        return False
    

def train_filter(data):
    eval_split = 0.1
    if(data != None and 'Id' not in data):
        if(abs(hash(data[0])) % 1000000 >= (1000000 * eval_split)):
            return True
        else:
            return False
    else:
        return False
       

cols = ['Id', 'groupId', 'matchId', 'assists', 'boosts', 'damageDealt', 'DBNOs',
       'headshotKills', 'heals', 'killPlace', 'killPoints', 'kills',
       'killStreaks', 'longestKill', 'matchDuration', 'matchType', 'maxPlace',
       'numGroups', 'rankPoints', 'revives', 'rideDistance', 'roadKills',
       'swimDistance', 'teamKills', 'vehicleDestroys', 'walkDistance',
       'weaponsAcquired', 'winPoints', 'winPlacePerc']

class SplitWords(beam.DoFn):
    def __init__(self, delimiter=','):
        self.delimiter = delimiter

    def process(self, text):
        yield text.split(self.delimiter)
        
class ConvertToCsv(beam.DoFn):
    def process(self, text):
        yield ','.join(text)
      

    
with beam.Pipeline() as pipeline:
    out = (pipeline
           | "Read Data" >> beam.io.ReadFromText('../input/pubg-finish-placement-prediction/train_V2.csv')
           | "Split to array" >> beam.ParDo(SplitWords(','))
    )
    
    train = (out
             | "Filter train" >> beam.Filter(train_filter)
             | "convert to array train" >> beam.ParDo(ConvertToCsv())
             | "Write train" >> beam.io.WriteToText(
                     header = ','.join(cols),
                    file_path_prefix = 'train',
                    file_name_suffix = '.csv',
                    shard_name_template = ''
                )
    )
    
    eval = (out
             | "Filter eval" >> beam.Filter(eval_filter)
             | "convert to array eval" >> beam.ParDo(ConvertToCsv())
             | "Write eval" >> beam.io.WriteToText(
                     header = ','.join(cols),
                    file_path_prefix = 'eval',
                    file_name_suffix = '.csv',
                    shard_name_template = ''
                )
    )




Our beam pipeline will generate two datasets train and eval . We will then use a data generator to iterate over them in a memory efficiennt manner

# Generating feature columns

In [None]:
import tensorflow as tf

In [None]:
for item in x_train:
    print({item : "{} : {}".format(x_train[item].unique(),len(x_train[item].unique()))})

In [None]:
def gennerate_feature_columns():
    return [
        tf.feature_column.numeric_column('assists'),
        tf.feature_column.numeric_column('boosts'),
        tf.feature_column.numeric_column('damageDealt'),
        #tf.feature_column.bucketized_column(tf.feature_column.numeric_column('DBNOs'),[0,1,2,3,5,10,15,20,25,30]),
        tf.feature_column.numeric_column('headshotKills'),
        tf.feature_column.numeric_column('heals'),
        tf.feature_column.bucketized_column(tf.feature_column.numeric_column('killPlace'),[0,10,20]),
        tf.feature_column.bucketized_column(tf.feature_column.numeric_column('killPoints'),[50,750,1000,1100]),
        tf.feature_column.bucketized_column(tf.feature_column.numeric_column('killStreaks'),[1,3,5,10]),
        #tf.feature_column.bucketized_column(tf.feature_column.numeric_column('longestKill'),[0,20,40,60,80,100,150,200]),
        tf.feature_column.numeric_column('matchDuration'),
        #tf.feature_column.indicator_column(tf.feature_column.categorical_column_with_vocabulary_list('matchType',['squad-fpp','duo-fpp','squad','solo-fpp','duo'])),
        #tf.feature_column.bucketized_column(tf.feature_column.numeric_column('numGroups'),[0,10,20,22,24,26,28,30,40,42,44,46,48,50,60,70,80,82,84,86,88,90,92,94,96,98,100]),
        tf.feature_column.bucketized_column(tf.feature_column.numeric_column('revives'),[0,5,10,15,20,25,30,35,40,45,50]),
        tf.feature_column.bucketized_column(tf.feature_column.numeric_column('walkDistance'),[0,1000,3000]),
        tf.feature_column.bucketized_column(tf.feature_column.numeric_column('weaponsAcquired'),[0,5,10,15,20]),
    ]  

These feature columns as defined in previous notebooks will map the corresponding pandas column with the required preprocessed format and convert it into a tensor to feed to the model. This allows us to do things like one hot encoding as part of the model graph itself , thereby reducing our stress on system resources and makinng it part of the automated training pipeline.

# Model Definition

In [None]:
# def model_fn(features, labels, mode):
#     model = tf.keras.Sequential([
#       tf.keras.layers.DenseFeatures(gennerate_feature_columns()),
#       tf.keras.layers.Dense(1,activation = 'relu'),
#       tf.keras.layers.Dense(1,activation = 'softmax')
#     ])
    
#     logits = model(features, training=False)
    
#     if mode == tf.estimator.ModeKeys.PREDICT:
#         predictions = {'logits': logits}
#         return tf.estimator.EstimatorSpec(labels=labels, predictions=predictions)
    
#     optimizer = tf.compat.v1.train.AdamOptimizer()
#     loss = tf.keras.losses.MAE(labels, logits)
    
#     if mode == tf.estimator.ModeKeys.EVAL:
#         return tf.estimator.EstimatorSpec(mode = mode, loss=loss)

#     return tf.estimator.EstimatorSpec(
#           mode=mode,
#           loss=loss,
#           train_op=optimizer.minimize(
#           loss, tf.compat.v1.train.get_or_create_global_step()))


The above code is commented out because it doesnot really demonstrate the capabilities of the estimator api and also due to some bug proper logs are not printed in kaggle. So for the time being we will use keras to train our models , but keep using tensor flow as much as possible. 

# Dataset Generation

In [None]:
train_batch_size = 100
eval_batch_size = 10

train_dataset = tf.data.experimental.make_csv_dataset(
    ['train.csv'],
    train_batch_size,
    label_name='winPlacePerc',
    num_epochs=3)

eval_dataset = tf.data.experimental.make_csv_dataset(
    ['eval.csv'],
    train_batch_size,
    label_name='winPlacePerc',
    num_epochs=3)

The above code will generate a dataset which is loaded lazily while training. This dataset will be fed directly to the keras model.

# Model Training

In [None]:
# loss_object = tf.keras.losses.MeanAbsoluteError()

# import time

# def loss(model, x, y, training):
#     y_ = model(x, training=training)
#     return loss_object(y_true=y, y_pred=y_)

# def grad(model, inputs, targets):
#     with tf.GradientTape() as tape:
#         loss_value = loss(model, inputs, targets, training=True)
#     return loss_value, tape.gradient(loss_value, model.trainable_variables)

# optimizer = tf.keras.optimizers.Adam(learning_rate=0.1)

# model = tf.keras.Sequential([
#       tf.keras.layers.DenseFeatures(gennerate_feature_columns()),
#       tf.keras.layers.Dense(100,activation = 'relu'),
#       tf.keras.layers.Dense(10,activation = 'relu'),
#       tf.keras.layers.Dense(1,activation = 'softmax')
#     ])

# train_loss_results = []
# train_accuracy_results = []

# num_epochs = 1
# start_time = 0
# counnt = 1
# for epoch in range(num_epochs):
#     epoch_loss_avg = tf.keras.metrics.Mean()
#     epoch_accuracy = tf.keras.metrics.SparseCategoricalAccuracy()
#     start_time = time.time()
#     print("Epoch:- ",counnt)
#     for x, y in train_dataset:
#         loss_value, grads = grad(model, x, y)
#         optimizer.apply_gradients(zip(grads, model.trainable_variables))
#         # Track progress
#         epoch_loss_avg.update_state(loss_value)  # Add current batch loss
#         # Compare predicted label to actual label
#         # training=True is needed only if there are layers with different
#         # behavior during training versus inference (e.g. Dropout).
#         epoch_accuracy.update_state(tf.reshape(y,(100,1)), model(x, training=True))
#         print(counnt)
#         counnt+=1
#      # End epoch
    
#     train_loss_results.append(epoch_loss_avg.result())
#     train_accuracy_results.append(epoch_accuracy.result())
#     print("Epoch {:03d}: Loss: {:.3f}, Accuracy: {:.3%}".format(epoch,
#                                                                 epoch_loss_avg.result(),
#                                                                 epoch_accuracy.result()))

The above code is a custom training loop which we can use to write a custom training job . Lets use the vanilla keras trainer for the timebeing

In [None]:
next(iter(train_dataset))[1]

In [None]:
model = tf.keras.Sequential([
      tf.keras.layers.DenseFeatures(gennerate_feature_columns()),
      tf.keras.layers.Dense(2048,activation = tf.keras.activations.relu),
     tf.keras.layers.Dense(1024,activation = tf.keras.activations.relu),
    tf.keras.layers.Dropout(0.2),
     tf.keras.layers.Dense(2048,activation = tf.keras.activations.relu),
     tf.keras.layers.Dense(1024,activation = tf.keras.activations.relu),
    tf.keras.layers.Dropout(0.2),
     tf.keras.layers.Dense(2048,activation = tf.keras.activations.relu),
     tf.keras.layers.Dense(1024,activation = tf.keras.activations.relu),
    tf.keras.layers.Dropout(0.2),
     tf.keras.layers.Dense(2048,activation = tf.keras.activations.relu),
     tf.keras.layers.Dense(1024,activation = tf.keras.activations.relu),
    tf.keras.layers.Dropout(0.2),
     tf.keras.layers.Dense(2048,activation = tf.keras.activations.relu),
     tf.keras.layers.Dense(1024,activation = tf.keras.activations.relu),
    tf.keras.layers.Dense(1,activation = 'softmax')
    ])

model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate = 0.01), loss = tf.keras.losses.mean_squared_error, metrics=["acc"])
model.fit(train_dataset ,epochs=3,verbose = 1,validation_data = eval_dataset,workers=-1,batch_size = 100)


As we can see that we no longer have to tweak the input layer every time we change the shape of the input data and it gives us a lot more freedom. Also when we are dealing with a large number of colummns which contain different types of data , it becomes difficult to calculate te exact input shape of the models. Thus by using feature columns we can very easily elliminate all these problems.

We can play around with the model architecture to improve the accuracy a lot more , and we have achieved better results than befor. But the intension of this notebook is to demonstrate how we can efficiently use our system resources to train our models. These techniques will help us later to handle even larger datasets 