# Creating Folds & Preprocessing

We will be using Kaggle Kernels to test XGBoost and CatBoost models using GPU and using these local notebooks to test LightGBM models with CPU. We want our cross-validation scheme to be consistent across both so we define it here, save it using `feather` and upload it to Kaggle.

The original data is very large and takes a long time to load. We use some techniques to save memory and also store the data more efficiently. 

In [1]:
# Global variables for testing changes to this notebook quickly
FOLD_SEED = 0
MIN_FOLDS = 3
MAX_FOLDS = 6

In [2]:
import os
import warnings
import numpy as np
import pandas as pd
import pyarrow
import time

# cross validation
from sklearn.model_selection import StratifiedKFold

# display options
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
#pd.set_option('float_format', '{:f}'.format)
warnings.filterwarnings('ignore')

# Loading Times

We benchmark the `read_csv` method with our raw data:

In [3]:
%%time 

train = pd.read_csv('../data/train.csv')

Wall time: 9.24 s


In [4]:
%%time

test = pd.read_csv('../data/test.csv')

Wall time: 4.87 s


# Data Size

We also check how large our raw data is in memory:

In [5]:
print("Train:", 
      round(train.memory_usage().sum() / 1024 ** 2, 2), "Mb")
print("Test:", 
      round(test.memory_usage().sum() / 1024 ** 2, 2), "Mb")

Train: 877.0 Mb
Test: 448.02 Mb


# Reduce Memory Usage

We use a helper function to cast the numerical variables (all variables in this months dataset are numerical) to their lowest possible subtype. This idea was adapted from this [Kaggle notebook](https://www.kaggle.com/bextuychiev/how-to-work-w-million-row-datasets-like-a-pro). 

In [6]:
# Creates a copy of the original data
def reduce_memory_usage(data, verbose=True):
    df = data.copy()
    start_mem = df.memory_usage().sum() / 1024 ** 2
    for col, dtype in df.dtypes.iteritems():
        if dtype.name.startswith('int'):
            df[col] = pd.to_numeric(data[col], downcast ='integer')
        elif dtype.name.startswith('float'):
            df[col] = pd.to_numeric(data[col], downcast ='float')
            if np.max(df[col] - data[col]) > 1:
                df[col] = data[col]
        
    end_mem = df.memory_usage().sum() / 1024 ** 2
    if verbose:
        print(
            "Mem. usage decreased to {:.2f} Mb ({:.1f}% reduction)".format(
                end_mem, 100 * (start_mem - end_mem) / start_mem
            )
        )
    return df

In [7]:
%%time

train_new = reduce_memory_usage(train)
temp = train - train_new
print("Maximal Difference:", np.max(temp.max()))

Mem. usage decreased to 483.26 Mb (44.9% reduction)
Maximal Difference: 0.0003906250003637979
Wall time: 16.4 s


In [8]:
%%time

test_new = reduce_memory_usage(test)
temp = test - test_new
print("Maximal Difference:", np.max(temp.max()))

Mem. usage decreased to 248.48 Mb (44.5% reduction)
Maximal Difference: 0.0003906250003637979
Wall time: 8.52 s


# Stratified K-Fold

We use `StratifiedKFold` to define our cross-validation scheme.

In [9]:
# Create Folds
for NUM_FOLDS in range(MIN_FOLDS,MAX_FOLDS+1):
    train_new[f"{NUM_FOLDS}fold"] = -1
    kf = StratifiedKFold(NUM_FOLDS, shuffle = True, random_state = FOLD_SEED) 
    for fold, (train_idx, valid_idx) in enumerate(kf.split(train_new, train_new['claim'])):
        train_new.loc[valid_idx,f"{NUM_FOLDS}fold"] = fold
        
# check output
train_new.head()

Unnamed: 0,id,f1,f2,f3,f4,f5,f6,f7,f8,f9,f10,f11,f12,f13,f14,f15,f16,f17,f18,f19,f20,f21,f22,f23,f24,f25,f26,f27,f28,f29,f30,f31,f32,f33,f34,f35,f36,f37,f38,f39,f40,f41,f42,f43,f44,f45,f46,f47,f48,f49,f50,f51,f52,f53,f54,f55,f56,f57,f58,f59,f60,f61,f62,f63,f64,f65,f66,f67,f68,f69,f70,f71,f72,f73,f74,f75,f76,f77,f78,f79,f80,f81,f82,f83,f84,f85,f86,f87,f88,f89,f90,f91,f92,f93,f94,f95,f96,f97,f98,f99,f100,f101,f102,f103,f104,f105,f106,f107,f108,f109,f110,f111,f112,f113,f114,f115,f116,f117,f118,claim,3fold,4fold,5fold,6fold
0,0,0.10859,0.004314,-37.566002,0.017364,0.28915,-10.251,135.119995,168900.0,399240000000000.0,86.488998,0.59881,1423200000.0,0.2724,9.4556,-0.050305,1938.300049,8.6331,4.0607,26.867001,-1.18,10961.0,1.5397,135.320007,-1.4965,440.079987,2590100000000.0,2194200000.0,2968800.0,0.001431,13.327,0.7505,18509.0,146820.0,-0.000276,1.0906e+16,1705.400024,414.290009,3.5392,1888.0,0.96893,18.388,-0.001583,7.7059,5.9325,0.025693,4.5604,0.61122,10.795,0.34193,0.23501,,5237.700195,1.2961,163.660004,0.40378,0.1886,-0.001446,-0.35416,6.6432,0.30534,0.51402,1907300000.0,29.861,0.96501,1797.199951,72.178001,108.620003,1.9799,1.2907,0.99519,1.3228,827.340027,777990000000000.0,41299000000.0,0.006994,6.9835,43956.0,1978.199951,5.5084,-0.001081,6.1244,123180000000.0,275.920013,5308500.0,1704.0,50224000000.0,53.397999,-2.2012,6871.0,3.8862,-0.00558,5252.100098,166.690002,1.6074,0.66534,7768.899902,0.99662,112570000000.0,2.2432,0.93416,0.65056,94569.0,21.471001,8214.099609,0.28801,0.097826,0.001071,1412400000.0,0.11093,-12.228,1.7482,1.9096,-7.1157,4378.799805,1.2096,861340000000000.0,140.100006,1.0177,1,2,3,4,5
1,1,0.1009,0.29961,11822.0,0.2765,0.4597,-0.83733,1721.900024,119810.0,3874100000000000.0,9953.599609,1.2093,3334100000.0,0.28631,-0.012858,-0.019912,10.284,6.1872,1.0419,4.6404,31.877001,123620.0,1.3951,125.809998,1.1989,136.449997,9098100000.0,40041000000.0,1564000.0,0.000204,3.1074,1.5033,238000.0,21440.0,-0.001344,3.0794e+16,229.100006,844.820007,1.468,4726.5,0.91538,-1.5321,0.9826,7.1112,2.0797,0.042321,4.2523,0.41871,5.4499,0.012737,0.38647,7.3082,283.209991,-0.92552,140.800003,0.24739,-0.001656,-0.000975,-0.22629,2.4246,0.77147,0.011613,1803700000.0,64.603996,0.26265,4455.0,78.338997,745.51001,2.9069,1.4826,1.0051,1.4974,84.445999,3505600000000000.0,2242300000.0,0.8963,4.6749,17713.0,9003.099609,-4.3546,0.2541,6.9191,183240000000.0,9.651,32800.0,1480.599976,23006000000.0,44.050999,205.690002,4295.299805,13.388,0.46843,754.609985,83.233002,1.189,29.549999,7343.700195,0.99815,48777000000000.0,1.2708,-0.000969,5.2952,6779.0,227.720001,34.341999,0.3403,0.14337,0.049276,1903200000.0,0.97673,-56.757999,4.1684,0.34808,4.142,913.22998,1.2464,7575100000000000.0,1861.0,0.28359,0,1,1,2,2
2,2,0.17803,-0.00698,907.27002,0.27214,0.45948,0.17327,2298.0,360650.0,12245000000000.0,15827.0,0.38164,1230300000.0,0.25807,2.4556,,26.872999,7.5463,1.9967,1.9526,817.76001,-2948.699951,2.0054,1.6826,1.1968,74.624001,-32739000000.0,57189000000.0,11058.0,-0.003097,8.0241,1.1318,27940.0,862460.0,-0.002207,58491000000000.0,-897.840027,,1.3561,3063.399902,0.086232,16.106001,0.001481,11.476,5.343,0.012162,4.1018,-0.8827,8.1228,-0.67669,0.3377,-1.0732,4097.0,13.458,159.240005,0.3223,0.56009,0.000455,-0.16083,3.5753,0.6097,0.028301,527130000.0,14.454,0.11549,14605.0,36.992001,-9.6391,64.266998,,0.99278,2.5891,430.399994,-44535000000000.0,5144900000000.0,0.099591,6.5516,1887.5,43319.0,4.3931,0.26026,6.1052,101330000000.0,357.269989,1476600.0,90.845001,1306200000.0,2.3731,391.369995,2965.300049,,0.49459,43.523998,138.520004,1.1079,0.91948,47.915001,,1510500000000.0,3.4663,0.56095,4.1309,95531.0,39.486,-83.148003,0.084881,0.032222,0.001668,14365000.0,0.20102,-5.7688,1.2042,0.2629,8.1312,45119.0,1.1764,321810000000000.0,3838.199951,0.4069,1,1,1,2,2
3,3,0.15236,0.007259,780.099976,0.025179,0.51947,7.4914,112.510002,259490.0,77814000000000.0,-36.837002,1.1096,1223100000.0,0.30944,10.37,-0.10626,533.840027,7.849,1.0379,8.003,12.349,-195.279999,2.5598,92.141998,0.63789,1054.900024,-12041000000.0,5187300000000.0,1475400.0,1.0365,1.1903,0.98941,301200.0,,-7e-06,-92992000000000.0,-10.818,1020.299988,2.9553,3342.5,-0.000372,17.011,0.095268,5.7448,15.883,0.037934,4.486,-0.88909,8.4384,-1.1898,0.001391,,175.809998,67.133003,119.260002,0.007034,0.46004,-0.000705,-0.39149,2.0888,0.7979,0.13592,4011100000.0,63.063,0.033075,75194.0,103.970001,-15.482,2.9432,1.1804,1.007,2.1572,1251.5,1894700000000000.0,10770000000.0,0.99225,4.5331,14348.0,1575.699951,9.8105,0.37283,1.5606,18354000000.0,-3.4298,6485700.0,2120.0,30812000000.0,34.056,157.429993,3724.5,8.4211,0.40778,2971.199951,204.699997,-0.97998,9.9405,12011.0,0.99898,50634000000000.0,1.2261,0.2502,0.72974,373690.0,194.649994,120.93,0.26071,0.23424,-0.002794,1442300000.0,-0.01182,-34.858002,2.0694,0.79631,-16.336,4952.399902,1.1784,4533000000000.0,4889.100098,0.51486,1,0,0,0,0
4,4,0.11623,0.5029,-109.150002,0.29791,0.3449,-0.40932,2538.899902,65332.0,1907200000000000.0,144.119995,1.0531,2634100000.0,0.29782,2.6548,,1808.900024,7.2783,3.9757,,,29520.0,3.4225,96.724998,0.79725,215.570007,17326000000000.0,2635200000000.0,2161200.0,0.89547,6.8257,0.97413,142620.0,231350.0,0.001258,1.0125e+16,51.507999,293.76001,1.3351,3042.100098,0.006791,94.889,0.91709,8.7369,,0.020281,3.9115,0.65634,6.141,-1.0896,0.24794,7.9704,2063.100098,0.80633,131.770004,0.17796,0.98938,0.000344,-0.98027,2.361,0.5803,0.46577,5702500000.0,23.738001,-0.000847,75843.0,73.737,,64.591003,1.1029,0.98985,1.3446,519.200012,569320000000000.0,286960000000.0,0.011649,6.0236,1969.800049,1967.599976,,0.08569,1.5846,38252000000.0,130.699997,102100.0,1951.800049,11428000000.0,58.566002,176.830002,1279.0,4.9662,0.47912,-70.278,10.887,1.1434,6.1912,197.470001,,15748000000000.0,1.0083,0.33953,13.486,201300.0,38.841999,324.0,0.23825,0.14155,0.002208,5830700000.0,0.92739,-13.641,1.5298,1.1464,-0.43124,3856.5,1.483,-8991300000000.0,,0.23049,1,1,1,1,2


# Saving Output

We want to save our output in a format that is easy to load and retains our memory usage savings. To this end we use the feather format in pandas.

In [10]:
path_train = '../data/train.feather'
path_test = '../data/test.feather'

# save data
train_new.to_feather(path_train)
test_new.to_feather(path_test)

# Sanity Checks

We check the following:

1. Speed at which we can load `.feather` files
2. Our `.feather` data is equivalent to the original data

## Loading Times

In [11]:
%%time 

# reload data (for testing purposes)
train_df = pd.read_feather(path_train)
test_df = pd.read_feather(path_test)

Wall time: 928 ms


## Data Equivalence

In [12]:
# Check train data types are preserved
types_1 = [train_new[x].dtype for x in train.columns]
types_2 = [train_df[x].dtype for x in train.columns]

assert types_1 == types_2

# Check test data types are preserved
types_1 = [test_new[x].dtype for x in test_new.columns]
types_2 = [test_df[x].dtype for x in test_df.columns]

assert types_1 == types_2

# Find the difference between the data
temp = train_new - train_df
print("Largest Difference (Train):", temp.max().max())

# Find the difference between the data
temp = test_new - test_df
print("Largest Difference (Test):", temp.max().max())

Largest Difference (Train): 0.0
Largest Difference (Test): 0.0
