<h1><center><font size="6">TalkingData AdTracking EDA</font></center></h1>


<img src="https://kaggle2.blob.core.windows.net/competitions/kaggle/5340/logos/front_page.png" width="400"></img>


# <a id='0'>Content</a>

- <a href='#1'>Introduction</a>  
- <a href='#2'>Load packages</a>  
- <a href='#2'>Parameters</a>  
- <a href='#3'>Read the data</a>  
- <a href='#4'>Check the data</a>  
    - <a href='#41'>Glimpse the data</a>  
    - <a href='#42'>Check missing data</a>
    - <a href='#43'>Check data unbalance</a>
- <a href='#5'>Data exploration</a>
- <a href='#6'>Data engineering</a>
    - <a href='#61'>Extract date and time data</a>
    - <a href='#62'>Additional data engineering</a>    
    - <a href='#63'>Data unbalance between train and test data</a>
- <a href='#7'>Model</a>  
    - <a href='#71'>Prepare the model</a>  
    - <a href='#71'>Train the model</a> 
- <a href='#8'>Predict and submission</a>


# <a id="1">Introduction</a>

Fraud risk is everywhere, but for companies that advertise online, click fraud can happen at an overwhelming volume, resulting in misleading click data and wasted money. Ad channels can drive up costs by simply clicking on the ad at a large scale. With over 1 billion smart mobile devices in active use every month, China is the largest mobile market in the world and therefore suffers from huge volumes of fraudulent traffic.
Founded in 2011, TalkingData is China’s largest independent big data service platform. 

They handle 3 billion clicks per day, of which 90% are potentially fraudulent. Their current approach to prevent click fraud for app developers is to measure the journey of a user’s click across their portfolio, and flag IP addresses who produce lots of clicks, but never end up installing apps. With this information, they've built an IP blacklist and device blacklist.
While successful, they wanted to always be one step ahead of fraudsters and have turned to the Kaggle community for help in further developing their solution. In their 2nd competition with Kaggle, they challenge participants to build an algorithm that predicts whether a user will download an app after clicking a mobile app ad. To support the modeling, they have provided a generous dataset covering approximately 200 million clicks over 4 days!


# <a id="2">Load packages</a>

Here we load the libraries and set few parameters.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
from sklearn.preprocessing import Imputer
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import VarianceThreshold
from sklearn.feature_selection import SelectFromModel

from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from sklearn.linear_model import LogisticRegression


import lightgbm as lgb
import gc # memory 
from datetime import datetime # train time checking

pd.set_option('display.max_columns', 100)

# <a id="3">Parameters</a>

Here we set most of the parameters used in this Notebook.



In [None]:
#VALIDATION
VALIDATE = False  #validation using train_test_split
VALID_SIZE = 0.90 # simple validation using train_test_split

#CROSS-VALIDATION
VALIDATE_KFOLDS = True #cross-validation using KFolds
NUMBER_KFOLDS = 5 #number of KFolds for cross-validation

SAMPLE = True    #True: use train.sample (100,000 rows) False: use full training set (train)
RANDOM_STATE = 2018

MAX_ROUNDS = 1000 #lgb iterations
EARLY_STOP = 50  #lgb early stop 
OPT_ROUNDS = 650  #To be adjusted based on best validation rounds
skiprows = range(1,109903891) #
nrows = 75000000
#USE SAMPLE FROM FULL TRAIN SET
SAMPLE_SIZE = 1 # use a subsample of the train set
output_filename = 'submission.csv'

IS_LOCAL = False


if (IS_LOCAL):
    PATH = '../input/talkingdata-adtracking-fraud-detection/'
else:
    PATH = '../input/'
print(os.listdir(PATH))

# <a id="3">Read the data</a>

We read the data with the options set in the **Parameters** section.


In [None]:
dtypes = {
        'ip'            : 'uint32',
        'app'           : 'uint16',
        'device'        : 'uint16',
        'os'            : 'uint16',
        'channel'       : 'uint16',
        'is_attributed' : 'uint8',
        'click_id'      : 'uint32'
        }

train_cols = ['ip','app','device','os', 'channel', 'click_time', 'is_attributed']

if SAMPLE:
    trainset = pd.read_csv(PATH+"train_sample.csv", dtype=dtypes, usecols=train_cols)    
else:
    trainset = pd.read_csv(PATH+"train.csv", skiprows=skiprows, nrows=nrows,dtype=dtypes, usecols=train_cols)
    trainset = trainset.sample(frac=SAMPLE_SIZE)

testset = pd.read_csv(PATH+"test.csv")

# <a id="4">Check the data</a>

## <a id="41">Glimpse the data</a>

Let's check the train and test set structure.


In [None]:
print("train -  rows:",trainset.shape[0]," columns:", trainset.shape[1])
print("test -  rows:",testset.shape[0]," columns:", testset.shape[1])


In [None]:
trainset.head()


In [None]:
testset.head()

The trainset columns are:
* **ip**: ip address of click (numeric);
* **app**: app id for marketing (category);
* **device**: device type id of user mobile phone (e.g., iphone 6 plus, iphone 7, huawei mate 7, etc.) (category);
* **os**: os version id of user mobile phone (category);
* **channel**: channel id of mobile ad publisher (category);
* **click_time**: timestamp of click (UTC) (time);
* **attributed_time**: if user download the app for after clicking an ad, this is the time of the app download (time);
* **is_attributed**: the target that is to be predicted, indicating the app was downloaded (binary);

The testset differs with three columns:
* **click_id** reference for making predictions (additional column) (numeric);
* **attributed_time**: missing column;
* **is_attributed**: missing column (target column) (binary).

Let's see what is the dimmension of the train and test set.



## <a id="42">Check missing data</a>

Let's check missing data in train and test.


In [None]:
def missing_data(data):
    total = data.isnull().sum().sort_values(ascending = False)
    percent = (data.isnull().sum()/data.isnull().count()*100).sort_values(ascending = False)
    return pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])

Train data:

In [None]:
missing_data(trainset)

Test data:

In [None]:
missing_data(testset)

## <a id="43">Check data unbalance</a>

We will check the **target** (**is_attributed**) data unbalance. 

In [None]:
plt.figure()
fig, ax = plt.subplots(figsize=(6,6))
x = trainset['is_attributed'].value_counts().index.values
y = trainset["is_attributed"].value_counts().values
# Bar plot
# Order the bars descending on target mean
sns.barplot(ax=ax, x=x, y=y)
plt.ylabel('Number of values', fontsize=12)
plt.xlabel('is_attributed value', fontsize=12)
plt.tick_params(axis='both', which='major', labelsize=12)
plt.show();

Only a very small part of the **is_attributed** data have 1 value (227 out of 100,000 values, 176,627 out of 75 M values). This means that the training dataset is highly imbalanced (0.23% have **is_attributed** = 1). Usually is used either undersample the records with **is_attributed** = 0 or oversample records with **is_attributed** = 1; because is a large dataset, it is a good option to do undersampling of records with **is_attributed** = 0.

# <a id="5">Data exploration</a>

In [None]:
trainset.describe()

We can observe that, while ip is a number with a wide range of values, **app**, **device**, **os** and **channel** are categorical values with a smaller range of variation. Let's separate between values with **is_attributed** with value `1` and value `0`.

In [None]:
t1 = trainset.loc[trainset['is_attributed'] != 0.]
t0 = trainset.loc[trainset['is_attributed'] == 0.]

t0.describe()

In [None]:
t1.describe()

We can observe that in the two sets, **ip** and **os** and **channel** have values in the same ranges for both **is_attributed** values (0 and 1). Maximum values for **app** is almost double and maximum value for **device** is four times larger for **is_attributed** = 0.

Let's visualize the distribution of values of **app**, **device**, **os** and **channel**, grouped on **is_attributed**.

In [None]:
var = ['app','device','os','channel']

i = 0
t1 = trainset.loc[trainset['is_attributed'] != 0]
t0 = trainset.loc[trainset['is_attributed'] == 0]

sns.set_style('whitegrid')
plt.figure()
fig, ax = plt.subplots(1,4,figsize=(16,4))

for feature in var:
    i += 1
    plt.subplot(1,4,i)
    sns.kdeplot(t1[feature], bw=0.5,label="is_attributed = 1")
    sns.kdeplot(t0[feature], bw=0.5,label="is_attributed = 0")
    plt.ylabel('Density plot', fontsize=12)
    plt.xlabel(feature, fontsize=12)
    locs, labels = plt.xticks()
    plt.tick_params(axis='both', which='major', labelsize=12)
plt.show();

# <a id="6">Data engineering<a/>

## <a id="61">Extract date and time data</a>

In [None]:
trainset['year'] = pd.to_datetime(trainset.click_time).dt.year
trainset['month'] = pd.to_datetime(trainset.click_time).dt.month
trainset['day'] = pd.to_datetime(trainset.click_time).dt.day
trainset['hour'] = pd.to_datetime(trainset.click_time).dt.hour
trainset['min'] = pd.to_datetime(trainset.click_time).dt.minute
trainset['sec'] = pd.to_datetime(trainset.click_time).dt.second
trainset.head()

In [None]:
trainset.describe()

In [None]:
var = ['day','hour']

i = 0
t1 = trainset.loc[trainset['is_attributed'] != 0]
t0 = trainset.loc[trainset['is_attributed'] == 0]

sns.set_style('whitegrid')
plt.figure()
fig, ax = plt.subplots(1,2,figsize=(12,4))

for feature in var:
    i += 1
    plt.subplot(1,2,i)
    sns.kdeplot(t1[feature], bw=0.5,label="is_attributed = 1")
    sns.kdeplot(t0[feature], bw=0.5,label="is_attributed = 0")
    plt.ylabel('Density plot', fontsize=12)
    plt.xlabel(feature, fontsize=12)
    locs, labels = plt.xticks()
    plt.tick_params(axis='both', which='major', labelsize=12)
plt.show();

One can observe that the distribution for **true** (**is_attributed = 1**) clicks is more diverse (hour, min, sec) compared with **false** (**is_attributed = 0**). This might be explained in two ways: one explanation can be that due to the reduced number of **true** cases, the distribution is less uniform. Another explanation might be (to be verified with larger number of cases) that due to programatic nature of *artificial* (**false**) clicks, their distribution is more uniform. One observation, related to the density plot for hours: for both **true** and **false** there is a certain hourly profile, with a plateau between 1 and 16, a saddle between 16 and 20 and a peak between 21 and 22. The plateau for the artificial (**false**) clicks shows an additional pattern, with oscilant profile.

Let's represent the hour distribution with an alternative method, using barplots to show percent from all data of the **true** data.

In [None]:
var = ['day','hour']

for feature in var:
    fig, ax = plt.subplots(figsize=(16,6))
    # Calculate the percentage of target=1 per category value
    cat_perc = trainset[[feature, 'is_attributed']].groupby([feature],as_index=False).mean()
    cat_perc.sort_values(by='is_attributed', ascending=False, inplace=True)
    # Bar plot
    #sns.barplot(ax=ax,x=feature, y='is_attributed', data=cat_perc, order=cat_perc[feature]) #for ordered bars
    sns.barplot(ax=ax,x=feature, y='is_attributed', data=cat_perc)
    plt.ylabel('Percent of `is_attributed` with value 1 [%]', fontsize=12)
    plt.xlabel(feature, fontsize=12)
    plt.tick_params(axis='both', which='major', labelsize=12)
    plt.show();

## <a id="62">Additional feature engineering</a>

Define several functions for calculation of derived features.

In [None]:
#------------------------------------------------------------------------------
def show_max_clean(df,gp,agg_name,agg_type,show_max):
#------------------------------------------------------------------------------    
    del gp
    if show_max:
        print( agg_name + " max value = ", df[agg_name].max() )
    df[agg_name] = df[agg_name].astype(agg_type)
    gc.collect()
    return( df )
    
#------------------------------------------------------------------------------
def perform_count( df, group_cols, agg_name, agg_type='uint32', show_max=False, show_agg=True ):
#------------------------------------------------------------------------------
    if show_agg:
        print( "Aggregating by ", group_cols , '...' )
    gp = df[group_cols][group_cols].groupby(group_cols).size().rename(agg_name).to_frame().reset_index()
    df = df.merge(gp, on=group_cols, how='left')
    return (show_max_clean(df,gp,agg_name,agg_type,show_max))

#------------------------------------------------------------------------------
def perform_countuniq( df, group_cols, counted, agg_name, agg_type='uint32', show_max=False, show_agg=True ):
#------------------------------------------------------------------------------    
    if show_agg:
        print( "Counting unique ", counted, " by ", group_cols , '...' )
    gp = df[group_cols+[counted]].groupby(group_cols)[counted].nunique().reset_index().rename(columns={counted:agg_name})
    df = df.merge(gp, on=group_cols, how='left')
    return (show_max_clean(df,gp,agg_name,agg_type,show_max))

#------------------------------------------------------------------------------    
def perform_cumcount( df, group_cols, counted, agg_name, agg_type='uint32', show_max=False, show_agg=True ):
#------------------------------------------------------------------------------    
    if show_agg:
        print( "Cumulative count by ", group_cols , '...' )
    gp = df[group_cols+[counted]].groupby(group_cols)[counted].cumcount()
    df[agg_name]=gp.values
    return (show_max_clean(df,gp,agg_name,agg_type,show_max))

#------------------------------------------------------------------------------
def perform_mean( df, group_cols, counted, agg_name, agg_type='float32', show_max=False, show_agg=True ):
#------------------------------------------------------------------------------    
    if show_agg:
        print( "Calculating mean of ", counted, " by ", group_cols , '...' )
    gp = df[group_cols+[counted]].groupby(group_cols)[counted].mean().reset_index().rename(columns={counted:agg_name})
    df = df.merge(gp, on=group_cols, how='left')
    return (show_max_clean(df,gp,agg_name,agg_type,show_max))

#------------------------------------------------------------------------------
def perform_var( df, group_cols, counted, agg_name, agg_type='float32', show_max=False, show_agg=True ):
#------------------------------------------------------------------------------    
    if show_agg:
        print( "Calculating variance of ", counted, " by ", group_cols , '...' )
    gp = df[group_cols+[counted]].groupby(group_cols)[counted].var().reset_index().rename(columns={counted:agg_name})
    df = df.merge(gp, on=group_cols, how='left')
    return (show_max_clean(df,gp,agg_name,agg_type,show_max))

Now we call the routines for additional features calculation.

In [None]:
trainset = perform_countuniq( trainset, ['ip'], 'channel', 'X0', 'uint8', show_max=True ); gc.collect()
trainset = perform_cumcount( trainset, ['ip', 'device', 'os'], 'app', 'X1', show_max=True ); gc.collect()
trainset = perform_countuniq( trainset, ['ip', 'day'], 'hour', 'X2', 'uint8', show_max=True ); gc.collect()
trainset = perform_countuniq( trainset, ['ip'], 'app', 'X3', 'uint8', show_max=True ); gc.collect()
trainset = perform_countuniq( trainset, ['ip', 'app'], 'os', 'X4', 'uint8', show_max=True ); gc.collect()
trainset = perform_countuniq( trainset, ['ip'], 'device', 'X5', 'uint16', show_max=True ); gc.collect()
trainset = perform_countuniq( trainset, ['app'], 'channel', 'X6', show_max=True ); gc.collect()
trainset = perform_cumcount( trainset, ['ip'], 'os', 'X7', show_max=True ); gc.collect()
trainset = perform_countuniq( trainset, ['ip', 'device', 'os'], 'app', 'X8', show_max=True ); gc.collect()
trainset = perform_count( trainset, ['ip', 'day', 'hour'], 'ip_tcount', show_max=True ); gc.collect()
trainset = perform_count( trainset, ['ip', 'app'], 'ip_app_count', show_max=True ); gc.collect()
trainset = perform_count( trainset, ['ip', 'app', 'os'], 'ip_app_os_count', 'uint16', show_max=True ); gc.collect()
trainset = perform_var( trainset, ['ip', 'day', 'channel'], 'hour', 'ip_tchan_count', show_max=True ); gc.collect()
trainset = perform_var( trainset, ['ip', 'app', 'os'], 'hour', 'ip_app_os_var', show_max=True ); gc.collect()
trainset = perform_var( trainset, ['ip', 'app', 'channel'], 'day', 'ip_app_channel_var_day', show_max=True ); gc.collect()
trainset = perform_mean( trainset, ['ip', 'app', 'channel'], 'hour', 'ip_app_channel_mean_hour', show_max=True ); gc.collect()

In [None]:
trainset.head(5)

In [None]:
var = ['X0','X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X7']

i = 0
t1 = trainset.loc[trainset['is_attributed'] != 0]
t0 = trainset.loc[trainset['is_attributed'] == 0]

sns.set_style('whitegrid')
plt.figure()
fig, ax = plt.subplots(2,4,figsize=(14,8))

for feature in var:
    i += 1
    plt.subplot(2,4,i)
    sns.kdeplot(t1[feature], bw=0.5,label="is_attributed = 1")
    sns.kdeplot(t0[feature], bw=0.5,label="is_attributed = 0")
    plt.ylabel('Density plot', fontsize=12)
    plt.xlabel(feature, fontsize=12)
    locs, labels = plt.xticks()
    plt.tick_params(axis='both', which='major', labelsize=12)
plt.show();

In [None]:
var = ['X8', 'ip_tcount', 'ip_app_count','ip_app_os_count', 
        'ip_tchan_count','ip_app_os_var','ip_app_channel_var_day', 'ip_app_channel_mean_hour']

i = 0
t1 = trainset.loc[trainset['is_attributed'] != 0]
t0 = trainset.loc[trainset['is_attributed'] == 0]

sns.set_style('whitegrid')
plt.figure()
fig, ax = plt.subplots(2,4,figsize=(14,8))

for feature in var:
    i += 1
    plt.subplot(2,4,i)
    sns.kdeplot(t1[feature], bw=0.5,label="is_attributed = 1")
    sns.kdeplot(t0[feature], bw=0.5,label="is_attributed = 0")
    plt.ylabel('Density plot', fontsize=12)
    plt.xlabel(feature, fontsize=12)
    locs, labels = plt.xticks()
    plt.tick_params(axis='both', which='major', labelsize=12)
plt.show();

## <a id="63">Data unbalance between train and test data</a>

Let's compare the distribution of the features in the train and test datasets.


Let's start with the data about application, devide, operating system and channel.

In [None]:
var = ['app','device','os','channel']

# Bar plot
sns.set_style('whitegrid')

plt.figure()
fig, ax = plt.subplots(1,4,figsize=(16,4))
i = 0
for feature in var:
    i = i + 1
    plt.subplot(1,4,i)
    sns.kdeplot(trainset[feature], bw=0.5, label="train")
    sns.kdeplot(testset[feature], bw=0.5, label="test")
    plt.ylabel('Distribution', fontsize=12)
    plt.xlabel(feature, fontsize=12)
    locs, labels = plt.xticks()
    #plt.setp(labels, rotation=90)
    plt.tick_params(axis='both', which='major', labelsize=12)
plt.show();

Let's continue with the data on time. First we will have to do the date and time extraction operation on testset data.

In [None]:
testset['year'] = pd.to_datetime(testset.click_time).dt.year
testset['month'] = pd.to_datetime(testset.click_time).dt.month
testset['day'] = pd.to_datetime(testset.click_time).dt.day
testset['hour'] = pd.to_datetime(testset.click_time).dt.hour
testset['min'] = pd.to_datetime(testset.click_time).dt.minute
testset['sec'] = pd.to_datetime(testset.click_time).dt.second
testset.head()

In [None]:
testset.describe()

In [None]:
var = ['day','hour']

sns.set_style('whitegrid')

plt.figure()
fig, ax = plt.subplots(1,2,figsize=(12,4))
i = 0
for feature in var:
    i = i + 1
    plt.subplot(1,2,i)
    sns.kdeplot(trainset[feature], bw=0.5, label="train")
    sns.kdeplot(testset[feature], bw=0.5, label="test")
    plt.ylabel('Distribution', fontsize=12)
    plt.xlabel(feature, fontsize=12)
    locs, labels = plt.xticks()
    #plt.setp(labels, rotation=90)
    plt.tick_params(axis='both', which='major', labelsize=12)
plt.show();

We can see that the day for the train data is for 6-9 November 2017, the test data is only from 10 November 2017. As well, the hours distribution are very different between train and test data.


Let's calculate the derived features as well for the test set.

In [None]:
testset = perform_countuniq( testset, ['ip'], 'channel', 'X0', 'uint8', show_max=True ); gc.collect()
testset = perform_cumcount( testset, ['ip', 'device', 'os'], 'app', 'X1', show_max=True ); gc.collect()
testset = perform_countuniq( testset, ['ip', 'day'], 'hour', 'X2', 'uint8', show_max=True ); gc.collect()
testset = perform_countuniq( testset, ['ip'], 'app', 'X3', 'uint8', show_max=True ); gc.collect()
testset = perform_countuniq( testset, ['ip', 'app'], 'os', 'X4', 'uint8', show_max=True ); gc.collect()
testset = perform_countuniq( testset, ['ip'], 'device', 'X5', 'uint16', show_max=True ); gc.collect()
testset = perform_countuniq( testset, ['app'], 'channel', 'X6', show_max=True ); gc.collect()
testset = perform_cumcount( testset, ['ip'], 'os', 'X7', show_max=True ); gc.collect()
testset = perform_countuniq( testset, ['ip', 'device', 'os'], 'app', 'X8', show_max=True ); gc.collect()
testset = perform_count( testset, ['ip', 'day', 'hour'], 'ip_tcount', show_max=True ); gc.collect()
testset = perform_count( testset, ['ip', 'app'], 'ip_app_count', show_max=True ); gc.collect()
testset = perform_count( testset, ['ip', 'app', 'os'], 'ip_app_os_count', 'uint16', show_max=True ); gc.collect()
testset = perform_var( testset, ['ip', 'day', 'channel'], 'hour', 'ip_tchan_count', show_max=True ); gc.collect()
testset = perform_var( testset, ['ip', 'app', 'os'], 'hour', 'ip_app_os_var', show_max=True ); gc.collect()
testset = perform_var( testset, ['ip', 'app', 'channel'], 'day', 'ip_app_channel_var_day', show_max=True ); gc.collect()
testset = perform_mean( testset, ['ip', 'app', 'channel'], 'hour', 'ip_app_channel_mean_hour', show_max=True ); gc.collect()

In [None]:
var = ['X0','X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X7']

sns.set_style('whitegrid')

plt.figure()
fig, ax = plt.subplots(2,4,figsize=(14,8))
i = 0
for feature in var:
    i = i + 1
    plt.subplot(2,4,i)
    sns.kdeplot(trainset[feature], bw=0.5, label="train")
    sns.kdeplot(testset[feature], bw=0.5, label="test")
    plt.ylabel('Distribution', fontsize=12)
    plt.xlabel(feature, fontsize=12)
    locs, labels = plt.xticks()
    #plt.setp(labels, rotation=90)
    plt.tick_params(axis='both', which='major', labelsize=12)
plt.show();

In [None]:
var = ['X8', 'ip_tcount', 'ip_app_count','ip_app_os_count', 
       'ip_tchan_count', 'ip_app_os_var','ip_app_channel_var_day', 'ip_app_channel_mean_hour']

sns.set_style('whitegrid')

plt.figure()
fig, ax = plt.subplots(2,4,figsize=(14,8))
i = 0
for feature in var:
    i = i + 1
    plt.subplot(2,4,i)
    sns.kdeplot(trainset[feature], bw=0.5, label="train")
    sns.kdeplot(testset[feature], bw=0.5, label="test")
    plt.ylabel('Distribution', fontsize=12)
    plt.xlabel(feature, fontsize=12)
    locs, labels = plt.xticks()
    #plt.setp(labels, rotation=90)
    plt.tick_params(axis='both', which='major', labelsize=12)
plt.show();

# <a id="7">Model</a>

## <a id="71">Prepare the model</a>

In [None]:
start = datetime.now()

len_train = len(trainset)
gc.collect()

most_freq_hours_in_test_data = [4, 5, 9, 10, 13, 14]
least_freq_hours_in_test_data = [6, 11, 15]

def prep_data( df ):
    
    df['hour'] = pd.to_datetime(df.click_time).dt.hour.astype('uint8')
    df['day'] = pd.to_datetime(df.click_time).dt.day.astype('uint8')
    df.drop(['click_time'], axis=1, inplace=True)
    gc.collect()
    
    df['in_test_hh'] = (   3 
                         - 2*df['hour'].isin(  most_freq_hours_in_test_data ) 
                         - 1*df['hour'].isin( least_freq_hours_in_test_data ) ).astype('uint8')
    gp = df[['ip', 'day', 'in_test_hh', 'channel']].groupby(by=['ip', 'day', 'in_test_hh'])[['channel']].count().reset_index().rename(index=str, columns={'channel': 'nip_day_test_hh'})
    df = df.merge(gp, on=['ip','day','in_test_hh'], how='left')
    df.drop(['in_test_hh'], axis=1, inplace=True)
    df['nip_day_test_hh'] = df['nip_day_test_hh'].astype('uint32')
    del gp
    gc.collect()

    gp = df[['ip', 'day', 'hour', 'channel']].groupby(by=['ip', 'day', 'hour'])[['channel']].count().reset_index().rename(index=str, columns={'channel': 'nip_day_hh'})
    df = df.merge(gp, on=['ip','day','hour'], how='left')
    df['nip_day_hh'] = df['nip_day_hh'].astype('uint16')
    del gp
    gc.collect()
    
    gp = df[['ip', 'os', 'hour', 'channel']].groupby(by=['ip', 'os', 'hour'])[['channel']].count().reset_index().rename(index=str, columns={'channel': 'nip_hh_os'})
    df = df.merge(gp, on=['ip','os','hour'], how='left')
    df['nip_hh_os'] = df['nip_hh_os'].astype('uint16')
    del gp
    gc.collect()

    gp = df[['ip', 'app', 'hour', 'channel']].groupby(by=['ip', 'app',  'hour'])[['channel']].count().reset_index().rename(index=str, columns={'channel': 'nip_hh_app'})
    df = df.merge(gp, on=['ip','app','hour'], how='left')
    df['nip_hh_app'] = df['nip_hh_app'].astype('uint16')
    del gp
    gc.collect()

    gp = df[['ip', 'device', 'hour', 'channel']].groupby(by=['ip', 'device', 'hour'])[['channel']].count().reset_index().rename(index=str, columns={'channel': 'nip_hh_dev'})
    df = df.merge(gp, on=['ip','device','hour'], how='left')
    df['nip_hh_dev'] = df['nip_hh_dev'].astype('uint32')
    del gp
    gc.collect()

    df.drop( ['ip','day'], axis=1, inplace=True )
    gc.collect()
    return df

In [None]:
trainset = prep_data(trainset)
gc.collect()

params = {
          'boosting_type': 'gbdt',
          'objective': 'binary',
          'metric':'auc',
          'learning_rate': 0.1,
          'num_leaves': 9,  # we should let it be smaller than 2^(max_depth)
          'max_depth': 5,  # -1 means no limit
          'min_child_samples': 100,  # Minimum number of data need in a child(min_data_in_leaf)
          'max_bin': 100,  # Number of bucketed bin for feature values
          'subsample': 0.9,  # Subsample ratio of the training instance.
          'subsample_freq': 1,  # frequence of subsample, <=0 means no enable
          'colsample_bytree': 0.7,  # Subsample ratio of columns when constructing each tree.
          'min_child_weight': 0,  # Minimum sum of instance weight(hessian) needed in a child(leaf)
          'min_split_gain': 0,  # lambda_l1, lambda_l2 and min_gain_to_split to regularization
          'nthread': 8,
          'verbose': 0,
          'scale_pos_weight':99.7, # because training data is extremely unbalanced 
         }

target = 'is_attributed'
predictors = ['app','device','os', 'channel', 'hour', 'nip_day_test_hh', 'nip_day_hh', 'nip_hh_os', 'nip_hh_app', 'nip_hh_dev']
categorical = ['app', 'device', 'os', 'channel', 'hour']


## <a id="72">Train the model</a>

In [None]:
if VALIDATE:

    train_df, val_df = train_test_split(trainset, test_size=VALID_SIZE, random_state=RANDOM_STATE, shuffle=True )
    
    dtrain = lgb.Dataset(train_df[predictors].values, 
                         label=train_df[target].values,
                         feature_name=predictors,
                         categorical_feature=categorical)
    del train_df
    gc.collect()

    dvalid = lgb.Dataset(val_df[predictors].values,
                         label=val_df[target].values,
                         feature_name=predictors,
                         categorical_feature=categorical)
    del val_df
    gc.collect()

    evals_results = {}

    model = lgb.train(params, 
                      dtrain, 
                      valid_sets=[dtrain, dvalid], 
                      valid_names=['train','valid'], 
                      evals_result=evals_results, 
                      num_boost_round=MAX_ROUNDS,
                      early_stopping_rounds=EARLY_STOP,
                      verbose_eval=50, 
                      feval=None)

    del dvalid
    
elif VALIDATE_KFOLDS:
    kf = KFold(n_splits = NUMBER_KFOLDS, random_state = RANDOM_STATE, shuffle = True)
    for train_index, test_index in kf.split(trainset):
        train_X, valid_X = trainset.iloc[train_index], trainset.iloc[test_index]

        dtrain = lgb.Dataset(train_X[predictors].values, label=train_X[target].values,
                         feature_name=predictors, categorical_feature=categorical)
   
        dvalid = lgb.Dataset(valid_X[predictors].values, label=valid_X[target].values,
                         feature_name=predictors, categorical_feature=categorical)
    
        evals_results = {}
        model =  lgb.train(params, 
                      dtrain, 
                      valid_sets=[dtrain, dvalid], 
                      valid_names=['train','valid'], 
                      evals_result=evals_results, 
                      num_boost_round=MAX_ROUNDS,
                      early_stopping_rounds=EARLY_STOP,
                      verbose_eval=50, 
                      feval=None)
    
else:

    gc.collect()
    dtrain = lgb.Dataset(train_df[predictors].values, label=train_df[target].values,
                          feature_name=predictors,
                          categorical_feature=categorical
                          )
    del train_df
    gc.collect()

    evals_results = {}

    model = lgb.train(params, 
                      dtrain, 
                      valid_sets=[dtrain], 
                      valid_names=['train'], 
                      evals_result=evals_results, 
                      num_boost_round=OPT_ROUNDS,
                      verbose_eval=50,
                      feval=None)
    
del dtrain
gc.collect()

# <a id="8">Prediction and submission</a>

In [None]:
test_cols = ['ip','app','device','os', 'channel', 'click_time', 'click_id']

test_df = prep_data(testset)
gc.collect()

sub = pd.DataFrame()
sub['click_id'] = test_df['click_id']
sub['is_attributed'] = model.predict(test_df[predictors])
sub.to_csv(output_filename, index=False, float_format='%.9f')
