### Problem: 
Fraud risk is everywhere, but for companies that advertise online, click fraud can happen at an overwhelming volume, resulting in misleading click data and wasted money. Ad channels can drive up costs by simply clicking on the ad at a large scale. With over 1 billion smart mobile devices in active use every month, China is the largest mobile market in the world and therefore suffers from huge volumes of fradulent traffic.

TalkingData, China’s largest independent big data service platform, covers over 70% of active mobile devices nationwide. They handle 3 billion clicks per day, of which 90% are potentially fraudulent. Their current approach to prevent click fraud for app developers is to measure the journey of a user’s click across their portfolio, and flag IP addresses who produce lots of clicks, but never end up installing apps. With this information, they've built an IP blacklist and device blacklist.

While successful, they want to always be one step ahead of fraudsters and have turned to the Kaggle community for help in further developing their solution. In their 2nd competition with Kaggle, you’re challenged to build an algorithm that predicts whether a user will download an app after clicking a mobile app ad. To support your modeling, they have provided a generous dataset covering approximately 200 million clicks over 4 days!

### Goal:
Predicting the probabilities for different click_id's in the test set.**

For each click_id in the test set, we must predict a probability for the target is_attributed variable. The file should contain a header and have the following format:
* click_id,is_attributed
* 1,0.003
* 2,0.001
* 3,0.000
* etc.

### Dataset Description
Each row of the training data contains a click record, with the following features.

- **ip**: ip address of click.
- **app**: app id for marketing.
- **device**: device type id of user mobile phone (e.g., iphone 6 plus, iphone 7, huawei mate 7, etc.)
- **os**: os version id of user mobile phone
- **channel**: channel id of mobile ad publisher
- **click_time**: timestamp of click (UTC)
- **attributed_time**: if user download the app for after clicking an ad, this is the time of the app download
- **is_attributed**: the target that is to be predicted, indicating the app was downloaded
    Note that ip, app, device, os, and channel are encoded.

The test data is similar, with the following differences:
- **click_id**: reference for making predictions
- **is_attributed**: not included

### Evaluation:
Submissions are evaluated on area under the ROC curve between the predicted probability and the observed target.

### TLDR; Summary
- Total train samples are 184,903,890.
- Total test samples are 18,790,469.
- Total test supplement samples are 57,537,505. 
- Percentage of positive data: 0.2%
- Used last 10M rows to down sample the data since the dataset is very big.
- Used Kaggle kernel for training and generating the output for this problem 
- Used XGBoost for training and testing purposes

### Leaderboard Score:
- **Public score**: 0.95613
- **Private score**: 0.95558

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import os
print(os.listdir("../input"))

# Any results you write to the current directory are saved as output.

In [None]:

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import xgboost as xgb
from xgboost import plot_importance
import gc
import time
from datetime import datetime
%config InlineBackend.figure_format = 'retina'
plt.figure(figsize=(12,5))
from sklearn.utils import resample

### Train dataset

In [None]:
train_path = "../input/" + 'train.csv'
train_df = pd.read_csv(train_path, nrows = 10000)
train_df.head()

### Test dataset

In [None]:
test_path = "../input/" + 'test.csv'
test_df = pd.read_csv(test_path, nrows = 10000)
test_df.head()

### Test supplement dataset

In [None]:
test_sup_path = "../input/" + 'test_supplement.csv'
test_sup_df = pd.read_csv(test_sup_path, nrows = 10000)
test_sup_df.head()

In [None]:
train_df.dtypes

In [None]:
train_cols = ['ip','app','device','os','channel','click_time','is_attributed']
test_cols = ['ip','app','device','os','channel','click_time','click_id']

# By default, pandas sets the dtype of integers to int64. In many cases, 
# this datatype takes up extra memory which is just not required.
# Hence, memory reduction by changing the datatypes is very helpful.

dtypes = {
        'ip'            : 'uint32',
        'app'           : 'uint16',
        'device'        : 'uint16',
        'os'            : 'uint16',
        'channel'       : 'uint16',
        'click_id'      : 'uint32',
        'is_attributed' : 'uint8'
        }

In [None]:
# Reading the last 10M rows for down sampling the data
train_df = pd.read_csv( '../input/' + "train.csv", skiprows=range(1,123903891), nrows=61000000, usecols=train_cols, dtype=dtypes)
test_sup_df = pd.read_csv('../input/' + "test_supplement.csv", usecols=test_cols, dtype=dtypes)

In [None]:
# Separate majority and minority classes
df_train_majority = train_df[train_df.is_attributed==0]
df_train_minority = train_df[train_df.is_attributed==1]

In [None]:
len(df_train_majority)

In [None]:
len(df_train_minority)

In [None]:
# Downsample majority class
df_majority_downsampled = resample(df_train_majority, 
                                 replace=False,    # sample without replacement
                                 n_samples=10000000,     # to match minority class
                                 random_state=123) # reproducible results

In [None]:
# Combine minority class with downsampled majority class
df_downsampled = pd.concat([df_majority_downsampled, df_train_minority])

In [None]:
# Display new class counts
df_downsampled.is_attributed.value_counts()

In [None]:
# Feature extraction using click_time column in the datasets
def feature_extraction(df):
    df['date'] = pd.to_datetime(df['click_time'])
    df['dayOfWeek'] = df['date'].dt.dayofweek.astype('uint16')
    df['dayOfYear'] = df['date'].dt.dayofyear.astype('uint16')
    df['hour'] = df['date'].dt.hour.astype('uint8')
    df['min'] = df['date'].dt.minute.astype('uint8')
    df['sec'] = df['date'].dt.second.astype('uint8')
    df.drop(['date','click_time'], axis= 1, inplace=True)
    return df

In [None]:
del df_train_minority, train_df, df_majority_downsampled
gc.collect()

In [None]:
df_downsampled.head()

In [None]:
# drop the target values from train dataset 
y = df_downsampled['is_attributed']
df_downsampled.drop(['is_attributed'], axis=1, inplace=True)

In [None]:
# drop the click_time from the test data
test_sup_df.drop(['click_id'], axis=1, inplace=True)
gc.collect()

In [None]:
# Merging the supplement data with test data
rows_train = df_downsampled.shape[0]
merge_df = pd.concat([df_downsampled, test_sup_df])
merge_df.head()

In [None]:
print('Length of combine dataset: ', len(merge_df))

In [None]:
del df_downsampled, test_sup_df
gc.collect()

In [None]:
# Group by ip to count the number of clicks
ip_groups = merge_df.groupby(['ip'])['channel'].count().reset_index(name = 'clicks_by_ip')
print(ip_groups)

In [None]:
merge_df = pd.merge(merge_df, ip_groups, on='ip', how='left', sort=False)
print(merge_df.head())

In [None]:
merge_df['clicks_by_ip'] = merge_df['clicks_by_ip'].astype('uint16')
merge_df.drop('ip', axis=1, inplace=True)

In [None]:
train_df = merge_df[:rows_train]
test_df = merge_df[rows_train:]

In [None]:
del test_df,merge_df
gc.collect()

In [None]:
train_df = feature_extraction(train_df)
gc.collect()

In [None]:
train_df.head()

In [None]:
# Set the params for xgboost model
params = {'eta': 0.3,
          'tree_method': "hist",
          'grow_policy': "lossguide",
          'max_leaves': 1400,  
          'max_depth': 0, 
          'subsample': 0.9, 
          'colsample_bytree': 0.7, 
          'colsample_bylevel':0.7,
          'min_child_weight':0,
          'alpha':4,
          'objective': 'binary:logistic', 
          'scale_pos_weight':9,
          'eval_metric': 'auc', 
          'nthread':8,
          'random_state': 99, 
          'silent': True}

In [None]:
X_train, X_test, y_train, y_test = train_test_split(train_df, y, test_size=0.1, stratify=y, random_state=99)
dtrain = xgb.DMatrix(X_train, y_train)
dvalid = xgb.DMatrix(X_test, y_test)
del X_train, y_train
gc.collect()
watchlist = [(dtrain, 'train'), (dvalid, 'valid')]
model = xgb.train(params, dtrain, 200, watchlist, maximize=True, early_stopping_rounds = 20, verbose_eval=5)
del dvalid

print("Validating...")
check = model.predict(xgb.DMatrix(X_test), ntree_limit=model.best_iteration+1)

In [None]:
from sklearn.metrics import roc_curve, auc
# Compute micro-average ROC curve and ROC area
fpr, tpr, _ = roc_curve(y_test.values, check)
roc_auc = auc(fpr, tpr)
plt.figure()
lw = 2
plt.plot(fpr, tpr, color='darkorange',
         lw=lw, label='ROC curve (area = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
plt.xlim([-0.02, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC curve')
plt.legend(loc="lower right")
plt.show()

In [None]:
# Plot the feature importances from xgboost
plot_importance(model)
plt.gcf().savefig('xgb_feature_importance_v2_downsample_roc.png')

In [None]:
# Load the test dataset for prediction
test_cols  = ['ip', 'app', 'device', 'os', 'channel', 'click_time', 'click_id']
test_df = pd.read_csv('../input/' +"test.csv", usecols=test_cols, dtype=dtypes)
test_df = pd.merge(test_df, ip_groups, on='ip', how='left', sort=False)
del ip_groups
gc.collect()

In [None]:
test_df.head()

In [None]:
del check, X_test, y_test

In [None]:
# Creating a dataframe for submission
submission_df = pd.DataFrame()
submission_df['click_id'] = test_df['click_id'].astype('int')

test_df['clicks_by_ip'] = test_df['clicks_by_ip'].astype('uint16')
test_df = feature_extraction(test_df)
test_df.drop(['click_id', 'ip'], axis=1, inplace=True)
dtest = xgb.DMatrix(test_df)
del test_df
gc.collect()

In [None]:
submission_df.head()

In [None]:
gc.collect()

In [None]:
# Get predictions from the best iteration with model.best_ntree_limit.
submission_df['is_attributed'] = model.predict(dtest, ntree_limit=model.best_ntree_limit)
submission_df.to_csv('xgb_submission_roc.csv', float_format='%.8f', index=False)