<center>
<img src="../../img/ods_stickers.jpg">
## Open Machine Learning Course
<center>
Author: Yury Kashnitsky, Data Scientist at Mail.Ru Group

This material is subject to the terms and conditions of the license [Creative Commons CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/). Free use is permitted for any non-comercial purpose with an obligatory indication of the names of the authors and of the source.

## <center>Assignment #6. Part 1
### <center> Beating benchmarks in "Catch Me If You Can: Intruder Detection through Webpage Session Tracking"
    
[Competition](https://www.kaggle.com/c/catch-me-if-you-can-intruder-detection-through-webpage-session-tracking2). The task is to beat "Assignment 6 baseline".

In [1]:
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
from matplotlib import pyplot as plt
import seaborn as sns
import os
import numpy as np
import pandas as pd
from scipy.sparse import csr_matrix, hstack, vstack
from sklearn.preprocessing import StandardScaler, OneHotEncoder, MinMaxScaler, LabelEncoder
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_auc_score
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.feature_extraction.text import TfidfVectorizer

Reading original data

In [2]:
PATH_TO_DATA = ('/Users/lucky/.kaggle/competitions/catch-me-if-you-can-intruder-detection-through-webpage-session-tracking2')
train_df = pd.read_csv(os.path.join(PATH_TO_DATA, 'train_sessions.csv'), index_col='session_id')
test_df = pd.read_csv(os.path.join(PATH_TO_DATA, 'test_sessions.csv'), index_col='session_id')

In [3]:
%%time

site_cols = ['site'+str(i) for i in range(1, 11)]
time_cols = ['time'+str(i) for i in range(1, 11)]

def prepare_dataset(df, is_test=False):
    
    sites = df[site_cols].fillna(0).astype(int)
    times = df[time_cols].fillna(method='ffill', axis=1).astype('datetime64[ns]')

    list_to_concat = [sites, times]
    if not is_test:
        list_to_concat.append(train_df['target'])
    df = pd.concat(list_to_concat, axis=1)

    df['#unique_sites'] = sites.nunique(axis=1) / 10 # to scale the feature we divide it by 10

    df['hour_of_day'] = df['time1'].dt.hour
    df['day_of_week'] = df['time1'].dt.dayofweek
    df['weekend'] = df['day_of_week'].apply(lambda d: 1 if d==5 or d==6 else 0)
    df['part_of_day'] = df['hour_of_day'].apply(lambda h: 1 if h > 11 and h <= 13 else (
                                                     2 if h > 15 and h <= 18 else (
                                                     3 if h > 18 and h <= 24 
                                                         else 4)))

    df['session_span'] = (df.time10 - df.time1).astype('timedelta64[s]')
    for i in range(1, 10):
        df['diff'+str(i)] = (df['time'+str(i+1)] - df['time'+str(i)]).astype('timedelta64[s]')
    
    return df
    
train_dataset = prepare_dataset(train_df)
test_dataset = prepare_dataset(test_df, is_test=True)

CPU times: user 33.4 s, sys: 448 ms, total: 33.9 s
Wall time: 34.8 s


Let's figure out top 30 popular sites for our train set:

In [4]:
import pickle
with open(os.path.join(PATH_TO_DATA, 'site_dic.pkl'), "rb") as input_file:
    site_dict = pickle.load(input_file)

reverse_site_dict = dict((v,k) for (k,v) in site_dict.items())

unique, counts = np.unique(train_dataset[train_dataset['target'] == 1][site_cols].values.flatten(), return_counts=True)
top30 = [s[0] for s in sorted(zip(unique, counts), key=lambda x: x[1], reverse=True)[0:31]]
top30.remove(0)
[reverse_site_dict[site_id] for site_id in top30]

['i1.ytimg.com',
 's.youtube.com',
 'www.youtube.com',
 'www.facebook.com',
 'www.google.fr',
 'r4---sn-gxo5uxg-jqbe.googlevideo.com',
 'apis.google.com',
 'r1---sn-gxo5uxg-jqbe.googlevideo.com',
 's.ytimg.com',
 'r2---sn-gxo5uxg-jqbe.googlevideo.com',
 'www.google.com',
 's-static.ak.facebook.com',
 'r3---sn-gxo5uxg-jqbe.googlevideo.com',
 'twitter.com',
 'static.ak.facebook.com',
 'vk.com',
 'translate.google.fr',
 'platform.twitter.com',
 'yt3.ggpht.com',
 'mts0.google.com',
 'www.info-jeunes.net',
 'clients1.google.com',
 'www.audienceinsights.net',
 'www.melty.fr',
 'gg.google.com',
 'plus.googleapis.com',
 'mts1.google.com',
 'api.bing.com',
 'www.dailymotion.com',
 'youwatch.org']

Find out average time of user's being at top 30 sites:

In [5]:
%%time

avg_ss_columns = []

def avg_session_span_for(site_id, row):
    n_visits = 0
    duration = 0
    for i in range(1, 10):
        if row['site'+str(i)] == site_id:
            n_visits += 1
            duration += row['diff'+str(i)]
    return duration/n_visits if n_visits > 0 else 0



for site in top30[2:18]:
    train_dataset['avg_ss_for_'+str(site)] = train_dataset.apply(lambda r: avg_session_span_for(site, r), axis=1)
    print('*** {} [train]'.format(reverse_site_dict[site]))
    test_dataset['avg_ss_for_'+str(site)] = test_dataset.apply(lambda r: avg_session_span_for(site, r), axis=1)
    print('--- {} [test]'.format(reverse_site_dict[site]))
    avg_ss_columns.append('avg_ss_for_'+str(site))


*** www.youtube.com [train]
--- www.youtube.com [test]
*** www.facebook.com [train]
--- www.facebook.com [test]
*** www.google.fr [train]
--- www.google.fr [test]
*** r4---sn-gxo5uxg-jqbe.googlevideo.com [train]
--- r4---sn-gxo5uxg-jqbe.googlevideo.com [test]
*** apis.google.com [train]
--- apis.google.com [test]
*** r1---sn-gxo5uxg-jqbe.googlevideo.com [train]
--- r1---sn-gxo5uxg-jqbe.googlevideo.com [test]
*** s.ytimg.com [train]
--- s.ytimg.com [test]
*** r2---sn-gxo5uxg-jqbe.googlevideo.com [train]
--- r2---sn-gxo5uxg-jqbe.googlevideo.com [test]
*** www.google.com [train]
--- www.google.com [test]
*** s-static.ak.facebook.com [train]
--- s-static.ak.facebook.com [test]
*** r3---sn-gxo5uxg-jqbe.googlevideo.com [train]
--- r3---sn-gxo5uxg-jqbe.googlevideo.com [test]
*** twitter.com [train]
--- twitter.com [test]
*** static.ak.facebook.com [train]
--- static.ak.facebook.com [test]
*** vk.com [train]
--- vk.com [test]
*** translate.google.fr [train]
--- translate.google.fr [test]
*** p

Separate target feature 

In [6]:
y = train_df['target']

Build Tf-Idf features based on sites. You can use `ngram_range`=(1, 3) and `max_features`=100000 or more

In [7]:
%%time

train_sessions = train_dataset[site_cols].astype(str).apply(lambda s: ' '.join(s), axis=1)
test_sessions = test_dataset[site_cols].astype(str).apply(lambda s: ' '.join(s), axis=1)

# fit TfidfVectorizer with all sites from user sessions
vec = TfidfVectorizer(ngram_range=(1, 3), max_features=200000, stop_words=['0'])
vec = vec.fit(test_sessions.append(train_sessions))

# generate sparse matrices
train_v = vec.transform(train_sessions)
test_v = vec.transform(test_sessions)

CPU times: user 52.2 s, sys: 938 ms, total: 53.1 s
Wall time: 53.4 s


Идеи:
* Пользователи предпочитают пользоваться определённым поисковиком и определённой соцсетью, таким образом можно ввести категориальные признаки: поисковик, соцсеть со значениями (google, yandex, mail.ru, rambler, microsoft ... и facebook, vk, odnoklassniki ...)
* Продолжительность пребывания на сайте может нести полезную информацию, поэтому разреженная матрица, где в строках сессии, а в столбцах сайты, со значением время пребывания за сессию может улучшить модель теоретически 
* Возможно сработают комбинации типа "любимые сайты по утрам" или "любимые сайты по выходным"

Add features based on the session start time: hour, whether it's morning, day or night and so on.
Scale this features and combine then with Tf-Idf based on sites (you'll need `scipy.sparse.hstack`)

In [8]:
cat_features = ['hour_of_day', 'part_of_day', 'day_of_week', 'weekend']
scalable_features = ['session_span'] + avg_ss_columns
other_features = ['#unique_sites']
all_features = cat_features + scalable_features + other_features

features_train = train_dataset[all_features]
features_test = test_dataset[all_features]

scaler = MinMaxScaler()
features_train[scalable_features] = scaler.fit_transform(features_train[scalable_features])
features_test[scalable_features] = scaler.transform(features_test[scalable_features])

In [9]:
enc = LabelEncoder()
enc.fit(cat_features)
new_cat_features = enc.transform(cat_features)

encoder = OneHotEncoder(categorical_features=new_cat_features)
train_mtx = encoder.fit_transform(features_train)
test_mtx = encoder.transform(features_test)

In [10]:
train_X = hstack([train_mtx, train_v])
test_X = hstack([test_mtx, test_v])

Perform cross-validation with logistic regression.

In [20]:
%%time
skf = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)
logitCV = LogisticRegressionCV(
        Cs=np.linspace(10, 14, 7),
        penalty='l2',
        scoring='roc_auc',
        cv=skf,
        random_state=42,
        solver='liblinear',
        n_jobs=-1,
        refit=True,
        verbose=2,
        max_iter=100,
        tol=0.0001
    )

logitCV.fit(train_X, y)


os.system('say "your program has finished"')

[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed: 12.7min finished


[LibLinear]CPU times: user 20.5 s, sys: 1.54 s, total: 22 s
Wall time: 13min


Let's evaluate our model:

In [18]:
print ('AUC_ROC for our model:', logitCV.scores_[1].mean(axis=0).max())

AUC_ROC for our model: 0.988172485677


Experiments:

 * TFxIDF+hour,pod,dow,is_weekend,session_span,#unique sites /np.logspace(-7, 7, 15); l1; 3 folds/ = 0.98138
 * TFxIDF+hour,pod,dow,is_weekend,session_span,#unique sites /np.logspace(-2, 3, 9); l2; 3 folds/ = 0.98430
 * TFxIDF+hour,pod,dow,is_weekend,session_span,#unique sites /l2; 3 folds; ngram_range=(1,2), C=12.55/ = 0.98446
 * TFxIDF only /.../ = 0.95835
 * TFxIDF+hour,pod,dow,is_weekend,session_span,#unique sites /l2; 5 folds; ngram_range=(1,3), C=11.55/ = 0.98626403243
 * ... /10 folds/ = 0.98667
 * ... /tol=0.00001/ = 0.98667
 * ...+avg_per_site = 0.98668
 * ...+TFxIDF(max_features=200000) = 0.98671
 * ...+TFxIDF(max_features=500000) = 0.98598
 * tuned hour feature+TFxIDF(max_features=200000)  = 0.98671
 * ...+day,month,year = 0.99170 (probably overfit 0.90 on kaggle)
 * ...+month = 0.98994
 * TFxIDF+hour,pod,dow,is_weekend,session_span,#unique sites, tuned hour, avg_per_site/l2; 10 folds; ngram_range=(1,3), C=12./ = 0.98807
 * ... C=13.27 = 0.98808
 * refactored = 0.98818 > 0.94894
 * top10 sites duration = 0.98817 > 0.94896
 * top15 of alice sites /3 folds/ = 0.98666 > 0.94904
 * ... /10 folds, C=14.677/ = 0.98817 > 0.94904

In [21]:
logitCV.C_

array([ 14.])

Make prediction for the test set and form a submission file.

In [14]:
test_pred = logitCV.predict_proba(test_X)[:,1]
test_pred = np.array(list("{:.6f}".format(x) for x in test_pred))

In [15]:
def write_to_submission_file(predicted_labels, out_file,
                             target='target', index_label="session_id"):
    predicted_df = pd.DataFrame(predicted_labels,
                                index = np.arange(1, predicted_labels.shape[0] + 1),
                                columns=[target])
    predicted_df.to_csv(out_file, index_label=index_label)


In [16]:
write_to_submission_file(test_pred, "assignment6_alice_submission.csv")