<center>
<img src="../../img/ods_stickers.jpg">
## Open Machine Learning Course
<center>
Author: Yury Kashnitsky, Data Scientist at Mail.Ru Group

This material is subject to the terms and conditions of the license [Creative Commons CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/). Free use is permitted for any non-comercial purpose with an obligatory indication of the names of the authors and of the source.

## <center>Assignment #6. Part 1
### <center> Beating benchmarks in "Catch Me If You Can: Intruder Detection through Webpage Session Tracking"
    
[Competition](https://www.kaggle.com/c/catch-me-if-you-can-intruder-detection-through-webpage-session-tracking2). The task is to beat "Assignment 6 baseline".

In [130]:
import seaborn as sns
import os
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from scipy.sparse import csr_matrix, hstack
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_auc_score
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV

import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
from matplotlib import pyplot as plt

Reading original data

In [131]:
PATH_TO_DATA = ('../../data')
train_df = pd.read_csv(os.path.join(PATH_TO_DATA, 'train_sessions.csv'), index_col='session_id')
test_df = pd.read_csv(os.path.join(PATH_TO_DATA, 'test_sessions.csv'), index_col='session_id')

In [132]:
def write_to_submission_file(predicted_labels, out_file,
                             target='target', index_label="session_id"):
    predicted_df = pd.DataFrame(predicted_labels,
                                index = np.arange(1, predicted_labels.shape[0] + 1),
                                columns=[target])
    predicted_df.to_csv(out_file, index_label=index_label)


In [133]:
train_df = train_df.sort_values(by='time1')

train_df.head()

Unnamed: 0_level_0,site1,time1,site2,time2,site3,time3,site4,time4,site5,time5,...,time6,site7,time7,site8,time8,site9,time9,site10,time10,target
session_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
21669,56,2013-01-12 08:05:57,55.0,2013-01-12 08:05:57,,,,,,,...,,,,,,,,,,0
54843,56,2013-01-12 08:37:23,55.0,2013-01-12 08:37:23,56.0,2013-01-12 09:07:07,55.0,2013-01-12 09:07:09,,,...,,,,,,,,,,0
77292,946,2013-01-12 08:50:13,946.0,2013-01-12 08:50:14,951.0,2013-01-12 08:50:15,946.0,2013-01-12 08:50:15,946.0,2013-01-12 08:50:16,...,2013-01-12 08:50:16,948.0,2013-01-12 08:50:16,784.0,2013-01-12 08:50:16,949.0,2013-01-12 08:50:17,946.0,2013-01-12 08:50:17,0
114021,945,2013-01-12 08:50:17,948.0,2013-01-12 08:50:17,949.0,2013-01-12 08:50:18,948.0,2013-01-12 08:50:18,945.0,2013-01-12 08:50:18,...,2013-01-12 08:50:18,947.0,2013-01-12 08:50:19,945.0,2013-01-12 08:50:19,946.0,2013-01-12 08:50:19,946.0,2013-01-12 08:50:20,0
146670,947,2013-01-12 08:50:20,950.0,2013-01-12 08:50:20,948.0,2013-01-12 08:50:20,947.0,2013-01-12 08:50:21,950.0,2013-01-12 08:50:21,...,2013-01-12 08:50:21,946.0,2013-01-12 08:50:21,951.0,2013-01-12 08:50:22,946.0,2013-01-12 08:50:22,947.0,2013-01-12 08:50:22,0


Separate target feature 

In [134]:
y_train = train_df['target']
train_df.drop('target',axis=1);

Build Tf-Idf features based on sites. You can use `ngram_range`=(1, 3) and `max_features`=100000 or more

In [135]:
import pickle

sites = ['site%s' % i for i in range(1, 11)]
train_df[sites] = train_df[sites].fillna(0).astype('int')
test_df[sites] = test_df[sites].fillna(0).astype('int')

times = ['time%s' % i for i in range(1, 11)]

# Load websites dictionary
with open(os.path.join(PATH_TO_DATA, 'site_dic.pkl'), 'rb') as input_file:
    site_dict = pickle.load(input_file)

# Create dataframe for the dictionary
sites_dict = pd.DataFrame(list(site_dict.keys()), index=list(site_dict.values()), columns=['site'])
sites_dict.head()

Unnamed: 0,site
25075,www.abmecatronique.com
13997,groups.live.com
42436,majeureliguefootball.wordpress.com
30911,cdt46.media.tourinsoft.eu
8104,www.hdwallpapers.eu


In [161]:
def vec(train, test):
    
    train_ = train.apply(lambda x: x.map(sites_dict['site'])).fillna('')
    train_ = train_.apply(lambda x: ' '.join(x), axis=1)
    
    test_ = test.apply(lambda x: x.map(sites_dict['site'])).fillna('')
    test_ = test_.apply(lambda x: ' '.join(x), axis=1)
    
    vectorizer = TfidfVectorizer(ngram_range=(1, 3), max_features=10000)
    vectorizer.fit(pd.concat([train_, test_]))
    
    return vectorizer.transform(train_), vectorizer.transform(test_)

X_train_sites, X_test_sites = vec(train_df[sites], test_df[sites])

In [None]:
# United dataframe of the initial data 
full_df = pd.concat([train_df.drop('target', axis=1), test_df])

# Index to split the training and test data sets
idx_split = train_df.shape[0]

full_sites = full_df[sites]

# sequence of indices
sites_flatten = full_sites.values.flatten()

# and the matrix we are looking for
full_sites_sparse = csr_matrix(([1] * sites_flatten.shape[0],
                                sites_flatten,
                                range(0, sites_flatten.shape[0]  + 10, 10)))[:, 1:]

X_train_sites = full_sites_sparse[:idx_split, :]
X_test_sites = full_sites_sparse[idx_split:, :]

Add features based on the session start time: hour, whether it's morning, day or night and so on.

Scale this features and combine then with Tf-Idf based on sites (you'll need `scipy.sparse.hstack`)

In [142]:
def full_times(train, test):
    
    def f(data):
        data_ = pd.DataFrame(np.zeros((data.shape[0],10), dtype=bool))
        for n in range(1,11):
            d = pd.DatetimeIndex(data['time'+str(n)])
            for i in range(0,7):
                data_[i] = np.where(data_[i]|(d.dayofweek==i+1),1,0)            
            data_[7] = np.where(data_[7]|(d.hour>=0)&(d.hour<9),1,0)
            data_[8] = np.where(data_[8]|(d.hour>=9)&(d.hour<19),1,0)
            data_[9] = np.where(data_[9]|(d.hour>=19)&(d.hour<24),1,0)
            
        return data_
        
    train_, test_ = f(train), f(test)
    
    return train_, test_

X_train_times, X_test_times = full_times(train_df[times], test_df[times]) 

In [188]:
def full_times1(train, test):
    
    def f(data):
        data_ = pd.DataFrame()
        for n in times:
            d = pd.DatetimeIndex(data[n]) 
            data_[n+'_dw'] = d.dayofweek
            data_[n+'_h'] = d.hour
        
        return data_.fillna(0)
    
    train_, test_ = f(train), f(test)
    
    scaler = OneHotEncoder()
    
    scaler.fit(pd.concat([train_, test_]))
    
    return scaler.transform(train_), scaler.transform(test_)

X_train_times, X_test_times = full_times1(train_df[times], test_df[times]) 

In [189]:
X_train = csr_matrix(hstack([X_train_sites, X_train_times]))
X_test = csr_matrix(hstack([X_test_sites, X_test_times]))

Perform cross-validation with logistic regression.

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import TimeSeriesSplit

lr = LogisticRegression(C=1.5, random_state=17, solver='newton-cg')

skf = TimeSeriesSplit(n_splits=4)

cv_aucs = cross_val_score(lr, X_train, y_train, scoring="roc_auc", cv=skf)

print(np.mean(cv_aucs))

In [None]:
0.948661845726

Make prediction for the test set and form a submission file.

In [182]:
lr.fit(X_train, y_train)
test_pred = lr.predict_proba(X_test)[:, 1]

In [183]:
write_to_submission_file(test_pred, os.path.join(PATH_TO_DATA, "assignment6_alice_submission.csv"))