<center>
<img src="../../img/ods_stickers.jpg">
## Open Machine Learning Course
<center>
Author: Yury Kashnitsky, Data Scientist at Mail.Ru Group

This material is subject to the terms and conditions of the license [Creative Commons CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/). Free use is permitted for any non-comercial purpose with an obligatory indication of the names of the authors and of the source.

## <center>Assignment #6. Part 1
### <center> Beating benchmarks in "Catch Me If You Can: Intruder Detection through Webpage Session Tracking"
    
[Competition](https://www.kaggle.com/c/catch-me-if-you-can-intruder-detection-through-webpage-session-tracking2). The task is to beat "Assignment 6 baseline".

In [207]:
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
from matplotlib import pyplot as plt
import seaborn as sns
import os
import numpy as np
import pandas as pd
from scipy.sparse import csr_matrix, hstack, vstack
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import StratifiedKFold, train_test_split
from sklearn.metrics import roc_auc_score
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.feature_extraction.text import TfidfVectorizer

Reading original data

In [208]:
PATH_TO_DATA = ('/Users/lucky/.kaggle/competitions/catch-me-if-you-can-intruder-detection-through-webpage-session-tracking2')
train_df = pd.read_csv(os.path.join(PATH_TO_DATA, 'train_sessions.csv'), index_col='session_id')
test_df = pd.read_csv(os.path.join(PATH_TO_DATA, 'test_sessions.csv'), index_col='session_id')

Prepare our dataset. Replace N/A with 0 values and convert columns to appropriate types

In [262]:
toy_train_df = train_df.iloc[:5]
toy_test_df = test_df.iloc[:5]


toy_train_sites = toy_train_df[['site'+str(i) for i in range(1, 11)]]
toy_test_sites = toy_test_df[['site'+str(i) for i in range(1, 11)]]
toy_full_sites = toy_test_sites.append(toy_train_sites)

toy_train_sessions = toy_train_sites.fillna(0).astype(int).astype(str).apply(lambda s: ' '.join(s), axis=1)
toy_test_sessions = toy_test_sites.fillna(0).astype(int).astype(str).apply(lambda s: ' '.join(s), axis=1)
toy_full_sessions = toy_full_sites.fillna(0).astype(int).astype(str).apply(lambda s: ' '.join(s), axis=1)
#print(toy_full_sessions)

vectorizer = TfidfVectorizer(ngram_range=(1, 3), max_features=100000)
vectorizer = vectorizer.fit(toy_full_sessions)
toy_train_v = vectorizer.transform(toy_train_sessions)
toy_test_v = vectorizer.transform(toy_test_sessions)

vstack([toy_test_v, toy_train_v])

toy_train_times = toy_train_df[['time'+str(i) for i in range(1, 11)]]
toy_train_times.loc[:, ['time2','time1']]
toy_train_times['delay9'] = toy_train_times['time10']-toy_train_times['time9']
toy_train_times['delay8'] = toy_train_times['time9']-toy_train_times['time8']
toy_train_times['delay7'] = toy_train_times['time8']-toy_train_times['time7']
toy_train_times['delay6'] = toy_train_times['time7']-toy_train_times['time6']
toy_train_times['delay5'] = toy_train_times['time6']-toy_train_times['time5']
toy_train_times['delay4'] = toy_train_times['time5']-toy_train_times['time4']
toy_train_times['delay3'] = toy_train_times['time4']-toy_train_times['time3']
toy_train_times['delay2'] = toy_train_times['time3']-toy_train_times['time2']
toy_train_times['delay1'] = toy_train_times['time2']-toy_train_times['time1']
toy_train_times['delay1'] = 0 
np.mean(toy_train_times[['delay1','delay2','delay3','delay4','delay5', 'delay6', 'delay7', 'delay8', 'delay9']], axis=1)

session_id
1   -1792 days +17:33:01.666666
2               0 days 00:00:03
3        0 days 00:00:00.888888
4        0 days 00:01:03.333333
5        0 days 00:00:43.555555
dtype: timedelta64[ns]

In [210]:
def prepare_dataset(df):
    df = df.fillna(0)
    for i in range(1, 11):
        df['site' + str(i)] = df['site' + str(i)].astype(int)
        df['time' + str(i)] = pd.to_datetime(df['time' + str(i)])
    return df


train_df = prepare_dataset(train_df)
test_df = prepare_dataset(test_df)

Separate target feature 

In [211]:
y = train_df['target']

In [None]:
X_train, X_val, y_train, y_val = train_test_split(train_df, y, test_size=0.2, random_state=42)

Build Tf-Idf features based on sites. You can use `ngram_range`=(1, 3) and `max_features`=100000 or more

In [212]:
%%time
vec = TfidfVectorizer(ngram_range=(1, 3), max_features=100000)

train_sites = X_train[['site'+str(i) for i in range(1, 11)]]
val_sites = X_val[['site'+str(i) for i in range(1, 11)]]
test_sites = test_df[['site'+str(i) for i in range(1, 11)]]
#full_sites = test_sites.append(train_sites)

train_sessions = train_sites.astype(str).apply(lambda s: ' '.join(s), axis=1)
val_sessions = train_sites.astype(str).apply(lambda s: ' '.join(s), axis=1)
test_sessions = test_sites.astype(str).apply(lambda s: ' '.join(s), axis=1)
full_sessions = test_sessions.append(train_sessions.append(val_sessions))
    
vec = vec.fit(full_sessions)

train_v = vec.transform(train_sessions)
val_v = vec.transform(val_sessions)
test_v = vec.transform(test_sessions)

CPU times: user 49.6 s, sys: 941 ms, total: 50.6 s
Wall time: 51 s


Add features based on the session start time: hour, whether it's morning, day or night and so on.

In [213]:
 
def analyze_session_start(df):
    df['session_hour'] = df['time1'].dt.hour
    df['session_pod'] = df['session_hour'].apply(lambda h: 1 if h > 6 and h <= 12 else (
                                                     2 if h > 12 and h <= 18 else (
                                                     3 if h > 18 and h <= 24 
                                                         else 4)))
    df['session_dow'] = df['time1'].dt.dayofweek
    df['is_weekend'] = df['session_dow'].apply(lambda d: 1 if d==5 or d==6 else 0)
    return df[['session_hour', 'session_pod', 'session_dow', 'is_weekend']]
    
time_features_train = analyze_session_start(X_train)
time_features_val = analyze_session_start(X_val)
time_features_test = analyze_session_start(test_df)

Scale this features and combine then with Tf-Idf based on sites (you'll need `scipy.sparse.hstack`)

In [214]:
encoder = OneHotEncoder()
time_features_train = encoder.fit_transform(time_features_train)
time_features_val = encoder.fit_transform(time_features_val)
time_features_test = encoder.transform(time_features_test)


In [215]:
scaler = StandardScaler(with_mean=False)

scaler = scaler.fit(time_features_train)
time_features_train = scaler.transform(time_features_train)
time_features_val = scaler.transform(time_features_val)
time_features_test = scaler.transform(time_features_test)    

train_dataset = hstack([time_features_train, train_v])
val_dataset = hstack([time_features_val, val_v])
test_dataset = hstack([time_features_test, test_v])

Perform cross-validation with logistic regression.

In [218]:
%%time
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
logitCV = LogisticRegressionCV(
        Cs=10,
        penalty='l1',
        scoring='roc_auc',
        cv=skf,
        random_state=42,
        solver='liblinear',
        n_jobs=-1
#        tol=10
    )

logitCV.fit(train_dataset, y_train)


os.system('say "your program has finished"')

ValueError: X has 100000 features per sample; expecting 100027

Let's evaluate our model:

In [223]:
print ('AUC_ROC for our model:', roc_auc_score(y_val, logitCV.predict_proba(val_dataset)))

Max auc_roc: 0.998155600268


Make prediction for the test set and form a submission file.

In [224]:
test_pred = logitCV.predict_proba(test_dataset)[:,1]

In [225]:
def write_to_submission_file(predicted_labels, out_file,
                             target='target', index_label="session_id"):
    predicted_df = pd.DataFrame(predicted_labels,
                                index = np.arange(1, predicted_labels.shape[0] + 1),
                                columns=[target])
    predicted_df.to_csv(out_file, index_label=index_label)


In [226]:
write_to_submission_file(test_pred, "assignment6_alice_submission.csv")