<center>
<img src="../../img/ods_stickers.jpg">
## Open Machine Learning Course
<center>
Author: Yury Kashnitsky, Data Scientist at Mail.Ru Group

This material is subject to the terms and conditions of the license [Creative Commons CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/). Free use is permitted for any non-comercial purpose with an obligatory indication of the names of the authors and of the source.

## <center>Assignment #6. Part 1
### <center> Beating benchmarks in "Catch Me If You Can: Intruder Detection through Webpage Session Tracking"
    
[Competition](https://www.kaggle.com/c/catch-me-if-you-can-intruder-detection-through-webpage-session-tracking2). The task is to beat "Assignment 6 baseline".

In [850]:
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
from matplotlib import pyplot as plt
import seaborn as sns
import os
import numpy as np
import pandas as pd
from scipy.sparse import csr_matrix, hstack, vstack
from sklearn.preprocessing import StandardScaler, OneHotEncoder, MinMaxScaler, LabelEncoder
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_auc_score
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.feature_extraction.text import TfidfVectorizer

Reading original data

In [851]:
PATH_TO_DATA = ('/Users/lucky/.kaggle/competitions/catch-me-if-you-can-intruder-detection-through-webpage-session-tracking2')
train_df = pd.read_csv(os.path.join(PATH_TO_DATA, 'train_sessions.csv'), index_col='session_id')
test_df = pd.read_csv(os.path.join(PATH_TO_DATA, 'test_sessions.csv'), index_col='session_id')

Separate target feature 

In [852]:
y = train_df['target']

Build Tf-Idf features based on sites. You can use `ngram_range`=(1, 3) and `max_features`=100000 or more

In [853]:
%%time

# prepare columns (clean n/a values and convert them to integer then to strings)
site_cols = ['site'+str(i) for i in range(1, 11)]
train_sites = X_train[site_cols].fillna(0).astype(int)
test_sites = test_df[site_cols].fillna(0).astype(int)

# TODO: take a number of unique sites into account
train_sites['#unique_sites'] = train_sites.nunique(axis=1) / 10
test_sites['#unique_sites'] = test_sites.nunique(axis=1) / 10


train_sessions = train_sites.astype(str).apply(lambda s: ' '.join(s), axis=1)
test_sessions = test_sites.astype(str).apply(lambda s: ' '.join(s), axis=1)

# fit TfidfVectorizer with all sites from user sessions
vec = TfidfVectorizer(ngram_range=(1, 3), max_features=200000)
full_sessions = test_sessions.append(train_sessions)
vec = vec.fit(full_sessions)

# generate sparse matrices
train_v = vec.transform(train_sessions)
test_v = vec.transform(test_sessions)

CPU times: user 1min 10s, sys: 978 ms, total: 1min 11s
Wall time: 1min 11s


Add features based on the session start time: hour, whether it's morning, day or night and so on.

In [854]:
# separate "timeN" columns into another dataset and clean it up and convert to datetime
timecols = ['time'+str(i) for i in range(1, 11)]
train_times = X_train[timecols].fillna(method='ffill', axis=1).astype('datetime64[ns]')
test_times = test_df[timecols].fillna(method='ffill', axis=1).astype('datetime64[ns]')

def analyze_time_columns(df):
    res = pd.DataFrame()
    res['session_span'] = (df.max(axis=1) - df.min(axis=1)).astype('timedelta64[s]')
    res['session_hour'] = df['time1'].dt.hour
    res['session_pod'] = res['session_hour'].apply(lambda h: 1 if h > 11 and h <= 13 else (
                                                     2 if h > 15 and h <= 18 else (
                                                     3 if h > 18 and h <= 24 
                                                         else 4)))
    res['session_dow'] = df['time1'].dt.dayofweek
    res['session_day'] = df['time1'].dt.day
    res['session_month'] = df['time1'].dt.month
    res['session_year'] = df['time1'].dt.year
    res['is_weekend'] = res['session_dow'].apply(lambda d: 1 if d==5 or d==6 else 0)

    return res
    
time_features_train = analyze_time_columns(train_times)
time_features_test = analyze_time_columns(test_times)

Let's visualize time based features

In [855]:
#time_features_train = analyze_time_columns(train_times)
alice_tf = train_times[train_df['target'] == 0]
#alice_tf.time1.dt.month.hist()

Scale this features and combine then with Tf-Idf based on sites (you'll need `scipy.sparse.hstack`)

In [856]:
scaler = StandardScaler()

time_features_train['avg_per_site'] = time_features_train['session_span'] / train_sites['#unique_sites']
time_features_test['avg_per_site'] = time_features_test['session_span'] / test_sites['#unique_sites']

time_features_train[['session_span', 'avg_per_site']] = scaler.fit_transform(time_features_train[['session_span', 'avg_per_site']])
time_features_test[['session_span', 'avg_per_site']] = scaler.transform(time_features_test[['session_span', 'avg_per_site']])

features_train = pd.concat([time_features_train, train_sites[['#unique_sites']]], axis=1)

In [857]:


cat_features = ['session_hour', 'session_pod', 'session_dow', 'is_weekend', 'session_day', 'session_month', 'session_year']
enc = LabelEncoder()
enc.fit(cat_features)
new_cat_features = enc.transform(cat_features)

encoder = OneHotEncoder(categorical_features=new_cat_features)
tf_train_mtx = encoder.fit_transform(time_features_train)
tf_test_mtx = encoder.transform(time_features_test)

In [858]:
train_dataset = hstack([tf_train_mtx, train_v ])
test_dataset = hstack([tf_test_mtx, test_v])
#train_dataset = train_v
#test_dataset = test_v

Perform cross-validation with logistic regression.

In [859]:
%%time
skf = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)
logitCV = LogisticRegressionCV(
        Cs=[11.33, 11.55, 12.33],
        penalty='l2',
        scoring='roc_auc',
        cv=skf,
        random_state=42,
        solver='liblinear',
        n_jobs=-1,
        refit=True,
#        verbose=2,
        tol=0.0001
    )

logitCV.fit(train_dataset, y_train)


os.system('say "your program has finished"')

CPU times: user 17.9 s, sys: 877 ms, total: 18.8 s
Wall time: 5min 2s


Let's evaluate our model:

In [860]:
print ('AUC_ROC for our model:', logitCV.scores_[1].mean(axis=0).max())

AUC_ROC for our model: 0.991703359184


Experiments:

 * TFxIDF+hour,pod,dow,is_weekend,session_span,#unique sites /np.logspace(-7, 7, 15); l1; 3 folds/ = 0.98138
 * TFxIDF+hour,pod,dow,is_weekend,session_span,#unique sites /np.logspace(-2, 3, 9); l2; 3 folds/ = 0.98430
 * TFxIDF+hour,pod,dow,is_weekend,session_span,#unique sites /l2; 3 folds; ngram_range=(1,2), C=12.55/ = 0.98446
 * TFxIDF only /.../ = 0.95835
 * TFxIDF+hour,pod,dow,is_weekend,session_span,#unique sites /l2; 5 folds; ngram_range=(1,3), C=11.55/ = 0.98626403243
 * ... /10 folds/ = 0.98667
 * ... /tol=0.00001/ = 0.98667
 * ...+avg_per_site = 0.98668
 * ...+TFxIDF(max_features=200000) = 0.98671
 * ...+TFxIDF(max_features=500000) = 0.98598
 * tuned hour feature+TFxIDF(max_features=200000)  = 0.98671
 * ...+day,month,year = 0.99170

Make prediction for the test set and form a submission file.

In [861]:
test_pred = logitCV.predict_proba(test_dataset)[:,1]
test_pred = np.array(list("{:.6f}".format(x) for x in test_pred))

In [862]:
def write_to_submission_file(predicted_labels, out_file,
                             target='target', index_label="session_id"):
    predicted_df = pd.DataFrame(predicted_labels,
                                index = np.arange(1, predicted_labels.shape[0] + 1),
                                columns=[target])
    predicted_df.to_csv(out_file, index_label=index_label)


In [863]:
write_to_submission_file(test_pred, "assignment6_alice_submission.csv")