<center>
<img src="../../img/ods_stickers.jpg">
## Open Machine Learning Course
<center>
Author: Yury Kashnitsky, Data Scientist at Mail.Ru Group

This material is subject to the terms and conditions of the license [Creative Commons CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/). Free use is permitted for any non-comercial purpose with an obligatory indication of the names of the authors and of the source.

## <center>Assignment #6. Part 1
### <center> Beating benchmarks in "Catch Me If You Can: Intruder Detection through Webpage Session Tracking"
    
[Competition](https://www.kaggle.com/c/catch-me-if-you-can-intruder-detection-through-webpage-session-tracking2). The task is to beat "Assignment 6 baseline".

In [9]:
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
from matplotlib import pyplot as plt
import seaborn as sns
import os
import numpy as np
import pandas as pd
from scipy.sparse import csr_matrix, hstack
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_auc_score
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.feature_extraction.text import TfidfVectorizer
import functools

Reading original data

In [10]:
PATH_TO_DATA = ('/Users/lucky/.kaggle/competitions/catch-me-if-you-can-intruder-detection-through-webpage-session-tracking2')
train_df = pd.read_csv(os.path.join(PATH_TO_DATA, 'train_sessions.csv'), index_col='session_id')
test_df = pd.read_csv(os.path.join(PATH_TO_DATA, 'test_sessions.csv'), index_col='session_id')

Prepare our dataset. Replace N/A with 0 values and convert columns to appropriate types

In [11]:
def prepare_dataset(df):
    df = df.fillna(0)
    for i in range(1, 11):
        df['site' + str(i)] = df['site' + str(i)].astype(int)
        df['time' + str(i)] = pd.to_datetime(df['time' + str(i)])
    return df


train_df = prepare_dataset(train_df)
test_df = prepare_dataset(test_df)

Separate target feature 

In [12]:
y = train_df['target']

Build Tf-Idf features based on sites. You can use `ngram_range`=(1, 3) and `max_features`=100000 or more

In [13]:
def vectorize(df):
    df['sites'] = ''
    for i in range(1, 11):
        df['sites'] = df['sites'] + ' ' + df['site' + str(i)].astype(str)
                                                           
    vectorizer = TfidfVectorizer(ngram_range=(1, 3), max_features=100000)
    df_vectorized = vectorizer.fit_transform(df['sites'])
    return df_vectorized

train_vectorized = vectorize(train_df)
test_vectorized = vectorize(test_df)

Add features based on the session start time: hour, whether it's morning, day or night and so on.

In [14]:
def analyze_session_start(df):
    df['session_hour'] = df['time1'].dt.hour
    df['morning'] = (df['session_hour'] > 6) & (df['session_hour'] <= 12)
    df['day'] = (df['session_hour'] > 12) & (df['session_hour'] <= 18)
    df['evening'] = (df['session_hour'] > 18) & (df['session_hour'] <= 24) 
    return df
    
train_df = analyze_session_start(train_df)
test_df = analyze_session_start(test_df)

Scale this features and combine then with Tf-Idf based on sites (you'll need `scipy.sparse.hstack`)

In [15]:
def normalize_features(df):
    df['morning'] = df['morning'].astype(int)
    df['day'] = df['day'].astype(int)
    df['evening'] = df['evening'].astype(int)

    scaler = StandardScaler()
    df['session_hour'] = scaler.fit_transform(df[['session_hour']])
    return df

train_df = normalize_features(train_df)
test_df = normalize_features(test_df)

train_dataset = hstack([train_df[['session_hour', 'morning', 'day', 'evening']], train_vectorized])
test_dataset = hstack([test_df[['session_hour', 'morning', 'day', 'evening']], test_vectorized])

Perform cross-validation with logistic regression.

In [None]:
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
searchCV = LogisticRegressionCV(
        Cs=10,
        penalty='l1',
        scoring='roc_auc',
        cv=skf,
        random_state=42,
        solver='liblinear',
        n_jobs=-1
#        tol=10
    )

searchCV.fit(train_dataset, y)

print ('Max auc_roc:', searchCV.scores)

Make prediction for the test set and form a submission file.

In [None]:
test_pred = searchCV.predict(test_dataset)

In [None]:
def write_to_submission_file(predicted_labels, out_file,
                             target='target', index_label="session_id"):
    predicted_df = pd.DataFrame(predicted_labels,
                                index = np.arange(1, predicted_labels.shape[0] + 1),
                                columns=[target])
    predicted_df.to_csv(out_file, index_label=index_label)


In [None]:
write_to_submission_file(test_pred, "assignment6_alice_submission.csv")