In [0]:
!ls -l ../input/

In [0]:
!head ../input/train_sessions.csv

In [0]:
from matplotlib import pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
from scipy import sparse
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import OneHotEncoder, FunctionTransformer
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

We will be solving the intruder detection problem analyzing his behavior on the Internet. It is a complicated and interesting problem combining the data analysis and behavioral psychology.

For example: Yandex solves the mailbox intruder detection problem based on the user's behavior patterns. In a nutshell, intruder's behaviour pattern might differ from the owner's one: 
- the breaker might not delete emails right after they are read, as the mailbox owner might do
- the intruder might mark emails and even move the cursor differently
- etc.

So the intruder could be detected and thrown out from the mailbox proposing the owner to be authentificated via SMS-code.
This pilot project is described in the Habrahabr article.

Similar things are being developed in Google Analytics and described in scientific researches. You can find more on this topic by searching "Traversal Pattern Mining" and "Sequential Pattern Mining".

In this competition we are going to solve a similar problem: our algorithm is supposed to analyze the sequence of websites consequently visited by a particular person and to predict whether this person is Alice or an intruder (someone else). As a metric we will use [ROC AUC](https://en.wikipedia.org/wiki/Receiver_operating_characteristic). We will reveal who Alice is at the end of the course.

###  Data Downloading and Transformation
First, read the training and test sets. 

In [0]:
# Read the training and test data sets
train_df = pd.read_csv('../input/train_sessions.csv',
                       index_col='session_id', parse_dates=['time1'])
test_df = pd.read_csv('../input/test_sessions.csv',
                      index_col='session_id', parse_dates=['time1'])

In [0]:
print(train_df.info())
train_df.head()

In [0]:
print(test_df.info())
test_df.head()

In [0]:
train_df.shape, test_df.shape

The training data set contains the following features:

- **site1** – id of the first visited website in the session
- **time1** – visiting time for the first website in the session
- ...
- **site10** – id of the tenth visited website in the session
- **time10** – visiting time for the tenth website in the session
- **target** – target variable, possesses value of 1 for Alice's sessions, and 0 for the other users' sessions
    
User sessions are chosen in the way they are not longer than half an hour or/and contain more than ten websites. I.e. a session is considered as ended either if a user has visited ten websites or if a session has lasted over thirty minutes.

There are some empty values in the table, it means that some sessions contain less than ten websites. Replace empty values with 0 and change columns types to integer. Also load the websites dictionary and check how it looks like:

### Data engineering

Coding of missing data.
- Using `-1` for undefined site features.
- Using `0` for undefined time features.
- Using `-1` for undefined date (day, dom, week, mon) features.

In [0]:
# Change site1, ..., site10 columns type to integer and fill NA-values with zeros
site_feature_names = [x for x in train_df.columns if 'site' in x]
for df in (train_df, test_df):
    df[site_feature_names] = df[site_feature_names].fillna(-1).astype('int').astype('str')
print(train_df[site_feature_names].info())
train_df[site_feature_names].head()

Checking sites in the target data set so can't move this code to pipeline.

In [0]:
%%time
# Add siteMN... features
site_feature_names = [x for x in train_df.columns if 'site' in x]
for df in (train_df, test_df):
    for l in range(2,len(site_feature_names)):
        for i in range(len(site_feature_names)-l+1):
            print('Progress: %d/%d, %d/%d\r' % (l,len(site_feature_names)-1,i,len(site_feature_names)-l), end='')
            df['site' + str(i) + str(l)] = df[site_feature_names[i:i+l]].apply(
                lambda x: '_'.join(x.values) if '-1' not in x.values else '-1', axis=1)

site_feature_names = [x for x in train_df.columns if 'site' in x]
print(train_df[site_feature_names].info())
train_df[site_feature_names].head()

In [0]:
# Fix timeN columns
time_feature_names = [x for x in train_df.columns if 'time' in x]
for df in (train_df, test_df):
    df[time_feature_names] = df[time_feature_names].fillna(0).astype('datetime64')

print(train_df[time_feature_names].info())
train_df[time_feature_names].head()

In [0]:
# Add dur, durN columns
for df in (train_df, test_df):
    df.drop([x for x in df.columns if 'dur' in x], axis=1, inplace=True)

for df in (train_df, test_df):
    df['timex'] = pd.to_datetime(0)
time_feature_names = [x for x in train_df.columns if 'time' in x]
for df in (train_df, test_df):
    for i,k in [(x,x+1) for x in range(len(time_feature_names)-1)]:
        df['dur' + str(i)] = (df[time_feature_names[k]] - df[time_feature_names[i]]).astype('timedelta64[s]').astype('int')
        df.loc[df[time_feature_names[k]].astype('int') == 0, 'dur' + str(i)] = 0
dur_feature_names = [x for x in df.columns if 'dur' in x]
for df in (train_df, test_df):
    impute_dur = int(df[dur_feature_names].median().median())
    df['dur'] = df[dur_feature_names].sum(axis=1)
    for i,k in [(x,x+1) for x in range(len(time_feature_names)-1)]:
        df.loc[df[time_feature_names[k]].astype('int') == 0, 'dur' + str(i)] = impute_dur
for df in (train_df, test_df):
    df.drop('timex', axis=1, inplace=True)

feature_names = [x for x in train_df.columns if 'dur' in x]
print(train_df[feature_names].info())
train_df[feature_names].head()

In [0]:
# Add dayN, hourN, minN, monN, domN, weekN columns
for df in (train_df, test_df):
    df.drop([x for x in df.columns if 'cat' in x], axis=1, inplace=True)

time_feature_names = [x for x in train_df.columns if 'time' in x]
for df in (train_df, test_df):
    for i in range(len(time_feature_names)):
        df['cat_day' + str(i)] = df[time_feature_names[i]].dt.dayofweek
        df['cat_hour' + str(i)] = df[time_feature_names[i]].dt.hour
        df['cat_min' + str(i)] = df[time_feature_names[i]].dt.minute
        df['cat_mon' + str(i)] = df[time_feature_names[i]].dt.month
        df['cat_dom' + str(i)] = df[time_feature_names[i]].dt.day
        df['cat_week' + str(i)] = df[time_feature_names[i]].dt.week
        cat_feature_names = [x for x in df.columns if 'cat' in x and str(i) in x]
        df.loc[df[time_feature_names[i]].astype('int') == 0, cat_feature_names] = -1

cat_feature_names = [x for x in train_df.columns if 'cat' in x]
print(train_df[cat_feature_names].info())
train_df[cat_feature_names].head()

In [0]:
print(train_df.info())
train_df.head()

In [0]:
print(test_df.info())
test_df.head()

In [0]:
site_feature_names = [x for x in train_df.columns if 'site' in x]
train_site_ids = set(train_df[site_feature_names].values.ravel())
test_site_ids = set(test_df[site_feature_names].values.ravel())
len(train_site_ids), len(test_site_ids), len(test_site_ids - train_site_ids), \
    len(test_site_ids & train_site_ids), len(train_site_ids | test_site_ids)

Using only common sites in train and test data set.

In [0]:
common_site_ids = [x for x in train_site_ids & test_site_ids if x != -1]
len(common_site_ids)

For the very basic model, we will use only the visited websites in the session (but we will not take into account timestamp features). The point behind this data selection is: *Alice has her favorite sites, and the more often you see these sites in the session, the higher probability that this is an Alice's session, and vice versa.*

Let us prepare the data, we will take only features `site1, site2, ... , site10` from the whole dataframe. Keep in mind that the missing values are replaced with zero. Here is how the first rows of the dataframe look like:

In [0]:
def pl_debug(X, text=''):
    print('%s:' % text, X.shape, '      ')
    return X

X_columns = test_df.columns

In [0]:
%%time

cv_voc = dict([(common_site_ids[i],i) for i in range(len(common_site_ids))])
cv = CountVectorizer(analyzer=lambda x: x, vocabulary=cv_voc)

# Debug
cv.fit_transform(train_df[[f for f in train_df.columns if 'site' in f]].values,
                 train_df['target'].values).shape

In [0]:
%%time

from itertools import combinations

def combinations_transformer(X):
    cs = []
    for i in range(2,X.shape[1]-1):
        cs += list(combinations(range(X.shape[1]), i))
    xx = [X]
    for i,c in enumerate(cs):
        print('Combinations progress: %d%% (%d/%d)\r' % (i*100/len(cs), i, len(cs)), end='')
        xx += [np.array([
            '_'.join([str(int(x)) for x in v]) \
            if -1 not in v else -1 \
            for v in X[:,np.array(c)]]).reshape(X.shape[0],1)]
    X = np.hstack(xx)
    return X

def make_comb_transformer():
    return FunctionTransformer(combinations_transformer, validate=True, accept_sparse=True)

# Debug
X_debug_comb = make_comb_transformer().fit_transform(train_df[[f for f in train_df.columns if 'cat' in f and '9' in f]].values, train_df['target'].values)
X_debug_comb.shape

In [0]:
%%time

def ohe_cleanup(X, ohe):
    idxs = np.array([i for i,c in enumerate(np.hstack(ohe.categories_)) if '-1' not in str(c)])
    return X[:,idxs]

def make_ohe_transformer(i):
    ohe_int = OneHotEncoder(dtype='int8', categories='auto')
    ohe_pl = Pipeline([
        ('d-ohe-%d-1' % i, FunctionTransformer(pl_debug, validate=False, kw_args={'text': 'Enter ohe_int%d' % i})),
        ('ohe_int%d' % i, ohe_int),
        ('d-ohe-%d-2' % i, FunctionTransformer(pl_debug, validate=False, kw_args={'text': 'Enter ohe_cleanup%d' % i})),
        ('ohe_cleanup%d' % i, FunctionTransformer(ohe_cleanup, validate=False, kw_args={'ohe': ohe_int})),
        ('d-ohe-%d-3' % i, FunctionTransformer(pl_debug, validate=False, kw_args={'text': 'Exit ohe_cleanup%d' % i})),
    ])
    return ohe_pl

# Debug
make_ohe_transformer(42).fit_transform(X_debug_comb, train_df['target'].values).shape

In [0]:
%%time

def make_cat_transformer(nums, columns):
    ts = []
    for i in nums:
        ts += [
            ('cat%d' % i, Pipeline([
                ('d-cat%d-1' % i, FunctionTransformer(pl_debug, validate=False,
                                                      kw_args={'text': 'Enter comb%d' % i})),
                ('comb%d' % i, make_comb_transformer()),
                ('d-cat%d-2' % i, FunctionTransformer(pl_debug, validate=False,
                                                      kw_args={'text': 'Enter ohe%d' % i})),
                ('ohe%d' % i, make_ohe_transformer(i)),
                ('d-cat%d-3' % i, FunctionTransformer(pl_debug, validate=False,
                                                      kw_args={'text': 'Exit ohe%d' % i})),
            ]), [k for k,f in enumerate(columns) if 'cat' in f and str(i) in f]),
        ]
    ct = ColumnTransformer(ts, remainder='passthrough', n_jobs=1, verbose=True)
    return ct

# Debug
cat_feature_names = [f for f in train_df.columns if 'cat' in f]
make_cat_transformer(range(10), cat_feature_names).fit_transform(train_df[cat_feature_names].values, train_df['target'].values)

In [0]:
m = Pipeline([
    ('d1', FunctionTransformer(pl_debug, validate=False, kw_args={'text': 'Enter ct'})),
    ('ct', ColumnTransformer([
        ('cv', cv, [i for i,f in enumerate(X_columns) if 'site' in f]),
        ('cat', make_cat_transformer(range(10), [f for f in X_columns if 'cat' in f]), [i for i,f in enumerate(X_columns) if 'cat' in f])
        ('dur', 'passthrough', [i for i,f in enumerate(X_columns) if 'dur' in f]),
    ], n_jobs=-1, verbose=True)),
    ('d3', FunctionTransformer(pl_debug, validate=False, kw_args={'text': 'Enter fs'})),
    ('fs', SelectFromModel(
        LogisticRegression(random_state=17, solver='lbfgs', max_iter=10000, verbose=3)
        max_features=600)
    ),
    ('d4', FunctionTransformer(pl_debug, validate=False, kw_args={'text': 'Enter logit'})),
    ('logit', LogisticRegression(random_state=17, solver='lbfgs', max_iter=100, C=4, verbose=3)),
    ('dx', FunctionTransformer(pl_debug, validate=False, kw_args={'text': 'Exit'})),
])

cross_val_score(m, train_df.drop('target', axis=1).values, train_df['target'].values,
                scoring='roc_auc', cv=5, n_jobs=1, verbose=3)

The baseline is **0.91252**

Strong baseline is **0.95965**

In [0]:
%%time
m.fit(train_df.drop('target', axis=1).values, train_df['target'].values)

In [0]:
%%time
y_test = m.predict_proba(test_df.values)[:,1]

In [0]:
# Function for writing predictions to a file
def write_to_submission_file(predicted_labels, out_file,
                             target='target', index_label="session_id"):
    predicted_df = pd.DataFrame(predicted_labels,
                                index = np.arange(1, predicted_labels.shape[0] + 1),
                                columns=[target])
    predicted_df.to_csv(out_file, index_label=index_label)

# Write data to the file which could be submitted
write_to_submission_file(y_test, 'submission.csv')