<center>
<img src="../../img/ods_stickers.jpg" />
    
## [mlcourse.ai](mlcourse.ai) – Open Machine Learning Course 
Author: [Yury Kashnitskiy](https://yorko.github.io) (@yorko). Translated by [Eugene Mashkin](https://www.linkedin.com/in/eugene-mashkin-88490883/) (@emashkin). This material is subject to the terms and conditions of the [Creative Commons CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/) license. Free use is permitted for any non-commercial purpose. This material is a translated version of the Capstone project (by the same author) from specialization "Machine learning and data analysis" by Yandex and MIPT. No solutions shared.

# <center> Capstone Project №1. User Identification Based on Visited Websites

In this project, we are going to tackle a user identification problem using their behavioral patterns on the Internet. It is a complicated and interesting problem that combines data analysis and behavioral psychology.
As an example, Yandex Company solves intruder detection problem based on the user's behavior patterns. In a nutshell, an intruder's pattern differs from an owner's one:

- the intruder might not delete emails right after they are read as the mailbox owner would
- the intruder might mark emails or even move the cursor differently
- etc.
 
Therefore, the intruder could be detected and thrown out from the mailbox, forcing the user to authenticate via SMS-code. This pilot project is described in the Habrahabr [article](https://habrahabr.ru/company/yandex/blog/230583/) (in Russian). 
 
Similar things are being developed in Google Analytics and the research community. You can find more on this topic by searching for "Traversal Pattern Mining" and "Sequential Pattern Mining".

<img src='../../img/stock-illustration-21546327-identification-de-l-utilisateur.jpg'>

In this project, we are going to solve a similar problem: identify a user given the sequence of websites visited by them. The main idea is following: Internet users have different patterns in visiting websites: some person might check mailbox first, then read the latest football news and, only after that, get down to business, another person might get to work right away. 

Our algorithm needs to analyze the sequence of websites consequently visited by a particular person and predict whether this person is Alice or an intruder (someone else). We will measure ROC AUC. Stay tuned until the end of the course to find out who Alice is.

We will use the data from ["A Tool for Classification of Sequential Data"](http://ceur-ws.org/Vol-1703/paper12.pdf) article. Even though we can't recommend this article (described methods are far from state-of-the-art, it's better to refer to the ["Frequent Pattern Mining"](http://www.charuaggarwal.net/freqbook.pdf) book and the latest ICDM articles), but the data is collected carefully and hence is of interest.

There are data from Blaise Pascal University's proxy-servers. It has a super simple structure. There is a file `user[USER_ID].csv` for each user (where [USER_ID] - is an id of a particular user). All website visits have the following format: <br><br>

<center><b>timestamp, visited website</b></center>

You can download the data using the link provided in the article. Data description could be found there as well. Using the data of all 3000 users is not necessary for this project. Data of 10 and 150 users will be enough. Capstone_user_identification archive [link](https://drive.google.com/open?id=1AU3M_mFPofbfhFQa_Bktozq_vFREkWJA) (~7 Mb, unziped data ~60 Mb).

In the final project, you'll face an issue that not all operations can be executed within a reasonable time (e.g. it's unlikely you will be able to perform cross-validation grid search over 100 combinations of random forest parameters on this data). Therefore we are going to use two sets of data:
- data of 10 users - we will use it to test and debug our code
- data of 150 users - it will be our main working data

The data has the following
- there are 10 `user[USER_ID].csv` files in `10users` directory
- the same for the `150users` directory - but there 150 files in total
- there is a toy example of 3 users data in the `3users` directory - use it to debug your preprocessing code, which you are supposed to develop further

# <center>Week 1. Data Preproessing

The fisrt part of the project is devoted to data preparation for further descriptive analysis and predictive model development. You need to write a code for data preprocessing and producing a single training set (initially the data is stored in separate files). Also we will learn sparse data format (`Scipy.sparse` matricies), which is well-suited for this purpose.


**Plan for the first week:**
 - Part 1. Training set preparation
 - Part 2. Working with sparse data format

**Your task**
1. Fill in the missing code in the provided notebook
2. Choose the answers in the [form](https://docs.google.com/forms/d/10yakW9zN85pTVnTo9uStzdAUlqj8nARwIhZxlT4stZk)

**You might find the following materials useful:**
- [Loops](https://www.datacamp.com/community/tutorials/loops-python-tutorial), [functions](https://www.datacamp.com/community/tutorials/functions-python-tutorial), [generators](https://www.learnpython.org/en/Generators), [list_comprehension](https://www.datacamp.com/community/tutorials/python-list-comprehension)
- [Reading and writing data from/to files in python](https://www.datacamp.com/community/tutorials/reading-writing-files-python)
- [Pandas Tutorial: DataFrames in Python](https://www.datacamp.com/community/tutorials/pandas-tutorial-dataframe-python)
   
**Besides, we are going to use [`pickle`](https://docs.python.org/3/library/pickle.html), [`glob`](https://docs.python.org/3/library/glob.html) and `Scipy.sparse` ([`csr_matrix`](https://docs.scipy.org/doc/scipy-0.18.1/reference/generated/scipy.sparse.csr_matrix.html) class) python libraries.**

Finally, let's list versions of all main libraries used in this project for better results reproducibility: NumPy, SciPy, Pandas, Matplotlib, Statsmodels and Scikit-learn. To do this use [watermark](https://github.com/rasbt/watermark) extension. It is recommended to use Open Data Science docker container with all  libraries and dependecies preinstalled (details are [here](https://github.com/Yorko/mlcourse.ai/wiki/Prerequisites:-Python,-math,-software,-and-DevOps)). 

In [1]:
# pip install watermark
%load_ext watermark

In [2]:
%watermark -v -m -p numpy,scipy,pandas,matplotlib,statsmodels,sklearn -g

CPython 3.6.4
IPython 7.1.1

numpy 1.14.2
scipy 1.0.0
pandas 0.22.0
matplotlib 2.1.2
statsmodels 0.8.0
sklearn 0.19.1

compiler   : GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)
system     : Darwin
release    : 17.7.0
machine    : x86_64
processor  : i386
CPU cores  : 4
interpreter: 64bit
Git hash   : 8cdafe5c7f3f6586082c74f430c20d3460746d01


In [3]:
# Disable Anaconda warnings
import warnings
warnings.filterwarnings('ignore')
from glob import glob
import os
import pickle
# pip install tqdm
from tqdm import tqdm_notebook
import numpy as np
import pandas as pd
from scipy.sparse import csr_matrix

**Let's look at one of the files containing the data about websites visited by a particular user (user 31).**

In [4]:
# Change the path to data
PATH_TO_DATA = '../../data/capstone_user_identification'

In [5]:
user31_data = pd.read_csv(os.path.join(PATH_TO_DATA, 
                                       '10users/user0031.csv'))

In [6]:
user31_data.head()

Unnamed: 0,timestamp,site
0,2013-11-15 08:12:07,fpdownload2.macromedia.com
1,2013-11-15 08:12:17,laposte.net
2,2013-11-15 08:12:17,www.laposte.net
3,2013-11-15 08:12:17,www.google.com
4,2013-11-15 08:12:18,www.laposte.net


**Let's define the problem: identify a user given a session containing 10 consequentially visited websites. Each sample is a session of 10 websites consequentially visited by a particular user, features are indexes of these 10 websites (later, we will get a bag of websites applying bag-of-words approach). Target class is the user id.**

### <center>Toy example</center>
**Suppose there are only 2 users, and session length is 2 (websites).**

<center>user0001.csv</center>
<style type="text/css">
.tg  {border-collapse:collapse;border-spacing:0;}
.tg td{font-family:Arial, sans-serif;font-size:14px;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;}
.tg th{font-family:Arial, sans-serif;font-size:14px;font-weight:normal;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;}
.tg .tg-yw4l{vertical-align:top}
</style>
<table class="tg">
  <tr>
    <th class="tg-031e">timestamp</th>
    <th class="tg-031e">site</th>
  </tr>
  <tr>
    <td class="tg-031e">00:00:01</td>
    <td class="tg-031e">vk.com</td>
  </tr>
  <tr>
    <td class="tg-yw4l">00:00:11</td>
    <td class="tg-yw4l">google.com</td>
  </tr>
  <tr>
    <td class="tg-031e">00:00:16</td>
    <td class="tg-031e">vk.com</td>
  </tr>
  <tr>
    <td class="tg-031e">00:00:20</td>
    <td class="tg-031e">yandex.ru</td>
  </tr>
</table>

<center>user0002.csv</center>
<style type="text/css">
.tg  {border-collapse:collapse;border-spacing:0;}
.tg td{font-family:Arial, sans-serif;font-size:14px;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;}
.tg th{font-family:Arial, sans-serif;font-size:14px;font-weight:normal;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;}
.tg .tg-yw4l{vertical-align:top}
</style>
<table class="tg">
  <tr>
    <th class="tg-031e">timestamp</th>
    <th class="tg-031e">site</th>
  </tr>
  <tr>
    <td class="tg-031e">00:00:02</td>
    <td class="tg-031e">yandex.ru</td>
  </tr>
  <tr>
    <td class="tg-yw4l">00:00:14</td>
    <td class="tg-yw4l">google.com</td>
  </tr>
  <tr>
    <td class="tg-031e">00:00:17</td>
    <td class="tg-031e">facebook.com</td>
  </tr>
  <tr>
    <td class="tg-031e">00:00:25</td>
    <td class="tg-031e">yandex.ru</td>
  </tr>
</table>

Iterate through the first file and assign consequtive site_id-s (numbers) to each new incoming websites: 
- vk.com – gets site_id=1, 
- google.com – gets site_id=2, 
- etc. 
- Then, iterate through the second file. So we collect a bag of websites. 

You are supposed to get the following mapping:

<style type="text/css">
.tg  {border-collapse:collapse;border-spacing:0;}
.tg td{font-family:Arial, sans-serif;font-size:14px;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;}
.tg th{font-family:Arial, sans-serif;font-size:14px;font-weight:normal;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;}
.tg .tg-yw4l{vertical-align:top}
</style>
<table class="tg">
  <tr>
    <th class="tg-031e">site</th>
    <th class="tg-yw4l">site_id</th>
  </tr>
  <tr>
    <td class="tg-yw4l">vk.com</td>
    <td class="tg-yw4l">1</td>
  </tr>
  <tr>
    <td class="tg-yw4l">google.com</td>
    <td class="tg-yw4l">2</td>
  </tr>
  <tr>
    <td class="tg-yw4l">yandex.ru</td>
    <td class="tg-yw4l">3</td>
  </tr>
  <tr>
    <td class="tg-yw4l">facebook.com</td>
    <td class="tg-yw4l">4</td>
  </tr>
</table>

Then your training set should be **target class - user_id**:

<style type="text/css">
.tg  {border-collapse:collapse;border-spacing:0;}
.tg td{font-family:Arial, sans-serif;font-size:14px;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;}
.tg th{font-family:Arial, sans-serif;font-size:14px;font-weight:normal;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;}
.tg .tg-s6z2{text-align:center}
.tg .tg-baqh{text-align:center;vertical-align:top}
.tg .tg-hgcj{font-weight:bold;text-align:center}
.tg .tg-amwm{font-weight:bold;text-align:center;vertical-align:top}
</style>
<table class="tg">
  <tr>
    <th class="tg-hgcj">session_id</th>
    <th class="tg-hgcj">site1</th>
    <th class="tg-hgcj">site2</th>
    <th class="tg-amwm">user_id</th>
  </tr>
  <tr>
    <td class="tg-s6z2">1</td>
    <td class="tg-s6z2">1</td>
    <td class="tg-s6z2">2</td>
    <td class="tg-baqh">1</td>
  </tr>
  <tr>
    <td class="tg-s6z2">2</td>
    <td class="tg-s6z2">1</td>
    <td class="tg-s6z2">3</td>
    <td class="tg-baqh">1</td>
  </tr>
  <tr>
    <td class="tg-s6z2">3</td>
    <td class="tg-s6z2">3</td>
    <td class="tg-s6z2">2</td>
    <td class="tg-baqh">2</td>
  </tr>
  <tr>
    <td class="tg-s6z2">4</td>
    <td class="tg-s6z2">4</td>
    <td class="tg-s6z2">3</td>
    <td class="tg-baqh">2</td>
  </tr>
</table>


Here, the first sample is a session containing two websites visited by the first user (target=1). There are `vk.com` and `google.com` in it (site1=1 and site2=2). Four sessions in total. In this example sessions are not overlapping each other, i.e. each website belongs to only one session.

## Part 1. Training set preparation

Implement function *prepare_train_set*, which takes as an input:
- *path_to_csv_files* - path to the directory contains csv-files
- *session_length* - number of websites in a session

and returns two objects:
- DataFrame with rows corresponding to a unique user session (each of *session_length* websites) and *session_length* columns corresponding to the websites in the session. And there should be one more column - *user_id* which is target lass of a sample. Hence the DataFrame should have *session_length + 1* columns.
- Dictionary of websites' frequncies which has {'site_string': [site_id, site_freq]} structure, where `site_string` - original website name, `site_id` - a number the website got be encoded, `site_freq` - total number of website occurrences in all files among all users. For our recent toy example it is:
`{'vk.com': (1, 2), 'google.com': (2, 2), 'yandex.ru': (3, 3), 'facebook.com': (4, 1)}`

Details:
- You can find example of function's output below
- Use `glob` (or somthing similar) to iterate through file in directory. For certainty, sort the list of files lexicographically. It is handy to use `tqdm_notebook` (or just `tqdm` in case of python-script) to track the number of loop iterations done
- Create the dictionary of websites' frequencies (with the following structure: {'site_string': (site_id, site_freq)}) and fill it iterating through the file. Start with 1
- It is recommended to give smaller indices(numbers) to more frequent websites (least description principle)
- Don't do entity recognition, consider *google.com*, *http://www.google.com* and *www.google.com* as different websites (you can try entity recognition in scope of your individual work on the project)
- It's more likely the number of records in file is not divisible by *session_length*. In this case the last session is shorter. Fill empty values with zeros. I.e. if there 24 recods in a file and session_length is 10, then third sessions will contain 4 websites and its corresponding vector will be:<br> 
`[*site1_id*, *site2_id*, *site3_id*, *site4_id*, 0, 0, 0, 0, 0, 0, *user_id*]`
- Some sessions might be repeatations of previous ones - leave it as is, don't drop duplicates. If there are identical web_sites in two sessions, but these sessions belong to different users, leave it as is as well. It's natural data uncertainty.
- Don't keep the website with `site_id`=0 in the dictionary (after your function returns that dictionary).
- It took me less than 5 sec to process 150 files from `capstone_websites_data/150users/`, but, surely, it depends on implementation of the function and hardware you use. Frankly, it's more likely that your first implementation won't be the most efficient. As a next step you can profile your code (especially if you plan to run your code on 3000 users data). Also efficient implementation of that function will help you next week.

In [36]:
import operator
import re

def prepare_train_set(path_to_csv_files, session_length=10):
    ''' YOUR CODE IS HERE '''
    dict_sites = {}
    list_file = sorted(glob(os.path.join(path_to_csv_files,'*')))
    #print(list_file)
    for file in tqdm_notebook(list_file):
        df_site = pd.read_csv(file)
        for i in range(len(df_site)):
            key = df_site.iloc[i]['site']
            if key not in dict_sites:
                dict_sites[key] = 1
            else:
                dict_sites[key] += 1
    dict_id_freq = {}
    for counter, value in enumerate(sorted(dict_sites.items(), key=operator.itemgetter(1),reverse=True)):
        dict_id_freq[value[0]] = (counter+1, value[1])
#     print(dict_id_freq)    
    df_res = pd.DataFrame()
    for file in tqdm_notebook(list_file):
        df_site = pd.read_csv(file)
        df_site['site'] = df_site['site'].map(lambda x: dict_id_freq[x][0])
        df_site['row'] = df_site.index // session_length
        df_site['col'] = df_site.index % session_length
#         print(df_site)
        df_site_pivot = df_site.pivot(index='row', columns = 'col', values = 'site')
        df_site_pivot.columns = ['site' + str(i+1) for i in range(min(len(df_site),session_length))]
        df_site_pivot['user_id'] = int(re.findall(r'\d+', file)[-1])
#         print(df_site_pivot)
        df_res = df_res.append(df_site_pivot)
#         df_res.columns = df_site_pivot.columns
#     print(df_res)
    return df_res.fillna(0), dict_id_freq

In [20]:
df_test = pd.read_csv('../../data/capstone_user_identification/3users/user0001.csv')
df_test.head()

Unnamed: 0,timestamp,site
0,2013-11-15 09:28:17,vk.com
1,2013-11-15 09:33:04,oracle.com
2,2013-11-15 09:52:48,oracle.com
3,2013-11-15 11:37:26,geo.mozilla.org
4,2013-11-15 11:40:32,oracle.com


In [21]:
df_test['row'] = df_test.index // 10
df_test['col'] = df_test.index % 10
print(df_test)

              timestamp                 site  row  col
0   2013-11-15 09:28:17               vk.com    0    0
1   2013-11-15 09:33:04           oracle.com    0    1
2   2013-11-15 09:52:48           oracle.com    0    2
3   2013-11-15 11:37:26      geo.mozilla.org    0    3
4   2013-11-15 11:40:32           oracle.com    0    4
5   2013-11-15 11:40:34           google.com    0    5
6   2013-11-15 11:40:35  accounts.google.com    0    6
7   2013-11-15 11:40:37      mail.google.com    0    7
8   2013-11-15 11:40:40      apis.google.com    0    8
9   2013-11-15 11:41:35      plus.google.com    0    9
10  2013-11-15 12:40:35               vk.com    1    0
11  2013-11-15 12:40:37           google.com    1    1
12  2013-11-15 12:40:40           google.com    1    2
13  2013-11-15 12:41:35           google.com    1    3


In [22]:
df_test = df_test.pivot(index='row', columns='col', values='site')
df_test.columns = ['site' + str(i + 1) for i in range(10)]
df_test

Unnamed: 0_level_0,site1,site2,site3,site4,site5,site6,site7,site8,site9,site10
row,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
0,vk.com,oracle.com,oracle.com,geo.mozilla.org,oracle.com,google.com,accounts.google.com,mail.google.com,apis.google.com,plus.google.com
1,vk.com,google.com,google.com,google.com,,,,,,


In [23]:
df_test

Unnamed: 0_level_0,site1,site2,site3,site4,site5,site6,site7,site8,site9,site10
row,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
0,vk.com,oracle.com,oracle.com,geo.mozilla.org,oracle.com,google.com,accounts.google.com,mail.google.com,apis.google.com,plus.google.com
1,vk.com,google.com,google.com,google.com,,,,,,


In [37]:
train_data_toy, dict_id_freq = prepare_train_set(os.path.join(PATH_TO_DATA, '3users'), 
                                                     session_length=10)
train_data_toy
dict_id_freq

HBox(children=(IntProgress(value=0, max=3), HTML(value='')))




HBox(children=(IntProgress(value=0, max=3), HTML(value='')))




{'google.com': (1, 9),
 'oracle.com': (2, 8),
 'vk.com': (3, 3),
 'meduza.io': (4, 3),
 'mail.google.com': (5, 2),
 'football.kulichki.ru': (6, 2),
 'geo.mozilla.org': (7, 1),
 'accounts.google.com': (8, 1),
 'apis.google.com': (9, 1),
 'plus.google.com': (10, 1),
 'yandex.ru': (11, 1)}

In [38]:
train_data_toy

Unnamed: 0_level_0,site1,site10,site2,site3,site4,site5,site6,site7,site8,site9,user_id
row,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
0,3.0,10.0,2.0,2.0,7.0,2.0,1.0,8.0,5.0,9.0,1
1,3.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1
0,3.0,0.0,2.0,6.0,6.0,2.0,0.0,0.0,0.0,0.0,2
0,4.0,4.0,1.0,2.0,1.0,2.0,1.0,1.0,5.0,11.0,3
1,4.0,0.0,1.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,3


**Apply your function to the toy example and make sure that everything works fine:**

In [6]:
!cat $PATH_TO_DATA/3users/user0001.csv

timestamp,site
2013-11-15 09:28:17,vk.com
2013-11-15 09:33:04,oracle.com
2013-11-15 09:52:48,oracle.com
2013-11-15 11:37:26,geo.mozilla.org
2013-11-15 11:40:32,oracle.com
2013-11-15 11:40:34,google.com
2013-11-15 11:40:35,accounts.google.com
2013-11-15 11:40:37,mail.google.com
2013-11-15 11:40:40,apis.google.com
2013-11-15 11:41:35,plus.google.com
2013-11-15 12:40:35,vk.com
2013-11-15 12:40:37,google.com
2013-11-15 12:40:40,google.com
2013-11-15 12:41:35,google.com


In [7]:
!cat $PATH_TO_DATA/3users/user0002.csv

timestamp,site
2013-11-15 09:28:17,vk.com
2013-11-15 09:33:04,oracle.com
2013-11-15 09:52:48,football.kulichki.ru
2013-11-15 11:37:26,football.kulichki.ru
2013-11-15 11:40:32,oracle.com


In [8]:
!cat $PATH_TO_DATA/3users/user0003.csv

timestamp,site
2013-11-15 09:28:17,meduza.io
2013-11-15 09:33:04,google.com
2013-11-15 09:52:48,oracle.com
2013-11-15 11:37:26,google.com
2013-11-15 11:40:32,oracle.com
2013-11-15 11:40:34,google.com
2013-11-15 11:40:35,google.com
2013-11-15 11:40:37,mail.google.com
2013-11-15 11:40:40,yandex.ru
2013-11-15 11:41:35,meduza.io
2013-11-15 12:28:17,meduza.io
2013-11-15 12:33:04,google.com
2013-11-15 12:52:48,oracle.com


In [9]:
train_data_toy, site_freq_3users = prepare_train_set(os.path.join(PATH_TO_DATA, '3users'), 
                                                     session_length=10)

HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))




In [10]:
train_data_toy

Unnamed: 0,site1,site2,site3,site4,site5,site6,site7,site8,site9,site10,user_id
0,3,2,2,8,2,1,10,5,7,9,1
1,3,1,1,1,0,0,0,0,0,0,1
2,3,2,6,6,2,0,0,0,0,0,2
3,4,1,2,1,2,1,1,5,11,4,3
4,4,1,2,0,0,0,0,0,0,0,3


**Websites' frequencies (second element of a tuple) should be exactly the same. Numeration (websites order) (first elements of tuples) may vary.**

In [11]:
site_freq_3users

{'google.com': (1, 9),
 'oracle.com': (2, 8),
 'vk.com': (3, 3),
 'meduza.io': (4, 3),
 'mail.google.com': (5, 2),
 'football.kulichki.ru': (6, 2),
 'apis.google.com': (7, 1),
 'geo.mozilla.org': (8, 1),
 'plus.google.com': (9, 1),
 'accounts.google.com': (10, 1),
 'yandex.ru': (11, 1)}

**Apply your function to the data with 10 users..**

**<font color='red'> Question 1. </font>How many sessions with length of 10 websites are there in 10users data?**

In [39]:
train_data_10users, site_freq_10users = prepare_train_set(os.path.join(PATH_TO_DATA, '10users'), 
                                                     session_length=10)#''' YOUR CODE IS HERE '''
print(len(train_data_10users))

HBox(children=(IntProgress(value=0, max=10), HTML(value='')))




HBox(children=(IntProgress(value=0, max=10), HTML(value='')))


14061


**<font color='red'> Questioin 2. </font>How many unique websites are there in 10users data?**

In [40]:
''' YOUR CODE IS HERE '''
print(len(site_freq_10users))

4913


**Apply your function to the data with 150 users.**

**<font color='red'> Question 3. </font> How many sessions with length of 10 websites are there in 150users data?**

In [42]:
%%time
train_data_150users, site_freq_150users = prepare_train_set(os.path.join(PATH_TO_DATA, '150users'), 
                                                     session_length=10)#''' YOUR CODE IS HERE '''
print(len(train_data_150users))

HBox(children=(IntProgress(value=0, max=150), HTML(value='')))




HBox(children=(IntProgress(value=0, max=150), HTML(value='')))


137019
CPU times: user 2min 25s, sys: 1.05 s, total: 2min 26s
Wall time: 2min 26s


**<font color='red'> Question 4. </font> How many unique websites are there in 150users data?**

In [43]:
''' YOUR CODE IS HERE '''
print(len(site_freq_150users))

27797


**<font color='red'> Question 5. </font> 
Which of these websites is <font color='red'> NOT </font> in top-10 most visited websites among 150 users?**
- www.google.fr
- www.youtube.com
- safebrowsing-cache.google.com
- www.linkedin.com

In [44]:
''' YOUR CODE IS HERE '''
print(site_freq_150users)

{'www.google.fr': (1, 64785), 'www.google.com': (2, 51320), 'www.facebook.com': (3, 39002), 'apis.google.com': (4, 29983), 's.youtube.com': (5, 29102), 'clients1.google.com': (6, 25087), 'mail.google.com': (7, 19072), 'plus.google.com': (8, 18467), 'safebrowsing-cache.google.com': (9, 17960), 'www.youtube.com': (10, 16319), 'twitter.com': (11, 16219), 'platform.twitter.com': (12, 15317), 's-static.ak.facebook.com': (13, 15048), 'accounts.google.com': (14, 13855), 'www.bing.com': (15, 13797), 'static.ak.facebook.com': (16, 13117), 'i1.ytimg.com': (17, 13117), 'download.jboss.org': (18, 11740), 'api.twitter.com': (19, 9350), 'safebrowsing.clients.google.com': (20, 8981), 'r1---sn-gxo5uxg-jqbe.googlevideo.com': (21, 8579), 'fr.openclassrooms.com': (22, 8100), 'ajax.googleapis.com': (23, 7811), 'r3---sn-gxo5uxg-jqbe.googlevideo.com': (24, 7482), 'drive.google.com': (25, 7341), 'r2---sn-gxo5uxg-jqbe.googlevideo.com': (26, 7053), 'r4---sn-gxo5uxg-jqbe.googlevideo.com': (27, 7039), 's.ytimg.c

**Write dataframes to csv files for further analysis.**

In [None]:
train_data_10users.to_csv(os.path.join(PATH_TO_DATA, 
                                       'train_data_10users.csv'), 
                        index_label='session_id', float_format='%d')
train_data_150users.to_csv(os.path.join(PATH_TO_DATA, 
                                        'train_data_150users.csv'), 
                         index_label='session_id', float_format='%d')

## Part 2. Working with sparse data format

If you think carefully, the features we've got - *site1*, ..., *site10* - won't work at all in a classification problem. But if we use the Bag of Words idea from text analysis - it's another story. Create new matrices with sessions as rows, and site_id as columns. The intersection of row $i$ and column $j$ is $n_{ij}$ - the number of website $j$ occurrences in session $i$. We are going to do this using sparse Scipy matrices - [csr_matrix](https://docs.scipy.org/doc/scipy-0.18.1/reference/generated/scipy.sparse.csr_matrix.html). Read the docs, figure out how to use sparse matrices and create such matrices for our data. First, debug your code on toy example and then apply it to 10users and 150users data.

Note, that in short sessions (less than 10 websites) there are trailing zeros, hence the meaning of the first feature (number of 0 occurrences in a session) differs from others (number of $i$-th website occurrences in a session). **Therefore you need to drop the first column in a Dataframe.**

In [None]:
X_toy, y_toy = train_data_toy.iloc[:, :-1].values, train_data_toy.iloc[:, -1].values

In [None]:
X_toy

In [None]:
X_sparse_toy = csr_matrix ''' YOUR CODE IS HERE '''   

**Number of columns in `X_sparse_toy` should be 11, because in the toy example 3 users visited 11 unique websites.**

In [None]:
X_sparse_toy.todense()

In [None]:
X_10users, y_10users = train_data_10users.iloc[:, :-1].values, \
                       train_data_10users.iloc[:, -1].values
X_150users, y_150users = train_data_150users.iloc[:, :-1].values, \
                         train_data_150users.iloc[:, -1].values

In [None]:
X_sparse_10users = ''' YOUR CODE IS HERE '''
X_sparse_150users = ''' YOUR CODE IS HERE '''

**Save these sparse matrices with [pickle](https://docs.python.org/3/library/pickle.html) (serialization in Python), and *y_10users, y_150users* - target variables (user_id-s) for our 10users and 150users data. The fact that the names of these matrices start with X and y implies that we are going to check our first classification models on them.
Finally, save frequency dictionaries for 3users, 10users and 150users data.**

In [None]:
with open(os.path.join(PATH_TO_DATA, 'X_sparse_10users.pkl'), 'wb') as X10_pkl:
    pickle.dump(X_sparse_10users, X10_pkl, protocol=2)
with open(os.path.join(PATH_TO_DATA, 'y_10users.pkl'), 'wb') as y10_pkl:
    pickle.dump(y_10users, y10_pkl, protocol=2)
with open(os.path.join(PATH_TO_DATA, 'X_sparse_150users.pkl'), 'wb') as X150_pkl:
    pickle.dump(X_sparse_150users, X150_pkl, protocol=2)
with open(os.path.join(PATH_TO_DATA, 'y_150users.pkl'), 'wb') as y150_pkl:
    pickle.dump(y_150users, y150_pkl, protocol=2)
with open(os.path.join(PATH_TO_DATA, 'site_freq_3users.pkl'), 'wb') as site_freq_3users_pkl:
    pickle.dump(site_freq_3users, site_freq_3users_pkl, protocol=2)
with open(os.path.join(PATH_TO_DATA, 'site_freq_10users.pkl'), 'wb') as site_freq_10users_pkl:
    pickle.dump(site_freq_10users, site_freq_10users_pkl, protocol=2)
with open(os.path.join(PATH_TO_DATA, 'site_freq_150users.pkl'), 'wb') as site_freq_150users_pkl:
    pickle.dump(site_freq_150users, site_freq_150users_pkl, protocol=2)

**Just in case doublecheck that number of columns in sparse matrices `X_sparse_10users` and `X_sparse_150users` equals to the number of unique websites in 10users and 150users data evaluated earlier.**

In [None]:
assert X_sparse_10users.shape[1] == len(site_freq_10users)

In [None]:
assert X_sparse_150users.shape[1] == len(site_freq_150users)

Next week we will preprocess the data a bit and check first hypotheses regarding our observations.