# Иллюстрация  разбиения датасета Netflix на train/test/split
<blockquote>
    <p>Предобработки признаков для датасета Netflix, используемая здесь вынесена в 
       <a href="https://github.com/AlgoMathITMO/sber-simulator/blob/experiments-new/experiments/Netflix/netflix.py" title="netflix.py">файл</a>,
       иллюстрация её работы продемонстрирована в <a href="https://github.com/AlgoMathITMO/sber-simulator/blob/experiments-new/experiments/Netflix/netflix_processing.ipynb" 
       title="netflix_processing">ноутбуке</a>.</p>
</blockquote>

### $\textbf{Содержание}$:

### $\textbf{I. Загрузка и подготовка данных }$
#### - Чтение данных с диска;
#### - Кодирование id фильмов и пользователей целочисленными идентификаторами;
---

### $\textbf{II. Разбиение данных для эксперимента}$
### Для разбиения данных на $\it{train/test/split}$ производится деление исходного датасета *df_rating* по квантилям атрибута $\it{timestamp}$, $\mathbb{q}$ для генерации признаков:
#### $\it{rating}_{t}$ = *df_rating*$[0, \mathbb{q}_{t}]$, где $\mathbb{q}_{train}=0.5$, $\mathbb{q}_{val}=0.75$, $\mathbb{q}_{test}=1$:
#### - $\it{rating}_{train}$ = *df_rating*$[0, 0.5]$;
#### - $\it{rating}_{val}$ = *df_rating*$[0, 0.75]$;
#### - $\it{rating}_{test}$ = *df_rating*$[0, 1]$;
### Далее для каждого из промежутков {$\it{rating}_{train}$, $\it{rating}_{val}$, $\it{rating}_{test}$} генерируются соответствующие им признаки пользователей и предложений (по данному [примеру](https://github.com/AlgoMathITMO/sber-simulator/blob/experiments-new/experiments/Netflix/netflix_processing.ipynb "Optional Title")):
#### - $\it{items}_{t}$, $\it{users}_{t}$, $\it{rating}_{t}$ = data_processing(movies, $\it{rating}_{t}$, tags), $t \in \{\it{train}, \it{val}, \it{test}\}$;
### После чего формируются окончательные рейтинги:
#### - $\it{rating}_{train}$ = $\it{rating}_{train}$ = *df_rating*$[0, 0.5]$;
#### - $\it{rating}_{val}$ = $\it{rating}_{val}$[$\mathbb{q}>\mathbb{q}_{train}$] = *df_rating*$(0.5, 0.75]$;
#### - $\it{rating}_{test}$ = $\it{rating}_{test}$[$\mathbb{q}>\mathbb{q}_{val}$] = *df_rating*$(0.75, 1]$;

<blockquote>
    <p>То есть, если для генерации признаков для валидационного набора данных мы используем временные метки с 0 по 0.75 квантиль, то в качестве рейтингов мы возьмем оценки
       только с 0.5 по 0.75 квантили. Аналогично для тестового набора: все временные метки для генерации признаков, но в качестве рейтингов только оценки с 0.75 по 1
       квантили.</p>
</blockquote>
<hr>

In [2]:
import pandas as pd
import numpy as np
import re
import itertools
import tqdm

from netflix import data_processing

[nltk_data] Downloading package stopwords to
[nltk_data]     /data/home/agurov/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package words to
[nltk_data]     /data/home/agurov/nltk_data...
[nltk_data]   Package words is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /data/home/agurov/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


### I. Загрузка и подготовка данных

In [3]:
MOVIE_PATH = r'./data_clean/movies.csv'
RATING_PATH = r'./data_clean/rating.csv'

In [4]:
df_movies = pd.read_csv(MOVIE_PATH)
df_rating = pd.read_csv(RATING_PATH)

#### Кодирование id фильмов и пользователей целочисленными идентификаторами 

In [5]:
cat_dict_movies = pd.Series(df_movies.movie_Id.astype("category").cat.codes.values, index=df_movies.movie_Id).to_dict()
cat_dict_users = pd.Series(df_rating.user_Id.drop_duplicates().astype("category").cat.codes.values, index=df_rating.user_Id.drop_duplicates()).to_dict()

df_movies.movie_Id = df_movies.movie_Id.apply(lambda x: cat_dict_movies[x])
df_rating.movie_Id = df_rating.movie_Id.apply(lambda x: cat_dict_movies[x])
df_rating.user_Id = df_rating.user_Id.apply(lambda x: cat_dict_users[x])

In [6]:
df_movies.head()

Unnamed: 0,movie_Id,rating_cnt,rating_avg,year,title
0,0,547,3.749543,2003,Dinosaur Planet
1,1,145,3.558621,2004,Isle of Man TT 2004 Review
2,2,2012,3.641153,1997,Character
3,3,142,2.739437,1994,Paula Abdul's Get Up & Dance
4,4,1140,3.919298,2004,The Rise and Fall of ECW


In [7]:
df_rating.head()

Unnamed: 0,movie_Id,user_Id,rating,timestamp
0,0,270045,3,1104970000.0
1,0,149546,5,1105574000.0
2,0,160878,4,1106093000.0
3,0,5466,4,1106698000.0
4,0,149791,3,1073088000.0


### II. Разбиение данных для эксперимента

### Разбиение df_rating на train/test/validation части по квантилям timestamp:
####  - train [0, 0.5]
####  - validation [0, 0.75]
####  - test [0, 1.]

In [8]:
QUANTILES = [0.5, 0.75]
df_rating = df_rating.sort_values(by='timestamp').reset_index(drop=True)
quantiles_values = [df_rating.timestamp.quantile(i) for i in QUANTILES]
quantiles_values

[1104624000.0, 1105833600.0]

In [9]:
df_rating_train = df_rating[df_rating.timestamp <= quantiles_values[0]]
print(f"DataFrame size: {df_rating_train.shape}")
df_rating_train.head(5)

DataFrame size: (50934106, 4)


Unnamed: 0,movie_Id,user_Id,rating,timestamp
0,5473,92898,2,915580800.0
1,3420,92898,3,915580800.0
2,16181,92898,5,915580800.0
3,9535,92898,5,915580800.0
4,14454,92898,3,915580800.0


In [10]:
df_rating_val = df_rating[df_rating.timestamp <= quantiles_values[1]]
print(f"DataFrame size: {df_rating_val.shape}")
df_rating_val.head()

DataFrame size: (75364540, 4)


Unnamed: 0,movie_Id,user_Id,rating,timestamp
0,5473,92898,2,915580800.0
1,3420,92898,3,915580800.0
2,16181,92898,5,915580800.0
3,9535,92898,5,915580800.0
4,14454,92898,3,915580800.0


In [11]:
df_rating_test = df_rating.copy()
print(f"DataFrame size: {df_rating_test.shape}")
df_rating_test.head(5)

DataFrame size: (100480507, 4)


Unnamed: 0,movie_Id,user_Id,rating,timestamp
0,5473,92898,2,915580800.0
1,3420,92898,3,915580800.0
2,16181,92898,5,915580800.0
3,9535,92898,5,915580800.0
4,14454,92898,3,915580800.0


### Генерация признаков по временным промежуткам

#### Train data

In [12]:
data_train = data_processing(df_movies, df_rating_train, False)

  0%|          | 0/5094 [00:00<?, ?it/s]

In [13]:
df_items_train, df_users_train, df_rating_train = data_train

In [14]:
print(f"DataFrame size: {df_items_train.shape}")
df_items_train.head(5)

DataFrame size: (17751, 304)


Unnamed: 0,item_idx,rating_avg,rating_cnt,year,w2v_0,w2v_1,w2v_2,w2v_3,w2v_4,w2v_5,...,w2v_290,w2v_291,w2v_292,w2v_293,w2v_294,w2v_295,w2v_296,w2v_297,w2v_298,w2v_299
0,0,3.65368,231,2003,0.0791626,0.141602,0.0653076,0.144043,-0.169922,-0.00537109,...,-0.170227,0.147827,-0.256836,0.119385,0.0618286,-0.0197067,-0.0344849,0.204407,0.0576172,-0.0251923
1,1,3.52381,21,2004,-0.0749512,0.0182292,-0.084554,0.0145671,0.0375163,-0.0357259,...,0.108398,-0.0535482,-0.0673828,0.0973307,-0.023112,-0.0393066,-0.146362,-0.221191,-0.023763,0.0696615
2,2,3.650708,1483,1997,0.257812,-0.0258789,-0.00357056,0.0163574,-0.0544434,0.289062,...,-0.180664,0.208984,-0.235352,-0.283203,-0.188477,0.0142822,0.143555,-0.0393066,-0.120605,0.041748
3,3,2.861702,94,1994,0.0388997,-0.162272,-0.00537109,0.194661,-0.0142822,-0.00219727,...,-0.00626628,0.115234,-0.116211,-0.00423177,0.034078,0.0113932,0.0836182,-0.107096,0.0487976,0.0219727
4,4,4.296512,172,2004,-0.00842285,0.0577393,-0.145508,0.188477,-0.142578,-0.101318,...,0.00756836,-0.0305786,-0.166504,-0.0996094,-0.0799561,-0.325195,-0.132568,0.020874,0.1427,0.0246277


In [15]:
print(f"DataFrame size: {df_users_train.shape}")
df_users_train.head(5)

DataFrame size: (365792, 303)


Unnamed: 0,user_idx,rating_avg,w2v_0,w2v_1,w2v_2,w2v_3,w2v_4,w2v_5,w2v_6,w2v_7,...,w2v_291,w2v_292,w2v_293,w2v_294,w2v_295,w2v_296,w2v_297,w2v_298,w2v_299,rating_cnt
0,0,3.596215,0.043663,0.029384,0.002536,0.066195,-0.007591,0.013208,0.010071,-0.085206,...,0.002209,-0.110979,-0.005254,-0.017464,-0.041331,-0.039495,-0.04489,0.006869,0.016734,317
1,1,4.026432,0.031384,0.03794,-0.003657,0.057513,-0.015136,0.0098,0.013432,-0.081515,...,0.009463,-0.09147,-0.007832,-0.019469,-0.045427,-0.03511,-0.040313,0.012134,0.012192,227
2,3,3.386364,0.047955,0.027966,0.011197,0.062136,-0.020985,0.013125,0.011197,-0.089339,...,0.002407,-0.121533,-0.012659,-0.031243,-0.034244,-0.019378,-0.052706,0.005008,0.029358,176
3,4,3.0,0.034546,-0.086487,-0.084869,0.079918,-0.053955,0.007385,0.051407,-0.030701,...,0.023674,-0.08606,-0.137726,-0.078156,0.042938,0.044037,-0.190643,0.012085,-0.061989,2
4,5,3.8,0.003311,-0.008862,-0.051199,0.071934,-0.026351,-0.029214,0.025361,-0.078089,...,0.035499,-0.090897,-0.039709,0.002157,-0.025337,-0.06733,-0.010834,-0.014924,-0.018644,10


In [16]:
print(f"DataFrame size: {df_rating_train.shape}")
df_rating_train.head(5)

DataFrame size: (50934106, 4)


Unnamed: 0,item_idx,user_idx,relevance,timestamp
0,5473,92898,2,915580800.0
1,3420,92898,3,915580800.0
2,16181,92898,5,915580800.0
3,9535,92898,5,915580800.0
4,14454,92898,3,915580800.0


#### Validation data

In [17]:
data_val = data_processing(df_movies, df_rating_val, False)

  0%|          | 0/7537 [00:00<?, ?it/s]

In [18]:
df_items_val, df_users_val, df_rating_val = data_val

In [19]:
print(f"DataFrame size: {df_items_val.shape}")
df_items_val.head(5)

DataFrame size: (17770, 304)


Unnamed: 0,item_idx,rating_avg,rating_cnt,year,w2v_0,w2v_1,w2v_2,w2v_3,w2v_4,w2v_5,...,w2v_290,w2v_291,w2v_292,w2v_293,w2v_294,w2v_295,w2v_296,w2v_297,w2v_298,w2v_299
0,0,3.699482,386,2003,0.0791626,0.141602,0.0653076,0.144043,-0.169922,-0.00537109,...,-0.170227,0.147827,-0.256836,0.119385,0.0618286,-0.0197067,-0.0344849,0.204407,0.0576172,-0.0251923
1,1,3.519481,77,2004,-0.0749512,0.0182292,-0.084554,0.0145671,0.0375163,-0.0357259,...,0.108398,-0.0535482,-0.0673828,0.0973307,-0.023112,-0.0393066,-0.146362,-0.221191,-0.023763,0.0696615
2,2,3.649514,1749,1997,0.257812,-0.0258789,-0.00357056,0.0163574,-0.0544434,0.289062,...,-0.180664,0.208984,-0.235352,-0.283203,-0.188477,0.0142822,0.143555,-0.0393066,-0.120605,0.041748
3,3,2.77193,114,1994,0.0388997,-0.162272,-0.00537109,0.194661,-0.0142822,-0.00219727,...,-0.00626628,0.115234,-0.116211,-0.00423177,0.034078,0.0113932,0.0836182,-0.107096,0.0487976,0.0219727
4,4,4.017107,643,2004,-0.00842285,0.0577393,-0.145508,0.188477,-0.142578,-0.101318,...,0.00756836,-0.0305786,-0.166504,-0.0996094,-0.0799561,-0.325195,-0.132568,0.020874,0.1427,0.0246277


In [20]:
print(f"DataFrame size: {df_users_val.shape}")
df_users_val.head(5)

DataFrame size: (460689, 303)


Unnamed: 0,user_idx,rating_avg,w2v_0,w2v_1,w2v_2,w2v_3,w2v_4,w2v_5,w2v_6,w2v_7,...,w2v_291,w2v_292,w2v_293,w2v_294,w2v_295,w2v_296,w2v_297,w2v_298,w2v_299,rating_cnt
0,0,3.501149,0.048998,0.03132,-0.000489,0.068118,-0.01529,0.015356,0.011137,-0.083811,...,0.006543,-0.110249,-0.003422,-0.025799,-0.042322,-0.036391,-0.047147,0.007338,0.02451,435
1,1,4.064885,0.032301,0.036464,-0.001935,0.059937,-0.017538,0.005171,0.015526,-0.083217,...,0.00525,-0.091632,-0.010069,-0.015506,-0.044552,-0.035608,-0.038914,0.007991,0.016136,262
2,3,3.384615,0.046666,0.032125,0.011057,0.063338,-0.018786,0.008224,0.014202,-0.089903,...,0.000309,-0.118802,-0.008717,-0.028164,-0.037049,-0.022207,-0.049561,-0.001797,0.025735,208
3,4,4.142857,0.100987,0.023281,-0.029916,0.06929,0.048898,-0.033186,-0.021193,-0.034755,...,-0.037382,-0.187605,-0.101257,0.010559,0.045502,0.098319,-0.060224,0.078125,-0.071213,7
4,5,3.8,0.003311,-0.008862,-0.051199,0.071934,-0.026351,-0.029214,0.025361,-0.078089,...,0.035499,-0.090897,-0.039709,0.002157,-0.025337,-0.06733,-0.010834,-0.014924,-0.018644,10


In [21]:
df_rating_val = df_rating_val[df_rating_val.timestamp > quantiles_values[0]]
print(f"DataFrame size: {df_rating_val.shape}")
df_rating_val.head(5)

DataFrame size: (24430434, 4)


Unnamed: 0,item_idx,user_idx,relevance,timestamp
50934106,1019,102935,2,1104710000.0
50934107,4660,319145,5,1104710000.0
50934108,16235,253986,4,1104710000.0
50934109,14279,459914,4,1104710000.0
50934110,16264,221677,5,1104710000.0


#### Test data

In [22]:
data_test = data_processing(df_movies, df_rating_test, False)

  0%|          | 0/10049 [00:00<?, ?it/s]

In [23]:
df_items_test, df_users_test, df_rating_test = data_test

In [24]:
print(f"DataFrame size: {df_items_test.shape}")
df_items_test.head(5)

DataFrame size: (17770, 304)


Unnamed: 0,item_idx,rating_avg,rating_cnt,year,w2v_0,w2v_1,w2v_2,w2v_3,w2v_4,w2v_5,...,w2v_290,w2v_291,w2v_292,w2v_293,w2v_294,w2v_295,w2v_296,w2v_297,w2v_298,w2v_299
0,0,3.749543,547,2003,0.0791626,0.141602,0.0653076,0.144043,-0.169922,-0.00537109,...,-0.170227,0.147827,-0.256836,0.119385,0.0618286,-0.0197067,-0.0344849,0.204407,0.0576172,-0.0251923
1,1,3.558621,145,2004,-0.0749512,0.0182292,-0.084554,0.0145671,0.0375163,-0.0357259,...,0.108398,-0.0535482,-0.0673828,0.0973307,-0.023112,-0.0393066,-0.146362,-0.221191,-0.023763,0.0696615
2,2,3.641153,2012,1997,0.257812,-0.0258789,-0.00357056,0.0163574,-0.0544434,0.289062,...,-0.180664,0.208984,-0.235352,-0.283203,-0.188477,0.0142822,0.143555,-0.0393066,-0.120605,0.041748
3,3,2.739437,142,1994,0.0388997,-0.162272,-0.00537109,0.194661,-0.0142822,-0.00219727,...,-0.00626628,0.115234,-0.116211,-0.00423177,0.034078,0.0113932,0.0836182,-0.107096,0.0487976,0.0219727
4,4,3.919298,1140,2004,-0.00842285,0.0577393,-0.145508,0.188477,-0.142578,-0.101318,...,0.00756836,-0.0305786,-0.166504,-0.0996094,-0.0799561,-0.325195,-0.132568,0.020874,0.1427,0.0246277


In [25]:
print(f"DataFrame size: {df_users_test.shape}")
df_users_test.head(5)

DataFrame size: (480189, 303)


Unnamed: 0,user_idx,rating_avg,w2v_0,w2v_1,w2v_2,w2v_3,w2v_4,w2v_5,w2v_6,w2v_7,...,w2v_291,w2v_292,w2v_293,w2v_294,w2v_295,w2v_296,w2v_297,w2v_298,w2v_299,rating_cnt
0,0,3.41853,0.047119,0.038021,-0.008658,0.061763,-0.006199,0.011819,0.012048,-0.086528,...,0.001927,-0.110658,-0.009286,-0.01813,-0.04322,-0.039856,-0.048123,0.006335,0.016588,626
1,1,4.011351,0.039846,0.035202,-0.000552,0.065487,-0.009792,0.003091,0.017207,-0.086343,...,0.011999,-0.103366,-0.01365,-0.021188,-0.039596,-0.028995,-0.047761,0.006293,0.020549,881
2,2,4.214286,0.029219,0.046266,0.028962,0.060885,-0.016917,0.019689,0.022825,-0.078199,...,-0.008831,-0.123834,-0.009376,-0.031162,-0.042404,-0.042829,-0.04536,-0.004269,0.017736,98
3,3,3.392308,0.045342,0.028131,0.005163,0.062421,-0.017837,0.006805,0.013406,-0.084151,...,0.006446,-0.117563,-0.005086,-0.029055,-0.038691,-0.02202,-0.047071,0.000898,0.022258,260
4,4,3.481481,0.050034,0.038392,0.023527,0.046617,0.021565,-0.016197,0.014753,-0.037128,...,-0.030786,-0.158192,-0.064261,0.001065,0.007731,0.04297,-0.057525,-0.00065,-0.050495,27


In [26]:
df_rating_test = df_rating_test[df_rating_test.timestamp > quantiles_values[1]]
print(f"DataFrame size: {df_rating_test.shape}")
df_rating_test.head(5)

DataFrame size: (25115967, 4)


Unnamed: 0,item_idx,user_idx,relevance,timestamp
75364540,1864,341295,4,1105920000.0
75364541,482,437697,4,1105920000.0
75364542,6205,208030,5,1105920000.0
75364543,8117,60270,2,1105920000.0
75364544,6910,159550,2,1105920000.0
