# Иллюстрация  разбиения датасета MovieLens на train/test/split
<blockquote>
    <p>Предобработки признаков для датасета MovieLens, используемая здесь вынесена в 
       <a href="https://github.com/AlgoMathITMO/sber-simulator/blob/experiments-new/experiments/Movielens/ml20.py">файл</a>,
       иллюстрация её работы продемонстрирована в <a href="https://github.com/AlgoMathITMO/sber-simulator/blob/experiments-new/experiments/Movielens/movie_preprocessing.ipynb" 
       title="movie_preprocessing">ноутбуке</a>.</p>
</blockquote>

### $\textbf{Содержание}$:

### $\textbf{I. Загрузка и подготовка данных }$
#### - Чтение данных с диска;
#### - Кодирование id фильмов и пользователей целочисленными идентификаторами;
---

### $\textbf{II. Разбиение данных для эксперимента}$
### Для разбиения данных на $\it{train/test/split}$ производится деление исходного датасета *df_rating* по квантилям атрибута $\it{timestamp}$, $\mathbb{q}$ для генерации признаков:
#### $\it{rating}_{t}$ = *df_rating*$[0, \mathbb{q}_{t}]$, где $\mathbb{q}_{train}=0.5$, $\mathbb{q}_{val}=0.75$, $\mathbb{q}_{test}=1$:
#### - $\it{rating}_{train}$ = *df_rating*$[0, 0.5]$;
#### - $\it{rating}_{val}$ = *df_rating*$[0, 0.75]$;
#### - $\it{rating}_{test}$ = *df_rating*$[0, 1]$;
### Далее для каждого из промежутков {$\it{rating}_{train}$, $\it{rating}_{val}$, $\it{rating}_{test}$} генерируются соответствующие им признаки пользователей и предложений (по данному [примеру](https://github.com/AlgoMathITMO/sber-simulator/blob/experiments-new/experiments/Movielens/movie_preprocessing.ipynb "Optional Title")):
#### - $\it{items}_{t}$, $\it{users}_{t}$, $\it{rating}_{t}$ = data_processing(movies, $\it{rating}_{t}$, tags), $t \in \{\it{train}, \it{val}, \it{test}\}$;
### После чего формируются окончательные рейтинги:
#### - $\it{rating}_{train}$ = $\it{rating}_{train}$ = *df_rating*$[0, 0.5]$;
#### - $\it{rating}_{val}$ = $\it{rating}_{val}$[$\mathbb{q}>\mathbb{q}_{train}$] = *df_rating*$(0.5, 0.75]$;
#### - $\it{rating}_{test}$ = $\it{rating}_{test}$[$\mathbb{q}>\mathbb{q}_{val}$] = *df_rating*$(0.75, 1]$;

<blockquote>
    <p>То есть, если для генерации признаков для валидационного набора данных мы используем временные метки с 0 по 0.75 квантиль, то в качестве рейтингов мы возьмем оценки
       только с 0.5 по 0.75 квантили. Аналогично для тестового набора: все временные метки для генерации признаков, но в качестве рейтингов только оценки с 0.75 по 1
       квантили.</p>
</blockquote>
<hr>

In [1]:
import sys
import os
import pandas as pd
import numpy as np
import re
import itertools
import tqdm


from ml20 import data_processing

[nltk_data] Downloading package stopwords to
[nltk_data]     /data/home/agurov/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package words to
[nltk_data]     /data/home/agurov/nltk_data...
[nltk_data]   Package words is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /data/home/agurov/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [43]:
MOVIE_PATH = r'./data_row/movies.csv'
RATING_PATH = r'./data_row/ratings.csv'
TAG_PATH = r'./data_row/tags.csv'

SAVE_PATH = r"./final_data"

In [3]:
QUANTILES = [0.5, 0.75]

### I. Загрузка и подготовка данных

#### Чтение данных с диска

In [4]:
df_movie = pd.read_csv(MOVIE_PATH)
df_movie.head(5)

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [5]:
df_rating = pd.read_csv(RATING_PATH)
df_rating.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,2,3.5,1112486027
1,1,29,3.5,1112484676
2,1,32,3.5,1112484819
3,1,47,3.5,1112484727
4,1,50,3.5,1112484580


In [6]:
df_tags = pd.read_csv(TAG_PATH)
df_tags.head()

Unnamed: 0,userId,movieId,tag,timestamp
0,18,4141,Mark Waters,1240597180
1,65,208,dark hero,1368150078
2,65,353,dark hero,1368150079
3,65,521,noir thriller,1368149983
4,65,592,dark hero,1368150078


#### Кодирование id фильмов и пользователей целочисленными идентификаторами 

In [7]:
cat_dict_movies = pd.Series(df_movie.movieId.astype("category").cat.codes.values, index=df_movie.movieId).to_dict()
cat_dict_users = pd.Series(df_rating.userId.astype("category").cat.codes.values, index=df_rating.userId).to_dict()

In [8]:
df_movie.movieId = df_movie.movieId.apply(lambda x: cat_dict_movies[x])

df_rating.movieId = df_rating.movieId.apply(lambda x: cat_dict_movies[x])
df_rating.userId = df_rating.userId.apply(lambda x: cat_dict_users[x])

df_tags.movieId = df_tags.movieId.apply(lambda x: cat_dict_movies[x])
df_tags.userId = df_tags.userId.apply(lambda x: cat_dict_users[x])

### II. Разбиение данных для эксперимента

### Разбиение df_rating на train/test/validation части по квантилям timestamp:
####  - train [0, 0.5]
####  - validation [0, 0.75]
####  - test [0, 1.]

In [9]:
df_rating = df_rating.sort_values(by='timestamp').reset_index(drop=True)
quantiles_values = [df_rating.timestamp.quantile(i) for i in QUANTILES]
quantiles_values

[1103555886.0, 1225642317.5]

In [10]:
df_rating_train = df_rating[df_rating.timestamp <= quantiles_values[0]]
print(f"DataFrame size: {df_rating_train.shape}")
df_rating_train.head(5)

DataFrame size: (10000132, 4)


Unnamed: 0,userId,movieId,rating,timestamp
0,28506,1153,4.0,789652004
1,131159,1057,3.0,789652009
2,131159,46,5.0,789652009
3,131159,20,3.0,789652009
4,85251,44,3.0,822873600


In [13]:
df_rating_val = df_rating[(df_rating.timestamp <= quantiles_values[1])]
print(f"DataFrame size: {df_rating_val.shape}")
df_rating_val.head()

DataFrame size: (15000197, 4)


Unnamed: 0,userId,movieId,rating,timestamp
0,28506,1153,4.0,789652004
1,131159,1057,3.0,789652009
2,131159,46,5.0,789652009
3,131159,20,3.0,789652009
4,85251,44,3.0,822873600


In [14]:
df_rating_test = df_rating.copy()
print(f"DataFrame size: {df_rating_test.shape}")
df_rating_test.head(5)

DataFrame size: (20000263, 4)


Unnamed: 0,userId,movieId,rating,timestamp
0,28506,1153,4.0,789652004
1,131159,1057,3.0,789652009
2,131159,46,5.0,789652009
3,131159,20,3.0,789652009
4,85251,44,3.0,822873600


### Генерация признаков по временным промежуткам

#### Train data

In [15]:
data_train = data_processing(df_movie, df_rating_train, df_tags)

------------------------ Movie processing ------------------------
------------------------ Rating processing ------------------------
------------------------ Tags processing ------------------------
------------------------ Tags embedding ------------------------
------------------------ Users processing ------------------------
------------------------ Users embedding ------------------------


100%|██████████| 80650/80650 [30:04<00:00, 44.69it/s]  


In [16]:
df_items_train, df_users_train, df_rating_train = data_train

In [17]:
print(f"DataFrame size: {df_items_train.shape}")
df_items_train.head(5)

DataFrame size: (27278, 322)


Unnamed: 0,item_idx,year,rating_avg,genre0,genre1,genre2,genre3,genre4,genre5,genre6,...,w2v_290,w2v_291,w2v_292,w2v_293,w2v_294,w2v_295,w2v_296,w2v_297,w2v_298,w2v_299
0,0,1995,4.018155,1.0,0.0,0.0,0.0,0.0,0.0,1.0,...,-0.174128,0.101186,-0.124264,-0.0650168,-0.00856246,-0.0169785,0.0107916,-0.0544386,-0.0506917,0.00865543
1,1,1995,3.34015,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,-0.103736,0.0705649,-0.0288628,-0.00690375,0.0374264,-0.0100587,-0.00881062,0.0436335,0.0338212,0.033346
2,2,1995,3.243801,1.0,0.0,1.0,0.0,0.0,0.0,0.0,...,-0.0761528,0.123608,-0.0645615,0.027002,0.00566101,-0.0378204,-0.0740143,-0.057489,0.00765991,0.0744789
3,3,1995,2.894591,1.0,0.0,1.0,0.0,0.0,0.0,0.0,...,-0.0975691,0.27274,-0.243025,-0.0733817,-0.0772182,-0.0487932,0.0435268,-0.133022,-0.110073,0.114049
4,4,1995,3.162804,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,-0.125023,0.0665251,-0.0802847,0.0328275,0.00461989,-0.0376446,-0.135221,-0.0997596,0.00258226,0.0805623


In [18]:
print(f"DataFrame size: {df_users_train.shape}")
df_users_train.head(5)

DataFrame size: (80650, 321)


Unnamed: 0,user_idx,genre0,genre1,genre2,genre3,genre4,genre5,genre6,genre7,genre8,...,w2v_290,w2v_291,w2v_292,w2v_293,w2v_294,w2v_295,w2v_296,w2v_297,w2v_298,w2v_299
0,0,0.217391,0.565217,0.108696,0.26087,0.021739,0.0,0.195652,0.021739,0.043478,...,-0.082082,0.065216,-0.093123,0.043002,-0.057089,-0.072583,0.006889,-0.06539,0.012009,0.068208
1,1,0.163934,0.295082,0.098361,0.377049,0.065574,0.016393,0.278689,0.032787,0.016393,...,-0.070471,0.051933,-0.095922,0.050092,-0.054486,-0.070923,0.009246,-0.0577,-0.001037,0.050525
2,2,0.278075,0.171123,0.085561,0.497326,0.032086,0.0,0.26738,0.016043,0.053476,...,-0.065744,0.063735,-0.094431,0.030662,-0.050246,-0.064388,-0.001983,-0.066358,-0.011671,0.056781
3,3,0.392857,0.0,0.142857,0.178571,0.035714,0.0,0.214286,0.035714,0.142857,...,-0.026629,0.064698,-0.068943,0.024203,-0.049823,-0.075761,-0.024932,-0.098899,-0.023882,0.055239
4,4,0.363636,0.015152,0.242424,0.151515,0.015152,0.045455,0.318182,0.030303,0.166667,...,-0.054649,0.061606,-0.108559,0.006885,-0.036019,-0.068205,0.00406,-0.078106,-0.009912,0.067181


In [19]:
print(f"DataFrame size: {df_rating_train.shape}")
df_rating_train.head(5)

DataFrame size: (10000132, 4)


Unnamed: 0,user_idx,item_idx,relevance,timestamp
0,28506,1153,4.0,789652004
1,131159,1057,3.0,789652009
2,131159,46,5.0,789652009
3,131159,20,3.0,789652009
4,85251,44,3.0,822873600


In [49]:
df_items_train.to_csv(os.path.join(SAVE_PATH, r"train/items.csv"), index=False)
df_users_train.to_csv(os.path.join(SAVE_PATH, r"train/users.csv"), index=False)
df_rating_train.to_csv(os.path.join(SAVE_PATH, r"train/rating.csv"), index=False)

#### Validation data

In [None]:
data_val = data_processing(df_movie, df_rating_val, df_tags)

------------------------ Movie processing ------------------------
------------------------ Rating processing ------------------------
------------------------ Tags processing ------------------------
------------------------ Tags embedding ------------------------
------------------------ Users processing ------------------------
------------------------ Users embedding ------------------------


 92%|█████████▏| 98398/106573 [47:51<05:54, 23.06it/s]  

In [None]:
df_items_val, df_users_val, df_rating_val = data_val

In [31]:
print(f"DataFrame size: {df_items_val.shape}")
df_items_val.head(5)

DataFrame size: (27278, 322)


Unnamed: 0,item_idx,year,rating_avg,genre0,genre1,genre2,genre3,genre4,genre5,genre6,...,w2v_290,w2v_291,w2v_292,w2v_293,w2v_294,w2v_295,w2v_296,w2v_297,w2v_298,w2v_299
0,0,1995,3.935369,1.0,0.0,0.0,0.0,0.0,0.0,1.0,...,-0.174128,0.101186,-0.124264,-0.0650168,-0.00856246,-0.0169785,0.0107916,-0.0544386,-0.0506917,0.00865543
1,1,1995,3.223721,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,-0.103736,0.0705649,-0.0288628,-0.00690375,0.0374264,-0.0100587,-0.00881062,0.0436335,0.0338212,0.033346
2,2,1995,3.160203,1.0,0.0,1.0,0.0,0.0,0.0,0.0,...,-0.0761528,0.123608,-0.0645615,0.027002,0.00566101,-0.0378204,-0.0740143,-0.057489,0.00765991,0.0744789
3,3,1995,2.867735,1.0,0.0,1.0,0.0,0.0,0.0,0.0,...,-0.0975691,0.27274,-0.243025,-0.0733817,-0.0772182,-0.0487932,0.0435268,-0.133022,-0.110073,0.114049
4,4,1995,3.092171,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,-0.125023,0.0665251,-0.0802847,0.0328275,0.00461989,-0.0376446,-0.135221,-0.0997596,0.00258226,0.0805623


In [32]:
print(f"DataFrame size: {df_users_val.shape}")
df_users_val.head(5)

DataFrame size: (106573, 321)


Unnamed: 0,user_idx,genre0,genre1,genre2,genre3,genre4,genre5,genre6,genre7,genre8,...,w2v_290,w2v_291,w2v_292,w2v_293,w2v_294,w2v_295,w2v_296,w2v_297,w2v_298,w2v_299
0,0,0.234286,0.257143,0.062857,0.228571,0.051429,0.011429,0.417143,0.022857,0.108571,...,-0.070012,0.064255,-0.097324,0.027969,-0.048789,-0.067166,0.005073,-0.054991,0.001487,0.064595
1,1,0.163934,0.295082,0.098361,0.377049,0.065574,0.016393,0.278689,0.032787,0.016393,...,-0.070471,0.051933,-0.095922,0.050092,-0.054486,-0.070923,0.009246,-0.0577,-0.001037,0.050525
2,2,0.278075,0.171123,0.085561,0.497326,0.032086,0.0,0.26738,0.016043,0.053476,...,-0.065744,0.063735,-0.094431,0.030662,-0.050246,-0.064388,-0.001983,-0.066358,-0.011671,0.056781
3,3,0.392857,0.0,0.142857,0.178571,0.035714,0.0,0.214286,0.035714,0.142857,...,-0.026629,0.064698,-0.068943,0.024203,-0.049823,-0.075761,-0.024932,-0.098899,-0.023882,0.055239
4,4,0.363636,0.015152,0.242424,0.151515,0.015152,0.045455,0.318182,0.030303,0.166667,...,-0.054649,0.061606,-0.108559,0.006885,-0.036019,-0.068205,0.00406,-0.078106,-0.009912,0.067181


In [33]:
df_rating_val = df_rating_val[df_rating_val.timestamp > quantiles_values[0]]
print(f"DataFrame size: {df_rating_val.shape}")
df_rating_val.head(5)

DataFrame size: (5000065, 4)


Unnamed: 0,user_idx,item_idx,relevance,timestamp
10000132,88819,7982,3.0,1103555902
10000133,14095,914,4.5,1103556126
10000134,14095,2244,4.5,1103556167
10000135,14095,583,4.5,1103556194
10000136,14095,1789,5.0,1103556223


In [50]:
df_items_train.to_csv(os.path.join(SAVE_PATH, r"val/items.csv"), index=False)
df_users_train.to_csv(os.path.join(SAVE_PATH, r"val/users.csv"), index=False)
df_rating_train.to_csv(os.path.join(SAVE_PATH, r"val/rating.csv"), index=False)

#### Test data

In [None]:
data_test = data_processing(df_movie, df_rating_test, df_tags)

In [None]:
df_items_test, df_users_test, df_rating_test = data_test

In [34]:
print(f"DataFrame size: {df_items_test.shape}")
df_items_test.head(5)

DataFrame size: (27278, 322)


Unnamed: 0,item_idx,year,rating_avg,genre0,genre1,genre2,genre3,genre4,genre5,genre6,...,w2v_290,w2v_291,w2v_292,w2v_293,w2v_294,w2v_295,w2v_296,w2v_297,w2v_298,w2v_299
0,0,1995,3.92124,1.0,0.0,0.0,0.0,0.0,0.0,1.0,...,-0.174128,0.101186,-0.124264,-0.0650168,-0.00856246,-0.0169785,0.0107916,-0.0544386,-0.0506917,0.00865543
1,1,1995,3.211977,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,-0.103736,0.0705649,-0.0288628,-0.00690375,0.0374264,-0.0100587,-0.00881062,0.0436335,0.0338212,0.033346
2,2,1995,3.15104,1.0,0.0,1.0,0.0,0.0,0.0,0.0,...,-0.0761528,0.123608,-0.0645615,0.027002,0.00566101,-0.0378204,-0.0740143,-0.057489,0.00765991,0.0744789
3,3,1995,2.861393,1.0,0.0,1.0,0.0,0.0,0.0,0.0,...,-0.0975691,0.27274,-0.243025,-0.0733817,-0.0772182,-0.0487932,0.0435268,-0.133022,-0.110073,0.114049
4,4,1995,3.064592,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,-0.125023,0.0665251,-0.0802847,0.0328275,0.00461989,-0.0376446,-0.135221,-0.0997596,0.00258226,0.0805623


In [35]:
print(f"DataFrame size: {df_users_test.shape}")
df_users_test.head(5)

DataFrame size: (138493, 321)


Unnamed: 0,user_idx,genre0,genre1,genre2,genre3,genre4,genre5,genre6,genre7,genre8,...,w2v_290,w2v_291,w2v_292,w2v_293,w2v_294,w2v_295,w2v_296,w2v_297,w2v_298,w2v_299
0,0,0.234286,0.257143,0.062857,0.228571,0.051429,0.011429,0.417143,0.022857,0.108571,...,-0.070012,0.064255,-0.097324,0.027969,-0.048789,-0.067166,0.005073,-0.054991,0.001487,0.064595
1,1,0.163934,0.295082,0.098361,0.377049,0.065574,0.016393,0.278689,0.032787,0.016393,...,-0.070471,0.051933,-0.095922,0.050092,-0.054486,-0.070923,0.009246,-0.0577,-0.001037,0.050525
2,2,0.278075,0.171123,0.085561,0.497326,0.032086,0.0,0.26738,0.016043,0.053476,...,-0.065744,0.063735,-0.094431,0.030662,-0.050246,-0.064388,-0.001983,-0.066358,-0.011671,0.056781
3,3,0.392857,0.0,0.142857,0.178571,0.035714,0.0,0.214286,0.035714,0.142857,...,-0.026629,0.064698,-0.068943,0.024203,-0.049823,-0.075761,-0.024932,-0.098899,-0.023882,0.055239
4,4,0.363636,0.015152,0.242424,0.151515,0.015152,0.045455,0.318182,0.030303,0.166667,...,-0.054649,0.061606,-0.108559,0.006885,-0.036019,-0.068205,0.00406,-0.078106,-0.009912,0.067181


In [36]:
df_rating_test = df_rating_test[df_rating_test.timestamp > quantiles_values[1]]
print(f"DataFrame size: {df_rating_test.shape}")
df_rating_test.head(5)

DataFrame size: (5000066, 4)


Unnamed: 0,user_idx,item_idx,relevance,timestamp
15000197,29881,1596,3.5,1225642324
15000198,29881,1582,3.0,1225642328
15000199,29881,1532,3.0,1225642333
15000200,29881,1502,3.5,1225642336
15000201,29881,1445,2.0,1225642340


In [51]:
df_items_train.to_csv(os.path.join(SAVE_PATH, r"test/items.csv"), index=False)
df_users_train.to_csv(os.path.join(SAVE_PATH, r"test/users.csv"), index=False)
df_rating_train.to_csv(os.path.join(SAVE_PATH, r"test/rating.csv"), index=False)