# Домашнее задание по теме «Рекомендации на основе содержания»

# Задание

**Преподаватель:** Наталья Баданина, Иван Анисковец, Юлия Пономарева, Ярослав Сапронов, Егор Шишковец

1. Использовать датасет [MovieLens](https://grouplens.org/datasets/movielens/latest/).
2. Построить рекомендации (регрессия, предсказываем оценку) на фичах:
* TF-IDF на тегах и жанрах;
* средние оценки (+ median, variance и т. д.)пользователя и фильма.
3. Оценить RMSE на тестовой выборке.

# Основные таблицы в MovieLens

1. **movies**:
- **Описание**: Содержит информацию о фильмах.
- `movieId`: Уникальный идентификатор фильма.
- `title`: Название фильма.
- `genres`: Жанры фильма, обычно представлены в виде строки со списком жанров, разделенных символами `|` (например, "Action|Comedy").

2. **ratings**:
- **Описание**: Содержит оценки фильмов, выставленные пользователями.
- `userId`: Уникальный идентификатор пользователя.
- `movieId`: Уникальный идентификатор фильма (ссылается на таблицу `movies`).
- `rating`: Оценка (обычно от 0.5 до 5, с шагом 0.5).
- `timestamp`: Временная метка, указывающая, когда была оставлена оценка (обычно в формате Unix).


3. **tags** (не всегда присутствует):
- **Описание**: Содержит метки, оставленные пользователями на фильмы.
- `userId`: Уникальный идентификатор пользователя.
- `movieId`: Уникальный идентификатор фильма (ссылается на таблицу `movies`).
- `tag`: Текстовая метка, добавленная пользователем.
- `timestamp`: Временная метка, указывающая, когда была добавлена метка (обычно в формате Unix).

# Загружаем данные

In [85]:
import pandas as pd

In [86]:
# Загружаем данные из CSV-файлов в DataFrame
movies = pd.read_csv("https://raw.githubusercontent.com/stefkong1982/netology.ru/refs/heads/Master/Rekomendatelnye_sistemy/Rekomendacii_na_osnove_soderzhaniya/movies.csv")
ratings = pd.read_csv("https://raw.githubusercontent.com/stefkong1982/netology.ru/refs/heads/Master/Rekomendatelnye_sistemy/Rekomendacii_na_osnove_soderzhaniya/ratings.csv")
tags = pd.read_csv("https://raw.githubusercontent.com/stefkong1982/netology.ru/refs/heads/Master/Rekomendatelnye_sistemy/Rekomendacii_na_osnove_soderzhaniya/tags.csv")

# Проверяем загруженные данные

In [87]:
movies

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy
...,...,...,...
9120,162672,Mohenjo Daro (2016),Adventure|Drama|Romance
9121,163056,Shin Godzilla (2016),Action|Adventure|Fantasy|Sci-Fi
9122,163949,The Beatles: Eight Days a Week - The Touring Y...,Documentary
9123,164977,The Gay Desperado (1936),Comedy


In [88]:
ratings

Unnamed: 0,userId,movieId,rating,timestamp
0,1,31,2.5,1260759144
1,1,1029,3.0,1260759179
2,1,1061,3.0,1260759182
3,1,1129,2.0,1260759185
4,1,1172,4.0,1260759205
...,...,...,...,...
99999,671,6268,2.5,1065579370
100000,671,6269,4.0,1065149201
100001,671,6365,4.0,1070940363
100002,671,6385,2.5,1070979663


In [89]:
tags

Unnamed: 0,userId,movieId,tag,timestamp
0,15,339,sandra 'boring' bullock,1138537770
1,15,1955,dentist,1193435061
2,15,7478,Cambodia,1170560997
3,15,32892,Russian,1170626366
4,15,34162,forgettable,1141391765
...,...,...,...,...
1291,660,135518,meaning of life,1436680885
1292,660,135518,philosophical,1436680885
1293,660,135518,sci-fi,1436680885
1294,663,260,action,1438398078


# Объединение данных

In [90]:
# Объединяем movies и ratings по movieId
movies_ratings = pd.merge(movies, ratings, on='movieId', how='left')

In [91]:
movies_ratings

Unnamed: 0,movieId,title,genres,userId,rating,timestamp
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,7.0,3.0,8.518667e+08
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,9.0,4.0,9.386292e+08
2,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,13.0,5.0,1.331380e+09
3,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,15.0,2.0,9.979383e+08
4,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,19.0,3.0,8.551901e+08
...,...,...,...,...,...,...
100058,162672,Mohenjo Daro (2016),Adventure|Drama|Romance,611.0,3.0,1.471524e+09
100059,163056,Shin Godzilla (2016),Action|Adventure|Fantasy|Sci-Fi,,,
100060,163949,The Beatles: Eight Days a Week - The Touring Y...,Documentary,547.0,5.0,1.476419e+09
100061,164977,The Gay Desperado (1936),Comedy,,,


In [92]:
# Смотрим количество строк с хотя бы одним NaN
num_nan_rows = movies_ratings.isna().any(axis=1).sum()
print(f"Количество строк с хотя бы одним NaN: {num_nan_rows}")

Количество строк с хотя бы одним NaN: 59


In [93]:
# Удаляем все строки, где есть хотя бы одно NaN
movies_ratings = movies_ratings.dropna()

In [94]:
movies_ratings

Unnamed: 0,movieId,title,genres,userId,rating,timestamp
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,7.0,3.0,8.518667e+08
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,9.0,4.0,9.386292e+08
2,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,13.0,5.0,1.331380e+09
3,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,15.0,2.0,9.979383e+08
4,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,19.0,3.0,8.551901e+08
...,...,...,...,...,...,...
100055,161944,The Last Brickmaker in America (2001),Drama,287.0,5.0,1.470168e+09
100056,162376,Stranger Things,Drama,73.0,4.5,1.474256e+09
100057,162542,Rustom (2016),Romance|Thriller,611.0,5.0,1.471521e+09
100058,162672,Mohenjo Daro (2016),Adventure|Drama|Romance,611.0,3.0,1.471524e+09


избавиться от столбцов `userId` и `timestamp` в DataFrame `tags`, чтобы не дублировать их в итоговом

In [95]:
# Удаляем ненужные столбцы (userId и timestamp) из tags
tags = tags.drop(columns=['userId', 'timestamp'])

In [96]:
tags

Unnamed: 0,movieId,tag
0,339,sandra 'boring' bullock
1,1955,dentist
2,7478,Cambodia
3,32892,Russian
4,34162,forgettable
...,...,...
1291,135518,meaning of life
1292,135518,philosophical
1293,135518,sci-fi
1294,260,action


In [97]:
# Объединяем полученный DataFrame с tags по movieId
full_data = pd.merge(movies_ratings, tags, on='movieId', how='left')

In [98]:
full_data

Unnamed: 0,movieId,title,genres,userId,rating,timestamp,tag
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,7.0,3.0,8.518667e+08,Pixar
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,9.0,4.0,9.386292e+08,Pixar
2,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,13.0,5.0,1.331380e+09,Pixar
3,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,15.0,2.0,9.979383e+08,Pixar
4,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,19.0,3.0,8.551901e+08,Pixar
...,...,...,...,...,...,...,...
141018,161944,The Last Brickmaker in America (2001),Drama,287.0,5.0,1.470168e+09,
141019,162376,Stranger Things,Drama,73.0,4.5,1.474256e+09,
141020,162542,Rustom (2016),Romance|Thriller,611.0,5.0,1.471521e+09,
141021,162672,Mohenjo Daro (2016),Adventure|Drama|Romance,611.0,3.0,1.471524e+09,


In [99]:
# Подсчет строк с хотя бы одним NaN
num_nan_rows = full_data.isna().any(axis=1).sum()

print(f"Количество пустых строк (строк с NaN) в DataFrame: {num_nan_rows}")

Количество пустых строк (строк с NaN) в DataFrame: 77898


In [100]:
# Заполним NaN в столбце tag значением "No Tag"
full_data['tag'] = full_data['tag'].fillna('No Tag')

In [101]:
full_data

Unnamed: 0,movieId,title,genres,userId,rating,timestamp,tag
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,7.0,3.0,8.518667e+08,Pixar
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,9.0,4.0,9.386292e+08,Pixar
2,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,13.0,5.0,1.331380e+09,Pixar
3,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,15.0,2.0,9.979383e+08,Pixar
4,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,19.0,3.0,8.551901e+08,Pixar
...,...,...,...,...,...,...,...
141018,161944,The Last Brickmaker in America (2001),Drama,287.0,5.0,1.470168e+09,No Tag
141019,162376,Stranger Things,Drama,73.0,4.5,1.474256e+09,No Tag
141020,162542,Rustom (2016),Romance|Thriller,611.0,5.0,1.471521e+09,No Tag
141021,162672,Mohenjo Daro (2016),Adventure|Drama|Romance,611.0,3.0,1.471524e+09,No Tag


In [102]:
import numpy as np

# Применяем TF-IDF

In [103]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [104]:
# Объединяем теги и жанры в одну строку
full_data['combined'] = full_data['tag'] + ' ' + full_data['genres']

In [105]:
full_data

Unnamed: 0,movieId,title,genres,userId,rating,timestamp,tag,combined
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,7.0,3.0,8.518667e+08,Pixar,Pixar Adventure|Animation|Children|Comedy|Fantasy
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,9.0,4.0,9.386292e+08,Pixar,Pixar Adventure|Animation|Children|Comedy|Fantasy
2,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,13.0,5.0,1.331380e+09,Pixar,Pixar Adventure|Animation|Children|Comedy|Fantasy
3,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,15.0,2.0,9.979383e+08,Pixar,Pixar Adventure|Animation|Children|Comedy|Fantasy
4,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,19.0,3.0,8.551901e+08,Pixar,Pixar Adventure|Animation|Children|Comedy|Fantasy
...,...,...,...,...,...,...,...,...
141018,161944,The Last Brickmaker in America (2001),Drama,287.0,5.0,1.470168e+09,No Tag,No Tag Drama
141019,162376,Stranger Things,Drama,73.0,4.5,1.474256e+09,No Tag,No Tag Drama
141020,162542,Rustom (2016),Romance|Thriller,611.0,5.0,1.471521e+09,No Tag,No Tag Romance|Thriller
141021,162672,Mohenjo Daro (2016),Adventure|Drama|Romance,611.0,3.0,1.471524e+09,No Tag,No Tag Adventure|Drama|Romance


In [106]:
# Применяем TF-IDF
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(full_data['combined'])

In [107]:
# Преобразуем TF-IDF матрицу в DataFrame для удобства
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=tfidf_vectorizer.get_feature_names_out(), index=full_data['movieId'])

In [108]:
tfidf_df

Unnamed: 0_level_0,1940,80,abigail,acclaimed,accurate,acting,action,activist,adam,adaptation,...,work,world,worth,wrongful,ww2,york,you,your,zombies,zooey
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
161944,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
162376,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
162542,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
162672,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [109]:
# Сначала создадим DataFrame с основными данными
metadata_df = full_data[['movieId', 'userId', 'rating', 'timestamp']].copy()

In [110]:
metadata_df

Unnamed: 0,movieId,userId,rating,timestamp
0,1,7.0,3.0,8.518667e+08
1,1,9.0,4.0,9.386292e+08
2,1,13.0,5.0,1.331380e+09
3,1,15.0,2.0,9.979383e+08
4,1,19.0,3.0,8.551901e+08
...,...,...,...,...
141018,161944,287.0,5.0,1.470168e+09
141019,162376,73.0,4.5,1.474256e+09
141020,162542,611.0,5.0,1.471521e+09
141021,162672,611.0,3.0,1.471524e+09


In [111]:
# Объединяем два DataFrame по 'movieId'
combined_df = pd.concat([metadata_df.set_index('movieId'), tfidf_df], axis=1)

In [112]:
# Сбрасываем индекс для удобства
combined_df.reset_index(inplace=True)

In [113]:
combined_df

Unnamed: 0,movieId,userId,rating,timestamp,1940,80,abigail,acclaimed,accurate,acting,...,work,world,worth,wrongful,ww2,york,you,your,zombies,zooey
0,1,7.0,3.0,8.518667e+08,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1,9.0,4.0,9.386292e+08,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,1,13.0,5.0,1.331380e+09,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,1,15.0,2.0,9.979383e+08,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,1,19.0,3.0,8.551901e+08,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
141018,161944,287.0,5.0,1.470168e+09,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
141019,162376,73.0,4.5,1.474256e+09,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
141020,162542,611.0,5.0,1.471521e+09,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
141021,162672,611.0,3.0,1.471524e+09,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


# Работа с признаками

In [114]:
# Устанавливаем финальный DataFrame
final_df = combined_df[['movieId', 'userId', 'rating']]  # Оставляем только нужные колонки

In [115]:
final_df

Unnamed: 0,movieId,userId,rating
0,1,7.0,3.0
1,1,9.0,4.0
2,1,13.0,5.0
3,1,15.0,2.0
4,1,19.0,3.0
...,...,...,...
141018,161944,287.0,5.0
141019,162376,73.0,4.5
141020,162542,611.0,5.0
141021,162672,611.0,3.0


In [116]:
# Импортируем нужные библиотеки
from sklearn.neighbors import NearestNeighbors

In [117]:
# Создание экземпляра для нахождения ближайших соседей
model = NearestNeighbors(n_neighbors=10, metric='euclidean')

In [118]:
# Обучение модели на данных о фильмах
model.fit(final_df[['movieId', 'rating']])

In [119]:
# Нахождение ближайших соседей
distances, indices = model.kneighbors(final_df[['movieId', 'rating']])

In [120]:
# Инициализация списка для хранения средних значений рейтингов соседей
mean_neighbors = []

In [121]:
# Вычисление средних значений рейтингов ближайших соседей
for idx in range(final_df.shape[0]):
    neighbors_indices = indices[idx][1:]  # Индексы ближайших соседей (исключая самого себя)
    mean_value = final_df.loc[neighbors_indices, 'rating'].mean()  # 'rating' - название столбца
    mean_neighbors.append(mean_value)

In [122]:
# Создание нового DataFrame для фильмов с положительными рейтингами
filtered_df = final_df[final_df['rating'] > 0].copy()  # Создает копию с фильтрацией
filtered_df['mean_neighbor_rating'] = mean_neighbors  # Добавление средней оценки сосед

In [123]:
filtered_df

Unnamed: 0,movieId,userId,rating,mean_neighbor_rating
0,1,7.0,3.0,3.000000
1,1,9.0,4.0,4.000000
2,1,13.0,5.0,5.000000
3,1,15.0,2.0,2.000000
4,1,19.0,3.0,3.000000
...,...,...,...,...
141018,161944,287.0,5.0,2.777778
141019,162376,73.0,4.5,2.944444
141020,162542,611.0,5.0,2.888889
141021,162672,611.0,3.0,3.111111


In [124]:
# Вычисление средних рейтингов для каждого фильма
mean_rating_per_movie = final_df.groupby('movieId')['rating'].mean().reset_index()
mean_rating_per_movie.rename(columns={'rating': 'mean_movie_rating'}, inplace=True)  # Переименование колонки

In [125]:
mean_rating_per_movie

Unnamed: 0,movieId,mean_movie_rating
0,1,3.872470
1,2,3.401869
2,3,3.161017
3,4,2.384615
4,5,3.267857
...,...,...
9061,161944,5.000000
9062,162376,4.500000
9063,162542,5.000000
9064,162672,3.000000


In [126]:
# Объединяем исходный DataFrame с средними значениями
final_df = filtered_df.merge(mean_rating_per_movie, on='movieId', how='left')

In [127]:
final_df

Unnamed: 0,movieId,userId,rating,mean_neighbor_rating,mean_movie_rating
0,1,7.0,3.0,3.000000,3.87247
1,1,9.0,4.0,4.000000,3.87247
2,1,13.0,5.0,5.000000,3.87247
3,1,15.0,2.0,2.000000,3.87247
4,1,19.0,3.0,3.000000,3.87247
...,...,...,...,...,...
141018,161944,287.0,5.0,2.777778,5.00000
141019,162376,73.0,4.5,2.944444,4.50000
141020,162542,611.0,5.0,2.888889,5.00000
141021,162672,611.0,3.0,3.111111,3.00000


In [128]:
# Удаляем ненужные столбцы
final_df.drop(columns=['userId', 'rating'], inplace=True)

In [129]:
final_df

Unnamed: 0,movieId,mean_neighbor_rating,mean_movie_rating
0,1,3.000000,3.87247
1,1,4.000000,3.87247
2,1,5.000000,3.87247
3,1,2.000000,3.87247
4,1,3.000000,3.87247
...,...,...,...
141018,161944,2.777778,5.00000
141019,162376,2.944444,4.50000
141020,162542,2.888889,5.00000
141021,162672,3.111111,3.00000


In [130]:
# Устанавливаем индексы по movieId
final_df.set_index('movieId', inplace=True)
combined_df.set_index('movieId', inplace=True)

In [131]:
# Использование concat для объединения DataFrame
merged_df = pd.concat([final_df, combined_df], axis=1)

In [132]:
# Сброс индекса обратно, если это необходимо
merged_df.reset_index(inplace=True)

In [133]:
merged_df

Unnamed: 0,movieId,mean_neighbor_rating,mean_movie_rating,userId,rating,timestamp,1940,80,abigail,acclaimed,...,work,world,worth,wrongful,ww2,york,you,your,zombies,zooey
0,1,3.000000,3.87247,7.0,3.0,8.518667e+08,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1,4.000000,3.87247,9.0,4.0,9.386292e+08,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,1,5.000000,3.87247,13.0,5.0,1.331380e+09,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,1,2.000000,3.87247,15.0,2.0,9.979383e+08,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,1,3.000000,3.87247,19.0,3.0,8.551901e+08,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
141018,161944,2.777778,5.00000,287.0,5.0,1.470168e+09,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
141019,162376,2.944444,4.50000,73.0,4.5,1.474256e+09,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
141020,162542,2.888889,5.00000,611.0,5.0,1.471521e+09,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
141021,162672,3.111111,3.00000,611.0,3.0,1.471524e+09,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [134]:
from sklearn.model_selection import train_test_split
# Разбиение данных
test_size = 0.2  # 20% для тестовой выборки
random_state = 42  # Для воспроизводимости результатов

train_df, test_df = train_test_split(merged_df, test_size=test_size, random_state=random_state)

In [135]:
test_df

Unnamed: 0,movieId,mean_neighbor_rating,mean_movie_rating,userId,rating,timestamp,1940,80,abigail,acclaimed,...,work,world,worth,wrongful,ww2,york,you,your,zombies,zooey
126190,55908,3.555556,3.916667,199.0,4.5,1.214655e+09,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
69063,2109,3.277778,3.696078,485.0,3.5,1.337749e+09,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
72857,2383,1.888889,1.708333,518.0,2.0,9.453656e+08,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
120933,39446,3.833333,3.250000,384.0,3.5,1.153574e+09,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
132234,73017,4.000000,3.875000,339.0,4.0,1.446664e+09,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
116803,8917,3.166667,3.425000,187.0,3.0,1.237163e+09,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
116537,8865,3.888889,2.727273,240.0,4.0,1.098941e+09,0.775794,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
18120,293,3.000000,4.071970,107.0,3.0,1.446493e+09,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
58405,1527,1.277778,3.677966,395.0,1.0,9.530073e+08,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [136]:
train_df

Unnamed: 0,movieId,mean_neighbor_rating,mean_movie_rating,userId,rating,timestamp,1940,80,abigail,acclaimed,...,work,world,worth,wrongful,ww2,york,you,your,zombies,zooey
55605,1340,4.555556,3.750000,102.0,5.0,9.579801e+08,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
118687,30825,3.111111,3.320000,471.0,3.0,1.257193e+09,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
110940,6814,2.944444,3.000000,624.0,3.0,1.066070e+09,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
96394,4226,3.500000,4.227273,468.0,3.5,1.296196e+09,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
111913,6942,4.500000,3.581818,362.0,4.5,1.221862e+09,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
110268,6561,3.611111,3.875000,148.0,4.5,1.059507e+09,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
119879,34334,2.555556,2.125000,380.0,2.0,1.148315e+09,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
103694,5349,4.000000,3.522388,68.0,4.0,1.194743e+09,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
131932,72998,4.000000,3.641791,94.0,4.0,1.291780e+09,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


# Обучение и оценка модели

#### 1. Модель на всех признаках и средней оценке по всем пользователям

In [137]:
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Подготовка данных
# Извлекаем все признаки и целевую переменную из тренировочной выборки
X_train = train_df.drop(columns=['rating', 'mean_neighbor_rating', 'timestamp'])   # Здесь 'rating' - это целевая переменная
y_train = train_df['rating']  # Целевая переменная

# Аналогично для тестовой выборки
X_test = test_df.drop(columns=['rating', 'mean_neighbor_rating', 'timestamp'])
y_test = test_df['rating']

# Выбор модели
model = LinearRegression()

#  Обучение модели на тренировочных данных
model.fit(X_train, y_train)

# Предсказания на тренировочной выборке
y_train_pred = model.predict(X_train)  # Предсказания на тренировочных данных
# Предсказания на тестовой выборке
y_test_pred = model.predict(X_test)

#  Оценка модели
# Расчет RMSE для тренировочной и тестовой выборок
rmse_train = np.sqrt(mean_squared_error(y_train, y_train_pred))
rmse_test = np.sqrt(mean_squared_error(y_test, y_test_pred))

# Вывод результатов
print(f'Корень среднеквадратичной ошибки (RMSE) на тренировочных данных: {rmse_train}')
print(f'Корень среднеквадратичной ошибки (RMSE) на тестовых данных: {rmse_test}')

Корень среднеквадратичной ошибки (RMSE) на тренировочных данных: 0.8964176202626511
Корень среднеквадратичной ошибки (RMSE) на тестовых данных: 0.9013793450307529


RMSE на тренировочных данных: 0.8964  
RMSE на тестовых данных: 0.9014  

На первый взгляд, эти показатели говорят о том, что модель не демонстрирует достаточной точности. Значение RMSE близкое к 0.90 указывает на то, что предсказания имеют значительное отклонение от реальных оценок. Это может быть связано с тем, что модель не учитывает значимые особенности пользователей или фильмов, что приводит к нарушению качества предсказания.

#### 2. Модель на всех признаках и средней оценке по 10 самым похожим пользователям

In [138]:
# Подготовка данных
# Извлекаем все признаки и целевую переменную из тренировочной выборки
X_train = train_df.drop(columns=['rating', 'mean_movie_rating', 'timestamp'])   # Здесь 'rating' - это целевая переменная
y_train = train_df['rating']  # Целевая переменная

# Аналогично для тестовой выборки
X_test = test_df.drop(columns=['rating', 'mean_movie_rating', 'timestamp'])
y_test = test_df['rating']

# Выбор модели
model = LinearRegression()

#  Обучение модели на тренировочных данных
model.fit(X_train, y_train)

# Предсказания на тренировочной выборке
y_train_pred = model.predict(X_train)  # Предсказания на тренировочных данных
# Предсказания на тестовой выборке
y_test_pred = model.predict(X_test)

#  Оценка модели
# Расчет RMSE для тренировочной и тестовой выборок
rmse_train = np.sqrt(mean_squared_error(y_train, y_train_pred))
rmse_test = np.sqrt(mean_squared_error(y_test, y_test_pred))

# Вывод результатов
print(f'Корень среднеквадратичной ошибки (RMSE) на тренировочных данных: {rmse_train}')
print(f'Корень среднеквадратичной ошибки (RMSE) на тестовых данных: {rmse_test}')

Корень среднеквадратичной ошибки (RMSE) на тренировочных данных: 0.35043139989551064
Корень среднеквадратичной ошибки (RMSE) на тестовых данных: 0.3489616035207577


RMSE на тренировочных данных: 0.3504  
RMSE на тестовых данных: 0.3490  

Во втором эксперимента результаты значительно улучшились. Значения RMSE упали почти до 0.35, что указывает на гораздо более точные предсказания. Помимо того, что модель учитывает информацию о фильмах, она добавляет контекст, основанный на близости пользователей через среднюю оценку по 10 самым похожим пользователям. Это подразумевает, что схожесть между пользователями играет критически важную роль, и что предпочтения десяти ближайших пользователей могут дать более точное представление о предпочтениях конкретного пользователя.

#### 3. Модель на всех признаках с учетом средней оценки по 10 самым похожим пользователям и средней оценкой по всем пользователям

In [139]:
# Подготовка данных
# Извлекаем все признаки и целевую переменную из тренировочной выборки
X_train = train_df.drop(columns=['rating', 'timestamp'])   # Здесь 'rating' - это целевая переменная
y_train = train_df['rating']  # Целевая переменная

# Аналогично для тестовой выборки
X_test = test_df.drop(columns=['rating', 'timestamp'])
y_test = test_df['rating']

# Выбор модели
model = LinearRegression()

#  Обучение модели на тренировочных данных
model.fit(X_train, y_train)

# Предсказания на тренировочной выборке
y_train_pred = model.predict(X_train)  # Предсказания на тренировочных данных
# Предсказания на тестовой выборке
y_test_pred = model.predict(X_test)

#  Оценка модели
# Расчет RMSE для тренировочной и тестовой выборок
rmse_train = np.sqrt(mean_squared_error(y_train, y_train_pred))
rmse_test = np.sqrt(mean_squared_error(y_test, y_test_pred))

# Вывод результатов
print(f'Корень среднеквадратичной ошибки (RMSE) на тренировочных данных: {rmse_train}')
print(f'Корень среднеквадратичной ошибки (RMSE) на тестовых данных: {rmse_test}')

Корень среднеквадратичной ошибки (RMSE) на тренировочных данных: 0.34347784346067006
Корень среднеквадратичной ошибки (RMSE) на тестовых данных: 0.34046836653417056


RMSE на тренировочных данных: 0.3435  
RMSE на тестовых данных: 0.3405  

В данной экспериментальной установке были объединены оба предыдущих подхода, включая информацию как о схожих пользователях, так и о средних оценках по всем пользователям. Результаты показали дальнейшее снижение RMSE до значений 0.3435 для тренировочных и 0.3405 для тестовых данных. Это улучшение подтверждает гипотезу о том, что комбинированный подход использует преимущества обоих методов (разнообразие пользователей и общие тенденции зрительских предпочтений) для достижения наилучшего качества предсказаний.