#### Домашнее задание к занятию "Рекомендации на основе содержания"

1. Использовать dataset MovieLens
2. Построить рекомендации (регрессия, предсказываем оценку) на фичах:  
TF-IDF на тегах и жанрах  
Средние оценки (+ median, variance, etc.) пользователя и фильма
3. Оценить RMSE на тестовой выборке

In [1]:
import pandas as pd
import numpy as np

from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import TfidfTransformer, CountVectorizer
from sklearn.neighbors import NearestNeighbors
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

%matplotlib inline

In [2]:
movies = pd.read_csv('movies.csv')
ratings = pd.read_csv('ratings.csv')
tags = pd.read_csv('tags.csv')
links = pd.read_csv('links.csv')

In [3]:
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [4]:
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,31,2.5,1260759144
1,1,1029,3.0,1260759179
2,1,1061,3.0,1260759182
3,1,1129,2.0,1260759185
4,1,1172,4.0,1260759205


In [5]:
tags.head()

Unnamed: 0,userId,movieId,tag,timestamp
0,15,339,sandra 'boring' bullock,1138537770
1,15,1955,dentist,1193435061
2,15,7478,Cambodia,1170560997
3,15,32892,Russian,1170626366
4,15,34162,forgettable,1141391765


Создадим список со средними оценками фильмов

In [6]:
# ratings.groupby('movieId').rating.mean()

In [7]:
mean_rank_list = ratings.groupby('movieId').rating.mean().values.tolist()
print(len(mean_rank_list))
mean_rank_list[:5]

9066


[3.8724696356275303,
 3.4018691588785046,
 3.1610169491525424,
 2.3846153846153846,
 3.267857142857143]

Создаем список с id всех фильмов

In [8]:
movieId_list = ratings.groupby('movieId').rating.mean().keys().tolist()
print(len(movieId_list))
movieId_list[:15]

9066


[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]

Создаем датафрейм, в котором будет id фильма и его средняя оценка

In [9]:
movie_mean_rank = pd.DataFrame({'movieId': movieId_list, 'movie_mean_rank': mean_rank_list})
movie_mean_rank[:5]

Unnamed: 0,movieId,movie_mean_rank
0,1,3.87247
1,2,3.401869
2,3,3.161017
3,4,2.384615
4,5,3.267857


Создадим датафрейм, содержащий одновременно название фильма, его жанры и теги

In [10]:
movies_genres_tags = movies.join(tags.set_index('movieId'), on='movieId')

In [11]:
movies_genres_tags.head()

Unnamed: 0,movieId,title,genres,userId,tag,timestamp
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,501.0,Pixar,1292956000.0
1,2,Jumanji (1995),Adventure|Children|Fantasy,,,
2,3,Grumpier Old Men (1995),Comedy|Romance,,,
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,,,
4,5,Father of the Bride Part II (1995),Comedy,431.0,steve martin,1140455000.0


Добавим к нему столбец со средними оценками фильмов

In [12]:
movies_genres_tags_mmranks = movies_genres_tags.join(movie_mean_rank.set_index('movieId'), on='movieId')

Удалим строки с пропусками и ненужный столбец

In [13]:
movies_genres_tags_mmranks.dropna(inplace=True)
movies_genres_tags_mmranks.drop(['timestamp'], axis=1, inplace=True)

In [14]:
movies_genres_tags_mmranks.head()

Unnamed: 0,movieId,title,genres,userId,tag,movie_mean_rank
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,501.0,Pixar,3.87247
4,5,Father of the Bride Part II (1995),Comedy,431.0,steve martin,3.267857
45,47,Seven (a.k.a. Se7en) (1995),Mystery|Thriller,364.0,biblical,4.034826
45,47,Seven (a.k.a. Se7en) (1995),Mystery|Thriller,364.0,crime,4.034826
45,47,Seven (a.k.a. Se7en) (1995),Mystery|Thriller,364.0,dark,4.034826


Из получившегося датафрейма выберем киномана с самым большим количеством поставленных оценок

In [15]:
# movies_genres_tags_mmranks.userId.value_counts()

Таким человеком оказался юзер с id 547. Создадим датафрейм только с его фильмами

In [16]:
user_movies = movies_genres_tags_mmranks[movies_genres_tags_mmranks.userId == 547.0]

In [17]:
user_movies.head()

Unnamed: 0,movieId,title,genres,userId,tag,movie_mean_rank
189,215,Before Sunrise (1995),Drama|Romance,547.0,holes90s,3.722222
263,293,Léon: The Professional (a.k.a. The Professiona...,Action|Crime|Drama|Thriller,547.0,holes90s,4.07197
274,306,Three Colors: Red (Trois couleurs: Rouge) (1994),Drama,547.0,holes90s,4.171875
285,319,Shallow Grave (1994),Comedy|Drama|Thriller,547.0,holes90s,3.973684
328,364,"Lion King, The (1994)",Adventure|Animation|Children|Drama|Musical|IMAX,547.0,holes90s,3.7775


Добавим столбец со средней оценкой, которую выставляет этот юзер

In [18]:
user_movies['user_mean_rank'] = ratings[ratings.userId == 547.0].rating.mean()
user_movies.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,movieId,title,genres,userId,tag,movie_mean_rank,user_mean_rank
189,215,Before Sunrise (1995),Drama|Romance,547.0,holes90s,3.722222,3.366792
263,293,Léon: The Professional (a.k.a. The Professiona...,Action|Crime|Drama|Thriller,547.0,holes90s,4.07197,3.366792
274,306,Three Colors: Red (Trois couleurs: Rouge) (1994),Drama,547.0,holes90s,4.171875,3.366792
285,319,Shallow Grave (1994),Comedy|Drama|Thriller,547.0,holes90s,3.973684,3.366792
328,364,"Lion King, The (1994)",Adventure|Animation|Children|Drama|Musical|IMAX,547.0,holes90s,3.7775,3.366792


Добавим столбец с оценками, которые данный юзер выставил просмотернным фильмам

Для этого сначала создадим временный датафрейм исключительно с оценками, которые поставил наш юзер

In [19]:
cut_ratings = ratings[ratings.userId == 547.0].drop(['timestamp', 'userId'], axis=1)

In [20]:
cut_ratings.head()

Unnamed: 0,movieId,rating
78433,1,3.5
78434,6,2.5
78435,7,2.0
78436,11,3.0
78437,14,3.5


А теперь уже объединим его с имеющимся

In [21]:
user_movies = user_movies.join(cut_ratings.set_index('movieId'), on='movieId')

К сожалению, наш юзер в некоторых случаях проставлял фильму теги, но не ставил оценку (и наоборот), поэтому избавимся от новых пропусков и заодно от столбца с его id

In [22]:
user_movies.dropna(inplace=True)
user_movies.drop(['userId'], axis=1, inplace=True)
user_movies.reset_index(drop=True, inplace=True)

In [23]:
user_movies.head()

Unnamed: 0,movieId,title,genres,tag,movie_mean_rank,user_mean_rank,rating
0,541,Blade Runner (1982),Action|Sci-Fi|Thriller,afi,4.037671,3.366792,3.5
1,954,Mr. Smith Goes to Washington (1939),Drama,afi,4.225806,3.366792,4.5
2,1153,Raw Deal (1948),Film-Noir,tcm,3.0,3.366792,3.0
3,1172,Cinema Paradiso (Nuovo cinema Paradiso) (1989),Drama,holes80s,4.26087,3.366792,5.0
4,3022,"General, The (1926)",Comedy|War,afi,4.25,3.366792,4.5


Мы получили датафрейм, включающий в себя фильмы одного юзера, с информацией по жанрам, средним оценкам фильма и юзера, по поставленным им тегам и оценкам

Создадим два списка: в одном будут id фильмов, в другом - строки с жанрами и тегами

In [24]:
str_movieId = []
str_genres_tags = []

for movie, group in user_movies.groupby('movieId'):
    str_movieId.append(movie)
    str_genres = ' '.join(str(group.genres.values.tolist()[0]).replace(' ', '').replace('-', '').split('|'))
    str_tags = ' '.join([str(s).replace(' ', '').replace('-', '') for s in group.tag.values])
    str_genres_tags.append(str_genres + ' ' + str_tags)

Таким получился список id фильмов

In [25]:
str_movieId[:5]

[541, 954, 1153, 1172, 3022]

Ткаим получился список, в котором в каждой строке одновременно указаны и жанры и теги фильмов

In [26]:
str_genres_tags[:5]

['Action SciFi Thriller afi',
 'Drama afi',
 'FilmNoir tcm',
 'Drama holes80s',
 'Comedy War afi']

In [27]:
print(len(str_movieId))
print(len(str_genres_tags))

174
174


Берем второй список с жанрами и тегами (str_genres_tags) и на нем обучаем модель CountVectorizer, по сути переводя строки этого списка в матрицу вида 0-1, где в строках нулям будет соответствовать отсуствие определенного жанра или тега у фильма, а единицам - их наличие (похоже на OneHotEncoder)

In [28]:
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(str_genres_tags)

Посмотрим на результат

In [29]:
X_train_counts.todense()

matrix([[1, 0, 1, ..., 0, 0, 0],
        [0, 0, 1, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

На полученных данных обучим модель TfidfTransformer, переводя предыдущую матрицу в матрицу с весами TF-IDF, которые построчно представляют собой набор векторов

In [30]:
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

Результат приведен ниже

In [31]:
X_train_tfidf.todense()

matrix([[0.49731318, 0.        , 0.62931596, ..., 0.        , 0.        ,
         0.        ],
        [0.        , 0.        , 0.96463268, ..., 0.        , 0.        ,
         0.        ],
        [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
         0.        ],
        ...,
        [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
         0.        ],
        [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
         0.        ],
        [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
         0.        ]])

In [32]:
X_train_tfidf.shape

(174, 35)

Переведем полученную матрицу в датафрейм

In [33]:
tfidf_df = pd.DataFrame(X_train_tfidf.todense())
tfidf_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,25,26,27,28,29,30,31,32,33,34
0,0.497313,0.0,0.629316,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.964633,0.0,0.0,0.0,0.0,0.0,0.263598,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.221833,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.696407,0.0,0.0,0.407918,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.590441,0.0


Объединим его с основным

In [34]:
final = user_movies.join(tfidf_df, how='outer')
final.head()

Unnamed: 0,movieId,title,genres,tag,movie_mean_rank,user_mean_rank,rating,0,1,2,...,25,26,27,28,29,30,31,32,33,34
0,541,Blade Runner (1982),Action|Sci-Fi|Thriller,afi,4.037671,3.366792,3.5,0.497313,0.0,0.629316,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,954,Mr. Smith Goes to Washington (1939),Drama,afi,4.225806,3.366792,4.5,0.0,0.0,0.964633,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,1153,Raw Deal (1948),Film-Noir,tcm,3.0,3.366792,3.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,1172,Cinema Paradiso (Nuovo cinema Paradiso) (1989),Drama,holes80s,4.26087,3.366792,5.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,3022,"General, The (1926)",Comedy|War,afi,4.25,3.366792,4.5,0.0,0.0,0.696407,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.590441,0.0


Переместим оценку фильма в самый конец датафрейма. Так будет удобнее его разбивать на область независимых переменных (Х) и область зависимых переменных (у)

In [35]:
cols = final.columns.tolist()

In [36]:
# cols

In [37]:
cols = cols[:6] + cols[7:] + [cols[6]]

In [38]:
final = final[cols]
final.head()

Unnamed: 0,movieId,title,genres,tag,movie_mean_rank,user_mean_rank,0,1,2,3,...,26,27,28,29,30,31,32,33,34,rating
0,541,Blade Runner (1982),Action|Sci-Fi|Thriller,afi,4.037671,3.366792,0.497313,0.0,0.629316,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.5
1,954,Mr. Smith Goes to Washington (1939),Drama,afi,4.225806,3.366792,0.0,0.0,0.964633,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.5
2,1153,Raw Deal (1948),Film-Noir,tcm,3.0,3.366792,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0
3,1172,Cinema Paradiso (Nuovo cinema Paradiso) (1989),Drama,holes80s,4.26087,3.366792,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0
4,3022,"General, The (1926)",Comedy|War,afi,4.25,3.366792,0.0,0.0,0.696407,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.590441,0.0,4.5


Определяем независимые переменные

In [39]:
X = final[final.columns[4:-1]] # или так: final.iloc[:,4:-1]
X

Unnamed: 0,movie_mean_rank,user_mean_rank,0,1,2,3,4,5,6,7,...,25,26,27,28,29,30,31,32,33,34
0,4.037671,3.366792,0.497313,0.0,0.629316,0.0,0.0,0.000000,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0,0.000000,0.0
1,4.225806,3.366792,0.000000,0.0,0.964633,0.0,0.0,0.000000,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0,0.000000,0.0
2,3.000000,3.366792,0.000000,0.0,0.000000,0.0,0.0,0.000000,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0,0.000000,0.0
3,4.260870,3.366792,0.000000,0.0,0.000000,0.0,0.0,0.000000,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0,0.000000,0.0
4,4.250000,3.366792,0.000000,0.0,0.696407,0.0,0.0,0.407918,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0,0.590441,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
169,3.500000,3.366792,0.000000,0.0,0.000000,0.0,0.0,0.000000,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.766232,0.000000,0.0,0.000000,0.0
170,4.166667,3.366792,0.000000,0.0,0.000000,0.0,0.0,0.000000,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.936748,0.000000,0.0,0.000000,0.0
171,3.600000,3.366792,0.000000,0.0,0.000000,0.0,0.0,0.000000,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.936748,0.000000,0.0,0.000000,0.0
172,4.000000,3.366792,0.000000,0.0,0.000000,0.0,0.0,0.000000,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.971115,0.0,0.000000,0.0


Определяем зависимые переменные

In [40]:
y = final['rating']
y[:10]

0    3.5
1    4.5
2    3.0
3    5.0
4    4.5
5    3.5
6    4.0
7    4.5
8    4.5
9    5.0
Name: rating, dtype: float64

Обучаем модель линейной регрессии на полученных данных, предварительно разбив их на тренировочные и тестовые

In [41]:
model = LinearRegression()

In [42]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [43]:
model.fit(X_train, y_train)

LinearRegression()

In [44]:
y_predicted = model.predict(X_test)
y_predicted[:10]

array([4.77246094, 4.33984375, 4.13574219, 4.47265625, 4.25      ,
       4.43066406, 4.08984375, 4.421875  , 3.84179688, 4.06054688])

Считаем RMSE. Для этого не забываем установить squared=False

In [45]:
mean_squared_error(y_test, y_predicted, squared=False)

7090408522.217026