# Практический проект по созданию рекомендательной системы.

# Цель
Построить рекомендательную систему по готовым данным

In [1]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

import warnings
warnings.filterwarnings("ignore",category=DeprecationWarning)

## Датасет

Этот набор данных содержит информацию о пользовательских предпочтениях от 73 516 пользователей в 12 294 аниме. Каждый пользователь может добавить аниме в свой завершенный список чтения и дать ему оценку, и этот набор данных является компиляцией этих оценок.

Anime.csv
    
    anime_id - уникальный идентификатор myanimelist.net, идентифицирующий аниме.
    name - полное имя названия аниме.
    genre - разделенный запятыми список жанров этого аниме.
    type - фильм, телевизор, OVA и т. д.
    episodes - сколько эпизодов в этом сериале. (1, если фильм).
    rating - средний рейтинг из 10 для этого аниме.
    members - количество участников сообщества, входящих в «группу» этого аниме.

Rating.csv

    user_id - неидентифицируемый случайно сгенерированный идентификатор пользователя.
    anime_id - аниме, которое этот пользователь оценил.
    rating - рейтинг из 10, назначенных этим пользователем (-1, если пользователь смотрел его, но не назначил оценку).


In [6]:
df_rating = pd.read_csv('data/rating.csv')
df_anime = pd.read_csv('data/anime.csv')

In [7]:
print('Таблица рейтинга')
df_rating.head(5)

Таблица рейтинга


Unnamed: 0,user_id,anime_id,rating
0,1,20,-1
1,1,24,-1
2,1,79,-1
3,1,226,-1
4,1,241,-1


In [8]:
print('Таблица аниме')
df_anime.head(5)

Таблица аниме


Unnamed: 0,anime_id,name,genre,type,episodes,rating,members
0,32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural",Movie,1,9.37,200630
1,5114,Fullmetal Alchemist: Brotherhood,"Action, Adventure, Drama, Fantasy, Magic, Mili...",TV,64,9.26,793665
2,28977,Gintama°,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.25,114262
3,9253,Steins;Gate,"Sci-Fi, Thriller",TV,24,9.17,673572
4,9969,Gintama&#039;,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.16,151266


Из данных видно, что связующим признаком является `anime_id`. Посмотрим их количество в двух таблицах.

In [17]:
print("rating.csv count records: {}, \nanime.csv count records: {}".format(
    len(df_rating.groupby('anime_id').count()),
    len(df_anime.groupby('anime_id').count())
))

rating.csv count records: 11200, 
anime.csv count records: 12294


Данных в `anime` больше, значит нужно удалить данные, которые не имеют связи с таблицей `rating`. Выберем `anime_id`, которые есть в только обоих таблицах.

In [50]:
set_of_rating_id = set(df_rating.groupby('anime_id').count().index)
set_of_anime_id = set(df_anime.groupby('anime_id').count().index)

set_of_inters_anime_id = set_of_rating_id.intersection(set_of_anime_id)

Сделаем фильтр по выбранным `anime_id`.

In [56]:
df_rating.rename(columns={'rating': 'user_rating'}, inplace=True)
df_anime.rename(columns={'rating': 'avg_rating'}, inplace=True)

df_anime = df_anime[df_anime.anime_id.isin(set_of_inters_anime_id)]
df_rating = df_rating[df_rating.anime_id.isin(set_of_inters_anime_id)]

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().rename(**kwargs)


Объеденим таблицы и сохраним в файл.

In [59]:
df = pd.merge(df_anime,df_rating,on='anime_id')
df.to_csv('anime_rating.csv',index=False)

Прочитаем сохраненный файл.

In [62]:
df = pd.read_csv('anime_rating.csv')
df.head(5)

Unnamed: 0,anime_id,name,genre,type,episodes,avg_rating,members,user_id,user_rating
0,32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural",Movie,1,9.37,200630,99,5
1,32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural",Movie,1,9.37,200630,152,10
2,32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural",Movie,1,9.37,200630,244,10
3,32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural",Movie,1,9.37,200630,271,10
4,32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural",Movie,1,9.37,200630,278,-1


# EDA

Напишем функцию для построение матрицы взаимодействий и построим ее.

In [65]:
from scipy import sparse
from lightfm import LightFM
from sklearn.metrics.pairwise import cosine_similarity

def create_interaction_matrix(df,user_col, item_col, rating_col, norm= False, threshold = None):   
    interactions = df.groupby([user_col, item_col])[rating_col] \
            .sum().unstack().reset_index(). \
            fillna(0).set_index(user_col)
    if norm:
        interactions = interactions.applymap(lambda x: 1 if x > threshold else 0)
    return interactions



In [156]:
%%time

interactions = create_interaction_matrix(df = df,
                                         user_col = 'user_id',
                                         item_col = 'anime_id',
                                         rating_col = 'user_rating',
                                         threshold = '0')
interactions.shape

CPU times: user 48.1 s, sys: 45 s, total: 1min 33s
Wall time: 1min 40s


(73515, 11197)

In [157]:
interactions.head()

anime_id,1,5,6,7,8,15,16,17,18,19,...,34283,34324,34325,34349,34358,34367,34412,34475,34476,34519
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,-1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,8.0,0.0,0.0,6.0,0.0,6.0,6.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Создадим функцию и словарь пользователей.

In [158]:
def create_user_dict(interactions):
    user_id = list(interactions.index)
    user_dict = {}
    counter = 0 
    for i in user_id:
        user_dict[i] = counter
        counter += 1
    return user_dict

In [159]:
user_dict = create_user_dict(interactions=interactions)

Создадим функцию и словарь аниме.

In [160]:
def create_item_dict(df,id_col,name_col):
    item_dict ={}
    for i in range(df.shape[0]):
        item_dict[(df.iloc[i][id_col])] = df.iloc[i][name_col]
    return item_dict

In [161]:
%%time
movies_dict = create_item_dict(df = df_test, id_col = 'anime_id', name_col = 'name')

CPU times: user 5min 33s, sys: 2.71 s, total: 5min 36s
Wall time: 6min 4s


# Построение модели

В качестве алгоритма рекомендаций будем использовать реализацию `LightFM`.

In [162]:
def runMF(interactions, n_components=30, loss='warp', k=15, epoch=30,n_jobs = 4):
    x = sparse.csr_matrix(interactions.values)
    model = LightFM(no_components= n_components, loss=loss,k=k)
    model.fit(x,epochs=epoch,num_threads = n_jobs)
    return model

Обучим модель.

In [163]:
%%time
mf_model = runMF(interactions = interactions,
                 n_components = 30,
                 loss = 'warp',
                 k =15,
                 epoch = 30,
                 n_jobs = 4)

CPU times: user 8min 56s, sys: 14.5 s, total: 9min 11s
Wall time: 10min


# Проверка рекомендаций

In [164]:
def sample_recommendation_user(model, interactions, user_id, user_dict, 
                               item_dict,threshold = 0,nrec_items = 10, show = True):
    n_users, n_items = interactions.shape
    user_x = user_dict[user_id]
    scores = pd.Series(model.predict(user_x,np.arange(n_items)))
    scores.index = interactions.columns
    scores = list(pd.Series(scores.sort_values(ascending=False).index))
    
    known_items = list(pd.Series(interactions.loc[user_id,:] \
                                 [interactions.loc[user_id,:] > threshold].index).sort_values(ascending=False))
    
    scores = [x for x in scores if x not in known_items]
    return_score_list = scores[0:nrec_items]
    known_items = list(pd.Series(known_items).apply(lambda x: item_dict[x]))
    scores = list(pd.Series(return_score_list).apply(lambda x: item_dict[x]))
    if show == True:
        print("Known Likes:")
        counter = 1
        for i in known_items:
            print(str(counter) + '- ' + i)
            counter+=1

        print("\n Recommended Items:")
        counter = 1
        for i in scores:
            print(str(counter) + '- ' + i)
            counter+=1
    return return_score_list

Выполним рекомендацию для пользователя.

In [182]:
rec_list = sample_recommendation_user(model = mf_model, 
                                      interactions = interactions, 
                                      user_id = 39, 
                                      user_dict = user_dict,
                                      item_dict = movies_dict, 
                                      threshold = 5,
                                      nrec_items = 5)

Known Likes:
1- Death Parade
2- Mekakucity Actors
3- Hitsugi no Chaika
4- Free!
5- Little Busters!: Refrain
6- Gin no Saji
7- Mirai Nikki Redial
8- Watashi ga Motenai no wa Dou Kangaetemo Omaera ga Warui!
9- Suisei no Gargantia
10- Shingeki no Kyojin
11- Photokano
12- Aku no Hana
13- Yahari Ore no Seishun Love Comedy wa Machigatteiru.
14- Ore no Kanojo to Osananajimi ga Shuraba Sugiru
15- Sukitte Ii na yo.
16- Tonari no Kaibutsu-kun
17- Little Busters!
18- Shinsekai yori
19- Kokoro Connect
20- Fate/Zero 2nd Season
21- Mirai Nikki (TV)
22- Working&#039;!!
23- Fate/Zero
24- Mahou Shoujo Madoka★Magica
25- HHH Triple Ecchi
26- Ore no Imouto ga Konnani Kawaii Wake ga Nai
27- Kanojo x Kanojo x Kanojo: Sanshimai to no DokiDoki Kyoudou Seikatsu
28- Working!!
29- Fairy Tail
30- Angel Beats!
31- K-On!
32- Fullmetal Alchemist: Brotherhood
33- Clannad: After Story
34- Code Geass: Hangyaku no Lelouch R2
35- School Days
36- Clannad
37- Tengen Toppa Gurren Lagann
38- Lucky☆Star
39- Katekyo Hitman Reb

# Вывод

В рамках данного проекта построена система рекомендаций аниме-фильмов для пользователя на основе ранее оцененных пользователем аниме-фильмов.