# The Movies Dataset TF-IDF Content-Based Recommendation

### Набор данных

Дан набор данных, содержащий информацию о 45,000 фильмах, выпущенных до июля 2017 года. Набор данных представлен на ресурсе Kaggle по ссылке https://www.kaggle.com/rounakbanik/the-movies-dataset
где представлено следующее описание составляющих файлов (выполнен перевод на русский язык):

`movies_metadata.csv:` Основной файл метаданных фильмов. Содержит информацию о 45 000 фильмов, представленных в наборе данных Full MovieLens. В таблицы представлены плакаты, фоны, бюджет, доход, даты выпуска, языки, страны-производители и компании.

`keywords.csv:` Содержит ключевые слова сюжета для наших фильмов MovieLens. Доступен в виде строкового объекта JSON.

`credits.csv:` Состоит из информации об актерах и съемках всех наших фильмов. Доступен в виде строкового объекта JSON.

`links.csv:` Файл содержит идентификаторы TMDB и IMDB всех фильмов, представленных в наборе данных Full MovieLens.

`links_small.csv:` Содержит идентификаторы TMDB и IMDB небольшого подмножества из 9000 фильмов полного набора данных.

`ratings_small.csv:` Подмножество 100 000 оценок от 700 пользователей на 9 000 фильмов.

`ratings.csv `- файл, содержащий полный список оценок, выставленных пользователями фильмам



Рассмотрим подробнее две таблицы:

*   `movies_metadata.csv `
*   `ratings.csv `

In [None]:
import pandas as pd
import numpy as np

In [None]:
#! pip install --upgrade --no-cache-dir gdown

In [None]:
#!gdown 14LeiPV598IHba4rGi07mhfC7O2Iw-NoL

In [None]:
#!gdown 17nemNXYD8D_rNdWSgVbcBrrNAozxTWgg

In [None]:
#from google.colab import drive
#drive.mount('/content/drive')

Начинаем с анализа датафрейма movies_metadata

## Задание 1

In [None]:
metadata = pd.read_csv(filepath_or_buffer = "movies_metadata.csv")

In [None]:
metadata.info()

Удалите из датафрейма metadata строки, в которых отсутствует описание. Обратите внимание, что у некоторых фильмов формально описание есть, но там написано No overview found, No overview, No movie overview available, Released. Такие строки тоже нужно удалить.

In [None]:
import string
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
stop = stopwords.words('english')

In [None]:
movies_df = metadata[~metadata['overview'].isna()]
movies_df['overview_1'] = movies_df['overview']

for ch in string.punctuation:
  movies_df['overview_1'] = movies_df['overview_1'].apply(lambda x: x.replace(ch,""))

movies_df['overview_1'] = movies_df['overview_1'].apply(lambda x: x.lower())
movies_df['overview_1'] = movies_df['overview_1'].apply(word_tokenize)
movies_df['overview_1']  = movies_df['overview_1'].apply(lambda x: [word for word in x if word not in (stop)])

In [None]:
movies_df['overview_1']

0        [led, woody, andys, toys, live, happily, room,...
1        [siblings, judy, peter, discover, enchanted, b...
2        [family, wedding, reignites, ancient, feud, ne...
3        [cheated, mistreated, stepped, women, holding,...
4        [george, banks, recovered, daughters, wedding,...
                               ...                        
45461                        [rising, falling, man, woman]
45462    [artist, struggles, finish, work, storyline, c...
45463    [one, hits, goes, wrong, professional, assassi...
45464    [small, town, live, two, brothers, one, minist...
45465    [50, years, decriminalisation, homosexuality, ...
Name: overview_1, Length: 44512, dtype: object

In [None]:
overview_set = set(movies_df[movies_df['overview'].str.match('.*overview.*', case=False)]['overview'])
sorted(list(overview_set), key=lambda x: len(x))

['No Overview',
 'No overview',
 'No overview.',
 'no overview yet',
 'No overview yet.',
 'No overview found',
 'No overview found.',
 'No plot overview available',
 'No movie overview available.',
 'No movie overview available, please add one at themoviedb.org',
 "An overview of the making of Terrence Malick's The New World (2005).",
 'An overview of the life of the most shocking, vile, and notorious of punk rock legends.',
 'A sweeping overview of humanity’s accomplishments in space, as well as our ongoing activities and future plans.',
 'Russell’s last DVD and CD, Outsourced, was taped before a sold out audience at the Warfield Theatre in San Francisco, and gives viewers and listeners an excellent overview of Russell’s comedic genius.',
 "An overview of the early years--late 1970s, early 1980s--of San Francisco punk band Dead Kennedys, with clips from some of their live concerts and footage of landmark San Francisco locations of the punk music scene. Jello Biafra and The Dead Kenne

In [None]:
stringsContainsOverview = set(filter(lambda x: len(x) <= 61, overview_set))
stringsContainsOverview

{'No Overview',
 'No movie overview available, please add one at themoviedb.org',
 'No movie overview available.',
 'No overview',
 'No overview found',
 'No overview found.',
 'No overview yet.',
 'No overview.',
 'No plot overview available',
 'no overview yet'}

In [None]:
movies_df = movies_df[~movies_df['overview'].isin(stringsContainsOverview)]

In [None]:
movies_df['word_count'] = movies_df['overview_1'].apply(lambda x: len(x))
movies_df = movies_df[~(movies_df['word_count'] <= 10)]

## Задание 2

Оставьте в датафрейме столбцы `'id', 'imdb_id', 'overview', 'title'`. Выведите 10 случайных строк.

In [None]:
movies_df = movies_df[['id', 'imdb_id', 'overview', 'title']]
movies_df.sample(n=10)

Unnamed: 0,id,imdb_id,overview,title
21245,61198,tt1172957,Director Mark Wexler embarks on a worldwide tr...,How to Live Forever
22547,2442,tt0113063,Mysterious bomber is planting explosive device...,The Final Cut
20445,64335,tt1808197,On a night of despair after being turned down ...,Fig Jam
40114,374617,tt4781612,"Nate Foster, a young, idealistic FBI agent, go...",Imperium
1183,3109,tt0045061,Sean Thornton has returned from America to rec...,The Quiet Man
34840,63441,tt0082153,Philip Kwok plays a repentant killer who vows ...,Masked Avengers
32507,306543,tt3851324,Just when Timo is trying to win back his ex-gi...,Homies
27842,224972,tt3983674,"Compared to girls, research shows that boys in...",The Mask You Live In
31345,348090,tt2466212,This film focuses on how a group of African Am...,The Black Kung Fu Experience
5488,31005,tt0179098,As he copes with the death of his fiancee alon...,Moonlight Mile


## Задание 3

In [None]:
ratings = pd.read_csv(filepath_or_buffer = "ratings.csv")

In [None]:
ratings.head(10)

Unnamed: 0,userId,movieId,rating,timestamp
0,1,110,1.0,1425941529
1,1,147,4.5,1425942435
2,1,858,5.0,1425941523
3,1,1221,5.0,1425941546
4,1,1246,5.0,1425941556
5,1,1968,4.0,1425942148
6,1,2762,4.5,1425941300
7,1,2918,5.0,1425941593
8,1,2959,4.0,1425941601
9,1,4226,4.0,1425942228


Убедимся в отсутствии пропусков

In [None]:
pd.isnull(ratings).sum()

Объедините датафреймы `metadata` и `ratings` в один. Обратите внимание, что `'id'` в  `metadata`, этот тот же самый идентификатор, что и `'movie_id'` и `ratings`. Объединять нужно по этому идентификатору (также обратите внимание на его тип данных).

In [None]:
movies_df = movies_df.astype({'id': np.int64})
movieAndRating = pd.merge(movies_df, ratings, how='inner', left_on='id', right_on='movieId')
movieAndRating = movieAndRating.drop(columns = ['movieId'], axis = 1)

In [None]:
movieAndRating.sample(15)

Unnamed: 0,id,imdb_id,overview,title,userId,rating,timestamp
6156452,42015,tt0094089,Spalding Gray sits behind a desk throughout th...,Swimming to Cambodia,79042,3.0,1463658015
6816836,1884,tt0087225,The Towani family civilian shuttlecraft crashe...,The Ewok Adventure,155867,5.0,1452873377
9933980,46578,tt0078960,A busload containing three cheerleading teams ...,Cheerleaders' Wild Weekend,16744,4.0,1469163695
10068305,53129,tt0374298,The film is based on the second book from the ...,The Turkish Gambit,140384,3.5,1229033677
2197009,1580,tt0040746,"Two young men strangle their ""inferior"" classm...",Rope,160281,3.0,943924712
2115310,907,tt0059113,Doctor Zhivago is the filmed adapation of the ...,Doctor Zhivago,127696,5.0,938536360
8912778,2253,tt0985699,"Wounded in Africa during World War II, Nazi Co...",Valkyrie,153344,1.0,945876202
1930909,380,tt0095953,Selfish yuppie Charlie Babbitt's father left a...,Rain Man,145679,4.0,1252588969
3598618,541,tt0048347,Frankie is a heroin addict and sits in prison....,The Man with the Golden Arm,219249,3.0,963368041
1195786,832,tt0022100,"In this classic German thriller, Hans Beckert,...",M,130626,4.0,1334967171


## Задание 4

Появились ли пропуски в получившемся после объединения датафрейме? Если появились, то ответьте на вопрос "почему?" и удалите строки с пропусками.

Выведите размеры получившегося датафрейма и 10 случайных записей в нём.

In [None]:
movieAndRating.sample(n=10)

Unnamed: 0,id,imdb_id,overview,title,userId,rating,timestamp
6407486,46976,tt0094331,In a staid English seaside town after the Seco...,Wish You Were Here,206421,4.0,1277009301
126631,4954,tt0109676,A team of skydiving crooks led by DEA-agent-tu...,Drop Zone,247118,2.5,1115990878
2147271,2182,tt0080391,Attack of the Killer Tomatoes is a 1978 comedy...,Attack of the Killer Tomatoes!,37448,4.0,1101948498
6649616,3060,tt0015624,The story of an idle rich boy who joins the US...,The Big Parade,216907,4.0,1052773659
9863857,44191,tt0089501,"Brad, Steve, Hue, and Marvin are four get-nowh...",Loose Screws,91768,5.0,1447840062
4599077,112,tt0243862,This fifth Danish Dogme film is about six vuln...,Italian for Beginners,148201,3.0,919552828
3086778,293,tt0105265,A River Runs Through is a cinematographically ...,A River Runs Through It,181235,3.5,1273249013
1176985,207,tt0097165,"At an elite, old-fashioned boarding school in ...",Dead Poets Society,242904,4.5,1226252456
1673630,8874,tt0119738,When she receives word that her longtime plato...,My Best Friend's Wedding,87240,4.5,1094985012
2692643,104,tt0130827,Lola receives a phone call from her boyfriend ...,Run Lola Run,127995,2.5,1481387362


## Задание 5

Возьмите случайного пользователя (проверьте, чтобы у него было достаточное количество оценок). Сформируйте для этого пользователя рекомендацию методом коллаборативной фильтрации. Оцените качество это рекомендации.


Повторите эти рассчеты для большого числа пользователей и дайте интегральную оценку.

In [None]:
critics = (movieAndRating.groupby('userId')['rating'].count() > 700).reset_index()
critics = critics[critics['rating'] == True]

critics

In [None]:
myUserId = critics.sample(1).userId.values[0]

myUserId

In [None]:
movieAndRating[movieAndRating['userId'] == myUserId]

In [None]:
myUserRatings = movieAndRating[movieAndRating['userId'] == myUserId]
otherUsersRatings = movieAndRating[(movieAndRating['userId'] != myUserId) & \
                                   (movieAndRating.userId.isin(critics.userId))]
otherUsersRatingsMyFilms = otherUsersRatings[otherUsersRatings.id.isin(myUserRatings.id)]
sim_films = pd.merge(myUserRatings, otherUsersRatingsMyFilms[['id', 'userId', 'rating']], \
                      how='left', left_on='id', right_on='id')

In [None]:
from math import sqrt

def compute_euclid(x, y):
  if len(x) != len(y):
    return 0
  sum = 0
  for i in range(len(x)):
    sum += np.square(y[i] - x[i])
  return np.sqrt(sum)

In [None]:
def similarity(x, y):
  return 1 / (1 + compute_euclid(x, y))

In [None]:
sim_films = sim_films.rename(columns = {'userId_y' : 'userId'})

In [None]:
from scipy.stats import pearsonr
import warnings
warnings.filterwarnings("ignore")

def count_sim(sim_films, pearson):
  sim_users = {}

  grouped_sim_films = sim_films.groupby('userId')

  for key, group in grouped_sim_films:
    cur_user = grouped_sim_films.get_group(key)
    x = cur_user['rating_x'].values
    y = cur_user['rating_y'].values
    if len(x) > 120:
      if pearson:
        score = pearsonr(x, y)[0]
      else:
        score = similarity(x, y)

      if score >= 0.0:
        sim_users[key] = score
      else:
        sim_users[key] = 0
  return sim_users

In [None]:
sim_users = count_sim(sim_films, False)

In [None]:
userDF = pd.DataFrame.from_dict(sim_users, orient='index')
userDF.columns = ['similarityIndex']
userDF['userId'] = userDF.index
userDF.index = range(len(userDF))
userDF.sample(5)

Unnamed: 0,similarityIndex,userId
193,0.031631,172480
182,0.038883,162505
246,0.037947,224707
221,0.025585,194690
270,0.031117,251298


In [None]:
topUsers=userDF.sort_values(by='similarityIndex', ascending=False)[:10]
topUsers

Unnamed: 0,similarityIndex,userId
44,0.071331,37222
249,0.064842,228291
267,0.062961,249810
167,0.0625,147611
34,0.061986,30494
222,0.061025,196061
66,0.060905,53562
2,0.060196,4160
184,0.059652,164198
69,0.059624,56954


In [None]:
NoWatchedFilms = otherUsersRatings[~otherUsersRatings.id.isin(myUserRatings.id)]
NoWatchedFilms

In [None]:
topUsersRating=NoWatchedFilms.merge(topUsers, left_on='userId', right_on='userId', how='inner')
topUsersRating[['id', 'title', 'userId',  'rating', 'similarityIndex']].sample(10)

In [None]:
topUsersRating['weightedRating'] = topUsersRating['similarityIndex'] * topUsersRating['rating']
topUsersRating[['id', 'title', 'userId',  'rating', 'similarityIndex', 'weightedRating']].head(15)

Unnamed: 0,id,title,userId,rating,similarityIndex,weightedRating
0,949,Heat,30494,4.5,0.061986,0.278936
1,2074,Flirting with Disaster,30494,4.0,0.061986,0.247943
2,1572,Die Hard: With a Vengeance,30494,4.5,0.061986,0.278936
3,8973,Lord of Illusions,30494,4.0,0.061986,0.247943
4,26258,Bushwhacked,30494,4.0,0.061986,0.247943
5,1909,Don Juan DeMarco,30494,4.0,0.061986,0.247943
6,8984,Disclosure,30494,4.5,0.061986,0.278936
7,1945,Nell,30494,4.5,0.061986,0.278936
8,527,Once Were Warriors,30494,4.5,0.061986,0.278936
9,101,Leon: The Professional,30494,3.5,0.061986,0.21695


In [None]:
tempTopUsersRating = topUsersRating.groupby(['id', 'title']).sum()[['similarityIndex','weightedRating']]
tempTopUsersRating.columns = ['sum_similarityIndex','sum_weightedRating']
tempTopUsersRating.head(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,sum_similarityIndex,sum_weightedRating
id,title,Unnamed: 2_level_1,Unnamed: 3_level_1
14,American Beauty,0.243262,0.974702
16,Dancer in the Dark,0.430533,1.596251
18,The Fifth Element,0.245163,0.889269
19,Metropolis,0.318401,0.801471
20,My Life Without Me,0.134292,0.434355
25,Jarhead,0.428652,1.472698
26,Walk on Water,0.122613,0.523587
68,Brazil,0.125867,0.347088
69,Walk the Line,0.062961,0.251844
70,Million Dollar Baby,0.376117,1.383807


In [None]:
recommendation_df = pd.DataFrame()
recommendation_df['score'] = tempTopUsersRating['sum_weightedRating'] / tempTopUsersRating['sum_similarityIndex']
#recommendation_df['id_'] = tempTopUsersRating.index
recommendation_df = recommendation_df.sort_values(by='score', ascending=False)

In [None]:
recommendation_df.sample(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,score
id,title,Unnamed: 2_level_1
125263,Broken Vessels,3.0
84152,Dollman vs. Demonic Toys,3.277463
42217,Bobby Deerfield,3.76048
118900,"In the Meantime, Darling",3.765581
1647,The Recruit,3.247155
8886,Palermo Shooting,3.0
80844,The Other Side of Midnight,2.476173
696,Manhattan,3.5
26314,"Job, czyli ostatnia szara komórka",3.0
54328,Ocean Heaven,3.0


In [None]:
recommendation_df = recommendation_df[recommendation_df.score > 4.0]
recommendation_df.head(7)

Unnamed: 0_level_0,Unnamed: 1_level_0,score
id,title,Unnamed: 2_level_1
26479,Strapped,5.0
4176,Murder on the Orient Express,5.0
41434,Don't Look Up,5.0
5817,"You, the Living",5.0
4708,Is Paris Burning?,5.0
4612,Absolon,5.0
123109,No One Lives,5.0


Итоговая функция

In [None]:
def count_recomended(selected_userId, pearson):
  myUserRatings = movieAndRating[movieAndRating['userId'] == selected_userId]
  otherUsersRatings = movieAndRating[(movieAndRating['userId'] != selected_userId) & \
                                   (movieAndRating.userId.isin(critics.userId))]
  otherUsersRatingsMyFilms = otherUsersRatings[otherUsersRatings.id.isin(myUserRatings.id)]
  sim_films = pd.merge(myUserRatings, otherUsersRatingsMyFilms[['id', 'userId', 'rating']], \
                      how='left', left_on='id', right_on='id')

  #print(sim_films)
  sim_films = sim_films.rename(columns = {'userId_y' : 'userId'})
  sim_users = count_sim(sim_films, pearson)

  userDF = pd.DataFrame.from_dict(sim_users, orient='index')
  userDF.columns = ['similarityIndex']
  userDF['userId'] = userDF.index
  userDF.index = range(len(userDF))
  userDF.head()

  topUsers=userDF.sort_values(by='similarityIndex', ascending=False)[:10]

  NoWatchedFilms = otherUsersRatings[~otherUsersRatings.id.isin(myUserRatings.id)]

  topUsersRating=NoWatchedFilms.merge(topUsers, left_on='userId', right_on='userId', how='inner')
  topUsersRating['weightedRating'] = topUsersRating['similarityIndex'] * topUsersRating['rating']

  tempTopUsersRating = topUsersRating.groupby(['id', 'title']).sum()[['similarityIndex','weightedRating']]
  tempTopUsersRating.columns = ['sum_similarityIndex','sum_weightedRating']

  recommendation_df = pd.DataFrame()
  recommendation_df['score'] = tempTopUsersRating['sum_weightedRating'] / tempTopUsersRating['sum_similarityIndex']
  recommendation_df = recommendation_df.sort_values(by='score', ascending=False)
  recommendation_df = recommendation_df.reset_index(level=['id','title'])

  return recommendation_df

In [None]:
newUserId = critics.sample(1).userId.values[0]

newUserId

225793

In [None]:
rec = count_recomended(newUserId, False)

In [None]:
rec.head(7)
#rec[rec.score > 4.0]

Unnamed: 0,id,title,score
0,37495,Four Lions,5.0
1,33294,In Your Hands,5.0
2,131739,"Isoroku Yamamoto, the Commander-in-Chief of th...",5.0
3,668,On Her Majesty's Secret Service,5.0
4,98532,Sensation,5.0
5,26578,The Falcon and the Snowman,5.0
6,42217,Bobby Deerfield,5.0


In [None]:
from IPython.display import display

def count_recomended_2(user_movies_df, userId, pearson):
  myUserRatings = user_movies_df

  otherUsersRatings = movieAndRating[(movieAndRating['userId'] != userId) & \
                                   (movieAndRating.userId.isin(critics.userId))]
  otherUsersRatingsMyFilms = otherUsersRatings[otherUsersRatings.id.isin(myUserRatings.id)]

  sim_films = pd.merge(myUserRatings, otherUsersRatingsMyFilms[['id', 'userId', 'rating']], \
                      how='left', left_on='id', right_on='id')
  #display(sim_films)

  sim_films = sim_films.rename(columns = {'userId_y' : 'userId'})
  sim_users = count_sim(sim_films, pearson)

  userDF = pd.DataFrame.from_dict(sim_users, orient='index')
  userDF.columns = ['similarityIndex']
  userDF['userId'] = userDF.index
  userDF.index = range(len(userDF))
  userDF.head()

  topUsers=userDF.sort_values(by='similarityIndex', ascending=False)[:7]

  watchedFilms = otherUsersRatingsMyFilms

  topUsersRating= watchedFilms.merge(topUsers, left_on='userId', right_on='userId', how='inner')
  topUsersRating['weightedRating'] = topUsersRating['similarityIndex'] * topUsersRating['rating']

  tempTopUsersRating = topUsersRating.groupby(['id', 'title']).sum()[['similarityIndex','weightedRating']]
  tempTopUsersRating.columns = ['sum_similarityIndex','sum_weightedRating']

  recommendation_df = pd.DataFrame()
  recommendation_df['score'] = tempTopUsersRating['sum_weightedRating'] / tempTopUsersRating['sum_similarityIndex']
  recommendation_df = recommendation_df.sort_values(by='score', ascending=False)
  recommendation_df = recommendation_df.reset_index(level=['id','title'])
  recommendation_df = recommendation_df.drop(columns='title')

  return recommendation_df

In [None]:
from sklearn.metrics import mean_squared_error, mean_absolute_error

def calculateMetricsFromPairs(actual_pairs, predicted_pairs):
    metrics = [
        # MSE
        mean_squared_error,
        # RMSE
        lambda x, y: np.sqrt(mean_squared_error(x, y)),
        # MAE
        mean_absolute_error
    ]

    x = actual_pairs
    y = predicted_pairs


    calculations = [
        f(x, y)
        for f in metrics
    ]

    return pd.DataFrame([calculations], columns=['MSE', 'RMSE', 'MAE'])

In [None]:
newUserId = critics.sample(1).userId.values[0]

newUserId

75660

In [None]:
viewed_movies = movieAndRating[movieAndRating['userId'] == myUserId]
train_movies, test_movies = train_test_split(viewed_movies, test_size=0.33, random_state=42)

predicted = count_recomended_2(test_movies, myUserId, False)
predicted = predicted.rename(columns = {'score' : 'rating'})

result = test_movies.merge(predicted, left_on='id', right_on='id', how='inner')

metrics = calculateMetricsFromPairs(result['rating_x'].values, result['rating_y'].values)
print(f'Calculated metrics by leave-one-out method for userId = {newUserId}:')
display(metrics)

predicted = count_recomended_2(test_movies, myUserId, True)
predicted = predicted.rename(columns = {'score' : 'rating'})
result = test_movies.merge(predicted, left_on='id', right_on='id', how='inner')

metrics = calculateMetricsFromPairs(result['rating_x'].values, result['rating_y'].values)
print(f'Calculated metrics by leave-one-out method for userId = {newUserId}: with pearson')
display(metrics)

Calculated metrics by leave-one-out method for userId = 75660:


Unnamed: 0,MSE,RMSE,MAE
0,0.631055,0.794389,0.618961


Calculated metrics by leave-one-out method for userId = 75660: with pearson


Unnamed: 0,MSE,RMSE,MAE
0,1.296606,1.138686,0.919796


In [None]:
from sklearn.model_selection import train_test_split
import math

def splitTestAndPredict(myUserId, pearson):
    viewed_movies = movieAndRating[movieAndRating['userId'] == myUserId]
    train_movies, test_movies = train_test_split(viewed_movies, test_size=0.33, random_state=42)

    predicted = count_recomended_2(test_movies, myUserId, pearson)
    predicted = predicted.rename(columns = {'score' : 'rating'})

    result = test_movies.merge(predicted, left_on='id', right_on='id', how='inner')

    return result[['id','rating_x', 'rating_y']]

In [None]:
result = splitTestAndPredict(myUserId, False)

In [None]:
print(f'Calculated metrics by leave-one-out method for userId = {newUserId}:')
calculateMetricsFromPairs(result['rating_x'].values, result['rating_y'].values)

Calculated metrics by leave-one-out method for userId = 75660:


Unnamed: 0,MSE,RMSE,MAE
0,0.631055,0.794389,0.618961


In [None]:
random_users = critics.sample(150).userId.values
results = pd.DataFrame()

for us in random_users:
  temp = splitTestAndPredict(us, True)
  metrics = calculateMetricsFromPairs(temp['rating_x'].values, temp['rating_y'].values)
  results = results.append(metrics)

In [None]:
results.mean()

MSE     0.856058
RMSE    0.893827
MAE     0.709377
dtype: float64

## Задание 6

Используйте метод Term Frequency Inverse Document Frequency (TF-IDF), чтобы отфильтровать фильмы, похожие (используйте для этого косинусное расстояние) на те, которые пользователь высоко оценил.

Оцените качество такой рекомендации.

In [None]:
#%%capture
#!pip install pymorphy2

In [None]:
#%%capture
#!pip install nltk

In [None]:
import string
import ssl
import nltk

try:
    _create_unverified_https_context = ssl._create_unverified_context
except AttributeError:
    pass
else:
    ssl._create_default_https_context = _create_unverified_https_context

nltk.download('punkt')
from nltk.tokenize import word_tokenize
import pymorphy2

[nltk_data] Downloading package punkt to /Users/macbook/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [None]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/macbook/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [None]:
from nltk.corpus import stopwords
stop = stopwords.words('english')

In [None]:
test = movieAndRating.copy()
test = test[['id', 'imdb_id', 'overview', 'title']]
test = test.drop_duplicates()

In [None]:
for ch in string.punctuation:
  test['overview'] = test['overview'].apply(lambda x: x.replace(ch,""))

test['overview'] = test['overview'].apply(lambda x: x.lower())

test['overview'] = test['overview'].apply(word_tokenize)

test['overview']  = test['overview'].apply(lambda x: [word for word in x if word not in (stop)])

test['overview'] = test['overview'].apply(lambda x: [word.replace("'s", "") for word in x ])

analyzer = pymorphy2.MorphAnalyzer()
test['overview'] = test['overview'].apply(
                    lambda lst: [analyzer.parse(word) for word in lst])

test['overview']  = test['overview'].apply(lambda x: [word for word in x if word not in (stop)])

test['overview']  = test['overview'].apply(lambda x: ' '.join([str(word[0].normal_form) for word in x]))

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel

def calculateSimilarFilms(df, movieId):
    temp = df.copy()

    tfidf = TfidfVectorizer()
    mx_tf = tfidf.fit_transform(temp['overview'])

    find_nearest_to = temp[temp['id'] == movieId]['overview'].values[0]
    new_entry = tfidf.transform([find_nearest_to])

    cosine_similarities = linear_kernel(new_entry, mx_tf).flatten()
    temp['cos_similarities'] = cosine_similarities
    temp['similar_to'] = movieId

    return temp \
        .sort_values(by=['cos_similarities'], ascending=[0]) \
        .head(6) \
        .tail(5)

In [None]:
calculateSimilarFilms(test, 862)

Unnamed: 0,id,imdb_id,overview,title,cos_similarities,similar_to
479971,2246,tt0107497,tale happily married couple would like childre...,Malice,0.174483,862
10400764,159201,tt0280000,ozzie young koala living australia kidnapped t...,Ozzie,0.129984,862
9847002,123235,tt1885331,paris pharmacist alice obsessed woody allen fa...,Paris-Manhattan,0.121163,862
9917821,117428,tt0062492,norman mailer ’ first feature filmmaking effor...,Wild 90,0.117878,862
2388455,4478,tt0107211,robert redford stars billionaire john gage off...,Indecent Proposal,0.117254,862


In [None]:
random_users = (ratings.groupby('userId')['rating'].count() < 50).reset_index()
random_users = random_users[random_users['rating'] == True].sample(1).userId.values

myUserId = random_users[0]
myUserId

181855

In [None]:
len(ratings[ratings['userId'] == myUserId])

14

In [None]:
myUserLikedFilms = movieAndRating[(movieAndRating['userId'] == myUserId) & (movieAndRating['rating'] >= 4.0)]
myUserLikedFilms

Unnamed: 0,id,imdb_id,overview,title,userId,rating,timestamp
215373,527,tt0110729,A drama about a Maori family lving in Auckland...,Once Were Warriors,181855,5.0,951063693
2801576,3175,tt0072684,"In the Eighteenth Century, in a small village ...",Barry Lyndon,181855,5.0,951063782
3306541,3114,tt0049730,As a Civil War veteran spends years searching ...,The Searchers,181855,4.0,951063768
4338546,2770,tt0252866,The whole gang are back and as close as ever. ...,American Pie 2,181855,5.0,951063925


In [None]:
 myUserLikedFilms_Ids = myUserLikedFilms['id'].values
 myUserLikedFilms_Ids

array([ 527, 3175, 3114, 2770])

In [None]:
simFilms = {}

for id in myUserLikedFilms_Ids:
  simFilms[id] = calculateSimilarFilms(test, id)

In [None]:
simFilms[myUserLikedFilms_Ids[0]]

Unnamed: 0,id,imdb_id,overview,title,cos_similarities,similar_to
10274539,53766,tt0011865,robert beth bordon married share little runs s...,Why Change Your Wife?,0.114926,527
5052178,99826,tt0271020,failed new jersey inventor embarks career stan...,The Jimmy Show,0.107521,527
10071003,142106,tt1826813,story family whose growth stunted family learn...,Petunia,0.102834,527
6946345,363,tt0347048,head german director fatih akin ’ story alcoho...,Head-On,0.101902,527
7357400,349,tt0428430,crustaces et coquillages fresh french comedy f...,Cockles and Muscles,0.100265,527


In [None]:
simFilms[myUserLikedFilms_Ids[len(myUserLikedFilms_Ids) - 1 ]]

Unnamed: 0,id,imdb_id,overview,title,cos_similarities,similar_to
7527671,5741,tt0058301,lorna married jim year still hasnt satisfied s...,Lorna,0.160044,2770
5415755,8273,tt0328828,high school distant memory jim michelle gettin...,American Wedding,0.150007,2770
9667517,26831,tt0070788,britain 1958 restless school bored life jim le...,That'll Be The Day,0.149522,2770
7194451,8976,tt0391304,flight los angeles new york oliver emily make ...,A Lot Like Love,0.12498,2770
9428557,86664,tt0160440,spending summer exotic beach two brothers fall...,Crazed Fruit,0.119505,2770


In [None]:
rec = pd.DataFrame(columns=['id', 'imdb_id', 'overview', 'title', 'cos_similarities', 'similar_to'])

for key in simFilms:
  rec = rec.append(simFilms[key])

rec

Unnamed: 0,id,imdb_id,overview,title,cos_similarities,similar_to
10274539,53766,tt0011865,robert beth bordon married share little runs s...,Why Change Your Wife?,0.114926,527
5052178,99826,tt0271020,failed new jersey inventor embarks career stan...,The Jimmy Show,0.107521,527
10071003,142106,tt1826813,story family whose growth stunted family learn...,Petunia,0.102834,527
6946345,363,tt0347048,head german director fatih akin ’ story alcoho...,Head-On,0.101902,527
7357400,349,tt0428430,crustaces et coquillages fresh french comedy f...,Cockles and Muscles,0.100265,527
9727601,79723,tt2043932,gerrie richard rikkert robbie barry maaskantje...,New Kids Nitro,0.15606,3175
1718474,56651,tt0119457,redmond young guy cant find life uncle sam giv...,Kicked in the Head,0.131503,3175
7860353,1116,tt0460989,1920s ireland young doctor damien odonovan pre...,The Wind That Shakes the Barley,0.115236,3175
896390,819,tt0117665,two gangsters seek revenge state jail worker s...,Sleepers,0.113337,3175
746205,3529,tt0025878,four year absence one time detective nick char...,The Thin Man,0.109264,3175


In [None]:
def countRecomendation(userId, df):
  myUserLikedFilms = movieAndRating[(movieAndRating['userId'] == userId) \
                                                 & (movieAndRating['rating'] >= 4.0)]

  if len(myUserLikedFilms) < 5:
    return pd.DataFrame()

  myUserLikedFilms_Ids = myUserLikedFilms['id'].values

  simFilms = {}

  for id in myUserLikedFilms_Ids:
    simFilms[id] = calculateSimilarFilms(df, id)

  rec = pd.DataFrame(columns=['id', 'imdb_id', 'overview', 'title', 'cos_similarities', 'similar_to'])

  for key in simFilms:
    rec = rec.append(simFilms[key])

  return rec

In [None]:
random_users = (ratings.groupby('userId')['rating'].count() < 50).reset_index()
random_users = random_users[random_users['rating'] == True].sample(1).userId.values

testUserId = random_users[0]
testUserId

200648

In [None]:
rec = countRecomendation(testUserId, test)
rec

Unnamed: 0,id,imdb_id,overview,title,cos_similarities,similar_to
862556,562,tt0095016,nypd cop john mcclanes plan reconcile estrange...,Die Hard,0.164502,1573
4377504,2034,tt0139654,first day job narcotics officer rookie cop wor...,Training Day,0.121671,1573
86769,1572,tt0112864,new york detective john mcclane back kicking b...,Die Hard: With a Vengeance,0.109046,1573
9759132,81393,tt1846442,12 dates christmas romantic comedy follows kat...,12 Dates of Christmas,0.105863,1573
1555948,8845,tt0105690,actionpacked thriller takes place soontobedeco...,Under Siege,0.104488,1573
9364495,27769,tt0102960,based short story stephen king man family retu...,Sometimes They Come Back,0.125464,1552
5521950,321,tt0330602,sweet comic film italian man comes closet affe...,Mambo Italiano,0.123048,1552
10071003,142106,tt1826813,story family whose growth stunted family learn...,Petunia,0.109314,1552
5234522,111815,tt0100200,five member family father conservative traditi...,Mr & Mrs Bridge,0.107658,1552
3102173,5486,tt0067656,elderly heiress killed husband wants control f...,A Bay of Blood,0.105623,1552


Оценка качества

In [None]:
rec['cos_similarities_inv'] = 1 - rec['cos_similarities']
rec

In [None]:
grouped_cos_sim_inv = rec.groupby(by=['similar_to'])['cos_similarities_inv'] \
    .apply(lambda x: sorted(list(x), reverse=True)) \
    .to_dict()

grouped_cos_sim_inv

In [None]:
precision = {
    k: sum(grouped_cos_sim_inv[k]) / len(grouped_cos_sim_inv[k])
    for k in grouped_cos_sim_inv
}

precision

{457: 0.7887356040651421,
 590: 0.8851392824280835,
 1259: 0.803699536331363,
 1552: 0.8857785235499559,
 1573: 0.8788863277454869,
 1917: 0.8995835921783752,
 1954: 0.8689463328226911,
 2447: 0.8980748738137834}

In [None]:
def ap_calc(v):
    # k = 147
    # v = [0.6765093902770655, 0.6805450468328779, 0.6843556880131448]

    numerator = 0  # числитель
    denominator = sum(v)  # знаменатель

    for i in range(len(v)):
        numerator += v[i] * (sum(v[:i + 1]) / len(v[:i + 1]))

    return numerator / denominator

ap = {
    k: ap_calc(grouped_cos_sim_inv[k])
    for k in grouped_cos_sim_inv
}

ap

In [None]:
movies_ap = ap.values()
print('Среднее значение для AP по фильмам:')
sum(movies_ap) / len(movies_ap)

Среднее значение для AP по фильмам:


0.8766809002198552

In [None]:
sorted_recommendations = rec.copy()
sorted_recommendations = sorted_recommendations.sort_values(by='cos_similarities_inv', ascending=False)

sorted_recommendations

In [None]:
def calc_precision_ap(series, k):
    precision_k = sum(series.iloc[:k]) / k
    ap_k = ap_calc(list(series.iloc[:k]))

    print(f'Precision@K ({k = }):')
    print(precision_k)
    print()
    print(f'AP@k ({k = }):')
    print(ap_k)
    print('-----------------')

    return precision_k, ap_k

In [None]:
random_users = (ratings.groupby('userId')['rating'].count() < 50).reset_index()
random_users = random_users[random_users['rating'] == True].sample(10).userId.values

random_users

array([ 51083,  25732, 104287, 226945,  65923, 102562, 231590,  65895,
        26381,  10172])

In [None]:
k = 10
total_ap_sum = 0
recommendation = pd.DataFrame()

for userId in random_users:
    userRec = countRecomendation(userId, test)
    if userRec.empty:
      indx = np.where(random_users == userId)[0][0]
      random_users = np.delete(random_users, indx)
      continue

    recommendation = recommendation.append(userRec)
    recommendation['cos_similarities_inv'] = 1 - recommendation['cos_similarities']
    recommendation = recommendation.sort_values(by='cos_similarities_inv', ascending=False)
    _, ap = calc_precision_ap(recommendation['cos_similarities_inv'], k)
    total_ap_sum += ap