**Name:** Stav Yosef

**ID:** 316298876


---


**Name:** Daniel Sabba

**ID:** 311500227


**Colab link:** [https://drive.google.com/file/d/1PxyFTxrHWmrt9IFB9XtJr-jM5yk9owVM/view?usp=sharing](https://drive.google.com/file/d/1PxyFTxrHWmrt9IFB9XtJr-jM5yk9owVM/view?usp=sharing) 

<h2> Content Based Recommender System - Metafeatures </h2>

The goal of this notebook is to implement content based recommender system on the Movielens 100k dataset.

The movie profile is based on the movie genres.

Two approaches are implemented. 

<b> Approach 1: </b>

The user profile is either a weighted average of the movie profile he\she rated, or the average of the movie profile he\she liked (rating >=3) - the average rating he\she didn't like (with a lower weight for the disliked movies)

The recommended movies are the closest ones (e.g. by Cosine similarity) to the user profile vector

The implementation is based on this blog post [website]
    
<b> Approach 2: </b>

The similarity score between two movies is calculated by computing the similarity between the movie profiles of each movies pair. 

The predicted rating a user will give to a candidate item, is calculated by the rating the user gave to K most similar items to the candidate item. The recommended movies are those with highest predicted rating.  

The implementation is based on this post [website2]

[website2]: https://www.kaggle.com/varian97/item-based-collaborative-filtering    

[website]: https://towardsdatascience.com/movie-recommendation-system-based-on-movielens-ef0df580cd0e

In [1]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics.pairwise import cosine_similarity

# Setup

In [106]:
import os.path as path
import warnings

import numpy as np
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity

warnings.filterwarnings('ignore')

## Download Dataset

In [107]:
!wget http://files.grouplens.org/datasets/movielens/ml-100k.zip

--2021-01-10 13:28:32--  http://files.grouplens.org/datasets/movielens/ml-100k.zip
Resolving files.grouplens.org (files.grouplens.org)... 128.101.65.152
Connecting to files.grouplens.org (files.grouplens.org)|128.101.65.152|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4924029 (4.7M) [application/zip]
Saving to: ‘ml-100k.zip.2’


2021-01-10 13:28:32 (14.0 MB/s) - ‘ml-100k.zip.2’ saved [4924029/4924029]



In [108]:
!ls

item_vec_latent.npy  ml-100k.zip    ml-100k.zip.2
ml-100k		     ml-100k.zip.1  sample_data


In [109]:
!unzip -u ml-100k.zip

Archive:  ml-100k.zip


## Data Managment

In [110]:
!ls "ml-100k/"

allbut.pl  u1.base  u2.test  u4.base  u5.test  ub.base	u.genre  u.occupation
mku.sh	   u1.test  u3.base  u4.test  ua.base  ub.test	u.info	 u.user
README	   u2.base  u3.test  u5.base  ua.test  u.data	u.item


In [111]:
def get_dataset_folder() -> str:
    return "ml-100k/"

In [112]:
def get_train_test_path(k: int) -> (str, str):
    if 1 <= k <= 5:
        return path.join(get_dataset_folder(), f'u{k}.base'), path.join(get_dataset_folder(), f'u{k}.test')
    else:
        return None

In [113]:
def build_matrix(_path: str):
    df = pd.read_csv(_path, sep='\t', names=['user_id', 'item_id', 'rating', 'timestamp'])

    n_users = int(get_num_users())
    n_items = int(get_num_items())

    matrix = np.zeros((n_users, n_items))
    for row in df.itertuples():
        matrix[row[1] - 1, row[2] - 1] = row[3]
    return matrix

In [114]:
def build_train_test_matrix(k: int) -> (np.ndarray, np.ndarray):
    path_train, path_test = get_train_test_path(k=k)

    return build_matrix(_path=path_train), build_matrix(_path=path_test)

In [115]:
def build_train_matrix() -> np.ndarray:
    matrix_train, _ = build_train_test_matrix(1)

    for i in range(2, 6, 1):
        train, _ = build_train_test_matrix(i)
        matrix_train += train
    
    return matrix_train

In [116]:
def build_test_matrix() -> np.ndarray:
    _, matrix_test = build_train_test_matrix(1)

    for i in range(2, 6, 1):
        _, test = build_train_test_matrix(i)
        matrix_test += test
    
    return matrix_test

In [117]:
def get_data_path() -> str:
    return path.join(get_dataset_folder(), "u.data")

In [118]:
def get_users_path() -> str:
    return path.join(get_dataset_folder(), "u.user")

In [119]:
def get_genres_path() -> str:
    return path.join(get_dataset_folder(), "u.genre")

In [120]:
def get_items_path() -> str:
    return path.join(get_dataset_folder(), "u.item")

In [121]:
def load_genres() -> list:
    _ = pd.read_csv(get_genres_path(),
                    delimiter='|',
                    names=["Genre", "Code"],
                    encoding='latin-1')
    
    return _[_.columns[0]].to_list()

In [122]:
def load_items() -> pd.DataFrame:
    m_cols = ['movie_id', 'movie_title', 'release date', 'video release date', 'IMDb URL'] + load_genres()
    return pd.read_csv(get_items_path(), delimiter='|', names=m_cols, encoding='latin-1')

In [123]:
def load_data() -> pd.DataFrame:
    m_cols = ['user_id', 'item_id', 'rating', 'timestamp']
    return pd.read_csv(get_data_path(), delimiter='\t', names=m_cols, encoding='latin-1')

In [124]:
def load_users() -> pd.DataFrame:
    m_cols = ['user_id', 'age', 'gender', 'occupation', 'zip_code']
    return pd.read_csv(get_users_path(), delimiter='\t', names=m_cols, encoding='latin-1')

In [125]:
def get_num_users() -> int:
    users = load_users()
    return users['user_id'].unique().shape[0]

In [126]:
def get_num_items() -> int:
    items = load_items()
    return items['movie_id'].unique().shape[0]

# Data loading

In [127]:
column_names = ['user_id', 'item_id', 'rating', 'timestamp']
ratings = load_data()

ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype
---  ------     --------------   -----
 0   user_id    100000 non-null  int64
 1   item_id    100000 non-null  int64
 2   rating     100000 non-null  int64
 3   timestamp  100000 non-null  int64
dtypes: int64(4)
memory usage: 3.1 MB


In [128]:
#Binary option should be set to True if the rating should be binary. 
#It should be set to True for the first approach and False for the second approach
BINARY_OPTION = True
#NEGATIVE WEIGHT is relevant only for the first approach
NEGATIVE_WEIGHT = 0.25

In [129]:
def brating(row):
    if row['rating'] >= 3:
        val = 1
    elif row['rating'] >=0:
        val = -NEGATIVE_WEIGHT
    else:
        val = row['rating']
    return val


ratings['binary_rating'] = ratings.apply(brating, axis=1)
ratings.head()

Unnamed: 0,user_id,item_id,rating,timestamp,binary_rating
0,196,242,3,881250949,1.0
1,186,302,3,891717742,1.0
2,22,377,1,878887116,-0.25
3,244,51,2,880606923,-0.25
4,166,346,1,886397596,-0.25


In [130]:
movie_titles = load_items()
movie_titles

Unnamed: 0,movie_id,movie_title,release date,video release date,IMDb URL,unknown,Action,Adventure,Animation,Children's,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,1,Toy Story (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Toy%20Story%2...,0,0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0
1,2,GoldenEye (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?GoldenEye%20(...,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
2,3,Four Rooms (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Four%20Rooms%...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
3,4,Get Shorty (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Get%20Shorty%...,0,1,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0
4,5,Copycat (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Copycat%20(1995),0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1677,1678,Mat' i syn (1997),06-Feb-1998,,http://us.imdb.com/M/title-exact?Mat%27+i+syn+...,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
1678,1679,B. Monkey (1998),06-Feb-1998,,http://us.imdb.com/M/title-exact?B%2E+Monkey+(...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0
1679,1680,Sliding Doors (1998),01-Jan-1998,,http://us.imdb.com/Title?Sliding+Doors+(1998),0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0
1680,1681,You So Crazy (1994),01-Jan-1994,,http://us.imdb.com/M/title-exact?You%20So%20Cr...,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0


# Question 1

## A

Creating movie profile, removing all columns but the genres.

In [131]:
movie_profile = movie_titles[load_genres()[1:]]
movie_profile.sort_index(axis=0, inplace=True)
movie_profile.head()

Unnamed: 0,Action,Adventure,Animation,Children's,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0
1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
3,1,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,1,0,0


## B

In [132]:
def get_movies_table_from_ids(df: pd.DataFrame) -> pd.DataFrame:
    movies_trim = movie_titles[movie_titles.columns.drop(load_genres())]
    
    df = df.merge(movies_trim, how='inner', on='movie_id')

    return df

In [133]:
def get_movies_cosine_similarity() -> pd.DataFrame:
    _ = movie_profile
    return pd.DataFrame(cosine_similarity(_, _),
                        index=_.index + 1,
                        columns=_.index + 1)

In [134]:
def find_most_similar_movies(movie_id: int, top_k: int) -> pd.Series:
    df_cos = get_movies_cosine_similarity()

    movie_vec = df_cos[movie_id]  # Extract the movie similarity cosine vector
    movie_vec = movie_vec.drop(movie_id)  # Remove the movie itself from the vector

    top_movies = movie_vec.sort_values(ascending=False).iloc[:top_k]  # Get top top_k similar movies.

    res = pd.DataFrame({"movie_id": top_movies.index, "similartiy_score": top_movies})
    res.reset_index(drop=True, inplace=True)

    return get_movies_table_from_ids(df=res)

In [155]:
top_similar = find_most_similar_movies(movie_id=1, top_k=5)
top_similar.head(5)

Unnamed: 0,movie_id,similartiy_score,movie_title,release date,video release date,IMDb URL
0,422,1.0,Aladdin and the King of Thieves (1996),01-Jan-1996,,http://us.imdb.com/M/title-exact?Aladdin%20and...
1,1219,0.866025,"Goofy Movie, A (1995)",01-Jan-1995,,http://us.imdb.com/M/title-exact?Goofy%20Movie...
2,95,0.866025,Aladdin (1992),01-Jan-1992,,http://us.imdb.com/M/title-exact?Aladdin%20(1992)
3,1078,0.816497,Oliver & Company (1988),29-Mar-1988,,http://us.imdb.com/M/title-exact?Oliver%20&%20...
4,477,0.816497,Matilda (1996),02-Aug-1996,,http://us.imdb.com/M/title-exact?Matilda%20(1996)


## C

We understood this question in two different ways so down below we solved it twice just in case.

### First approach

Use function b to find top k (5) items similar for specific item.

In [158]:
top_similar = find_most_similar_movies(movie_id=1, top_k=5)
top_similar.head()

Unnamed: 0,movie_id,similartiy_score,movie_title,release date,video release date,IMDb URL
0,422,1.0,Aladdin and the King of Thieves (1996),01-Jan-1996,,http://us.imdb.com/M/title-exact?Aladdin%20and...
1,1219,0.866025,"Goofy Movie, A (1995)",01-Jan-1995,,http://us.imdb.com/M/title-exact?Goofy%20Movie...
2,95,0.866025,Aladdin (1992),01-Jan-1992,,http://us.imdb.com/M/title-exact?Aladdin%20(1992)
3,1078,0.816497,Oliver & Company (1988),29-Mar-1988,,http://us.imdb.com/M/title-exact?Oliver%20&%20...
4,477,0.816497,Matilda (1996),02-Aug-1996,,http://us.imdb.com/M/title-exact?Matilda%20(1996)


In [173]:
top_similar = find_most_similar_movies(movie_id=1000, top_k=5)
top_similar.head()

Unnamed: 0,movie_id,similartiy_score,movie_title,release date,video release date,IMDb URL
0,575,1.0,City Slickers II: The Legend of Curly's Gold (...,01-Jan-1994,,http://us.imdb.com/M/title-exact?City%20Slicke...
1,415,0.816497,"Apple Dumpling Gang, The (1975)",01-Jan-1975,,http://us.imdb.com/M/title-exact?Apple%20Dumpl...
2,1188,0.816497,Young Guns II (1990),01-Jan-1990,,http://us.imdb.com/M/title-exact?Young%20Guns%...
3,435,0.816497,Butch Cassidy and the Sundance Kid (1969),01-Jan-1969,,http://us.imdb.com/M/title-exact?Butch%20Cassi...
4,73,0.816497,Maverick (1994),01-Jan-1994,,http://us.imdb.com/M/title-exact?Maverick%20(1...


#### Explanation

In [137]:
movie_titles[movie_titles["movie_id"] == 1]

Unnamed: 0,movie_id,movie_title,release date,video release date,IMDb URL,unknown,Action,Adventure,Animation,Children's,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,1,Toy Story (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Toy%20Story%2...,0,0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0


As we can see the genres of Toy Story are:

1.   Animation
2.   Children's
3.   Comedy

In [138]:
movie_titles[movie_titles["movie_id"].isin([422, 1219, 95, 1078, 477])]

Unnamed: 0,movie_id,movie_title,release date,video release date,IMDb URL,unknown,Action,Adventure,Animation,Children's,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
94,95,Aladdin (1992),01-Jan-1992,,http://us.imdb.com/M/title-exact?Aladdin%20(1992),0,0,0,1,1,1,0,0,0,0,0,0,1,0,0,0,0,0,0
421,422,Aladdin and the King of Thieves (1996),01-Jan-1996,,http://us.imdb.com/M/title-exact?Aladdin%20and...,0,0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0
476,477,Matilda (1996),02-Aug-1996,,http://us.imdb.com/M/title-exact?Matilda%20(1996),0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0
1077,1078,Oliver & Company (1988),29-Mar-1988,,http://us.imdb.com/M/title-exact?Oliver%20&%20...,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1218,1219,"Goofy Movie, A (1995)",01-Jan-1995,,http://us.imdb.com/M/title-exact?Goofy%20Movie...,0,0,0,1,1,1,0,0,0,0,0,0,0,0,1,0,0,0,0


Aladdin and the King of Thieves have the same exactly genres as Toy Story so the similarity is 1.

Goofy Movie, A & Aladdin have the same genres as Toy Story with 1 more genre so it's close to 1 but not one (0.866)

Oliver & Company & Matilda have 2/3 genres as Toy Story so they are pretty similar but not exactly the same $\rightarrow$ 0.816


**Good Results!**



---



In [174]:
movie_titles[movie_titles["movie_id"] == 1000]

Unnamed: 0,movie_id,movie_title,release date,video release date,IMDb URL,unknown,Action,Adventure,Animation,Children's,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
999,1000,Lightning Jack (1994),01-Jan-1994,,http://us.imdb.com/M/title-exact?Lightning%20J...,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1


As we can see the genres of Lightning Jack are:

1.   Comedy
2.   Western

In [175]:
movie_titles[movie_titles["movie_id"].isin([575, 415, 1188, 435, 73])]

Unnamed: 0,movie_id,movie_title,release date,video release date,IMDb URL,unknown,Action,Adventure,Animation,Children's,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
72,73,Maverick (1994),01-Jan-1994,,http://us.imdb.com/M/title-exact?Maverick%20(1...,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1
414,415,"Apple Dumpling Gang, The (1975)",01-Jan-1975,,http://us.imdb.com/M/title-exact?Apple%20Dumpl...,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,1
434,435,Butch Cassidy and the Sundance Kid (1969),01-Jan-1969,,http://us.imdb.com/M/title-exact?Butch%20Cassi...,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1
574,575,City Slickers II: The Legend of Curly's Gold (...,01-Jan-1994,,http://us.imdb.com/M/title-exact?City%20Slicke...,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1
1187,1188,Young Guns II (1990),01-Jan-1990,,http://us.imdb.com/M/title-exact?Young%20Guns%...,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1


City Slickers II: The Legend of Curly's Gold have the same exactly genres as Lightning Jack so the similarity is 1.

All others movies are Comedy & Western with addition one genre $\rightarrow$ 0.816


**Good Results!**

### Second approach

We look at the question in different perspective, mabye the question is to take movie X and movie Y and we need to produce common recommendation for both of the movie in the same time.

The idea is to check similarity to each of the items and then check the intersection of them by the movie id column.

In the last stage calculate the average $\frac{score_{i,first-mov-id} + score_{i,second-mov-id}}{2}$ And take the top k (5).

In [139]:
def most_similar_for_2_movies(first_mov_id: int, second_mov_id: int, top_k: int = 5) -> pd.DataFrame:
    df = movie_profile.copy()
    k = df.shape[0]

    similar_movies_first = find_most_similar_movies(movie_id=first_mov_id, top_k=k)
    similar_movies_second = find_most_similar_movies(movie_id=second_mov_id, top_k=k)

    merged = similar_movies_first.merge(similar_movies_second, how='inner', on='movie_id')

    pt = merged.pivot_table(index='movie_id')
    pt = pt.mean(axis=1)
    pt.sort_values(ascending=False, inplace=True)

    res = pt.to_frame(name="score")
    res = res.iloc[:top_k]
    res = pd.DataFrame({"movie_id": res.index, "score": res["score"]})
    res.reset_index(drop=True, inplace=True)
    
    res = get_movies_table_from_ids(res)

    return res

In [140]:
movie_titles[movie_titles['movie_id'] == 1]

Unnamed: 0,movie_id,movie_title,release date,video release date,IMDb URL,unknown,Action,Adventure,Animation,Children's,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,1,Toy Story (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Toy%20Story%2...,0,0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0


In [141]:
movie_titles[movie_titles['movie_id'] == 402]

Unnamed: 0,movie_id,movie_title,release date,video release date,IMDb URL,unknown,Action,Adventure,Animation,Children's,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
401,402,Ghost (1990),01-Jan-1990,,http://us.imdb.com/M/title-exact?Ghost%20(1990),0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,1,0,0


Toy Story genres:

1.   Animation
2.   Children's
3.   Comedy

Ghost genres:

1.   Comedy
2.   Romance
3.   Thriller

In the results we should see something in between Toy Story & Ghost $\rightarrow$ movies with mixed genres.

In [142]:
top_similar = most_similar_for_2_movies(first_mov_id=1, second_mov_id=402, top_k=5)
top_similar.head()

Unnamed: 0,movie_id,score,movie_title,release date,video release date,IMDb URL
0,1219,0.721688,"Goofy Movie, A (1995)",01-Jan-1995,,http://us.imdb.com/M/title-exact?Goofy%20Movie...
1,422,0.666667,Aladdin and the King of Thieves (1996),01-Jan-1996,,http://us.imdb.com/M/title-exact?Aladdin%20and...
2,490,0.666667,To Catch a Thief (1955),01-Jan-1955,,http://us.imdb.com/M/title-exact?To%20Catch%20...
3,408,0.666667,"Close Shave, A (1995)",28-Apr-1996,,http://us.imdb.com/M/title-exact?Close%20Shave...
4,90,0.666667,So I Married an Axe Murderer (1993),01-Jan-1993,,http://us.imdb.com/M/title-exact?So%20I%20Marr...


In [143]:
movie_titles[movie_titles["movie_id"].isin([1219, 422, 490, 408, 90])]

Unnamed: 0,movie_id,movie_title,release date,video release date,IMDb URL,unknown,Action,Adventure,Animation,Children's,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
89,90,So I Married an Axe Murderer (1993),01-Jan-1993,,http://us.imdb.com/M/title-exact?So%20I%20Marr...,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,1,0,0
407,408,"Close Shave, A (1995)",28-Apr-1996,,http://us.imdb.com/M/title-exact?Close%20Shave...,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0
421,422,Aladdin and the King of Thieves (1996),01-Jan-1996,,http://us.imdb.com/M/title-exact?Aladdin%20and...,0,0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0
489,490,To Catch a Thief (1955),01-Jan-1955,,http://us.imdb.com/M/title-exact?To%20Catch%20...,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,1,0,0
1218,1219,"Goofy Movie, A (1995)",01-Jan-1995,,http://us.imdb.com/M/title-exact?Goofy%20Movie...,0,0,0,1,1,1,0,0,0,0,0,0,0,0,1,0,0,0,0


As we can see it's produced movies as we anticipated (mixed genres)

**Good Results!**

## D

In [144]:
with open('item_vec_latent.npy', 'rb') as f:
    items_vec_latent = np.load(f)

items_vec_latent

array([[-0.04783021,  0.02221448, -0.05127662, ..., -0.01007532,
        -0.02563317, -0.01639253],
       [ 0.01588862,  0.02223237,  0.01566593, ...,  0.01292368,
         0.01028191, -0.00914688],
       [ 0.02308113, -0.000877  ,  0.01033436, ...,  0.01291555,
        -0.0148144 , -0.03359524],
       ...,
       [ 0.01516091,  0.04004909,  0.03174036, ...,  0.00368512,
        -0.00010248, -0.04094235],
       [ 0.02120049,  0.01216243, -0.00912177, ..., -0.02031322,
        -0.03148889,  0.01064072],
       [-0.00871174,  0.03308845, -0.00286855, ...,  0.02723331,
         0.00094329,  0.05252173]])

In [145]:
df_items_ex1 = pd.DataFrame(items_vec_latent)
df_items_ex1

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
0,-0.047830,0.022214,-0.051277,0.018955,-0.010539,0.008496,0.025852,0.054459,-0.023817,-0.005165,0.016127,0.026267,0.035355,0.025266,0.006466,-0.011044,-0.032430,-0.010075,-0.025633,-0.016393
1,0.015889,0.022232,0.015666,0.008329,0.007295,0.000263,0.010206,-0.018697,0.012755,0.007449,-0.000243,-0.046671,0.019493,0.003477,0.010911,0.030813,0.014695,0.012924,0.010282,-0.009147
2,0.023081,-0.000877,0.010334,0.004438,0.013743,0.005990,-0.003703,-0.018383,0.015738,0.043424,0.001468,0.009810,0.014812,0.036310,-0.015183,0.025970,-0.017467,0.012916,-0.014814,-0.033595
3,-0.014282,0.004960,0.015986,0.007655,-0.009241,0.030535,0.010042,0.015250,0.037587,-0.005528,-0.027891,-0.001879,0.011368,-0.027538,0.012439,-0.010332,-0.017225,0.015146,-0.005361,0.012977
4,-0.009577,-0.024733,0.021396,0.022057,0.000463,0.041410,-0.016403,-0.019643,-0.027070,-0.034789,-0.042950,0.023231,-0.027747,-0.018103,0.003531,-0.001671,-0.035076,-0.010347,0.008591,0.026025
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1677,0.023184,0.026013,0.014345,0.013616,-0.025923,0.005807,0.033800,-0.019447,0.004049,-0.041714,0.026322,-0.016132,0.008790,0.004170,0.011377,0.012401,-0.021064,0.002501,-0.048414,0.019632
1678,-0.056877,0.037772,0.010527,0.001813,0.002640,-0.015562,-0.006522,-0.022140,-0.022059,-0.010712,0.002430,-0.013007,0.007945,-0.018803,0.015815,0.011522,0.020750,-0.027863,0.033905,0.014389
1679,0.015161,0.040049,0.031740,0.010561,-0.021497,-0.031669,0.010174,0.027132,-0.005823,-0.032951,0.016829,-0.003812,0.000192,-0.001203,0.001142,0.003432,-0.048768,0.003685,-0.000102,-0.040942
1680,0.021200,0.012162,-0.009122,0.007452,0.000563,-0.010825,0.006672,-0.023340,0.013082,0.004566,0.023572,0.012783,0.035359,-0.021385,0.000472,-0.013131,-0.023517,-0.020313,-0.031489,0.010641


In [146]:
_ = df_items_ex1
df_items_ex1_cos_sim = pd.DataFrame(cosine_similarity(_, _), index=_.index + 1, columns=_.index + 1)
df_items_ex1_cos_sim

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,...,1643,1644,1645,1646,1647,1648,1649,1650,1651,1652,1653,1654,1655,1656,1657,1658,1659,1660,1661,1662,1663,1664,1665,1666,1667,1668,1669,1670,1671,1672,1673,1674,1675,1676,1677,1678,1679,1680,1681,1682
1,1.000000,-0.368499,-0.060298,0.045094,-0.148955,0.089401,0.012261,0.107576,-0.014639,0.166914,-0.004493,-0.147061,0.010930,-0.201001,-0.023828,-0.155354,-0.301902,0.160771,0.108748,-0.038651,0.107234,0.187469,0.243426,0.271970,-0.184363,-0.134612,-0.173588,0.122387,-0.136159,0.044130,-0.179307,-0.156573,-0.174929,-0.066519,0.270463,-0.096941,-0.085525,-0.003208,-0.242804,0.506207,...,0.145854,0.271761,0.039841,0.012205,-0.183651,-0.046569,0.283053,0.036322,0.085589,0.121377,-0.212264,0.460641,0.091918,0.507066,-0.049666,-0.246374,-0.018840,0.125273,0.117997,-0.062008,0.353439,0.088489,0.051704,-0.444061,-0.032090,0.241286,-0.065924,0.007620,-0.080398,-0.356059,0.043105,0.069914,-0.369793,0.241246,-0.121633,0.094924,0.006712,0.240328,0.153660,0.212588
2,-0.368499,1.000000,0.313474,0.033400,-0.368053,-0.220800,0.032579,0.164773,-0.246206,-0.005304,0.090226,0.237098,-0.155474,0.050440,-0.009752,0.320401,0.174803,0.059179,-0.165036,0.129775,-0.001392,-0.276989,-0.225336,-0.311257,-0.162654,-0.051824,-0.158737,0.005449,-0.101171,-0.394512,0.178051,0.162911,0.082840,0.198701,0.015502,0.272466,0.267067,0.609615,0.138511,0.046202,...,0.095822,0.048928,-0.248338,-0.109858,-0.309568,0.571712,-0.193170,0.069392,-0.486284,-0.019337,0.174527,-0.135580,0.044989,-0.387045,0.368600,0.078159,-0.071415,-0.185196,-0.034153,0.106194,-0.132757,0.041537,0.336419,0.012777,0.158313,-0.019387,-0.067667,-0.043902,-0.172188,-0.275720,-0.505435,0.129573,0.048411,-0.351932,-0.148275,0.260929,0.197061,0.102602,-0.029763,0.009031
3,-0.060298,0.313474,1.000000,-0.184132,-0.281953,-0.156866,-0.003729,0.183377,-0.390355,0.010021,-0.140235,0.126966,-0.153166,0.102730,-0.342067,0.271002,-0.092980,-0.051116,-0.268739,0.082019,0.394431,-0.167997,-0.265353,0.165367,-0.002185,-0.050031,-0.118622,-0.305592,-0.202738,-0.328010,-0.133287,0.252913,0.146597,-0.105807,0.163569,0.021938,0.272905,0.310827,0.184299,0.482161,...,-0.243092,0.468170,-0.267864,0.188754,-0.098002,0.464955,-0.035345,0.202535,0.069050,0.274346,0.356534,-0.013323,-0.112866,0.193689,-0.050500,-0.011092,0.036910,-0.082452,-0.007555,0.148936,-0.141669,0.011055,0.404319,-0.238414,-0.165786,0.041539,-0.283619,-0.244887,-0.039139,-0.029737,-0.173131,-0.092184,-0.459699,0.130960,-0.130156,-0.039231,-0.462575,0.048277,0.141143,-0.140425
4,0.045094,0.033400,-0.184132,1.000000,0.316618,0.262509,-0.095359,0.104148,0.019876,-0.156545,0.208399,-0.013963,-0.189931,-0.201443,0.242239,0.049149,0.257145,-0.037469,0.282366,0.336560,-0.161530,0.170038,-0.144742,-0.500024,-0.340226,0.254008,-0.357297,-0.263988,-0.157271,-0.193804,-0.040739,-0.174642,0.187882,-0.004512,0.197982,-0.046744,-0.178851,0.002050,0.157715,0.263588,...,-0.104849,-0.013286,-0.220394,0.383820,0.034089,0.020278,-0.187383,0.044926,0.174956,-0.255669,-0.265750,0.174189,0.017366,-0.171451,-0.310110,0.033258,-0.077647,-0.018622,0.177948,-0.011794,0.476376,-0.339349,-0.113305,0.177698,-0.074095,0.217012,0.145929,0.031134,-0.343522,0.166062,-0.050883,-0.110040,0.178057,0.020849,0.016494,0.152172,-0.081708,0.024685,0.059517,-0.036755
5,-0.148955,-0.368053,-0.281953,0.316618,1.000000,0.007823,0.170728,-0.190936,-0.109926,-0.088699,-0.081242,0.089520,-0.106859,0.010064,0.218363,0.005680,0.193775,-0.019554,-0.164518,-0.011836,-0.003899,0.251581,-0.159390,-0.015469,-0.212066,0.306363,-0.133574,-0.177901,0.262332,0.334761,0.158262,0.100657,-0.302290,-0.146674,-0.336904,-0.231107,-0.026016,-0.250731,-0.183758,0.066096,...,-0.238937,0.067627,-0.278762,0.067813,0.120354,-0.250403,-0.078826,0.285560,0.025385,-0.132604,0.168295,-0.030497,0.003786,-0.280402,-0.488858,0.121323,0.192485,0.448307,0.025358,0.125723,0.192921,-0.306660,0.030335,0.105302,0.156694,0.008145,-0.266004,0.344549,0.070770,0.065558,0.216451,-0.443869,0.253838,-0.161326,0.190342,0.013168,0.068083,-0.101527,-0.163029,-0.102740
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1678,0.094924,0.260929,-0.039231,0.152172,0.013168,-0.399247,0.010474,0.071242,-0.049377,0.117439,0.411789,-0.029335,0.086366,0.170167,0.185121,0.144705,0.018620,0.106638,0.054448,-0.157014,-0.216205,-0.084743,0.020591,0.072422,0.293135,0.200232,-0.199014,0.007394,0.219499,-0.042813,0.526284,-0.041843,0.042292,-0.055275,0.316259,-0.198525,-0.034900,0.515755,-0.003915,0.002430,...,-0.109317,0.047822,-0.288577,0.218382,-0.386059,0.166982,0.038306,-0.043690,-0.265948,0.216061,0.167440,0.031575,0.370092,-0.019081,-0.133036,0.039066,-0.223922,-0.134168,0.151957,-0.296289,0.446259,-0.025626,0.342231,0.008047,0.265164,-0.096610,-0.400532,0.190075,-0.383973,-0.098027,-0.350065,-0.116075,-0.035531,-0.275040,-0.372132,1.000000,-0.104009,0.415177,0.488688,0.139502
1679,0.006712,0.197061,-0.462575,-0.081708,0.068083,0.057282,0.170258,-0.091651,0.207842,0.022465,-0.157962,-0.018540,-0.336525,0.018379,0.449742,-0.064424,-0.169564,0.311042,-0.456117,0.221596,-0.119073,-0.359209,0.161785,-0.041258,-0.261165,-0.299931,-0.013957,0.417688,-0.029134,0.199775,0.115345,0.126233,-0.059482,0.037921,-0.187106,0.406175,-0.179245,-0.040947,-0.159160,-0.215950,...,0.390544,-0.125112,0.395273,-0.532578,-0.084647,-0.126684,-0.069496,-0.228859,-0.380495,-0.018502,-0.069349,-0.098510,-0.076722,-0.165092,0.133267,-0.070978,0.166429,-0.130028,0.133600,-0.113217,-0.147648,0.122558,-0.036878,0.115416,0.022064,-0.121454,0.168018,0.405448,0.197664,-0.053724,0.185978,0.088264,0.409177,-0.258399,-0.068726,-0.104009,1.000000,-0.027497,-0.118878,0.254880
1680,0.240328,0.102602,0.048277,0.024685,-0.101527,-0.370909,0.031459,0.274644,0.020788,-0.210184,0.463426,0.161201,-0.464082,-0.045917,-0.169159,-0.205298,-0.251144,0.318867,-0.107336,-0.140748,-0.120770,-0.030143,-0.095523,0.491634,0.000802,-0.175267,-0.093526,0.147356,-0.283981,0.153003,0.379191,-0.374302,0.361217,-0.021906,0.130096,0.109647,-0.174322,0.387738,0.351883,0.090020,...,-0.101443,-0.000552,-0.078422,0.293185,-0.062657,0.103572,-0.021539,0.303082,-0.440781,0.510625,0.166701,0.037750,0.263804,0.290280,0.292824,-0.010044,0.139670,-0.010623,-0.362490,-0.522322,0.574758,0.052055,0.188597,-0.216959,-0.252000,-0.478106,-0.260970,-0.159470,-0.421558,0.067318,-0.195735,-0.001864,0.092652,-0.205842,-0.130319,0.415177,-0.027497,1.000000,0.135019,0.059673
1681,0.153660,-0.029763,0.141143,0.059517,-0.163029,-0.106165,0.292333,0.537248,0.146897,-0.043668,0.265532,0.130179,0.178581,0.190021,0.211352,0.039550,-0.232948,0.397116,-0.031104,-0.010824,0.331503,-0.328388,0.124912,0.122270,0.433751,0.010118,0.064960,0.090761,-0.130187,0.053790,0.072678,0.037745,0.169838,0.076745,0.372755,-0.257802,-0.543154,0.293763,0.117955,0.183443,...,-0.273583,0.049804,0.156829,0.456657,0.104537,0.096004,0.081307,-0.252217,-0.084975,0.024343,0.153781,0.184180,0.336647,0.301954,-0.106508,-0.418223,-0.257161,-0.317257,0.151559,-0.315806,0.349467,-0.230918,0.237062,0.142436,-0.033831,0.118242,-0.041576,0.004658,0.105731,0.078521,0.194599,0.028744,-0.116929,-0.012214,-0.599713,0.488688,-0.118878,0.135019,1.000000,0.226509


In [147]:
def find_most_similar_items_mf(mov_id: int, top_k: int = 5) -> pd.DataFrame:
    item_vec = df_items_ex1_cos_sim[mov_id]  # Extract the movie similarity cosine vector
    item_vec = item_vec.drop(mov_id)  # Remove the movie itself from the vector

    top_items = item_vec.sort_values(ascending=False).iloc[:5]

    res = pd.DataFrame({"movie_id": top_items.index, "similartiy_score": top_items})
    res.reset_index(drop=True, inplace=True)

    return get_movies_table_from_ids(df=res)

In [148]:
find_most_similar_items_mf(mov_id=1, top_k=5)

Unnamed: 0,movie_id,similartiy_score,movie_title,release date,video release date,IMDb URL
0,1453,0.644865,Angel on My Shoulder (1946),01-Jan-1946,,http://us.imdb.com/M/title-exact?Angel%20on%20...
1,405,0.64395,Mission: Impossible (1996),22-May-1996,,http://us.imdb.com/M/title-exact?Mission:%20Im...
2,511,0.627022,Lawrence of Arabia (1962),01-Jan-1962,,http://us.imdb.com/M/title-exact?Lawrence%20of...
3,979,0.605899,"Trigger Effect, The (1996)",30-Aug-1996,,http://us.imdb.com/M/title-exact?Trigger%20Eff...
4,1139,0.598576,Hackers (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Hackers%20(1995)


In [149]:
find_most_similar_items_mf(mov_id=402, top_k=5)

Unnamed: 0,movie_id,similartiy_score,movie_title,release date,video release date,IMDb URL
0,236,0.647709,Citizen Ruth (1996),13-Dec-1996,,http://us.imdb.com/M/title-exact?Citizen%20Rut...
1,1180,0.644818,I Love Trouble (1994),01-Jan-1994,,http://us.imdb.com/M/title-exact?I%20Love%20Tr...
2,865,0.642947,"Ice Storm, The (1997)",01-Jan-1997,,http://us.imdb.com/M/title-exact?Ice+Storm%2C+...
3,591,0.627722,Primal Fear (1996),30-Mar-1996,,http://us.imdb.com/M/title-exact?Primal%20Fear...
4,449,0.594677,Star Trek: The Motion Picture (1979),01-Jan-1979,,http://us.imdb.com/M/title-exact?Star%20Trek:%...


Implemented also the second approach from C

In [150]:
def most_similar_for_2_items_mf(first_mov_id: int, second_mov_id: int, top_k: int = 5) -> pd.DataFrame:
    df = movie_profile.copy()
    k = df.shape[0]

    similar_items_first = find_most_similar_items_mf(mov_id=first_mov_id, top_k=k)
    similar_items_second = find_most_similar_items_mf(mov_id=second_mov_id, top_k=k)

    merged = similar_items_first.merge(similar_items_second, how='inner', on='movie_id')

    pt = merged.pivot_table(index='movie_id')
    pt = pt.mean(axis=1)
    pt.sort_values(ascending=False, inplace=True)

    res = pt.to_frame(name="score")
    res = res.iloc[:top_k]
    res = pd.DataFrame({"movie_id": res.index, "score": res["score"]})
    res.reset_index(drop=True, inplace=True)

    return get_movies_table_from_ids(res)

In [151]:
most_similar_for_2_items_mf(first_mov_id=1, second_mov_id=402, top_k=5)

Unnamed: 0,score,movie_id,movie_title,release date,video release date,IMDb URL


### Explanation

It is obvious that the results are not the same. To understand that we need to understand how the evaluation of each method is different from each other.

In the matrix factorization we calculating each item score (latent vector) based on users rating so two different movies with the same genres can be evaluated differently because one can be highly score rated movie and the other low score rated, On the other hand in content based approach we calculating each item score based on it's "dry" properties, for example genres, title, year of release, producer and more. 

# Question 2

In [44]:
user_x_movie = pd.pivot_table(ratings, values='binary_rating', index=['item_id'], columns = ['user_id'])
user_x_movie.sort_index(axis=0, inplace=True)

userIDs = user_x_movie.columns
user_profile = pd.DataFrame(columns = movie_profile.columns)

user_x_movie is the rating matrix. Rows are item_id, columns are user_id. Missing values are NaN

In [45]:
user_x_movie

user_id,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,...,904,905,906,907,908,909,910,911,912,913,914,915,916,917,918,919,920,921,922,923,924,925,926,927,928,929,930,931,932,933,934,935,936,937,938,939,940,941,942,943
item_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1
1,1.0,1.0,,,1.0,1.0,,,,1.0,,,1.00,,-0.25,1.0,1.0,1.0,,1.0,1.00,,1.0,,1.0,1.0,,,,,,,,,,,,1.0,,,...,,,,1.0,,,1.00,,,-0.25,,,1.0,1.00,1.0,1.00,,1.0,1.0,1.0,1.0,,,1.0,,1.0,1.0,,1.0,1.0,-0.25,1.0,1.0,,1.0,,,1.0,,
2,1.0,,,,1.0,,,,,,,,1.00,,,,,,,,,-0.25,,,,,,,,1.0,,,,,,,,,,,...,,,,,,,,,,,,,1.0,,,,,,,,1.0,,,,,,,,,,1.00,,,,,,,,,1.0
3,1.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,-0.25,,,,,,1.0,-0.25,,,,,,1.0,,,,,,,,,,,,,1.0,,,,,,,
4,1.0,,,,,,1.0,,,1.0,,1.0,1.00,,,1.0,,1.0,1.0,,,1.00,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,1.00,,,1.0,,,-0.25,,,,,,,,,,,,,,1.0,1.00,,,,,,-0.25,,,
5,1.0,,,,,,,,,,,,-0.25,,,,,,,,-0.25,,,,,,,1.0,,,,,,,,,,,,,...,,,,1.0,,,,,,,,,1.0,,,1.00,,,,,,1.0,,,,,,,,,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1678,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1679,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1680,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1681,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


In [46]:
user_profile.head()

Unnamed: 0,Action,Adventure,Animation,Children's,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western


The user profile is the average rating the user gave to movies which belong to each genre

In [47]:
for i in range(len(user_x_movie.columns)):
  working_df = movie_profile.mul(user_x_movie.iloc[:,i], axis=0)
  user_profile.loc[userIDs[i]] = working_df.mean(axis=0)

In [48]:
user_profile.head()

Unnamed: 0,Action,Adventure,Animation,Children's,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
1,0.219669,0.140625,0.026654,0.051471,0.261949,0.077206,0.013787,0.292279,0.002757,0.003676,0.043199,0.024816,0.013787,0.129596,0.13511,0.13511,0.073529,0.017463
2,0.157258,0.080645,0.016129,0.096774,0.173387,0.108871,0.0,0.487903,0.016129,0.032258,0.048387,0.016129,0.064516,0.241935,0.080645,0.197581,0.032258,0.016129
3,0.078704,0.032407,0.0,0.050926,0.138889,0.074074,-0.00463,0.199074,0.018519,0.0,0.032407,0.032407,0.050926,0.027778,0.032407,0.166667,-0.009259,0.0
4,0.25,0.125,0.0,0.125,0.208333,0.166667,0.041667,0.291667,0.0,0.041667,0.0,0.083333,0.083333,0.083333,0.083333,0.40625,0.083333,0.041667
5,0.158571,0.082857,0.071429,0.091429,0.228571,0.02,0.0,0.131429,0.004286,0.0,0.05,0.067143,-0.004286,0.051429,0.091429,0.078571,0.042857,0.021429


**TFIDF**

In the movie profile we want to give higher weight to rare genres. The movie profile is now represented by a TFIDF of the genres in the dataset

In [49]:
# TFIDF
df = movie_profile.sum()
idf = (len(movie_titles) / df).apply(np.log) #log inverse of DF
TFIDF = movie_profile.mul(idf.values)

In [50]:
TFIDF.head()

Unnamed: 0,Action,Adventure,Animation,Children's,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,0.0,0.0,3.690069,2.623718,1.20318,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1.902286,2.522464,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.902286,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.902286,0.0,0.0
3,1.902286,0.0,0.0,0.0,1.20318,0.0,0.0,0.841567,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,2.736391,0.0,0.841567,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.902286,0.0,0.0


The recommended items to a user, are the items with highest Cosine similarity with the user profile vector

In [51]:
cosine_similarity_user_item = cosine_similarity(user_profile, TFIDF)

## A

In [52]:
def predict_most_similar_items_per_user(user_id: int,num_items: int = 5) -> np.ndarray:
    result = np.argsort(cosine_similarity_user_item[user_profile.index.get_loc(user_id),:])[::-1][:num_items]
    ret_result = [movie_profile.index[i] for i in result]
    return np.array(ret_result)

## B

### Test 1


In [53]:
user_idx = 0

In [54]:
s = user_profile.iloc[user_idx]
s.sort_values(ascending=False)

Drama          0.292279
Comedy         0.261949
Action         0.219669
Adventure      0.140625
Thriller       0.135110
Sci-Fi         0.135110
Romance        0.129596
Crime          0.077206
War            0.073529
Children's     0.051471
Horror         0.043199
Animation      0.026654
Musical        0.024816
Western        0.017463
Documentary    0.013787
Mystery        0.013787
Film-Noir      0.003676
Fantasy        0.002757
Name: 1, dtype: float64

We can see that user 1 is into Drama (0.292279), Comedy (0.261949) and Action (0.219669)

Let's predict and examine the results.

In [55]:
res = predict_most_similar_items_per_user(user_id=user_idx + 1, num_items=5)
df_res = pd.DataFrame({"movie_id": res + 1})
df_res = df_res.merge(movie_titles, how='inner', on='movie_id')
df_res

Unnamed: 0,movie_id,movie_title,release date,video release date,IMDb URL,unknown,Action,Adventure,Animation,Children's,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,4,Get Shorty (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Get%20Shorty%...,0,1,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0
1,74,Faster Pussycat! Kill! Kill! (1965),01-Jan-1965,,http://us.imdb.com/M/title-exact?Faster%20Puss...,0,1,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0
2,65,What's Eating Gilbert Grape (1993),01-Jan-1993,,http://us.imdb.com/M/title-exact?What's%20Eati...,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0
3,964,"Month by the Lake, A (1995)",01-Jan-1995,,http://us.imdb.com/M/title-exact?Month%20by%20...,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0
4,723,Boys on the Side (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Boys%20on%20t...,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0


As we can see the first 2 movies are Action, Comedy & Drama.

3-5 movies are Comedy & Drama.


**Results are good!**

TFIDF is the movies profile matrix after TFIDF weighting 

In [56]:
TFIDF.iloc[res]

Unnamed: 0,Action,Adventure,Animation,Children's,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
3,1.902286,0.0,0.0,0.0,1.20318,0.0,0.0,0.841567,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
73,1.902286,0.0,0.0,0.0,1.20318,0.0,0.0,0.841567,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
64,0.0,0.0,0.0,0.0,1.20318,0.0,0.0,0.841567,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
963,0.0,0.0,0.0,0.0,1.20318,0.0,0.0,0.841567,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
722,0.0,0.0,0.0,0.0,1.20318,0.0,0.0,0.841567,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


**Good Matches!**

### Test 2


In [57]:
user_idx = 111

In [58]:
s = user_profile.iloc[user_idx]
s.sort_values(ascending=False)

Drama          0.375000
Thriller       0.315217
Action         0.206522
Comedy         0.168478
Romance        0.146739
Children's     0.130435
Mystery        0.125000
Crime          0.125000
Sci-Fi         0.086957
Adventure      0.086957
Fantasy        0.065217
Film-Noir      0.043478
Horror         0.021739
War            0.016304
Documentary    0.000000
Animation      0.000000
Musical        0.000000
Western        0.000000
Name: 112, dtype: float64

We can see that user 112 is into Drama (0.375), Thriller (0.315217) and Action (0.206522)

Let's predict and examine the results.

In [59]:
res = predict_most_similar_items_per_user(user_id=user_idx + 1, num_items=5)
df_res = pd.DataFrame({"movie_id": res + 1})
df_res = df_res.merge(movie_titles, how='inner', on='movie_id')
df_res

Unnamed: 0,movie_id,movie_title,release date,video release date,IMDb URL,unknown,Action,Adventure,Animation,Children's,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,337,"House of Yes, The (1997)",01-Jan-1997,,"http://us.imdb.com/M/title-exact?House+of+Yes,...",0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,1,0,0
1,54,Outbreak (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Outbreak%20(1...,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0
2,1025,Fire Down Below (1997),01-Jan-1997,,http://us.imdb.com/M/title-exact?Fire+Down+Bel...,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0
3,1491,Tough and Deadly (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Tough%20and%2...,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0
4,244,Smilla's Sense of Snow (1997),14-Mar-1997,,http://us.imdb.com/M/title-exact?Smilla%27s%20...,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0


As we can see all of them are Thriller & Drama.

All of them also Comedy or Action which is the number 3 in the preferences of the user


**Results are good!**

TFIDF is the movies profile matrix after TFIDF weighting 

In [60]:
TFIDF.iloc[res]

Unnamed: 0,Action,Adventure,Animation,Children's,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
336,0.0,0.0,0.0,0.0,1.20318,0.0,0.0,0.841567,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.902286,0.0,0.0
53,1.902286,0.0,0.0,0.0,0.0,0.0,0.0,0.841567,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.902286,0.0,0.0
1024,1.902286,0.0,0.0,0.0,0.0,0.0,0.0,0.841567,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.902286,0.0,0.0
1490,1.902286,0.0,0.0,0.0,0.0,0.0,0.0,0.841567,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.902286,0.0,0.0
243,1.902286,0.0,0.0,0.0,0.0,0.0,0.0,0.841567,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.902286,0.0,0.0


**Good Matches**

## C


In [61]:
movie_sim_df = pd.DataFrame(cosine_similarity(movie_profile, movie_profile),
                            index=movie_profile.index,
                            columns=movie_profile.index)
movie_sim_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,...,1642,1643,1644,1645,1646,1647,1648,1649,1650,1651,1652,1653,1654,1655,1656,1657,1658,1659,1660,1661,1662,1663,1664,1665,1666,1667,1668,1669,1670,1671,1672,1673,1674,1675,1676,1677,1678,1679,1680,1681
0,1.0,0.0,0.0,0.333333,0.0,0.0,0.0,0.666667,0.0,0.0,0.0,0.0,0.57735,0.0,0.0,0.408248,0.258199,0.0,0.0,0.0,0.258199,0.0,0.0,0.0,0.57735,0.57735,0.0,0.0,0.288675,0.0,0.0,0.0,0.0,0.408248,0.333333,0.0,0.0,0.0,0.0,0.57735,...,0.0,0.57735,0.0,0.0,0.333333,0.0,0.408248,0.0,0.0,0.0,0.0,0.57735,0.408248,0.408248,0.0,0.0,0.57735,0.0,0.0,0.0,0.0,0.57735,0.0,0.0,0.0,0.57735,0.0,0.408248,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.57735,0.0
1,0.0,1.0,0.57735,0.333333,0.333333,0.0,0.0,0.0,0.0,0.0,0.408248,0.408248,0.0,0.0,0.0,0.0,0.516398,0.0,0.0,0.0,0.774597,0.333333,0.408248,0.666667,0.0,0.0,0.57735,0.666667,0.57735,0.0,0.333333,0.0,0.666667,0.0,0.333333,0.0,0.0,0.408248,0.333333,0.0,...,0.0,0.0,0.0,0.408248,0.0,0.0,0.0,0.0,0.408248,0.0,0.0,0.0,0.0,0.0,0.408248,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.408248,0.0,0.0,0.816497,0.0,0.0,0.0,0.0,0.0,0.408248,0.0,0.0,0.0
2,0.0,0.57735,1.0,0.0,0.57735,0.0,0.0,0.0,0.0,0.0,0.707107,0.707107,0.0,0.0,0.0,0.0,0.447214,0.0,0.0,0.0,0.447214,0.0,0.707107,0.0,0.0,0.0,0.0,0.57735,0.0,0.0,0.57735,0.0,0.57735,0.0,0.0,0.0,0.0,0.707107,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.707107,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.707107,0.0,0.0,0.707107,0.0,0.0,0.0,0.0,0.0,0.707107,0.0,0.0,0.0
3,0.333333,0.333333,0.0,1.0,0.333333,0.57735,0.408248,0.666667,0.57735,0.408248,0.0,0.0,0.57735,0.408248,0.57735,0.408248,0.516398,0.57735,0.57735,0.408248,0.516398,0.666667,0.408248,0.333333,0.57735,0.57735,0.57735,0.666667,0.57735,0.57735,0.333333,0.0,0.333333,0.816497,0.333333,0.408248,0.57735,0.0,0.333333,0.57735,...,0.57735,0.57735,0.57735,0.816497,0.666667,0.57735,0.408248,0.57735,0.408248,0.0,0.57735,0.57735,0.408248,0.408248,0.816497,0.57735,0.57735,0.57735,0.57735,0.408248,0.408248,0.57735,0.57735,0.57735,0.57735,0.57735,0.333333,0.408248,0.57735,0.57735,0.408248,0.57735,0.57735,0.57735,0.57735,0.57735,0.0,0.408248,0.57735,0.57735
4,0.0,0.333333,0.57735,0.333333,1.0,0.57735,0.408248,0.333333,0.57735,0.408248,0.816497,0.816497,0.0,0.408248,0.57735,0.0,0.516398,0.57735,0.57735,0.408248,0.258199,0.333333,0.816497,0.333333,0.0,0.0,0.0,0.666667,0.288675,0.57735,0.666667,0.0,0.333333,0.408248,0.333333,0.408248,0.57735,0.408248,0.333333,0.0,...,0.57735,0.0,0.57735,0.408248,0.666667,0.57735,0.0,0.57735,0.816497,0.0,0.57735,0.0,0.0,0.0,0.408248,0.57735,0.0,0.57735,0.57735,0.408248,0.408248,0.0,0.57735,0.57735,0.57735,0.0,0.666667,0.408248,0.57735,0.57735,0.408248,0.57735,0.57735,0.57735,0.57735,0.57735,0.408248,0.408248,0.0,0.57735


In [62]:
r_cols = ['user_id', 'movie_id', 'rating', 'timestamp']
data_path = path.join(get_dataset_folder(), 'u1.base')

f1_train = pd.read_csv(data_path, delimiter='\t', names=r_cols, encoding='latin-1')
f1_train.drop(columns=["timestamp"], inplace=True)
f1_train['binary_rating'] = ratings.apply(brating, axis=1)

user_x_movie = pd.pivot_table(f1_train, values='binary_rating', index=['movie_id'], columns=['user_id'])
user_x_movie_n = user_x_movie.copy()
user_x_movie_n.fillna(0, inplace=True)
user_x_movie_n.head()

user_id,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,...,904,905,906,907,908,909,910,911,912,913,914,915,916,917,918,919,920,921,922,923,924,925,926,927,928,929,930,931,932,933,934,935,936,937,938,939,940,941,942,943
movie_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1
1,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,-0.25,0.0,0.0,1.0,1.0,1.0,-0.25,0.0,1.0,1.0,1.0,1.0,0.0,0.0,-0.25,0.0,1.0,1.0,0.0,1.0,-0.25,1.0,1.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
2,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
3,-0.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,-0.25,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,-0.25,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,-0.25,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.25,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,-0.25,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
5,-0.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [63]:
def get_topN_rec(df: pd.DataFrame, n: int) -> pd.DataFrame:
    df2 = df
    df2 = df2.fillna(0)
    x = df2.T

    result = pd.DataFrame(np.zeros((0, n)), columns=[n for n in range(1, n + 1)])

    for i in x.columns:
        df1row = pd.DataFrame(x.nlargest(n, i).index.tolist(), index=[n for n in range(1, n + 1)], columns=[i]).T
        result = pd.concat([result, df1row], axis=0)
        
    return result

In [74]:
def calc_mrr_score(topN: pd.DataFrame) -> pd.DataFrame:
    top2N = topN.copy()
    for top in top2N.columns:
        for user in top2N.index:
            try:
                rating = \
                f1_test[(f1_test["user_id"] == user) & (f1_test["movie_id"] == top2N.at[user, top])].rating.values[-1]
            except:
                rating = -1
            if rating > 3:
                top2N.at[user, top] = 1
            elif rating == -1:
                top2N.at[user, top] = -1
            else:
                top2N.at[user, top] = 0

    mrr_calc = []
    for idx, row in top2N.iterrows():
        try:
            first_occ = list(mrr_score.loc[idx]).index(1) + 1
            mrr_calc.append(1 / first_occ)
        except:
            mrr_calc.append(0)

    top2N["MRR Score"] = mrr_calc
    return top2N

In [75]:
def get_similar_movie(movie_id) -> (pd.DataFrame, pd.DataFrame):
    if movie_id not in movie_profile.index:
        print(movie_id, " not in movie_profile")
        return None, None
    else:
        sim_movie = movie_sim_df.sort_values(by=movie_id, ascending=False).index[1:]
        sim_score = movie_sim_df.sort_values(by=movie_id, ascending=False).loc[:, movie_id].tolist()[1:]
        return sim_movie, sim_score

In [76]:
# predict the rating of movie x by user y
def predict_rating(user_id: int, movie_id: int, max_neighbor: int = 10) -> np.ndarray:
    movies, scores = get_similar_movie(movie_id)
    movie_arr = []
    sim_arr = []
    for movie, score in zip(movies, scores):
        if (movie in user_x_movie_n.index):
            movie_arr.append(movie)
            sim_arr.append(score)

    sim_arr = np.array([x for x in scores])
    movie_arr = np.array([x for x in movie_arr])

    # select only the movies that has already rated by user x
    filtering = user_x_movie_n[user_id].loc[movie_arr] > 0
    indxs = filtering[filtering == True].index
    # calculate the predicted score
    s = 0.0
    # don't estimate rating by less than 4 nearest neighbors (by content)
    if np.sum(sim_arr[indxs][:max_neighbor]) > 0.0 and np.where(sim_arr[indxs] > 0.0)[0].size > 3:
        s = np.dot(sim_arr[indxs][:max_neighbor], user_x_movie_n[user_id].loc[movie_arr[indxs][:max_neighbor]]) \
            / np.sum(sim_arr[indxs][:max_neighbor])

    return s

In [77]:
data_path = path.join(get_dataset_folder(), 'u1.test')

f1_test = pd.read_csv(data_path, delimiter='\t', names=r_cols, encoding='latin-1')
f1_test.drop(columns=["timestamp"], inplace=True)

In [78]:
def predict_values(f1_test: pd.DataFrame) -> pd.DataFrame:
    df = pd.DataFrame(index=set(f1_test["user_id"]), columns=set(f1_test["movie_id"]))
    for idx, row in f1_test.iterrows():
        user, item = row["user_id"], row["movie_id"]
        try:
            pred_value = predict_rating(user, item)
        except:
            # print("user {} or item {} are not exists in U or V".format(user,item))
            continue;
        df.at[user, item] = pred_value
    return df

In [69]:
df = predict_values(f1_test=f1_test)

In [79]:
topN = get_topN_rec(df=df, n=5)

In [80]:
mrr_score = calc_mrr_score(topN=topN)

In [81]:
mrr_score = mrr_score[mrr_score.sum(axis = 1) != -5]

In [85]:
print("MRR:", np.round(mrr_score["MRR Score"].mean(), 3))

MRR: 0.701


## D
As we can see above with content base recommender we achieved 0.701 with MRR evaluation and in matrix factorization we recieved 0.90127 score.

Using matrix factorization, recommendations are based on movies each user watched, and therefore, most viewed movies are more potential to be recommended while their score was high, while in collaborative filtering, a variety of movies are recommended, therefore, when calculating the MRR score, a lot of "rare" movies will be recommended, and therefore the chances of a user to score those items is low.


## E

Content based are better in few aspects, content based filtering isn't sensitive to **cold start problem**, because when a new movie is added to the list, he will be recommended based on his properties and not based on ratings.

Second, in content based we can add more information about the movies or the users in order to further enhance our system.

Content based is easier to explain because each recommendation is based on its properties and therefore easily to interpret, on the other hand in MF we have latent vector, which it's operate like a black box.

One of the most biggest advantages of matrix factorization approach is that the accuracy if we have enough data (for example in our case rating) it is much better and accurate. We can achieve pretty high results because the MF look at each user and customizing recommendation more personally than content based but it's cost us much more computing time against content based approach.
