## Item Based Movie Recommendation System

Two types of collaborative filtering:

* User based
    * Let’s say we want to show recommendations to user **A**.
    * In this method, we try to find a similar user **B** who also tends to like items that user **A** likes.
    * So, we recommend user **B**‘s other liked items to user **A**.
    * Logic behind this is, similar people may like similar items.  
<br>  
* Item based  
    * Let’s say one user buy item **P**.
    * Now, from all the user’s data, there’s one item **S** which users bought almost all time whenever item **P** get bought.
    * So, we recommend item **S** to users whenever they buy item **P**.
    * Logic behind this is, similar items may be sold together.


### Item Based Recommendation System

In [1]:
# to open files
import pandas as pd

# for numerical operations
import numpy as np

# sci-kit learn to measure distances
from sklearn.metrics.pairwise import pairwise_distances

Let's read the data

In [1]:
header = ['user_id', 'item_id', 'rating', 'timestamp']
movielens_data = pd.read_csv('../datasets/ml-100k/u.data', sep='\t', names=header)
movielens_data.head()

NameError: name 'pd' is not defined

In [3]:
movielens_data.shape

(100000, 4)

In [22]:
n_users, n_movies  = movielens_data['user_id'].nunique(), movielens_data['item_id'].nunique() #unique users and movies
n_users, n_movies


(943, 1682)

### We can see we have 100k ratings from 943 unique users to 1682 unique movies.

In [24]:
# We can also use panda's pivot_table to create this 2D matrix.

train_data_matrix = np.zeros((n_users, n_movies))

for line in movielens_data.itertuples():
    train_data_matrix[line[1]-1, line[2]-1] = line[3] #for the row of user and row of movie, we put the rating
    
train_data_matrix.shape

(943, 1682)

In [7]:
train_data_matrix

array([[5., 3., 4., ..., 0., 0., 0.],
       [4., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [5., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 5., 0., ..., 0., 0., 0.]])

This 2D matrix have `943` rows & `1682` columns.<br>
### Each row represents a user, each column represents a movie.<br>

Whiever movie a user have seen, that values for that user will be the rating of that movie, given by that user. Other values will be zero for that user.<br>

### Calculate user_distances & movie_distances.<br>
* `user_distances` is, distances of a user with every other user.<br>
* `movie_distances` is, distances of a movie with every other movie.<br>

### We will use cosine similarity to measure distance.  
Because, cosine distance works comparatively good on vectors than euclidean etc. metrics.<br>
Here's the formula which is used to calculate cosine distance.<br>

In [8]:
user_distances = pairwise_distances(train_data_matrix, metric="cosine")

# ".T" below is to transpose our 2D matrix.
train_data_matrix_transpose = train_data_matrix.T
movie_distances = pairwise_distances(train_data_matrix_transpose, metric="cosine")

user_distances.shape, movie_distances.shape

((943, 943), (1682, 1682))

### "Distance" here means, how much far two user are far from each other in terms of their favorite movies.<br>
Like, let say, <br>
* User **A** likes 6 movies with 5 rating to each of them. <br>
* User **B** likes 4 movies with 5 ratings.<br><br>

Now, all 4 of those movies, are from 6 movies which user **A** likes. <br>
So, distance between user **B** & user **A** is less compared to user **C** whose favorite movies have no intersection with **A** or **B**.

In [9]:
user_distances

array([[0.        , 0.83306902, 0.95254046, ..., 0.85138306, 0.82049212,
        0.60182526],
       [0.83306902, 0.        , 0.88940868, ..., 0.83851522, 0.82773219,
        0.89420212],
       [0.95254046, 0.88940868, 0.        , ..., 0.89875744, 0.86658385,
        0.97344413],
       ...,
       [0.85138306, 0.83851522, 0.89875744, ..., 0.        , 0.8983582 ,
        0.90488042],
       [0.82049212, 0.82773219, 0.86658385, ..., 0.8983582 , 0.        ,
        0.81753534],
       [0.60182526, 0.89420212, 0.97344413, ..., 0.90488042, 0.81753534,
        0.        ]])

In [10]:
movie_distances

array([[0.        , 0.59761782, 0.66975521, ..., 1.        , 0.95281693,
        0.95281693],
       [0.59761782, 0.        , 0.72693082, ..., 1.        , 0.92170064,
        0.92170064],
       [0.66975521, 0.72693082, 0.        , ..., 1.        , 1.        ,
        0.90312495],
       ...,
       [1.        , 1.        , 1.        , ..., 0.        , 1.        ,
        1.        ],
       [0.95281693, 0.92170064, 1.        , ..., 1.        , 0.        ,
        1.        ],
       [0.95281693, 0.92170064, 0.90312495, ..., 1.        , 1.        ,
        0.        ]])

### Above values represent "distances"<br>So, let's make "similarity" matrices from them.<br>We can calculate "similarity" just by subtracting every value from 1.

In [11]:
user_similarity = 1 - user_distances
movie_similarity = 1 - movie_distances

In [12]:
user_similarity

array([[1.        , 0.16693098, 0.04745954, ..., 0.14861694, 0.17950788,
        0.39817474],
       [0.16693098, 1.        , 0.11059132, ..., 0.16148478, 0.17226781,
        0.10579788],
       [0.04745954, 0.11059132, 1.        , ..., 0.10124256, 0.13341615,
        0.02655587],
       ...,
       [0.14861694, 0.16148478, 0.10124256, ..., 1.        , 0.1016418 ,
        0.09511958],
       [0.17950788, 0.17226781, 0.13341615, ..., 0.1016418 , 1.        ,
        0.18246466],
       [0.39817474, 0.10579788, 0.02655587, ..., 0.09511958, 0.18246466,
        1.        ]])

In [13]:
movie_similarity

array([[1.        , 0.40238218, 0.33024479, ..., 0.        , 0.04718307,
        0.04718307],
       [0.40238218, 1.        , 0.27306918, ..., 0.        , 0.07829936,
        0.07829936],
       [0.33024479, 0.27306918, 1.        , ..., 0.        , 0.        ,
        0.09687505],
       ...,
       [0.        , 0.        , 0.        , ..., 1.        , 0.        ,
        0.        ],
       [0.04718307, 0.07829936, 0.        , ..., 0.        , 1.        ,
        0.        ],
       [0.04718307, 0.07829936, 0.09687505, ..., 0.        , 0.        ,
        1.        ]])

### Now, Above matrices represents "similarities".


In [14]:
idx_to_movie = {}

with open('../datasets/ml-100k/u.item', 'r', encoding="ISO-8859-1") as f:
    for line in f.readlines():
        info = line.split('|')
        idx_to_movie[int(info[0])-1] = info[1]

movie_to_idx = {v: k for k, v in idx_to_movie.items()}

In [15]:
idx_to_movie[0], idx_to_movie[1], idx_to_movie[2], idx_to_movie[3] 

('Toy Story (1995)',
 'GoldenEye (1995)',
 'Four Rooms (1995)',
 'Get Shorty (1995)')

In [16]:
movie_to_idx['Toy Story (1995)'], movie_to_idx['GoldenEye (1995)'], movie_to_idx['Four Rooms (1995)'], movie_to_idx['Get Shorty (1995)'] 

(0, 1, 2, 3)

* `idx_to_movie` is a dictionary which maps movie_index to movie name<br>
* `movie_to_idx` is a dictionary which maps movie name to movie_index<br>
<br>
Now, let's write a function to given a movie name, find `k` closest movies to it.

In [52]:
# What we do is, we just that movie's column & sort it by value.
# Those value represents "similarity" so, we just need to sort it & pick first "k" values.

def top_k_movies(similarity, mapper, movie_idx, k=6):
    return [mapper[x] for x in np.argsort(similarity[movie_idx,:])[-k-1:]][::-1]

In [40]:
np.argsort(movie_similarity[0,:])[:-6 -2:-1]

array([  0,  49, 180, 120, 116, 404, 150], dtype=int64)

In [43]:
np.argsort(movie_similarity[0,:])[-6:]

array([404, 116, 120, 180,  49,   0], dtype=int64)

## Let's find out similar movies of "Batman Forever" movie.<br>
### We can recommend these movies to users who like "Batman Forever" movie.

In [54]:
favorite_movie_name = 'Batman Forever (1995)'
movie_index = movie_to_idx[favorite_movie_name]
movie_index

28

In [55]:
how_much_movie_to_show = 7

movies = top_k_movies(movie_similarity, idx_to_movie, movie_index, k = how_much_movie_to_show)
movies[1:how_much_movie_to_show + 1]

['Batman (1989)',
 'Batman Returns (1992)',
 'Cliffhanger (1993)',
 'Demolition Man (1993)',
 'Stargate (1994)',
 'Net, The (1995)',
 'Waterworld (1995)']

In [56]:
favorite_movie_name = 'Star Wars (1977)'
movie_index = movie_to_idx[favorite_movie_name]
movie_index

49

In [57]:
how_much_movie_to_show = 7
movies = top_k_movies(movie_similarity, idx_to_movie, movie_index, k = how_much_movie_to_show)
movies[1:how_much_movie_to_show + 1]

['Return of the Jedi (1983)',
 'Raiders of the Lost Ark (1981)',
 'Empire Strikes Back, The (1980)',
 'Toy Story (1995)',
 'Godfather, The (1972)',
 'Independence Day (ID4) (1996)',
 'Indiana Jones and the Last Crusade (1989)']

# Summary

* We saw 2 types of recommendation "User based" & "Item based"
* We saw, how to create 2D matrix which we can use to create distances between each movie & user.
* We learned, we can use cosine distance to calculate distance between them.
* We saw how we can recommend movies to a user by finding k nearest movies from that movie which user like.
