# Exploration 14: Movie Recommendation
---

#### Model
* Matrix Factorization

#### Data
* MovieLens 1M Dataset

#### Goals

1. create CSR Matrix
2. build MF model
3. recommend movies
---

#### Note

* Star ratings are considered as the explicit data but I will be using them as the implicit data
* The star ratings will be considered as the number of views
* I will assume that the ratings under three points are not preferred by the users and such data will be excluded

## Importing Dependencies

In [128]:
import os
import re
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.sparse import csr_matrix
from implicit.als import AlternatingLeastSquares

os.environ['OPENBLAS_NUM_THREADS']='1'
os.environ['KMP_DUPLICATE_LIB_OK']='True'
os.environ['MKL_NUM_THREADS']='1'

print("All imported!")

All imported!


## Loading Data

In [101]:
rating_file_path=os.getenv('HOME') + '/aiffel/recommendata_iu/data/ml-1m/ratings.dat'
ratings_cols = ['user_id', 'movie_id', 'ratings', 'timestamp']
ratings = pd.read_csv(rating_file_path, sep='::', names=ratings_cols, engine='python', encoding = "ISO-8859-1")
orginal_data_size = len(ratings)
ratings.head()

Unnamed: 0,user_id,movie_id,ratings,timestamp
0,1,1193,5,978300760
1,1,661,3,978302109
2,1,914,3,978301968
3,1,3408,4,978300275
4,1,2355,5,978824291


---

## Data Pre-processing

In [102]:
# above three points
ratings = ratings[ratings['ratings']>=3]
filtered_data_size = len(ratings)

print(f'orginal_data_size: {orginal_data_size}, filtered_data_size: {filtered_data_size}')
print(f'Ratio of Remaining Data is {filtered_data_size / orginal_data_size:.2%}')

orginal_data_size: 1000209, filtered_data_size: 836478
Ratio of Remaining Data is 83.63%


In [103]:
# renaming the "ratings" column to "counts"
ratings.rename(columns={'ratings':'counts'}, inplace=True)

In [104]:
ratings['counts']

0          5
1          3
2          3
3          4
4          5
          ..
1000203    3
1000205    5
1000206    5
1000207    4
1000208    4
Name: counts, Length: 836478, dtype: int64

In [105]:
# reading the meta data to check movie titles
movie_file_path=os.getenv('HOME') + '/aiffel/recommendata_iu/data/ml-1m/movies.dat'
cols = ['movie_id', 'title', 'genre'] 
movies = pd.read_csv(movie_file_path, sep='::', names=cols, engine='python', encoding='ISO-8859-1')
movies.head()

Unnamed: 0,movie_id,title,genre
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama
4,5,Father of the Bride Part II (1995),Comedy


In [106]:
ratings = ratings.join(movies.set_index('movie_id'), on='movie_id')
ratings.head

<bound method NDFrame.head of          user_id  movie_id  counts  timestamp  \
0              1      1193       5  978300760   
1              1       661       3  978302109   
2              1       914       3  978301968   
3              1      3408       4  978300275   
4              1      2355       5  978824291   
...          ...       ...     ...        ...   
1000203     6040      1090       3  956715518   
1000205     6040      1094       5  956704887   
1000206     6040       562       5  956704746   
1000207     6040      1096       4  956715648   
1000208     6040      1097       4  956715569   

                                          title  \
0        One Flew Over the Cuckoo's Nest (1975)   
1              James and the Giant Peach (1996)   
2                           My Fair Lady (1964)   
3                        Erin Brockovich (2000)   
4                          Bug's Life, A (1998)   
...                                         ...   
1000203                 

In [107]:
# exclude timestamp and genre columns
ratings = ratings.drop(columns=['timestamp', 'genre'])
ratings.head()

Unnamed: 0,user_id,movie_id,counts,title
0,1,1193,5,One Flew Over the Cuckoo's Nest (1975)
1,1,661,3,James and the Giant Peach (1996)
2,1,914,3,My Fair Lady (1964)
3,1,3408,4,Erin Brockovich (2000)
4,1,2355,5,"Bug's Life, A (1998)"


In [108]:
#convert to lowercase
ratings['title'] = ratings['title'].str.lower()
ratings.head()

Unnamed: 0,user_id,movie_id,counts,title
0,1,1193,5,one flew over the cuckoo's nest (1975)
1,1,661,3,james and the giant peach (1996)
2,1,914,3,my fair lady (1964)
3,1,3408,4,erin brockovich (2000)
4,1,2355,5,"bug's life, a (1998)"


## Analysis

In [109]:
print('# of user_id:', ratings['user_id'].nunique())
print('# of movie_id:', ratings['movie_id'].nunique())

# of user_id: 6039
# of movie_id: 3628


In [110]:
# top 30 movies
movie_counts = ratings.groupby('title')['user_id'].count()
movie_counts.sort_values(ascending=False).head(30)

title
american beauty (1999)                                   3211
star wars: episode iv - a new hope (1977)                2910
star wars: episode v - the empire strikes back (1980)    2885
star wars: episode vi - return of the jedi (1983)        2716
saving private ryan (1998)                               2561
terminator 2: judgment day (1991)                        2509
silence of the lambs, the (1991)                         2498
raiders of the lost ark (1981)                           2473
back to the future (1985)                                2460
matrix, the (1999)                                       2434
jurassic park (1993)                                     2413
sixth sense, the (1999)                                  2385
fargo (1996)                                             2371
braveheart (1995)                                        2314
men in black (1997)                                      2297
schindler's list (1993)                                  2257
pr

In [111]:
# statistics regards to number of movies seen by users
user_count = ratings.groupby('user_id')['movie_id'].count()
user_count.describe()

count    6039.000000
mean      138.512668
std       156.241599
min         1.000000
25%        38.000000
50%        81.000000
75%       177.000000
max      1968.000000
Name: movie_id, dtype: float64

## Add my favorites

In [112]:
ratings[ratings['title'].str.contains('toy story', regex=False)]

Unnamed: 0,user_id,movie_id,counts,title
40,1,1,5,toy story (1995)
50,1,3114,4,toy story 2 (1999)
203,3,3114,3,toy story 2 (1999)
469,6,1,4,toy story (1995)
581,8,1,4,toy story (1995)
...,...,...,...,...
998170,6032,1,4,toy story (1995)
998360,6035,1,4,toy story (1995)
998926,6036,3114,4,toy story 2 (1999)
999583,6037,3114,4,toy story 2 (1999)


In [113]:
my_favorite_id = [1, 356, 480, 1197, 1580]

my_favorite_title = []
for i in my_favorite_id:
    my_favorite_title.extend(list(movies[movies['movie_id'] == i]['title']))
    
my_movie = pd.DataFrame({'user_id':['6041']*5, 'movie_id': my_favorite_id, 'counts': [5] * 5, 'title': my_favorite_title})
my_movie

Unnamed: 0,user_id,movie_id,counts,title
0,6041,1,5,Toy Story (1995)
1,6041,356,5,Forrest Gump (1994)
2,6041,480,5,Jurassic Park (1993)
3,6041,1197,5,"Princess Bride, The (1987)"
4,6041,1580,5,Men in Black (1997)


In [114]:
if not ratings.isin({'user_id':['6041']})['user_id'].any():
    ratings = ratings.append(my_movie, ignore_index=True)
    
ratings.tail(10)

Unnamed: 0,user_id,movie_id,counts,title
836473,6040,1090,3,platoon (1986)
836474,6040,1094,5,"crying game, the (1992)"
836475,6040,562,5,welcome to the dollhouse (1995)
836476,6040,1096,4,sophie's choice (1982)
836477,6040,1097,4,e.t. the extra-terrestrial (1982)
836478,6041,1,5,Toy Story (1995)
836479,6041,356,5,Forrest Gump (1994)
836480,6041,480,5,Jurassic Park (1993)
836481,6041,1197,5,"Princess Bride, The (1987)"
836482,6041,1580,5,Men in Black (1997)


In [115]:
#convert to lowercase
ratings['title'] = ratings['title'].str.lower()
ratings.tail(5)

Unnamed: 0,user_id,movie_id,counts,title
836478,6041,1,5,toy story (1995)
836479,6041,356,5,forrest gump (1994)
836480,6041,480,5,jurassic park (1993)
836481,6041,1197,5,"princess bride, the (1987)"
836482,6041,1580,5,men in black (1997)


## CSR Matrix

In [116]:
user_unique = ratings['user_id'].unique()
movie_unique = ratings['title'].unique()

#indexing
user_to_idx = {v:k for k, v in enumerate(user_unique)}
movie_to_idx = {v:k for k, v in enumerate(movie_unique)}

In [117]:
print(user_to_idx['6041'])
print(movie_to_idx['jurassic park (1993)'])

6039
107


In [123]:
# replace data columns with indexing values
temp_user_data = ratings['user_id'].map(user_to_idx.get).dropna()
if len(temp_user_data) == len(ratings):
    print('user_id column indexing OK!!')
    ratings['user_id'] = temp_user_data
else:
    print('user_id column indexing Fail!!')

temp_movie_data = ratings['title'].map(movie_to_idx.get).dropna()
if len(temp_movie_data) == len(ratings):
    print('movie column indexing OK!!')
    ratings['title'] = temp_movie_data
else:
    print('movie column indexing Fail!!')

ratings

user_id column indexing Fail!!
movie column indexing Fail!!


Unnamed: 0,user_id,movie_id,counts,title
0,0,1193,5,0
1,0,661,3,1
2,0,914,3,2
3,0,3408,4,3
4,0,2355,5,4
...,...,...,...,...
836478,6039,1,5,40
836479,6039,356,5,160
836480,6039,480,5,107
836481,6039,1197,5,5


In [125]:
# CSR Matrix
num_user = ratings['user_id'].nunique()
num_movie = ratings['title'].nunique()

csr_data = csr_matrix((ratings['counts'], (ratings.user_id, ratings.title)), shape = (num_user, num_movie))
csr_data

<6040x3628 sparse matrix of type '<class 'numpy.int64'>'
	with 836483 stored elements in Compressed Sparse Row format>

## Model

In [132]:
# Implicit AlternatingLeastSquares model
als_model = AlternatingLeastSquares(factors=500, regularization=0.01, use_gpu=False, iterations=30, dtype=np.float32)

In [133]:
# als model input (item x user matrix)
csr_data_transpose = csr_data.T
csr_data_transpose

<3628x6040 sparse matrix of type '<class 'numpy.int64'>'
	with 836483 stored elements in Compressed Sparse Column format>

In [134]:
als_model.fit(csr_data_transpose)

  0%|          | 0/30 [00:00<?, ?it/s]

In [137]:
me, toystory = user_to_idx['6041'], movie_to_idx['toy story (1995)']
me_vector, toystory_vector = als_model.user_factors[me], als_model.item_factors[toystory]

print(me_vector.shape)
print(toystory_vector.shape)

(500,)
(500,)


In [140]:
# preferred movie
meninblack_vector = als_model.item_factors[movie_to_idx['men in black (1997)']]
np.dot(me_vector, meninblack_vector)

0.92834914

In [139]:
# not preferred
platoon_vector = als_model.item_factors[movie_to_idx['platoon (1986)']]
np.dot(me_vector, platoon_vector)

-0.032725602

## Movie Recommendation

In [142]:
favorite_movie = 'toy story (1995)'
movie_id = movie_to_idx[favorite_movie]
similar_movie = als_model.similar_items(movie_id, N =10)
similar_movie

[(40, 0.9999999),
 (2938, 0.31202558),
 (3266, 0.30120242),
 (2969, 0.29848966),
 (3589, 0.29720727),
 (2364, 0.29604134),
 (3624, 0.29526508),
 (2791, 0.295201),
 (3625, 0.2946685),
 (3260, 0.29457784)]

In [144]:
# index to movie title
idx_to_movie = {v:k for k,v in movie_to_idx.items()}
[idx_to_movie[i[0]] for i in similar_movie]

['toy story (1995)',
 'nobody loves me (keiner liebt mich) (1994)',
 'amityville: dollhouse (1996)',
 'best men (1997)',
 'soft toilet seats (1999)',
 'getting even with dad (1994)',
 'slaughterhouse (1987)',
 'blood beach (1981)',
 'promise, the (versprechen, das) (1994)',
 'price of glory (2000)']

In [145]:
def get_similar_movie(movie_title: str):
    movie_id = movie_to_idx[movie_title]
    similar_movie = als_model.similar_items(movie_id, N=10)
    similar_movie = [idx_to_movie[i[0]] for i in similar_movie]
    return similar_movie

In [148]:
get_similar_movie('terminator 2: judgment day (1991)')

['terminator 2: judgment day (1991)',
 'terminator, the (1984)',
 'grosse fatigue (1994)',
 'man from down under, the (1943)',
 'sorority house massacre ii (1990)',
 'city of the living dead (paura nella città dei morti viventi) (1980)',
 "i can't sleep (j'ai pas sommeil) (1994)",
 'running free (2000)',
 'stranger, the (1994)',
 'schlafes bruder (brother of sleep) (1995)']

In [149]:
get_similar_movie('e.t. the extra-terrestrial (1982)')

['e.t. the extra-terrestrial (1982)',
 'paralyzing fear: the story of polio in america, a (1998)',
 'murder! (1930)',
 'smashing time (1967)',
 'lured (1947)',
 'so dear to my heart (1949)',
 'house party 3 (1994)',
 'yankee zulu (1994)',
 'year of the horse (1997)',
 'held up (2000)']

In [150]:
get_similar_movie('back to the future (1985)')

['back to the future (1985)',
 'eaten alive (1976)',
 'arguing the world (1996)',
 'prom night iv: deliver us from evil (1992)',
 'slumber party massacre iii, the (1990)',
 'anna (1996)',
 'smoking/no smoking (1993)',
 'sorority house massacre ii (1990)',
 'tough and deadly (1995)',
 'small wonders (1996)']

In [154]:
# recommendations for me
user = user_to_idx['6041']

movie_recommend = als_model.recommend(user, csr_data, N=15, filter_already_liked_items = True)
movie_recommend

[(50, 0.10809891),
 (4, 0.10345341),
 (150, 0.1032293),
 (41, 0.09124587),
 (336, 0.08243993),
 (768, 0.08199596),
 (542, 0.079098724),
 (322, 0.0784945),
 (702, 0.075643),
 (913, 0.075600274),
 (58, 0.07500526),
 (479, 0.073181316),
 (1049, 0.071042),
 (105, 0.07095452),
 (265, 0.07023032)]

In [155]:
[idx_to_movie[i[0]] for i in movie_recommend]

['toy story 2 (1999)',
 "bug's life, a (1998)",
 'independence day (id4) (1996)',
 'rain man (1988)',
 'pretty woman (1990)',
 'get shorty (1995)',
 'rocky horror picture show, the (1975)',
 'babe (1995)',
 'field of dreams (1989)',
 'fly, the (1986)',
 'mission: impossible (1996)',
 'contact (1997)',
 'gods must be crazy, the (1980)',
 'last of the mohicans, the (1992)',
 'sweet hereafter, the (1997)']

In [161]:
# find out why
recommended = movie_to_idx['toy story 2 (1999)']
explain = als_model.explain(user, csr_data, itemid=recommended)

[(idx_to_movie[i[0]], i[1]) for i in explain[1]]

[('toy story (1995)', 0.11156882214909725),
 ('men in black (1997)', 0.012746875296119982),
 ('princess bride, the (1987)', 0.004990073161428124),
 ('forrest gump (1994)', 0.004435500528995823),
 ('jurassic park (1993)', -0.026654536889060144)]

---

## Conclusion

For services such as Youtube, Netflix, Spotify, and more, the recommendation system has become one of their key factors. <br/>
Always had some curiosity regards to the recommendation system and it was fun to try out one of the most basic ones.<br/>
Wish to learn more about how to handle cold factors and much more complex models.<br/>
Although some movies that were recommended didn't suit my taste, most of them did.<br/>
3 out of 3 goals!