<b>
    
<p>
<center>
<font size="5">
Content-Based Filtering
</font>
</center>
    <center>
<font size="4">
Content-Based Filtering using TF-IDF and Cosine Similarity
</font>
</center>
</p>
   


### Overview

#### Aim
1. Create Similarity Matrix using TfIdfVectorizer
2. Recommend Top 20 Movies Similar to a Particular Movie

### Import

In [1]:
# Import Statements
import import_ipynb
from Functions import *

importing Jupyter notebook from Functions.ipynb


### Create Similarity Matrix

#### Find Average Rating for Each Movie

In [2]:
# find average rating for each movie; drop the userId column
movie_info = data.copy(deep=True)
movie_info = movie_info.drop(columns='userId')
movie_info

Unnamed: 0,movieId,title,genres,rating
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,4.0
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,4.0
2,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,4.5
3,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,2.5
4,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,4.5
...,...,...,...,...
100831,193581,Black Butler: Book of the Atlantic (2017),Action|Animation|Comedy|Fantasy,4.0
100832,193583,No Game No Life: Zero (2017),Animation|Comedy|Fantasy,3.5
100833,193585,Flint (2017),Drama,3.5
100834,193587,Bungo Stray Dogs: Dead Apple (2018),Action|Animation,3.5


In [3]:
average_rating = movie_info.groupby(['title','movieId', 'genres'], as_index=False)['rating'].mean()
# average_rating = average_rating.to_frame()
average_rating

Unnamed: 0,title,movieId,genres,rating
0,'71 (2014),117867,Action|Drama|Thriller|War,4.000000
1,'Hellboy': The Seeds of Creation (2004),97757,Action|Adventure|Comedy|Documentary|Fantasy,4.000000
2,'Round Midnight (1986),26564,Drama|Musical,3.500000
3,'Salem's Lot (2004),27751,Drama|Horror|Mystery|Thriller,5.000000
4,'Til There Was You (1997),779,Drama|Romance,4.000000
...,...,...,...,...
9719,eXistenZ (1999),2600,Action|Sci-Fi|Thriller,3.863636
9720,xXx (2002),5507,Action|Crime|Thriller,2.770833
9721,xXx: State of the Union (2005),33158,Action|Crime|Thriller,2.000000
9722,¡Three Amigos! (1986),2478,Comedy|Western,3.134615


#### Compute Similarity Matrix using TF-IDF and Cosine Similarity

In [4]:
# We filter genres from movies so that repetition of the movie titles does not occur
genre = movies['genres']
# print(genre)
genre = genre.replace(to_replace = '(no genres listed)', value = '')
print('This dataset include movies from the following list of genres: \n', list_of_genres(genre))

This dataset include movies from the following list of genres: 
 ['Adventure', 'Animation', 'Children', 'Comedy', 'Fantasy', 'Romance', 'Drama', 'Action', 'Crime', 'Thriller', 'Horror', 'Mystery', 'Sci-Fi', 'War', 'Musical', 'Documentary', 'IMAX', 'Western', 'Film-Noir', '']


In [5]:
# initializing the vectorizer
vectorizer = TfidfVectorizer(analyzer='word', ngram_range=(1, 2), min_df=0, stop_words='english')
# constructing the genre tf-idf matrix using fit and transform
tfidf_genre_matrix = vectorizer.fit_transform(genre)
print(tfidf_genre_matrix.shape)
print(tfidf_genre_matrix)

(9742, 174)
  (0, 63)	0.4051430286389587
  (0, 47)	0.3681884973089335
  (0, 34)	0.38369482677526473
  (0, 18)	0.4008862821540716
  (0, 108)	0.30254034715329503
  (0, 59)	0.16761357728391116
  (0, 46)	0.3162303113127544
  (0, 33)	0.32335863498874723
  (0, 17)	0.26110809240797916
  (1, 51)	0.5795995638728872
  (1, 19)	0.5337814180965866
  (1, 108)	0.36554429536140276
  (1, 46)	0.382085190978399
  (1, 17)	0.31548378439611124
  (2, 68)	0.7695974416123483
  (2, 157)	0.5242383036039113
  (2, 59)	0.36454626441402677
  (3, 103)	0.5645649298589199
  (3, 62)	0.5417511322516687
  (3, 96)	0.2904365851652309
  (3, 157)	0.4522400920963429
  (3, 59)	0.31447995130958456
  (4, 59)	1.0
  (5, 84)	0.604518892749723
  (5, 5)	0.5454388121871825
  :	:
  (9733, 38)	0.835677806885533
  (9733, 96)	0.23714974930952545
  (9733, 33)	0.495381266784903
  (9734, 62)	0.7846149876753742
  (9734, 96)	0.42063760299449465
  (9734, 59)	0.4554594691761476
  (9735, 33)	1.0
  (9736, 86)	1.0
  (9737, 2)	0.5335755137706529
  (9

In [6]:
similar_movie = linear_kernel(tfidf_genre_matrix, tfidf_genre_matrix)
similar_movie

array([[1.        , 0.31379419, 0.0611029 , ..., 0.        , 0.16123168,
        0.16761358],
       [0.31379419, 1.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.0611029 , 0.        , 1.        , ..., 0.        , 0.        ,
        0.36454626],
       ...,
       [0.        , 0.        , 0.        , ..., 1.        , 0.        ,
        0.        ],
       [0.16123168, 0.        , 0.        , ..., 0.        , 1.        ,
        0.        ],
       [0.16761358, 0.        , 0.36454626, ..., 0.        , 0.        ,
        1.        ]])

### Recommend Top 20 Movies Similar to a Particular Movie

#### Generate Recommendations (By Movie)

In [7]:
title = movies['title']
print(title.head())

0                      Toy Story (1995)
1                        Jumanji (1995)
2               Grumpier Old Men (1995)
3              Waiting to Exhale (1995)
4    Father of the Bride Part II (1995)
Name: title, dtype: object


In [8]:
# generate recommendations based on a target movie
recommended_movies = movie_list_by_title('Silence of the Lambs, The (1991)', title, similar_movie, average_rating)
recommended_movies

Top 20 Movies Similar to Silence of the Lambs, The (1991) are:


Unnamed: 0,title,movieId,genres,rating
0,"Usual Suspects, The (1995)",50,Crime|Mystery|Thriller,4.237745
1,Twelve Monkeys (a.k.a. 12 Monkeys) (1995),32,Mystery|Sci-Fi|Thriller,3.983051
2,Seven (a.k.a. Se7en) (1995),47,Mystery|Thriller,3.975369
3,Heat (1995),6,Action|Crime|Thriller,3.946078
4,Casino (1995),16,Crime|Drama,3.926829
5,Dead Man Walking (1995),36,Crime|Drama,3.835821
6,Eye for an Eye (1996),61,Drama|Thriller,3.75
7,From Dusk Till Dawn (1996),70,Action|Comedy|Horror|Thriller,3.509091
8,GoldenEye (1995),10,Action|Adventure|Thriller,3.496212
9,Get Shorty (1995),21,Comedy|Crime|Thriller,3.494382


#### Evaluation (By Movie)

In [9]:
# find hit rate of the recommendations based on genre
hit_rate_movie = movie_evaluation('Silence of the Lambs, The (1991)', recommended_movies)
print('The Hit Rate for this method is:', hit_rate_movie)

The Hit Rate for this method is: 0.4827586206896552
