> # RECOMMENDATION SYSTEM

- **Collaborative Filtering**: This method makes automatic predictions (filtering) about the interests of a user by collecting preferences or taste information from many users (collaborating). The underlying assumption of the collaborative filtering approach is that if a person A has the same opinion as a person B on a set of items, A is more likely to have B's opinion for a given item than that of a randomly chosen person.   
- **Content-Based Filtering**: This method uses only information about the description and attributes of the items users has previously consumed to model user's preferences. In other words, these algorithms try to recommend items that are similar to those that a user liked in the past (or is examining in the present). In particular, various candidate items are compared with items previously rated by the user and the best-matching items are recommended.  
- **Hybrid methods**:  Recent research has demonstrated that a hybrid approach, combining collaborative filtering and content-based filtering could be more effective than pure approaches in some cases. These methods can also be used to overcome some of the common problems in recommender systems such as cold start and the sparsity problem.

# Content Based Recommendation System

 ## Import the libraries and read the dataset

In [1]:
import numpy as np
import pandas as pd

In [2]:
movies = pd.read_csv(r'tmdb_5000_movies.csv')
credits = pd.read_csv(r'tmdb_5000_credits.csv')

In [3]:
movies.head(2)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500


In [4]:
credits.head(2)

Unnamed: 0,movie_id,title,cast,crew
0,19995,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."


## Data Preprocessing

In [5]:
movies.shape, credits.shape

((4803, 20), (4803, 4))

In [6]:
data = pd.merge(movies,credits,left_on='id',right_on='movie_id',how='inner')

In [7]:
data.shape

(4803, 24)

In [8]:
data.head(2)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,...,spoken_languages,status,tagline,title_x,vote_average,vote_count,movie_id,title_y,cast,crew
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...",...,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800,19995,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...",...,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500,285,Pirates of the Caribbean: At World's End,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."


In [9]:
data.columns

Index(['budget', 'genres', 'homepage', 'id', 'keywords', 'original_language',
       'original_title', 'overview', 'popularity', 'production_companies',
       'production_countries', 'release_date', 'revenue', 'runtime',
       'spoken_languages', 'status', 'tagline', 'title_x', 'vote_average',
       'vote_count', 'movie_id', 'title_y', 'cast', 'crew'],
      dtype='object')

In [10]:
data.drop(['homepage','status','title_x','title_y','production_companies'],axis=1,inplace=True)

In [11]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4803 entries, 0 to 4802
Data columns (total 19 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   budget                4803 non-null   int64  
 1   genres                4803 non-null   object 
 2   id                    4803 non-null   int64  
 3   keywords              4803 non-null   object 
 4   original_language     4803 non-null   object 
 5   original_title        4803 non-null   object 
 6   overview              4800 non-null   object 
 7   popularity            4803 non-null   float64
 8   production_countries  4803 non-null   object 
 9   release_date          4802 non-null   object 
 10  revenue               4803 non-null   int64  
 11  runtime               4801 non-null   float64
 12  spoken_languages      4803 non-null   object 
 13  tagline               3959 non-null   object 
 14  vote_average          4803 non-null   float64
 15  vote_count           

In [12]:
data.isna().sum()

budget                    0
genres                    0
id                        0
keywords                  0
original_language         0
original_title            0
overview                  3
popularity                0
production_countries      0
release_date              1
revenue                   0
runtime                   2
spoken_languages          0
tagline                 844
vote_average              0
vote_count                0
movie_id                  0
cast                      0
crew                      0
dtype: int64

In [13]:
data['overview'] = data['overview'].fillna('')

### Now lets make a recommendations based on the movie's plot summaries given in the overview column. So if our user gives us a movie title, our goal is to recommend movies that share similar plot summaries

## TF-IDF Vectorization

In [14]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfv = TfidfVectorizer(min_df=3,max_features=None,
                         strip_accents='unicode',analyzer='word',
                         ngram_range=(1,3),stop_words='english')
# stop_words='english' removes words like a,an,the,she,he
# n_grams=(1,3) - combination of 1-3 different kind of words

### * min_df : float or int, default=1
	When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature.

### * max_features : int, default=None
	If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus. Otherwise, all features are used.This parameter is ignored if vocabulary is not None.

### * strip_accents : {‘ascii’, ‘unicode’} or callable, default=None
	Remove accents and perform other character normalization during the preprocessing step. ‘ascii’ is a fast method that only works on characters that have a direct ASCII mapping. ‘unicode’ is a slightly slower method that works on any characters. None (default) does nothing.
   
### * analyzer : {‘word’, ‘char’, ‘char_wb’} or callable, default=’word’
	Whether the feature should be made of word or character n-grams. Option ‘char_wb’ creates character n-grams only from text inside word boundaries; n-grams at the edges of words are padded with space.If a callable is passed it is used to extract the sequence of features out of the raw, unprocessed input.
    
### * ngram_range : tuple (min_n, max_n), default=(1, 1)
	The lower and upper boundary of the range of n-values for different n-grams to be extracted. All values of n such that min_n <= n <= max_n will be used. For example an ngram_range of (1, 1) means only unigrams, (1, 2) means unigrams and bigrams, and (2, 2) means only bigrams. Only applies if analyzer is not callable.

### * stop_words : {‘english’}, list, default=None
	If a string, it is passed to _check_stop_list and the appropriate stop list is returned. ‘english’ is currently the only supported string value. There are several known issues with ‘english’ and you should consider an alternative (see Using stop words).If a list, that list is assumed to contain stop words, all of which will be removed from the resulting tokens. Only applies if analyzer == 'word'.If None, no stop words will be used.

In [15]:
data['overview'][0]

'In the 22nd century, a paraplegic Marine is dispatched to the moon Pandora on a unique mission, but becomes torn between following orders and protecting an alien civilization.'

## Fit TF-IDF on overview

In [16]:
tfv_matrix = tfv.fit_transform(data['overview'])

In [17]:
tfv_matrix

<4803x9920 sparse matrix of type '<class 'numpy.float64'>'
	with 121504 stored elements in Compressed Sparse Row format>

In [18]:
tfv_matrix.shape

(4803, 9920)

In [19]:
from sklearn.metrics.pairwise import sigmoid_kernel
sig = sigmoid_kernel(tfv_matrix,tfv_matrix)

In [20]:
# reverse mapping of movie titles
indices = pd.Series(data.index,index=data['original_title']).drop_duplicates()
indices

original_title
Avatar                                         0
Pirates of the Caribbean: At World's End       1
Spectre                                        2
The Dark Knight Rises                          3
John Carter                                    4
                                            ... 
El Mariachi                                 4798
Newlyweds                                   4799
Signed, Sealed, Delivered                   4800
Shanghai Calling                            4801
My Date with Drew                           4802
Length: 4803, dtype: int64

In [21]:
def rec(title,sig=sig):
    idx = indices[title] # get the index corresponding to the original title
    sig_scores = list(enumerate(sig[idx])) # get pairwise similarity scores
    sig_scores = sorted(sig_scores,key= lambda x: x[1],reverse=True)
    sig_scores = sig_scores[1:11] # scores of 10 similar movies 
    # [:10] returns movie name along Spy Kids
    movie_indices = [i[0] for i in sig_scores] # indices of movies
    return data['original_title'].iloc[movie_indices] # top 10 most similar movies

In [22]:
indices['Newlyweds']

4799

In [23]:
sig[4799]

array([0.76159416, 0.76159416, 0.76159416, ..., 0.76159416, 0.76159416,
       0.76159416])

In [24]:
list(enumerate(sig[indices['Newlyweds']]))

[(0, 0.7615941559557649),
 (1, 0.7615941559557649),
 (2, 0.7615941559557649),
 (3, 0.7615941559557649),
 (4, 0.7615941559557649),
 (5, 0.7615941559557649),
 (6, 0.7615941559557649),
 (7, 0.7615941559557649),
 (8, 0.7615941559557649),
 (9, 0.7615941559557649),
 (10, 0.7615941559557649),
 (11, 0.7615941559557649),
 (12, 0.7615941559557649),
 (13, 0.7615941559557649),
 (14, 0.7615941559557649),
 (15, 0.7615941559557649),
 (16, 0.7615941559557649),
 (17, 0.7615941559557649),
 (18, 0.7615941559557649),
 (19, 0.7615941559557649),
 (20, 0.7615941559557649),
 (21, 0.7615941559557649),
 (22, 0.7615941559557649),
 (23, 0.7615941559557649),
 (24, 0.7615941559557649),
 (25, 0.7615941559557649),
 (26, 0.7615941559557649),
 (27, 0.7615941559557649),
 (28, 0.7615941559557649),
 (29, 0.7615941559557649),
 (30, 0.7615941559557649),
 (31, 0.7615941559557649),
 (32, 0.7615941559557649),
 (33, 0.7615941559557649),
 (34, 0.7615941559557649),
 (35, 0.7615941559557649),
 (36, 0.7615941559557649),
 (37, 0.761

In [25]:
sorted(list(enumerate(sig[indices['Newlyweds']])), key= lambda x: x[1],reverse=True)

[(4799, 0.7616364888287208),
 (616, 0.7616064084426961),
 (869, 0.7616034178776128),
 (2689, 0.761600965560282),
 (3969, 0.7616006690796738),
 (1576, 0.7616006385374702),
 (2290, 0.7616002366156066),
 (504, 0.7615998032036724),
 (866, 0.761599273973311),
 (2962, 0.7615988859616694),
 (242, 0.7615987888376458),
 (4576, 0.7615987405779349),
 (1223, 0.7615987281085932),
 (3479, 0.7615987268724713),
 (2688, 0.7615984153199932),
 (3155, 0.761598404367395),
 (2869, 0.7615983709610472),
 (3559, 0.7615983643583946),
 (4641, 0.761598353802844),
 (4616, 0.7615981856256006),
 (1071, 0.761598180113216),
 (3393, 0.7615980969770934),
 (1970, 0.7615978373905358),
 (1856, 0.7615977257330712),
 (4591, 0.7615975483430093),
 (1110, 0.7615975379580007),
 (237, 0.7615974566214845),
 (4584, 0.7615974399662768),
 (3583, 0.7615974104345924),
 (1385, 0.7615973983527261),
 (1949, 0.7615972820266094),
 (3253, 0.761597246958908),
 (1364, 0.7615971450305588),
 (4085, 0.761597087141578),
 (3309, 0.7615969991514514)

In [26]:
rec('Spy Kids')

1302    Spy Kids 2: The Island of Lost Dreams
1155                  Spy Kids 3-D: Game Over
1769      Spy Kids: All the Time in the World
4044                               Go for It!
1825                Jimmy Neutron: Boy Genius
339                           The Incredibles
3793                     The Velocity of Gary
1081                       Revolutionary Road
2441                   Escobar: Paradise Lost
577                   AVP: Alien vs. Predator
Name: original_title, dtype: object

In [27]:
data.columns

Index(['budget', 'genres', 'id', 'keywords', 'original_language',
       'original_title', 'overview', 'popularity', 'production_countries',
       'release_date', 'revenue', 'runtime', 'spoken_languages', 'tagline',
       'vote_average', 'vote_count', 'movie_id', 'cast', 'crew'],
      dtype='object')

In [28]:
# Define the vectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()

In [29]:
# Fit the vectorizer to the plot overview
X = vectorizer.fit_transform(data['overview'])

In [30]:
# Compute the pairwise cosine similarity between the movies
from sklearn.metrics.pairwise import cosine_similarity
similarity = cosine_similarity(X)

In [31]:
def recommend(title):
    # Find the index of the input movie in the dataframe
    idx = data[data['original_title'] == title].index[0]
    # Get the pairwise cosine similarity between the input movie and all other movies
    scores = list(enumerate(similarity[idx]))
    # Sort the movies by similarity score
    scores = sorted(scores, key=lambda x: x[1], reverse=True)
    # Get the top 10 most similar movies
    top_scores = scores[1:11]
    # Get the titles of the most similar movies
    top_titles = [data.iloc[score[0]]['original_title'] for score in top_scores]
    return top_titles

In [32]:
# Get recommendations for a given movie:
recommend('Spy Kids')

['Spy Kids 2: The Island of Lost Dreams',
 'Spy Kids 3-D: Game Over',
 'The Blues Brothers',
 'Courageous',
 'Sex Tape',
 'Monsters, Inc.',
 'Silverado',
 "Bill & Ted's Excellent Adventure",
 'Earth to Echo',
 'Go for It!']