# Content-Based Filtering

Content-based filtering uses similarities in products, services, or content features, as well as information accumulated about the user to make recommendation.

In this case the similarities are: `Genre` and `storyline` to recommend the movie.

In [508]:
#improting the libraries we need
import pandas as pd
import numpy as np 
from pandas.api.types import CategoricalDtype
from datetime import datetime 
import pandas as pd
import numpy as np 
import nltk
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer
from scipy.sparse import csr_matrix



Load the cleaned dataset which contains 6k rows of movie info:

In [3]:
movies = pd.read_csv("Data/movies.csv")

Checking out the `movies` dataframe :

In [4]:
movies.shape

(6020, 33)

For content-based filtering, I'm going to use: `title` , `overview` and `genre`.

The idea for each of these is to vectorize the words with TF-IDF and create 3 cosine similarity matrixes.


Each time that we call one of the functions: `recommendr_on_story` or `recommender_on_genre` if the title is valid, then it will proceed with the functions but if it cannot find a match for movie title, the `find_similar_movies` function will get called and returns top movies which have the highest similarities with the name that was searched. 


Next step in `recommendr_on_story` or `recommender_on_genre` would be taking out the first movie(highest similarity) and find movies based on the feature we want.

### Searching the dataframe for the movies which have similar name to the one that is searched

TF-IDF vectorizer is a method for storing a measure of relevance of every word in each document by reweighing the counts.



- Creating a `TF-IDF` vectorizer
- fit and transform the titles in the tfidf vectorizer.
- create a matrix out of the transformed titles.


In [297]:
#Define a TF-IDF Vectorizer Object for titles

tfidf_title = TfidfVectorizer(stop_words='english',min_df=0,ngram_range=(1,2))


#Construct the required TF-IDF matrix by fitting and transforming the data
title_matrix = tfidf_title.fit_transform(movies['title'])

#Output the shape of tfidf_matrix
title_matrix.shape

(6020, 5258)

### cosine Similarity 
Cosine Similarity is a method of calculating the similarity of two vectors. 
Here we're calculating the similarity between the titles:

In [303]:
title_similarity = cosine_similarity(title_matrix,title_matrix)


In [304]:
title_similarity

array([[1.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 1.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 1.        , ..., 0.02596629, 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 0.02596629, ..., 1.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 1.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        1.        ]])

`Title_similarity` is a matrix which both rows and columns are movie titles  and the values indicate the relationship between the movie titles. the higher the value (closer 1) it means the titles are more similar to each other. That is why the similarity between a row and a column with same title is 1. 

In [305]:
title_similarity.shape
#sanity check on the shape of our matrix

(6020, 6020)

#### `Content_recommender` function 
which gets a title for a movie and returns the top movies with similar name to the one searched.


I'm using `fuzzywuzzy` library for this function since the user may have typos and this will help us in increasing the accuracy.

In [452]:
def find_similar_movies(movie_title, top_n=10, threshold=60):
    idx=movies[movies['title']==movie_title].index.values
    
    titles=[]
    # Calculate similarity scores for each movie title
    titles = movies['title'].tolist() #movie title that we have in a list 
    similarity_scores = process.extract(movie_title, titles, scorer=fuzz.partial_ratio) #similarity score between the title searched and the movie titles we have
    
    

    # Filter movies with similarity scores above the threshold
    similar_movies = [name for name,score in similarity_scores if score >= threshold]
    
    #create a dataframe to compare the similarity of the titles we've found and the one that has been searched.
    title_similarity_df =  pd.DataFrame(columns =["title" , 'similarity'])

    
    similar_movies.append(movie_title)
    
    title_matrix = tfidf_title.transform(similar_movies)
    title_sim = cosine_similarity(title_matrix)
    
    for i in range(len(similar_movies)-1):
        #transforming the titles we have found into out TF-IDF matrix and getting a cosine similarity
        sim = title_sim[len(similar_movies)-1][i]
        new_row = [{"title": similar_movies[i], "similarity": sim }]
        title_similarity_df = title_similarity_df.append(new_row,ignore_index=True)   
    title_similarity_df =title_similarity_df.sort_values(by='similarity',ascending=False).head(top_n)    
    return title_similarity_df.title.values
    


In [453]:

similar_movies = find_similar_movies("barman", top_n=10)

print(similar_movies)

['M' 'Batman Forever' 'Batman' 'Batman Returns' 'Batman & Robin']


# Recommenders:

### TF-IDF for Movie overviews

- Create a tfidf vectorizer for the 5000 most repeated words in the movie reviews
- Max_features = maximum number of words we want to get out of this vectorization. if we don't specify anything it will go over all of them and that would  take a lot of space and time
- stemming the words: since this is an overview it might have words similar to each otherlik run,running,runs, so we're stemming to count those as one word('run').
- fit and transform the overviews.
- Create a cosine similarity just like above, for `Overview TF_IDF` matrix.

In [448]:
#Define a TF-IDF Vectorizer Object. Remove all english stop words such as 'the', 'a'
stemmer = nltk.stem.PorterStemmer()
tfidf = TfidfVectorizer(stop_words='english',min_df=10,ngram_range=(1,2),max_features=5000,tokenizer = lambda x: [stemmer.stem(i) for i in x.split(' ')])


#Construct the required TF-IDF matrix by fitting and transforming the data
tfidf_matrix = tfidf.fit_transform(movies['overview'])

#Output the shape of tfidf_matrix
tfidf_matrix.shape

(6020, 3036)

In [449]:
overview_similarity = cosine_similarity(tfidf_matrix)
#cosine similarity gives the relationship between the vectorized words.
#higher the number(closer to 1) the words are more similar to each other.

In [450]:
overview_similarity
#similarity between the overview words.

array([[1.        , 0.01383318, 0.01645828, ..., 0.01044857, 0.04085473,
        0.        ],
       [0.01383318, 1.        , 0.06442897, ..., 0.        , 0.03124879,
        0.        ],
       [0.01645828, 0.06442897, 1.        , ..., 0.00819855, 0.032057  ,
        0.01317856],
       ...,
       [0.01044857, 0.        , 0.00819855, ..., 1.        , 0.02035144,
        0.        ],
       [0.04085473, 0.03124879, 0.032057  , ..., 0.02035144, 1.        ,
        0.        ],
       [0.        , 0.        , 0.01317856, ..., 0.        , 0.        ,
        1.        ]])

### `Recommender_on_story` 

Now We want to get use of our `overview_similarity` and find the title of the movies that have similar overviews.

In the `recommender_on_story` function, we take the title of the movie we want to find similar movies to it plus the overview_similairty matrix as inputs.

- The function first checks for the index of the movie in the original dataset.
- if the index is empty: We will use `find_similar_movies` and try to find the similar titles to the one searched.
- next step would be inserting the movies and their similarities in a dataframe (`sim_df`) and returning the sort version of it.
- the top 10 movies with highest similarities have an overview similar to the movie searched.

In [486]:
def recommender_on_story(title,similarities): #taking title and similarity matrix as inputs.
    sim_df = pd.DataFrame(columns =['Movie','Similarity'])
    #creating a dataframe to show the results in it.
    idx=movies[movies['title']==title].index.values
    if len(idx)==0: #checkng whether there was such a movie or we have to produce movie name close to it
        similar_titles = find_similar_movies(title, top_n=10, threshold=80)
        title=similar_titles[0]
        if len(matching_movies) ==0:
            raise ValueError("No similar movie titles found.")
   
        
    

    
    idx=movies[movies['title']==title].index.values[0]
    #transform the overview of the movies here
    similarities =  similarities[idx]

    for i in range(len(similarities)):
        movie_title = movies.loc[i,'title']

        new_row = [{"Movie": movie_title, 'Similarity': similarities[i]}]
        sim_df = sim_df.append(new_row,ignore_index=True)
    top_movies = sim_df.sort_values(by='Similarity', ascending=False).head(10)
    return top_movies

In [500]:
#example 
recommender_on_story('batman',overview_similarity)

Unnamed: 0,Movie,Similarity
240,Batman,1.0
3516,Harry Potter and the Half-Blood Prince,0.274113
4121,The Dark Knight Rises,0.264122
3306,Batman: Gotham Knight,0.252821
63,Batman Forever,0.242784
611,Batman & Robin,0.238748
2112,Videodrome,0.230962
3763,Batman: Under the Red Hood,0.228181
5601,Batman: Bad Blood,0.215129
1557,Masters of the Universe,0.198168


### Genre

First:

Load the data farme that contains genres dummies and movieIds.

In [473]:
genres_df = pd.read_csv("Data/genres_dummies.csv")
genres_df.drop("Unnamed: 0",axis=1,inplace=True)

Since the Genres are already in the binary format we can use them for the cosine similarity.And get the relationship between the genres(similarity)
Then we're going to define a cosine similarity matrix for them:

In [474]:
genre_columns = genres_df.columns[1:]

# Calculate the similarity matrix based on genres
genre_similarity = cosine_similarity(genres_df[genre_columns])

In [475]:
genre_similariy.shape

(6020, 6020)

In [476]:
movies.shape

(6020, 33)

In [488]:
def recommender_on_genre(title,similarities):
    idx = movies[movies['title']==title].index
    if len(idx)==0: #checkng whether there was such a movie or we have to produce movie name close to it
        similar_titles = find_similar_movies(title, top_n=10, threshold=80)
        title=similar_titles[0] 
        if len(matching_movies) ==0:
            raise ValueError("No similar movie titles found.")
      
    idx=movies[movies['title']==title].index.values[0]
    sim_df = pd.DataFrame({'movie':movies['title'], 
                       'similarity': similarities[idx]})
    top_movies = sim_df.sort_values(by='similarity', ascending=False).head(10)
    return top_movies

In [498]:
def recommend(title):
    print("Recommendation based on Genre: \n---------\n")
    print(recommender_on_genre(title,genre_similarity))
    print("Recommendation based on storyline: \n---------\n")
    print(recommender_on_story(title,overview_similarity))

In [499]:
#Enter any movie Title and get the results: (Based on content)
recommend("Batman")

Recommendation based on Genre: 
---------

                                                movie  similarity
2719                                       Mirrormask         1.0
1069                               The Legend of 1900         1.0
1843                                       Spider-Man         1.0
851                                  The Dark Crystal         1.0
3282                                          Hancock         1.0
3571                                              Ink         1.0
56    The Neverending Story III: Escape from Fantasia         1.0
5619                                    Gods of Egypt         1.0
28                                      Mortal Kombat         1.0
560                                    Batman Returns         1.0
Recommendation based on storyline: 
---------

                                       Movie  Similarity
240                                   Batman    1.000000
3516  Harry Potter and the Half-Blood Prince    0.274113
4121         