# Movie Recommendation System using TF-IDF and Collaborative Filtering

This project involves developing a movie recommendation system that combines content-based filtering using TF-IDF vectorisation with collaborative filtering techniques. The system preprocesses movie data, generates a TF-IDF matrix for text-based search, and utilises user ratings to recommend movies based on both content similarity and user preferences. The project also features an interactive search interface using ipywidgets for dynamic recommendations.


In [1]:
import pandas as pd
import numpy as np

**1) Reading data:**
First, we will read the data in with pandas. We are using a dataset from Kaggle.

In [2]:
movies = pd.read_csv("movies.csv")
movies

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy
...,...,...,...
62418,209157,We (2018),Drama
62419,209159,Window of the Soul (2001),Documentary
62420,209163,Bad Poems (2018),Comedy|Drama
62421,209169,A Girl Thing (2001),(no genres listed)


**2) Cleaning and pre-processing data:** We will use RegEx to remove whitespace and special characters, and pre-process by dropping irrelevant columns.

In [3]:
import re

def clean_title(title):
    return re.sub("[^a-zA-Z0-9 ]", "", title)

movies["clean_title"] = movies["title"].apply(clean_title)
movies.drop(columns=["genres"], inplace=True)
movies.head()

Unnamed: 0,movieId,title,clean_title
0,1,Toy Story (1995),Toy Story 1995
1,2,Jumanji (1995),Jumanji 1995
2,3,Grumpier Old Men (1995),Grumpier Old Men 1995
3,4,Waiting to Exhale (1995),Waiting to Exhale 1995
4,5,Father of the Bride Part II (1995),Father of the Bride Part II 1995


**3) Creating a search engine:** We will use TF-IDF vectorisation to process frequency of n-grams, and use cosine similarity to determine an output. We will use ipywidgets to implement the search engine as a widget.

In [4]:
#tfidf matrix

from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(ngram_range=(1, 2))
tfidf = vectorizer.fit_transform(movies["clean_title"])

In [5]:
from sklearn.metrics.pairwise import cosine_similarity

def search(title):
    title=clean_title(title)
    query_vec = vectorizer.transform([title])
    similarity = cosine_similarity(query_vec, tfidf).flatten()
    indices = np.argpartition(similarity, -5)[-5:]
    results = movies.iloc[indices].iloc[::-1]
    
    return results

In [6]:
import ipywidgets as widgets
from IPython.display import display, clear_output

movie_input = widgets.Text(
    description='Movie Title:',
    disabled=False
)
movie_list = widgets.Output()

def on_type(data):
    with movie_list:
        clear_output(wait=True)
        title = data["new"]
        if len(title) > 5:
            print(f"Searching for: {title}")
            results = search(title)
            display(results)

movie_input.observe(on_type, names='value')


display(movie_input, movie_list)

Text(value='', description='Movie Title:')

Output()

**4) Movie Recommendation System:** We will design a recommendation system utilising collobarative filtering based on user ratings to give a recommendation score for movies, and include a widgets-based user interface.

In [7]:
ratings = pd.read_csv("ratings.csv")

In [8]:
def find_recs(movie_id):
    #users that liked the same movie
    similar_users = ratings[(ratings["movieId"] == movie_id) & (ratings["rating"] > 4)]["userId"].unique()
    
    #movies that 10% or more of similar users liked 
    similar_user_recs = ratings[(ratings["userId"].isin(similar_users)) & (ratings["rating"] > 4)]["movieId"] 
    similar_user_recs = similar_user_recs.value_counts() / len(similar_users) 
    similar_user_recs = similar_user_recs[similar_user_recs > .10] 

    #percentage of all users who rated the movies in similar_user_recs
    all_users = ratings[(ratings["movieId"].isin(similar_user_recs.index)) & (ratings["rating"] > 4)]
    all_user_recs = all_users["movieId"].value_counts() / len(all_users["userId"].unique()) 
    return similar_user_recs, all_user_recs

def movie_scores(movie_id):
    similar_user_recs, all_user_recs = find_recs(movie_id) 
    rec_percentages = pd.concat([similar_user_recs, all_user_recs], axis=1)
    rec_percentages.columns = ["similar", "all"]
    
    rec_percentages["score"] = rec_percentages["similar"] / rec_percentages["all"]
    rec_percentages = rec_percentages.sort_values("score", ascending=False)
    return rec_percentages.head(10).merge(movies, left_index=True, right_on="movieId")[["score", "title"]]

In [9]:
movie_name_input = widgets.Text(
    description='Movie Title:',
    disabled=False
)
recommendation_list = widgets.Output()

def on_type(data):
    with recommendation_list:
        recommendation_list.clear_output()
        title = data["new"]
        if len(title) > 5:
            results = search(title)
            movie_id = results.iloc[0]["movieId"]
            display(movie_scores(movie_id))

movie_name_input.observe(on_type, names='value')

display(movie_name_input, recommendation_list)

Text(value='', description='Movie Title:')

Output()