# Movie Recommendation System Using TF-IDF & Cosine Similarity

This notebook demonstrates how to build a content-based movie recommender system using TF-IDF for feature extraction and Cosine Similarity to determine the average similarity between the user’s favourites and all other movies. 
Users select their favorite movies, and the model predicts which other movies they are likely to enjoy.

### Step 1. Importing Required Libraries

This section imports the key Python libraries used in the project:

- `pandas` for data manipulation
- `os` for setting the working directory
- `questionary` for terminal-based user input
- `TfidfVectorizer` from `scikit-learn` to convert movie metadata into vector form
- `cosine_similarity` to measure similarity between movies

In [23]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

### Step 2. Loading Movie Data

The dataset used is the IMDb Top 1000 Movies dataset. It includes information about each movie's title, release year, genre(s), director, leading actors, and gross revenue.

We read in the dataset and adjust the working directory so that relative paths are handled consistently, regardless of where the script is executed from.

In [24]:
movies = pd.read_csv("../Data/imdb_top_1000.csv")
movies.head()

Unnamed: 0,Poster_Link,Title,Year,Certificate,Runtime,Genre,IMDB_Rating,Overview,Meta_score,Director,Star1,Star2,Star3,Star4,No_of_Votes,Gross
0,https://m.media-amazon.com/images/M/MV5BMDFkYT...,The Shawshank Redemption,1994,A,142 min,Drama,9.3,Two imprisoned men bond over a number of years...,80.0,Frank Darabont,Tim Robbins,Morgan Freeman,Bob Gunton,William Sadler,2343110,28341469
1,https://m.media-amazon.com/images/M/MV5BM2MyNj...,The Godfather,1972,A,175 min,"Crime, Drama",9.2,An organized crime dynasty's aging patriarch t...,100.0,Francis Ford Coppola,Marlon Brando,Al Pacino,James Caan,Diane Keaton,1620367,134966411
2,https://m.media-amazon.com/images/M/MV5BMTMxNT...,The Dark Knight,2008,UA,152 min,"Action, Crime, Drama",9.0,When the menace known as the Joker wreaks havo...,84.0,Christopher Nolan,Christian Bale,Heath Ledger,Aaron Eckhart,Michael Caine,2303232,534858444
3,https://m.media-amazon.com/images/M/MV5BMWMwMG...,The Godfather: Part II,1974,A,202 min,"Crime, Drama",9.0,The early life and career of Vito Corleone in ...,90.0,Francis Ford Coppola,Al Pacino,Robert De Niro,Robert Duvall,Diane Keaton,1129952,57300000
4,https://m.media-amazon.com/images/M/MV5BMWU4N2...,12 Angry Men,1957,U,96 min,"Crime, Drama",9.0,A jury holdout attempts to prevent a miscarria...,96.0,Sidney Lumet,Henry Fonda,Lee J. Cobb,Martin Balsam,John Fiedler,689845,4360000


### Step 3. Cleaning and Preprocessing
We perform several preprocessing steps to clean the data:

- Convert `Year` to integer
- Split the `Genre` column into three separate columns
- Remove unnecessary columns (e.g. poster links, rating scores)
- Convert the `Gross` revenue column into a numeric format
- Rename columns for easier referencing throughout the project

In [25]:
# Convert year to integer and split the genre string into separate columns
movies['Year'] = movies['Year'].astype('int')
movies = movies.join(movies['Genre'].str.split(', ', expand=True)).drop(['Genre', 'Poster_Link', 'IMDB_Rating', 'Certificate', 'Overview', 'Meta_score', 'No_of_Votes'], axis = 1)

# Remove commas from Gross column and convert to float
movies['Gross'] = (movies['Gross'].replace(',', '', regex=True).astype(float))

# Rename columns for simplicity
movies = movies.set_axis(['Title', 'Year', 'Length','Director','Star1','Star2','Star3','Star4','Gross','Genre1','Genre2','Genre3'], axis = 1)

### Step 4. Building a Balanced Subset of Recognisable Movies

To improve the user experience when selecting favourites, we create a subset of top-grossing films that span a wide variety of genres. This ensures users are more likely to recognise and select movies they’ve seen.

For each of the top 10 most common genres, we sample 20 high-grossing movies.

In [26]:
# Drop rows with missing data
movies_filtered = movies.dropna(subset=['Gross', 'Genre1'])

# Sort by Gross (descending)
movies_sorted = movies.sort_values(by='Gross', ascending=False)

# Select a subset of movies from each major genre
sample_per_genre = 20  # Number of movies to sample per genre
unique_titles = set()

# Get most common genres
top_genres = movies_sorted['Genre1'].value_counts().head(10).index.tolist()
print(top_genres)

# Sample movies from each genre
for genre in top_genres:
    genre_subset = movies_sorted[movies_sorted['Genre1'] == genre]

    # Take top N grossing movies, then sample
    top_grossing = genre_subset.head(100)  # Look at top 50 for variety
    sampled = top_grossing.sample(
        n=min(sample_per_genre, len(top_grossing)),
        random_state=42
    )['Title'].tolist()

    unique_titles.update(sampled)

# Convert to sorted list
movie_titles = sorted(unique_titles)

['Drama', 'Action', 'Comedy', 'Crime', 'Biography', 'Animation', 'Adventure', 'Mystery', 'Horror', 'Western']


### Step. 5 User Input: Selecting Favourite Movies
The user is asked to select multiple favourite movies from the curated list. These selections will form the basis for generating recommendations using content similarity.

If no movies are selected, the program notifies the user and exits.

In [27]:
# Display top 100 movie titles for user selection
movie_titles = np.sort(movies['Title'].head(100).dropna().unique().tolist())

# Interactive multi-select (manually input as list for notebook use)
print("Top 100 Movies:\n")
for idx, title in enumerate(movie_titles, 1):
    print(f"{idx}. {title}")

selected_indexes = input("\nEnter the numbers of your favorite movies separated by commas (e.g., 1, 5, 10): ")
selected_indexes = [int(i.strip())-1 for i in selected_indexes.split(",") if i.strip().isdigit()]
selected_titles = [movie_titles[i] for i in selected_indexes]

# Display selections
if selected_titles:
    print("\nYou selected the following movies:")
    for title in selected_titles:
        print(f"- {title}")
else:
    print("No movies selected.")

Top 100 Movies:

1. 12 Angry Men
2. 1917
3. 3 Idiots
4. Alien
5. American Beauty
6. American History X
7. Amélie
8. Anand
9. Andhadhun
10. Apocalypse Now
11. Avengers: Endgame
12. Avengers: Infinity War
13. Ayla: The Daughter of War
14. Babam ve Oglum
15. Back to the Future
16. Capharnaüm
17. Casablanca
18. Cidade de Deus
19. City Lights
20. Coco
21. Dangal
22. Django Unchained
23. Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb
24. Drishyam
25. Eternal Sunshine of the Spotless Mind
26. Fight Club
27. Forrest Gump
28. Gisaengchung
29. Gladiator
30. Good Will Hunting
31. Goodfellas
32. Hamilton
33. Hotaru no haka
34. Il buono, il brutto, il cattivo
35. Incendies
36. Inception
37. Inglourious Basterds
38. Interstellar
39. It's a Wonderful Life
40. Jagten
41. Jodaeiye Nader az Simin
42. Joker
43. Kimi no na wa.
44. La vita è bella
45. Léon
46. Memento
47. Miracle in cell NO.7
48. Modern Times
49. Mononoke-hime
50. Nuovo Cinema Paradiso
51. Oldeuboi
52. Once Upon a Tim

### Step 6. Feature Engineering

To compare movies based on content, we combine each movie’s key features — including genres, director, and top 4 actors — into a single string.

This textual representation is used to compute content similarity between films.

In [28]:
# Combine features (genres, director, lead actor) into a single string
# This text will be used to learn patterns in what makes a movie "likable"
movies['features'] = (
    movies['Genre1'].fillna('') + ' ' +
    movies['Genre2'].fillna('') + ' ' +
    movies['Director'].fillna('') + ' ' +
    movies['Star1'].fillna('') + ' ' +
    movies['Star2'].fillna('') + ' ' +
    movies['Star3'].fillna('') + ' ' +
    movies['Star4'].fillna('')
)

### Step 7. TF-IDF Vectorisation

We apply Term Frequency-Inverse Document Frequency (TF-IDF) vectorisation to the combined feature strings.

TF-IDF transforms the raw text into numerical vectors that highlight important, distinguishing terms — helping us quantify the similarity between movies.

In [29]:
# Convert text features into a numeric matrix using TF-IDF
# TF-IDF scores help identify the most distinguishing features
tfidf = TfidfVectorizer()
tfidf_matrix = tfidf.fit_transform(movies['features'])

## Step 8. Calculating Cosine Similarity

Using the TF-IDF matrix, we compute the pairwise cosine similarity between all movies.

This gives us a similarity score between each pair of movies, where 1 means identical and 0 means completely dissimilar.

In [30]:
similarity_matrix = cosine_similarity(tfidf_matrix)

## Step 9. Generating Recommendations

For each user-selected favourite, we retrieve the corresponding similarity scores from the matrix.

We then compute the **average similarity score** for all other movies in the dataset — effectively ranking how similar each one is to the user’s overall taste.

Movies already liked by the user are excluded from the recommendations.

In [31]:
# Get indices of liked movies
liked_indices = movies[movies['Title'].isin(selected_titles)].index

# Compute average similarity to all other movies
similarity_scores = similarity_matrix[liked_indices].mean(axis=0)

# Set scores of liked movies to -1 so they don't appear in results
similarity_scores[liked_indices] = -1

## Step 10. Final Recommendations

We return the top 10 movies with the highest average similarity scores. These are the films that share the most content characteristics with the user's selected favourites.

This unsupervised model offers flexibility and scalability without requiring any labelled training data.

In [32]:
top_indices = similarity_scores.argsort()[::-1][:10]
recommendations = movies.iloc[top_indices][['Title', 'Year']]

print("\nRecommended movies:")
for i, row in recommendations.iterrows():
    print(f"{row['Title']} ({row['Year']})")


Recommended movies:
Star Wars: Episode VI - Return of the Jedi (1983)
When Harry Met Sally... (1989)
Raiders of the Lost Ark (1981)
Indiana Jones and the Last Crusade (1989)
The Fugitive (1993)
Blade Runner 2049 (2017)
Blade Runner (1982)
Memento (2000)
Aliens (1986)
Lawrence of Arabia (1962)
