
# 🎬 Make your own recommender system!

In this notebook, we'll build a simple but powerful **movie recommender** using different text similarity techniques. In this notebook, we will put all the pieces from the last three weeks together! More specifically, you will use your knowledge of preprocessing (week 1 and week 2), vectorizers (week 2), embeddings (week 2) and (soft) cosine similiarity (week 3) to build your own recommender system. 


We'll cover:

-  A knowledge-based recommender using genres
-  Content-based recommender systems using:
   -  regular cosine similarity based on Count and TF-IDF vectorizers 
   - soft cosine similiarity based on spaCy embeddings for smart comparison



In [None]:
# Imports
import nltk
import string
import numpy as np
import pandas as pd
import spacy
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import pandas as pd
from ast import literal_eval

In [None]:
# Download resources
nltk.download('punkt')
nltk.download('stopwords')

### Explore and Preprocess the data

Today, we will start working on building our own recommender system. For this assignment, we will work with movie data.
Download the following datasets [here](https://www.kaggle.com/tmdb/tmdb-movie-metadata):
- `tmdb_5000_credits.csv`
- `tmdb_5000_movies.csv`

Place the files a folder in the current working directory, which you can call `data/`.


Let's explore the two datasets and identify what information is available and which columns may be useful for building a **knowledge-based** and **content-based recommender system**.

---

#### 🎬 `tmdb_5000_movies.csv`

This dataset contains metadata for each movie, including:
- `title`: Movie title
- `genres`: JSON-formatted list of genres
- `keywords`: Tags or themes for the movie
- `overview`: A short description of the movie plot
- `vote_average`, `vote_count`: Useful for understanding popularity
- `runtime`, `release_date`, `popularity`
- `production_companies`, `production_countries`
- `original_language`
- `id`: The unique movie ID (important for merging)

We’ll likely focus on:
- `title`, `overview`, `genres`, `keywords`, `vote_average`, `popularity`

---

#### 👥 `tmdb_5000_credits.csv`

This contains:
- `movie_id`: Unique ID (can be matched with `id` in movies dataset)
- `title`: Redundant but helpful for validation
- `cast`: JSON-formatted list of cast members
- `crew`: JSON-formatted list of crew members (can extract directors, writers, etc.)

We’ll likely use:
- `movie_id`, `cast`, `crew`

---


In [None]:
movies = pd.read_csv('data/tmdb_5000_movies.csv')
credits = pd.read_csv('data/tmdb_5000_credits.csv')

PATH = 'data/'

VOTE_COUNT = 2000 #If you want to work with a larger dataset, decrease this value.

def get_data(path_to_data):

    data1 = pd.read_csv(f'{path_to_data}tmdb_5000_credits.csv')
    data2 = pd.read_csv(f'{path_to_data}tmdb_5000_movies.csv')
    data2.rename(columns={'id': 'movie_id'}, inplace=True)

    data = pd.merge(data1,data2,  on=['movie_id', 'title'])
    data["original_title"] = data["original_title"].str.lower()

    data = data[data['vote_count'] > VOTE_COUNT] # for now, only keep movies with frequent votes (this will keep the dataset rather small and therefore computation is faster)
    data.index = [i for i in range(0,len(data))]
    return data

In [None]:
data = get_data(PATH)
data.head(3)

# 1. Knowledge-based recommender system

## Text Preprocessing

As a first step, some data wrangling techniques are needed to get the data into the right shape.
- Can you convert `release_year` to a yearly-level variable?
- Can you clean up the `genres` column?

In [None]:
# Step 1: Extract the 'release_year' from 'release_date'
data['release_year'] = pd.to_datetime(data['release_date'], errors='coerce').dt.year

# Step 2: Function to clean and combine genres into a simple text string
def get_genres(x):
    try:
        # Convert the string of genres into a list, then join the names into one string (lowercase)
        return " ".join([genre['name'].lower() for genre in literal_eval(x)])
    except (ValueError, SyntaxError):
        # If there's an error in parsing, return an empty string
        return ''

# Step 3: Apply the function to the 'genres' column
data['genres'] = data['genres'].apply(get_genres)

# Step 4: Display the first few rows to check the result
data.head(3)



-   In a knowledge-based recommender system, we leverage specific attributes of items (in this case, movies) to recommend similar items based on user preferences. One of the most useful attributes in movie recommendations is genre (e.g., Action, Comedy, Drama). In the next code, we will be 'exploding' the  genres column: The genres column contains a list of genres for each movie (e.g., `['Action', 'Adventure']`). To build a knowledge-based system, we need to break each list into individual genres so that each row in the data corresponds to a single genre for a movie. This allows us to make recommendations based on individual genres rather than a combination of genres.

-   Creating a "long" format: The process of transforming data into this "long" format is called exploding. In the context of movie genres, each row will represent a movie and one of its genres. If a movie has multiple genres, there will be multiple rows for that movie (one for each genre).

In [None]:
s = data.apply(lambda x: pd.Series(x['genres'].split()),axis=1).stack().reset_index(level=1, drop=True)
s.name = 'single_genre'
data = data.join(s)

data[['single_genre', 'title', 'vote_average', 'vote_count', 'release_year']].head(10)

Now that the data looks in to be in a good shape, we can start with the actual recommendation system. We will use the following to get started with our knowledge-content-based recommendation system:

In [None]:
print("Enter whatever:")
a_test = input()
print(a_test) ## do you understand what happens here? Some input is taken from the user and stored in a_test.

#### Example of a knowlege based recommender system

Feel free to play around with this simple knowledge-based recommender system to see how it works and inspect the output! By experimenting with different genres, release years, and other preferences, you can explore how the system tailors movie recommendations based on your input. 

In [None]:

def knowledge_based_recommender(data):

    data = data[data['single_genre'].notna()]
    data['single_genre'] = data['single_genre'].str.lower()

    print(f"What type of genre do you like? \n\nYou can choose from the following:\n\n{set(data['single_genre'])}")
    genre = input().lower()

    print("What is the minimum release year of movies you are interested in? (e.g., how 'old' may a movie be?)" )
    release_year = int(input())

    movies = data[(data['single_genre'] == genre) &
    (data['release_year'] >= release_year) ]

    recommend_movies = movies.sort_values('vote_average', ascending=False)

    return recommend_movies[['title', 'vote_average', 'genres']].head(5)


knowledge_based_recommender(data)


# 2. Content-based recommender system

## a. Content-based using Cosine Similarity

For this taks, we go back to the dataset in the original format (hence, before exploding the data to a long format).


### Key Steps:

1. **Create a Combined Feature Column**:
   - We combine multiple text columns such as **overview**, **genres**, **tags**, etc., to create a new column called `combined_features`. This column will hold all the relevant textual data for each movie that will be used for similarity comparison.
   - For example, combining the **overview** and **genres** might provide a richer context to better identify similar movies.

2. **Preprocessing**: 
   - **Lowercase**: Convert the text to lowercase for uniformity.
   - **Remove Punctuation**: Strip punctuation marks from the text.
   - **Tokenization**: Split the text into individual words (tokens).
   - **Stopwords Removal**: Remove common words like "the", "is", "and", etc., which do not contribute much to the meaning.
   - **Stemming**: Reduce words to their root form (e.g., "running" → "run").

3. **User Input Preprocessing**:
   - When a user enters a movie title, we preprocess the input text in the same way we processed the dataset, to ensure that the comparison is valid.

4. **Vectorization**:
   - Convert the text data in the `combined_features` column and user input into numerical vectors using **CountVectorizer**. This step transforms the text into a format that can be compared mathematically.

5. **Cosine Similarity**:
   - **Cosine similarity** is used to calculate how similar the user's input is to the movies in the dataset.
   - The higher the cosine similarity, the more similar the movies are to the user's input.

6. **Recommendation**:
   - The system returns a list of the top 10 most similar movies based on the calculated cosine similarity.

### Notes:
- Creating the combined feature column allows the system to consider multiple aspects (overview, genres, etc.) when calculating similarity, which helps improve the quality of recommendations.
- Preprocessing ensures that text variations (case, punctuation, etc.) do not affect the similarity calculation.
- The **CountVectorizer** converts text into numerical form, which is required for comparing the user input with the movie dataset.


In [None]:
data = get_data(PATH)
data['release_year'] = pd.DatetimeIndex(data['release_date']).year
data['genres'] = data['genres'].apply(get_genres)
data.head(3)

### a. Create a combined feature column.


The goal is to create a **combined feature column** by merging relevant columns in your dataset. This can improve your recommender system by providing a richer representation of each movie.

### Example:

Combine columns like **`overview`**, and **`genres`**:

```python
data['combined'] = data[['genres', 'overview']].apply(lambda x: ','.join(x.dropna().astype(str)), axis=1)
```


In [None]:
def combine_features(data): 
    data['combined_features'] = data[['original_title', 'genres', 'overview', 'tagline']].apply(lambda x: ','.join(x.dropna().astype(str)),axis=1)
    return data

data = combine_features(data)
data.head(3)

### b. Preprocess the data

Before moving on to our vectorizers, we clean our created `combined_features` column. This is important because we want to make sure that our text data is in a format that can be easily processed by the vectorizers.

Think about the following preprocessing steps from week 1 and week 2 of this course:

- Lowercasing
- Removing punctuation
- Tokenizing
- Removing stopwords (like "the", "is")
- Stemming or Lemmatization (optional)



In [None]:
def preprocess_text(text):
    text = text.lower()
    text = text.translate(str.maketrans('', '', string.punctuation))
    tokens = word_tokenize(text)
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words]
    stemmer = PorterStemmer()
    stemmed_tokens = [stemmer.stem(word) for word in tokens]
    return ' '.join(stemmed_tokens)

# Apply preprocessing to the 'combined_feature' column
data['processed_combined_features'] = data['combined_features'].apply(preprocess_text)

### c. Transform your data -- decide upon Count or Tfidf vectorizer. 

Think about a strategy for transforming your combined data column, as designed in the previous step. More specifically, `fit_transform` the combined data column using `tfidf` or `count` vectorizer.

In [None]:
# Step 1: Vectorize the combined features
vectorizer = CountVectorizer()  # or TfidfVectorizer() for TF-IDF
vectors = vectorizer.fit_transform(data['processed_combined_features'])

# Step 2: Compute cosine similarity between the movies
cosine_sim = cosine_similarity(vectors)

# Get the movie title from the user
print("Welcome to the Movie Recommender!")
print("What movie do you like? Please enter the movie title:")

# Preprocess the user input
user_input = input().strip().lower()
user_input_processed = preprocess_text(user_input)  # Preprocess the input in the same way

# Step 3: Vectorize the preprocessed user input
user_input_vector = vectorizer.transform([user_input_processed])

# Step 4: Compute cosine similarity between the user input and all movies
cosine_sim_user = cosine_similarity(user_input_vector, vectors)

# Step 5: Check if the movie exists in the dataset
if user_input not in data['original_title'].str.lower().values:
    print("Movie not in Database")
else:
    # Find the index of the movie entered by the user
    indices = pd.Series(data.index, index=data['original_title'].str.lower())
    index = indices[user_input]

    # Get similarity scores for the entered movie
    sim_scores = list(enumerate(cosine_sim_user[0]))

    # Sort the movies by similarity score
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the top 10 most similar movies (excluding the movie itself)
    sim_scores = sim_scores[1:11]  # Exclude the first movie, which is the same as the input movie

    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]

    # Extract movie details for the recommendations
    movie_id = data['movie_id'].iloc[movie_indices]
    movie_title = data['original_title'].iloc[movie_indices]
    movie_genres = data['genres'].iloc[movie_indices]

    # Create a DataFrame for the recommendations
    recommendation = pd.DataFrame(columns=['Id', 'title', 'genres'])
    recommendation['Id'] = movie_id
    recommendation['title'] = movie_title
    recommendation['genres'] = movie_genres

    # Display the recommendations
    print("\nHere are some movie recommendations based on your choice:")
    for index, row in recommendation.iterrows():
        print(f"Title: {row['title']}, Genres: {row['genres']}")



## 2b. Content based using Soft Cosine -- spaCy Embeddings (No Preprocessing!)

spaCy understands context — so we **don't need to preprocess**.


In [None]:
import spacy
import pandas as pd

# Load spaCy's English model
nlp = spacy.load("en_core_web_md")

# Take user input for the movie description
query_plot = input("What movie do you like? Please enter the movie description: ")

# Process the user input (query) using spaCy
query_doc = nlp(query_plot)

# Process the 'combined_features' column for all movies in your dataset using spaCy
doc_objects = [nlp(text) for text in data['combined_features']]

# Calculate similarity scores between the user's query and each movie's combined features
spacy_scores = [query_doc.similarity(doc) for doc in doc_objects]

# Get the indices of the top 10 most similar movies
sorted_indices = sorted(range(len(spacy_scores)), key=lambda i: spacy_scores[i], reverse=True)[:10]

# Create a DataFrame to store the top 10 recommendations
top_10_movies = pd.DataFrame({
    'title': data['original_title'].iloc[sorted_indices],
    'score': [spacy_scores[i] for i in sorted_indices]
})

# Display the top 10 most similar movies
print("\nTop 10 Most Similar Movies:")
for index, row in top_10_movies.iterrows():
    print(f"Title: {row['title']}, Similarity Score: {row['score']:.4f}")
