# AI/Machine Learning Intern Challenge: Simple Content-Based Recommendation

In [8]:
# Import Libraries
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import kagglehub
import os

## Load dataset

The dataset used is `jrobischon/wikipedia-movie-plots` from Kaggle, which contains Wikipedia movie plot summaries. 

In [9]:
# Download latest version
path = kagglehub.dataset_download("jrobischon/wikipedia-movie-plots")
csv_path = os.path.join(path, "wiki_movie_plots_deduped.csv")
df = pd.read_csv(csv_path)

df = df.dropna(subset=['Plot']) # drop the rows with missing value for column plot
df = df.sample(500, random_state=42)[['Title', 'Plot']] # Randomly sample 500 rows and select relevant columns

In [10]:
# Overview of the dataframe
df.head()

Unnamed: 0,Title,Plot
5337,The Day the Earth Stood Still,"When a flying saucer lands in Washington, D.C...."
9809,The Burning,"One night at Camp Blackfoot, several campers p..."
24075,Nobel Chor,"The first Asian Nobel Laureate, Rabindranath T..."
19057,Trent's Last Case,A major international financier is found dead ...
24991,Aafat,Inspector Amar and Inspector Chhaya are after ...


## Vectorize Plots

We use **TF-IDF (Term Frequency-Inverse Document Frequency)**, which:
- Emphasizes important words in each plot while reducing the weight of common words (e.g., "the", "and").
- Creates a sparse matrix of features for all movie plots.

In [11]:
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(df['Plot'])

# The shape should be [500, dim of vector]
print("TF-IDF matrix shape:", tfidf_matrix.shape)

TF-IDF matrix shape: (500, 17261)


## Recommendation Function

In [12]:
def recommend_movies(query, top_n=5):
    """
    Recommend movies based on a user's text query using TF-IDF and cosine similarity.
    
    Args:
        query: User's text description of preferences (e.g., "action movies in space").
        top_n: Number of top movies to recommend (default: 5).
    
    Returns:
        list: Titles of the top N recommended movies.
    """
    # vectorize the input
    query_vector = vectorizer.transform([query]) 
    
    # Compute cosine similarity 
    similarities = cosine_similarity(query_vector, tfidf_matrix).flatten()
    
    # Get the indices of the top N similar movies
    top_indices = similarities.argsort()[-top_n:][::-1]
    
    return df['Title'].iloc[top_indices].tolist()

## Sample Query & Outputs

In [13]:
sample_query = "I want sci-fi thrillers set in outer space with a bit of humor."
recommendation = recommend_movies(sample_query)
print(f"Query: {sample_query}\n")
print("Recommended Movies: \n")
for i, (title) in enumerate(recommendation, 1):
    print(f"{i}. {title}\n")

Query: I want sci-fi thrillers set in outer space with a bit of humor.

Recommended Movies: 

1. Kaizoku Sentai Gokaiger vs. Space Sheriff Gavan: The Movie

2.  Angst

3. Neighor, The !The Neighbor

4. Kamen Rider Kabuto: GOD SPEED LOVE

5. The Wayward Bus

