---
# **Table of Contents**
---

1. [**Problem Statement**](#Section1)<br>
2. [**Importing Libraries**](#Section2)<br>
3. [**Data Acquisition & Description**](#Section3)<br>
4. [**Data Pre-Processing**](#Section4)<br>
5. [**Feature Engineering**](#Section5)<br>
6. [**Recommendation**](#Section6)
7. [**Salary Expectations**](#Section7)
---

---
<a name = Section1></a>
# ***1. Problem Statement***
---
Build a **content-based recommendation system** that, given a **short text description** of a user’s preferences, suggests **similar items** (e.g., movies) from a small dataset. This challenge should take about **3 hours**, so keep your solution **simple** yet **functional**.

### Example Use Case

- The user inputs:  
  "I love thrilling action movies set in space, with a comedic twist."
- Your system processes this description (query) and compares it to a dataset of items (e.g., movies with their plot summaries or keywords).  
- You then return the **top 3–5 “closest” matches** to the user.


<center><img width=50% src="https://shorturl.at/5xY64"></center>


---
<a name = Section2></a>
# ***2. Importing Libraries***
---

In [1]:
# Import the necessary libraries
import pandas as pd
import numpy as np
import re
import string
import warnings
warnings.filterwarnings('ignore')
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.stem import PorterStemmer
import nltk
from nltk.corpus import stopwords

---
<a name = Section3></a>
# ***3. Data Acquisition & Description***
---

- The IMDb Top 1000 Movies dataset

|Id|Feature|Description|
|:--|:--|:--|
|01| Poster_Link            | Link to the movie poster |
|02| Series_Title           | Link to the movie title |
|03| Released_Year          | Year in which the movie was released |
|04| Certificate            | Certification rating assigned by the movie board |
|05| Runtime                | Movie Runtime |
|06| IMDB_Rating            | IMDB Rating given to the movie |
|07| Overview               | A brief description about the movie |
|08| Meta_Score             | Meta Score provided by the user |
|09| Director               | Movie director |
|10| Star1                  | Movie Actor 1  |
|11| Star2                  | Movie Star 2   |
|12| Star3                  | Movie Star 3   |
|13| Star4                  | Movie Star 4   |
|14| No of Votes            | Number of votes received by the user |
|15| Gross                  | Number of likes received by user  through web.|

In [2]:
# Loading the data
df = pd.read_csv('imdb_top_1000.csv') # replace with your data path
df.head()

Unnamed: 0,Poster_Link,Series_Title,Released_Year,Certificate,Runtime,Genre,IMDB_Rating,Overview,Meta_score,Director,Star1,Star2,Star3,Star4,No_of_Votes,Gross
0,https://m.media-amazon.com/images/M/MV5BMDFkYT...,The Shawshank Redemption,1994,A,142 min,Drama,9.3,Two imprisoned men bond over a number of years...,80.0,Frank Darabont,Tim Robbins,Morgan Freeman,Bob Gunton,William Sadler,2343110,28341469
1,https://m.media-amazon.com/images/M/MV5BM2MyNj...,The Godfather,1972,A,175 min,"Crime, Drama",9.2,An organized crime dynasty's aging patriarch t...,100.0,Francis Ford Coppola,Marlon Brando,Al Pacino,James Caan,Diane Keaton,1620367,134966411
2,https://m.media-amazon.com/images/M/MV5BMTMxNT...,The Dark Knight,2008,UA,152 min,"Action, Crime, Drama",9.0,When the menace known as the Joker wreaks havo...,84.0,Christopher Nolan,Christian Bale,Heath Ledger,Aaron Eckhart,Michael Caine,2303232,534858444
3,https://m.media-amazon.com/images/M/MV5BMWMwMG...,The Godfather: Part II,1974,A,202 min,"Crime, Drama",9.0,The early life and career of Vito Corleone in ...,90.0,Francis Ford Coppola,Al Pacino,Robert De Niro,Robert Duvall,Diane Keaton,1129952,57300000
4,https://m.media-amazon.com/images/M/MV5BMWU4N2...,12 Angry Men,1957,U,96 min,"Crime, Drama",9.0,A jury holdout attempts to prevent a miscarria...,96.0,Sidney Lumet,Henry Fonda,Lee J. Cobb,Martin Balsam,John Fiedler,689845,4360000


In [3]:
# The dataset link
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 16 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Poster_Link    1000 non-null   object 
 1   Series_Title   1000 non-null   object 
 2   Released_Year  1000 non-null   object 
 3   Certificate    899 non-null    object 
 4   Runtime        1000 non-null   object 
 5   Genre          1000 non-null   object 
 6   IMDB_Rating    1000 non-null   float64
 7   Overview       1000 non-null   object 
 8   Meta_score     843 non-null    float64
 9   Director       1000 non-null   object 
 10  Star1          1000 non-null   object 
 11  Star2          1000 non-null   object 
 12  Star3          1000 non-null   object 
 13  Star4          1000 non-null   object 
 14  No_of_Votes    1000 non-null   int64  
 15  Gross          831 non-null    object 
dtypes: float64(2), int64(1), object(13)
memory usage: 125.1+ KB


- Most of the data consists of text-based columns

In [4]:
# Describe the dataset
df.describe()

Unnamed: 0,IMDB_Rating,Meta_score,No_of_Votes
count,1000.0,843.0,1000.0
mean,7.9493,77.97153,273692.9
std,0.275491,12.376099,327372.7
min,7.6,28.0,25088.0
25%,7.7,70.0,55526.25
50%,7.9,79.0,138548.5
75%,8.1,87.0,374161.2
max,9.3,100.0,2343110.0


- The 3 main numerical columns within the dataset include 'IMDB_Rating', 'Meta_Score' and 'No_of_Votes'

- Going ahead, for our use-case, we will only be using 'Series_Title', 'Genre', 'Overview' AND 'Director' columns for our use-case, since we are focussing ON a simple text-based recommendation model

In [5]:
# Select only the required columns for the data
df_process = df[['Series_Title', 'Genre', 'Overview', 'Director']]

In [6]:
df_process['Overview'][0]

'Two imprisoned men bond over a number of years, finding solace and eventual redemption through acts of common decency.'

In [7]:
df_process.describe()

Unnamed: 0,Series_Title,Genre,Overview,Director
count,1000,1000,1000,1000
unique,999,202,1000,548
top,Drishyam,Drama,Two imprisoned men bond over a number of years...,Alfred Hitchcock
freq,2,85,1,14


In [8]:
df_process.isnull().sum()

Series_Title    0
Genre           0
Overview        0
Director        0
dtype: int64

- Going ahead, we will process only these columns for our data

---
<a name = Section4></a>
# **4. Data Preprocessing**
---

In [9]:
# NLTK tokenizer AND stopword packages
nltk.download('punkt_tab')
nltk.download('stopwords')

[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/saggysimmba/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/saggysimmba/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [10]:
# Stemmer for PorterStemmer
stemmer = PorterStemmer()
# Removing punctuations
remove_punctuation_map = dict((ord(char), None) for char in string.punctuation)

In [11]:
def clean_tokens(text):
  """Preprocess the text"""
  lower_text = text.lower()  # Convert to lowercase
  lower_text = re.sub(r'[^a-zA-Z ]', ' ', lower_text)  # Remove everything except letters and spaces
  lower_text = re.sub(r'\s+', ' ', lower_text).strip()  # Replace multiple spaces with a single space and trim
  no_punctuation = lower_text.translate(remove_punctuation_map) # Lower the text for punctuations
  tokens = nltk.word_tokenize(no_punctuation) # Tokenize the data without punctuation
  filtered = [w for w in tokens if not w in stopwords.words("english")] # Filtered data
  stemmed = [] # Stem the data
  for item in filtered:
    stemmed.append(stemmer.stem(item)) # Perform data stemming
  return lower_text

In [12]:
# Apply cleaning to relevant columns
df_process['Genre'] = df_process['Genre'].apply(clean_tokens)
df_process['Series_Title'] = df_process['Series_Title'].apply(clean_tokens)
df_process['Overview'] = df_process['Overview'].apply(clean_tokens)
df_process['Director'] = df_process['Director'].apply(clean_tokens)

- Now we will combine data from 'Genre', 'Series_Title', 'Overview' and 'Director' to get a combined knowledge of genre, series title and tags present in the dataset

In [13]:
# Create the tags
df_process['tags'] = df_process['Series_Title'] + ' ' + df_process['Genre'] + ' ' + df_process['Overview'] + ' ' + df_process['Director']

---
<a name = Section5></a>
# **5. Feature Engineering**
---

- Here we will do a simple playaround with tfidf parameters. Lets see how that works out.
- First we will fit the matrix to the entire dataset. Then we will fit the matrix to reduced parameters

In [14]:
# Initialize tfidf vectorizer to extract vocabulary
tfidf1 = TfidfVectorizer()
tfidf_matrix1 = tfidf1.fit_transform(df_process['tags'])  # Fit on combined genre data

In [15]:
# Get vocabulary size
vocab_size = len(tfidf1.vocabulary_)
vocab_size

7191

The vocabulary seems to be too high. Let's limit the number of minimum number of features

In [16]:
# Let's define the second vectorizer
tfidf2 = TfidfVectorizer(min_df=10)
tfidf_matrix2 = tfidf2.fit_transform(df_process['tags'])  # Fit on combined genre data
vocab_size = len(tfidf2.vocabulary_)
vocab_size

363

- Here we can see that just by removing the number of words which occured in less than 10% of all documents (vocab_size), we were able to limit the no of features to just 363

---
<a name = Section6></a>
# **6. Recommendation**
---

- Here we will test out recommendations from both the matrices and comment on the result

In [17]:
def recommend_from_text(query, vectorizer, df_movies, tfidf_matrix, n_recommendations=5):
    """
    Recommend movies based on a text query, including similarity scores.

    Parameters:
    - query (str): User input describing the movie.
    - vectorizer (TfidfVectorizer): Pre-trained TF-IDF vectorizer.
    - df_movies (DataFrame): Movie dataset with 'Series_Title' column.
    - tfidf_matrix (sparse matrix): The TF-IDF feature matrix of all movies.
    - n_recommendations (int): Number of movies to return.

    Returns:
    - List of tuples with recommended movie titles and their similarity scores.
    """
    # Step 1: Convert input text into TF-IDF features
    query_tfidf = vectorizer.transform([query])  # Apply TF-IDF transformation

    # Step 2: Compute cosine similarity between query and all movies in the dataset
    similarity_scores = cosine_similarity(query_tfidf, tfidf_matrix)[0]

    # Step 3: Get top N most similar movies (excluding the query itself)
    top_movie_indices = similarity_scores.argsort()[::-1][:n_recommendations]

    # Step 4: Retrieve recommended movie titles and their similarity scores
    recommended_movies = [
        (df_movies.iloc[i]['Series_Title'], similarity_scores[i]) for i in top_movie_indices
    ]

    return recommended_movies

In [18]:
# Example Usage for first tfidf matrix
recommend_from_text(
    "I love thrilling action movies set in space, with a comedic twist.",
    tfidf1,  # TF-IDF vectorizer
    df,  # DataFrame with movies
    tfidf_matrix1  # TF-IDF feature matrix
)

[('Amarcord', np.float64(0.27847261075557966)),
 ('Aliens', np.float64(0.13616430405418314)),
 ('The Man Who Would Be King', np.float64(0.13217272989380913)),
 ('Barton Fink', np.float64(0.1237384750664491)),
 ('Clerks', np.float64(0.11124411444944568))]

In [19]:
# Example Usage for second tfidf matrix
recommend_from_text(
    "I love thrilling action movies set in space, with a comedic twist.",
    tfidf2,  # TF-IDF vectorizer
    df,  # DataFrame with movies
    tfidf_matrix2  # TF-IDF feature matrix
)

[('Aliens', np.float64(0.3608801625928601)),
 ('Ghostbusters', np.float64(0.3264176120018433)),
 ('Amarcord', np.float64(0.3056737953920116)),
 ('Gattaca', np.float64(0.2928924835594347)),
 ('Blade Runner', np.float64(0.28458284071687384))]

**Comments**
- The reduced vector space size in the second recommendation system makes it more relevant, as we get results with higher similarity scores. It includes **Ghostbusters**, which aligns with **comedy+sci-fi** action theme. Additionally, **Gattaca** & **Blade Runner** add strong sci-fi/thriller elements to the second system.
- **Aliens** is the only consistent correct recommendation among the two systems.
- The first system's result is more focussed on **comedy**.

---
<a name = Section7></a>
# **7. Salary Expectations**
---

- My salary expectations are in the range of $1200 to $2000 per month