# 🎥 Built a Movie Recommendation System with Python

I recently developed a content-based movie recommendation system using The Movie Database (TMDB) API. This project combined data science, NLP, and my passion for movies to create personalized recommendations. Below, I’ve shared the key steps and the exact code used, along with explanations for each section. Let’s dive in!

### 1) Importing Core Libraries
This section imports essential Python libraries for data handling (`pandas`, `numpy`), progress tracking (`tqdm`), and API requests (`requests`).


In [None]:
import pandas as pd
import numpy as np
from tqdm.notebook import tqdm

In [None]:
import requests

### 2) Initial API Request
Here, I set up the TMDB API request to fetch top-rated movies, using headers for authentication and printing the JSON response to verify the data structure.


In [None]:
url = "https://api.themoviedb.org/3/movie/top_rated?language=en-US&page=1"

headers = {
    "accept": "application/json",
    "Authorization": "Bearer eyJhbGciOiJIUzI1NiJ9.eyJhdWQiOiJiNDZkYjZjZDQ4ZGI5YzExMDQ1MDE2Y2YwM2U4ODc5MiIsIm5iZiI6MTc0OTk5ODgyOC44NjksInN1YiI6IjY4NGVkY2VjOGIwYzNkMWMwM2IwZTg0NyIsInNjb3BlcyI6WyJhcGlfcmVhZCJdLCJ2ZXJzaW9uIjoxfQ.OmF4cjOsvnDoJ5tAfYD54-2Kp_9GwwmdAuPRDlpJ9LI"
}

response = requests.get(url, headers=headers)
print(response.json())

{'page': 1, 'results': [{'adult': False, 'backdrop_path': '/zfbjgQE1uSd9wiPTX4VzsLi0rGG.jpg', 'genre_ids': [18, 80], 'id': 278, 'original_language': 'en', 'original_title': 'The Shawshank Redemption', 'overview': 'Imprisoned in the 1940s for the double murder of his wife and her lover, upstanding banker Andy Dufresne begins a new life at the Shawshank prison, where he puts his accounting skills to work for an amoral warden. During his long stretch in prison, Dufresne comes to be admired by the other inmates -- including an older prisoner named Red -- for his integrity and unquenchable sense of hope.', 'popularity': 34.4666, 'poster_path': '/9cqNxx0GxF0bflZmeSMuL5tnGzr.jpg', 'release_date': '1994-09-23', 'title': 'The Shawshank Redemption', 'video': False, 'vote_average': 8.711, 'vote_count': 28435}, {'adult': False, 'backdrop_path': '/tmU7GeKVybMWFButWEGl2M4GeiP.jpg', 'genre_ids': [18, 80], 'id': 238, 'original_language': 'en', 'original_title': 'The Godfather', 'overview': 'Spanning t

In [None]:
response.json()["results"]

[{'adult': False,
  'backdrop_path': '/zfbjgQE1uSd9wiPTX4VzsLi0rGG.jpg',
  'genre_ids': [18, 80],
  'id': 278,
  'original_language': 'en',
  'original_title': 'The Shawshank Redemption',
  'overview': 'Imprisoned in the 1940s for the double murder of his wife and her lover, upstanding banker Andy Dufresne begins a new life at the Shawshank prison, where he puts his accounting skills to work for an amoral warden. During his long stretch in prison, Dufresne comes to be admired by the other inmates -- including an older prisoner named Red -- for his integrity and unquenchable sense of hope.',
  'popularity': 34.4666,
  'poster_path': '/9cqNxx0GxF0bflZmeSMuL5tnGzr.jpg',
  'release_date': '1994-09-23',
  'title': 'The Shawshank Redemption',
  'video': False,
  'vote_average': 8.711,
  'vote_count': 28435},
 {'adult': False,
  'backdrop_path': '/tmU7GeKVybMWFButWEGl2M4GeiP.jpg',
  'genre_ids': [18, 80],
  'id': 238,
  'original_language': 'en',
  'original_title': 'The Godfather',
  'overvi

### 3) Creating a DataFrame
Here we extracts specific fields (`id`, `title`, `title`, `overview`, `genre_ids`, `poster_path`) from the API response and stores them in a Pandas DataFrame.


In [None]:
data=pd.DataFrame(response.json()["results"])[["id","title","overview","genre_ids","poster_path"]]


In [None]:
data.head()

Unnamed: 0,id,title,overview,genre_ids,poster_path
0,29426,Survival of the Dead,"On a small island off the coast of Delaware, t...",27351853,/2BmqSRt10J2mpJmCgZbo5YYkQLj.jpg
1,241251,The Boy Next Door,A recently cheated on married woman falls for ...,53,/gicmSeLG6Uh7BF1r1mxZHUQ8r26.jpg
2,214597,Fright Night 2: New Blood,"By day Gerri Dandridge is a sexy professor, bu...",2735,/3Is5G28YLNKq22n5Ee2yTmYA3m6.jpg
3,351819,Fifty Shades of Black,An inexperienced college student meets a wealt...,1074935,/nkGhv7WMbyX7tL8CJkvMY3S2lW.jpg
4,654974,Home Sweet Home Alone,After being left at home by himself for the ho...,107513580,/fP3VvqUjEBjawxZHL4sYCq2ZdJD.jpg


### 4) Converting Genre IDs to Strings
Here we converts this `genre_ids` list into a comma-separated string for easier processing later.


In [None]:
data["genre_ids"] = data["genre_ids"].apply(lambda x: ",".join(map(str, x)))

### 5) Initializing an Empty DataFrame
An empty DataFrame is created to store the movie data from multiple API calls.


In [None]:
df=pd.DataFrame()
df

### 6) Fetching Data Across Pages
This loop iterates over 500 pages of the TMDB API to collect movie data, appending each page’s results to the main DataFrame.


In [None]:

for i in tqdm(range(1,501)):
  url = f"https://api.themoviedb.org/3/movie/top_rated?language=en-US&page={i}"

  headers = {
    "accept": "application/json",
    "Authorization": "Bearer eyJhbGciOiJIUzI1NiJ9.eyJhdWQiOiJiNDZkYjZjZDQ4ZGI5YzExMDQ1MDE2Y2YwM2U4ODc5MiIsIm5iZiI6MTc0OTk5ODgyOC44NjksInN1YiI6IjY4NGVkY2VjOGIwYzNkMWMwM2IwZTg0NyIsInNjb3BlcyI6WyJhcGlfcmVhZCJdLCJ2ZXJzaW9uIjoxfQ.OmF4cjOsvnDoJ5tAfYD54-2Kp_9GwwmdAuPRDlpJ9LI"
  }

  response = requests.get(url, headers=headers)
  temp_df=pd.DataFrame(response.json()["results"])[["id","title","overview","genre_ids","poster_path"]]
  df=pd.concat([df, temp_df], ignore_index=True)



  0%|          | 0/500 [00:00<?, ?it/s]

### 7) Copying the DataFrame
A copy of the DataFrame is created to preserve the original data while performing preprocessing.


In [None]:
new_df=df.copy()

### 8) Mapping Genre IDs to Names
Here we converts genre IDs to their corresponding genre names (e.g., 18 → Drama).


In [None]:
genre_mapping = {
    28: "Action", 12: "Adventure", 16: "Animation", 35: "Comedy", 80: "Crime",
    99: "Documentary", 18: "Drama", 10751: "Family", 14: "Fantasy", 36: "History",
    27: "Horror", 10402: "Music", 9648: "Mystery", 10749: "Romance",
    878: "Science Fiction", 10770: "TV Movie", 53: "Thriller", 10752: "War", 37: "Western"
}
new_df['genre_ids'] = new_df['genre_ids'].apply(lambda ids: [genre_mapping.get(id) for id in ids])



### 9) Dropping Poster Path Column
The `poster_path` column is dropped from the new DataFrame as it’s not needed for text-based similarity calculations.


In [None]:
new_df.drop(columns="poster_path",inplace=True)

### 10) Splitting Overview into Words
The `overview` column is split into individual words to prepare for text processing.


In [None]:
new_df["overview"]=new_df["overview"].apply(lambda x:x.split())

### 11) Combining Overview and Genres
A new "tags" column is created by combining the `overview` and `genre_ids` lists to form a single text feature.


In [None]:
new_df["tags"]=new_df['overview']+new_df['genre_ids']

### 12) Dropping Unused Columns
The `overview` and `genre_ids` columns are dropped.


In [None]:
new_df.drop(columns=["overview","genre_ids"],inplace=True)

### 13) Joining Tags into Strings
The `tags` column is converted from a list to a single string, with elements separated by spaces.


In [None]:
new_df['tags']=new_df['tags'].apply(lambda x:" ".join(x))

In [None]:
new_df.head()

Unnamed: 0,id,title,tags
0,278,The Shawshank Redemption,Imprisoned in the 1940s for the double murder ...
1,238,The Godfather,"Spanning the years 1945 to 1955, a chronicle o..."
2,240,The Godfather Part II,In the continuing saga of the Corleone crime f...
3,424,Schindler's List,The true story of how businessman Oskar Schind...
4,389,12 Angry Men,The defense and the prosecution have rested an...


### 14) Downloading NLTK WordNet
The WordNet resource from NLTK is downloaded for lemmatization.

In [None]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...


True

### 15) Applying Lemmatization
A lemmatization function is defined using NLTK’s `WordNetLemmatizer` to normalize words in the `tags` column (e.g., "running" → "run").

In [None]:
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

def lemma(text):
    y = []
    for i in text.split():
        y.append(lemmatizer.lemmatize(i))  # Lemmatizes assuming words are nouns
    return " ".join(y)

new_df['tags'] = new_df['tags'].apply(lemma)


In [None]:
new_df['tags'][0]

'Imprisoned in the 1940s for the double murder of his wife and her lover, upstanding banker Andy Dufresne begin a new life at the Shawshank prison, where he put his accounting skill to work for an amoral warden. During his long stretch in prison, Dufresne come to be admired by the other inmate -- including an older prisoner named Red -- for his integrity and unquenchable sense of hope. Drama Crime'

### 16) Importing CountVectorizer
The `CountVectorizer` from scikit-learn is imported for feature extraction.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
cv=CountVectorizer(max_features=10000,stop_words="english")

### 17) Transforming Tags to Vectors
The `tags` are transformed into a numerical array using the fitted `CountVectorizer`.


In [None]:
vectors= cv.fit_transform(new_df["tags"]).toarray()

### 18) Computing Cosine Similarity
The `cosine_similarity` function is used to calculate similarity scores between movie vectors.


In [None]:
from sklearn.metrics.pairwise import cosine_similarity
similarity = cosine_similarity(vectors)

In [None]:
similarity

array([[1.        , 0.13041013, 0.11595421, ..., 0.        , 0.        ,
        0.02742042],
       [0.13041013, 1.        , 0.50279332, ..., 0.        , 0.        ,
        0.        ],
       [0.11595421, 0.50279332, 1.        , ..., 0.        , 0.        ,
        0.02782074],
       ...,
       [0.        , 0.        , 0.        , ..., 1.        , 0.10012523,
        0.03721615],
       [0.        , 0.        , 0.        , ..., 0.10012523, 1.        ,
        0.        ],
       [0.02742042, 0.        , 0.02782074, ..., 0.03721615, 0.        ,
        1.        ]])

### 19) Defining Recommendation Function
This recommendation function finds the top 5 similar movies based on cosine similarity for a given movie title.


In [None]:
def recommend(movie):
    index = new_df[new_df['title'] == movie].index[0]
    distances = sorted(list(enumerate(similarity[index])),reverse=True,key = lambda x: x[1])
    for i in distances[1:6]:
        print(new_df.iloc[i[0]].title)

In [None]:
recommend('A Passage to India')

Life of Pi
Gandhi
Lagaan: Once Upon a Time in India
Victoria & Abdul
Dilwale Dulhania Le Jayenge


### 20) Saving Data with Pickle
The preprocessed DataFrame, similarity matrix and df dataframe are saved as pickle files for future use.

In [None]:
import pickle

In [None]:
pickle.dump(new_df,open('movie_list.pkl','wb'))
pickle.dump(similarity,open('similarity.pkl','wb'))

In [None]:
pickle.dump(df,open('photo.pkl','wb'))

### 21) Updating Poster Paths
The `poster_path` column in the DataFrame is updated by the full TMDB image URL.


In [None]:
df['poster_path']=df['poster_path'].apply(lambda x: "https://image.tmdb.org/t/p/original" + str(x))

### 22) Defining Enhanced Recommendation Function
This is my updated recommendation function outputs both movie titles and their corresponding poster URLs for the top 5 similar movies.


In [None]:
def recommend(movie):
    index = new_df[new_df['title'] == movie].index[0]
    distances = sorted(list(enumerate(similarity[index])),reverse=True,key = lambda x: x[1])
    for i in distances[1:6]:
        print(new_df.iloc[i[0]].title)
        print(df['poster_path'][i[0]])

The enhanced recommendation function is tested with *A Passage to India*, displaying recommended titles and poster URLs.


In [None]:
recommend('A Passage to India')

Life of Pi
https://image.tmdb.org/t/p/original/iLgRu4hhSr6V1uManX6ukDriiSc.jpg
Gandhi
https://image.tmdb.org/t/p/original/rOXftt7SluxskrFrvU7qFJa5zeN.jpg
Lagaan: Once Upon a Time in India
https://image.tmdb.org/t/p/original/yNX9lFRAFeNLNRIXdqZK9gYrYKa.jpg
Victoria & Abdul
https://image.tmdb.org/t/p/original/uIzQ8zZ0rqjqqJUIpeeovtTryAa.jpg
Dilwale Dulhania Le Jayenge
https://image.tmdb.org/t/p/original/2CAL2433ZeIihfX1Hb2139CX0pW.jpg


🔍 **Example Output**: For *A Passage to India*, the system recommends movies like *Life of Pi* and *Gandhi*, along with their poster URLs for a visual touch.

This project showcases how NLP and machine learning can create engaging movie recommendations. The code is available on [my GitHub repository](https://github.com/snehangshu2002/Movie-Recommendation)—check it out and share your feedback!
#DataScience #Python #MachineLearning #RecommendationSystems #Movies #TMDB #NLP