# üé¨ MovieMatch AI: Data Analysis & Recommendation Engine
### Developed by: Vineeth Tangedudona Project Status: ‚úÖ Completed & Deployed 

## üìù Project Overview
### This notebook contains the Full Data Pipeline for the MovieMatch AI system. I processed a raw dataset of over 25,000 movies to build a Content-Based Recommendation Engine. The goal was to transform messy metadata into a high-performance mathematical model.

## üõ†Ô∏è What‚Äôs inside this Notebook?
### Feature Engineering: Creating a "Bag of Words" (Tags) for every movie.

### NLP Pipeline: Implementing TF-IDF Vectorization to turn text into numerical data.

### Similarity Modeling: Calculating Cosine Similarity to find the closest matches for any given film.

## üöÄ Interactive Live Demo
### I have deployed this model as a live web application!

### üëâ Click here to test the Recommender App
 ###   (https://vineeth-movie-recommender.streamlit.app/)

In [1]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

/kaggle/input/tmdb-movies-dataset-2023-930k-movies/TMDB_movie_dataset_v11.csv


In [2]:
import pandas as pd
import numpy as np 
import nltk
from nltk.stem import WordNetLemmatizer
import re
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import warnings
warnings.filterwarnings('ignore')

In [3]:
df = pd.read_csv('/kaggle/input/tmdb-movies-dataset-2023-930k-movies/TMDB_movie_dataset_v11.csv')

# Filtering: Non-adult and only movies with descriptions
df = df[df['adult'] == False]
df = df.dropna(subset=['overview'])

# Keeping top 25,000 popular movies for memory efficiency and quality
df = df.sort_values(by='popularity', ascending=False).head(25000).reset_index(drop=True)

df.head(4)

Unnamed: 0,id,title,vote_average,vote_count,status,release_date,revenue,runtime,adult,backdrop_path,...,original_title,overview,popularity,poster_path,tagline,genres,production_companies,production_countries,spoken_languages,keywords
0,565770,Blue Beetle,7.139,1023,Released,2023-08-16,124818235,128,False,/1syW9SNna38rSl9fnXwc9fP7POW.jpg,...,Blue Beetle,Recent college grad Jaime Reyes returns home f...,2994.357,/mXLOHHc1Zeuwsl4xYKjKh2280oL.jpg,Jaime Reyes is a superhero whether he likes it...,"Action, Science Fiction, Adventure","Warner Bros. Pictures, The Safran Company, DC ...",United States of America,"English, Portuguese, Spanish","armor, superhero, family relationships, family..."
1,980489,Gran Turismo,8.068,702,Released,2023-08-09,114800000,135,False,/xFYpUmB01nswPgbzi8EOCT1ZYFu.jpg,...,Gran Turismo,The ultimate wish-fulfillment tale of a teenag...,2680.593,/51tqzRtKMMZEYUpSYkrUE7v9ehm.jpg,From gamer to racer.,"Action, Drama, Adventure","PlayStation Productions, 2.0 Entertainment, Co...",United States of America,"English, German, Japanese","based on true story, racing, based on video ga..."
2,968051,The Nun II,6.545,365,Released,2023-09-06,231200000,110,False,/53z2fXEKfnNg2uSOPss2unPBGX1.jpg,...,The Nun II,"In 1956 France, a priest is violently murdered...",1692.778,/c9kVD7W8CT5xe4O3hQ7bFWwk68U.jpg,Confess your sins.,"Horror, Mystery, Thriller","New Line Cinema, Atomic Monster, The Safran Co...",United States of America,"English, French","france, bullying, sequel, religion, demon, got..."
3,615656,Meg 2: The Trench,6.912,2034,Released,2023-08-02,384056482,116,False,/5mzr6JZbrqnqD8rCEvPhuCE5Fw2.jpg,...,Meg 2: The Trench,An exploratory dive into the deepest depths of...,1567.273,/4m1Au3YkjqsxF8iwQy0fPYSxE0h.jpg,Back for seconds.,"Action, Science Fiction, Horror","Apelles Entertainment, Warner Bros. Pictures, ...","China, United States of America",English,"based on novel or book, sequel, shark, kaiju, ..."


In [4]:
df.shape
df.info()
df.isnull().sum()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25000 entries, 0 to 24999
Data columns (total 24 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   id                    25000 non-null  int64  
 1   title                 25000 non-null  object 
 2   vote_average          25000 non-null  float64
 3   vote_count            25000 non-null  int64  
 4   status                25000 non-null  object 
 5   release_date          24971 non-null  object 
 6   revenue               25000 non-null  int64  
 7   runtime               25000 non-null  int64  
 8   adult                 25000 non-null  bool   
 9   backdrop_path         24226 non-null  object 
 10  budget                25000 non-null  int64  
 11  homepage              7461 non-null   object 
 12  imdb_id               24335 non-null  object 
 13  original_language     25000 non-null  object 
 14  original_title        25000 non-null  object 
 15  overview           

id                          0
title                       0
vote_average                0
vote_count                  0
status                      0
release_date               29
revenue                     0
runtime                     0
adult                       0
backdrop_path             774
budget                      0
homepage                17539
imdb_id                   665
original_language           0
original_title              0
overview                    0
popularity                  0
poster_path               155
tagline                  8333
genres                    135
production_companies     1341
production_countries      482
spoken_languages          194
keywords                 3466
dtype: int64

In [5]:
# Create a 'tags' column by combining relevant text features
# We fill NaNs with empty strings to avoid 'NaN' appearing in our text
df['genres'] = df['genres'].fillna('').str.lower()
df['keywords'] = df['keywords'].fillna('').str.lower()
df['overview'] = df['overview'].fillna('').str.lower()

# Combining the text: Plot + Genre + Keywords
df['tags'] = (df['genres'] + " ") + (df['keywords'] + " ") + df['overview']

# Cleaning: Lowercase and stripping extra whitespace
df['tags'] = df['tags'].str.lower()

print("Sample Tags for Movie 1:")
print(df['tags'].iloc[0] + "....")

Sample Tags for Movie 1:
action, science fiction, adventure armor, superhero, family relationships, family, high tech, job hunting, mexican american, aftercreditsstinger, duringcreditsstinger, immigrant family, college graduate, dc extended universe (dceu), alien technology, brother sister relationship, latino recent college grad jaime reyes returns home full of aspirations for his future, only to find that home is not quite as he left it. as he searches to find his purpose in the world, fate intervenes when jaime unexpectedly finds himself in possession of an ancient relic of alien biotechnology: the scarab.....


In [6]:
# Download the necessary dictionary for lemmatization
nltk.download('wordnet')
nltk.download('omw-1.4')

lemmatizer = WordNetLemmatizer()

def clean_and_lemmatize(text):
    # 1. Remove non-alphabetic characters (keep only letters)
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    # 2. Tokenize (split) and Lemmatize each word
    words = text.split()
    lemmed_words = [lemmatizer.lemmatize(w) for w in words]
    # 3. Join back into a single string
    return " ".join(lemmed_words)

df['tags'] = df['tags'].apply(clean_and_lemmatize)

df['tags'].head(4)

[nltk_data] Downloading package wordnet to /usr/share/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /usr/share/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


0    action science fiction adventure armor superhe...
1    action drama adventure based on true story rac...
2    horror mystery thriller france bullying sequel...
3    action science fiction horror based on novel o...
Name: tags, dtype: object

In [7]:
tfidf = TfidfVectorizer(stop_words='english', max_features=5000)
tfidf_matrix = tfidf.fit_transform(df['tags'])

In [8]:
def recommend(movie_title):
    try:
        # 1. Fuzzy Search: Look for movies that contain the user's text
        # (e.g., "Godfather" will find "The Godfather")
        matches = df[df['title'].str.contains(movie_title, case=False, na=False)]
        
        # 2. If no match is found, jump to the 'except' logic
        if matches.empty:
            raise IndexError
            
        # 3. Pick the most popular version of that movie name
        idx = matches.sort_values(by='popularity', ascending=False).index[0]
        
        # 4. Math: Calculate similarity for just this one movie (Memory Efficient)
        target_vector = tfidf_matrix[idx]
        scores = cosine_similarity(target_vector, tfidf_matrix).flatten()
        
        # 5. Get the Top 5 most similar (excluding itself)
        top_indices = np.argsort(scores)[-6:-1][::-1]
        
        # Display the actual title found
        print(f"--- Recommendations for: {df.iloc[idx]['title']} ---")
        return df.iloc[top_indices][['title', 'genres', 'vote_average']]

    except:
        return "‚ö†Ô∏è Movie not found. Please try a different or more specific name!"


recommend("batman")

--- Recommendations for: The Batman ---


Unnamed: 0,title,genres,vote_average
19348,The Batman - Part II,"drama, crime, action",0.0
7143,"Batman: The Long Halloween, Part One","animation, mystery, action, crime",7.504
2680,Batman: Under the Red Hood,"science fiction, crime, action, animation, mys...",7.75
6208,"Batman: The Long Halloween, Part Two","animation, mystery, action, crime",7.459
20952,The Awakener,"thriller, action, crime, drama",6.266


In [9]:
import pickle

pickle.dump(df[['title', 'genres', 'vote_average']], open('movie_list.pkl', 'wb'))

pickle.dump(tfidf_matrix, open('tfidf_matrix.pkl', 'wb'))