# Movie Recommendation System


## Import Necessary Libraries and Load Datasets
In this cell, we import the libraries we need for our project. 
- `pandas` is used for data manipulation and analysis.
- `json` helps us work with JSON data.
- `numpy` is used for numerical operations.
- `TfidfVectorizer` from `sklearn` helps us convert text data into numerical format.

We then load two datasets: one containing movie credits and another containing movie details. 
Finally, we display the first few rows of the credits dataset to see what it looks like.

In [None]:
# Import necessary libraries and load datasets
import pandas as pd
import json
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
credits = pd.read_csv("datasets/tmdb_5000_credits.csv")
movies = pd.read_csv("datasets/tmdb_5000_movies.csv")
credits.head()

## Print the Shapes of the Credits and Movies DataFrames
Here, we print the dimensions (number of rows and columns) of the credits and movies datasets. 
This helps us understand how much data we are working with.

In [None]:
# Print the shapes of the credits and movies dataframes
print("Credits:", credits.shape)
print("Movies Dataframe:", movies.shape)

## Merge Credits and Movies DataFrames on 'id' and Rename 'movie_id' to 'id'
In this cell, we combine the credits and movies datasets into one. 
We rename the 'movie_id' column in the credits dataset to 'id' so that we can merge them based on this common column. 
After merging, we display the first few rows of the combined dataset to check the result.

In [None]:
# Merge credits and movies dataframes on 'id' and rename 'movie_id' to 'id'
credits_column_renamed = credits.rename(index=str, columns={"movie_id": "id"})
movies_merge = movies.merge(credits_column_renamed, on='id')
print(movies_merge.head())

## Clean the Merged Data by Selecting Relevant Columns and Processing JSON Fields
Here, we select only the columns we need from the merged dataset. 
We also process the 'cast', 'crew', and 'genres' columns, which are in JSON format, to convert them into Python lists. 
Finally, we save the cleaned data to a JSON file and display the first few rows of the cleaned dataset.

In [None]:
# Clean the merged data by selecting relevant columns and processing JSON fields
columns_to_select = ['id', 'original_title', 'overview', 'cast', 'crew', 'release_date', 'genres', 'vote_average', 'runtime', 'original_language']
movies_cleaned = movies_merge[columns_to_select].copy()
movies_cleaned['movie_id'] = movies_cleaned['id']
movies_cleaned['cast'] = movies_cleaned['cast'].apply(lambda x: json.loads(x) if isinstance(x, str) else [])
movies_cleaned['crew'] = movies_cleaned['crew'].apply(lambda x: json.loads(x) if isinstance(x, str) else [])
movies_cleaned['genres'] = movies_cleaned['genres'].apply(lambda x: json.loads(x) if isinstance(x, str) else [])
movies_cleaned.to_json('src/data/movies_cleaned.json', orient='records')
print("Cleaned data saved to data folder")
print(movies_cleaned.head())

## Load Dataset and Prepare TF-IDF Vectorizer
In this cell, we load the cleaned movies dataset again. 
We check if the 'overview' column exists, as it is essential for our analysis. 
We fill any missing values in the 'overview' column with an empty string. 
Then, we set up a TF-IDF Vectorizer, which will help us convert the text in the 'overview' column into a numerical format that we can work with.

In [None]:
# Load dataset and prepare TF-IDF Vectorizer
movies_cleaned_df = pd.read_csv('datasets/tmdb_5000_movies.csv')

if 'overview' not in movies_cleaned_df.columns:
    raise ValueError("'overview' column not found in the dataset!")

movies_cleaned_df['overview'] = movies_cleaned_df['overview'].fillna('')
tfv = TfidfVectorizer(min_df=3, max_features=None,
                      strip_accents='unicode', analyzer='word',
                      token_pattern=r'\w{1,}', ngram_range=(1, 3),
                      stop_words='english')

tfv_matrix = tfv.fit_transform(movies_cleaned_df['overview'])
print(tfv_matrix)
print(tfv_matrix.shape)

## Compute the Sigmoid Kernel for the TF-IDF Matrix and Save the Similarity Matrix
Here, we calculate the similarity between movies based on their overviews using a method called the sigmoid kernel. 
This creates a similarity matrix that tells us how similar each movie is to every other movie. 
We then save this similarity matrix to a JSON file for later use.

In [None]:
# Compute the sigmoid kernel for the TF-IDF matrix and save the similarity matrix
from sklearn.metrics.pairwise import sigmoid_kernel
sig = sigmoid_kernel(tfv_matrix, tfv_matrix)
with open('data/similarity_matrix.json', 'w') as f:
    json.dump(sig.tolist(), f)
print("Similarity matrix shape:", sig.shape)
print("Similarity matrix saved to data folder")

## Ensure Unique Movie Titles and Create a Mapping of Titles to Indices
In this cell, we remove any duplicate movie titles from the cleaned dataset to ensure each title is unique. 
We create a mapping that links each movie title to its index in the dataset. 
This mapping will help us quickly find a movie's index when we want to get recommendations. 
Finally, we save this mapping to a JSON file.

In [None]:
# Ensure unique movie titles and create a mapping of titles to indices
movies_cleaned_unique = movies_cleaned.drop_duplicates(subset='original_title')
indices = pd.Series(movies_cleaned_unique.index, index=movies_cleaned_unique['original_title']).drop_duplicates()
indices.to_json('data/indices.json', orient='index')
print("Indices mapping saved to data folder")
print(indices.head())

## Define a Function to Give Movie Recommendations Based on Title
Here, we define a function called `give_recomendations` that takes a movie title as input and returns a list of recommended movies. 
If the title is not found in our mapping, it will print a message. 
The function calculates the similarity scores for the input movie and returns the top 10 similar movies, excluding the input movie itself.

In [None]:
# Define a function to give movie recommendations based on title
def give_recomendations(title, sig=sig):
    if title not in indices.index:
        print(f"Title '{title}' not found in indices.")
        return None
    idx = indices[title]
    sig_scores = list(enumerate(sig[idx]))
    sig_scores = sorted(sig_scores, key=lambda x: x[1], reverse=True)
    sig_scores = sig_scores[1:11]  # Skip the first one as it is the movie itself
    movie_indices = [i[0] for i in sig_scores]
    return movies_cleaned.iloc[movie_indices]

## Finally Get Recommendations for a Specific Movie and Print Relevant Details
In this cell, we use the function we defined earlier to get recommendations for the movie 'Interstellar'. 
We then print out the original title, overview, cast, and crew of the recommended movies to see the details.

In [None]:
# Get recommendations for a specific movie and print relevant details
recommended_movies = give_recomendations('Interstellar') # Input your movie title here
print(recommended_movies[['original_title', 'overview', 'cast', 'crew']])

# Conclusion: How the Movie Recommendation Model Works

In this notebook, we have built a movie recommendation system that suggests similar movies based on their overviews. Here's a brief overview of how the model works:

1. **Data Preparation**: We start by importing necessary libraries and loading two datasets: one containing movie credits and another with movie details. We then clean and process this data to ensure it's in a usable format.

2. **Feature Extraction**: The key feature for our recommendations is the 'overview' of each movie. We use a technique called TF-IDF (Term Frequency-Inverse Document Frequency) to convert the text in the 'overview' column into numerical vectors. This allows us to quantify the content of each movie.

3. **Similarity Calculation**: Using the TF-IDF matrix, we calculate the similarity between movies based on their overviews. We employ a method called the sigmoid kernel, which helps us determine how closely related two movies are based on their textual descriptions.

4. **Recommendation Generation**: When a user inputs a movie title, our model checks for its index in the dataset. It then retrieves the similarity scores for that movie and sorts them to find the top 10 most similar movies, excluding the input movie itself.

5. **Output**: Finally, the model returns a list of recommended movies along with their details, such as the original title, overview, cast, and crew.

By following these steps, our recommendation system effectively identifies and suggests movies that are similar to the one the user is interested in, enhancing the movie-watching experience.