# Movie Recommendation System


## Import Necessary Libraries and Load Datasets
In this cell, we import the libraries we need for our project. 
- `pandas` is used for data manipulation and analysis.
- `json` helps us work with JSON data.
- `numpy` is used for numerical operations.
- `TfidfVectorizer` from `sklearn` helps us convert text data into numerical format.

We then load two datasets: one containing movie credits and another containing movie details. 
Finally, we display the first few rows of the credits dataset to see what it looks like.

In [6]:
# Import necessary libraries and load datasets
import pandas as pd
import json
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
credits = pd.read_csv("src/datasets/tmdb_5000_credits.csv")
movies = pd.read_csv("src/datasets/tmdb_5000_movies.csv")
credits.head()

Unnamed: 0,movie_id,title,cast,crew
0,19995,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,206647,Spectre,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,49026,The Dark Knight Rises,"[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
4,49529,John Carter,"[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."


## Print the Shapes of the Credits and Movies DataFrames
Here, we print the dimensions (number of rows and columns) of the credits and movies datasets. 
This helps us understand how much data we are working with.

In [7]:
# Print the shapes of the credits and movies dataframes
print("Credits:", credits.shape)
print("Movies Dataframe:", movies.shape)

Credits: (4803, 4)
Movies Dataframe: (4803, 20)


## Merge Credits and Movies DataFrames on 'id' and Rename 'movie_id' to 'id'
In this cell, we combine the credits and movies datasets into one. 
We rename the 'movie_id' column in the credits dataset to 'id' so that we can merge them based on this common column. 
After merging, we display the first few rows of the combined dataset to check the result.

In [8]:
# Merge credits and movies dataframes on 'id' and rename 'movie_id' to 'id'
credits_column_renamed = credits.rename(index=str, columns={"movie_id": "id"})
movies_merge = movies.merge(credits_column_renamed, on='id')
print(movies_merge.head())

      budget                                             genres  \
0  237000000  [{"id": 28, "name": "Action"}, {"id": 12, "nam...   
1  300000000  [{"id": 12, "name": "Adventure"}, {"id": 14, "...   
2  245000000  [{"id": 28, "name": "Action"}, {"id": 12, "nam...   
3  250000000  [{"id": 28, "name": "Action"}, {"id": 80, "nam...   
4  260000000  [{"id": 28, "name": "Action"}, {"id": 12, "nam...   

                                       homepage      id  \
0                   http://www.avatarmovie.com/   19995   
1  http://disney.go.com/disneypictures/pirates/     285   
2   http://www.sonypictures.com/movies/spectre/  206647   
3            http://www.thedarkknightrises.com/   49026   
4          http://movies.disney.com/john-carter   49529   

                                            keywords original_language  \
0  [{"id": 1463, "name": "culture clash"}, {"id":...                en   
1  [{"id": 270, "name": "ocean"}, {"id": 726, "na...                en   
2  [{"id": 470, "nam

## Clean the Merged Data by Selecting Relevant Columns and Processing JSON Fields
Here, we select only the columns we need from the merged dataset. 
We also process the 'cast', 'crew', and 'genres' columns, which are in JSON format, to convert them into Python lists. 
Finally, we save the cleaned data to a JSON file and display the first few rows of the cleaned dataset.

In [11]:
# Clean the merged data by selecting relevant columns and processing JSON fields
columns_to_select = ['id', 'original_title', 'overview', 'cast', 'crew', 'release_date', 'genres', 'vote_average', 'runtime', 'original_language']
movies_cleaned = movies_merge[columns_to_select].copy()
movies_cleaned['movie_id'] = movies_cleaned['id']
movies_cleaned['cast'] = movies_cleaned['cast'].apply(lambda x: json.loads(x) if isinstance(x, str) else [])
movies_cleaned['crew'] = movies_cleaned['crew'].apply(lambda x: json.loads(x) if isinstance(x, str) else [])
movies_cleaned['genres'] = movies_cleaned['genres'].apply(lambda x: json.loads(x) if isinstance(x, str) else [])
movies_cleaned.to_json('src/data/movies_cleaned.json', orient='records')
print("Cleaned data saved to data folder")
print(movies_cleaned.head())

Cleaned data saved to data folder
       id                            original_title  \
0   19995                                    Avatar   
1     285  Pirates of the Caribbean: At World's End   
2  206647                                   Spectre   
3   49026                     The Dark Knight Rises   
4   49529                               John Carter   

                                            overview  \
0  In the 22nd century, a paraplegic Marine is di...   
1  Captain Barbossa, long believed to be dead, ha...   
2  A cryptic message from Bond’s past sends him o...   
3  Following the death of District Attorney Harve...   
4  John Carter is a war-weary, former military ca...   

                                                cast  \
0  [{'cast_id': 242, 'character': 'Jake Sully', '...   
1  [{'cast_id': 4, 'character': 'Captain Jack Spa...   
2  [{'cast_id': 1, 'character': 'James Bond', 'cr...   
3  [{'cast_id': 2, 'character': 'Bruce Wayne / Ba...   
4  [{'cast_id': 5,

## Load Dataset and Prepare TF-IDF Vectorizer
In this cell, we load the cleaned movies dataset again. 
We check if the 'overview' column exists, as it is essential for our analysis. 
We fill any missing values in the 'overview' column with an empty string. 
Then, we set up a TF-IDF Vectorizer, which will help us convert the text in the 'overview' column into a numerical format that we can work with.

In [13]:
# Load dataset and prepare TF-IDF Vectorizer
movies_cleaned_df = pd.read_csv('src/datasets/tmdb_5000_movies.csv')

if 'overview' not in movies_cleaned_df.columns:
    raise ValueError("'overview' column not found in the dataset!")

movies_cleaned_df['overview'] = movies_cleaned_df['overview'].fillna('')
tfv = TfidfVectorizer(min_df=3, max_features=None,
                      strip_accents='unicode', analyzer='word',
                      token_pattern=r'\w{1,}', ngram_range=(1, 3),
                      stop_words='english')

tfv_matrix = tfv.fit_transform(movies_cleaned_df['overview'])
print(tfv_matrix)
print(tfv_matrix.shape)

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 127220 stored elements and shape (4803, 10417)>
  Coords	Values
  (0, 147)	0.30913114032500133
  (0, 1514)	0.20118856027389753
  (0, 5658)	0.2610479764815685
  (0, 2634)	0.2818968058308857
  (0, 5972)	0.27473323883894724
  (0, 6543)	0.2959108637414297
  (0, 9717)	0.2443539591791674
  (0, 5907)	0.17992707015426693
  (0, 9393)	0.24144219475319856
  (0, 3582)	0.2178825775762986
  (0, 6447)	0.25667782792456906
  (0, 7055)	0.26867652924731417
  (0, 431)	0.21084762228452017
  (0, 1670)	0.27815198887096043
  (0, 148)	0.30913114032500133
  (1, 1408)	0.25331222077109966
  (1, 5397)	0.21777873762971253
  (1, 973)	0.3197826358277298
  (1, 2318)	0.2189240483549797
  (1, 1806)	0.21046232712343155
  (1, 5263)	0.13330162173427934
  (1, 4205)	0.3080300379356447
  (1, 2872)	0.32324825936267076
  (1, 2848)	0.21556897928560056
  (1, 9609)	0.33551759750863897
  :	:
  (4802, 5303)	0.14458839172229582
  (4802, 3238)	0.16238922347130244
  (4802, 6

## Compute the Sigmoid Kernel for the TF-IDF Matrix and Save the Similarity Matrix
Here, we calculate the similarity between movies based on their overviews using a method called the sigmoid kernel. 
This creates a similarity matrix that tells us how similar each movie is to every other movie. 
We then save this similarity matrix to a JSON file for later use.

In [15]:
# Compute the sigmoid kernel for the TF-IDF matrix and save the similarity matrix
from sklearn.metrics.pairwise import sigmoid_kernel
sig = sigmoid_kernel(tfv_matrix, tfv_matrix)
with open('src/data/similarity_matrix.json', 'w') as f:
    json.dump(sig.tolist(), f)
print("Similarity matrix shape:", sig.shape)
print("Similarity matrix saved to data folder")

Similarity matrix shape: (4803, 4803)
Similarity matrix saved to data folder


## Ensure Unique Movie Titles and Create a Mapping of Titles to Indices
In this cell, we remove any duplicate movie titles from the cleaned dataset to ensure each title is unique. 
We create a mapping that links each movie title to its index in the dataset. 
This mapping will help us quickly find a movie's index when we want to get recommendations. 
Finally, we save this mapping to a JSON file.

In [16]:
# Ensure unique movie titles and create a mapping of titles to indices
movies_cleaned_unique = movies_cleaned.drop_duplicates(subset='original_title')
indices = pd.Series(movies_cleaned_unique.index, index=movies_cleaned_unique['original_title']).drop_duplicates()
indices.to_json('src/data/indices.json', orient='index')
print("Indices mapping saved to data folder")
print(indices.head())

Indices mapping saved to data folder
original_title
Avatar                                      0
Pirates of the Caribbean: At World's End    1
Spectre                                     2
The Dark Knight Rises                       3
John Carter                                 4
dtype: int64


## Define a Function to Give Movie Recommendations Based on Title
Here, we define a function called `give_recomendations` that takes a movie title as input and returns a list of recommended movies. 
If the title is not found in our mapping, it will print a message. 
The function calculates the similarity scores for the input movie and returns the top 10 similar movies, excluding the input movie itself.

In [17]:
# Define a function to give movie recommendations based on title
def give_recomendations(title, sig=sig):
    if title not in indices.index:
        print(f"Title '{title}' not found in indices.")
        return None
    idx = indices[title]
    sig_scores = list(enumerate(sig[idx]))
    sig_scores = sorted(sig_scores, key=lambda x: x[1], reverse=True)
    sig_scores = sig_scores[1:11]  # Skip the first one as it is the movie itself
    movie_indices = [i[0] for i in sig_scores]
    return movies_cleaned.iloc[movie_indices]

## Finally Get Recommendations for a Specific Movie and Print Relevant Details
In this cell, we use the function we defined earlier to get recommendations for the movie 'Interstellar'. 
We then print out the original title, overview, cast, and crew of the recommended movies to see the details.

In [18]:
# Get recommendations for a specific movie and print relevant details
recommended_movies = give_recomendations('Interstellar')
print(recommended_movies[['original_title', 'overview', 'cast', 'crew']])

                         original_title  \
1709                         キャプテンハーロック   
1352                            Gattaca   
643                       Space Cowboys   
268                       Stuart Little   
220                          Prometheus   
4353                  The Green Inferno   
4176  Battle for the Planet of the Apes   
2260                    All Good Things   
2648                     Winnie Mandela   
1373                The English Patient   

                                               overview  \
1709  Space Pirate Captain Harlock and his fearless ...   
1352  Science fiction drama about a future society i...   
643   Frank Corvin, ‘Hawk’ Hawkins, Jerry O'Neill an...   
268   The adventures of a heroic and debonair stalwa...   
220   A team of explorers discover a clue to the ori...   
4353  A group of student activists travel from New Y...   
4176  The fifth and final episode in the Planet of t...   
2260  Newly-discovered facts, court records and spec..

# Conclusion: How the Movie Recommendation Model Works

In this notebook, we have built a movie recommendation system that suggests similar movies based on their overviews. Here's a brief overview of how the model works:

1. **Data Preparation**: We start by importing necessary libraries and loading two datasets: one containing movie credits and another with movie details. We then clean and process this data to ensure it's in a usable format.

2. **Feature Extraction**: The key feature for our recommendations is the 'overview' of each movie. We use a technique called TF-IDF (Term Frequency-Inverse Document Frequency) to convert the text in the 'overview' column into numerical vectors. This allows us to quantify the content of each movie.

3. **Similarity Calculation**: Using the TF-IDF matrix, we calculate the similarity between movies based on their overviews. We employ a method called the sigmoid kernel, which helps us determine how closely related two movies are based on their textual descriptions.

4. **Recommendation Generation**: When a user inputs a movie title, our model checks for its index in the dataset. It then retrieves the similarity scores for that movie and sorts them to find the top 10 most similar movies, excluding the input movie itself.

5. **Output**: Finally, the model returns a list of recommended movies along with their details, such as the original title, overview, cast, and crew.

By following these steps, our recommendation system effectively identifies and suggests movies that are similar to the one the user is interested in, enhancing the movie-watching experience.