## Movie Recommendation System

### Introduction

In this notebook, we will explore the creation of a content-based movie recommendation system. Recommender systems play a crucial role in today's digital landscape, helping users discover new content tailored to their preferences. Content-based recommendation systems leverage the characteristics of items and user preferences to provide personalized suggestions. Specifically, we will focus on creating a system that suggests movies to users based on the content and features of movies they have enjoyed in the past.

In this project, we will utilize a dataset containing information about thousands of movies, including details like genres, keywords, cast, crew, and overviews. By analyzing these attributes, we can develop a model that captures the essence of each movie's content and uses it to make intelligent recommendations.

The main steps of our approach include:

- **Data Preprocessing:** We will clean and structure the dataset, extracting relevant information from JSON-like objects and preparing the textual data for analysis.
- **Feature Extraction:** Using techniques such as tokenization and stemming, we will convert the textual data into numerical features that can be used for similarity calculations.
- **Vectorization:** We will transform the extracted features into numerical vectors, creating a representation of each movie's content.
- **Similarity Calculation:** By computing cosine similarity between movie vectors, we can measure how closely related two movies are in terms of content.
- **Recommendation Generation:** Based on the computed similarity scores, we will implement a recommendation function that suggests movies similar to a user's input.

#### Import necessary libraries and modules

In [2]:
import numpy as np
import pandas as pd
import ast
from sklearn.feature_extraction.text import CountVectorizer
from nltk.stem.porter import PorterStemmer
from sklearn.metrics.pairwise import cosine_similarity
import pickle

#### Read movie and credits data from CSV files

Data downloaded from: https://www.kaggle.com/datasets/gazu468/tmdb-10000-movies-dataset

In [None]:
movies = pd.read_csv('10000 Movies Data.csv')
credits = pd.read_csv('10000 Credits Data.csv')

#### Remove the 'Unnamed: 0' column from both DataFrames

In [None]:
movies.drop('Unnamed: 0', axis= 1,inplace=True)
credits.drop('Unnamed: 0', axis= 1,inplace=True)

#### Merge movie and credits data on 'Movie_id' and 'title'

In [5]:
df = movies.merge(credits, on=['Movie_id', 'title'])

#### Select relevant columns and drop rows with missing values

In [6]:
df = df[['Movie_id', 'title', 'Genres', 'Keywords', 'overview', 'Cast', 'Crew']]
df.dropna(inplace=True)

#### Rename columns for clarity

In [None]:
df.rename(columns={"Movie_id": "id", "Genres": "genres", "Keywords": "keywords", "Cast": "cast", "Crew": "crew"}, inplace=True)

#### Function to convert JSON-like objects to lists of names

In [7]:
def convert_base(obj):
    L = []
    for i in ast.literal_eval(obj):
        L.append(i['name'])
    return L

#### Function to extract top 5 cast members

In [None]:
def convert_cast(obj):
    L = []
    counter = 0
    for i in ast.literal_eval(obj):
        if counter != 5:
            L.append(i['name'])
            counter += 1
        else:
            break
    return L

#### Function to extract director's name

In [None]:
def convert_director(obj):
    L = []
    for i in ast.literal_eval(obj):
        if i['job'] == 'Director':
            L.append(i['name'])
            break
    return L

#### Apply conversion functions to relevant columns

In [8]:
df.genres = df.genres.apply(convert_base)
df.keywords = df.keywords.apply(convert_base)
df.cast = df.cast.apply(convert_cast)
df.crew = df.crew.apply(convert_director)
df.overview = df.overview.apply(lambda x:x.split())

#### Remove spaces from individual elements in lists

In [9]:
df.genres = df.genres.apply(lambda x:[i.replace(' ','') for i in x])
df.keywords = df.keywords.apply(lambda x:[i.replace(' ','') for i in x])
df.cast = df.cast.apply(lambda x:[i.replace(' ','') for i in x])
df.crew = df.crew.apply(lambda x:[i.replace(' ','') for i in x])
df.overview = df.overview.apply(lambda x:[i.replace(' ','') for i in x])

#### Combine different lists into a single 'tags' column

In [10]:
df['tags'] = df.genres + df.keywords + df.overview + df.cast +df.crew

#### Create a new DataFrame with relevant columns

In [11]:
new_df = df[['id', 'title', 'tags']]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value


#### Join tags into a single string and convert to lowercase

In [None]:
new_df.tags = new_df.tags.apply(lambda x:" ".join(x)) #join the list on each space
new_df.tags = new_df.tags.apply(lambda x: x.lower())

#### Initialize Porter Stemmer

In [12]:
ps = PorterStemmer() # Initialize Porter Stemmer

def stem(text):
    y=[]
    for i in text.split():
        y.append(ps.stem(i))
    return ' '.join(y)

#### Apply stemming to 'tags' column

In [13]:
new_df.tags = new_df.tags.apply(stem)

#### Initialize CountVectorizer with specified parameters

In [14]:
cv = CountVectorizer(max_features=10000, stop_words='english')

#### Convert 'tags' into bag-of-words vectors

In [None]:
vectors = cv.fit_transform(new_df.tags).toarray()

#### Calculate cosine similarity between vectors

In [None]:
similarity = cosine_similarity(vectors)

#### Function to recommend movies based on similarity

In [15]:
def recommend(movie):
    movie_index = new_df[new_df.title == movie].index[0]
    movies_list = sorted(enumerate(similarity[movie_index]), reverse=True, key=lambda x:x[1])[1:6]

    recommended_movies = []
    for i in movies_list:
        recommended_movies.append(new_df.iloc[i[0]].title)
    
    return recommended_movies

#### Save cleaned DataFrame and similarity matrix as pickle files

In [19]:
pickle.dump(new_df.to_dict(),open('movie_dict.pkl', 'wb'))
pickle.dump(similarity,open('similarity.pkl', 'wb'))