# **Introduction**
Do you know what recommendation system is? I bet you already know what it is. Recommendation System is a system that can give you suggestion or recommendation. It's give you recommendation about restaurant, job, video game to play/buy, movies, anything. You usually see it in sites/platform something like "You might like this" on some article, "Similar movie to XYZ", on Netflix or "Recommended videos" on Youtube, it is a recommendation system. In this kernel I would like to build a movie recommendation system (content based).

### About Dataset
These files contain metadata for all 45,000 movies listed in the Full MovieLens Dataset. The dataset consists of movies released on or before July 2017. Data points include cast, crew, plot keywords, budget, revenue, posters, release dates, languages, production companies, countries, TMDB vote counts and vote averages. This dataset also has files containing 26 million ratings from 270,000 users for all 45,000 movies. Ratings are on a scale of 1-5 and have been obtained from the official GroupLens website.


This dataset consists of the following files:

* movies_metadata.csv: The main Movies Metadata file. Contains information on 45,000 movies featured in the Full MovieLens dataset. Features include posters, backdrops, budget, revenue, release dates, languages, production countries and companies.
* keywords.csv: Contains the movie plot keywords for our MovieLens movies. Available in the form of a stringified JSON Object.
* credits.csv: Consists of Cast and Crew Information for all our movies. Available in the form of a stringified JSON Object.
* links.csv: The file that contains the TMDB and IMDB IDs of all the movies featured in the Full MovieLens dataset.
* links_small.csv: Contains the TMDB and IMDB IDs of a small subset of 9,000 movies of the Full Dataset.
* ratings_small.csv: The subset of 100,000 ratings from 700 users on 9,000 movies.

### Acknowledgements

This dataset is an ensemble of data collected from TMDB and GroupLens.
The Movie Details, Credits and Keywords have been collected from the TMDB Open API. This product uses the TMDb API but is not endorsed or certified by TMDb. Their API also provides access to data on many additional movies, actors and actresses, crew members, and TV shows.

The Movie Links and Ratings have been obtained from the Official GroupLens website.


# **Import Dataset and Libraries**

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns

movies = pd.read_csv('/kaggle/input/the-movies-dataset/movies_metadata.csv', parse_dates=True, low_memory=False)

In [None]:
movies.head()

In [None]:
# First thing to do is to drop unnecessary columns.
movies = movies.drop(['belongs_to_collection','homepage','spoken_languages','video','popularity','tagline'], axis=1)

In [None]:
# check missing values
movies.isnull().sum()

In [None]:
# handle missing values
movies['original_language'] = movies['original_language'].fillna(value='en')
movies['status'] = movies['status'].fillna(value='Released')

revenue_mean = movies['revenue'].mean()
movies['revenue'] = movies['revenue'].fillna(value=revenue_mean)

runtime_mean = round(movies['runtime'].mean(),1)
movies['runtime'] = movies['runtime'].fillna(value=runtime_mean)
movies = movies.dropna()

In [None]:
# check missing values for the last time
movies.isnull().sum()

> Much better.

## Feature Engineering

In [None]:
# change id dtypes of movies dataframe to int
movies['id'] = movies['id'].astype('int')

# change release_date dtypes into datetime
movies['release_date'] = pd.to_datetime(movies['release_date'])

# create new feature/column named year_release_date and month_release_date, and extract it from release_date
movies['year_release_date'] = movies['release_date'].dt.year
movies['month_release_date'] = movies['release_date'].dt.month

In [None]:
# define a function to clean values
def clean_keywords(keywords, val_key='name'):
    str = []
    if len(keywords):
        for key in eval(keywords):
            str.append(key[val_key])    
        return ', '.join(str)
    else:
        return 'No data'
    
# get the director's name from the crew feature.
def get_director(x):
    for i in eval(x):
        if i['job'] == 'Director':
            return i['name']
        return 'No data'

In [None]:
# clean feature genres, production_companies, production_countries values
cols = ['genres', 'production_companies', 'production_countries']

for col in cols:
    movies[col] = movies[col].apply(lambda x: clean_keywords(x))
    
movies.head()

# **Data Exploration & Visualization**
Let's explore this dataset for a bit.

## Distribution of Movies released each year

In [None]:
plt.figure(figsize=(20,5))
ax = sns.countplot(data=movies, x='year_release_date')
plt.title('Distribution of Movie Released Each Year')
plt.xlabel('Year')
plt.ylabel('Total')
plt.xticks(rotation=90)
plt.show()

> The number of movies released increase significantly from 1991 to 2015.

## Distribution of Movies Released Each Month

In [None]:
sns.countplot(data=movies, x='month_release_date')
plt.title('Distribution of Movie Released Each Month')
plt.xlabel('Month')
plt.ylabel('Total')
plt.show()

> Seems like the highest released movies is in January.

## Top 20 Movies with the highest votes

In [None]:
top_movies = movies[['title','year_release_date', 'genres','vote_average']].sort_values(by=['vote_average','year_release_date'], ascending=False)
top_movies.head(20)

> Have you watched some of movies above?

## Statistical Summary

In [None]:
movies.describe().transpose()

> - Average of revenue for a movie is \\$11,548,474.45 with the highest revenue \\$2,787,965,087.0
> - Average runtime is 95 minutes (1 hour, 35 minutes) with max runtime 1,256 minutes (20 hours, 56 minutes). 

## Feature Correlation Matrix

In [None]:
movies_corr = movies.corr(numeric_only=True)
# movies_corr = dfCorr[((dfCorr >= .5) | (dfCorr <= -.5)) & (dfCorr !=1.000)]
sns.heatmap(movies_corr, annot=True, cmap='crest')
plt.show()

> well, revenue and vote_count shows strong correlation from these 2 feature. This means the higher the revenue also high vote rating.

# **Recommendation System**

There are types of recommendation systems:
   1. **Content-based filtering**, this used features of an content-item in dataset to recommend/suggest other similar content. Example: If you like to watch The Amazing World of Gumball, then your Youtube recommendation will recommends you more clips from Cartoon Network.
   2. **Collaborative filtering**, recommend/suggests item based on user similar preferences. Example: User A, B and C likes superhero-action movie genre. If user A and B like The Avengers then user C most-likely to like it too.
   3. **Hybrid recommender systems**, combine both content-based and collaborative filtering to create more personalized recommendation.

In this kernel, I'm applying content-based filtering to create a movie recommendation system.

In [None]:
# Import keywords and credits dataset 
keywords = pd.read_csv('/kaggle/input/the-movies-dataset/keywords.csv')
credits = pd.read_csv('/kaggle/input/the-movies-dataset/credits.csv')

# clean keywords, cast and crew feature with clean_keywords function
keywords['keywords'] = keywords['keywords'].apply(lambda x: clean_keywords(x))
credits['cast'] = credits['cast'].apply(lambda x: clean_keywords(x))
credits['director'] = credits['crew'].apply(lambda x: get_director(x))

# merge keywords and credits dataset to movies 
movies = movies.merge(keywords, how='left', on='id')
movies = movies.merge(credits, how='left', on='id')
movies.head()

#### **Movie recommendation using overview of the movie**

In [None]:
# Import TF-IDF Vectorizer Object and linear_kernel from sklearn library
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel

# define TF-IDF Vectorizer Object, to remove all english stop words.
tfidf = TfidfVectorizer(stop_words='english')

# replace NaN with an empty string
movies['overview'] = movies['overview'].fillna('')

# construct the required TF-IDF matrix by fitting and transforming the data
tfidf_matrix = tfidf.fit_transform(movies['overview'])

# check the shape of tfidf matrix
# print('TF-IDF Shape:', tfidf_matrix.shape)

# calculate the cosine similarity matrix
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

# construct a reverse map of indices and movie titles
indices = pd.Series(movies.index, index=movies['title']).drop_duplicates()

In [None]:
# define a function with title as an input argument and list similar movies as output
def get_recommendations(title, cosine_sim=cosine_sim):
    # Get the index of the movie title
    idx = indices[title]

    # Get the pairwsie similarity scores of all movies with that movie
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the movies based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar movies
    sim_scores = sim_scores[1:11]

    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]

    # Return the top 10 most similar movies
    return movies[['title','year_release_date']].iloc[movie_indices]

In [None]:
get_recommendations('The Dark Knight Rises')

#### **Movie Recommendation using genres, keywords and credits**
In this section, we want to improve the quality of our recommender system by using movie metadata such as genres, keywords and credits. To achieve this, we will extract these essential details from the movie dataframe, converting them into a structured and accessible format, thus facilitating more robust recommendations.


Now create "metadata soup", which contains all the metadata that we want to feed to our vectorizer (namely actors, director and keywords).

In [None]:
def create_soup(x):
    return movies['genres'] + ', ' + movies['keywords'] + ', ' + movies['cast'] + ', ' + movies['director']

movies['soup'] = movies.apply(create_soup, axis=1)

For this section we are using CountVectorizer() instead of TF-IDF because we do not want to weighted the keys.

In [None]:
# Import CountVectorizer and create the count matrix
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

count = CountVectorizer(stop_words='english')
count_matrix = count.fit_transform(movies['soup'])

# calculate the Cosine Similarity matrix based on the count_matrix
cosine_sim2 = cosine_similarity(count_matrix, count_matrix)

# reset index movies dataframe and construct a reverse map of indices and movie titles
movies = movies.reset_index()
indices = pd.Series(movies.index, index=movies['title'])

Since we had build **get_recommendations()** function, we can reuse it by passing the cosine_sim2 as second argument.

In [None]:
get_recommendations('The Avengers', cosine_sim2)

References:
1. https://www.kaggle.com/code/ibtesama/getting-started-with-a-movie-recommendation-system
2. https://www.kaggle.com/code/dgoenrique/a-simple-movie-tv-show-recommendation-system
3. https://www.kaggle.com/code/rounakbanik/movie-recommender-systems/notebook

## **Thank you!**