# Movies Recommendation Model
---
- based on user watched movies pattern

- `content based` -> based on user watch before
- `popularity based` -> based on the popularity of content (movies)
- `collabrative based` -> gropus people based on their watching pattern 

![workflow](../images/workflow.png)

Data Movies -> Data preprocessing and Analysis -> Feature Extraction -> Similarity Score bw movies (Cosine Simlarity) -> List of Movies

- Cosine Similiarity (relationship btween datasets)

In [87]:
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

import difflib # get the close match of the movie name
from sklearn.feature_extraction.text import TfidfVectorizer # textula data to numerical data
from sklearn.metrics.pairwise import cosine_similarity #(highest similarity score)

warnings.filterwarnings('ignore') 

In [88]:
movies = pd.read_csv('../Datasets/movies.csv')
movies.drop(columns=['index'], inplace=True)
# homepage 3091 almost all the values are null
movies.drop(columns=['homepage'], inplace=True)

In [89]:
movies.head(1)

Unnamed: 0,budget,genres,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,...,runtime,spoken_languages,status,tagline,title,vote_average,vote_count,cast,crew,director
0,237000000,Action Adventure Fantasy Science Fiction,19995,culture clash future space war space colony so...,en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",...,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800,Sam Worthington Zoe Saldana Sigourney Weaver S...,"[{'name': 'Stephen E. Rivkin', 'gender': 0, 'd...",James Cameron


In [90]:
movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4803 entries, 0 to 4802
Data columns (total 22 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   budget                4803 non-null   int64  
 1   genres                4775 non-null   object 
 2   id                    4803 non-null   int64  
 3   keywords              4391 non-null   object 
 4   original_language     4803 non-null   object 
 5   original_title        4803 non-null   object 
 6   overview              4800 non-null   object 
 7   popularity            4803 non-null   float64
 8   production_companies  4803 non-null   object 
 9   production_countries  4803 non-null   object 
 10  release_date          4802 non-null   object 
 11  revenue               4803 non-null   int64  
 12  runtime               4801 non-null   float64
 13  spoken_languages      4803 non-null   object 
 14  status                4803 non-null   object 
 15  tagline              

In [91]:
movies.shape

(4803, 22)

In [92]:
movies.isnull().sum()

budget                    0
genres                   28
id                        0
keywords                412
original_language         0
original_title            0
overview                  3
popularity                0
production_companies      0
production_countries      0
release_date              1
revenue                   0
runtime                   2
spoken_languages          0
status                    0
tagline                 844
title                     0
vote_average              0
vote_count                0
cast                     43
crew                      0
director                 30
dtype: int64

## Selecting the relavent feature for movie recommendation

In [93]:
# based on content 
features = ['keywords', 'cast', 'genres', 'director', 'tagline', 'overview', 'title']
print("Features selected for recommendation system:", features)

Features selected for recommendation system: ['keywords', 'cast', 'genres', 'director', 'tagline', 'overview', 'title']


In [94]:
for feature in features:
    movies[feature] = movies[feature].fillna('') 

In [95]:
def combine_features(features_list):
    temp_df = movies[features_list].copy()
    return temp_df.apply(lambda row: ' '.join(row), axis=1)

df = combine_features(features_list=features)
df.head()

0    culture clash future space war space colony so...
1    ocean drug abuse exotic island east india trad...
2    spy based on novel secret agent sequel mi6 Dan...
3    dc comics crime fighter terrorist secret ident...
4    based on novel mars medallion space travel pri...
dtype: object

In [96]:
# combined_features = movies['keywords'] + ' ' + movies['cast'] + ' ' + movies['genres'] + ' ' + movies['director'] + ' ' + movies['tagline'] + ' ' + movies['overview']

# combined_features.shape

In [97]:
# convert the text data into numerical data
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(combined_features)
tfidf_matrix.shape

(4803, 30592)

## Cosine Similarity Score or Similarity confidence value

In [98]:
cosine_sim = cosine_similarity(tfidf_matrix)
print(cosine_sim.shape)

(4803, 4803)


In [99]:
# getting movies anme
movies_name = "God father"

In [100]:
# creating the list of all movie names in datastes
movies_list = movies['title'].tolist()
# getting the close match of the movie name
close_match = difflib.get_close_matches(movies_name, movies_list)
close_match

['The Godfather', 'The Last Godfather', 'Goldfinger']

In [101]:
# getting the index of the movie with the close match
movie_index = movies[movies.title == close_match[0]].index[0]
movie_index

3337

In [102]:
# getting the similarity score of all the movies with the selected movie
similarity_score = list(enumerate(cosine_sim[movie_index]))
# sorting the movies based on the similarity score
sorted_similar_movies = sorted(similarity_score, key=lambda x: x[1], reverse=True)
sorted_similar_movies[:10]

[(3337, 1.0000000000000002),
 (2731, 0.3400721247063389),
 (867, 0.24579758343491187),
 (1873, 0.14066696731815023),
 (1225, 0.12270123122479634),
 (1525, 0.12086680678462663),
 (877, 0.11764345466464356),
 (3125, 0.11451810418839659),
 (2038, 0.11145789029578734),
 (2553, 0.11055655989419616)]

In [103]:
# getting the top 10 similar movies
recommended_movies = sorted_similar_movies[0:5]
# printing the names of the top 10 similar movies
print("Movies recommended for you:")
for movie in recommended_movies:
    print(movies.iloc[movie[0]].title)

Movies recommended for you:
The Godfather
The Godfather: Part II
The Godfather: Part III
Blood Ties
Mickey Blue Eyes
