### Movie Recommendation System

In [141]:
import numpy as np
import pandas as pd

In [142]:
movies = pd.read_csv('tmdb_5000_movies.csv')
credits = pd.read_csv('tmdb_5000_credits.csv')

In [143]:
movies.head(1)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800


In [144]:
credits.head(1)

Unnamed: 0,movie_id,title,cast,crew
0,19995,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."


In [145]:
# merging movies and cast
movies = movies.merge(credits,on='title')

#### 
tags for system: 
genres, id, keywords, title, overview, cast, crew

In [146]:
movies = movies[['movie_id', 'title', 'overview', 'genres', 'keywords', 'cast', 'crew']]

In [147]:
movies.head(1)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...","[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...","[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."


In [148]:
movies.columns

Index(['movie_id', 'title', 'overview', 'genres', 'keywords', 'cast', 'crew'], dtype='object')

### >> pre-processing

In [149]:
movies.isnull().sum()

movie_id    0
title       0
overview    3
genres      0
keywords    0
cast        0
crew        0
dtype: int64

In [150]:
movies.dropna(inplace=True)

In [151]:
movies.duplicated().sum()

np.int64(0)

In [152]:
movies.iloc[0].genres
# it is giving string, so we have to convert str to into a real Python list of dictionaries.

'[{"id": 28, "name": "Action"}, {"id": 12, "name": "Adventure"}, {"id": 14, "name": "Fantasy"}, {"id": 878, "name": "Science Fiction"}]'

In [153]:
import ast
def convert(obj):
    L = []
    for val in ast.literal_eval(obj):
        L.append(val['name'])
    return L

In [154]:
movies['genres'] = movies['genres'].apply(convert)

In [155]:
movies['keywords'] = movies['keywords'].apply(convert)

In [156]:
def convertTopThreeCast(obj):
    L = []
    counter=0
    for stringVal in ast.literal_eval(obj):
        if counter < 3:
            L.append(stringVal['name'])
            counter+=1
        else:
            break
    return L 


In [157]:
movies['cast'] = movies['cast'].apply(convertTopThreeCast) 

In [158]:
def crewDirectorName(obj):
    L = []
    for val in ast.literal_eval(obj):
        if val['job']=='Director':
            L.append(val['name'])
            break
    return L

In [159]:
movies['crew'] = movies['crew'].apply(crewDirectorName)

In [160]:
movies['overview'] = movies['overview'].apply(lambda x: x.split())

In [161]:
movies['genres'] = movies['genres'].apply(lambda x :[i.replace(" ", "") for i in x])
movies['keywords'] = movies['keywords'].apply(lambda x :[i.replace(" ", "") for i in x])
movies['cast'] = movies['cast'].apply(lambda x :[i.replace(" ", "") for i in x])
movies['crew'] = movies['crew'].apply(lambda x :[i.replace(" ", "") for i in x])

In [162]:
movies['tags'] = movies['overview']+movies['genres']+movies['keywords']+movies['cast']+movies['crew']

In [163]:
new_df = movies[['movie_id','title','tags']]

In [164]:
new_df['tags'] = new_df['tags'].apply(lambda x:" ".join(x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_df['tags'] = new_df['tags'].apply(lambda x:" ".join(x))


In [165]:
new_df['tags'] = new_df['tags'].apply(lambda x:x.lower())

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_df['tags'] = new_df['tags'].apply(lambda x:x.lower())


In [166]:
new_df.head(1)

Unnamed: 0,movie_id,title,tags
0,19995,Avatar,"in the 22nd century, a paraplegic marine is di..."


### >> Vectorization

#### 
vectorization, distance between vectors tells us how similar or different two movies are.

How it works?
Each movie (or user) becomes a vector of numbers based on features like genre, actors, etc.
Then, we measure distance between these vectors using things like:

Cosine similarity (angle between vectors)
Euclidean distance (straight-line distance)

Interpretation:
Smaller distance / higher similarity score → movies are more similar
Larger distance / lower similarity score → movies are less similar

try to understand it with n*2 vector example and its graph!


#### technique to use for converting text into text vector
Bags of words, TF-IDF, One-Hot Encoding (mostly 0s and 1s), embedding

### combining the tags into one array and picking out the most common 5000 words by ignoring the stop word

In [167]:
import nltk 
from sklearn.feature_extraction.text import CountVectorizer
from nltk.stem.porter import PorterStemmer

In [168]:
ps = PorterStemmer()
cv = CountVectorizer(max_features=5000, stop_words='english')

In [169]:
def stem(text):
    y = []
    for i in text.split():
        y.append(ps.stem(i))
    return " ".join(y)

In [170]:
new_df['tags'] = new_df['tags'].apply(stem)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_df['tags'] = new_df['tags'].apply(stem)


In [171]:
# this is a vector storing stemed tags
vectors = cv.fit_transform(new_df['tags']).toarray()

#### now we will calculate the distance of each movie vector from every other movie vector

In [172]:
from sklearn.metrics.pairwise import cosine_similarity

In [173]:
similarity = cosine_similarity(vectors)

In [174]:
similarity.shape

(4806, 4806)

In [175]:
print(similarity[0])
print(similarity[1])
print(similarity[2])

[1.         0.08346223 0.0860309  ... 0.04499213 0.         0.        ]
[0.08346223 1.         0.06063391 ... 0.02378257 0.         0.02615329]
[0.0860309  0.06063391 1.         ... 0.02451452 0.         0.        ]


### >> main function

### now we will use recommend function which is goint to give top five similar movie of the provided one
we will pass the movie and then will fetch the its index in similarity and will sort that similarity vector and will return first 6

In [176]:
new_df

Unnamed: 0,movie_id,title,tags
0,19995,Avatar,"in the 22nd century, a parapleg marin is dispa..."
1,285,Pirates of the Caribbean: At World's End,"captain barbossa, long believ to be dead, ha c..."
2,206647,Spectre,a cryptic messag from bond’ past send him on a...
3,49026,The Dark Knight Rises,follow the death of district attorney harvey d...
4,49529,John Carter,"john carter is a war-weary, former militari ca..."
...,...,...,...
4804,9367,El Mariachi,el mariachi just want to play hi guitar and ca...
4805,72766,Newlyweds,a newlyw couple' honeymoon is upend by the arr...
4806,231617,"Signed, Sealed, Delivered","""signed, sealed, delivered"" introduc a dedic q..."
4807,126186,Shanghai Calling,when ambiti new york attorney sam is sent to s...


In [196]:
def recommendTopFiveSimilar(movie):
    movie_index = new_df[new_df['title'] == movie].index[0]
    distances = similarity[movie_index]
    movies_list = sorted(list(enumerate(distances)), reverse=True, key=lambda x: x[1])[1:6]

    for i in movies_list:
        print(new_df.iloc[i[0]]['title']) 

In [199]:
recommendTopFiveSimilar('Avatar')

Aliens vs Predator: Requiem
Aliens
Falcon Rising
Independence Day
Titan A.E.


In [201]:
recommendTopFiveSimilar('Batman Begins')

The Dark Knight
Batman
Batman
The Dark Knight Rises
10th & Wolf


In [202]:
recommendTopFiveSimilar('Gandhi')

Gandhi, My Father
Guiana 1838
The Wind That Shakes the Barley
Mr. Turner
A Passage to India
