### Movie Recommendation System

In [2]:
import numpy as np
import pandas as pd

In [3]:
movies = pd.read_csv('tmdb_5000_movies.csv')
credits = pd.read_csv('tmdb_5000_credits.csv')

In [4]:
movies.head(1)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800


In [5]:
credits.head(1)

Unnamed: 0,movie_id,title,cast,crew
0,19995,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."


In [6]:
# merging movies and cast
movies = movies.merge(credits,on='title')

#### 
tags for system: 
genres, id, keywords, title, overview, cast, crew

In [7]:
movies = movies[['movie_id', 'title', 'overview', 'genres', 'keywords', 'cast', 'crew']]

In [8]:
movies.head(1)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...","[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...","[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."


In [9]:
movies.columns

Index(['movie_id', 'title', 'overview', 'genres', 'keywords', 'cast', 'crew'], dtype='object')

### >> pre-processing

In [10]:
movies.isnull().sum()

movie_id    0
title       0
overview    3
genres      0
keywords    0
cast        0
crew        0
dtype: int64

In [11]:
movies.dropna(inplace=True)

In [12]:
movies.duplicated().sum()

np.int64(0)

In [13]:
movies.iloc[0].genres
# it is giving string, so we have to convert str to into a real Python list of dictionaries.

'[{"id": 28, "name": "Action"}, {"id": 12, "name": "Adventure"}, {"id": 14, "name": "Fantasy"}, {"id": 878, "name": "Science Fiction"}]'

In [14]:
import ast
def convert(obj):
    L = []
    for val in ast.literal_eval(obj):
        L.append(val['name'])
    return L

In [15]:
movies['genres'] = movies['genres'].apply(convert)

In [16]:
movies['keywords'] = movies['keywords'].apply(convert)

In [17]:
def convertTopThreeCast(obj):
    L = []
    counter=0
    for stringVal in ast.literal_eval(obj):
        if counter < 3:
            L.append(stringVal['name'])
            counter+=1
        else:
            break
    return L 


In [18]:
movies['cast'] = movies['cast'].apply(convertTopThreeCast) 

In [19]:
def crewDirectorName(obj):
    L = []
    for val in ast.literal_eval(obj):
        if val['job']=='Director':
            L.append(val['name'])
            break
    return L

In [20]:
movies['crew'] = movies['crew'].apply(crewDirectorName)

In [21]:
movies['overview'] = movies['overview'].apply(lambda x: x.split())

In [22]:
movies['genres'] = movies['genres'].apply(lambda x :[i.replace(" ", "") for i in x])
movies['keywords'] = movies['keywords'].apply(lambda x :[i.replace(" ", "") for i in x])
movies['cast'] = movies['cast'].apply(lambda x :[i.replace(" ", "") for i in x])
movies['crew'] = movies['crew'].apply(lambda x :[i.replace(" ", "") for i in x])

In [23]:
movies['tags'] = movies['overview']+movies['genres']+movies['keywords']+movies['cast']+movies['crew']

In [24]:
new_df = movies[['movie_id','title','tags']]

In [None]:
new_df['tags'] = new_df['tags'].apply(lambda x:" ".join(x))

In [None]:
new_df['tags'] = new_df['tags'].apply(lambda x:x.lower())

In [28]:
new_df.head(1)

Unnamed: 0,movie_id,title,tags
0,19995,Avatar,"in the 22nd century, a paraplegic marine is di..."


### >> Vectorization

In [30]:
new_df['tags'][0]

'in the 22nd century, a paraplegic marine is dispatched to the moon pandora on a unique mission, but becomes torn between following orders and protecting an alien civilization. action adventure fantasy sciencefiction cultureclash future spacewar spacecolony society spacetravel futuristic romance space alien tribe alienplanet cgi marine soldier battle loveaffair antiwar powerrelations mindandsoul 3d samworthington zoesaldana sigourneyweaver jamescameron'

In [31]:
new_df['tags'][1]

"captain barbossa, long believed to be dead, has come back to life and is headed to the edge of the earth with will turner and elizabeth swann. but nothing is quite as it seems. adventure fantasy action ocean drugabuse exoticisland eastindiatradingcompany loveofone'slife traitor shipwreck strongwoman ship alliance calypso afterlife fighter pirate swashbuckler aftercreditsstinger johnnydepp orlandobloom keiraknightley goreverbinski"

#### 
vectorization, distance between vectors tells us how similar or different two movies are.

How it works?
Each movie (or user) becomes a vector of numbers based on features like genre, actors, etc.
Then, we measure distance between these vectors using things like:

Cosine similarity (angle between vectors)
Euclidean distance (straight-line distance)

Interpretation:
Smaller distance / higher similarity score → movies are more similar
Larger distance / lower similarity score → movies are less similar

try to understand it with n*2 vector example and its graph!


#### technique to use for converting text into text vector
Bags of words, TF-IDF, One-Hot Encoding (mostly 0s and 1s), embedding