# Recommender Systems

Recommender systems, also known as recommendation systems or engines, are a type of software application or algorithm designed to provide personalized suggestions or recommendations to users. These recommendations can be for a wide range of items or content, such as products, movies, music, books, news articles, or other items of interest. The goal of a recommender system is to help users discover relevant items that they might not have otherwise found on their own.

There are several types of recommender systems, and they can be categorized into two main approaches:

1. Content-Based Recommender Systems:
   - Content-based recommenders analyze the characteristics of items and user profiles to make recommendations.
   - They recommend items that are similar in content to those that a user has shown interest in.
   - For example, in a content-based movie recommendation system, if a user has previously liked action movies, the system will recommend other action movies based on their content characteristics (e.g., genre, actors, director).

2. Collaborative Filtering Recommender Systems:
   - Collaborative filtering methods make recommendations based on the behavior and preferences of users within a community or group.
   - They can be further divided into two subtypes: user-based and item-based collaborative filtering.
   - User-based collaborative filtering recommends items to a user based on the preferences and behaviors of similar users.
   - Item-based collaborative filtering recommends items to a user based on the similarity of items the user has shown interest in to other items.
   
Additionally, there are hybrid recommender systems that combine both content-based and collaborative filtering approaches to improve recommendation accuracy and coverage.


Recommender systems have wide-ranging applications in e-commerce, entertainment, social media, and information retrieval. They are essential for helping users find products, content, and services tailored to their preferences, ultimately enhancing user satisfaction and engagement.

# Stages of Project

Project flow stages:
    
1. Data gathering

2. Preprocessing

3. Model

4. Deploy

5. Website

# Data Gathering and basic EDA

Dataset TMDB 5000 Movie Dataset
    
Link: https://www.kaggle.com/datasets/tmdb/tmdb-movie-metadata?select=tmdb_5000_movies.csv 

In [1]:
import numpy as np
import pandas as pd

In [2]:
movies = pd.read_csv(
    "/Users/rajatchauhan/Desktop/Machine Learning Notes/Projects/3. Movie Recommender System/tmdb_5000_movies.csv")
credits = pd.read_csv(
    "/Users/rajatchauhan/Desktop/Machine Learning Notes/Projects/3. Movie Recommender System/tmdb_5000_credits.csv")

In [3]:
movies.head(1)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800


In [4]:
movies.shape

(4803, 20)

In [5]:
credits.head(1)

Unnamed: 0,movie_id,title,cast,crew
0,19995,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."


In [6]:
credits.shape

(4803, 4)

In [7]:
credits["cast"][0]

'[{"cast_id": 242, "character": "Jake Sully", "credit_id": "5602a8a7c3a3685532001c9a", "gender": 2, "id": 65731, "name": "Sam Worthington", "order": 0}, {"cast_id": 3, "character": "Neytiri", "credit_id": "52fe48009251416c750ac9cb", "gender": 1, "id": 8691, "name": "Zoe Saldana", "order": 1}, {"cast_id": 25, "character": "Dr. Grace Augustine", "credit_id": "52fe48009251416c750aca39", "gender": 1, "id": 10205, "name": "Sigourney Weaver", "order": 2}, {"cast_id": 4, "character": "Col. Quaritch", "credit_id": "52fe48009251416c750ac9cf", "gender": 2, "id": 32747, "name": "Stephen Lang", "order": 3}, {"cast_id": 5, "character": "Trudy Chacon", "credit_id": "52fe48009251416c750ac9d3", "gender": 1, "id": 17647, "name": "Michelle Rodriguez", "order": 4}, {"cast_id": 8, "character": "Selfridge", "credit_id": "52fe48009251416c750ac9e1", "gender": 2, "id": 1771, "name": "Giovanni Ribisi", "order": 5}, {"cast_id": 7, "character": "Norm Spellman", "credit_id": "52fe48009251416c750ac9dd", "gender": 

Wow, this is a huge list of dictionaries, with each having details of cast like:
    
    1. cast_id
    2. characater
    3. credit_id
    4. gender
    5. id
    6. name
    7. order

In [8]:
credits["crew"][0]

'[{"credit_id": "52fe48009251416c750aca23", "department": "Editing", "gender": 0, "id": 1721, "job": "Editor", "name": "Stephen E. Rivkin"}, {"credit_id": "539c47ecc3a36810e3001f87", "department": "Art", "gender": 2, "id": 496, "job": "Production Design", "name": "Rick Carter"}, {"credit_id": "54491c89c3a3680fb4001cf7", "department": "Sound", "gender": 0, "id": 900, "job": "Sound Designer", "name": "Christopher Boyes"}, {"credit_id": "54491cb70e0a267480001bd0", "department": "Sound", "gender": 0, "id": 900, "job": "Supervising Sound Editor", "name": "Christopher Boyes"}, {"credit_id": "539c4a4cc3a36810c9002101", "department": "Production", "gender": 1, "id": 1262, "job": "Casting", "name": "Mali Finn"}, {"credit_id": "5544ee3b925141499f0008fc", "department": "Sound", "gender": 2, "id": 1729, "job": "Original Music Composer", "name": "James Horner"}, {"credit_id": "52fe48009251416c750ac9c3", "department": "Directing", "gender": 2, "id": 2710, "job": "Director", "name": "James Cameron"},

Wow, its great to see,  how many people are involved in film making, this is again a huge list with feilds like:
    
    1. credit_id
    2. department 
      (There are various departments like: Editing, Art, Sound, Production, Directing, Writing, Visual Effects, Costume & Make-Up, Crew and so on)
    3. gender
    4. id
    5. job
    6. name

## Merging these two datasets together:

In [9]:
movies = movies.merge(credits, on = "title")
movies.shape

(4809, 23)

In [10]:
movies.head(2)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,...,runtime,spoken_languages,status,tagline,title,vote_average,vote_count,movie_id,cast,crew
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...",...,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800,19995,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...",...,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500,285,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."


## Choosing relevant columns

We will be going for Content based recommendations.

The recommender system works on the basis of tags.

So, we will see what are the relevant columns from which we can extract relevant features from user perspective, from which we can make useful tags.

In [11]:
movies["keywords"][0]

'[{"id": 1463, "name": "culture clash"}, {"id": 2964, "name": "future"}, {"id": 3386, "name": "space war"}, {"id": 3388, "name": "space colony"}, {"id": 3679, "name": "society"}, {"id": 3801, "name": "space travel"}, {"id": 9685, "name": "futuristic"}, {"id": 9840, "name": "romance"}, {"id": 9882, "name": "space"}, {"id": 9951, "name": "alien"}, {"id": 10148, "name": "tribe"}, {"id": 10158, "name": "alien planet"}, {"id": 10987, "name": "cgi"}, {"id": 11399, "name": "marine"}, {"id": 13065, "name": "soldier"}, {"id": 14643, "name": "battle"}, {"id": 14720, "name": "love affair"}, {"id": 165431, "name": "anti war"}, {"id": 193554, "name": "power relations"}, {"id": 206690, "name": "mind and soul"}, {"id": 209714, "name": "3d"}]'

In [12]:
movies["original_language"].value_counts()

en    4510
fr      70
es      32
zh      27
de      27
hi      19
ja      16
it      14
ko      12
cn      12
ru      11
pt       9
da       7
sv       5
nl       4
fa       4
th       3
he       3
ta       2
cs       2
ro       2
id       2
ar       2
vi       1
sl       1
ps       1
no       1
ky       1
hu       1
pl       1
af       1
nb       1
tr       1
is       1
xx       1
te       1
el       1
Name: original_language, dtype: int64

In [13]:
movies.columns

Index(['budget', 'genres', 'homepage', 'id', 'keywords', 'original_language',
       'original_title', 'overview', 'popularity', 'production_companies',
       'production_countries', 'release_date', 'revenue', 'runtime',
       'spoken_languages', 'status', 'tagline', 'title', 'vote_average',
       'vote_count', 'movie_id', 'cast', 'crew'],
      dtype='object')

Let us see one by one which features can be relevant for us:
    
 1. budget: (no, cannot have any direct affect on user preferences)
 2. genres: (yes- very imp.)
 3. homepage: (no)
 4. id: (yes- This is TMDB id, so will be used to fetch movie images from TMDB site)
 5. keywords: (yes- these are basically tags, so again very imp.)
 6. original_language: (no - mostly its just english so No)
 7. original_title (no- we can use title column instead)
 8. overview (yes - overview surely can tell how similar the movies are)
 9. popularity (no - again its a numeric column not useful to create tags)
 10. production_companies (no)
 11. production_countries (no)
 12. release_date (no - but this can be useful column as different age groups can have different preferences of old and new movies)
 13. revenue (no)
 14. runtime (no)
 15. spoken_languages (no)
 16. status (no)
 17. tagline (no - as taglines can be very vague, better we keep the overview column)
 18. title (yes)
 19. vote_average (no - numeric column)
 20. vote_count (no - numeric)
 21. movie_id (no - we have id column instead)
 22. cast (yes - important as we see movies based on our fav actors)
 23. crew (yes - we often do recommendations based on our fav directors)

In [14]:
movies = movies[["movie_id", "title", "overview", "genres", "keywords", "cast", "crew"]]
movies

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...","[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...","[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...","[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...","[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...","[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,206647,Spectre,A cryptic message from Bond’s past sends him o...,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...","[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...","[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,49026,The Dark Knight Rises,Following the death of District Attorney Harve...,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...","[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...","[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
4,49529,John Carter,"John Carter is a war-weary, former military ca...","[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...","[{""id"": 818, ""name"": ""based on novel""}, {""id"":...","[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."
...,...,...,...,...,...,...,...
4804,9367,El Mariachi,El Mariachi just wants to play his guitar and ...,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...","[{""id"": 5616, ""name"": ""united states\u2013mexi...","[{""cast_id"": 1, ""character"": ""El Mariachi"", ""c...","[{""credit_id"": ""52fe44eec3a36847f80b280b"", ""de..."
4805,72766,Newlyweds,A newlywed couple's honeymoon is upended by th...,"[{""id"": 35, ""name"": ""Comedy""}, {""id"": 10749, ""...",[],"[{""cast_id"": 1, ""character"": ""Buzzy"", ""credit_...","[{""credit_id"": ""52fe487dc3a368484e0fb013"", ""de..."
4806,231617,"Signed, Sealed, Delivered","""Signed, Sealed, Delivered"" introduces a dedic...","[{""id"": 35, ""name"": ""Comedy""}, {""id"": 18, ""nam...","[{""id"": 248, ""name"": ""date""}, {""id"": 699, ""nam...","[{""cast_id"": 8, ""character"": ""Oliver O\u2019To...","[{""credit_id"": ""52fe4df3c3a36847f8275ecf"", ""de..."
4807,126186,Shanghai Calling,When ambitious New York attorney Sam is sent t...,[],[],"[{""cast_id"": 3, ""character"": ""Sam"", ""credit_id...","[{""credit_id"": ""52fe4ad9c3a368484e16a36b"", ""de..."


In [15]:
movies.shape

(4809, 7)

# Data Preprocessing

- Checking for missing values

In [16]:
movies.isnull().sum()

movie_id    0
title       0
overview    3
genres      0
keywords    0
cast        0
crew        0
dtype: int64

We can drop these 3 records where overview is missing

In [17]:
movies.dropna(inplace= True)

In [18]:
movies.isnull().sum()

movie_id    0
title       0
overview    0
genres      0
keywords    0
cast        0
crew        0
dtype: int64

- Checking for duplicates

In [19]:
movies.duplicated().sum()

0

# Feature extraction

We need to convert the existing movies dataframe into these three columns:
    
    1. movie_id
    2. title
    3. tags
    
    
Tags would be formed by merging the remaining columns: overview + genres + keywords + cast + crew

### 1. Genres column

In [20]:
movies["genres"][0]

'[{"id": 28, "name": "Action"}, {"id": 12, "name": "Adventure"}, {"id": 14, "name": "Fantasy"}, {"id": 878, "name": "Science Fiction"}]'

This is not a list, this is a string

This is actually json format

In [21]:
import json

In [22]:
data = json.loads(movies["genres"][0])
data

[{'id': 28, 'name': 'Action'},
 {'id': 12, 'name': 'Adventure'},
 {'id': 14, 'name': 'Fantasy'},
 {'id': 878, 'name': 'Science Fiction'}]

In [23]:
def convert(string):
        data = json.loads(string)
        names = [item["name"] for item in data]
        return names

In [24]:
convert(movies["genres"][0])

['Action', 'Adventure', 'Fantasy', 'Science Fiction']

In [25]:
movies["genres"].apply(convert)

0       [Action, Adventure, Fantasy, Science Fiction]
1                        [Adventure, Fantasy, Action]
2                          [Action, Adventure, Crime]
3                    [Action, Crime, Drama, Thriller]
4                [Action, Adventure, Science Fiction]
                            ...                      
4804                        [Action, Crime, Thriller]
4805                                [Comedy, Romance]
4806               [Comedy, Drama, Romance, TV Movie]
4807                                               []
4808                                    [Documentary]
Name: genres, Length: 4806, dtype: object

In [26]:
movies["genres"].apply(convert)[0]

['Action', 'Adventure', 'Fantasy', 'Science Fiction']

So, this is working fine

In [27]:
movies["genres"] = movies["genres"].apply(convert)
movies.head()

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[Action, Adventure, Fantasy, Science Fiction]","[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...","[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...","[Adventure, Fantasy, Action]","[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...","[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,206647,Spectre,A cryptic message from Bond’s past sends him o...,"[Action, Adventure, Crime]","[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...","[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,49026,The Dark Knight Rises,Following the death of District Attorney Harve...,"[Action, Crime, Drama, Thriller]","[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...","[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
4,49529,John Carter,"John Carter is a war-weary, former military ca...","[Action, Adventure, Science Fiction]","[{""id"": 818, ""name"": ""based on novel""}, {""id"":...","[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."


### 2. Keywords Column

In [28]:
movies["keywords"][0]

'[{"id": 1463, "name": "culture clash"}, {"id": 2964, "name": "future"}, {"id": 3386, "name": "space war"}, {"id": 3388, "name": "space colony"}, {"id": 3679, "name": "society"}, {"id": 3801, "name": "space travel"}, {"id": 9685, "name": "futuristic"}, {"id": 9840, "name": "romance"}, {"id": 9882, "name": "space"}, {"id": 9951, "name": "alien"}, {"id": 10148, "name": "tribe"}, {"id": 10158, "name": "alien planet"}, {"id": 10987, "name": "cgi"}, {"id": 11399, "name": "marine"}, {"id": 13065, "name": "soldier"}, {"id": 14643, "name": "battle"}, {"id": 14720, "name": "love affair"}, {"id": 165431, "name": "anti war"}, {"id": 193554, "name": "power relations"}, {"id": 206690, "name": "mind and soul"}, {"id": 209714, "name": "3d"}]'

Here, also we can apply same function and extract the name of keywords

In [29]:
movies['keywords'] = movies["keywords"].apply(convert)
movies.head()

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colon...","[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...","[Adventure, Fantasy, Action]","[ocean, drug abuse, exotic island, east india ...","[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,206647,Spectre,A cryptic message from Bond’s past sends him o...,"[Action, Adventure, Crime]","[spy, based on novel, secret agent, sequel, mi...","[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,49026,The Dark Knight Rises,Following the death of District Attorney Harve...,"[Action, Crime, Drama, Thriller]","[dc comics, crime fighter, terrorist, secret i...","[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
4,49529,John Carter,"John Carter is a war-weary, former military ca...","[Action, Adventure, Science Fiction]","[based on novel, mars, medallion, space travel...","[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."


### 3. Cast column

In [30]:
movies["cast"][0]

'[{"cast_id": 242, "character": "Jake Sully", "credit_id": "5602a8a7c3a3685532001c9a", "gender": 2, "id": 65731, "name": "Sam Worthington", "order": 0}, {"cast_id": 3, "character": "Neytiri", "credit_id": "52fe48009251416c750ac9cb", "gender": 1, "id": 8691, "name": "Zoe Saldana", "order": 1}, {"cast_id": 25, "character": "Dr. Grace Augustine", "credit_id": "52fe48009251416c750aca39", "gender": 1, "id": 10205, "name": "Sigourney Weaver", "order": 2}, {"cast_id": 4, "character": "Col. Quaritch", "credit_id": "52fe48009251416c750ac9cf", "gender": 2, "id": 32747, "name": "Stephen Lang", "order": 3}, {"cast_id": 5, "character": "Trudy Chacon", "credit_id": "52fe48009251416c750ac9d3", "gender": 1, "id": 17647, "name": "Michelle Rodriguez", "order": 4}, {"cast_id": 8, "character": "Selfridge", "credit_id": "52fe48009251416c750ac9e1", "gender": 2, "id": 1771, "name": "Giovanni Ribisi", "order": 5}, {"cast_id": 7, "character": "Norm Spellman", "credit_id": "52fe48009251416c750ac9dd", "gender": 

Here, also we can extract the names of top 3 casts

We will do some changes in our convert function:

In [31]:
def convert3(string):
        data = json.loads(string)
        names = [item["name"] for item in data]
        names = names[:3]
        return names

In [32]:
convert3(movies["cast"][0])

['Sam Worthington', 'Zoe Saldana', 'Sigourney Weaver']

In [33]:
movies["cast"] = movies["cast"].apply(convert3)
movies

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colon...","[Sam Worthington, Zoe Saldana, Sigourney Weaver]","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...","[Adventure, Fantasy, Action]","[ocean, drug abuse, exotic island, east india ...","[Johnny Depp, Orlando Bloom, Keira Knightley]","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,206647,Spectre,A cryptic message from Bond’s past sends him o...,"[Action, Adventure, Crime]","[spy, based on novel, secret agent, sequel, mi...","[Daniel Craig, Christoph Waltz, Léa Seydoux]","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,49026,The Dark Knight Rises,Following the death of District Attorney Harve...,"[Action, Crime, Drama, Thriller]","[dc comics, crime fighter, terrorist, secret i...","[Christian Bale, Michael Caine, Gary Oldman]","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
4,49529,John Carter,"John Carter is a war-weary, former military ca...","[Action, Adventure, Science Fiction]","[based on novel, mars, medallion, space travel...","[Taylor Kitsch, Lynn Collins, Samantha Morton]","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."
...,...,...,...,...,...,...,...
4804,9367,El Mariachi,El Mariachi just wants to play his guitar and ...,"[Action, Crime, Thriller]","[united states–mexico barrier, legs, arms, pap...","[Carlos Gallardo, Jaime de Hoyos, Peter Marqua...","[{""credit_id"": ""52fe44eec3a36847f80b280b"", ""de..."
4805,72766,Newlyweds,A newlywed couple's honeymoon is upended by th...,"[Comedy, Romance]",[],"[Edward Burns, Kerry Bishé, Marsha Dietlein]","[{""credit_id"": ""52fe487dc3a368484e0fb013"", ""de..."
4806,231617,"Signed, Sealed, Delivered","""Signed, Sealed, Delivered"" introduces a dedic...","[Comedy, Drama, Romance, TV Movie]","[date, love at first sight, narration, investi...","[Eric Mabius, Kristin Booth, Crystal Lowe]","[{""credit_id"": ""52fe4df3c3a36847f8275ecf"", ""de..."
4807,126186,Shanghai Calling,When ambitious New York attorney Sam is sent t...,[],[],"[Daniel Henney, Eliza Coupe, Bill Paxton]","[{""credit_id"": ""52fe4ad9c3a368484e16a36b"", ""de..."


### 4. Crew column

Let us see what's in crew column, we are interested in finding the director of movie

In [34]:
movies["crew"][0]

'[{"credit_id": "52fe48009251416c750aca23", "department": "Editing", "gender": 0, "id": 1721, "job": "Editor", "name": "Stephen E. Rivkin"}, {"credit_id": "539c47ecc3a36810e3001f87", "department": "Art", "gender": 2, "id": 496, "job": "Production Design", "name": "Rick Carter"}, {"credit_id": "54491c89c3a3680fb4001cf7", "department": "Sound", "gender": 0, "id": 900, "job": "Sound Designer", "name": "Christopher Boyes"}, {"credit_id": "54491cb70e0a267480001bd0", "department": "Sound", "gender": 0, "id": 900, "job": "Supervising Sound Editor", "name": "Christopher Boyes"}, {"credit_id": "539c4a4cc3a36810c9002101", "department": "Production", "gender": 1, "id": 1262, "job": "Casting", "name": "Mali Finn"}, {"credit_id": "5544ee3b925141499f0008fc", "department": "Sound", "gender": 2, "id": 1729, "job": "Original Music Composer", "name": "James Horner"}, {"credit_id": "52fe48009251416c750ac9c3", "department": "Directing", "gender": 2, "id": 2710, "job": "Director", "name": "James Cameron"},

So, we just want that name where job is "Director"

In [35]:
json.loads(movies["crew"][0])

[{'credit_id': '52fe48009251416c750aca23',
  'department': 'Editing',
  'gender': 0,
  'id': 1721,
  'job': 'Editor',
  'name': 'Stephen E. Rivkin'},
 {'credit_id': '539c47ecc3a36810e3001f87',
  'department': 'Art',
  'gender': 2,
  'id': 496,
  'job': 'Production Design',
  'name': 'Rick Carter'},
 {'credit_id': '54491c89c3a3680fb4001cf7',
  'department': 'Sound',
  'gender': 0,
  'id': 900,
  'job': 'Sound Designer',
  'name': 'Christopher Boyes'},
 {'credit_id': '54491cb70e0a267480001bd0',
  'department': 'Sound',
  'gender': 0,
  'id': 900,
  'job': 'Supervising Sound Editor',
  'name': 'Christopher Boyes'},
 {'credit_id': '539c4a4cc3a36810c9002101',
  'department': 'Production',
  'gender': 1,
  'id': 1262,
  'job': 'Casting',
  'name': 'Mali Finn'},
 {'credit_id': '5544ee3b925141499f0008fc',
  'department': 'Sound',
  'gender': 2,
  'id': 1729,
  'job': 'Original Music Composer',
  'name': 'James Horner'},
 {'credit_id': '52fe48009251416c750ac9c3',
  'department': 'Directing',
  

In [36]:
def fetch_director_name(string):
        data = json.loads(string)
        name = []
        for dictionary in data:
            if dictionary["job"] == "Director":
                name.append(dictionary["name"])
        return name

In [37]:
fetch_director_name(movies["crew"][0])

['James Cameron']

In [38]:
movies["director"] = movies["crew"].apply(fetch_director_name)
movies.head()

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew,director
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colon...","[Sam Worthington, Zoe Saldana, Sigourney Weaver]","[{""credit_id"": ""52fe48009251416c750aca23"", ""de...",[James Cameron]
1,285,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...","[Adventure, Fantasy, Action]","[ocean, drug abuse, exotic island, east india ...","[Johnny Depp, Orlando Bloom, Keira Knightley]","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de...",[Gore Verbinski]
2,206647,Spectre,A cryptic message from Bond’s past sends him o...,"[Action, Adventure, Crime]","[spy, based on novel, secret agent, sequel, mi...","[Daniel Craig, Christoph Waltz, Léa Seydoux]","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de...",[Sam Mendes]
3,49026,The Dark Knight Rises,Following the death of District Attorney Harve...,"[Action, Crime, Drama, Thriller]","[dc comics, crime fighter, terrorist, secret i...","[Christian Bale, Michael Caine, Gary Oldman]","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de...",[Christopher Nolan]
4,49529,John Carter,"John Carter is a war-weary, former military ca...","[Action, Adventure, Science Fiction]","[based on novel, mars, medallion, space travel...","[Taylor Kitsch, Lynn Collins, Samantha Morton]","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de...",[Andrew Stanton]


We do not require the crew column now

In [39]:
movies.drop(columns = "crew", inplace = True)

In [40]:
movies.head()

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,director
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colon...","[Sam Worthington, Zoe Saldana, Sigourney Weaver]",[James Cameron]
1,285,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...","[Adventure, Fantasy, Action]","[ocean, drug abuse, exotic island, east india ...","[Johnny Depp, Orlando Bloom, Keira Knightley]",[Gore Verbinski]
2,206647,Spectre,A cryptic message from Bond’s past sends him o...,"[Action, Adventure, Crime]","[spy, based on novel, secret agent, sequel, mi...","[Daniel Craig, Christoph Waltz, Léa Seydoux]",[Sam Mendes]
3,49026,The Dark Knight Rises,Following the death of District Attorney Harve...,"[Action, Crime, Drama, Thriller]","[dc comics, crime fighter, terrorist, secret i...","[Christian Bale, Michael Caine, Gary Oldman]",[Christopher Nolan]
4,49529,John Carter,"John Carter is a war-weary, former military ca...","[Action, Adventure, Science Fiction]","[based on novel, mars, medallion, space travel...","[Taylor Kitsch, Lynn Collins, Samantha Morton]",[Andrew Stanton]


### 5. Overview Column

In [41]:
movies["overview"][0]

'In the 22nd century, a paraplegic Marine is dispatched to the moon Pandora on a unique mission, but becomes torn between following orders and protecting an alien civilization.'

This is a simple string, we can convert this column also to a list so that we can concatenate it with other columns

In [42]:
movies["overview"][0].split()

['In',
 'the',
 '22nd',
 'century,',
 'a',
 'paraplegic',
 'Marine',
 'is',
 'dispatched',
 'to',
 'the',
 'moon',
 'Pandora',
 'on',
 'a',
 'unique',
 'mission,',
 'but',
 'becomes',
 'torn',
 'between',
 'following',
 'orders',
 'and',
 'protecting',
 'an',
 'alien',
 'civilization.']

In [43]:
movies["overview"] = movies["overview"].apply(lambda x: x.split())
movies.head()

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,director
0,19995,Avatar,"[In, the, 22nd, century,, a, paraplegic, Marin...","[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colon...","[Sam Worthington, Zoe Saldana, Sigourney Weaver]",[James Cameron]
1,285,Pirates of the Caribbean: At World's End,"[Captain, Barbossa,, long, believed, to, be, d...","[Adventure, Fantasy, Action]","[ocean, drug abuse, exotic island, east india ...","[Johnny Depp, Orlando Bloom, Keira Knightley]",[Gore Verbinski]
2,206647,Spectre,"[A, cryptic, message, from, Bond’s, past, send...","[Action, Adventure, Crime]","[spy, based on novel, secret agent, sequel, mi...","[Daniel Craig, Christoph Waltz, Léa Seydoux]",[Sam Mendes]
3,49026,The Dark Knight Rises,"[Following, the, death, of, District, Attorney...","[Action, Crime, Drama, Thriller]","[dc comics, crime fighter, terrorist, secret i...","[Christian Bale, Michael Caine, Gary Oldman]",[Christopher Nolan]
4,49529,John Carter,"[John, Carter, is, a, war-weary,, former, mili...","[Action, Adventure, Science Fiction]","[based on novel, mars, medallion, space travel...","[Taylor Kitsch, Lynn Collins, Samantha Morton]",[Andrew Stanton]


In [44]:
movies["overview"][0]

['In',
 'the',
 '22nd',
 'century,',
 'a',
 'paraplegic',
 'Marine',
 'is',
 'dispatched',
 'to',
 'the',
 'moon',
 'Pandora',
 'on',
 'a',
 'unique',
 'mission,',
 'but',
 'becomes',
 'torn',
 'between',
 'following',
 'orders',
 'and',
 'protecting',
 'an',
 'alien',
 'civilization.']

## Transformation

In [45]:
movies.head()

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,director
0,19995,Avatar,"[In, the, 22nd, century,, a, paraplegic, Marin...","[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colon...","[Sam Worthington, Zoe Saldana, Sigourney Weaver]",[James Cameron]
1,285,Pirates of the Caribbean: At World's End,"[Captain, Barbossa,, long, believed, to, be, d...","[Adventure, Fantasy, Action]","[ocean, drug abuse, exotic island, east india ...","[Johnny Depp, Orlando Bloom, Keira Knightley]",[Gore Verbinski]
2,206647,Spectre,"[A, cryptic, message, from, Bond’s, past, send...","[Action, Adventure, Crime]","[spy, based on novel, secret agent, sequel, mi...","[Daniel Craig, Christoph Waltz, Léa Seydoux]",[Sam Mendes]
3,49026,The Dark Knight Rises,"[Following, the, death, of, District, Attorney...","[Action, Crime, Drama, Thriller]","[dc comics, crime fighter, terrorist, secret i...","[Christian Bale, Michael Caine, Gary Oldman]",[Christopher Nolan]
4,49529,John Carter,"[John, Carter, is, a, war-weary,, former, mili...","[Action, Adventure, Science Fiction]","[based on novel, mars, medallion, space travel...","[Taylor Kitsch, Lynn Collins, Samantha Morton]",[Andrew Stanton]


We need to remove any spaces between words:

    Example: Sam Worthington should be one tag as SamWorthington

In [46]:
movies["genres"].apply(lambda x: [i.replace(" ", "") for i in x] )

0       [Action, Adventure, Fantasy, ScienceFiction]
1                       [Adventure, Fantasy, Action]
2                         [Action, Adventure, Crime]
3                   [Action, Crime, Drama, Thriller]
4                [Action, Adventure, ScienceFiction]
                            ...                     
4804                       [Action, Crime, Thriller]
4805                               [Comedy, Romance]
4806               [Comedy, Drama, Romance, TVMovie]
4807                                              []
4808                                   [Documentary]
Name: genres, Length: 4806, dtype: object

In [47]:
movies["genres"] = movies["genres"].apply(lambda x: [i.replace(" ", "") for i in x] )
movies.head()

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,director
0,19995,Avatar,"[In, the, 22nd, century,, a, paraplegic, Marin...","[Action, Adventure, Fantasy, ScienceFiction]","[culture clash, future, space war, space colon...","[Sam Worthington, Zoe Saldana, Sigourney Weaver]",[James Cameron]
1,285,Pirates of the Caribbean: At World's End,"[Captain, Barbossa,, long, believed, to, be, d...","[Adventure, Fantasy, Action]","[ocean, drug abuse, exotic island, east india ...","[Johnny Depp, Orlando Bloom, Keira Knightley]",[Gore Verbinski]
2,206647,Spectre,"[A, cryptic, message, from, Bond’s, past, send...","[Action, Adventure, Crime]","[spy, based on novel, secret agent, sequel, mi...","[Daniel Craig, Christoph Waltz, Léa Seydoux]",[Sam Mendes]
3,49026,The Dark Knight Rises,"[Following, the, death, of, District, Attorney...","[Action, Crime, Drama, Thriller]","[dc comics, crime fighter, terrorist, secret i...","[Christian Bale, Michael Caine, Gary Oldman]",[Christopher Nolan]
4,49529,John Carter,"[John, Carter, is, a, war-weary,, former, mili...","[Action, Adventure, ScienceFiction]","[based on novel, mars, medallion, space travel...","[Taylor Kitsch, Lynn Collins, Samantha Morton]",[Andrew Stanton]


In [48]:
movies["keywords"] = movies["keywords"].apply(lambda x: [i.replace(" ", "") for i in x] )
movies["cast"] = movies["cast"].apply(lambda x: [i.replace(" ", "") for i in x] )
movies["director"] = movies["director"].apply(lambda x: [i.replace(" ", "") for i in x] )
movies.head()

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,director
0,19995,Avatar,"[In, the, 22nd, century,, a, paraplegic, Marin...","[Action, Adventure, Fantasy, ScienceFiction]","[cultureclash, future, spacewar, spacecolony, ...","[SamWorthington, ZoeSaldana, SigourneyWeaver]",[JamesCameron]
1,285,Pirates of the Caribbean: At World's End,"[Captain, Barbossa,, long, believed, to, be, d...","[Adventure, Fantasy, Action]","[ocean, drugabuse, exoticisland, eastindiatrad...","[JohnnyDepp, OrlandoBloom, KeiraKnightley]",[GoreVerbinski]
2,206647,Spectre,"[A, cryptic, message, from, Bond’s, past, send...","[Action, Adventure, Crime]","[spy, basedonnovel, secretagent, sequel, mi6, ...","[DanielCraig, ChristophWaltz, LéaSeydoux]",[SamMendes]
3,49026,The Dark Knight Rises,"[Following, the, death, of, District, Attorney...","[Action, Crime, Drama, Thriller]","[dccomics, crimefighter, terrorist, secretiden...","[ChristianBale, MichaelCaine, GaryOldman]",[ChristopherNolan]
4,49529,John Carter,"[John, Carter, is, a, war-weary,, former, mili...","[Action, Adventure, ScienceFiction]","[basedonnovel, mars, medallion, spacetravel, p...","[TaylorKitsch, LynnCollins, SamanthaMorton]",[AndrewStanton]


Cool, we have removed the spaces now

## Creating tags column

By concatenating all the five columns:

In [49]:
movies["tags"] = movies["overview"] + movies["genres"] + movies["keywords"] + movies["cast"] + movies["director"]
movies.head()

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,director,tags
0,19995,Avatar,"[In, the, 22nd, century,, a, paraplegic, Marin...","[Action, Adventure, Fantasy, ScienceFiction]","[cultureclash, future, spacewar, spacecolony, ...","[SamWorthington, ZoeSaldana, SigourneyWeaver]",[JamesCameron],"[In, the, 22nd, century,, a, paraplegic, Marin..."
1,285,Pirates of the Caribbean: At World's End,"[Captain, Barbossa,, long, believed, to, be, d...","[Adventure, Fantasy, Action]","[ocean, drugabuse, exoticisland, eastindiatrad...","[JohnnyDepp, OrlandoBloom, KeiraKnightley]",[GoreVerbinski],"[Captain, Barbossa,, long, believed, to, be, d..."
2,206647,Spectre,"[A, cryptic, message, from, Bond’s, past, send...","[Action, Adventure, Crime]","[spy, basedonnovel, secretagent, sequel, mi6, ...","[DanielCraig, ChristophWaltz, LéaSeydoux]",[SamMendes],"[A, cryptic, message, from, Bond’s, past, send..."
3,49026,The Dark Knight Rises,"[Following, the, death, of, District, Attorney...","[Action, Crime, Drama, Thriller]","[dccomics, crimefighter, terrorist, secretiden...","[ChristianBale, MichaelCaine, GaryOldman]",[ChristopherNolan],"[Following, the, death, of, District, Attorney..."
4,49529,John Carter,"[John, Carter, is, a, war-weary,, former, mili...","[Action, Adventure, ScienceFiction]","[basedonnovel, mars, medallion, spacetravel, p...","[TaylorKitsch, LynnCollins, SamanthaMorton]",[AndrewStanton],"[John, Carter, is, a, war-weary,, former, mili..."


In [50]:
movies["tags"][0]

['In',
 'the',
 '22nd',
 'century,',
 'a',
 'paraplegic',
 'Marine',
 'is',
 'dispatched',
 'to',
 'the',
 'moon',
 'Pandora',
 'on',
 'a',
 'unique',
 'mission,',
 'but',
 'becomes',
 'torn',
 'between',
 'following',
 'orders',
 'and',
 'protecting',
 'an',
 'alien',
 'civilization.',
 'Action',
 'Adventure',
 'Fantasy',
 'ScienceFiction',
 'cultureclash',
 'future',
 'spacewar',
 'spacecolony',
 'society',
 'spacetravel',
 'futuristic',
 'romance',
 'space',
 'alien',
 'tribe',
 'alienplanet',
 'cgi',
 'marine',
 'soldier',
 'battle',
 'loveaffair',
 'antiwar',
 'powerrelations',
 'mindandsoul',
 '3d',
 'SamWorthington',
 'ZoeSaldana',
 'SigourneyWeaver',
 'JamesCameron']

Now, we have got the three columns which we require for our purpose

In [51]:
df = movies[["movie_id", "title", "tags"]]
df.head()

Unnamed: 0,movie_id,title,tags
0,19995,Avatar,"[In, the, 22nd, century,, a, paraplegic, Marin..."
1,285,Pirates of the Caribbean: At World's End,"[Captain, Barbossa,, long, believed, to, be, d..."
2,206647,Spectre,"[A, cryptic, message, from, Bond’s, past, send..."
3,49026,The Dark Knight Rises,"[Following, the, death, of, District, Attorney..."
4,49529,John Carter,"[John, Carter, is, a, war-weary,, former, mili..."


### 1. Converting to string

We will convert this tags list back to a string

In [52]:
df["tags"].apply(lambda x: " ".join(x))

0       In the 22nd century, a paraplegic Marine is di...
1       Captain Barbossa, long believed to be dead, ha...
2       A cryptic message from Bond’s past sends him o...
3       Following the death of District Attorney Harve...
4       John Carter is a war-weary, former military ca...
                              ...                        
4804    El Mariachi just wants to play his guitar and ...
4805    A newlywed couple's honeymoon is upended by th...
4806    "Signed, Sealed, Delivered" introduces a dedic...
4807    When ambitious New York attorney Sam is sent t...
4808    Ever since the second grade when he first saw ...
Name: tags, Length: 4806, dtype: object

In [53]:
df["tags"] = df["tags"].apply(lambda x: " ".join(x))
df.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["tags"] = df["tags"].apply(lambda x: " ".join(x))


Unnamed: 0,movie_id,title,tags
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di..."
1,285,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha..."
2,206647,Spectre,A cryptic message from Bond’s past sends him o...
3,49026,The Dark Knight Rises,Following the death of District Attorney Harve...
4,49529,John Carter,"John Carter is a war-weary, former military ca..."


In [54]:
df["tags"][0]

'In the 22nd century, a paraplegic Marine is dispatched to the moon Pandora on a unique mission, but becomes torn between following orders and protecting an alien civilization. Action Adventure Fantasy ScienceFiction cultureclash future spacewar spacecolony society spacetravel futuristic romance space alien tribe alienplanet cgi marine soldier battle loveaffair antiwar powerrelations mindandsoul 3d SamWorthington ZoeSaldana SigourneyWeaver JamesCameron'

Cool, this is a string that we have just made

## Defining Problem statement

We will ask for a movie name from the user, and based on the movie name, we will provide some movie suggestions.

These suggestions would be based on the similarity of the movies based on the tags column.

One simple approach will be to see how many words in the text are similar, but text vectorization is a good option.

And once we have the text vectors ready we can just find out the closest 5 vectors, those movies will be recommended to the user.

## Text Preprocessing

Before that we require some text preprocessing:
    
- 1. Lowercase
- 2. removing alpha numeric words
- 3. removing stop words and puntuations
- 4. stemming 

We have the function for this from the email classifier project

In [55]:
from nltk.corpus import stopwords
import string
from nltk.stem.porter import PorterStemmer
import nltk

ps = PorterStemmer()
nltk.download('stopwords')

def transform_text(text):
    text = text.lower()
    word_list = nltk.word_tokenize(text)
    new_list = []
    for i in word_list:
        if i.isalnum():
            new_list.append(i)
    new_list2 = []
    
    for i in new_list:
        if i not in stopwords.words("english") and i not in string.punctuation:
            new_list2.append(i)
            
    new_list3 = []
    
    for i in new_list2:
        i = ps.stem(i)
        new_list3.append(i)
            
    return " ".join(new_list3)

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/rajatchauhan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [56]:
df.head(1)

Unnamed: 0,movie_id,title,tags
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di..."


In [57]:
df["tags"].apply(transform_text)

0       22nd centuri parapleg marin dispatch moon pand...
1       captain barbossa long believ dead come back li...
2       cryptic messag bond past send trail uncov sini...
3       follow death district attorney harvey dent bat...
4       john carter former militari captain inexplic t...
                              ...                        
4804    el mariachi want play guitar carri famili trad...
4805    newlyw coupl honeymoon upend arriv respect sis...
4806    sign seal deliv introduc dedic quartet civil s...
4807    ambiti new york attorney sam sent shanghai ass...
4808    ever sinc second grade first saw extraterrestr...
Name: tags, Length: 4806, dtype: object

In [58]:
df["tags"] = df["tags"].apply(transform_text)
df.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["tags"] = df["tags"].apply(transform_text)


Unnamed: 0,movie_id,title,tags
0,19995,Avatar,22nd centuri parapleg marin dispatch moon pand...
1,285,Pirates of the Caribbean: At World's End,captain barbossa long believ dead come back li...
2,206647,Spectre,cryptic messag bond past send trail uncov sini...
3,49026,The Dark Knight Rises,follow death district attorney harvey dent bat...
4,49529,John Carter,john carter former militari captain inexplic t...


## Vectorization

Count Vectorizer is a text preprocessing and feature extraction technique commonly used in natural language processing (NLP) and machine learning for converting a collection of text documents (corpus) into a numerical format that can be used for various NLP tasks, such as text classification, clustering, or information retrieval. It is a part of the Bag of Words (BoW) model.

In [59]:
from sklearn.feature_extraction.text import CountVectorizer

In [60]:
cv = CountVectorizer(max_features=5000)

In [61]:
cv.fit_transform(df["tags"]).toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [62]:
vectors = cv.fit_transform(df["tags"]).toarray()

We can see what are the top 5000 words picked by the Count Vectorizer using get features names method

In [63]:
cv.get_feature_names()



['007',
 '10',
 '100',
 '11',
 '12',
 '13',
 '14',
 '15',
 '150',
 '16',
 '17',
 '18',
 '18th',
 '18thcenturi',
 '1910',
 '1920',
 '1930',
 '1940',
 '1950',
 '1960',
 '1970',
 '1980',
 '1990',
 '1999',
 '19th',
 '19thcenturi',
 '20',
 '2003',
 '2009',
 '20th',
 '24',
 '25',
 '30',
 '300',
 '3d',
 '40',
 '50',
 '60',
 '70',
 'aaron',
 'aaroneckhart',
 'abandon',
 'abbi',
 'abduct',
 'abigailbreslin',
 'abil',
 'abl',
 'aboard',
 'aborigin',
 'absenc',
 'absurd',
 'abus',
 'academ',
 'academi',
 'accept',
 'access',
 'accid',
 'accident',
 'acclaim',
 'accompani',
 'accomplish',
 'account',
 'accus',
 'ace',
 'achiev',
 'acquaint',
 'across',
 'act',
 'action',
 'actionhero',
 'activ',
 'activist',
 'actor',
 'actress',
 'actual',
 'ad',
 'adam',
 'adammckay',
 'adamsandl',
 'adamshankman',
 'adapt',
 'add',
 'addict',
 'addit',
 'adjust',
 'admir',
 'admit',
 'adolesc',
 'adopt',
 'ador',
 'adrienbrodi',
 'adrift',
 'adult',
 'adultanim',
 'adulteri',
 'adulthood',
 'advanc',
 'advantag

In [64]:
len(cv.get_feature_names())

5000

## Similarity Score Approach

Now, we have all the vectors for the tags of the movies.
These vectors are in 5000 dimentional space.

Now, to find out which vectors are close to each other, we need to find out distance between all the vectors with each other.

For distance we have two options

1. **Euclidean Distance:** The Euclidean distance, also known as L2 norm, is a common distance metric in vector spaces. It measures the straight-line distance between two points in space and is calculated as the square root of the sum of the squared differences between corresponding elements of the vectors.

2. **Cosine Similarity:** Cosine similarity measures the cosine of the angle between two vectors. It is often used for text or high-dimensional data, as it is robust to the vector's length. A cosine similarity of 1 indicates that the vectors are pointing in the same direction, while -1 means they are pointing in opposite directions.

Euclidean distance comparatively will be computationaly expensive and also is not a reliable measure in case of high dimentional data (curse of dimentionality) so for text data or high-dimentional data, it is advised to use Cosine Similarity.

Cosine is nothing but the angle between the two vectors.

Smaller the angle the closer are the two vectors.

In [65]:
from sklearn.metrics.pairwise import cosine_similarity

cosine_similarity would be a number between 0 and 1, 

0 means least similar and 1 means the most similar or exact same vector.

In [66]:
vectors.shape

(4806, 5000)

In [67]:
cosine_similarity(vectors)

array([[1.        , 0.0836242 , 0.08119979, ..., 0.06362848, 0.02414023,
        0.        ],
       [0.0836242 , 1.        , 0.08827348, ..., 0.02305715, 0.        ,
        0.        ],
       [0.08119979, 0.08827348, 1.        , ..., 0.04477737, 0.        ,
        0.        ],
       ...,
       [0.06362848, 0.02305715, 0.04477737, ..., 1.        , 0.03993615,
        0.02068572],
       [0.02414023, 0.        , 0.        , ..., 0.03993615, 1.        ,
        0.02354408],
       [0.        , 0.        , 0.        , ..., 0.02068572, 0.02354408,
        1.        ]])

In [68]:
cosine_similarity(vectors).shape

(4806, 4806)

Cool, this is a 4806 by 4806 matrix, that's why we learned matrix

So, we can see we have the similarity score of each 4806 movie with each other 4806 movie.

In [69]:
similarity_score_matrix = cosine_similarity(vectors)

## Function to find out top 5 similar movies

We need the index of the movie

In [70]:
df.head()

Unnamed: 0,movie_id,title,tags
0,19995,Avatar,22nd centuri parapleg marin dispatch moon pand...
1,285,Pirates of the Caribbean: At World's End,captain barbossa long believ dead come back li...
2,206647,Spectre,cryptic messag bond past send trail uncov sini...
3,49026,The Dark Knight Rises,follow death district attorney harvey dent bat...
4,49529,John Carter,john carter former militari captain inexplic t...


In [71]:
df[df["title"] == "Avatar"].index[0]

0

In [72]:
df[df["title"] == "Batman Begins"].index[0]

119

In [73]:
similarity_score_matrix[0]

array([1.        , 0.0836242 , 0.08119979, ..., 0.06362848, 0.02414023,
       0.        ])

In [74]:
similarity_score_matrix[119]

array([0.04358136, 0.04737794, 0.09200874, ..., 0.09012301, 0.02051525,
       0.02125256])

### Enumerate Function

Now, we can sort the distance and get the top 5 movies:

In [75]:
sorted(similarity_score_matrix[0], reverse= True)

[1.0000000000000004,
 0.2466466911319812,
 0.24419314525275215,
 0.2401922307076307,
 0.24019223070763068,
 0.2342057229768425,
 0.2311250817605121,
 0.23007892341722033,
 0.21982600255746473,
 0.21977383072747692,
 0.21571674297647797,
 0.20965696734438366,
 0.20814536170751838,
 0.20672455764868075,
 0.20587905489225483,
 0.2051282051282051,
 0.2051282051282051,
 0.20131905799006777,
 0.20033416898825337,
 0.20033416898825337,
 0.20033416898825335,
 0.20016019225635892,
 0.20016019225635892,
 0.20006252931214213,
 0.19814848097530421,
 0.19795764159656265,
 0.1971041319963609,
 0.19693159036084404,
 0.19611613513818404,
 0.19611613513818402,
 0.19611613513818402,
 0.19535451620220173,
 0.19535451620220173,
 0.1921537845661046,
 0.1913897505877382,
 0.1900371558963217,
 0.18842228790639834,
 0.18685673434682065,
 0.18681617943926832,
 0.18566888396533887,
 0.18490006540840973,
 0.18427434427242972,
 0.18367958959266126,
 0.1815682598006407,
 0.1815682598006407,
 0.18156825980064067,
 

But we do not want to loose the index position of the movies, which will be required to fetch the movie.

In [76]:
list(enumerate(similarity_score_matrix[0]))

[(0, 1.0000000000000004),
 (1, 0.08362420100070908),
 (2, 0.081199794294115),
 (3, 0.07252377242938948),
 (4, 0.18685673434682065),
 (5, 0.12503908082008883),
 (6, 0.0),
 (7, 0.1601281538050871),
 (8, 0.06163335513613657),
 (9, 0.0915018021743355),
 (10, 0.09883324222148016),
 (11, 0.09548198320525772),
 (12, 0.08920515501750789),
 (13, 0.039722906114947866),
 (14, 0.12988108336653278),
 (15, 0.06150692760785473),
 (16, 0.07502344849205331),
 (17, 0.13797289239745258),
 (18, 0.12773807700531709),
 (19, 0.08200923681047297),
 (20, 0.07161148740394328),
 (21, 0.11322770341445956),
 (22, 0.05947010334500526),
 (23, 0.08492077756084468),
 (24, 0.05264981264926564),
 (25, 0.05008354224706334),
 (26, 0.14322297480788657),
 (27, 0.18156825980064065),
 (28, 0.11503946170861017),
 (29, 0.060522753266880225),
 (30, 0.06933752452815364),
 (31, 0.14617633655117152),
 (32, 0.0806970043558403),
 (33, 0.09078412990032037),
 (34, 0.0),
 (35, 0.08492077756084468),
 (36, 0.14978617237881955),
 (37, 0.08

In [77]:
sorted(list(enumerate(similarity_score_matrix[0])), reverse= True)

[(4805, 0.0),
 (4804, 0.02414022747926338),
 (4803, 0.06362847629757777),
 (4802, 0.04622501635210243),
 (4801, 0.02120949209919259),
 (4800, 0.0),
 (4799, 0.050015632328035534),
 (4798, 0.020846909961254163),
 (4797, 0.0),
 (4796, 0.0),
 (4795, 0.0),
 (4794, 0.0),
 (4793, 0.053376051268362375),
 (4792, 0.0),
 (4791, 0.0),
 (4790, 0.05413319619607667),
 (4789, 0.06419407387663695),
 (4788, 0.0),
 (4787, 0.02001601922563589),
 (4786, 0.0),
 (4785, 0.021398024625545648),
 (4784, 0.04441155916843275),
 (4783, 0.0),
 (4782, 0.023357091793352585),
 (4781, 0.055749467333806056),
 (4780, 0.020336295869552105),
 (4779, 0.0),
 (4778, 0.0),
 (4777, 0.09656090991705352),
 (4776, 0.0),
 (4775, 0.05128205128205128),
 (4774, 0.0),
 (4773, 0.027066598098038335),
 (4772, 0.022875450543583874),
 (4771, 0.03774256780481986),
 (4770, 0.0),
 (4769, 0.0),
 (4768, 0.0),
 (4767, 0.03494282789073061),
 (4766, 0.01660451604622393),
 (4765, 0.0),
 (4764, 0.0),
 (4763, 0.0),
 (4762, 0.0),
 (4761, 0.0253184841770

But this sorting is on the index column

In [78]:
sorted(list(enumerate(similarity_score_matrix[0])), reverse= True, key= lambda x:x[1])

[(0, 1.0000000000000004),
 (2409, 0.2466466911319812),
 (1204, 0.24419314525275215),
 (507, 0.2401922307076307),
 (539, 0.24019223070763068),
 (1216, 0.2342057229768425),
 (778, 0.2311250817605121),
 (260, 0.23007892341722033),
 (1194, 0.21982600255746473),
 (1920, 0.21977383072747692),
 (61, 0.21571674297647797),
 (2999, 0.20965696734438366),
 (2786, 0.20814536170751838),
 (3730, 0.20672455764868075),
 (1831, 0.20587905489225483),
 (322, 0.2051282051282051),
 (942, 0.2051282051282051),
 (973, 0.20131905799006777),
 (151, 0.20033416898825337),
 (4192, 0.20033416898825337),
 (495, 0.20033416898825335),
 (1444, 0.20016019225635892),
 (4048, 0.20016019225635892),
 (91, 0.20006252931214213),
 (972, 0.19814848097530421),
 (582, 0.19795764159656265),
 (2971, 0.1971041319963609),
 (1089, 0.19693159036084404),
 (172, 0.19611613513818404),
 (74, 0.19611613513818402),
 (83, 0.19611613513818402),
 (300, 0.19535451620220173),
 (2204, 0.19535451620220173),
 (466, 0.1921537845661046),
 (47, 0.191389

So, movies very similar to avatar are 2409, 1204 and 507.

What are these movies??

In [79]:
df["title"][2409]

'Aliens'

In [80]:
df["title"][1204]

'Predators'

In [81]:
df["title"][507]

'Independence Day'

We need the top 5 movies only

In [82]:
sorted(list(enumerate(similarity_score_matrix[0])), reverse= True, key= lambda x:x[1])[1:6]

[(2409, 0.2466466911319812),
 (1204, 0.24419314525275215),
 (507, 0.2401922307076307),
 (539, 0.24019223070763068),
 (1216, 0.2342057229768425)]

We can create this function using these above steps now

In [83]:
def recommend(movie):
    movie_index = df[df["title"] == movie].index[0]
    distances = similarity_score_matrix[movie_index]
    top_5_movies_list = sorted(list(enumerate(distances)), reverse= True, key= lambda x:x[1])[1:6]
    
    for i in top_5_movies_list:
        print(df["title"][i[0]])
    return 

In [84]:
recommend("Avatar")

Aliens
Predators
Independence Day
Titan A.E.
Aliens vs Predator: Requiem


In [85]:
recommend("Batman Begins")

The Dark Knight
The Dark Knight Rises
Batman
Batman v Superman: Dawn of Justice
Batman


That is so cool, this is working like magic.

## Using Pickle

Exporting movie name for creating drop down list:

In [86]:
df

Unnamed: 0,movie_id,title,tags
0,19995,Avatar,22nd centuri parapleg marin dispatch moon pand...
1,285,Pirates of the Caribbean: At World's End,captain barbossa long believ dead come back li...
2,206647,Spectre,cryptic messag bond past send trail uncov sini...
3,49026,The Dark Knight Rises,follow death district attorney harvey dent bat...
4,49529,John Carter,john carter former militari captain inexplic t...
...,...,...,...
4804,9367,El Mariachi,el mariachi want play guitar carri famili trad...
4805,72766,Newlyweds,newlyw coupl honeymoon upend arriv respect sis...
4806,231617,"Signed, Sealed, Delivered",sign seal deliv introduc dedic quartet civil s...
4807,126186,Shanghai Calling,ambiti new york attorney sam sent shanghai ass...


In [87]:
import pickle

In [88]:
pickle.dump(df, open("movies.pkl", "wb"))

In [89]:
pickle.dump(recommend, open("recommend.pkl", "wb"))

In [90]:
pickle.dump(similarity_score_matrix, open("similarity_score_matrix.pkl", "wb"))