# Machine Learning
*Note- This notebook is where the data was analyzes and the recommendation system was built
## Topic: movie recommendation system


In [78]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import ast
from sklearn.feature_extraction.text import CountVectorizer
import nltk
from nltk.stem.porter import PorterStemmer
from sklearn.metrics.pairwise import cosine_similarity
import pickle 

import warnings


In [79]:
warnings.filterwarnings("ignore")

## Datasets



In [80]:
movies =pd.read_csv("/kaggle/input/tmdb-movie-metadata/tmdb_5000_movies.csv")
credits =pd.read_csv("/kaggle/input/tmdb-movie-metadata/tmdb_5000_credits.csv")

In [81]:
movies.shape

(4803, 20)

In [82]:
credits.shape

(4803, 4)

## Merging Datasets

In [83]:
movies=movies.merge(credits, on='title')
movies.shape

(4809, 23)

In [84]:
movies.head()

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,...,runtime,spoken_languages,status,tagline,title,vote_average,vote_count,movie_id,cast,crew
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...",...,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800,19995,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...",...,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500,285,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,245000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.sonypictures.com/movies/spectre/,206647,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,"[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...",...,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,Spectre,6.3,4466,206647,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,250000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",http://www.thedarkknightrises.com/,49026,"[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",en,The Dark Knight Rises,Following the death of District Attorney Harve...,112.31295,"[{""name"": ""Legendary Pictures"", ""id"": 923}, {""...",...,165.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,The Legend Ends,The Dark Knight Rises,7.6,9106,49026,"[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
4,260000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://movies.disney.com/john-carter,49529,"[{""id"": 818, ""name"": ""based on novel""}, {""id"":...",en,John Carter,"John Carter is a war-weary, former military ca...",43.926995,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}]",...,132.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"Lost in our world, found in another.",John Carter,6.1,2124,49529,"[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."


# Dataset overview

In [85]:
print('Number of records:',movies.shape[0])
print('_ _ _ _ _')
print('Number of features:',movies.shape[1])
print('_ _ _ _ _')
print(movies.info())

Number of records: 4809
_ _ _ _ _
Number of features: 23
_ _ _ _ _
<class 'pandas.core.frame.DataFrame'>
Int64Index: 4809 entries, 0 to 4808
Data columns (total 23 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   budget                4809 non-null   int64  
 1   genres                4809 non-null   object 
 2   homepage              1713 non-null   object 
 3   id                    4809 non-null   int64  
 4   keywords              4809 non-null   object 
 5   original_language     4809 non-null   object 
 6   original_title        4809 non-null   object 
 7   overview              4806 non-null   object 
 8   popularity            4809 non-null   float64
 9   production_companies  4809 non-null   object 
 10  production_countries  4809 non-null   object 
 11  release_date          4808 non-null   object 
 12  revenue               4809 non-null   int64  
 13  runtime               4807 non-null   float64
 14  spoke

In [86]:
movies.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
budget,4809.0,29027800.0,40704730.0,0.0,780000.0,15000000.0,40000000.0,380000000.0
id,4809.0,57120.57,88653.37,5.0,9012.0,14624.0,58595.0,459488.0
popularity,4809.0,21.49166,31.80337,0.0,4.66723,12.92159,28.35053,875.5813
revenue,4809.0,82275110.0,162837900.0,0.0,0.0,19170000.0,92913170.0,2787965000.0
runtime,4807.0,106.8823,22.60254,0.0,94.0,103.0,118.0,338.0
vote_average,4809.0,6.092514,1.193989,0.0,5.6,6.2,6.8,10.0
vote_count,4809.0,690.3317,1234.187,0.0,54.0,235.0,737.0,13752.0
movie_id,4809.0,57120.57,88653.37,5.0,9012.0,14624.0,58595.0,459488.0


# Truncated Dataframe
A content-based recommendation system like the one we're building requires features that will help us create tags to compare films
with. Eg: movie budget is not important for a recommender system, because it is not a given that if a person likes Interstellar, that they
will also like other high budget movies like Marvel movies.


****COLUMNS TO BE KEPT****

**title**

**overview - for content based similarity**

**genre**

**keywords - basically tags to describe and recommend similar movies, this will be useful in creating our system.**

**production_companies - some companies stick to producing certain types of movies, like Pixar or Marvel Studios.**

**cast - we often recommend movies on the basis of actors**

**crew - we often recommend movies based on directors, among other crew members**

In [87]:
movies=movies[['movie_id','title','overview','genres','keywords','production_companies','cast','crew']]
movies.head()

Unnamed: 0,movie_id,title,overview,genres,keywords,production_companies,cast,crew
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...","[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...","[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...","[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...","[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...","[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,206647,Spectre,A cryptic message from Bond’s past sends him o...,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...","[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...","[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...","[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,49026,The Dark Knight Rises,Following the death of District Attorney Harve...,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...","[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...","[{""name"": ""Legendary Pictures"", ""id"": 923}, {""...","[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
4,49529,John Carter,"John Carter is a war-weary, former military ca...","[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...","[{""id"": 818, ""name"": ""based on novel""}, {""id"":...","[{""name"": ""Walt Disney Pictures"", ""id"": 2}]","[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."


In [88]:
movies.shape

(4809, 8)

## Preprocessing our Data

Now we will preprocess our data by checking for null values as well as duplicated variables. We can also see the from the 'genres'
column through the 'crew' column, the names of those features, which we need for creating tags, are tucked away inside lists of
dictionaries. We will parse these columns to retrieve the names we are looking for.


In [89]:
movies.isnull().sum()

movie_id                0
title                   0
overview                3
genres                  0
keywords                0
production_companies    0
cast                    0
crew                    0
dtype: int64

In [90]:
movies.dropna(inplace=True)

In [91]:
movies.duplicated().sum()

0

In [92]:
movies.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4806 entries, 0 to 4808
Data columns (total 8 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   movie_id              4806 non-null   int64 
 1   title                 4806 non-null   object
 2   overview              4806 non-null   object
 3   genres                4806 non-null   object
 4   keywords              4806 non-null   object
 5   production_companies  4806 non-null   object
 6   cast                  4806 non-null   object
 7   crew                  4806 non-null   object
dtypes: int64(1), object(7)
memory usage: 337.9+ KB


## Column conversion

We will use the literal_eval function from the ast (Abstract Syntax Tree) library to create functions to parse through the necessary
columns in order to retrieve the necessary attributes for our system.

> The ast library provides a way to parse and analyze the code written in Python. It can be used to transform code, check
for errors, or extract information about the code.

> The literal_eval function is a function that evaluates a string containing a Python literal (e.g., a string, tuple, list,
dictionary, number, or boolean value) and returns the corresponding Python object


## Genres and Keywords

An exapmle of what genres look like


In [93]:
movies['genres'][0]

'[{"id": 28, "name": "Action"}, {"id": 12, "name": "Adventure"}, {"id": 14, "name": "Fantasy"}, {"id": 878, "name": "Science Fiction"}]'

Now to get the genres name and keywords...

In [94]:
def convert(obj):
    li=[]
    for i in ast.literal_eval(obj):
        li.append(i['name'])
    return li

In [95]:
movies['genres']=movies['genres'].apply(convert)
movies['genres'][0:6]

0    [Action, Adventure, Fantasy, Science Fiction]
1                     [Adventure, Fantasy, Action]
2                       [Action, Adventure, Crime]
3                 [Action, Crime, Drama, Thriller]
4             [Action, Adventure, Science Fiction]
5                     [Fantasy, Action, Adventure]
Name: genres, dtype: object

In [96]:
movies['keywords'] = movies['keywords'].apply(convert)
movies['keywords'][0:6]

0    [culture clash, future, space war, space colon...
1    [ocean, drug abuse, exotic island, east india ...
2    [spy, based on novel, secret agent, sequel, mi...
3    [dc comics, crime fighter, terrorist, secret i...
4    [based on novel, mars, medallion, space travel...
5    [dual identity, amnesia, sandstorm, love of on...
Name: keywords, dtype: object

In [97]:
movies.head()

Unnamed: 0,movie_id,title,overview,genres,keywords,production_companies,cast,crew
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colon...","[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...","[Adventure, Fantasy, Action]","[ocean, drug abuse, exotic island, east india ...","[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,206647,Spectre,A cryptic message from Bond’s past sends him o...,"[Action, Adventure, Crime]","[spy, based on novel, secret agent, sequel, mi...","[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...","[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,49026,The Dark Knight Rises,Following the death of District Attorney Harve...,"[Action, Crime, Drama, Thriller]","[dc comics, crime fighter, terrorist, secret i...","[{""name"": ""Legendary Pictures"", ""id"": 923}, {""...","[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
4,49529,John Carter,"John Carter is a war-weary, former military ca...","[Action, Adventure, Science Fiction]","[based on novel, mars, medallion, space travel...","[{""name"": ""Walt Disney Pictures"", ""id"": 2}]","[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."


## Production Companies

In [98]:
movies['production_companies'][0]

'[{"name": "Ingenious Film Partners", "id": 289}, {"name": "Twentieth Century Fox Film Corporation", "id": 306}, {"name": "Dune Entertainment", "id": 444}, {"name": "Lightstorm Entertainment", "id": 574}]'

In [99]:
def convert_prod(obj) :
    li=[]
    counter=0
    for i in ast.literal_eval(obj):               #ast.literal_eval raises an exception
        if counter < 4:
            li.append(i['name'])#if the input isn't a valid Python datatype, so the code won't be executed if it's not
            counter+=1
                               #Use ast.literal_eval whenever you need eval. You shouldn't usually evaluate literal Python statements.
    return li

In [100]:
movies['production_companies'] = movies['production_companies'].apply(convert_prod)
movies['production_companies'][0:6]

0    [Ingenious Film Partners, Twentieth Century Fo...
1    [Walt Disney Pictures, Jerry Bruckheimer Films...
2                     [Columbia Pictures, Danjaq, B24]
3    [Legendary Pictures, Warner Bros., DC Entertai...
4                               [Walt Disney Pictures]
5    [Columbia Pictures, Laura Ziskin Productions, ...
Name: production_companies, dtype: object

## Cast

In [101]:
movies['cast'][0][:500]

'[{"cast_id": 242, "character": "Jake Sully", "credit_id": "5602a8a7c3a3685532001c9a", "gender": 2, "id": 65731, "name": "Sam Worthington", "order": 0}, {"cast_id": 3, "character": "Neytiri", "credit_id": "52fe48009251416c750ac9cb", "gender": 1, "id": 8691, "name": "Zoe Saldana", "order": 1}, {"cast_id": 25, "character": "Dr. Grace Augustine", "credit_id": "52fe48009251416c750aca39", "gender": 1, "id": 10205, "name": "Sigourney Weaver", "order": 2}, {"cast_id": 4, "character": "Col. Quaritch", "c'

In [102]:
def convert_cast(obj) :
    li=[]
    counter=0
    for i in ast.literal_eval(obj):
        if counter < 3:
            li.append(i['name'])
            counter=+1
    return li
        

In [103]:
movies['cast']=movies['cast'].apply(convert_cast)
movies['cast'][0:6]

0    [Sam Worthington, Zoe Saldana, Sigourney Weave...
1    [Johnny Depp, Orlando Bloom, Keira Knightley, ...
2    [Daniel Craig, Christoph Waltz, Léa Seydoux, R...
3    [Christian Bale, Michael Caine, Gary Oldman, A...
4    [Taylor Kitsch, Lynn Collins, Samantha Morton,...
5    [Tobey Maguire, Kirsten Dunst, James Franco, T...
Name: cast, dtype: object

# crew

In [104]:
movies['crew'][0][:500]

'[{"credit_id": "52fe48009251416c750aca23", "department": "Editing", "gender": 0, "id": 1721, "job": "Editor", "name": "Stephen E. Rivkin"}, {"credit_id": "539c47ecc3a36810e3001f87", "department": "Art", "gender": 2, "id": 496, "job": "Production Design", "name": "Rick Carter"}, {"credit_id": "54491c89c3a3680fb4001cf7", "department": "Sound", "gender": 0, "id": 900, "job": "Sound Designer", "name": "Christopher Boyes"}, {"credit_id": "54491cb70e0a267480001bd0", "department": "Sound", "gender": 0,'

In [105]:
def convert_crew(obj):
    crew_set=set()
    crew_list = []
    
    for i in ast.literal_eval(obj):
        if i['job'] in ['Director' ,'Screenplay','Producer']:
            name=i['name']
            if name not in crew_set:
                crew_set.add(name)
                crew_list.append(name)
                
    return crew_list

In [106]:
movies['crew']=movies['crew'].apply(convert_crew)
movies['crew'][0:6]

0                          [James Cameron, Jon Landau]
1    [Gore Verbinski, Jerry Bruckheimer, Ted Elliot...
2    [Sam Mendes, John Logan, Barbara Broccoli, Rob...
3    [Charles Roven, Christopher Nolan, Jonathan No...
4    [Andrew Stanton, Colin Wilson, Jim Morris, Lin...
5    [Sam Raimi, Laura Ziskin, Avi Arad, Alvin Sarg...
Name: crew, dtype: object

In [107]:
movies.head()

Unnamed: 0,movie_id,title,overview,genres,keywords,production_companies,cast,crew
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colon...","[Ingenious Film Partners, Twentieth Century Fo...","[Sam Worthington, Zoe Saldana, Sigourney Weave...","[James Cameron, Jon Landau]"
1,285,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...","[Adventure, Fantasy, Action]","[ocean, drug abuse, exotic island, east india ...","[Walt Disney Pictures, Jerry Bruckheimer Films...","[Johnny Depp, Orlando Bloom, Keira Knightley, ...","[Gore Verbinski, Jerry Bruckheimer, Ted Elliot..."
2,206647,Spectre,A cryptic message from Bond’s past sends him o...,"[Action, Adventure, Crime]","[spy, based on novel, secret agent, sequel, mi...","[Columbia Pictures, Danjaq, B24]","[Daniel Craig, Christoph Waltz, Léa Seydoux, R...","[Sam Mendes, John Logan, Barbara Broccoli, Rob..."
3,49026,The Dark Knight Rises,Following the death of District Attorney Harve...,"[Action, Crime, Drama, Thriller]","[dc comics, crime fighter, terrorist, secret i...","[Legendary Pictures, Warner Bros., DC Entertai...","[Christian Bale, Michael Caine, Gary Oldman, A...","[Charles Roven, Christopher Nolan, Jonathan No..."
4,49529,John Carter,"John Carter is a war-weary, former military ca...","[Action, Adventure, Science Fiction]","[based on novel, mars, medallion, space travel...",[Walt Disney Pictures],"[Taylor Kitsch, Lynn Collins, Samantha Morton,...","[Andrew Stanton, Colin Wilson, Jim Morris, Lin..."


## Overview

This will convert our movie overviews into a list of strings, in other words, tokens. This will help us in measuring similarities between
movies

In [108]:
movies['overview']=movies['overview'].apply(lambda x:x.split())
movies['overview'][0:6]

0    [In, the, 22nd, century,, a, paraplegic, Marin...
1    [Captain, Barbossa,, long, believed, to, be, d...
2    [A, cryptic, message, from, Bond’s, past, send...
3    [Following, the, death, of, District, Attorney...
4    [John, Carter, is, a, war-weary,, former, mili...
5    [The, seemingly, invincible, Spider-Man, goes,...
Name: overview, dtype: object

Now the data frame looks much better and is easier to read now\\
Saving the dataframe to create the website

In [109]:
cleaned_movies=movies.copy()
cleaned_movies.to_csv('cleaned_movies.csv')

## Feature Transformation

Now we will remove the spaces between strings for each value in 'genres', 'keywords', 'production_companies', 'cast', and 'crew'.
The purpose of this is to create only one tag per feature instead of two or more.
Example:

In [110]:
movies['genres'] = movies['genres'].apply(lambda x:[i.replace(' ', '') for i in x])
movies['keywords'] = movies['keywords'].apply(lambda x:[i.replace(' ', '') for i in x])
movies['production_companies'] = movies['production_companies'].apply(lambda x:[i.replace(' ','') for i in x])
movies['cast'] = movies['cast'].apply(lambda x:[i.replace(' ', '') for i in x])
movies['crew'] = movies['crew'].apply(lambda x:[i.replace(' ', '') for i in x])

In [111]:
movies.head()

Unnamed: 0,movie_id,title,overview,genres,keywords,production_companies,cast,crew
0,19995,Avatar,"[In, the, 22nd, century,, a, paraplegic, Marin...","[Action, Adventure, Fantasy, ScienceFiction]","[cultureclash, future, spacewar, spacecolony, ...","[IngeniousFilmPartners, TwentiethCenturyFoxFil...","[SamWorthington, ZoeSaldana, SigourneyWeaver, ...","[JamesCameron, JonLandau]"
1,285,Pirates of the Caribbean: At World's End,"[Captain, Barbossa,, long, believed, to, be, d...","[Adventure, Fantasy, Action]","[ocean, drugabuse, exoticisland, eastindiatrad...","[WaltDisneyPictures, JerryBruckheimerFilms, Se...","[JohnnyDepp, OrlandoBloom, KeiraKnightley, Ste...","[GoreVerbinski, JerryBruckheimer, TedElliott, ..."
2,206647,Spectre,"[A, cryptic, message, from, Bond’s, past, send...","[Action, Adventure, Crime]","[spy, basedonnovel, secretagent, sequel, mi6, ...","[ColumbiaPictures, Danjaq, B24]","[DanielCraig, ChristophWaltz, LéaSeydoux, Ralp...","[SamMendes, JohnLogan, BarbaraBroccoli, Robert..."
3,49026,The Dark Knight Rises,"[Following, the, death, of, District, Attorney...","[Action, Crime, Drama, Thriller]","[dccomics, crimefighter, terrorist, secretiden...","[LegendaryPictures, WarnerBros., DCEntertainme...","[ChristianBale, MichaelCaine, GaryOldman, Anne...","[CharlesRoven, ChristopherNolan, JonathanNolan..."
4,49529,John Carter,"[John, Carter, is, a, war-weary,, former, mili...","[Action, Adventure, ScienceFiction]","[basedonnovel, mars, medallion, spacetravel, p...",[WaltDisneyPictures],"[TaylorKitsch, LynnCollins, SamanthaMorton, Wi...","[AndrewStanton, ColinWilson, JimMorris, Lindse..."


## Creating our final dataframe
Creating tag colum that joins overview,genres,keywords,cast and crew

In [112]:
movies['tags']=movies['overview']+movies['genres']+movies['keywords']+movies['production_companies']+movies['cast']+movies['crew']
movies

Unnamed: 0,movie_id,title,overview,genres,keywords,production_companies,cast,crew,tags
0,19995,Avatar,"[In, the, 22nd, century,, a, paraplegic, Marin...","[Action, Adventure, Fantasy, ScienceFiction]","[cultureclash, future, spacewar, spacecolony, ...","[IngeniousFilmPartners, TwentiethCenturyFoxFil...","[SamWorthington, ZoeSaldana, SigourneyWeaver, ...","[JamesCameron, JonLandau]","[In, the, 22nd, century,, a, paraplegic, Marin..."
1,285,Pirates of the Caribbean: At World's End,"[Captain, Barbossa,, long, believed, to, be, d...","[Adventure, Fantasy, Action]","[ocean, drugabuse, exoticisland, eastindiatrad...","[WaltDisneyPictures, JerryBruckheimerFilms, Se...","[JohnnyDepp, OrlandoBloom, KeiraKnightley, Ste...","[GoreVerbinski, JerryBruckheimer, TedElliott, ...","[Captain, Barbossa,, long, believed, to, be, d..."
2,206647,Spectre,"[A, cryptic, message, from, Bond’s, past, send...","[Action, Adventure, Crime]","[spy, basedonnovel, secretagent, sequel, mi6, ...","[ColumbiaPictures, Danjaq, B24]","[DanielCraig, ChristophWaltz, LéaSeydoux, Ralp...","[SamMendes, JohnLogan, BarbaraBroccoli, Robert...","[A, cryptic, message, from, Bond’s, past, send..."
3,49026,The Dark Knight Rises,"[Following, the, death, of, District, Attorney...","[Action, Crime, Drama, Thriller]","[dccomics, crimefighter, terrorist, secretiden...","[LegendaryPictures, WarnerBros., DCEntertainme...","[ChristianBale, MichaelCaine, GaryOldman, Anne...","[CharlesRoven, ChristopherNolan, JonathanNolan...","[Following, the, death, of, District, Attorney..."
4,49529,John Carter,"[John, Carter, is, a, war-weary,, former, mili...","[Action, Adventure, ScienceFiction]","[basedonnovel, mars, medallion, spacetravel, p...",[WaltDisneyPictures],"[TaylorKitsch, LynnCollins, SamanthaMorton, Wi...","[AndrewStanton, ColinWilson, JimMorris, Lindse...","[John, Carter, is, a, war-weary,, former, mili..."
...,...,...,...,...,...,...,...,...,...
4804,9367,El Mariachi,"[El, Mariachi, just, wants, to, play, his, gui...","[Action, Crime, Thriller]","[unitedstates–mexicobarrier, legs, arms, paper...",[ColumbiaPictures],"[CarlosGallardo, JaimedeHoyos, PeterMarquardt,...","[RobertRodriguez, CarlosGallardo]","[El, Mariachi, just, wants, to, play, his, gui..."
4805,72766,Newlyweds,"[A, newlywed, couple's, honeymoon, is, upended...","[Comedy, Romance]",[],[],"[EdwardBurns, KerryBishé, MarshaDietlein, Cait...","[EdwardBurns, WilliamRexer, AaronLubin]","[A, newlywed, couple's, honeymoon, is, upended..."
4806,231617,"Signed, Sealed, Delivered","[""Signed,, Sealed,, Delivered"", introduces, a,...","[Comedy, Drama, Romance, TVMovie]","[date, loveatfirstsight, narration, investigat...","[FrontStreetPictures, MuseEntertainmentEnterpr...","[EricMabius, KristinBooth, CrystalLowe, GeoffG...","[HarveyKahn, ScottSmith]","[""Signed,, Sealed,, Delivered"", introduces, a,..."
4807,126186,Shanghai Calling,"[When, ambitious, New, York, attorney, Sam, is...",[],[],[],"[DanielHenney, ElizaCoupe, BillPaxton, AlanRuc...",[DanielHsia],"[When, ambitious, New, York, attorney, Sam, is..."


## Final Dataframe
Since the newly-created tags column already contains all the necessary information for creating our recommendation system, our
dataframe will only contain this column, past the title column.

In [113]:
movies_df=movies[['movie_id','title','tags']]
movies_df

Unnamed: 0,movie_id,title,tags
0,19995,Avatar,"[In, the, 22nd, century,, a, paraplegic, Marin..."
1,285,Pirates of the Caribbean: At World's End,"[Captain, Barbossa,, long, believed, to, be, d..."
2,206647,Spectre,"[A, cryptic, message, from, Bond’s, past, send..."
3,49026,The Dark Knight Rises,"[Following, the, death, of, District, Attorney..."
4,49529,John Carter,"[John, Carter, is, a, war-weary,, former, mili..."
...,...,...,...
4804,9367,El Mariachi,"[El, Mariachi, just, wants, to, play, his, gui..."
4805,72766,Newlyweds,"[A, newlywed, couple's, honeymoon, is, upended..."
4806,231617,"Signed, Sealed, Delivered","[""Signed,, Sealed,, Delivered"", introduces, a,..."
4807,126186,Shanghai Calling,"[When, ambitious, New, York, attorney, Sam, is..."


Now converting each list in tags column to a string using "join" function.......

In [114]:
movies_df['tags'] = movies_df['tags'].apply(lambda x:' '.join(x))
movies_df.head()

Unnamed: 0,movie_id,title,tags
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di..."
1,285,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha..."
2,206647,Spectre,A cryptic message from Bond’s past sends him o...
3,49026,The Dark Knight Rises,Following the death of District Attorney Harve...
4,49529,John Carter,"John Carter is a war-weary, former military ca..."


In [115]:
movies_df['tags']=movies_df['tags'].apply(lambda x:x.lower())

In [116]:
movies_df['tags'][0]

'in the 22nd century, a paraplegic marine is dispatched to the moon pandora on a unique mission, but becomes torn between following orders and protecting an alien civilization. action adventure fantasy sciencefiction cultureclash future spacewar spacecolony society spacetravel futuristic romance space alien tribe alienplanet cgi marine soldier battle loveaffair antiwar powerrelations mindandsoul 3d ingeniousfilmpartners twentiethcenturyfoxfilmcorporation duneentertainment lightstormentertainment samworthington zoesaldana sigourneyweaver stephenlang michellerodriguez giovanniribisi joeldavidmoore cchpounder wesstudi lazalonso dileeprao mattgerald seananthonymoran jasonwhyte scottlawrence kellykilgour jamespatrickpitt seanpatrickmurphy peterdillon kevindorman kelsonhenderson davidvanhorn jacobtomuri michaelblain-rozgay joncurry lukehawker woodyschultz petermensah soniayee jahnelcurfman ilramchoi kylawarren lisaroumain debrawilson chrismala taylorkibby jodielandau julielamm cullenb.madden

## Preparing our system

The CountVectorizer function from sklearn converts a collection of text documents to a matrix of token counts, that way we can see the
most occuring features in our data.
We chose 5000 features as our max since our dataframe contains information for 5000 movies and 'english' for the stop_words
parameter since our dataframe is in english. This will cause the Vectorizer to ignore words that don't really add meaning to a sentence,
such as, 'the', 'and', etc.

In [117]:
cv=CountVectorizer(max_features=5000, stop_words='english')
vectors=cv.fit_transform(movies_df['tags']).toarray()
vectors

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [118]:
vectors.shape

(4806, 5000)

This will show the 100 most occuring values in numeric-alphabetical order 1

In [119]:
cv.get_feature_names()[:101]

['000',
 '10',
 '11',
 '12',
 '13',
 '14',
 '1492pictures',
 '15',
 '16',
 '17',
 '18',
 '18th',
 '19',
 '1930s',
 '1940s',
 '1950s',
 '1960s',
 '1970s',
 '1980s',
 '19th',
 '19thcentury',
 '20',
 '20th',
 '21lapsentertainment',
 '24',
 '25',
 '2929productions',
 '30',
 '3d',
 '40',
 '40acres',
 '50',
 'aaron',
 'aaroneckhart',
 'aarontaylor',
 'aasifmandvi',
 'abandoned',
 'abducted',
 'abigailbreslin',
 'ability',
 'able',
 'abrams',
 'abuse',
 'abusive',
 'academy',
 'accept',
 'accepts',
 'access',
 'accident',
 'accidentally',
 'accompanied',
 'account',
 'accused',
 'ace',
 'act',
 'action',
 'actions',
 'activist',
 'activities',
 'actor',
 'actors',
 'actress',
 'acts',
 'actual',
 'actually',
 'adam',
 'adambrody',
 'adamgoldberg',
 'adamlefevre',
 'adammckay',
 'adams',
 'adamsandler',
 'adamscott',
 'adamshankman',
 'adaptation',
 'addict',
 'addiction',
 'adewaleakinnuoye',
 'adopted',
 'adoption',
 'adrienbrody',
 'adult',
 'adultery',
 'adulthood',
 'adults',
 'advantage'

## Stemming Features

We will use the PorterStemmer function from the NLTK (Natural Language Toolkit) library to reduce words down to their root word.
This will keep words that mean the same thing, like 'actions' and 'action', to be counted as different words.
> The NLTK (Natural Language Toolkit) library is the go-to API for Natural Language Processing with Python. It is a really
powerful tool to preprocess text data for further analysis like with recommendation systems for instance.


> The PorterStemmer is a function that removes any prefixes or suffixes from words, leaving only the word stem, hence
the name.

In [120]:
ps=PorterStemmer()

In [121]:
def stemming(text):
    li=[]
    for i in text.split():
        li.append(ps.stem(i))
        
    return ' '.join(li)

In [122]:
movies_df['tags']=movies_df['tags'].apply(stemming)

## Similarities

Using the cosine_similarity function from sklearn, we obtain the cosine distance between each movie vector. Cosine_similarity is
frequently used in natural language processing and machine learning to compare the similarity of documents, text, or other highdimensional
data. That is to say, the angle between each vector. The smaller the angle, the more similar the data points, in this case
movies, are.

In [123]:
similarity=cosine_similarity(vectors)
similarity

array([[1.        , 0.06897007, 0.04828045, ..., 0.02272727, 0.02548236,
        0.        ],
       [0.06897007, 1.        , 0.07325794, ..., 0.02299002, 0.        ,
        0.        ],
       [0.04828045, 0.07325794, 1.        , ..., 0.02414023, 0.        ,
        0.        ],
       ...,
       [0.02272727, 0.02299002, 0.02414023, ..., 1.        , 0.07644708,
        0.05025189],
       [0.02548236, 0.        , 0.        , ..., 0.07644708, 1.        ,
        0.05634362],
       [0.        , 0.        , 0.        , ..., 0.05025189, 0.05634362,
        1.        ]])

In [124]:
similarity.shape

(4806, 4806)

4806 comparisons for 4806 movies

Here ↓ we will enumerate and sort the similarities in descending order to get the top 5 similar movies

Enumerating allows us to keep the index order of the movies


Using the lambda function, we sort using the second value in each tuple, those being the similarity scores.


In [125]:
sorted(list(enumerate(similarity[0])), reverse=True, key=lambda x:x[1])[1:6]

[(1216, 0.2528558164964056),
 (539, 0.24140227479263376),
 (507, 0.23162743094465488),
 (1920, 0.22305671869347435),
 (582, 0.21774708517784636)]

## Recommendation Function 

finally we have prepared our dataset for final use and we can use it to build our new movie recommendatio system

In [126]:
def recommend(movie):
    movies_index=movies_df[movies_df['title']==movie].index[0]  ##this is the problem
    distances=similarity[movies_index]
    movie_list=sorted(list(enumerate(distances)), reverse=True , key=lambda x:x[1])[1:6]
    
    
    for i in movie_list:
        print(movies_df.iloc[i[0]].title)

## Example 1

In [127]:
recommend('Avatar')

Aliens vs Predator: Requiem
Titan A.E.
Independence Day
Lifeforce
Battle: Los Angeles


## Example 2

In [131]:
recommend('Batman Begins')

The Dark Knight
The Dark Knight Rises
Batman
Batman v Superman: Dawn of Justice
Amidst the Devil's Wings


## Pickling
Now we will pickle our final datframe and our similarities function containing the vector for our recommendations. This will be used create our website

In [129]:
pickle.dump(movies_df.to_dict(), open('movies_dict.pk1','wb'))

In [130]:
pickle.dump(similarity, open('similarity.pk1','wb'))