# TF-IDF Movie Recommender

The goal is to build a simple movie recommender using on document vectorization, especially the TF-IDF method of scaling down stopwords!

Some key steps:
- Combining movie data into a single string
- Transform strings using TF-IDF
- Assume query is always an existing movie in the database
    - e.g. query = "Scream 3", recommend other movies based on this

In [229]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import json

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity, euclidean_distances

### Reading in our Movie Dataset

In [230]:
df = pd.read_csv('tmdb_5000_movies.csv')

In [231]:
df.head()

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500
2,245000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.sonypictures.com/movies/spectre/,206647,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,"[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...","[{""iso_3166_1"": ""GB"", ""name"": ""United Kingdom""...",2015-10-26,880674609,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,Spectre,6.3,4466
3,250000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",http://www.thedarkknightrises.com/,49026,"[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",en,The Dark Knight Rises,Following the death of District Attorney Harve...,112.31295,"[{""name"": ""Legendary Pictures"", ""id"": 923}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-07-16,1084939099,165.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,The Legend Ends,The Dark Knight Rises,7.6,9106
4,260000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://movies.disney.com/john-carter,49529,"[{""id"": 818, ""name"": ""based on novel""}, {""id"":...",en,John Carter,"John Carter is a war-weary, former military ca...",43.926995,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}]","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-03-07,284139100,132.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"Lost in our world, found in another.",John Carter,6.1,2124


In [232]:
df.columns

Index(['budget', 'genres', 'homepage', 'id', 'keywords', 'original_language',
       'original_title', 'overview', 'popularity', 'production_companies',
       'production_countries', 'release_date', 'revenue', 'runtime',
       'spoken_languages', 'status', 'tagline', 'title', 'vote_average',
       'vote_count'],
      dtype='object')

**Let's think of what features could be relevant!**
- Genres
- Keywords
- overview
- release_date
- tagline

In [233]:
test_row = df.iloc[0]

We first want to define the aggregator function, which extracts the relevant features as strings and combines them all into a document for use in our vectorizer!

In [234]:
def aggregator(row):
    genre_dicts = json.loads(row['genres'])
    genre_strs = ' '.join([x['name'] for x in genre_dicts])
    
    keyword_dicts = json.loads(row['keywords'])
    keyword_strs = ' '.join([x['name'] for x in keyword_dicts])
    return genre_strs + " " + keyword_strs + " " + str(row['overview']) + " " + str(row['tagline']) + " " + str(row['release_date'])

In [235]:
documents = df.apply(aggregator, axis=1)

In [236]:
index = pd.Series(df.index, df['original_title'])

In [237]:
vectorizer = TfidfVectorizer()

Fitting our model

In [238]:
X = vectorizer.fit_transform(documents)

In [239]:
X

<4803x23798 sparse matrix of type '<class 'numpy.float64'>'
	with 286478 stored elements in Compressed Sparse Row format>

In [240]:
def recommender(title):
    movie_idx = index[title]
    query = X[movie_idx]
    scores = cosine_similarity(query, X)
    top_5_scores = np.argsort(scores)[0][::-1][1:6]
    top_5_movies = list(map(lambda x: index.index[x], top_5_scores))
    return top_5_movies

In [241]:
print(recommender('Scream 3'))

['Scream', 'Scream 2', 'Disaster Movie', '2:13', 'Sorority Row']
