# Movie Recommender
<a id='Movie_Recommender'></a> 

## Table of Contents<a id='Table_of_Contents'></a>
* [1 Movie Recommender](#1_Movie_Recommender)
    * [1.1 Table of Contents](#1.1_Table_of_Contents)
    * [1.2 Introduction](#1.2_Introduction)
    * [1.3 Imports](#1.3_Imports)
    * [1.4 Load Data](#1.4_Load_Data)

## Introduction<a id='1.2_Introduction'></a>

Does the recommended section provide satisfaction? Are the movie suggestions popping up elevate anyone’s mood? Do the advertisements appearing in between Instagram posts catch anybody’s attention? Is there a system that gives better movie suggestions? 

From the moment an individual unlocks his/her phone till the moment he/she puts it away, every action being committed is tracked and monitored, stored into data, that is confidential information sold to big corporations. Pew Research states that 81% of Americans own smartphones creating a push for relevant and accurate promotions that will influence a user’s decision-making process and attract customers. With complex supervision, people with smartphones are being schemed to stay on their cellular devices for longer periods of time to generate more revenue. Why? Companies with money can afford exceptional marketing. According to Business Wire, 60% of consumers click on cellphone ads every week suggesting how prevalent mobile advertising has become. Given these points, everyone is simply surrounded by recommender systems; people are constantly getting pitched ideas. From a simple Google search to a YouTube video to a Netflix original and to a song on Spotify, there exists a machine that advises its user based off of abundant data. 

## Imports<a id='1.3_Imports'></a>

In [1]:
#import modules and their subpackages required for analysis
import ast
import pandas as pd
import matplotlib.pyplot as plt 
from datetime import datetime
import numpy as np
from ast import literal_eval

## Load Data<a id='1.4_Load_Data'></a>

In [2]:
#load data into variable
movies = pd.read_csv('C:/Users/sathw/OneDrive/Desktop/Springboard_work/IMDB-Movie-Ratings/rawdata/movies_metadata.csv', index_col = 'title', low_memory = False, nrows = 46000)
credits = pd.read_csv('C:/Users/sathw/OneDrive/Desktop/Springboard_work/IMDB-Movie-Ratings/rawdata/credits.csv')
keywords = pd.read_csv('C:/Users/sathw/OneDrive/Desktop/Springboard_work/IMDB-Movie-Ratings/rawdata/keywords.csv')

In [3]:
def clean_col(df,col):
    """ 
    @ params: dataframe, column
    removes irrelevant information and outputs only list of the names in each dictionary
    """
    df[col] = df[col].fillna('[]').apply(literal_eval).apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else[])

In [4]:
cols = ['genres', 'production_countries', 'spoken_languages', 'belongs_to_collection']
for col in cols:
    clean_col(movies, col)

In [5]:
movies = movies[~movies.vote_count.isnull()]

In [6]:
movies = movies.drop_duplicates(subset=['original_title', 'release_date'])

In [7]:
movies['release_date'] = pd.to_datetime(movies['release_date'], errors = 'coerce')

In [8]:
credits.head()

Unnamed: 0,cast,crew,id
0,"[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...",862
1,"[{'cast_id': 1, 'character': 'Alan Parrish', '...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...",8844
2,"[{'cast_id': 2, 'character': 'Max Goldman', 'c...","[{'credit_id': '52fe466a9251416c75077a89', 'de...",15602
3,"[{'cast_id': 1, 'character': ""Savannah 'Vannah...","[{'credit_id': '52fe44779251416c91011acb', 'de...",31357
4,"[{'cast_id': 1, 'character': 'George Banks', '...","[{'credit_id': '52fe44959251416c75039ed7', 'de...",11862


In [9]:
clean_col(credits, 'cast')
clean_col(credits, 'crew')
clean_col(keywords, 'keywords')

In [10]:
credits.id = credits.id.astype(int)
movies['id'] = movies['id'].astype(int)
keywords['id'] = keywords.id.astype(int)

In [11]:
movies = movies.merge(credits, on = 'id')
movies = movies.merge(keywords, on='id')

In [12]:
movies.head()

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,runtime,spoken_languages,status,tagline,video,vote_average,vote_count,cast,crew,keywords
0,False,[],30000000,"[Animation, Comedy, Family]",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,81.0,[English],Released,,False,7.7,5415.0,"[Tom Hanks, Tim Allen, Don Rickles, Jim Varney...","[John Lasseter, Joss Whedon, Andrew Stanton, J...","[jealousy, toy, boy, friendship, friends, riva..."
1,False,[],65000000,"[Adventure, Fantasy, Family]",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,104.0,"[English, Français]",Released,Roll the dice and unleash the excitement!,False,6.9,2413.0,"[Robin Williams, Jonathan Hyde, Kirsten Dunst,...","[Larry J. Franco, Jonathan Hensleigh, James Ho...","[board game, disappearance, based on children'..."
2,False,[],0,"[Romance, Comedy]",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,101.0,[English],Released,Still Yelling. Still Fighting. Still Ready for...,False,6.5,92.0,"[Walter Matthau, Jack Lemmon, Ann-Margret, Sop...","[Howard Deutch, Mark Steven Johnson, Mark Stev...","[fishing, best friend, duringcreditsstinger, o..."
3,False,[],16000000,"[Comedy, Drama, Romance]",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",...,127.0,[English],Released,Friends are the people who let you be yourself...,False,6.1,34.0,"[Whitney Houston, Angela Bassett, Loretta Devi...","[Forest Whitaker, Ronald Bass, Ronald Bass, Ez...","[based on novel, interracial relationship, sin..."
4,False,[],0,[Comedy],,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,106.0,[English],Released,Just When His World Is Back To Normal... He's ...,False,5.7,173.0,"[Steve Martin, Diane Keaton, Martin Short, Kim...","[Alan Silvestri, Elliot Davis, Nancy Meyers, N...","[baby, midlife crisis, confidence, aging, daug..."


In [13]:
movies.overview = movies.overview.fillna(" ")
movies.tagline = movies.tagline.fillna(" ")

Had to only work with a fraction of the dataset because lack of computing power.

In [14]:
movies['details'] = movies.overview + movies.tagline

small_movies = movies.sample(frac = 0.01, axis =0)

small_movies.details = small_movies.details.fillna(" ")
small_movies.details = ','.join(map(str, small_movies.details)) 
small_movies.cast = small_movies.cast.apply(lambda x: [str.lower(i.replace(" ", "")) for i in x])
small_movies.cast = ','.join(map(str, small_movies.cast)) 
small_movies.keywords = small_movies.keywords.apply(lambda x: [str.lower(i.replace(" ", "")) for i in x])
small_movies.keywords = ','.join(map(str, small_movies.keywords))  
small_movies.genres =','.join(map(str, small_movies.genres))  

#small_movies.cast = small_movies.cast.apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])
#small_movies.keywords=small_movies.keywords.apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])
#small_movies.genres=small_movies.genres.apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])
small_movies.cast = small_movies.cast.apply(lambda x: x[:3] if len (x) >= 3 else x)

small_movies['mix'] = small_movies.details + small_movies.keywords + small_movies.cast + small_movies.genres
#small_movies['mix'] = small_movies.mix.apply(lambda x: " ".join(x))

In [15]:
small_movies.applymap(lambda x: isinstance(x,list)).all()

adult                    False
belongs_to_collection     True
budget                   False
genres                   False
homepage                 False
id                       False
imdb_id                  False
original_language        False
original_title           False
overview                 False
popularity               False
poster_path              False
production_companies     False
production_countries      True
release_date             False
revenue                  False
runtime                  False
spoken_languages          True
status                   False
tagline                  False
video                    False
vote_average             False
vote_count               False
cast                     False
crew                      True
keywords                 False
details                  False
mix                      False
dtype: bool

In [16]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(analyzer='word',ngram_range=(1, 2),min_df=0,stop_words = 'english')
tfidf_matrix = vectorizer.fit_transform(small_movies.mix)  ## Even astype(str) would work
print(tfidf_matrix.toarray())

[[0.01174591 0.00195765 0.00195765 ... 0.00195765 0.00195765 0.00195765]
 [0.01174591 0.00195765 0.00195765 ... 0.00195765 0.00195765 0.00195765]
 [0.01174591 0.00195765 0.00195765 ... 0.00195765 0.00195765 0.00195765]
 ...
 [0.01174591 0.00195765 0.00195765 ... 0.00195765 0.00195765 0.00195765]
 [0.01174591 0.00195765 0.00195765 ... 0.00195765 0.00195765 0.00195765]
 [0.01174591 0.00195765 0.00195765 ... 0.00195765 0.00195765 0.00195765]]


In [None]:
from sklearn.metrics.pairwise import linear_kernel
# cosine similarity matrix
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

In [None]:
## recommender function
# Generate mapping between titles and index
indices = pd.Series(movies.index, index=movies['original_title']).drop_duplicates()

def get_recs(title, cosine_sim, indices):
    # Get index of movie that matches title
    idx = indices[title]
    # Sort the movies based on the similarity scores
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    # Get the scores for 20 most similar movies
    sim_scores = sim_scores[1:20]
    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]
    # Return the top 20 most similar movies
    return movies['original_title'].iloc[movie_indices]

In [None]:
print(get_recs('Toy Story',cosine_sim, indices))