# 4 Pre-processing Data

## 4.1 Data Source

### 4.1.1 Importing packages

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt

from sklearn.model_selection import train_test_split

# packages for NLP
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# packages required to calculate jaccard similarity
from sklearn.metrics import jaccard_score
from scipy.spatial.distance import pdist, squareform

# Normalizing data
from sklearn.preprocessing import Normalizer

# regex
import re

# importing non-negative matrix factorization
from sklearn.decomposition import NMF


### 4.1.2 Importing Data

In [None]:
# importing data
movies_metadata=pd.read_csv('/content/drive/MyDrive/machine_learning_projects/movie_recommendation_system/data/movies_metadata_cleaned.csv')

  exec(code_obj, self.user_global_ns, self.user_ns)


### 4.1.3 Formatting

Due to computing power limitations, I am unable to run a model with all 45,418, so I will select half of the movies. Initially, I was going to randomly sample the movies; however, this resulted in a lot of obscure movies. Since I was not familar with a lot of the movies that were left, I was unable to judge how well the recommendation system performed. Thus, I decided to select the 22,709 most popular movies. Movie popularity will be based on the the number of votes to avoid issues with people's opinions. The more votes a movie has the more people who have seen the movie, so even if the movie did not perform well, with enough votes the movie should be well enough known. 

Selecting the movies based on number of votes does mean that obscure movies will be left out; thus, the recommendation system will not perform well for people who enjoy those types of movies. However, I feel the majority of people like more well known movies, which is why the movies are well known. To cater to the largest number of people, I decided to go with the more well know movies. 

In [None]:
movies_metadata.head()

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,id,imdb_id,original_language,overview,release_date,spoken_languages,title,vote_average,...,Mystery,War,Foreign,Music,Documentary,Western,production_companies_list,spoken_languages_list,release_year,outlier
0,0,0,862,tt0114709,en,"Led by Woody, Andy's toys live happily in his ...",1995-10-30,"[{'iso_639_1': 'en', 'name': 'English'}]",Toy Story,7.7,...,0,0,0,0,0,0,['Pixar Animation Studios'],['en'],1995,non-outlier
1,1,1,8844,tt0113497,en,When siblings Judy and Peter discover an encha...,1995-12-15,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Jumanji,6.9,...,0,0,0,0,0,0,"['TriStar Pictures', 'Teitler Film', 'Intersco...","['en', 'fr']",1995,non-outlier
2,2,2,15602,tt0113228,en,A family wedding reignites the ancient feud be...,1995-12-22,"[{'iso_639_1': 'en', 'name': 'English'}]",Grumpier Old Men,6.5,...,0,0,0,0,0,0,"['Warner Bros.', 'Lancaster Gate']",['en'],1995,non-outlier
3,3,3,31357,tt0114885,en,"Cheated on, mistreated and stepped on, the wom...",1995-12-22,"[{'iso_639_1': 'en', 'name': 'English'}]",Waiting to Exhale,6.1,...,0,0,0,0,0,0,['Twentieth Century Fox Film Corporation'],['en'],1995,non-outlier
4,4,4,11862,tt0113041,en,Just when George Banks has recovered from his ...,1995-02-10,"[{'iso_639_1': 'en', 'name': 'English'}]",Father of the Bride Part II,5.7,...,0,0,0,0,0,0,"['Sandollar Productions', 'Touchstone Pictures']",['en'],1995,non-outlier


In [None]:
movies_metadata.shape[0]

45418

Since 50 percent of the movies have less than 10 votes, I will select all movies with a vote count greater than 10.

In [None]:
# selecting all movies with a vote count greater than 10
movies_metadata_small=movies_metadata[movies_metadata['vote_count']>10]
movies_metadata_small.shape

# delete movies_metadata
del movies_metadata

# saving movies_metadata_small
movies_metadata_small.to_csv('/content/drive/MyDrive/machine_learning_projects/movie_recommendation_system/data/movies_metadata_small.csv')

## 4.2 Content Based Recommendation System

### 4.2.1 Genre based Model

The first model will use only the genre to make a recommendation; therefore, the assumption is that a person chooses to watch a movie because they enjoy a particular genre. Although this assumption is likely true to an extent, its recommendations will be limited because people enjoy movies based on more than just genre, and they like more than one genre. Although the recommendation will be limited, it will provide a good base model.

Since the genre data is binary, I will use jaccard similarity for this first model.

In [None]:
# remove all columns except genre columns
drop_columns=['Unnamed: 0', 'Unnamed: 0.1','imdb_id', 'original_language', 'overview', 'release_date', 'release_year','outlier','spoken_languages_list', 'spoken_languages', 'vote_average', 'vote_count', 'genres_list', 'production_companies_list', 'id']
movies_metadata_genre = movies_metadata_small.drop(columns=drop_columns)

movies_metadata_genre.columns

Index(['title', 'Animation', 'Comedy', 'Family', 'Adventure', 'Fantasy',
       'Romance', 'Drama', 'Action', 'Crime', 'Thriller', 'Horror', 'History',
       'Mystery', 'War', 'Foreign', 'Music', 'Documentary', 'Western'],
      dtype='object')

In [None]:
# remove movies_metadata_small from memory
del movies_metadata_small

In [None]:
# collecting movie titles for index
movies_index = movies_metadata_genre['title']
movies_metadata_genre = movies_metadata_genre.drop(columns = ['title'])

movies_metadata_genre.columns

Index(['Animation', 'Comedy', 'Family', 'Adventure', 'Fantasy', 'Romance',
       'Drama', 'Action', 'Crime', 'Thriller', 'Horror', 'History', 'Mystery',
       'War', 'Foreign', 'Music', 'Documentary', 'Western'],
      dtype='object')

In [None]:
# setting movie title as index
movies_metadata_genre.index = movies_index

movies_metadata_genre.head()

Unnamed: 0_level_0,Animation,Comedy,Family,Adventure,Fantasy,Romance,Drama,Action,Crime,Thriller,Horror,History,Mystery,War,Foreign,Music,Documentary,Western
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
Toy Story,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
Jumanji,0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0
Grumpier Old Men,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0
Waiting to Exhale,0,1,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0
Father of the Bride Part II,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [None]:
# saving movies_

#### 4.2.1.1 Calculating Jaccard Similarity

In [None]:
# calculating jaccard distance for all movies
movies_metadata_genre_distance = pdist(movies_metadata_genre, metric = 'jaccard')

In [None]:
# putting the distances into a square matrix
movies_metadata_genre_df= squareform(movies_metadata_genre_distance)

In [None]:
# removes movies_metadata_genre_distance
del movies_metadata_genre_distance

In [None]:
# converting to a dataframe
movies_metadata_genre_df=pd.DataFrame(movies_metadata_genre_df, index=movies_index, columns = movies_index)

In [None]:
movies_metadata_genre_df.head()

title,Toy Story,Jumanji,Grumpier Old Men,Waiting to Exhale,Father of the Bride Part II,Heat,Sabrina,Tom and Huck,Sudden Death,GoldenEye,...,The Final Storm,In a Heartbeat,"Blood, Sweat and Tears",To Be Fat Like Me,Cadet Kelly,The Man with the Rubber Head,The Devilish Tenant,The One-Man Band,Mom,Robin Hood
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Toy Story,0.0,0.8,0.75,0.8,0.666667,1.0,0.75,0.833333,1.0,1.0,...,1.0,0.25,1.0,0.75,0.666667,0.75,0.75,1.0,1.0,1.0
Jumanji,0.8,0.0,1.0,1.0,1.0,1.0,1.0,0.6,0.8,0.8,...,1.0,0.833333,1.0,0.75,1.0,0.75,0.75,0.8,1.0,1.0
Grumpier Old Men,0.75,1.0,0.0,0.333333,0.5,1.0,0.0,1.0,1.0,1.0,...,1.0,0.5,1.0,1.0,0.5,0.666667,0.666667,1.0,1.0,0.75
Waiting to Exhale,0.8,1.0,0.333333,0.0,0.666667,0.833333,0.333333,0.833333,1.0,1.0,...,1.0,0.6,0.666667,0.75,0.666667,0.75,0.75,1.0,0.8,0.5
Father of the Bride Part II,0.666667,1.0,0.5,0.666667,0.0,1.0,0.5,1.0,1.0,1.0,...,1.0,0.75,1.0,1.0,0.0,0.5,0.5,1.0,1.0,1.0


In [None]:
# calculating jaccard similarity
movies_metadata_genre_similarity = 1- movies_metadata_genre_df

In [None]:
# remove movies_metadata_genre_df
del movies_metadata_genre_df

In [None]:
# checking the similarity matrix
movies_metadata_genre_similarity.head()

title,Toy Story,Jumanji,Grumpier Old Men,Waiting to Exhale,Father of the Bride Part II,Heat,Sabrina,Tom and Huck,Sudden Death,GoldenEye,...,The Final Storm,In a Heartbeat,"Blood, Sweat and Tears",To Be Fat Like Me,Cadet Kelly,The Man with the Rubber Head,The Devilish Tenant,The One-Man Band,Mom,Robin Hood
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Toy Story,1.0,0.2,0.25,0.2,0.333333,0.0,0.25,0.166667,0.0,0.0,...,0.0,0.75,0.0,0.25,0.333333,0.25,0.25,0.0,0.0,0.0
Jumanji,0.2,1.0,0.0,0.0,0.0,0.0,0.0,0.4,0.2,0.2,...,0.0,0.166667,0.0,0.25,0.0,0.25,0.25,0.2,0.0,0.0
Grumpier Old Men,0.25,0.0,1.0,0.666667,0.5,0.0,1.0,0.0,0.0,0.0,...,0.0,0.5,0.0,0.0,0.5,0.333333,0.333333,0.0,0.0,0.25
Waiting to Exhale,0.2,0.0,0.666667,1.0,0.333333,0.166667,0.666667,0.166667,0.0,0.0,...,0.0,0.4,0.333333,0.25,0.333333,0.25,0.25,0.0,0.2,0.5
Father of the Bride Part II,0.333333,0.0,0.5,0.333333,1.0,0.0,0.5,0.0,0.0,0.0,...,0.0,0.25,0.0,0.0,1.0,0.5,0.5,0.0,0.0,0.0


In [None]:
movies_metadata_genre_similarity.index

Index(['Toy Story', 'Jumanji', 'Grumpier Old Men', 'Waiting to Exhale',
       'Father of the Bride Part II', 'Heat', 'Sabrina', 'Tom and Huck',
       'Sudden Death', 'GoldenEye',
       ...
       'The Final Storm', 'In a Heartbeat', 'Blood, Sweat and Tears',
       'To Be Fat Like Me', 'Cadet Kelly', 'The Man with the Rubber Head',
       'The Devilish Tenant', 'The One-Man Band', 'Mom', 'Robin Hood'],
      dtype='object', name='title', length=21740)

#### 4.2.1.2 Finding Similar movies based on genre

In [None]:
# function to find similar movies and provide recommendation
def similar_movies(x, data, k=None):
    """The function take a movie and returns [k:optional] similar movies"""
    # if data.index.any(x) == False:
    #     return print('The movie is not in the database')
    
    movies=data.loc[x]
    
    if k == None:        
        movies_rec=movies.sort_values(ascending=False)
    else: 
        movies_rec=movies.sort_values(ascending=False)[0:k]
        
    return movies_rec

In [None]:
similar_movies('Old Yeller', movies_metadata_genre_similarity, 10)

title
Dances with Wolves       1.00
True Grit                1.00
Gold                     1.00
Old Yeller               1.00
Man in the Wilderness    1.00
Into the West            1.00
The Timber               1.00
The Indian Fighter       0.75
Northwest Passage        0.75
Shenandoah               0.75
Name: Old Yeller, dtype: float64

In [None]:
similar_movies('Toy Story', movies_metadata_genre_similarity, 50)

title
Toy Story                                               1.0
The SpongeBob SquarePants Movie                         1.0
Meet the Deedles                                        1.0
Garfield Gets Real                                      1.0
Meet the Robinsons                                      1.0
Scooby-Doo! and the Samurai Sword                       1.0
Scooby-Doo! And the Legend of the Vampire               1.0
Looney Tunes: Back in Action                            1.0
Scooby-Doo! and the Loch Ness Monster                   1.0
Big Top Scooby-Doo!                                     1.0
Lilo & Stitch 2: Stitch has a Glitch                    1.0
Scooby-Doo Goes Hollywood                               1.0
The Flintstones & WWE: Stone Age Smackdown              1.0
One Froggy Evening                                      1.0
Tom and Jerry: Shiver Me Whiskers                       1.0
Barnyard                                                1.0
Mutant Pumpkins from Outer Space  

In [None]:
# checking similarity between Toy Story and Toy Story 2
movies_metadata_genre_similarity['Toy Story']['Toy Story 2']

1.0

#### 4.2.1.3 Genre Summary

Although the recommendations based soley on genre is very basic, it seems to have some relevent suggestions. For example, it recommends Lion King and Scooby-Doo for Toy story, which are animated kids movies. Since there are so many movies ranked as 1, you have to go through a lot of movies before it recommends one of the other Toy Story movies. Thus, there are a lot of really close movies that are missed.

### 4.2.2 Genre and popularity

Since the genre only based model produced so many movies ranked as one, I will include the average_vote, so if a person watches a popular movie, it will recommend other popular movies in the same genre.

In [None]:
# To save room remove movies_metadata_genre_similarity
del movies_metadata_genre_similarity

In [None]:
del movies_metadata_genre

In [None]:
# importing data
movies_metadata=pd.read_csv('/content/drive/MyDrive/machine_learning_projects/movie_recommendation_system/data/movies_metadata_cleaned.csv')

  exec(code_obj, self.user_global_ns, self.user_ns)


In [None]:
# selecting all movies with a vote count greater than 10
movies_metadata_small=movies_metadata[movies_metadata['vote_count']>10]

In [None]:
# remove all columns except genre and vote_average columns
drop_columns=['Unnamed: 0', 'Unnamed: 0.1','imdb_id', 'original_language', 'overview', 'release_date', 'release_year','outlier','spoken_languages_list', 'spoken_languages', 'vote_count', 'genres_list', 'production_companies_list', 'id']
movies_metadata_genre_vote = movies_metadata_small.drop(columns=drop_columns)

In [None]:
# remove movies_metadata
del movies_metadata

In [None]:
# collecting movie titles for index
movies_index = movies_metadata_genre_vote['title']
movies_metadata_genre_vote = movies_metadata_genre_vote.drop(columns = ['title'])

In [None]:
movies_metadata_genre_vote.columns

Index(['vote_average', 'Animation', 'Comedy', 'Family', 'Adventure', 'Fantasy',
       'Romance', 'Drama', 'Action', 'Crime', 'Thriller', 'Horror', 'History',
       'Mystery', 'War', 'Foreign', 'Music', 'Documentary', 'Western'],
      dtype='object')

In [None]:
movies_metadata_genre_vote.index=movies_index

In [None]:
movies_metadata_genre_vote.head()

Unnamed: 0_level_0,vote_average,Animation,Comedy,Family,Adventure,Fantasy,Romance,Drama,Action,Crime,Thriller,Horror,History,Mystery,War,Foreign,Music,Documentary,Western
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
Toy Story,7.7,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
Jumanji,6.9,0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0
Grumpier Old Men,6.5,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0
Waiting to Exhale,6.1,0,1,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0
Father of the Bride Part II,5.7,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


#### 4.2.2.1 Normalizing data

Since the vote count data ranges between 1 and 10, I need to normalize the data. normalizing the data will ensure the distance calculations do not put more weight on vote average than the dummy variables.

In [None]:
# initilize the normalizer
norm = Normalizer()

# fit and transform
norm_fit = norm.fit_transform(movies_metadata_genre_vote)

# creating a data frame
movies_genre_vote_transform = pd.DataFrame(norm_fit, columns = movies_metadata_genre_vote.columns, 
                                          index = movies_metadata_genre_vote.index)
movies_genre_vote_transform.head()

Unnamed: 0_level_0,vote_average,Animation,Comedy,Family,Adventure,Fantasy,Romance,Drama,Action,Crime,Thriller,Horror,History,Mystery,War,Foreign,Music,Documentary,Western
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
Toy Story,0.975622,0.126704,0.126704,0.126704,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Jumanji,0.969909,0.0,0.0,0.140566,0.140566,0.140566,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Grumpier Old Men,0.97714,0.0,0.150329,0.0,0.0,0.0,0.150329,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Waiting to Exhale,0.961973,0.0,0.1577,0.0,0.0,0.0,0.1577,0.1577,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Father of the Bride Part II,0.984957,0.0,0.172799,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
# remove movies_metadata_genre_vote
del movies_metadata_genre_vote

#### 4.2.2.2 Cosine Similarity

In [None]:
# calculating cosine similarity
movies_metadata_genre_vote_similarity = cosine_similarity(movies_genre_vote_transform)

In [None]:
# converting to data frame
movies_metadata_genre_vote_similarity_df = pd. DataFrame(movies_metadata_genre_vote_similarity, index = movies_genre_vote_transform.index,
                                                        columns = movies_genre_vote_transform.index)
movies_metadata_genre_vote_similarity_df.head()

title,Toy Story,Jumanji,Grumpier Old Men,Waiting to Exhale,Father of the Bride Part II,Heat,Sabrina,Tom and Huck,Sudden Death,GoldenEye,...,The Final Storm,In a Heartbeat,"Blood, Sweat and Tears",To Be Fat Like Me,Cadet Kelly,The Man with the Rubber Head,The Devilish Tenant,The One-Man Band,Mom,Robin Hood
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Toy Story,1.0,0.964075,0.972366,0.958503,0.98284,0.944289,0.971115,0.936891,0.930569,0.943668,...,0.858261,0.992997,0.96524,0.963177,0.981995,0.975548,0.973092,0.942726,0.943668,0.933477
Jumanji,0.964075,1.0,0.947737,0.933026,0.955319,0.938759,0.945621,0.958352,0.949497,0.958742,...,0.853235,0.959385,0.959588,0.960347,0.952457,0.971724,0.969526,0.958102,0.938142,0.92801
Grumpier Old Men,0.972366,0.947737,1.0,0.987396,0.988418,0.945758,0.99995,0.916312,0.932016,0.945136,...,0.859596,0.985166,0.966742,0.940253,0.987947,0.980096,0.978027,0.944193,0.945136,0.960163
Waiting to Exhale,0.958503,0.933026,0.987396,1.0,0.974752,0.950901,0.987481,0.929475,0.91755,0.930465,...,0.846254,0.972148,0.974681,0.956008,0.974445,0.966138,0.964264,0.929537,0.953577,0.97336
Father of the Bride Part II,0.98284,0.955319,0.988418,0.974752,1.0,0.953324,0.987465,0.923642,0.939473,0.952697,...,0.866473,0.97779,0.974476,0.947775,0.999867,0.990688,0.988957,0.951747,0.952697,0.942408


In [None]:
# removes movies_genre_vote_transform
del movies_genre_vote_transform

In [None]:
del norm
del norm_fit

#### 4.2.2.3 Finding similar movies

In [None]:
similar_movies('Toy Story', movies_metadata_genre_vote_similarity_df, 10)

title
Toy Story                            1.000000
One Froggy Evening                   0.999996
Toy Story 3                          0.999996
Rabbit Fire                          0.999996
The Wrong Trousers                   0.999996
A Charlie Brown Christmas            0.999984
There Once Was a Dog                 0.999984
Monsters, Inc.                       0.999984
A Close Shave                        0.999984
Scooby-Doo! and the Samurai Sword    0.999962
Name: Toy Story, dtype: float64

In [None]:
similar_movies('Toy Story', movies_metadata_genre_vote_similarity_df, 50)

title
Toy Story                                               1.000000
One Froggy Evening                                      0.999996
Toy Story 3                                             0.999996
Rabbit Fire                                             0.999996
The Wrong Trousers                                      0.999996
A Charlie Brown Christmas                               0.999984
There Once Was a Dog                                    0.999984
Monsters, Inc.                                          0.999984
A Close Shave                                           0.999984
Scooby-Doo! and the Samurai Sword                       0.999962
Cosmic Scrat-tastrophe                                  0.999962
Creature Comforts                                       0.999932
Toy Story of Terror!                                    0.999932
Toy Story 2                                             0.999932
Scooby-Doo! Camp Scare                                  0.999932
Banana             

In [None]:
del movies_metadata_genre_vote_similarity

#### 4.2.2.4 Genre and Vote Average Summary

The Genre and vote count recommendation did better because it recommended Toy Story 3 in the top 10 similar movies to Toy Story, and it recommended two other Toy Story movies in the top 50. However, there are still a lot of movies with a similarity score close to 1.

### 4.2.3 Genre and Keyword

Although the recommendation system with both genre and average vote count was able to recommend a sequal of Toy Story in the top 10 results, it did miss Toy Story 2, which maybe a result of the lower popularity of the second movie. Also, since it only uses genre and popularity, there are a lot of results to go through. To help the system better recommend sequal and more similar movies, I am adding keywords as a feature. 

Since keywords is stored as a string version of a dictionary, I will first have to convert just the keywords to strings.

#### 4.2.3.1 Importing Data

In [None]:
# remove the last recommendation system
del movies_metadata_genre_vote_similarity_df

NameError: ignored

In [None]:
# importing data
movies_metadata=pd.read_csv('/content/drive/MyDrive/machine_learning_projects/movie_recommendation_system/data/movies_metadata_cleaned.csv')

  exec(code_obj, self.user_global_ns, self.user_ns)


In [None]:
# function to find similar movies and provide recommendation
def similar_movies(x, data, k=None):
    """The function take a movie and returns [k:optional] similar movies"""
    # if data.index.any(x) == False:
    #     return print('The movie is not in the database')
    
    movies=data.loc[x]
    
    if k == None:        
        movies_rec=movies.sort_values(ascending=False)
    else: 
        movies_rec=movies.sort_values(ascending=False)[0:k]
        
    return movies_rec

In [None]:
# reducing the set to more popular movies
movies_metadata_small=movies_metadata[movies_metadata['vote_count']>10]
movies_metadata_small.head()

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,id,imdb_id,original_language,overview,release_date,spoken_languages,title,vote_average,...,Mystery,War,Foreign,Music,Documentary,Western,production_companies_list,spoken_languages_list,release_year,outlier
0,0,0,862,tt0114709,en,"Led by Woody, Andy's toys live happily in his ...",1995-10-30,"[{'iso_639_1': 'en', 'name': 'English'}]",Toy Story,7.7,...,0,0,0,0,0,0,['Pixar Animation Studios'],['en'],1995,non-outlier
1,1,1,8844,tt0113497,en,When siblings Judy and Peter discover an encha...,1995-12-15,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Jumanji,6.9,...,0,0,0,0,0,0,"['TriStar Pictures', 'Teitler Film', 'Intersco...","['en', 'fr']",1995,non-outlier
2,2,2,15602,tt0113228,en,A family wedding reignites the ancient feud be...,1995-12-22,"[{'iso_639_1': 'en', 'name': 'English'}]",Grumpier Old Men,6.5,...,0,0,0,0,0,0,"['Warner Bros.', 'Lancaster Gate']",['en'],1995,non-outlier
3,3,3,31357,tt0114885,en,"Cheated on, mistreated and stepped on, the wom...",1995-12-22,"[{'iso_639_1': 'en', 'name': 'English'}]",Waiting to Exhale,6.1,...,0,0,0,0,0,0,['Twentieth Century Fox Film Corporation'],['en'],1995,non-outlier
4,4,4,11862,tt0113041,en,Just when George Banks has recovered from his ...,1995-02-10,"[{'iso_639_1': 'en', 'name': 'English'}]",Father of the Bride Part II,5.7,...,0,0,0,0,0,0,"['Sandollar Productions', 'Touchstone Pictures']",['en'],1995,non-outlier


In [None]:
# removes movies_metadata
del movies_metadata

In [None]:
# importing keywords
keywords=pd.read_csv('/content/drive/MyDrive/machine_learning_projects/movie_recommendation_system/data/keywords_cleaned.csv')
keywords.head()

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,id,keywords
0,0,0,862,"[{'id': 931, 'name': 'jealousy'}, {'id': 4290,..."
1,1,1,8844,"[{'id': 10090, 'name': 'board game'}, {'id': 1..."
2,2,2,15602,"[{'id': 1495, 'name': 'fishing'}, {'id': 12392..."
3,3,3,31357,"[{'id': 818, 'name': 'based on novel'}, {'id':..."
4,4,4,11862,"[{'id': 1009, 'name': 'baby'}, {'id': 1599, 'n..."


#### 4.2.3.2 Formatting Data

In [None]:
# dropping unnecessary columns
movies_metadata_small.drop(columns=['Unnamed: 0', 'Unnamed: 0.1'], inplace=True)
keywords.drop(columns=['Unnamed: 0', 'Unnamed: 0.1'], inplace=True)

# merging movies_metadata_small with keywords
movies_metadata_genre_keyword=movies_metadata_small.merge(keywords, how='left', on='id')
movies_metadata_genre_keyword.columns

Index(['id', 'imdb_id', 'original_language', 'overview', 'release_date',
       'spoken_languages', 'title', 'vote_average', 'vote_count',
       'genres_list', 'Animation', 'Comedy', 'Family', 'Adventure', 'Fantasy',
       'Romance', 'Drama', 'Action', 'Crime', 'Thriller', 'Horror', 'History',
       'Mystery', 'War', 'Foreign', 'Music', 'Documentary', 'Western',
       'production_companies_list', 'spoken_languages_list', 'release_year',
       'outlier', 'keywords'],
      dtype='object')

In [None]:
# removes movies_metadata_small and keywords
del keywords
del movies_metadata_small

In [None]:
# removing unnecessary columns from the new data frame
drop_columns=['id', 'imdb_id', 'original_language', 'overview', 'release_date',
       'spoken_languages', 'vote_average', 'vote_count',
       'genres_list', 'production_companies_list', 'spoken_languages_list', 'release_year',
       'outlier']

movies_metadata_genre_keyword.drop(columns=drop_columns, inplace=True)

movies_metadata_genre_keyword.columns

Index(['title', 'Animation', 'Comedy', 'Family', 'Adventure', 'Fantasy',
       'Romance', 'Drama', 'Action', 'Crime', 'Thriller', 'Horror', 'History',
       'Mystery', 'War', 'Foreign', 'Music', 'Documentary', 'Western',
       'keywords'],
      dtype='object')

In [None]:
del drop_columns

In [None]:
# setting the title as index
movies_index=movies_metadata_genre_keyword['title']
movies_metadata_genre_keyword.index=movies_index

# drop title column
movies_metadata_genre_keyword.drop(columns=['title'], inplace=True)

# checking the results
movies_metadata_genre_keyword.head()

Unnamed: 0_level_0,Animation,Comedy,Family,Adventure,Fantasy,Romance,Drama,Action,Crime,Thriller,Horror,History,Mystery,War,Foreign,Music,Documentary,Western,keywords
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
Toy Story,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,"[{'id': 931, 'name': 'jealousy'}, {'id': 4290,..."
Jumanji,0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,"[{'id': 10090, 'name': 'board game'}, {'id': 1..."
Grumpier Old Men,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,"[{'id': 1495, 'name': 'fishing'}, {'id': 12392..."
Waiting to Exhale,0,1,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,"[{'id': 818, 'name': 'based on novel'}, {'id':..."
Father of the Bride Part II,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,"[{'id': 1009, 'name': 'baby'}, {'id': 1599, 'n..."


In [None]:
del movies_index

#### 4.2.3.3 Extracting Keywords

In [None]:
# inspecting the keywords column
movies_metadata_genre_keyword.loc[:,'keywords']

title
Toy Story                       [{'id': 931, 'name': 'jealousy'}, {'id': 4290,...
Jumanji                         [{'id': 10090, 'name': 'board game'}, {'id': 1...
Grumpier Old Men                [{'id': 1495, 'name': 'fishing'}, {'id': 12392...
Waiting to Exhale               [{'id': 818, 'name': 'based on novel'}, {'id':...
Father of the Bride Part II     [{'id': 1009, 'name': 'baby'}, {'id': 1599, 'n...
                                                      ...                        
The Man with the Rubber Head    [{'id': 10124, 'name': 'laboratory'}, {'id': 1...
The Devilish Tenant             [{'id': 11320, 'name': 'tenant'}, {'id': 15480...
The One-Man Band                                                               []
Mom                             [{'id': 155794, 'name': 'physical abuse'}, {'i...
Robin Hood                                                                     []
Name: keywords, Length: 21740, dtype: object

In [None]:
# convert keywords column to a string
movies_metadata_genre_keyword['keywords']=movies_metadata_genre_keyword['keywords'].astype(str)

In [None]:
# function extracts keywords
def extract_keywords(row):
    """This function takes the string text in genres columns and collects the genres into a list"""
    keywords=re.findall(r"'name':\s+'(\w+\s*\w+)'",row)

    words=''
    for number, word in enumerate(keywords):
      if number==0:
        words = str(word)
      else:
        words = words + ", " + str(word)
    return words

#creates a new column keywords_list
movies_metadata_genre_keyword['keywords_list']=movies_metadata_genre_keyword['keywords'].apply(lambda row: extract_keywords(row))

# inspect the results
movies_metadata_genre_keyword['keywords_list']

title
Toy Story                       jealousy, toy, boy, friendship, friends, rival...
Jumanji                         board game, disappearance, new home, recluse, ...
Grumpier Old Men                fishing, best friend, duringcreditsstinger, ol...
Waiting to Exhale               interracial relationship, single mother, divor...
Father of the Bride Part II     baby, midlife crisis, confidence, aging, daugh...
                                                      ...                        
The Man with the Rubber Head    laboratory, mad scientist, disembodied head, s...
The Devilish Tenant                                           tenant, silent film
The One-Man Band                                                                 
Mom                                                physical abuse, sexual assault
Robin Hood                                                                       
Name: keywords_list, Length: 21740, dtype: object

To use the TfidfVectorizer, I will need to convert the list into a string and remove the [] and commas

In [None]:
# drop the keywords column
movies_metadata_genre_keyword.drop(columns=['keywords'], inplace=True)

In [None]:
# convert keywords_list to a string
movies_metadata_genre_keyword['keywords']=movies_metadata_genre_keyword['keywords_list'].astype(str)

In [None]:
movies_metadata_genre_keyword['keywords']=movies_metadata_genre_keyword['keywords'].fillna('')

In [None]:
# check the results
movies_metadata_genre_keyword['keywords']

title
Toy Story                       jealousy, toy, boy, friendship, friends, rival...
Jumanji                         board game, disappearance, new home, recluse, ...
Grumpier Old Men                fishing, best friend, duringcreditsstinger, ol...
Waiting to Exhale               interracial relationship, single mother, divor...
Father of the Bride Part II     baby, midlife crisis, confidence, aging, daugh...
                                                      ...                        
The Man with the Rubber Head    laboratory, mad scientist, disembodied head, s...
The Devilish Tenant                                           tenant, silent film
The One-Man Band                                                                 
Mom                                                physical abuse, sexual assault
Robin Hood                                                                       
Name: keywords, Length: 21740, dtype: object

In [None]:
# drop the keywords_list column
movies_metadata_genre_keyword.drop(columns=['keywords_list'], inplace=True)

#### 4.2.3.4 Creating TfidfVectorizer for Keywords

In [None]:
tfidfvec=TfidfVectorizer()

tfidfvec_data=tfidfvec.fit_transform(movies_metadata_genre_keyword['keywords'])

tfidfvec.get_feature_names_out()

array(['10', '11', '1500s', ..., '카운트다운', '하울링', '형사'], dtype=object)

In [None]:
tfidfvec_data_df=pd.DataFrame(tfidfvec_data.toarray(), columns=tfidfvec.get_feature_names_out())

tfidfvec_data_df.index=movies_metadata_genre_keyword.index

tfidfvec_data_df.head()



Unnamed: 0_level_0,10,11,1500s,15th,16th,17th,18th,1910s,1920s,1930s,...,超级妈妈,감시자들,관상,돈의,소원,연애,오싹한,카운트다운,하울링,형사
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Toy Story,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Jumanji,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Grumpier Old Men,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Waiting to Exhale,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Father of the Bride Part II,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
del tfidfvec
del tfidfvec_data

In [None]:
# removing keyword column, so I can merge it with the tfidfvec data frame
keyword_genre_df = movies_metadata_genre_keyword.drop(columns=['keywords'])

# combining genre and tfidf data for each movie
keyword_genre_df = keyword_genre_df.merge(tfidfvec_data_df, how='left', left_index=True, right_index=True)

keyword_genre_df.head()

Unnamed: 0_level_0,Animation,Comedy,Family,Adventure,Fantasy,Romance,Drama,Action,Crime,Thriller,...,超级妈妈,감시자들,관상,돈의,소원,연애,오싹한,카운트다운,하울링,형사
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
#1 Cheerleader Camp,0,1,0,0,0,0,1,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
#Horror,0,0,0,0,0,0,1,0,0,1,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
$5 a Day,0,1,0,0,0,0,1,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
$50K and a Call Girl: A Love Story,0,0,0,0,0,0,1,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
$9.99,1,0,0,0,0,0,1,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
del tfidfvec_data_df
del movies_metadata_genre_keyword

Since tfidf creates entries that at between zero and one and the genre data is dummy variables, I do not need to normalize this data.

In [None]:
# ensuring all of the data is less or equal 1.
len(keyword_genre_df.max()<=1)

9313

In [None]:
# ensuring all of the data is greater than or equal to 0
len(keyword_genre_df.max()>=0)

9313

#### 4.2.3.5 Non-negative Matrix Factorization

When trying to calculate cosine similarity, my system keeps crashing, so I decided to reduce the number of dimensions by using non-negative matrix factorization.

In [None]:
#initating NMF
nmf=NMF(n_components=10)

In [None]:
# saving movie titles
movie_names=keyword_genre_df.index

# fitting it to the data
keyword_genre=nmf.fit_transform(keyword_genre_df)



In [None]:
# creating a data frame and setting the movie titles as the index
keyword_genre=pd.DataFrame(keyword_genre, index=movie_names)

In [None]:
# deleting the old NMF
del nmf

In [None]:
# deleting the old keyword_genre_df
del keyword_genre_df

#### 4.2.3.6 Cosine Similarity

In [None]:
# calculating cosine similarity
genre_keyword_similarity=cosine_similarity(keyword_genre)

In [None]:
del keyword_genre

NameError: ignored

In [None]:
# saving cosine similarity as a data frame
genre_keyword_similarity_df=pd.DataFrame(genre_keyword_similarity, columns=movie_names, index=movie_names)
genre_keyword_similarity_df.head()

title,#1 Cheerleader Camp,#Horror,$5 a Day,$50K and a Call Girl: A Love Story,$9.99,'71,'Tis the Season for Love,'Twas the Night Before Christmas,(500) Days of Summer,(Dis)Honesty: The Truth About Lies,...,À nos amours,À propos de Nice,Æon Flux,Çalgı Çengi,È arrivato mio fratello,Él,Ödipussi,Üvegtigris,Želary,’Round Midnight
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
#1 Cheerleader Camp,1.0,0.273733,1.0,0.593959,0.446598,0.27931,0.0,0.0,0.747262,0.67288,...,0.325254,0.285074,0.001995,0.804495,0.804495,0.273782,0.804495,0.804495,0.330333,0.59472
#Horror,0.273733,1.0,0.273733,0.460861,0.344379,0.563324,0.0,0.000126,0.204124,0.282813,...,0.252397,0.163359,0.001357,0.0,0.0,0.575789,0.0,0.0,0.25631,0.460944
$5 a Day,1.0,0.273733,1.0,0.593959,0.446598,0.27931,0.0,0.0,0.747262,0.67288,...,0.325254,0.285074,0.001995,0.804495,0.804495,0.273782,0.804495,0.804495,0.330333,0.59472
$50K and a Call Girl: A Love Story,0.593959,0.460861,0.593959,1.0,0.740977,0.470252,0.0,0.0,0.442884,0.465361,...,0.547603,0.192512,0.000394,0.0,0.0,0.460945,0.0,0.0,0.556155,0.999977
$9.99,0.446598,0.344379,0.446598,0.740977,1.0,0.355373,0.004378,0.671403,0.33593,0.724642,...,0.409425,0.623971,0.010158,0.008065,0.008065,0.344575,0.008065,0.008065,0.415737,0.741842


In [None]:
# removing genre_keyword_similarity
del genre_keyword_similarity

In [None]:
# deleting movie names
del movie_names

#### 4.2.3.7 Finding Similar movies

In [None]:
similar_movies('Toy Story', genre_keyword_similarity_df, 20)

title
Toy Story                                               1.0
Cloudy with a Chance of Meatballs                       1.0
Barnyard                                                1.0
Free Birds                                              1.0
Animals United                                          1.0
Tom and Jerry: Shiver Me Whiskers                       1.0
Phineas and Ferb the Movie: Across the 2nd Dimension    1.0
Garfield Gets Real                                      1.0
Saving Santa                                            1.0
Scooby-Doo! and the Loch Ness Monster                   1.0
Banana                                                  1.0
Anina                                                   1.0
Scooby-Doo Goes Hollywood                               1.0
Mr. Magoo's Christmas Carol                             1.0
Hop                                                     1.0
Scooby-Doo! And the Legend of the Vampire               1.0
Scooby-Doo! Camp Scare            

In [None]:
similar_movies('Toy Story 2', genre_keyword_similarity_df, 20)

title
Toy Story 2                                   1.0
Robots                                        1.0
Cloudy with a Chance of Meatballs             1.0
Looney Tunes: Back in Action                  1.0
Surf's Up                                     1.0
Chicken Run                                   1.0
Toy Story                                     1.0
Monsters, Inc.                                1.0
The SpongeBob SquarePants Movie               1.0
Anina                                         1.0
Dug's Special Mission                         1.0
Cosmic Scrat-tastrophe                        1.0
Banana                                        1.0
Happiness Is a Warm Blanket, Charlie Brown    1.0
Bartok the Magnificent                        1.0
Big Top Scooby-Doo!                           1.0
Barbie as The Princess & the Pauper           1.0
The Prince and the Pauper                     1.0
An All Dogs Christmas Carol                   1.0
Rabbit Fire                                 

In [None]:
genre_keyword_similarity_df.loc['Toy Story',['Toy Story 2', 'Toy Story 3']]

title
Toy Story 2    1.000000
Toy Story 3    0.999999
Name: Toy Story, dtype: float64

In [None]:
del genre_keyword_similarity_df 

#### 4.2 Genre and Keyword Summary

The recommendation system for Toy Story does not list any Toy Story sequal in the top 20; however, the system does return Toy Story as a recommendation for Toy Story 2. Toy story and Toy Story 2 have a rating of 1 while Toy Story and Toy Sotry 3 have a cosine similarity score of .99. The system does give out reasonable scores; however, it seems like the system is still scoring more based on genre; which is why there are many movies with scores close to 1. 

## 4.3 Collabrative Based Recommendation System

For the collabrative filtering, I will only use the same movies I used for the content based model, so the models are comparable.

### 4.3.1 Importing Data

In [None]:
movies_metadata_small=pd.read_csv('/content/drive/MyDrive/machine_learning_projects/movie_recommendation_system/data/movies_metadata_small.csv')

In [None]:
movies_metadata_small.head()

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,Unnamed: 0.1.1,id,imdb_id,original_language,overview,release_date,spoken_languages,title,...,Mystery,War,Foreign,Music,Documentary,Western,production_companies_list,spoken_languages_list,release_year,outlier
0,0,0,0,862,tt0114709,en,"Led by Woody, Andy's toys live happily in his ...",1995-10-30,"[{'iso_639_1': 'en', 'name': 'English'}]",Toy Story,...,0,0,0,0,0,0,['Pixar Animation Studios'],['en'],1995,non-outlier
1,1,1,1,8844,tt0113497,en,When siblings Judy and Peter discover an encha...,1995-12-15,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Jumanji,...,0,0,0,0,0,0,"['TriStar Pictures', 'Teitler Film', 'Intersco...","['en', 'fr']",1995,non-outlier
2,2,2,2,15602,tt0113228,en,A family wedding reignites the ancient feud be...,1995-12-22,"[{'iso_639_1': 'en', 'name': 'English'}]",Grumpier Old Men,...,0,0,0,0,0,0,"['Warner Bros.', 'Lancaster Gate']",['en'],1995,non-outlier
3,3,3,3,31357,tt0114885,en,"Cheated on, mistreated and stepped on, the wom...",1995-12-22,"[{'iso_639_1': 'en', 'name': 'English'}]",Waiting to Exhale,...,0,0,0,0,0,0,['Twentieth Century Fox Film Corporation'],['en'],1995,non-outlier
4,4,4,4,11862,tt0113041,en,Just when George Banks has recovered from his ...,1995-02-10,"[{'iso_639_1': 'en', 'name': 'English'}]",Father of the Bride Part II,...,0,0,0,0,0,0,"['Sandollar Productions', 'Touchstone Pictures']",['en'],1995,non-outlier


### 4.3.2 Formatting

In [None]:
movies_metadata_small.columns
drop_columns=['Unnamed: 0', 'Unnamed: 0.1', 'Unnamed: 0.1.1', 'imdb_id',
       'original_language', 'overview', 'release_date', 'spoken_languages',
       'vote_average', 'vote_count', 'genres_list', 'production_companies_list', 'spoken_languages_list', 'release_year',
       'outlier']

Index(['Unnamed: 0', 'Unnamed: 0.1', 'Unnamed: 0.1.1', 'id', 'imdb_id',
       'original_language', 'overview', 'release_date', 'spoken_languages',
       'title', 'vote_average', 'vote_count', 'genres_list', 'Animation',
       'Comedy', 'Family', 'Adventure', 'Fantasy', 'Romance', 'Drama',
       'Action', 'Crime', 'Thriller', 'Horror', 'History', 'Mystery', 'War',
       'Foreign', 'Music', 'Documentary', 'Western',
       'production_companies_list', 'spoken_languages_list', 'release_year',
       'outlier'],
      dtype='object')