# 4 Pre-processing Data

## 4.1 Data Source

### 4.1.1 Importing packages

In [1]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt

from sklearn.model_selection import train_test_split

# packages for NLP
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# packages required to calculate jaccard similarity
from sklearn.metrics import jaccard_score
from scipy.spatial.distance import pdist, squareform

# Normalizing data
from sklearn.preprocessing import Normalizer

# saving data to a file
from library.sb_utils import save_file


### 4.1.2 Importing Data

In [2]:
# importing data
movies_metadata=pd.read_csv('../data/movies_metadata_cleaned.csv')

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


### 4.1.3 Formatting

Due to computing power limitations, I am unable to run a model with all 45,418, so I will select half of the movies. Initially, I was going to randomly sample the movies; however, this resulted in a lot of obscure movies. Since I was not familar with a lot of the movies that were left, I was unable to judge how well the recommendation system performed. Thus, I decided to select the 22,709 most popular movies. Movie popularity will be based on the the number of votes to avoid issues with people's opinions. The more votes a movie has the more people who have seen the movie, so even if the movie did not perform well, with enough votes the movie should be well enough known. 

Selecting the movies based on number of votes does mean that obscure movies will be left out; thus, the recommendation system will not perform well for people who enjoy those types of movies. However, I feel the majority of people like more well known movies, which is why the movies are well known. To cater to the largest number of people, I decided to go with the more well know movies. 

In [3]:
movies_metadata.head()

Unnamed: 0,budget,id,imdb_id,original_language,overview,release_date,revenue,spoken_languages,title,vote_average,...,Thriller,Horror,History,Mystery,War,Foreign,Music,Documentary,Western,production_companies_list
0,30000000,862,tt0114709,en,"Led by Woody, Andy's toys live happily in his ...",1995-10-30,373554033.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Toy Story,7.7,...,0,0,0,0,0,0,0,0,0,['Pixar Animation Studios']
1,65000000,8844,tt0113497,en,When siblings Judy and Peter discover an encha...,1995-12-15,262797249.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Jumanji,6.9,...,0,0,0,0,0,0,0,0,0,"['TriStar Pictures', 'Teitler Film', 'Intersco..."
2,0,15602,tt0113228,en,A family wedding reignites the ancient feud be...,1995-12-22,0.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Grumpier Old Men,6.5,...,0,0,0,0,0,0,0,0,0,"['Warner Bros.', 'Lancaster Gate']"
3,16000000,31357,tt0114885,en,"Cheated on, mistreated and stepped on, the wom...",1995-12-22,81452156.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Waiting to Exhale,6.1,...,0,0,0,0,0,0,0,0,0,['Twentieth Century Fox Film Corporation']
4,0,11862,tt0113041,en,Just when George Banks has recovered from his ...,1995-02-10,76578911.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Father of the Bride Part II,5.7,...,0,0,0,0,0,0,0,0,0,"['Sandollar Productions', 'Touchstone Pictures']"


In [4]:
movies_metadata.shape[0]

45418

In [5]:
# randomly sample half of the movies
n=movies_metadata.shape[0]
n=int(n/2)

rand_sample=np.random.randint(0, movies_metadata.shape[0], n)

# filter the movies by the randomly selected indicies
movies_metadata_small=movies_metadata.loc[rand_sample]
movies_metadata_small.shape

(22709, 31)

Since 50 percent of the movies have less than 10 votes, I will select all movies with a vote count greater than 10.

In [6]:
# selecting all movies with a vote count greater than 10
movies_metadata_small=movies_metadata[movies_metadata['vote_count']>10]
movies_metadata_small.shape

(21740, 31)

## 4.2 Content Based Recommendation System

### 4.2.1 Genre based Model

The first model will use only the genre to make a recommendation; therefore, the assumption is that a person chooses to watch a movie because they enjoy a particular genre. Although this assumption is likely true to an extent, its recommendations will be limited because people enjoy movies based on more than just genre, and they like more than one genre. Although the recommendation will be limited, it will provide a good base model.

Since the genre data is binary, I will use jaccard similarity for this first model.

In [7]:
# remove all columns except genre columns
drop_columns=['budget', 'imdb_id', 'original_language', 'overview', 'release_date', 'revenue', 'spoken_languages', 'vote_average', 'vote_count', 'genres_list', 'production_companies_list', 'id']
movies_metadata_genre = movies_metadata_small.drop(columns=drop_columns)

In [8]:
# collecting movie titles for index
movies_index = movies_metadata_genre['title']
movies_metadata_genre = movies_metadata_genre.drop(columns = ['title'])

movies_metadata_genre.columns

Index(['Animation', 'Comedy', 'Family', 'Adventure', 'Fantasy', 'Romance',
       'Drama', 'Action', 'Crime', 'Thriller', 'Horror', 'History', 'Mystery',
       'War', 'Foreign', 'Music', 'Documentary', 'Western'],
      dtype='object')

In [9]:
# setting movie title as index
movies_metadata_genre.index = movies_index

movies_metadata_genre.head()

Unnamed: 0_level_0,Animation,Comedy,Family,Adventure,Fantasy,Romance,Drama,Action,Crime,Thriller,Horror,History,Mystery,War,Foreign,Music,Documentary,Western
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
Toy Story,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
Jumanji,0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0
Grumpier Old Men,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0
Waiting to Exhale,0,1,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0
Father of the Bride Part II,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


#### 4.2.1.1 Calculating Jaccard Similarity

In [28]:
# calculating jaccard distance for all movies
movies_metadata_genre_distance = pdist(movies_metadata_genre, metric = 'jaccard')

In [None]:
# putting the distances into a square matrix
movies_metadata_genre_df= squareform(movies_metadata_genre_distance)

In [30]:
# converting to a dataframe
movies_metadata_genre_df=pd.DataFrame(movies_metadata_genre_df, index=movies_index, columns = movies_index)

In [31]:
movies_metadata_genre_df.head()

title,Toy Story,Jumanji,Grumpier Old Men,Waiting to Exhale,Father of the Bride Part II,Heat,Sabrina,Tom and Huck,Sudden Death,GoldenEye,...,The Final Storm,In a Heartbeat,"Blood, Sweat and Tears",To Be Fat Like Me,Cadet Kelly,The Man with the Rubber Head,The Devilish Tenant,The One-Man Band,Mom,Robin Hood
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Toy Story,0.0,0.8,0.75,0.8,0.666667,1.0,0.75,0.833333,1.0,1.0,...,1.0,0.25,1.0,0.75,0.666667,0.75,0.75,1.0,1.0,1.0
Jumanji,0.8,0.0,1.0,1.0,1.0,1.0,1.0,0.6,0.8,0.8,...,1.0,0.833333,1.0,0.75,1.0,0.75,0.75,0.8,1.0,1.0
Grumpier Old Men,0.75,1.0,0.0,0.333333,0.5,1.0,0.0,1.0,1.0,1.0,...,1.0,0.5,1.0,1.0,0.5,0.666667,0.666667,1.0,1.0,0.75
Waiting to Exhale,0.8,1.0,0.333333,0.0,0.666667,0.833333,0.333333,0.833333,1.0,1.0,...,1.0,0.6,0.666667,0.75,0.666667,0.75,0.75,1.0,0.8,0.5
Father of the Bride Part II,0.666667,1.0,0.5,0.666667,0.0,1.0,0.5,1.0,1.0,1.0,...,1.0,0.75,1.0,1.0,0.0,0.5,0.5,1.0,1.0,1.0


In [32]:
# calculating jaccard similarity
movies_metadata_genre_similarity = 1- movies_metadata_genre_df

In [33]:
# checking the similarity matrix
movies_metadata_genre_similarity.head()

title,Toy Story,Jumanji,Grumpier Old Men,Waiting to Exhale,Father of the Bride Part II,Heat,Sabrina,Tom and Huck,Sudden Death,GoldenEye,...,The Final Storm,In a Heartbeat,"Blood, Sweat and Tears",To Be Fat Like Me,Cadet Kelly,The Man with the Rubber Head,The Devilish Tenant,The One-Man Band,Mom,Robin Hood
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Toy Story,1.0,0.2,0.25,0.2,0.333333,0.0,0.25,0.166667,0.0,0.0,...,0.0,0.75,0.0,0.25,0.333333,0.25,0.25,0.0,0.0,0.0
Jumanji,0.2,1.0,0.0,0.0,0.0,0.0,0.0,0.4,0.2,0.2,...,0.0,0.166667,0.0,0.25,0.0,0.25,0.25,0.2,0.0,0.0
Grumpier Old Men,0.25,0.0,1.0,0.666667,0.5,0.0,1.0,0.0,0.0,0.0,...,0.0,0.5,0.0,0.0,0.5,0.333333,0.333333,0.0,0.0,0.25
Waiting to Exhale,0.2,0.0,0.666667,1.0,0.333333,0.166667,0.666667,0.166667,0.0,0.0,...,0.0,0.4,0.333333,0.25,0.333333,0.25,0.25,0.0,0.2,0.5
Father of the Bride Part II,0.333333,0.0,0.5,0.333333,1.0,0.0,0.5,0.0,0.0,0.0,...,0.0,0.25,0.0,0.0,1.0,0.5,0.5,0.0,0.0,0.0


In [34]:
movies_metadata_genre_similarity.index

Index(['Toy Story', 'Jumanji', 'Grumpier Old Men', 'Waiting to Exhale',
       'Father of the Bride Part II', 'Heat', 'Sabrina', 'Tom and Huck',
       'Sudden Death', 'GoldenEye',
       ...
       'The Final Storm', 'In a Heartbeat', 'Blood, Sweat and Tears',
       'To Be Fat Like Me', 'Cadet Kelly', 'The Man with the Rubber Head',
       'The Devilish Tenant', 'The One-Man Band', 'Mom', 'Robin Hood'],
      dtype='object', name='title', length=21740)

#### 4.2.1.2 Finding Similar movies based on genre

In [74]:
# function to find similar movies and provide recommendation
def similar_movies(x, data, k=None):
    """The function take a movie and returns [k:optional] similar movies"""
    if data.index.any(x) == False:
        return print('The movie is not in the database')
    
    movies=data.loc[x]
    
    if k == None:        
        movies_rec=movies.sort_values(ascending=False)
    else: 
        movies_rec=movies.sort_values(ascending=False)[0:k]
        
    return movies_rec

In [75]:
similar_movies('Old Yeller', movies_metadata_genre_similarity, 10)

title
Old Yeller               1.00
The Timber               1.00
Man in the Wilderness    1.00
True Grit                1.00
Gold                     1.00
Into the West            1.00
Dances with Wolves       1.00
El Topo                  0.75
Far and Away             0.75
Seven Men from Now       0.75
Name: Old Yeller, dtype: float64

In [76]:
similar_movies('Toy Story', movies_metadata_genre_similarity, 40)

title
Toy Story                                               1.0
The Wrong Trousers                                      1.0
Phineas and Ferb the Movie: Across the 2nd Dimension    1.0
Happiness Is a Warm Blanket, Charlie Brown              1.0
Banana                                                  1.0
Lorenzo                                                 1.0
The Emoji Movie                                         1.0
The Lion King 1½                                        1.0
Open Season 3                                           1.0
Scooby-Doo! And the Legend of the Vampire               1.0
Scooby-Doo! and the Samurai Sword                       1.0
Cloudy with a Chance of Meatballs                       1.0
Garfield                                                1.0
Leroy & Stitch                                          1.0
Cosmic Scrat-tastrophe                                  1.0
A Close Shave                                           1.0
Doug's 1st Movie                  

#### 4.2.1.3 Genre Summary

Although the recommendations based soley on genre is very basic, it seems to have some relevent suggestions. For example, it recommends Lion King and Scooby-Doo for Toy story, which are animated kids movies. Since there are so many movies ranked as 1, you have to go through a lot of movies before it recommends one of the other Toy Story movies. Thus, there are a lot of really close movies that are missed.

### 4.2.2 Genre and popularity

Since the genre only based model produced so many movies ranked as one, I will include the average_vote, so if a person watches a popular movie, it will recommend other popular movies in the same genre.

In [55]:
# remove all columns except genre and vote_average columns
drop_columns=['budget', 'imdb_id', 'original_language', 'overview', 'release_date', 'revenue', 'spoken_languages', 'vote_count', 'genres_list', 'production_companies_list', 'id']
movies_metadata_genre_vote = movies_metadata_small.drop(columns=drop_columns)

In [56]:
# collecting movie titles for index
movies_index = movies_metadata_genre_vote['title']
movies_metadata_genre_vote = movies_metadata_genre_vote.drop(columns = ['title'])

In [57]:
movies_metadata_genre_vote.columns

Index(['vote_average', 'Animation', 'Comedy', 'Family', 'Adventure', 'Fantasy',
       'Romance', 'Drama', 'Action', 'Crime', 'Thriller', 'Horror', 'History',
       'Mystery', 'War', 'Foreign', 'Music', 'Documentary', 'Western'],
      dtype='object')

In [59]:
movies_metadata_genre_vote.index=movies_index

In [60]:
movies_metadata_genre_vote.head()

Unnamed: 0_level_0,vote_average,Animation,Comedy,Family,Adventure,Fantasy,Romance,Drama,Action,Crime,Thriller,Horror,History,Mystery,War,Foreign,Music,Documentary,Western
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
Toy Story,7.7,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
Jumanji,6.9,0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0
Grumpier Old Men,6.5,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0
Waiting to Exhale,6.1,0,1,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0
Father of the Bride Part II,5.7,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


#### 4.2.2.1 Normalizing data

Since the vote count data ranges between 1 and 10, I need to normalize the data. normalizing the data will ensure the distance calculations do not put more weight on vote average than the dummy variables.

In [66]:
# initilize the normalizer
norm = Normalizer()

# fit and transform
norm_fit = norm.fit_transform(movies_metadata_genre_vote)

# creating a data frame
movies_genre_vote_transform = pd.DataFrame(norm_fit, columns = movies_metadata_genre_vote.columns, 
                                          index = movies_metadata_genre_vote.index)
movies_genre_vote_transform.head()

Unnamed: 0_level_0,vote_average,Animation,Comedy,Family,Adventure,Fantasy,Romance,Drama,Action,Crime,Thriller,Horror,History,Mystery,War,Foreign,Music,Documentary,Western
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
Toy Story,0.975622,0.126704,0.126704,0.126704,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Jumanji,0.969909,0.0,0.0,0.140566,0.140566,0.140566,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Grumpier Old Men,0.97714,0.0,0.150329,0.0,0.0,0.0,0.150329,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Waiting to Exhale,0.961973,0.0,0.1577,0.0,0.0,0.0,0.1577,0.1577,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Father of the Bride Part II,0.984957,0.0,0.172799,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


#### 4.2.2.2 Cosine Similarity

In [68]:
# calculating cosine similarity
movies_metadata_genre_vote_similarity = cosine_similarity(movies_metadata_genre_vote)

numpy.ndarray

In [69]:
# converting to data frame
movies_metadata_genre_vote_similarity_df = pd. DataFrame(movies_metadata_genre_vote_similarity, index = movies_metadata_genre_vote.index,
                                                        columns = movies_metadata_genre_vote.index)
movies_metadata_genre_vote_similarity_df.head()

title,Toy Story,Jumanji,Grumpier Old Men,Waiting to Exhale,Father of the Bride Part II,Heat,Sabrina,Tom and Huck,Sudden Death,GoldenEye,...,The Final Storm,In a Heartbeat,"Blood, Sweat and Tears",To Be Fat Like Me,Cadet Kelly,The Man with the Rubber Head,The Devilish Tenant,The One-Man Band,Mom,Robin Hood
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Toy Story,1.0,0.964075,0.972366,0.958503,0.98284,0.944289,0.971115,0.936891,0.930569,0.943668,...,0.858261,0.992997,0.96524,0.963177,0.981995,0.975548,0.973092,0.942726,0.943668,0.933477
Jumanji,0.964075,1.0,0.947737,0.933026,0.955319,0.938759,0.945621,0.958352,0.949497,0.958742,...,0.853235,0.959385,0.959588,0.960347,0.952457,0.971724,0.969526,0.958102,0.938142,0.92801
Grumpier Old Men,0.972366,0.947737,1.0,0.987396,0.988418,0.945758,0.99995,0.916312,0.932016,0.945136,...,0.859596,0.985166,0.966742,0.940253,0.987947,0.980096,0.978027,0.944193,0.945136,0.960163
Waiting to Exhale,0.958503,0.933026,0.987396,1.0,0.974752,0.950901,0.987481,0.929475,0.91755,0.930465,...,0.846254,0.972148,0.974681,0.956008,0.974445,0.966138,0.964264,0.929537,0.953577,0.97336
Father of the Bride Part II,0.98284,0.955319,0.988418,0.974752,1.0,0.953324,0.987465,0.923642,0.939473,0.952697,...,0.866473,0.97779,0.974476,0.947775,0.999867,0.990688,0.988957,0.951747,0.952697,0.942408


#### 4.2.2.3 Finding similar movies

In [77]:
similar_movies('Toy Story', movies_metadata_genre_vote_similarity_df, 10)

title
Toy Story                    1.000000
One Froggy Evening           0.999996
Rabbit Fire                  0.999996
Toy Story 3                  0.999996
The Wrong Trousers           0.999996
There Once Was a Dog         0.999984
Monsters, Inc.               0.999984
A Close Shave                0.999984
A Charlie Brown Christmas    0.999984
Cosmic Scrat-tastrophe       0.999962
Name: Toy Story, dtype: float64

In [79]:
similar_movies('Toy Story', movies_metadata_genre_vote_similarity_df, 50)

title
Toy Story                                               1.000000
One Froggy Evening                                      0.999996
Rabbit Fire                                             0.999996
Toy Story 3                                             0.999996
The Wrong Trousers                                      0.999996
There Once Was a Dog                                    0.999984
Monsters, Inc.                                          0.999984
A Close Shave                                           0.999984
A Charlie Brown Christmas                               0.999984
Cosmic Scrat-tastrophe                                  0.999962
Scooby-Doo! and the Samurai Sword                       0.999962
Toy Story of Terror!                                    0.999932
Scooby-Doo! Camp Scare                                  0.999932
Creature Comforts                                       0.999932
Toy Story 2                                             0.999932
Banana             

#### 4.2.2.4 Genre and Vote Average Summary

The Genre and vote count recommendation did better because it recommended Toy Story 3 in the top 10 similar movies to Toy Story. However, there are still a lot of movies with a similarity score close to 1.

## 4.3 Saving Data

In [80]:
datapath = '../data'

save_file(movies_metadata_small, 'movies_metadata_small.csv', datapath)

A file already exists with this name.

Do you want to overwrite? (Y/N)Y
Writing file.  "../data\movies_metadata_small.csv"
