# Top 5 Movie Recommendation System
#### Portfolio Project - Robert Rindos

## This Recommendation System Will Provide a Top 5 List of Recommended Movies and TV Shows. Made possible by the Rotten Tomatoes API
   - Tableau visualizations of this dataset can be accessed with this link :
   https://public.tableau.com/views/PortfolioProject_16274849785760/Story1?:language=en-US&publish=yes&:display_count=n&:origin=viz_share_link
   - This dataset is from Kaggle.com, a user scraped the Rotten Tomatoes API to create two datasets.
   https://www.kaggle.com/stefanoleone992/rotten-tomatoes-movies-and-critic-reviews-dataset
   - Recommendations are given based off the highest similarity scores, in reference to the user's favorite movie.
   
   - This 'score' gives us a metric to compare similarity to the user's given movie.
   
   - Similarity scores are calculated using common texts and values across categories like genres, actors, directors, producers, and explicit content.
   
   - Natural Language Processing (NLP) strategies of Bag-of-Words and Count Vectorization are utilized for word-count matrix.
   
## How This Algorithm Can be Tweaked for Desired Business Strategies
   
   - Audience and critic ratings can increase or decrease the similarity scores proportionally.
   
   - Streaming release dates can be applied to similarity scores by favoring newer releases over older ones.
   
   - Algorithm can be expanded upon with additional user data to find more personalized recommendations.
   
   - This script was created to be scalable, and even inlcudes a modularized main() function at the end with examples.
   
   - Can be integrated into data pipeline with minor changes to .py script.
   
   - The count matrix can be trimmed to a desired degree in order to lower the number of computations necessary for the cosine similarity function to occur. The count matrix will be trimmed to only include terms that are seem more than X amount of times in the entire data set. This allows the function to be tweaked for faster processing without sacrificing much accuracy.

## Importing Libraries

In [20]:
import pandas as pd
import numpy as np
import string
import re
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import StandardScaler, Normalizer

## Defining and Opening GitHub URL for .csv file
- Opens csv file from my personal GitHub repository and loads it into a pandas dataframe.

In [2]:
url = 'https://raw.githubusercontent.com/robertrindos/Recommendation-System/main/rotten_tomatoes_movies.csv'
movies = pd.read_csv(url)
movies.head(3)

Unnamed: 0,rotten_tomatoes_link,movie_title,movie_info,critics_consensus,content_rating,genres,directors,authors,actors,original_release_date,...,production_company,tomatometer_status,tomatometer_rating,tomatometer_count,audience_status,audience_rating,audience_count,tomatometer_top_critics_count,tomatometer_fresh_critics_count,tomatometer_rotten_critics_count
0,m/0814255,Percy Jackson & the Olympians: The Lightning T...,"Always trouble-prone, the life of teenager Per...",Though it may seem like just another Harry Pot...,PG,"Action & Adventure, Comedy, Drama, Science Fic...",Chris Columbus,"Craig Titley, Chris Columbus, Rick Riordan","Logan Lerman, Brandon T. Jackson, Alexandra Da...",2010-02-12,...,20th Century Fox,Rotten,49.0,149.0,Spilled,53.0,254421.0,43,73,76
1,m/0878835,Please Give,Kate (Catherine Keener) and her husband Alex (...,Nicole Holofcener's newest might seem slight i...,R,Comedy,Nicole Holofcener,Nicole Holofcener,"Catherine Keener, Amanda Peet, Oliver Platt, R...",2010-04-30,...,Sony Pictures Classics,Certified-Fresh,87.0,142.0,Upright,64.0,11574.0,44,123,19
2,m/10,10,"A successful, middle-aged Hollywood songwriter...",Blake Edwards' bawdy comedy may not score a pe...,R,"Comedy, Romance",Blake Edwards,Blake Edwards,"Dudley Moore, Bo Derek, Julie Andrews, Robert ...",1979-10-05,...,Waner Bros.,Fresh,67.0,24.0,Spilled,53.0,14684.0,2,16,8


## Explore the List of Titles for Yourself!

In [3]:
# Uncomment (#) to view entire list (works best on Jupyter Notebook)
#for x in range(len(movies['movie_title'])):
#    print(movies['movie_title'].iloc[x])
#print(movies['movie_title'].head(5))
#print(movies['movie_title'].tail(5))

## NA Value Exploration
- From this function we can see the name of each column and how many NA values are observed in the dataset.
- It appears we are missing the most values in the 'critics_consensus' category with 8,578 NA's. That is roughly half of the 17,712 movies available, so this column will not be very helpful when creating our model.
- This model will not be using all of the rating data provided, it will use 'tomatometer_rating' because it has the least NA's and is the overall rating given by Rotten Tomatoes.

In [4]:
def print_orig_columns_na(movies):
    movies=pd.DataFrame(movies)
    print(movies.isna().sum())
    print("movies.csv length: ", len(movies))
print_orig_columns_na(movies)

rotten_tomatoes_link                   0
movie_title                            0
movie_info                           321
critics_consensus                   8578
content_rating                         0
genres                                19
directors                            194
authors                             1542
actors                               352
original_release_date               1166
streaming_release_date               384
runtime                              314
production_company                   499
tomatometer_status                    44
tomatometer_rating                    44
tomatometer_count                     44
audience_status                      448
audience_rating                      296
audience_count                       297
tomatometer_top_critics_count          0
tomatometer_fresh_critics_count        0
tomatometer_rotten_critics_count       0
dtype: int64
movies.csv length:  17712


## Selecting Desired Word-Count Features

In [5]:
text_features = movies[['genres', 'directors', 'authors', 'actors', 'production_company']]
text_features.head(3)

Unnamed: 0,genres,directors,authors,actors,production_company
0,"Action & Adventure, Comedy, Drama, Science Fic...",Chris Columbus,"Craig Titley, Chris Columbus, Rick Riordan","Logan Lerman, Brandon T. Jackson, Alexandra Da...",20th Century Fox
1,Comedy,Nicole Holofcener,Nicole Holofcener,"Catherine Keener, Amanda Peet, Oliver Platt, R...",Sony Pictures Classics
2,"Comedy, Romance",Blake Edwards,Blake Edwards,"Dudley Moore, Bo Derek, Julie Andrews, Robert ...",Waner Bros.


# Defining Functions for Algorithm

### 1. clean_text_features()
- Cleans text data within the given columns
- Replaces NA's with blank spaces
- The order of these .replace() functions allows the model to create unique pronouns for each person
- Ex: actor 'Sarah Jessica Parker' would be read by the model like 'sarahjessicaparker' instead of 'Sarah', 'Jessica', and 'Parker'. Likewise, we would rather have the unique term '20thcenturyfox' instead of more common terms like '20th', 'Century' and 'Fox'.
- The similarity between people with same first or last names wont be picked up by the model

In [6]:
def clean_text_features(text):
    clean_features = text.fillna(' ', inplace=False)
    for col in clean_features:
        clean_features[col] = clean_features[col].apply(lambda x: str(x).replace(' ', ''))
        clean_features[col] = clean_features[col].apply(lambda x: str(x).replace(',', ' '))
        clean_features[col] = clean_features[col].apply(lambda x: str(x).lower())
        clean_features[col] = clean_features[col].apply(lambda x: str(x).replace('&', ''))
        clean_features[col] = clean_features[col].apply(lambda x: str(x).replace('.', ''))
    return clean_features

clean_text = clean_text_features(text_features)
clean_text

Unnamed: 0,genres,directors,authors,actors,production_company
0,actionadventure comedy drama sciencefictionfan...,chriscolumbus,craigtitley chriscolumbus rickriordan,loganlerman brandontjackson alexandradaddario ...,20thcenturyfox
1,comedy,nicoleholofcener,nicoleholofcener,catherinekeener amandapeet oliverplatt rebecca...,sonypicturesclassics
2,comedy romance,blakeedwards,blakeedwards,dudleymoore boderek julieandrews robertwebber ...,wanerbros
3,classics drama,sidneylumet,reginaldrose,martinbalsam johnfiedler leejcobb egmarshall j...,criterioncollection
4,actionadventure drama kidsfamily,richardfleischer,earlfelton,jamesmason kirkdouglas paullukas peterlorre ro...,disney
...,...,...,...,...,...
17707,drama musicalperformingarts,luisvaldez,luisvaldez,danielvaldez edwardjamesolmos charlesaidman ty...,mcauniversalhomevideo
17708,actionadventure animation comedy,byronhoward richmoore jaredbush,jaredbush philjohnston,jksimmons kristenbell octaviaspencer alantudyk...,waltdisneyanimationstudios
17709,actionadventure arthouseinternational classics...,,,anthonyquinn alanbates irenepapas lilakedrova ...,fox
17710,classics drama,cyendfield cyrilendfield,cyendfield johnprebble,stanleybaker jackhawkins ullajacobsson jamesbo...,paramountpictures


### 2. create_bow()
- Creates an aggregated bag-of-words (BoW) list for each movie
- Includes all desired word-count features

In [8]:
def create_bow(clean_text, feature_list):
    bow_list = clean_text[feature_list].agg(' '.join, axis=1)
    # for loop removes duplicate words within each movie's BoW
    for bow in bow_list:
        re.sub(r'\b(.+)\s+\1\b', r'\1', str(bow_list))
    return bow_list

feature_list = ['genres', 'directors', 'authors', 'actors', 'production_company']
bow_list = create_bow(clean_text, feature_list)
bow_list

0        actionadventure comedy drama sciencefictionfan...
1        comedy nicoleholofcener nicoleholofcener cathe...
2        comedy romance blakeedwards blakeedwards dudle...
3        classics drama sidneylumet reginaldrose martin...
4        actionadventure drama kidsfamily richardfleisc...
                               ...                        
17707    drama musicalperformingarts luisvaldez luisval...
17708    actionadventure animation comedy byronhoward r...
17709    actionadventure arthouseinternational classics...
17710    classics drama cyendfield cyrilendfield cyendf...
17711    actionadventure arthouseinternational drama do...
Length: 17712, dtype: object

### 3. count_vectorize()
- Creates a word-count matrix from an aggregated BoW object

In [9]:
def count_vectorize(bow_list):
    count_vectorizer = CountVectorizer()
    count_vectorizer.fit(bow_list)
    count_transform = count_vectorizer.transform(bow_list)
    soup_count_array = count_transform.toarray()
    count_matrix = pd.DataFrame(soup_count_array)
    return count_matrix

count_matrix = count_vectorize(bow_list)
print(count_matrix.shape)

(17712, 223172)


### 223,172 Unique Terms Across the Dataset!
- This will require a lot of computing power and should be trimmed!

### 4. count_vect_trim()
- Only includes terms that appear more than x amount of times throughout the dataset.
- This allows tweaking and finding a balance of computing power/time and accuracy.

In [11]:
def count_vect_trim(count_matrix, count_treshold):
    compact_matrix = count_matrix.loc[:,(count_matrix.sum(axis=0) > count_treshold)]
    return compact_matrix

count_treshold = 5
compact_matrix = count_vect_trim(count_matrix, count_treshold)
print(compact_matrix.shape)

(17712, 16125)


### When removing unique terms that are only seen once in the dataset, we reduce the number of terms from 223,172 down to 70,469. 
- Depending on your computer, this may still be too much. To work around this, one can change the threshold to inversely trade time for accuracy.
- For this example, I set the threshold only select terms seen more than 5 times. This outputs a matrix with 16,125 terms.

### 5. cosine_similarity()
- This function comes with SKlearn library

In [12]:
cosine_sim = cosine_similarity(compact_matrix, compact_matrix)

### 6. get_sim_scores()
- Given a movie title and cosine_sim scores, this function will retreive the similarity scores for all movies based on the chosen movie.

In [43]:
def get_sim_scores(chosen_movie, titles, cosine_sim):
    title=chosen_movie.replace(' ', '').lower()
    titles_clean=titles.apply(lambda x: str(x).replace(' ','').lower())
    titles_clean=pd.Index(titles_clean)
    idx=titles_clean.get_loc(title)
    sim_list = list(enumerate(cosine_sim[idx]))
    return sim_list

chosen_movie = 'zootopia'
titles = movies['movie_title']
sim_list = get_sim_scores(chosen_movie, titles, cosine_sim)

### 7. apply_scalers()      Work in Progress!
- Scales similarity scores by rating.
- Above average scores will increase similarity score.
- Below average scores will decrease similarity score.

In [41]:
#ratings = movies[['tomatometer_rating']]
#scaler = StandardScaler()
#def apply_scalers(sim_list, ratings, scaler):
#    scaler.fit(ratings)
#    std_ratings = scaler.transform(ratings)
#    for idx in range(len(sim_list)):
#       sim_list[idx] = sim_list[idx] * (std_ratings[idx] / 10)
#    return sim_list
#sim_list_scaled = apply_scalers(sim_list, ratings, scaler)

### 8. top_5_list()
- Retrieves a top 5 list of the highest similiarity scores for a given movie.

In [None]:
output_info = movies[['movie_title','tomatometer_rating','directors','actors']]
def top_5_list(sim_list, output_info):
    sim_list = sorted(sim_list, key=lambda x: x[1], reverse=True)
    top_5 = sim_list[0:6]
    top_5_indices = [i[0] for i in top_5]
    output = output_info.fillna(' ',inplace=False)
    return output.iloc[top_5_indices].reset_index(drop=True)
top_5 = top_5_list(sim_list, output_info)

# Main() Function


In [45]:
def main():
    #### Defining Variables ####
    # Defining GitHub URL to read .csv file
    url = 'https://raw.githubusercontent.com/robertrindos/Recommendation-System/main/rotten_tomatoes_movies.csv'
    movies = pd.read_csv(url)
    
    # Defining text features for Word-Count Matrix
    text_feature_list = ['genres', 'directors', 'authors', 'actors', 'production_company']
    text_features = movies[text_feature_list]
    
    # Defining movie titles
    titles = movies['movie_title']
    
    # Defining Ratings
    ratings = movies[['tomatometer_rating']]
    
    # Defining what info will be shown with recommendations
    output_info = movies[['movie_title','tomatometer_rating','directors','actors']]
    
    # Defining movie to find recommendations for
    chosen_movie = 'zootopia'
    
    #### Functions Deployed ####
    # 1 - Text Cleaning
    clean_text = clean_text_features(text_features)
    
    # 2 - Creates Bag-of-Words
    bow_list = create_bow(clean_text,text_feature_list)
    
    # 3 - Creates a Word-Count Matrix
    count_matrix = count_vectorize(bow_list)
    
    # 4 - Trims Matrix for Less Computations
    count_treshold = 5
    compact_matrix = count_vect_trim(count_matrix, count_treshold)
    
    # 5 - Cosine Similarity
    cosine_sim = cosine_similarity(compact_matrix, compact_matrix)
    
    # 6 - Retrieves a list of similarity scores for the chosen movie
    sim_list = get_sim_scores(chosen_movie, titles, cosine_sim)
    
    # 7 - Applies scaled version of 'Tomatometer_rating' to the similarity scores list
    #scaler = StandardScaler()
    #sim_list_scaled = apply_scalers(sim_list, ratings, scaler)
    
    # 8 - Sorts and retrieves a top 5 list of hightest similarity scores
    top_5 =  top_5_list(sim_list, output_info)
    return titles, cosine_sim

if __name__ == "__main__":
    titles, cosine_sim = main()

                                         movie_title tomatometer_rating  \
0                                           Zootopia               98.0   
1                          Ralph Breaks the Internet               88.0   
2                                     Wreck-it Ralph               87.0   
3  Minuscule: Valley of the Lost Ants (Minuscule ...               90.0   
4                                   Capture the Flag               48.0   
5                                          Mind Game              100.0   

                              directors  \
0  Byron Howard, Rich Moore, Jared Bush   
1             Phil Johnston, Rich Moore   
2                            Rich Moore   
3           Hélène Giraud, Thomas Szabo   
4                          Enrique Gato   
5                         Masaaki Yuasa   

                                              actors  
0  J.K. Simmons, Kristen Bell, Octavia Spencer, A...  
1  John C. Reilly, Sarah Silverman, Gal Gadot, Ja...  
2  John

## Test 1 - Pulp Fiction
- From the following test, we can see that all 5 of the recommendations are directed by Quentin Tarantino, a personal favorite of mine. I wasn't sure what "Four Rooms" was, but after some research I learned Tarantino was also a director for that movie.
- With more tweaking of the rating scaler, the similarity score for "Four Rooms" can be lowered since it has a very low rating of only 13.

In [53]:
movie_1 = 'Pulp Fiction'
sim_list = get_sim_scores(movie_1, titles, cosine_sim)
top_5 =  top_5_list(sim_list, output_info)
top_5

Unnamed: 0,movie_title,tomatometer_rating,directors,actors
0,Pulp Fiction,92.0,Quentin Tarantino,"John Travolta, Samuel L. Jackson, Uma Thurman,..."
1,Reservoir Dogs,92.0,Quentin Tarantino,"Harvey Keitel, Chris Penn, Tim Roth, Michael M..."
2,Four Rooms,13.0,"Allison Anders, Robert Rodriguez, Alexandre Ro...","Tim Roth, Valeria Golino, Madonna, Alicia Witt..."
3,Jackie Brown,87.0,Quentin Tarantino,"Pam Grier, Samuel L. Jackson, Robert Forster, ..."
4,Kill Bill: Volume 1,85.0,Quentin Tarantino,"Uma Thurman, Lucy Liu, Vivica A. Fox, David Ca..."
5,Kill Bill: Volume 2,84.0,Quentin Tarantino,"Uma Thurman, David Carradine, Michael Madsen, ..."


### Test 2 - Zootopia
- This test performed very well as all the recommended movies are kid-friendly and animated.
- Includes movies with similar directors and actors.
- All movies have a high rating besides "Capture the Flag" which has a low rating of 48.
- After some research, I found "Mind Game" is an experimental animated movie from Japan and seems to be in Japanese. This problem would be fixed with additional 'language' data.

In [54]:
movie_1 = 'zootopia'
sim_list = get_sim_scores(movie_1, titles, cosine_sim)
top_5 =  top_5_list(sim_list, output_info)
top_5

Unnamed: 0,movie_title,tomatometer_rating,directors,actors
0,Zootopia,98.0,"Byron Howard, Rich Moore, Jared Bush","J.K. Simmons, Kristen Bell, Octavia Spencer, A..."
1,Ralph Breaks the Internet,88.0,"Phil Johnston, Rich Moore","John C. Reilly, Sarah Silverman, Gal Gadot, Ja..."
2,Wreck-it Ralph,87.0,Rich Moore,"John C. Reilly, Sarah Silverman, Jack McBrayer..."
3,Minuscule: Valley of the Lost Ants (Minuscule ...,90.0,"Hélène Giraud, Thomas Szabo",
4,Capture the Flag,48.0,Enrique Gato,"Dani Rovira, Michelle Jenner, Carme Calvell, J..."
5,Mind Game,100.0,Masaaki Yuasa,"Sayaka Maeda, Seiko Takuma, Tomomitsu Yamaguch..."


### Test 3 - Hot Rod
- This comedy is a personal favorite of mine.
- This test also performed very well with great recommendations, 3 out of 5 of them including Andy Sandberg.
- One mistake seen in this example is "Capture the Flag". Somehow it managed to have a top 5 similarity score for both "Hot Rod" and "Zootopia" in the previous example. It makes sense to be in the last example, but not this one.

In [55]:
movie_1 = 'hot rod'
sim_list = get_sim_scores(movie_1, titles, cosine_sim)
top_5 =  top_5_list(sim_list, output_info)
top_5

Unnamed: 0,movie_title,tomatometer_rating,directors,actors
0,Hot Rod,39.0,Akiva Schaffer,"Andy Samberg, Jorma Taccone, Isla Fisher, Bill..."
1,Fun Size,25.0,,
2,Popstar: Never Stop Never Stopping,79.0,"Akiva Schaffer, Jorma Taccone","Andy Samberg, Jorma Taccone, Akiva Schaffer, S..."
3,Capture the Flag,48.0,Enrique Gato,"Dani Rovira, Michelle Jenner, Carme Calvell, J..."
4,The Lonely Island Presents: The Unauthorized B...,100.0,"Mike Diva, Akiva Schaffer","Andy Samberg, Akiva Schaffer, Jorma Taccone, H..."
5,30 Minutes or Less,45.0,Ruben Fleischer,"Jesse Eisenberg, Danny McBride (IV), Aziz Ansa..."
