### Importing Packages

In [244]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics.pairwise import cosine_similarity

# Adding more data

One of the larger complaints we got was that there was very little data-so by adding more we are hoping to provide more comprehensive recommendations. We found datasets for Netflix, Amazon, Disney+, and Hulu all made by the same person and sharing the same formatting. As such, using these four in tandem will give us more data in general.

In [245]:
netflix_df = pd.read_csv('netflix_titles.csv')
amazon_df = pd.read_csv('amazon_prime_titles.csv')
disney_df = pd.read_csv('disney_plus_titles.csv')
hulu_df = pd.read_csv('hulu_titles.csv')

- Getting the information columns that we have and checking their data-types and number instances on which columns

In [246]:
netflix_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8807 entries, 0 to 8806
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       8807 non-null   object
 1   type          8807 non-null   object
 2   title         8807 non-null   object
 3   director      6173 non-null   object
 4   cast          7982 non-null   object
 5   country       7976 non-null   object
 6   date_added    8797 non-null   object
 7   release_year  8807 non-null   int64 
 8   rating        8803 non-null   object
 9   duration      8804 non-null   object
 10  listed_in     8807 non-null   object
 11  description   8807 non-null   object
dtypes: int64(1), object(11)
memory usage: 825.8+ KB


In [247]:
amazon_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9668 entries, 0 to 9667
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       9668 non-null   object
 1   type          9668 non-null   object
 2   title         9668 non-null   object
 3   director      7586 non-null   object
 4   cast          8435 non-null   object
 5   country       672 non-null    object
 6   date_added    155 non-null    object
 7   release_year  9668 non-null   int64 
 8   rating        9331 non-null   object
 9   duration      9668 non-null   object
 10  listed_in     9668 non-null   object
 11  description   9668 non-null   object
dtypes: int64(1), object(11)
memory usage: 906.5+ KB


In [248]:
disney_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1450 entries, 0 to 1449
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       1450 non-null   object
 1   type          1450 non-null   object
 2   title         1450 non-null   object
 3   director      977 non-null    object
 4   cast          1260 non-null   object
 5   country       1231 non-null   object
 6   date_added    1447 non-null   object
 7   release_year  1450 non-null   int64 
 8   rating        1447 non-null   object
 9   duration      1450 non-null   object
 10  listed_in     1450 non-null   object
 11  description   1450 non-null   object
dtypes: int64(1), object(11)
memory usage: 136.1+ KB


In [249]:
hulu_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3073 entries, 0 to 3072
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   show_id       3073 non-null   object 
 1   type          3073 non-null   object 
 2   title         3073 non-null   object 
 3   director      3 non-null      object 
 4   cast          0 non-null      float64
 5   country       1620 non-null   object 
 6   date_added    3045 non-null   object 
 7   release_year  3073 non-null   int64  
 8   rating        2553 non-null   object 
 9   duration      2594 non-null   object 
 10  listed_in     3073 non-null   object 
 11  description   3069 non-null   object 
dtypes: float64(1), int64(1), object(10)
memory usage: 288.2+ KB


we can see that all four share the same formatting, so we can combine them all

However, first we can make features for where the show/movie originated from. This is so we can recommend shows from the same streaming service, increasing the likelihood of them watching it

In [250]:
netflix_df['netflix'] = 1
amazon_df['amazon'] = 1
disney_df['disney'] = 1
hulu_df['hulu'] = 1

In [251]:
df = pd.concat([netflix_df,amazon_df,disney_df,hulu_df])

In [252]:
df.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,netflix,amazon,disney,hulu
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm...",1.0,,,
1,s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t...",1.0,,,
2,s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,"September 24, 2021",2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...,1.0,,,
3,s4,TV Show,Jailbirds New Orleans,,,,"September 24, 2021",2021,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo...",1.0,,,
4,s5,TV Show,Kota Factory,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...,1.0,,,


To replace the NaN values in the origin features, we will replace them with 0s

In [253]:
df['netflix'] = df['netflix'].fillna(0)
df['amazon'] = df['amazon'].fillna(0)
df['disney'] = df['disney'].fillna(0)
df['hulu'] = df['hulu'].fillna(0)

- We are checking for the shape which has 22998 instances per column and 12 features 

In [254]:
df.shape

(22998, 16)

Since we combined four datasets, it is reasonable to assume there is some level of overlap between the four

In [255]:
df[df.duplicated(['title'])].shape

(883, 16)

- Checking for the dtypes of columns

In [256]:
df.dtypes

show_id          object
type             object
title            object
director         object
cast             object
country          object
date_added       object
release_year      int64
rating           object
duration         object
listed_in        object
description      object
netflix         float64
amazon          float64
disney          float64
hulu            float64
dtype: object

- Checking for the null values for each columns

In [257]:
df.isnull().sum()

show_id             0
type                0
title               0
director         8259
cast             5321
country         11499
date_added       9554
release_year        0
rating            864
duration          482
listed_in           0
description         4
netflix             0
amazon              0
disney              0
hulu                0
dtype: int64

- Here we are printing rating column's unique value which would be our movie or show ratings.

In [258]:
print('Types of ratings:',df['rating'].unique())

Types of ratings: ['PG-13' 'TV-MA' 'PG' 'TV-14' 'TV-PG' 'TV-Y' 'TV-Y7' 'R' 'TV-G' 'G'
 'NC-17' '74 min' '84 min' '66 min' 'NR' nan 'TV-Y7-FV' 'UR' '13+' 'ALL'
 '18+' '16+' '7+' 'TV-NR' 'UNRATED' '16' 'AGES_16_' 'AGES_18_' 'ALL_AGES'
 'NOT_RATE' 'NOT RATED' '2 Seasons' '93 min' '4 Seasons' '136 min'
 '91 min' '85 min' '98 min' '89 min' '94 min' '86 min' '3 Seasons'
 '121 min' '88 min' '101 min' '1 Season' '83 min' '100 min' '95 min'
 '92 min' '96 min' '109 min' '99 min' '75 min' '87 min' '67 min' '104 min'
 '107 min' '103 min' '105 min' '119 min' '114 min' '82 min' '90 min'
 '130 min' '110 min' '80 min' '6 Seasons' '97 min' '111 min' '81 min'
 '49 min' '45 min' '41 min' '73 min' '40 min' '36 min' '39 min' '34 min'
 '47 min' '65 min' '37 min' '78 min' '102 min' '129 min' '115 min'
 '112 min' '61 min' '106 min' '76 min' '77 min' '79 min' '157 min'
 '28 min' '64 min' '7 min' '5 min' '6 min' '127 min' '142 min' '108 min'
 '57 min' '118 min' '116 min' '12 Seasons' '71 min']


We see that the ratings also include durations, and are generally messy, with esveral titles that all essentially mean the same thing

- In this snnipet here we are searching for the total number of movies and their unique durations. 

In [259]:
print('Number of Movies:',
      df[df['type'] == 'Movie']['duration'].count(),
      '\nShow Durations:',
      df[df['type'] == 'Movie']['duration'].unique())

Number of Movies: 15999 
Show Durations: ['90 min' '91 min' '125 min' '104 min' '127 min' '67 min' '94 min'
 '161 min' '61 min' '166 min' '147 min' '103 min' '97 min' '106 min'
 '111 min' '110 min' '105 min' '96 min' '124 min' '116 min' '98 min'
 '23 min' '115 min' '122 min' '99 min' '88 min' '100 min' '102 min'
 '93 min' '95 min' '85 min' '83 min' '113 min' '13 min' '182 min' '48 min'
 '145 min' '87 min' '92 min' '80 min' '117 min' '128 min' '119 min'
 '143 min' '114 min' '118 min' '108 min' '63 min' '121 min' '142 min'
 '154 min' '120 min' '82 min' '109 min' '101 min' '86 min' '229 min'
 '76 min' '89 min' '156 min' '112 min' '107 min' '129 min' '135 min'
 '136 min' '165 min' '150 min' '133 min' '70 min' '84 min' '140 min'
 '78 min' '64 min' '59 min' '139 min' '69 min' '148 min' '189 min'
 '141 min' '130 min' '138 min' '81 min' '132 min' '123 min' '65 min'
 '68 min' '66 min' '62 min' '74 min' '131 min' '39 min' '46 min' '38 min'
 '126 min' '155 min' '159 min' '137 min' '12 min' '273 m

- Same we are doing the shows that are presented inside our dataset

In [260]:
print('Number of Shows:',
      df[df['type'] == 'TV Show']['duration'].count(),
      '\nShow Durations:',
      df[df['type'] == 'TV Show']['duration'].unique())

Number of Shows: 6517 
Show Durations: ['2 Seasons' '1 Season' '9 Seasons' '4 Seasons' '5 Seasons' '3 Seasons'
 '6 Seasons' '7 Seasons' '10 Seasons' '8 Seasons' '17 Seasons'
 '13 Seasons' '15 Seasons' '12 Seasons' '11 Seasons' '29 Seasons'
 '19 Seasons' '21 Seasons' '14 Seasons' '32 Seasons' '16 Seasons'
 '23 Seasons' '20 Seasons' '30 Seasons' '22 Seasons' '25 Seasons'
 '34 Seasons' '26 Seasons']


<h4>Observations</h4>

<ul>
    <li>A lot of null values, especially for director, cast and country</li>
    <li>Rating isn't the popularity, but the suggested audience of a movie/film</li>
    <li>Listed In is essntially the genre list</li>
    <li>Duration is kept in two different metrics, time for movies and seasons for shows</li>
    <li>The movie/show rating feature is very messy</li>
</ul>

# Preprocessing & Visualization

<h3>General</h3>

- Dropping Director, Cast, and Country, due to the amoung of null values and how unlikely it is that they will have a significant effect on future models

In [261]:
df = df.drop(['director', 'cast', 'country'], axis=1)

- Dropping duration column, as two different metrics are used depending on whether the item is a show or a movie. Another solution would be to split the dataset into movies and shows, but we want to be able to recommend either.

In [262]:
df = df.drop('duration', axis=1)

- Dropping date added, since it has little relevance

In [263]:
df = df.drop('date_added', axis=1)

In [264]:
df.isnull().sum()

show_id           0
type              0
title             0
release_year      0
rating          864
listed_in         0
description       4
netflix           0
amazon            0
disney            0
hulu              0
dtype: int64

- Renaming two-word column names for ease

In [265]:
df = df.rename({'release_year':'year'}, axis=1)
df = df.rename({'listed_in':'genre'}, axis=1)
df = df.rename({'show_id':'id'}, axis=1)

In [266]:
df[df['rating'].isnull()]

Unnamed: 0,id,type,title,year,rating,genre,description,netflix,amazon,disney,hulu
5989,s5990,Movie,13TH: A Conversation with Oprah Winfrey & Ava ...,2017,,Movies,Oprah Winfrey sits down with director Ava DuVe...,1.0,0.0,0.0,0.0
6827,s6828,TV Show,Gargantia on the Verdurous Planet,2013,,"Anime Series, International TV Shows","After falling through a wormhole, a space-dwel...",1.0,0.0,0.0,0.0
7312,s7313,TV Show,Little Lunch,2015,,"Kids' TV, TV Comedies","Adopting a child's perspective, this show take...",1.0,0.0,0.0,0.0
7537,s7538,Movie,My Honor Was Loyalty,2015,,Dramas,"Amid the chaos and horror of World War II, a c...",1.0,0.0,0.0,0.0
0,s1,Movie,The Grand Seduction,2014,,"Comedy, Drama",A small fishing village must procure a local d...,0.0,1.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...
2544,s2545,Movie,28 Hotel Rooms,2013,,Drama,On a business trip a married woman and a novel...,0.0,0.0,0.0,1.0
2632,s2633,Movie,(Dub) Dragon Age: Dawn of the Seeker,2012,,Anime,A brash young Seeker - Cassandra - is accused ...,0.0,0.0,0.0,1.0
2697,s2698,TV Show,The Driver,2017,,"Documentaries, Lifestyle & Culture",Jeffrey Earnhardt is the grandson of seven-tim...,0.0,0.0,0.0,1.0
2874,s2875,TV Show,NASA Television Documentaries,2015,,"Documentaries, Science & Technology",NASA's Vision: To reach for new heights and re...,0.0,0.0,0.0,1.0


There is a substantian amount of rows with no rating. Along with this there are rows with unrated movies as well. Since rating is an important feature when recommending shows, we cannot get rid of the column. Because the the data in these rows is filled other than their rating, we will continue with the null rows

<h3>RS Processing</h3>

- One hot encoding the features, preparing dataframe for recommendation system

Different from our first iteration, we are going to use rating and type as well

In [267]:
df_enc = df.copy()

<h4>First, the ratings</h4>

In [268]:
rating_enc = pd.get_dummies(df['rating'], prefix='Rated ')

In [269]:
rating_enc.columns = rating_enc.columns.str.replace('_', '')

In [270]:
rating_enc.columns

Index(['Rated 1 Season', 'Rated 100 min', 'Rated 101 min', 'Rated 102 min',
       'Rated 103 min', 'Rated 104 min', 'Rated 105 min', 'Rated 106 min',
       'Rated 107 min', 'Rated 108 min',
       ...
       'Rated TV-14', 'Rated TV-G', 'Rated TV-MA', 'Rated TV-NR',
       'Rated TV-PG', 'Rated TV-Y', 'Rated TV-Y7', 'Rated TV-Y7-FV',
       'Rated UNRATED', 'Rated UR'],
      dtype='object', length=105)

First, we drop any ratings that aren't actually ratings (66 min, 1 season, etc.)

In [271]:
rating_enc = rating_enc.drop(['Rated 1 Season', 'Rated 100 min', 'Rated 101 min', 'Rated 102 min',
       'Rated 103 min', 'Rated 104 min', 'Rated 105 min', 'Rated 106 min',
       'Rated 107 min', 'Rated 108 min', 'Rated 109 min', 'Rated 110 min', 'Rated 111 min', 'Rated 112 min',
       'Rated 114 min', 'Rated 115 min', 'Rated 116 min', 'Rated 118 min',
       'Rated 119 min', 'Rated 12 Seasons', 'Rated 121 min', 'Rated 127 min',
       'Rated 129 min', 'Rated 130 min', 'Rated 136 min',
       'Rated 142 min', 'Rated 157 min', 'Rated 2 Seasons',
       'Rated 28 min', 'Rated 3 Seasons', 'Rated 34 min', 'Rated 36 min',
       'Rated 37 min', 'Rated 39 min', 'Rated 4 Seasons', 'Rated 40 min',
       'Rated 41 min', 'Rated 45 min', 'Rated 47 min', 'Rated 49 min',
       'Rated 5 min', 'Rated 57 min', 'Rated 6 Seasons', 'Rated 6 min',
       'Rated 61 min', 'Rated 64 min', 'Rated 65 min', 'Rated 66 min',
       'Rated 67 min', 'Rated 7 min', 'Rated 7+', 'Rated 71 min',
       'Rated 73 min', 'Rated 74 min', 'Rated 75 min', 'Rated 76 min',
       'Rated 77 min', 'Rated 78 min', 'Rated 79 min', 'Rated 80 min',
       'Rated 81 min', 'Rated 82 min', 'Rated 83 min', 'Rated 84 min',
       'Rated 85 min', 'Rated 86 min', 'Rated 87 min', 'Rated 88 min',
       'Rated 89 min', 'Rated 90 min', 'Rated 91 min', 'Rated 92 min',
       'Rated 93 min', 'Rated 94 min', 'Rated 95 min', 'Rated 96 min',
       'Rated 97 min', 'Rated 98 min', 'Rated 99 min'], axis=1)

Dropping the unrated rows, since they don't tell us anything about the show/movie

In [272]:
rating_enc = rating_enc.drop(['Rated NR','Rated UR','Rated NOT RATED','Rated NOTRATE','Rated UNRATED','Rated TV-NR'], axis=1)

Merging ratings that convey the same information (ex/ rating 16, 16+, and AGES_16_)

In [273]:
rating_enc.loc[rating_enc['Rated 16'] == 1, 'Rated 16+'] = 1
rating_enc.loc[rating_enc['Rated AGES16'] == 1, 'Rated 16'] = 1
rating_enc = rating_enc.drop(['Rated 16','Rated AGES16'], axis=1)

In [274]:
rating_enc.loc[rating_enc['Rated AGES18'] == 1, 'Rated 18+'] = 1
rating_enc = rating_enc.drop('Rated AGES18', axis=1)

In [275]:
rating_enc.loc[rating_enc['Rated ALL'] == 1, 'Rated G'] = 1
rating_enc.loc[rating_enc['Rated ALLAGES'] == 1, 'Rated G'] = 1
rating_enc = rating_enc.drop(['Rated ALL','Rated ALLAGES'], axis=1)

In [276]:
rating_enc.columns

Index(['Rated 13+', 'Rated 16+', 'Rated 18+', 'Rated G', 'Rated NC-17',
       'Rated PG', 'Rated PG-13', 'Rated R', 'Rated TV-14', 'Rated TV-G',
       'Rated TV-MA', 'Rated TV-PG', 'Rated TV-Y', 'Rated TV-Y7',
       'Rated TV-Y7-FV'],
      dtype='object')

<h4>Next, the type</h4>

In [277]:
type_enc = pd.get_dummies(df['type'], prefix='type')

In [278]:
type_enc.columns

Index(['type_Movie', 'type_TV Show'], dtype='object')

In [279]:
type_enc = type_enc.rename({'type_Movie':'Movie',
                            'type_TV Show':'Show'}, axis=1)

<h4>Lastly, the genre of the movie/show</h4>

In [280]:
genre_enc = df.filter('genre')
# genre_enc = genre_enc.join(pd.concat([genre_enc['genre'].str.get_dummies(sep=',')])).drop('genre', axis=1)

In [281]:
genre_enc = df['genre'].str.replace(' ', '').str.get_dummies(sep=',')

In [282]:
genre_enc.shape

(22998, 120)

In [283]:
df.shape

(22998, 11)

In [284]:
list(genre_enc.columns)

['Action',
 'Action&Adventure',
 'Action-Adventure',
 'AdultAnimation',
 'Adventure',
 'Animals&Nature',
 'Animation',
 'Anime',
 'AnimeFeatures',
 'AnimeSeries',
 'Anthology',
 'Arthouse',
 'Arts',
 'Biographical',
 'BlackStories',
 'BritishTVShows',
 'Buddy',
 'Cartoons',
 'Children&FamilyMovies',
 'Classic&CultTV',
 'ClassicMovies',
 'Classics',
 'Comedies',
 'Comedy',
 'ComingofAge',
 'ConcertFilm',
 'Cooking&Food',
 'Crime',
 'CrimeTVShows',
 'CultMovies',
 'Dance',
 'Disaster',
 'Documentaries',
 'Documentary',
 'Docuseries',
 'Drama',
 'Dramas',
 'Entertainment',
 'Faith&Spirituality',
 'FaithandSpirituality',
 'Family',
 'Fantasy',
 'Fitness',
 'GameShow/Competition',
 'GameShows',
 'Health&Wellness',
 'Historical',
 'History',
 'Horror',
 'HorrorMovies',
 'IndependentMovies',
 'International',
 'InternationalMovies',
 'InternationalTVShows',
 'Kids',
 "Kids'TV",
 'KoreanTVShows',
 'LGBTQ',
 'LGBTQ+',
 'LGBTQMovies',
 'LateNight',
 'Latino',
 'Lifestyle',
 'Lifestyle&Culture',


A lot of things to fix here, so I will go alphabetically throught the list to do the following:
- Noticing a lot of redundancy between categories, especially to split movies and shows. I can merge these columns
- Rename any features so that they read better in an output

In [285]:
#Turning postitives in action&adventure categories into positives in the respective action and adventure categories
genre_enc.loc[genre_enc['Action-Adventure'] == 1, 'Action'] = 1
genre_enc.loc[genre_enc['Action-Adventure'] == 1, 'Adventure'] = 1


genre_enc.loc[genre_enc['Action&Adventure'] == 1, 'Action'] = 1
genre_enc.loc[genre_enc['Action&Adventure'] == 1, 'Adventure'] = 1

genre_enc = genre_enc.drop(['Action-Adventure','Action&Adventure'], axis=1)

In [286]:
genre_enc = genre_enc.rename({'AdultAnimation':'Adult Animation'}, axis=1)

In [287]:
genre_enc = genre_enc.rename({'Animals&Nature':'Animals & Nature'}, axis=1)

In [288]:
# There are so many anime features, I will merge them

genre_enc.loc[genre_enc['AnimeFeatures'] == 1, 'Anime'] = 1
genre_enc.loc[genre_enc['AnimeSeries'] == 1, 'Anime'] = 1


genre_enc = genre_enc.drop(['AnimeSeries','AnimeFeatures'], axis=1)

In [289]:
genre_enc = genre_enc.rename({'BlackStories':'Black Stories'}, axis=1)

In [290]:
genre_enc = genre_enc.rename({'BritishTVShows':'British'}, axis=1)

In [291]:
genre_enc = genre_enc.rename({'Children&FamilyMovies':'Family'}, axis=1)

In [292]:
#Merging classic/cult into classics feature
genre_enc.loc[genre_enc['Classic&CultTV'] == 1, 'Classics'] = 1
genre_enc.loc[genre_enc['ClassicMovies'] == 1, 'Classics'] = 1

genre_enc = genre_enc.drop(['Classic&CultTV','ClassicMovies'], axis=1)

In [293]:
genre_enc = genre_enc.rename({'ComingofAge':'Coming of Age'}, axis=1)

In [294]:
genre_enc = genre_enc.rename({'ConcertFilm':'Concert'}, axis=1)

In [295]:
genre_enc = genre_enc.rename({'Cooking&Food':'Cooking'}, axis=1)

In [296]:
#Merging CrimeTV with Crime
genre_enc.loc[genre_enc['CrimeTVShows'] == 1, 'Crime'] = 1
genre_enc = genre_enc.drop(['CrimeTVShows'], axis=1)

In [297]:
genre_enc = genre_enc.rename({'CultMovies':'Cult'}, axis=1)

In [298]:
# Merging all of the documentary features
genre_enc.loc[genre_enc['Documentaries'] == 1, 'Documentary'] = 1
genre_enc.loc[genre_enc['Docuseries'] == 1, 'Documentary'] = 1

genre_enc = genre_enc.drop(['Documentaries','Docuseries'], axis=1)

In [299]:
# Merging Drama with Dramas
genre_enc.loc[genre_enc['Dramas'] == 1, 'Drama'] = 1
genre_enc = genre_enc.drop(['Dramas'], axis=1)

In [300]:
# Merging Fairth&SPirituality features into one called faith
genre_enc = genre_enc.rename({'Faith&Spirituality':'Faith'}, axis=1)

genre_enc.loc[genre_enc['FaithandSpirituality'] == 1, 'FaithandSpirituality'] = 1

genre_enc = genre_enc.drop('FaithandSpirituality', axis=1)

In [301]:
# Merging gameshow features into one called Game Show
genre_enc = genre_enc.rename({'GameShow/Competition':'Game Show'}, axis=1)

genre_enc.loc[genre_enc['GameShows'] == 1, 'Game Show'] = 1

genre_enc = genre_enc.drop('GameShows', axis=1)

In [302]:
genre_enc = genre_enc.rename({'Health&Wellness':'Health'}, axis=1)

In [303]:
# Merging historical and history
genre_enc.loc[genre_enc['Historical'] == 1, 'History'] = 1

genre_enc = genre_enc.drop('Historical', axis=1)

In [304]:
# Merging horrors
genre_enc.loc[genre_enc['HorrorMovies'] == 1, 'Horror'] = 1

genre_enc = genre_enc.drop('HorrorMovies', axis=1)

In [305]:
genre_enc = genre_enc.rename({'IndependentMovies':'Independent'}, axis=1)

In [306]:
# Merging all of the international features
genre_enc.loc[genre_enc['InternationalMovies'] == 1, 'International'] = 1
genre_enc.loc[genre_enc['InternationalTVShows'] == 1, 'International'] = 1

genre_enc = genre_enc.drop(['InternationalMovies','InternationalTVShows'], axis=1)

In [307]:
# Merging kids features
genre_enc.loc[genre_enc["Kids'TV"] == 1, 'Kids'] = 1

genre_enc = genre_enc.drop("Kids'TV", axis=1)

In [308]:
# Merging all LGBTQ features
genre_enc.loc[genre_enc['LGBTQ'] == 1, 'LGBTQ+'] = 1
genre_enc.loc[genre_enc['LGBTQMovies'] == 1, 'LGBTQ+'] = 1

genre_enc = genre_enc.drop(['LGBTQ','LGBTQMovies'], axis=1)

In [309]:
genre_enc = genre_enc.rename({'LateNight':'Late Night'}, axis=1)

In [310]:
# Merging lifestyle features
genre_enc.loc[genre_enc['Lifestyle&Culture'] == 1, 'Lifestyle'] = 1

genre_enc = genre_enc.drop('Lifestyle&Culture', axis=1)

In [311]:
genre_enc = genre_enc.rename({'MilitaryandWar':'Military & War'}, axis=1)

In [312]:
# Dropping Movies, its redundant when we have a type feature
genre_enc = genre_enc.drop(['Movies'], axis=1)

In [313]:
# Merging feautres into music
genre_enc.loc[genre_enc['Concert'] == 1, 'Music'] = 1
genre_enc.loc[genre_enc['MusicVideosandConcerts'] == 1, 'Music'] = 1

genre_enc = genre_enc.drop(['Concert','MusicVideosandConcerts'], axis=1)

In [314]:
# Mering feautres into musical
genre_enc.loc[genre_enc['Music&Musicals'] == 1, 'Musical'] = 1

genre_enc = genre_enc.drop('Music&Musicals', axis=1)

In [315]:
# Merging Police/Cop into Crime
genre_enc.loc[genre_enc['Police/Cop'] == 1, 'Crime'] = 1

genre_enc = genre_enc.drop(['Police/Cop'], axis=1)

In [316]:
# Merging reality
genre_enc.loc[genre_enc['RealityTV'] == 1, 'Reality'] = 1

genre_enc = genre_enc.drop(['RealityTV'], axis=1)

In [317]:
# Merging romances, and adding romcoms to comedy
genre_enc.loc[genre_enc['RomanticComedy'] == 1, 'Comedy'] = 1

genre_enc.loc[genre_enc['RomanticComedy'] == 1, 'Romance'] = 1
genre_enc.loc[genre_enc['RomanticMovies'] == 1, 'Romance'] = 1
genre_enc.loc[genre_enc['RomanticTVShows'] == 1, 'Romance'] = 1

genre_enc = genre_enc.drop(['RomanticComedy','RomanticMovies','RomanticTVShows'], axis=1)

In [318]:
genre_enc = genre_enc.rename({'Sci-Fi&Fantasy':'Sci-Fi & Fantasy'}, axis=1)

In [319]:
# Adding Animals to Nature category
genre_enc = genre_enc.rename({'Science&NatureTV':'Nature'}, axis=1)

genre_enc.loc[genre_enc['Animals & Nature'] == 1, 'Nature'] = 1

In [320]:
# Renamings to Sci-Fi and adding movies in this category to the Sci-Fi and Fantasy category
genre_enc = genre_enc.rename({'ScienceFiction':'Sci-Fi'}, axis=1)

In [321]:
# Dropping redundant feature Series
genre_enc = genre_enc.drop(['Series'], axis=1)

In [322]:
# Adding sketch comedy movies/shows to comedy category
genre_enc.loc[genre_enc['SketchComedy'] == 1, 'Comedy'] = 1

In [323]:
# Adding SoapOpera/Melodrama to Drama
genre_enc.loc[genre_enc['SoapOpera/Melodrama'] == 1, 'Drama'] = 1

In [324]:
genre_enc = genre_enc.rename({'Spanish-LanguageTVShows':'Spanish'}, axis=1)

In [325]:
genre_enc = genre_enc.rename({'SpecialInterest':'Special Interest'}, axis=1)

In [326]:
# Merging Sports
genre_enc.loc[genre_enc['SportsMovies'] == 1, 'Sports'] = 1

genre_enc = genre_enc.drop(['SportsMovies'], axis=1)

In [327]:
genre_enc = genre_enc.rename({'Spy/Espionage':'Spy'}, axis=1)

In [328]:
# Merging STandup and adding these to the comedy category
genre_enc = genre_enc.rename({'StandUp':'Stand Up'}, axis=1)

genre_enc.loc[genre_enc['Stand-UpComedy'] == 1, 'Stand Up'] = 1
genre_enc.loc[genre_enc['Stand-UpComedy&TalkShows'] == 1, 'Stand Up'] = 1

genre_enc.loc[genre_enc['Stand Up'] == 1, 'Comedy'] = 1

genre_enc = genre_enc.drop(['Stand-UpComedy'], axis=1)

In [329]:
# Putting all TV cattegories into respective other categories
genre_enc.loc[genre_enc['TVAction&Adventure'] == 1, 'Action'] = 1
genre_enc.loc[genre_enc['TVAction&Adventure'] == 1, 'Adventure'] = 1
genre_enc = genre_enc.drop('TVAction&Adventure', axis=1)

genre_enc.loc[genre_enc['TVComedies'] == 1, 'Comedy'] = 1
genre_enc = genre_enc.drop('TVComedies', axis=1)

genre_enc.loc[genre_enc['TVDramas'] == 1, 'Drama'] = 1
genre_enc = genre_enc.drop('TVDramas', axis=1)

genre_enc.loc[genre_enc['TVHorror'] == 1, 'Horror'] = 1
genre_enc = genre_enc.drop('TVHorror', axis=1)

genre_enc.loc[genre_enc['TVMysteries'] == 1, 'Mystery'] = 1
genre_enc = genre_enc.drop('TVMysteries', axis=1)

genre_enc.loc[genre_enc['TVSci-Fi&Fantasy'] == 1, 'Sci-Fi & Fantasy'] = 1
genre_enc = genre_enc.drop('TVSci-Fi&Fantasy', axis=1)

genre_enc.loc[genre_enc['TVThrillers'] == 1, 'Thriller'] = 1
genre_enc = genre_enc.drop('TVThrillers', axis=1)

genre_enc.loc[genre_enc['TeenTVShows'] == 1, 'Teen'] = 1
genre_enc = genre_enc.drop('TeenTVShows', axis=1)

In [330]:
# Dropping redundant TV feature
genre_enc = genre_enc.drop(['TVShows'], axis=1)

In [331]:
# Merging everything talk show
genre_enc = genre_enc.rename({'TalkShow':'Talk Show'}, axis=1)

genre_enc.loc[genre_enc['Late Night'] == 1, 'Talk Show'] = 1
genre_enc.loc[genre_enc['TalkShowandVariety'] == 1, 'Talk Show'] = 1
genre_enc.loc[genre_enc['Stand-UpComedy&TalkShows'] == 1, 'Talk Show'] = 1

genre_enc = genre_enc.drop(['Late Night','TalkShowandVariety','Stand-UpComedy&TalkShows'], axis=1)

In [332]:
# Merging Thriller
genre_enc.loc[genre_enc['Thrillers'] == 1, 'Thriller'] = 1

genre_enc = genre_enc.drop('Thrillers', axis=1)

In [333]:
genre_enc = genre_enc.rename({'YoungAdultAudience':'Young Adult'}, axis=1)

In [334]:
# Merging this with earlier lifestyle feature
genre_enc.loc[genre_enc['andCulture'] == 1, 'Lifestyle'] = 1

genre_enc = genre_enc.drop('andCulture', axis=1)

In [335]:
list(genre_enc.columns)

['Action',
 'Adult Animation',
 'Adventure',
 'Animals & Nature',
 'Animation',
 'Anime',
 'Anthology',
 'Arthouse',
 'Arts',
 'Biographical',
 'Black Stories',
 'British',
 'Buddy',
 'Cartoons',
 'Family',
 'Classics',
 'Comedies',
 'Comedy',
 'Coming of Age',
 'Cooking',
 'Crime',
 'Cult',
 'Dance',
 'Disaster',
 'Documentary',
 'Drama',
 'Entertainment',
 'Faith',
 'Family',
 'Fantasy',
 'Fitness',
 'Game Show',
 'Health',
 'History',
 'Horror',
 'Independent',
 'International',
 'Kids',
 'KoreanTVShows',
 'LGBTQ+',
 'Latino',
 'Lifestyle',
 'Medical',
 'Military & War',
 'Music',
 'Musical',
 'Mystery',
 'News',
 'Parody',
 'Reality',
 'Romance',
 'Sci-Fi & Fantasy',
 'Nature',
 'Science&Technology',
 'Sci-Fi',
 'Sitcom',
 'SketchComedy',
 'SoapOpera/Melodrama',
 'Spanish',
 'Special Interest',
 'Sports',
 'Spy',
 'Stand Up',
 'Superhero',
 'Survival',
 'Suspense',
 'Talk Show',
 'Teen',
 'Thriller',
 'Travel',
 'Unscripted',
 'Variety',
 'Western',
 'Young Adult']

In [336]:
genre_enc.shape

(22998, 74)

In [337]:
df_enc = pd.concat([df, rating_enc, type_enc], axis=1)
df_enc = df_enc.drop(['rating', 'type', 'genre'], axis=1)

In [338]:
df_enc.head()

Unnamed: 0,id,title,year,description,netflix,amazon,disney,hulu,Rated 13+,Rated 16+,...,Rated R,Rated TV-14,Rated TV-G,Rated TV-MA,Rated TV-PG,Rated TV-Y,Rated TV-Y7,Rated TV-Y7-FV,Movie,Show
0,s1,Dick Johnson Is Dead,2020,"As her father nears the end of his life, filmm...",1.0,0.0,0.0,0.0,0,0,...,0,0,0,0,0,0,0,0,1,0
1,s2,Blood & Water,2021,"After crossing paths at a party, a Cape Town t...",1.0,0.0,0.0,0.0,0,0,...,0,0,0,1,0,0,0,0,0,1
2,s3,Ganglands,2021,To protect his family from a powerful drug lor...,1.0,0.0,0.0,0.0,0,0,...,0,0,0,1,0,0,0,0,0,1
3,s4,Jailbirds New Orleans,2021,"Feuds, flirtations and toilet talk go down amo...",1.0,0.0,0.0,0.0,0,0,...,0,0,0,1,0,0,0,0,0,1
4,s5,Kota Factory,2021,In a city of coaching centers known to train I...,1.0,0.0,0.0,0.0,0,0,...,0,0,0,1,0,0,0,0,0,1


Lastly, we must combine duplicate rows that occur from using several datasets

In [339]:
agg_func = {'id':'first', 
            'year':'first', 
            'description':'first',
            'netflix':'sum',
            'amazon':'sum', 
            'disney':'sum',
            'hulu':'sum', 
            'Rated 13+':'sum', 
            'Rated 16+':'sum', 
            'Rated 18+':'sum', 
            'Rated G':'sum', 
            'Rated NC-17':'sum',
            'Rated PG':'sum', 
            'Rated PG-13':'sum', 
            'Rated R':'sum', 
            'Rated TV-14':'sum', 
            'Rated TV-G':'sum',
            'Rated TV-MA':'sum', 
            'Rated TV-PG':'sum',
            'Rated TV-Y':'sum', 
            'Rated TV-Y7':'sum',
            'Rated TV-Y7-FV':'sum', 
            'Movie':'sum',
            'Show':'sum'}

In [340]:
df_enc = df_enc.groupby(df_enc['title']).aggregate(agg_func)

In [341]:
df_enc.reset_index(inplace=True)

In [342]:
df_enc

Unnamed: 0,title,id,year,description,netflix,amazon,disney,hulu,Rated 13+,Rated 16+,...,Rated R,Rated TV-14,Rated TV-G,Rated TV-MA,Rated TV-PG,Rated TV-Y,Rated TV-Y7,Rated TV-Y7-FV,Movie,Show
0,"""Mixed Up""",s5548,2020,"""Mixed Up"" examines casual factors that make u...",0.0,1.0,0.0,0.0,0,1,...,0,0,0,0,0,0,0,0,1,0
1,"""The Paramedic Angel""",s5978,2021,The tragedy of a loving family man and paramed...,0.0,1.0,0.0,0.0,0,0,...,0,0,0,0,0,0,0,0,0,1
2,#Alive,s2037,2020,"As a grisly virus rampages a city, a lone man ...",1.0,0.0,0.0,0.0,0,0,...,0,0,0,1,0,0,0,0,1,0
3,#AnneFrank - Parallel Stories,s2305,2019,"Through her diary, Anne Frank's story is retol...",1.0,0.0,0.0,0.0,0,0,...,0,1,0,0,0,0,0,0,1,0
4,#FriendButMarried,s2482,2018,"Pining for his high school crush for years, a ...",1.0,0.0,0.0,0.0,0,0,...,0,0,1,0,0,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
22110,忍者ハットリくん,s6178,2012,"Hailing from the mountains of Iga, Kanzo Hatto...",1.0,0.0,0.0,0.0,0,0,...,0,0,0,0,0,0,1,0,0,1
22111,海的儿子,s4915,2016,"Two brothers start a new life in Singapore, wh...",1.0,0.0,0.0,0.0,0,0,...,0,1,0,0,0,0,0,0,0,1
22112,마녀사냥,s7102,2015,Four Korean celebrity men and guest stars of b...,1.0,0.0,0.0,0.0,0,0,...,0,0,0,1,0,0,0,0,0,1
22113,반드시 잡는다,s5023,2017,After people in his town start turning up dead...,1.0,0.0,0.0,0.0,0,0,...,0,0,0,1,0,0,0,0,1,0


# Recommendation System

In [343]:
# Dropping the values that cannot be used in cosine similarity
df_rs = df_enc.drop(['id','title','year','description'],axis=1)

In [345]:
cosine_sim = cosine_similarity(df_rs, df_rs)

MemoryError: Unable to allocate 3.64 GiB for an array with shape (22115, 22115) and data type float64

In [346]:
def get_recommendations(title_list, cosine_sim_matrix, df, top_n=10):
    avg_sim_scores = np.zeros(len(df))

    for title in title_list:
        # Check if the title exists in the dataset
        if title not in df['title'].values:
            print("Warning: '{}' not found in the dataset.".format(title))
            continue
        
        # Get the index of the title in the DataFrame
        idx = df[df['title'] == title].index[0]
        sim_scores = cosine_sim_matrix[idx]
        
        # Add the similarity scores to the average scores array
        avg_sim_scores += sim_scores

    # Average the similarity scores across all titles in the input list
    avg_sim_scores /= len(title_list)
    movie_indices = np.argsort(avg_sim_scores)[::-1][:top_n]

    recommendations = df.iloc[movie_indices]['title']

    return recommendations

movie_titles = ["Henry Danger", "Inception", "Stranger Things"]
recommendations = get_recommendations(movie_titles, cosine_sim, df_enc)
print(recommendations)

5803                         El Barco
12837                   North & South
10479                    Little Lunch
14519                    Red vs. Blue
14982    Russell Peters vs. the World
4215               Comedy Bang! Bang!
10012                       LOST SONG
15744                         Shtisel
7706                   Happy 300 Days
7708                        Happy And
Name: title, dtype: object


In [352]:
get_recommendations(['Food Wars!: Shokugeki no Soma'],cosine_sim,df_enc)

9956                  L.A.’s Finest
8146                      Hollywood
18723                 The Magicians
16117                        Somos.
4648                      Damnation
5892     Ellen DeGeneres: Relatable
21055                         Vexed
21062               Victim Number 8
16125                 Song Exploder
2993                    Borderliner
Name: title, dtype: object

In [350]:
df_enc[df_enc['title'].str.contains('Food Wars')]

Unnamed: 0,title,id,year,description,netflix,amazon,disney,hulu,Rated 13+,Rated 16+,...,Rated R,Rated TV-14,Rated TV-G,Rated TV-MA,Rated TV-PG,Rated TV-Y,Rated TV-Y7,Rated TV-Y7-FV,Movie,Show
6615,Food Wars!,s2782,2015,Souma is a teenage chef who is always looking ...,0.0,0.0,0.0,1.0,0,0,...,0,0,0,1,0,0,0,0,0,1
6616,Food Wars!: Shokugeki no Soma,s1909,2016,Young chef Soma enters the prestigious Totsuki...,1.0,0.0,0.0,0.0,0,0,...,0,0,0,1,0,0,0,0,0,1
