# Movie/Tv Show Recommendation System

## Introduction
- Recommender systems (RSs) are everywhere. Amazon, Netflix, Spotify, YouTube, and many more services and apps we use every day have in the backend some sort of recommendation engine.

- RSs help users to find items they are interested in, and this can increase engagement on the platform: if a platform suggests items of interest, users will spend more time on that platform.

- There are many techniques and strategies to build modern and powerful RSs based on specific application domains
- I am using the content based techniques for building this system.

## So, in this kernel, we'll be building a Movie/ TV Show Recommendation System using those datasets:

- Paramount+ Movies and TV Shows
- Disney+ Movies and TV Shows
- AppleTV+ Movies and TV Shows
- Amazon Prime Movies and TV Shows
- HBO Max Movies and TV Shows
- Netflix Movies and TV Shows

In [1]:
pwd

'/Users/mac/Documents/recommendation project'

In [2]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import missingno as msno
%matplotlib inline

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel

In [3]:
amazon_titles = pd.read_csv("amazon.csv")
appletv_titles = pd.read_csv("appletv.csv")
disney_titles = pd.read_csv("disney.csv")
hbo_titles = pd.read_csv("hbo.csv")
netflix_titles = pd.read_csv("netflix.csv")
paramount_titles = pd.read_csv("paramount.csv")

In [5]:
amazon_titles.shape

(10873, 15)

In [14]:
# concatination
titles = pd.concat([amazon_titles, appletv_titles, disney_titles, hbo_titles, netflix_titles, paramount_titles],axis=0).reset_index()

In [15]:
titles.drop(['index'], axis=1, inplace=True)
titles.head()

Unnamed: 0,id,title,type,description,release_year,age_certification,runtime,genres,production_countries,seasons,imdb_id,imdb_score,imdb_votes,tmdb_popularity,tmdb_score
0,tm87233,It's a Wonderful Life,MOVIE,A holiday favourite for generations... George...,1946,PG,130,"['drama', 'family', 'fantasy', 'romance', 'com...",['US'],,tt0038650,8.6,467766.0,27.611,8.261
1,tm143047,Duck Soup,MOVIE,Rufus T. Firefly is named president/dictator o...,1933,,69,"['comedy', 'war']",['US'],,tt0023969,7.8,60933.0,9.013,7.357
2,tm83884,His Girl Friday,MOVIE,"Hildy, the journalist former wife of newspaper...",1940,,92,"['drama', 'romance', 'comedy']",['US'],,tt0032599,7.8,60244.0,14.759,7.433
3,ts20945,The Three Stooges,SHOW,The Three Stooges were an American vaudeville ...,1934,TV-PG,19,"['comedy', 'family']",['US'],26.0,tt0850645,8.5,1149.0,15.424,7.6
4,tm5012,Red River,MOVIE,Headstrong Thomas Dunson starts a thriving Tex...,1948,,133,"['western', 'drama', 'romance', 'action']",['US'],,tt0040724,7.8,32210.0,12.4,7.4


# Data cleaning

In [16]:
# Seeing if we have duplicates
titles[titles.duplicated() == True].head(5)

Unnamed: 0,id,title,type,description,release_year,age_certification,runtime,genres,production_countries,seasons,imdb_id,imdb_score,imdb_votes,tmdb_popularity,tmdb_score
11321,tm57241,Never Been Kissed,MOVIE,"Josie Geller, a baby-faced junior copywriter a...",1999,PG-13,107,"['comedy', 'drama', 'romance']",['US'],,tt0151738,6.0,93238.0,17.42,6.18
11387,ts22130,Rolie Polie Olie,SHOW,Rolie Polie Olie was a children's television s...,1998,TV-Y,21,"['animation', 'comedy', 'family', 'fantasy', '...","['CA', 'FR', 'US', 'GB']",6.0,tt0172049,6.3,3012.0,13.848,6.6
11586,tm98015,The Last Song,MOVIE,A drama centered on a rebellious girl who is s...,2010,PG,107,"['drama', 'romance', 'music']",['US'],,tt1294226,6.0,89378.0,15.081,7.242
11613,ts22233,Shake It Up,SHOW,Best pals CeCe and Rocky dream of dancing star...,2010,TV-G,25,"['comedy', 'family']",['US'],3.0,tt0453993,8.0,88.0,41.672,7.8
11822,ts7273,Doc McStuffins,SHOW,A young African-American girl aspires to be a ...,2012,TV-G,22,"['animation', 'family', 'fantasy', 'music']",['US'],6.0,tt1710295,6.6,2551.0,35.228,5.8


In [17]:
# Dropping duplicates
titles.drop_duplicates(inplace=True)
titles.shape

(23362, 15)

In [18]:
# define a function to get all the information needed
def information_func(titles):
    print('dataset info')
    titles.info()
    print('-----'*10)
    null = titles.isnull().sum()
    print("missing values:\n", null)
    print('-----'*10)

In [19]:
information_func(titles)

dataset info
<class 'pandas.core.frame.DataFrame'>
Int64Index: 23362 entries, 0 to 25245
Data columns (total 15 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   id                    23362 non-null  object 
 1   title                 23362 non-null  object 
 2   type                  23362 non-null  object 
 3   description           23167 non-null  object 
 4   release_year          23362 non-null  int64  
 5   age_certification     11381 non-null  object 
 6   runtime               23362 non-null  int64  
 7   genres                23362 non-null  object 
 8   production_countries  23362 non-null  object 
 9   seasons               5623 non-null   float64
 10  imdb_id               21412 non-null  object 
 11  imdb_score            20804 non-null  float64
 12  imdb_votes            20744 non-null  float64
 13  tmdb_popularity       22642 non-null  float64
 14  tmdb_score            20362 non-null  float64
dtypes: flo

### Handling the 'genres' and 'production_countries' columns
### These two columns are formed by list values, so we need to handle these values to work with a single value.

In [20]:
titles['genres'] = titles['genres'].str.replace(r'[','').str.replace(r"'",'').str.replace(r']','')
titles['genre'] = titles['genres'].str.split(',').str[0]


titles['production_countries'] = titles['production_countries'].str.replace(r"[", '').str.replace(r"'", '').str.replace(r"]", '')
titles['production_country'] = titles['production_countries'].str.split(',').str[0]


  titles['genres'] = titles['genres'].str.replace(r'[','').str.replace(r"'",'').str.replace(r']','')
  titles['production_countries'] = titles['production_countries'].str.replace(r"[", '').str.replace(r"'", '').str.replace(r"]", '')


In [22]:
titles.drop(['genres','production_countries'],axis=1,inplace=True)

In [28]:
titles['genre'].unique()

array(['drama', 'comedy', 'western', 'romance', 'action', 'fantasy',
       'horror', 'thriller', 'documentation', 'music', 'crime', '', 'war',
       'reality', 'scifi', 'history', 'family', 'animation', 'sport',
       'european'], dtype=object)

In [29]:
titles['genre'] = titles['genre'].replace('', np.nan)
titles['production_country'] = titles['production_country'].replace('',np.nan)


In [30]:
titles.genre.isnull().sum()

367

# season column

- the movie type is nan in season columns so i am going to change the movie type nan in season to 0 no of season


In [31]:
titles['seasons'].fillna(0,inplace=True)

# Missing Data

In [32]:
titles.isnull().sum()

id                        0
title                     0
type                      0
description             195
release_year              0
age_certification     11981
runtime                   0
seasons                   0
imdb_id                1950
imdb_score             2558
imdb_votes             2618
tmdb_popularity         720
tmdb_score             3000
genre                   367
production_country     1049
dtype: int64

In [33]:
titles.drop(['imdb_id','age_certification'], axis=1,inplace=True)

In [34]:
titles.dropna(inplace=True)

In [35]:
titles.shape

(18374, 13)

In [36]:
titles.head()

Unnamed: 0,id,title,type,description,release_year,runtime,seasons,imdb_score,imdb_votes,tmdb_popularity,tmdb_score,genre,production_country
0,tm87233,It's a Wonderful Life,MOVIE,A holiday favourite for generations... George...,1946,130,0.0,8.6,467766.0,27.611,8.261,drama,US
1,tm143047,Duck Soup,MOVIE,Rufus T. Firefly is named president/dictator o...,1933,69,0.0,7.8,60933.0,9.013,7.357,comedy,US
2,tm83884,His Girl Friday,MOVIE,"Hildy, the journalist former wife of newspaper...",1940,92,0.0,7.8,60244.0,14.759,7.433,drama,US
3,ts20945,The Three Stooges,SHOW,The Three Stooges were an American vaudeville ...,1934,19,26.0,8.5,1149.0,15.424,7.6,comedy,US
4,tm5012,Red River,MOVIE,Headstrong Thomas Dunson starts a thriving Tex...,1948,133,0.0,7.8,32210.0,12.4,7.4,western,US


# Content Based Recommender


- In this kernel, we will build a recommendation system based on the description of titles and genre. We will calculate pairwise similarity scores for all movies/tv shows based on their descriptions and recommend titles with similar scores.

In [37]:
# streaming_platform column to present where the titles are available.
lt = []
for i in titles['id']:
    movie_streaming = []
    if i in amazon_titles['id'].values:
        movie_streaming.append('amazon')
    if i in appletv_titles['id'].values:
        movie_streaming.append('appletv')
    if i in disney_titles['id'].values:
        movie_streaming.append('disney+')
    if i in hbo_titles['id'].values:
        movie_streaming.append('hbomax')
    if i in netflix_titles['id'].values:
        movie_streaming.append('netflix')
    if i in paramount_titles['id'].values:
        movie_streaming.append('paramount+')
    lt.append(movie_streaming)

In [38]:
titles['streaming_platform'] = lt

In [40]:
titles.head()

Unnamed: 0,id,title,type,description,release_year,runtime,seasons,imdb_score,imdb_votes,tmdb_popularity,tmdb_score,genre,production_country,streaming_platform
0,tm87233,It's a Wonderful Life,MOVIE,A holiday favourite for generations... George...,1946,130,0.0,8.6,467766.0,27.611,8.261,drama,US,[amazon]
1,tm143047,Duck Soup,MOVIE,Rufus T. Firefly is named president/dictator o...,1933,69,0.0,7.8,60933.0,9.013,7.357,comedy,US,[amazon]
2,tm83884,His Girl Friday,MOVIE,"Hildy, the journalist former wife of newspaper...",1940,92,0.0,7.8,60244.0,14.759,7.433,drama,US,"[amazon, paramount+]"
3,ts20945,The Three Stooges,SHOW,The Three Stooges were an American vaudeville ...,1934,19,26.0,8.5,1149.0,15.424,7.6,comedy,US,[amazon]
4,tm5012,Red River,MOVIE,Headstrong Thomas Dunson starts a thriving Tex...,1948,133,0.0,7.8,32210.0,12.4,7.4,western,US,"[amazon, paramount+]"


# movie and show data

In [41]:
movies = titles[titles['type'] == 'MOVIE'].copy().reset_index()
movies.drop(['index'], axis=1, inplace=True)

shows = titles[titles['type'] == 'SHOW'].copy().reset_index()
shows.drop(['index'], axis=1, inplace=True)

In [42]:
movies.head()

Unnamed: 0,id,title,type,description,release_year,runtime,seasons,imdb_score,imdb_votes,tmdb_popularity,tmdb_score,genre,production_country,streaming_platform
0,tm87233,It's a Wonderful Life,MOVIE,A holiday favourite for generations... George...,1946,130,0.0,8.6,467766.0,27.611,8.261,drama,US,[amazon]
1,tm143047,Duck Soup,MOVIE,Rufus T. Firefly is named president/dictator o...,1933,69,0.0,7.8,60933.0,9.013,7.357,comedy,US,[amazon]
2,tm83884,His Girl Friday,MOVIE,"Hildy, the journalist former wife of newspaper...",1940,92,0.0,7.8,60244.0,14.759,7.433,drama,US,"[amazon, paramount+]"
3,tm5012,Red River,MOVIE,Headstrong Thomas Dunson starts a thriving Tex...,1948,133,0.0,7.8,32210.0,12.4,7.4,western,US,"[amazon, paramount+]"
4,tm82253,The Best Years of Our Lives,MOVIE,It's the hope that sustains the spirit of ever...,1947,171,0.0,8.1,66209.0,16.056,7.838,drama,US,[amazon]


In [45]:
#Define a TF-IDF Vectorizer Object. 
#This remove all english stop words such as 'the', 'a'
tfidf = TfidfVectorizer(stop_words='english')

#Construct the required TF-IDF matrix by fitting and transforming the data
tfidf_matrix_movies = tfidf.fit_transform(movies['description'])
tfidf_matrix_shows = tfidf.fit_transform(shows['description'])

#Output the shape of tfidf_matrix
print(f'Shape for Movies: {tfidf_matrix_movies.shape}')
print(f'Shape for Shows: {tfidf_matrix_shows.shape}')

Shape for Movies: (13831, 35074)
Shape for Shows: (4543, 19365)


In [46]:
# Compute the cosine similarity matrix
cosine_sim_movies = linear_kernel(tfidf_matrix_movies, tfidf_matrix_movies)
cosine_sim_shows = linear_kernel(tfidf_matrix_shows, tfidf_matrix_shows)

In [51]:
# Now we create a way to identify the index of a movie/show in our data, given its title.
indices_movies = pd.Series(movies.index, index=movies['title'])
indices_shows = pd.Series(shows.index, index=shows['title'])


In [53]:
indices_movies.head()

title
It's a Wonderful Life          0
Duck Soup                      1
His Girl Friday                2
Red River                      3
The Best Years of Our Lives    4
dtype: int64

In [54]:
def get_title(title,indices):
    """
    Function that gets the 'index searcher' and searches
    the user's title index.
    """
    
    try:
        index = indices[title]
    except:
        print("\n  Title not found")
        return None

    if isinstance(index, np.int64):
        return index
    
    else:
        rt = 0
        print("Select a title: ")
        for i in range(len(index)):
            print(f"{i} - {movies['title'].iloc[index[i]]}", end=' ')
            print(f"({movies['release_year'].iloc[index[i]]})")
        rt = int(input())
        return index[rt]

In [55]:
# functions that accept a movie/show title as input and produce a list of the 10 most similar titles.


In [56]:
def get_recommendations_movie(title, cosine_sim=cosine_sim_movies):   
    
    title = get_title(title, indices_movies)
    if title == None:
        return 
    
    idx = indices_movies[title]
      
    print(f"Title: {movies['title'].iloc[idx]} |  Year: {movies['release_year'].iloc[idx]}")

    print('**' * 40)

    # Get the pairwsie similarity scores of all movies with that movie
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the movies based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar movies
    sim_scores = sim_scores[1:11]

    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]

    print(movies[['title', 'release_year','streaming_platform']].iloc[movie_indices])

    print('**' * 40)


    

In [57]:
def get_recommendations_show(title, cosine_sim=cosine_sim_shows):  
    
    title = get_title(title, indices_shows)
    if title == None:
        return 
    
    idx = indices_shows[title]
      
    print(f"Title: {shows['title'].iloc[idx]} |  Year: {shows['release_year'].iloc[idx]}")

    print('**' * 40)

    # Get the pairwsie similarity scores of all shows with that movie
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the shows based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar shows
    sim_scores = sim_scores[1:11]

    # Get the shows indices
    shows_indices = [i[0] for i in sim_scores]

    print(shows[['title', 'release_year','streaming_platform']].iloc[shows_indices])

    print('**' * 40)

In [58]:
get_recommendations_movie('Red River')

Title: Red River |  Year: 1948
********************************************************************************
                           title  release_year    streaming_platform
596              The Dude Ranger          1934              [amazon]
1060          High School Caesar          1960  [amazon, paramount+]
725                 Roarin' Lead          1936  [amazon, paramount+]
257            Randy Rides Alone          1934  [amazon, paramount+]
2113    Godkiller: Walk Among Us          2010              [amazon]
581           Hop-a-long Cassidy          1935  [amazon, paramount+]
13205                   Big Jake          1971          [paramount+]
2922                    Grinders          2011              [amazon]
708    Springtime in the Rockies          1937  [amazon, paramount+]
13342                    Rob Roy          1995          [paramount+]
********************************************************************************


In [59]:
get_recommendations_show('The Three Stooges')

Title: The Three Stooges |  Year: 1934
********************************************************************************
                                 title  release_year streaming_platform
1616                     Tom and Jerry          1940           [hbomax]
4205                           KaBlam!          1996       [paramount+]
1558                  Welcome to Earth          2021          [disney+]
1547                Star Wars: Visions          2021          [disney+]
1185                 Schoolhouse Rock!          1973          [disney+]
3979                   Feels Like Ishq          2021          [netflix]
3844  Rilakkuma's Theme Park Adventure          2022          [netflix]
3406                   Paava Kadhaigal          2020          [netflix]
504            The New Yorker Presents          2016           [amazon]
1229                           Bonkers          1993          [disney+]
********************************************************************************


# In this project i was able to build a recommendation system using different streaming platform.
- i got top 10 recommendation for movie and show using content based method by calculating the cosine similarity