In this dataset, I tried using cosine similarity to collect movie recommendations based on the movie description provided in the dataset.

# Load Dataset

First, we load the dataset and explore the columns. We can see that the dataset contains `description` column, which is a short summary of the movie. We'll build up the recommendations based on contents in `description` column.

In [None]:
import pandas as pd
netflix_titles = pd.read_csv("../input/netflix-shows/netflix_titles.csv")
print(netflix_titles.columns)
print(netflix_titles.head())

# User Input

You can try passing your movie(s) of interest here. For this version I just made the input to be exact and case sensitive. Will loosen up (case insensitive, more elastic titles) in later versions. For multiple movies, you can pass the titles separated by commas. Movie titles are stored in a list named `user_movietitles`. Movie titles used for the base of the recommender are limited to those listed in the dataset.

In [None]:
user_id = '1'
user_movietitles = ['Sierra Burgess Is A Loser', 'Narcos', 'Explained', 'The Mind of a Chef']
status = ['Watched' for x in range(len(user_movietitles))]

For easier further development, I created a copy dataframe that'll store input movies from multiple users, hence the dataframe now also contains `user_id` and `status` with dummy contents. For now it'll only store the movies passed to `user_movietitles`. 

In [None]:
# Use .copy() to avoid warning related to modifying one of the dataframes used while not meaning to
df_user = netflix_titles.loc[netflix_titles['title'].isin(user_movietitles)].copy()
df_user['user_id'] = user_id
df_user['status'] = status
print(df_user[['user_id', 'title', 'status', 'description']])

# Recommendation System

As we're working with text data for summaries in the `description` column, we first need to import packages like NLTK and TFIDF. Next, we remove stopwords and tokenize the sentences, with `lower()` function to not differ same tokens written in different cases. I haven't add any lemmatization to these tokens. Do this in both (i) whole Netflix dataset (here imported as `netflix_titles` dataframe) and (ii) user movies dataframe (here imported as `df_user`). Also create the similarity function, here we're using cosine similarity. The larger the cosine similarity value, the more similar the movie summaries. The object `similarity` will be in Numpy array.

In [None]:
# Add stopwords, tokenize 'description' in both netflix_titles df and df_user df
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

stop_words = set(stopwords.words('english'))
tfidf_vectorizer = TfidfVectorizer(stop_words = stop_words)
tfidf_netflix_titles = tfidf_vectorizer.fit_transform(netflix_titles['description'].str.lower()).toarray()
tfidf_user_df = tfidf_vectorizer.transform(df_user['description'].str.lower()).toarray()
similarity = cosine_similarity(tfidf_user_df, tfidf_netflix_titles)

Sort `similarity` and collect top 5 most similar movies as our recommendations.

In [None]:
for similarity_x in similarity:
    similarity_x = pd.Series(list(similarity_x))
    top_recommendations = similarity_x.nlargest(6)
    netflix_id = top_recommendations.index.values.tolist()
    print('*Recommended for you from:')
    print('Title: ' + str(netflix_titles.loc[netflix_id[0], 'title']))
    print('Desc : ' + str(netflix_titles.loc[netflix_id[0], 'description']))
    
    print('\n*Top 5 recommendations:')
    titles_only = []
    for id in range(1, len(netflix_id)):
        titles_only.append(str(netflix_titles.loc[netflix_id[id], 'title']))
        print('Title: ' + str(netflix_titles.loc[netflix_id[id], 'title']))
        print('Desc : ' + str(netflix_titles.loc[netflix_id[id], 'description']))
        print('\n')

# Improvement Ideas

As seen in the result, we still see some not too similar movies recommended. As an example, Richard Pryor: Live in Concert is recommended to users watching Explained. I think adding sub genres for this movie dataset would be a helpful feature. Take the Richard Pryor: Live in Concert, we can add `Music` as the second tier for this movie's genre. Then, we can add a new scoring on whether or not the recommended movies matched the first tier genre and second tier genre. Finally, we can have weighted scores for both cosine similarity (scaled from 0 to 1) and genre similarity (scaled from 0 to 1). Simply put, `final_score = 0.5 * (cosine_similarity + genre_similarity)`. The larger final_score, the more similar the movie to the user input movie.

Here's the list of unique genres available. See how we can improve the sub genres?

In [None]:
genres_all = list(netflix_titles.listed_in.unique())
genres = []
for x in genres_all:
    x1 = x.replace(' ', '').split(',')
    genres.extend(x1)

genres = sorted(list(set(genres)))
print(genres)

Anyways, I personally used this recommender sometimes as a complement of Netflix's in app recommendations. Haven't used any convoluted techniques here but I found this worked well enough. Happy binge watching!!