# Kaggle: Netflix Movies and TV shows

Author: Katherine Zhang 

Start date: April, 6, 2020

![Netflix logo](https://www.mercurynews.com/wp-content/uploads/2020/01/netflixlogo.jpg?w=877)

## **Background**: 

This dataset consists of TV shows and movies available on Netflix as of 2020. The dataset is collected from Flixable which is a third-party Netflix search engine. The data can be downloaded from https://www.kaggle.com/shivamb/netflix-shows.

## **Motivation:**

- Understanding what content is available in different countries
- Determeine whether Netflix has increasingly focusing on TV rather than movies in recent years.
- Identifying similar content by matching text-based features


In [None]:
import nltk
import re
import pandas as pd
import numpy as np
import seaborn as sns 
import collections as c
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from nltk.corpus import stopwords
from sklearn.metrics.pairwise import linear_kernel
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

## 1 Data preprocessing

In [None]:
df_netflix = pd.read_csv('../input/netflix-shows/netflix_titles.csv')
df_netflix.head(3)

In [None]:
len(df_netflix.drop_duplicates())

In [None]:
len(df_netflix[df_netflix['type'] == 'Movie'])

There are 6234 movies and tv shows on netflix. About 68% of them are movies and 32% are TV shows. 

In [None]:
list(df_netflix[df_netflix['date_added'].isna()]['title'])

In [None]:
len(list(df_netflix[df_netflix['date_added'].isna()]['title']))

We notice that the column 'date_added' has 11 NA values and further research shows that these 11 shows are not available on Netflix anymore. For example, 'Friends' left Netflix in 2020. 

We will drop those shows which are no longer available on Netflix. 

In [None]:
df_netflix = df_netflix[df_netflix['date_added'].notna()]

In [None]:
# Handled missing values and standardized datetime 
df_netflix['rating'] = df_netflix['rating'].fillna("")
df_netflix['director'] = df_netflix['director'].fillna("")
df_netflix['date_added'] = pd.to_datetime(df_netflix['date_added'])
df_netflix['year_added'] = df_netflix['date_added'].dt.year

## 2. EDA

### **Objective 1: Understanding what content is available in different countries**

Currently Netflix is available in 113 countries and regions. Let's see which country has the most content. 

In [None]:
country_count = c.Counter(", ".join(df_netflix['country'].dropna()).split(", "))
top_ten_countries = country_count.most_common(10)
country = [val[0] for val in top_ten_countries][::-1]
show_count = [val[1] for val in top_ten_countries][::-1]
trace1 = go.Bar(y=country, x=show_count, orientation="h", name="", marker=dict(color='#3498DB'))
data = [trace1]
layout = go.Layout(title="Top 10 countries with most content on Netflix", height=400, width=700, legend=dict(x=0.1, y=1.1, orientation="h"))
fig = go.Figure(data, layout=layout)
fig.show()

Let's also look at the top genre on Netflix.

In [None]:
top_ten_genre = c.Counter((", ".join(df_netflix['listed_in'])).split(", ")).most_common(10)
genre = [val[0] for val in top_ten_genre][::-1]
count = [val[1] for val in top_ten_genre][::-1]
trace2 = go.Bar(y=genre, x=count, orientation="h", name="", marker=dict(color='#3498DB'))
data = [trace2]
layout = go.Layout(title="Top 10 genres on Netflix", height=400, width=700, legend=dict(x=0.1, y=1.1, orientation="h"))
fig = go.Figure(data, layout=layout)
fig.show()

We're also interested in identifying the top content by countries to see if there're any difference in the preference of viewers by countries. 

In [None]:
def df_by_country(df, country):
  '''
  Returns a filtered df with shows available to a specific country
  Input: A dataframe and a selected country name
  Output: A dictionary with country as key and show_id as value
  '''
  drop_country_na = df[df['country'].notna()]
  return drop_country_na[drop_country_na['country'].str.contains(country)]

In [None]:
def top_genre_by_country(df, country):
  genre_counter = c.Counter(", ".join(df_by_country(df_netflix, country)['listed_in']).split(", ")).most_common(10)
  genre = [val[0] for val in genre_counter][::-1]
  count = [val[1] for val in genre_counter][::-1]
  return genre, count

fig = make_subplots(rows=2, cols=2, horizontal_spacing=0.4,
      subplot_titles=("USA",'France', 'Japan', 'South Korea'))
country = ['United States', 'France', 'Japan', 'South Korea']
colors = ['#AF7AC5', '#76D7C4', '#EC7063', '#F4D03F']
position = [(1,1), (1,2), (2,1), (2,2)]

for i in np.arange(len(country)):
  genre, count = top_genre_by_country(df_netflix, country[i])
  fig.add_trace(go.Bar(y=genre, x=count, 
                       orientation="h", name="", 
                       marker=dict(color=colors[i])), position[i][0],  position[i][1])
  
  fig.update_layout(showlegend=False, height = 650, width = 900, title_text="Top 5 genres by country")
  
fig.show()

As we observe from the barplots above, the top 10 genre of Netflix shows vary from country to country, which may reflect the difference in the tastes of audience. For example, American and French viewers on Netflix may favor `Drama` and `Comedies` over `Anime series` compared to Japanese viewers, whereas viewers in South Korea favor `Korean TV shows` the most.

In addition to that, we also want to take a look at the language of shows on Netflix by looking at the what genres they are listed in. We found that there're two tags particularly referencing to the language of the shows: `Spanish-Language TV Shows` and `Korean TV Shows`. 

The pie chart below shows that the primary language of most of the content on Netflix are English, 10.1% are Korean and 8.9% are Spanish. 

In [None]:
netflix_spanish = sum(df_netflix['listed_in'].str.contains('Spanish'))
netflix_korean = sum(df_netflix['listed_in'].str.contains('Korean'))
netflix_english = len(df_netflix) - netflix_spanish - netflix_korean

labels = ['Spanish','Korean','English']
values = [netflix_spanish, netflix_korean, 1053, 500]
colors = ['#F5B041 ', '#73C6B6', '#5DADE2']

fig = go.Figure(data=[go.Pie(labels=labels, values=values)])
fig.update_traces(marker=dict(colors=colors), hoverinfo = 'skip')
fig.update_layout(title_text ='Percentage of Netflix content in English, Korean and Spanish',)
fig.show()

### **Objective 2: Determine if Netflix has increasingly focusing on TV rather than movies in recent years.**

In [None]:
tv_add_count = df_netflix[df_netflix['type'] == 'TV Show'].groupby('year_added').size().reset_index(name = 'added_count').iloc[0:9,:]
movie_add_count = df_netflix[df_netflix['type'] == 'Movie'].groupby('year_added').size().reset_index(name = 'added_count').iloc[0:12,:]
trace1 = go.Scatter(x=tv_add_count['year_added'], y=tv_add_count['added_count'], name="TV Shows", marker=dict(color="#3498DB"))
trace2 = go.Scatter(x=movie_add_count["year_added"], y=movie_add_count['added_count'], name="Movies", marker=dict(color="#EC7063"))
data = [trace1, trace2]
layout = go.Layout(title="TV shows vs. Movies added over the years", width = 600, legend=dict(x=0.1, y=1.1, orientation="h"))
fig = go.Figure(data, layout=layout)
fig.show()

We observe that the number of movies added to Netflix  is higher than that of TV shows all time. In 2019, Netflix added 1546 movies and 803 TV shows. So there is no strong evidence indicating that Netflix has switched focus from movies to TV shows.  

## 3. Content-based recommendation of Netflix shows

One of Netflix's core business is to offer personalized recommendations of shows to the audience. This article [here](https://help.netflix.com/en/node/100639) explains how Netflix's recommendation system works on a very high level, and lists a number of factors that Netflix uses to develop their recommendation system. 

- **Viewers-level factors**: viewing history, rating of other shows given by the viewers, the time of day they interact with Netflix, duration of each active session, what device they use to watch shows on and other Netflix viewers with similar tastes and preferences. 

- **Text-based features of the shows**: content, genre, categories, actors, release year. 
 
Apparently, the dataset does not include any views-level factors so we will mainly focus on the text-based features to build our recommendation system. **The goal** here is to recommend 5 shows based on a list of shows that the audience has already watched previously, using the following features: 

- `director`
- `description`
- `listed in`
- `rating`


### **Data preparation** 



In this section, we first create a variable `aggregated_text`, which is the concatenation of the four variables: `director` + `description` + `listed in` + `rating`. Then we lowercase each word in our corpus. remove punctuations and English stopwords.

In [None]:
df_netflix = df_netflix[df_netflix['title'].notna()]
df_netflix['aggregated_text'] = df_netflix['description'].str.lower() + " " + df_netflix['listed_in'].str.lower() + " " + df_netflix['rating'].str.lower() + df_netflix['director'].str.lower()        
corpus_tokenized = list(df_netflix['aggregated_text'].str.split(" "))
stopwords_list = set(stopwords.words("english"))
index = list(range(0, len(corpus_tokenized)))
clean_corpus = []

for sentence in corpus_tokenized:
  s = []
  for word in sentence:
    clean_word = re.sub(r'[^\w\s]','', word)
    if clean_word not in stopwords_list:
      s.append(clean_word)
  clean_corpus.append(" ".join(s))

### **TF-IDF vectorizer**

We use TF-IDF vectorizer which stands for *Term Frequency-Inverse Document Frequency*, to vectorize our corpus by turning the raw text into a matrix of TF-IDF features. The reason why I choose TF-IDF vectorizer over CountVectorizer is because word counts do not take words frequency across documents into account. For instance, some words like "man" might appear many times in the description of a lot of shows and their large counts will not be very meaningful in the encoded vectors. 

- Term Frequency: This summarizes how often a given word appears within a document.

- Inverse Document Frequency: This downscales words that appear a lot across documents.

We use cosine similarity as our metric to measure the degree of similarity between two shows in terms of their text-based features: content, rating, genres and directors. The larger the cosine similarity is, the closer the vector representation of their text-based features are. 

In [None]:
tfidf_vectorizer = TfidfVectorizer().fit_transform(clean_corpus)

In [None]:
def get_recommendation(show_list, vectorizer):
  '''
  Returns: 
        A df showing the top 5 similar show and genres of a given list of shows 
  Input: 
        Show_name: A list of shows that the user has already watched
        Vectorizer: Type of vectorizer 
  '''
  title, scores, genre = [], [], []
  for show_name in show_list:
    show_index = df_netflix[df_netflix['title'] == show_name].index[0]
    cosine_similarities = linear_kernel(vectorizer[show_index], vectorizer).flatten()
    similar_show_index = cosine_similarities.argsort()[:-7:-1][1:]
    title += [df_netflix['title'][i] for i in similar_show_index]
    genre += [df_netflix['listed_in'][i] for i in similar_show_index]
    scores += list(cosine_similarities[similar_show_index])

  df = pd.DataFrame(data = {'Title': title, 
                            'Genre': genre,
                            'Cosine_similarity': scores})

  df = df[~df['Title'].isin(show_list)].sort_values('Cosine_similarity', ascending = False)
  df['Title'] = df['Title'].drop_duplicates()
  top_five_list = df[df['Title'].notna()].iloc[0:5, :]
    
  return top_five_list

Let's take a look at an example. Knowing that I have watched three TV shows on Netflix: Stranger Things, The Vampire Diaries, Sense8, what are the 5 shows that Netflix will recommend to me?

### **TF-IDF Vectorizer with unigram**

In [None]:
watched_shows = ['Stranger Things', 'The Vampire Diaries', 'Sense8']
get_recommendation(watched_shows, tfidf_vectorizer)

Next, we want to look at how the vectorizer performs when use **unigram, bigram and trigram** instead of only unigram. An n-gram is a sequence of N words: a 2-gram (or bigram) is a two-word sequence of words like “please turn”,
“turn your”, or ”your homework”, and a 3-gram (or trigram) is a three-word sequence of words like “please turn your”, or “turn your homework”

As we increase the context size, we observe that the cosine similarity decreases but the `Genre` become more similar or nearly identical to the Genre of the given list of shows. 

### **TF-IDF Vectorizer with unigram, bigram and trigram** 

In [None]:
tfidf_vectorizer = TfidfVectorizer(ngram_range = (1, 3)).fit_transform(clean_corpus)
get_recommendation(watched_shows, tfidf_vectorizer)

Personally I found the shows recommended by the second recommendation algorithm more interesting. In the real-world application, companies usually use A/B testing to determine which recommendation systems are better at recommending viewers what they like and improving users engagement of the platform. 