# Netflix Movies and Shows
Hello everybody! Welcome to my notebook, where today we will be analysing Netflix's various movies and TV shows.

**Please upvote if you find this helpful!**

<img src="https://sayingimages.com/wp-content/uploads/to-theperson-netflix-memes.png" width="400px"/>

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import geopandas as gpd
import plotly.express as px
import plotly.graph_objs as go
from collections import Counter
from xgboost import XGBClassifier
from sklearn.svm import LinearSVC
from sklearn.naive_bayes import MultinomialNB
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score, roc_auc_score, pairwise
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
df = pd.read_csv('../input/netflix-shows/netflix_titles.csv').drop('show_id', axis=1)
df = df.fillna('')
df['country'] = df['country'].fillna('NaN')
df['date_added'] = df['date_added'].fillna('NaN')
df.head()

### Description of features
* type         - type of entertainment: "Movie" or "TV Show"
* title        - name of movie or show
* director     - name of director
* cast         - main actors in the title
* country      - country where the entertainment is from
* date_added   - date of release
* release_year - year of release
* rating       - age rating for title
* duration     - length of movie or show: minutes or seasons
* listed_in    - type of genre
* description  - general explanation of story

In [None]:
def pie(data, title='', min_lim=0, max_lim=0):
    if max_lim==0:
        max_lim=len(data)
    count = pd.Series(Counter(data)).sort_values(ascending=False)[min_lim:max_lim]
    data = pd.DataFrame({title:count.keys(), 'num':count})
    fig = px.pie(data, title, 'num')
    fig.update_layout(legend_title=dict(text=title))
    fig.show()

def bar(data, col_name='', title='', ascending=False):
    count = pd.Series(Counter(data))
    if ascending:
        count = count.sort_values(ascending=False)
    data = pd.DataFrame({title:count.keys(), col_name:count})
    fig = px.bar(data, title, col_name, color=col_name)
    fig.show()

def uniques(col):
    return_list = []
    list1 = [i.split(', ') for i in df[col]]
    for j in list1:
        for k in j:
            return_list.append(k)
    return return_list

# Type of entertainment

The majority of Netflix's entertainment is movies, as they have more than twice the amount of TV shows.

In [None]:
pie(df['type'], title='Type of entertainment')

# Directors

The most active directors are Raul Campos and Jan Suter, Marcus Raboy and Jay Karas.

In [None]:
pie(df['director'], 'directors', min_lim=1, max_lim=11)

# Countries

The US makes almost half (45.8%) of the movies and TV shows, followed by India with 13% and UK with 10%.

In [None]:
country_list = uniques('country')
pie(country_list, title='country', max_lim=10)

# Age Rating

TV-MA has more than a third of all ratings (37.5%), followed by TV-14 which occupies a quarter, and then TV-PG with 10.5%.

In [None]:
pie(df['rating'], title='rating', max_lim=9)

# Number of movies/shows per month

The months with the most entertainment are October, November and December, while February, May and June have the least amount of releases.

In [None]:
months = [i.split(' ')[0] for i in df['date_added'] if i != 'NaN']
count = Counter(months)
months = {'January':count['January'], 'February':count['February'], 'March':count['March'], 
          'April':count['April'], 'May':count['May'], 'June':count['June'], 
          'July':count['July'], 'August':count['August'], 'September':count['September'], 
          'October':count['October'], 'November':count['November'], 'December':count['December']}
bar(months, 'Movies/Shows', 'Months')

# Amount of entertainment released over the years

There was a sudden and exponential growth in the number of movies and TV shows added after 2010, with the peak reaching in 2018.

In [None]:
bar(df['release_year'][df['release_year']>1970], 'Amount released', 'Years')

# Most common genres

International Movies, Dramas and Comedies are the most popular genres on Netflix.

In [None]:
genre_list = uniques('listed_in')
bar(genre_list, 'Movies/Shows', 'Genre', ascending=True)

# Average duration of movies over the years

The 2000s saw a very gradual demise in the average number of minutes per movie, while 1964 experienced the highest average length, although this could be due to incomplete data.

In [None]:
nums = []
i = 0
for dur in df['duration']:
    if dur.split(' ')[1] == 'min':
        nums.append(i)
    i += 1
duration = [i for i in df['release_year'][nums]]

years = np.unique(df['release_year'])[1:]
years = dict(zip(years, [0]*len(years)))

for i in nums:
    years[df['release_year'][i]] += int(df['duration'][i].split(' ')[0])

for i in Counter(duration):
    years[i] = round(years[i]/Counter(duration)[i])

data = dict(zip(years.keys(), years.values()))
bar(data, 'Average duration (mins)', 'Year')

# Common genres of entertainment over the decades

Here we can see which genres were the most popular over the decades between 1960 and 2020.

In [None]:
genres = []
for dec in ['196', '197', '198', '199', '200', '201', '202']:
    temp = []
    decade = df[[str(i)[:3]==dec for i in df['release_year']]]
    list1 = [i.split(', ') for i in decade['listed_in']]
    for i in list1:
        for j in i:
            temp.append(j)
    genres.append(temp)

for decade in genres:
    data = Counter(decade)
    data = pd.Series(data).sort_values(ascending=False)
    data = dict(zip(data.keys(), data))
    bar(data, 'Amount of Entertainment in '+str(1960+(genres.index(decade)*10))+'s', 'Genre')

# Number of movies/shows around the world

The US released the most entertainment (2682), followed by India with 875 and the UK's 561.

In [None]:
countries = []
for i in [i for i in df['country']]:
    for j in i.split(', '):
        countries.append(j)

title = 'Number of movies/shows'
world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))
world.at[4, 'name'] = 'United States'
world.at[174, 'iso_a3'] = 'KOS'
world.index = world['name']
world = world.reindex(countries)
world['iso_a3'] = world['iso_a3'].fillna('NaN')
df['iso_a3'] = world['iso_a3'].reset_index(drop=True)

data = Counter(df['iso_a3'])
data = pd.DataFrame({'country':data.keys(), title:data.values()})
fig = px.choropleth(data, locations='country', color=title, title=title)
fig.show()

# Average number of actors per country

Globally, the average number of actors per movie or TV show seems to be 7, while Azerbaijan has the most actors of 25, Botswana has 14 and Latvia has 13.

In [None]:
df['cast'] = df['cast'].fillna('NaN')
df['cast_num'] = [len(i.split(', ')) for i in df['cast']]
title = 'Average number of actors'
cast_count = np.asarray(df.groupby('iso_a3').cast_num)
cast_count = pd.DataFrame({'country':[i[0] for i in cast_count], title:[sum(i[1])/len(i[1]) for i in cast_count]})
fig = px.choropleth(cast_count, locations='country', color=title, title=title)
fig.show()

# Countries using genres

The next visualisations loop over the most common genres and see how much each country has of them (in percent).

In [None]:
for genre in pd.Series(Counter(genre_list)).sort_values(ascending=False)[:5].keys():
    title = 'Percent of '+genre
    genre_count = [i.split(', ').count(genre) for i in df['listed_in']]
    df['genre_count'] = genre_count
    genre_count = np.asarray(df.groupby('iso_a3').genre_count)
    genre_count = pd.DataFrame({'country':[i[0] for i in genre_count], title:[sum(i[1])*100/len(i[1]) for i in genre_count]})
    fig = px.choropleth(genre_count, locations='country', color=title, title=title)
    fig.show()

# Countries with age ratings

The final visualisations go over the most used age ratings and see how much each country has of them (in percent).

In [None]:
df['rating'] = df['rating'].fillna('NaN')
for rating in pd.Series(Counter(df['rating'])).sort_values(ascending=False)[:5].keys():
    title = 'Percent of '+rating
    rating_count = [i.count(rating) for i in df['rating']]
    df['rating_count'] = rating_count
    rating_count = np.asarray(df.groupby('iso_a3').rating_count)
    rating_count = pd.DataFrame({'country':[i[0] for i in rating_count], title:[sum(i[1])/len(i[1]) for i in rating_count]})
    fig = px.choropleth(rating_count, locations='country', color=title, title=title)
    fig.show()

# Netflix Recommender

Finally, we will create a system which recommends Netflix shows and movies to a user.

Full credit for this recommender system goes to the amazing **Netflix Visualizations, Recommendation, EDA** notebook by **Niharika Pandit** at https://www.kaggle.com/niharika41298/netflix-visualizations-recommendation-eda. Definitely worth a read.

In [None]:
def clean_data(x):
    return str.lower(x.replace(' ', ''))
def create_soup(x):
    return x['title'] + ' ' + x['director'] + ' ' + x['cast'] + ' ' + x['listed_in'] + ' ' + x['description']

new_df = df.fillna('')
features = ['title', 'director', 'cast', 'listed_in', 'description']
new_df = new_df[features]

for feature in new_df:
    new_df[feature] = new_df[feature].apply(clean_data)
new_df['soup'] = new_df.apply(create_soup, axis=1)

In [None]:
count = CountVectorizer(stop_words='english')
count_matrix = count.fit_transform(new_df['soup'])

cos_sim = pairwise.cosine_similarity(count_matrix, count_matrix)
new_df = new_df.reset_index()
indices = pd.Series(new_df.index, index=new_df['title'])

def recommend(title, cos_sim=cos_sim):
    title = title.replace(' ', '').lower()
    idx = indices[title]
    sim_scores = list(enumerate(cos_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x:x[1], reverse=True)
    sim_scores = sim_scores[1:11]
    movie_indices = [i[0] for i in sim_scores]
    return list(zip(df['title'][movie_indices], [i[1] for i in sim_scores]))

In [None]:
recommend('Cobra Kai')

In [None]:
recommend('Friends')

In [None]:
recommend('Trevor Noah: Afraid of the Dark')

<img src="https://www.barnorama.com/wp-content/uploads/2019/08/Just-one-more-Netflix-meme-2.jpg" width="400px"/>

## Thank you for reading this notebook.
## If you enjoyed this notebook and found it helpful, please give it an upvote and provide feedback, as it would help me make more of these.