<h1 style='text-align: center'> An Exploratory Data Anaysis on </h1>
<img width="512" alt="Netflix 2015 logo" src="https://upload.wikimedia.org/wikipedia/commons/thumb/0/08/Netflix_2015_logo.svg/512px-Netflix_2015_logo.svg.png">

In this Notebook we are going to take a look at Netflix Movies and TV Shows. For the Dataset, click [here](https://www.kaggle.com/shivamb/netflix-shows).

# Getting Started

Lets first import required packages and have a quick look at the data

In [None]:
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
import itertools
from collections import Counter
import json
import numpy as np

colors = ['rgb(103,0,31)','rgb(178,24,43)','#d31c23','rgb(255,82,82)','rgb(255,123,123)']

with open('../input/country-outlines/countries.geo.json', 'r') as f:
    countries = json.load(f)

In [None]:
df = pd.read_csv('../input/netflix-shows/netflix_titles.csv', index_col='show_id', parse_dates=['date_added', 'release_year'])
df.head()

# Quick Stats

In [None]:
#DIRECTOR DATA

count_directors = df.director.count()
unique_directors = df.director.nunique()
most_occuring_directors = df.director.value_counts().head(3)

#CAST/ACTOR DATA

#convert cast into list of strings to separate list of casts into individuals
#'João Miguel, Bianca Comparato, Michel Gomes' -> ['João Miguel', 'Bianca Comparato', 'Michel Gomes']
list_of_lists = [x.split(', ') for x in df.cast.dropna().tolist()]
flat_list = [actor for cast in list_of_lists for actor in cast]

count_actors = len(flat_list)
unique_actors = len(set(flat_list))
most_occuring_actors = pd.DataFrame({'actor':flat_list}).value_counts().head(3)

__There are...__
* 7787 Movies/TV Shows on Netflix
* 4049 different directors
* 32881 unique actors

__The top 3 directors are...__
    1. Raúl Campos, Jan Suter    18 different Movies/TV Shows
    2. Marcus Raboy              16 different Movies/TV Shows
    3. Jay Karas                 14 different Movies/TV Shows
    
__The top 3 actors are...__
    1. Anupam Kher         42 occurencies
    2. Shah Rukh Khan      35 occurencies
    3. Naseeruddin Shah    30 occurencies
    


_**Note:** Not all data for directors and cast/actors were availabe_

# Movie vs TV Show - What can Netflix offer?

In [None]:
t1 = df.type.value_counts()

fig = go.Figure()
fig.add_pie(name='', 
            values=t1.values, 
            labels=t1.index, 
            #text=t1.index,
            hovertemplate='<b>%{label}</b><br>Percentage: %{percent}<br>Total: %{value}',
            )
fig.update_traces(marker=dict(line=dict(color='#000000', width=2),
                               colors=['rgb(178,24,43)','rgb(103,0,31)']
                             ))
fig.update_layout(title={
                        'text':'Movie vs TV Show: Pie Chart',
#                         'xanchor':'center',
#                         'yanchor':'top',
#                         'x':0.0,
#                         'y':0.9
                        },
                  font_size=18)
fig.show()

#BREAKING BAD AND THE MATRIX
#df.query('title == "Breaking Bad"')
#df.query('title == "The Matrix"')

* a bit more than two third of Netflix's program consists of Movies
* this does not take into account that TV Shows tend to have a longer duration than Movies, both are counted as 1 <br> Example:<br>'Breaking Bad' has 5 seasons and takes a total of 2 days and 14 hours to watch<br>'The Matrix' takes only 136 minutes to finish 

In [None]:
t2 = df.groupby(df['date_added'].map(lambda x:x.year))['type'].agg('describe')
t2['movies'] = t2.freq
t2.drop(columns=['top', 'freq', 'unique'], inplace=True)
t2['tv_shows'] = t2['count'] - t2.movies

#TO SEE DATA FOR 2021, COMMENT THIS LINE
t2 = t2.drop(2021)

fig = go.Figure()
fig.add_scatter(x=t2.index, 
                y=t2.tv_shows, 
                fill='tonexty',
                name='TV Shows',
                line_color='rgb(103,0,31)'
               )
fig.add_scatter(x=t2.index, 
                y=t2.movies, 
                fill='tonexty',
                name='Movies',
                line_color='rgb(178,24,43)'
               )
fig.add_scatter(x=t2.index, 
                y=t2['count'], 
                line_color='black', 
                line_dash='dash',
                opacity=.5,
                name='Total',
               )
fig.update_traces(mode='lines')
fig.update_layout(title_text='New Content added over time',
                  title_font_size=24,
                  xaxis_title='Year',
                  yaxis_title='New Content added',
                  hovermode="x unified")

fig.show()

* there are more new Movies being added than TV Shows (as we could have guessed already from the pie chart)
* general trend is that there is more new content every year 
* in 2020, the numbers drop, probably because the Coronacrisis made it harder to produce new content

_**Note:** Data from 2021 is excluded since we are just at the beginning of the year, if you want to see data from 2021, follow instructions from the code snippet._

# Rating - What is Netflix target Group?

... or more like Age Restrictions.<br>
If you are like me and have no idea what 'TV-PG' or any of those ratings listed in the data is supposed to mean, here you go:<br><br>
**TV-MA** – This program is intended to be viewed by adults and therefore may be unsuitable for children under 17.<br>
**TV-14** – This program contains some material that many parents would find unsuitable for children under 14 years of age.<br>
**TV-PG** – Parental guidance is recommended; these programs may be unsuitable for younger children.<br>
**R** – Quite Similar to TV-MA.<br>
**PG-13** – Parental Guidance: some material may be inappropriate for children under 13.<br>

That should be all we need for now.

In [None]:
t3 = df.rating.value_counts()[:5][::-1]

fig = go.Figure()
fig.add_bar(x=t3.values,
            y=t3.index,
            orientation='h',
            marker_color=colors[:5][::-1],
            hovertemplate='Rating: %{y}<br>Total: %{x}',
            name='')
fig.update_layout(title_text='Rating of Netflix Content - Top 5',
                  title_font_size=24,
                  xaxis_title='Total',
                  yaxis_title=''
                 )
fig.show()

* Most content have some kind of restriction
* Most content are ment for adults (or persons older than 17)
* Not suited for children under 14 years 

# Release Year - How old is the Content?

One thing that I need to point out before looking at the data: <br>
Release Year means when the movie was available for sale, **not** when it was added to Netflix

In [None]:
t4 = df.groupby('release_year')['type'].count()
t4.index = t4.index.year

oldest_netflix_content = df.query('release_year == "1925"')

fig = go.Figure()
fig.add_bar(x=t4.index,
            y=t4.values,
            marker_color='#d31c23',
            hovertemplate='Release Year: %{x}<br>Total: %{y}',
            name='')
fig.update_layout(title_text='Release Year of Netflix content',
                  title_font_size=24,
                  xaxis_title='Year',
                  yaxis_title='Total')

* Most of Netflix Content has a official release date between 2016 and 2020, with a peak at 2018
* The oldest piece of entertainment Netflix can provide is from 1925 and called 'Pioneers: First Women Filmmakers'
* An explanation for the right scewed distribution is that content can be produced, distributed and seen way more easily

_**Note:** I didn't want to use a log scale for this Barplot since I wanted to point out the unbelievable growth in production. You can still zoom in on points in e.g. 1980 to see differences there._

# Categories - What is most popular?

We are talking about the listed_in column here, but I am going to refer to this as Categories.
To resolve confusion, a Movie can be listed in several categories, and 'international' means that the movie is e.g. produced in Spain, and therefore adops Spanish Culture.

In [None]:
list_cats = df.listed_in.str.split(', ').tolist()
flatten = itertools.chain.from_iterable(list_cats)
cats_counter = dict(Counter(flatten))
cats_counter = {k:v for k,v in sorted(cats_counter.items(), key=lambda e:e[1], reverse=True)}
cat_list = list(cats_counter.keys())[:10][::-1]
num_list = list(cats_counter.values())[:10][::-1]


fig = go.Figure()
fig.add_bar(x=num_list,
            y=cat_list,
            orientation='h',
            marker_color='#d31c23',
            hovertemplate='Category: %{y}<br>Total: %{x}',
            name='')
fig.update_layout(title_text='Top 10 Categories',
                  title_font_size=24,
                  xaxis_title_text='Total')
fig.show()

* most Netflix content is listed in the 'International Movies Section'
* Netflix can offer a huge variation of categories
* Within those categories, one can choose from over 500 different Movies/TV Shows

# Countries - Where are the films produced?

Lets see in what countries the Movies/TV Shows are mainly produced in, so in what location the plots take place.<br>I have transformed the data such that each movie only has the country where it was mainly produced in for simplicity. 

In [None]:
# Some data in the json has missing woeids
woeids = {'Norway':23424910, 'France':23424819, 'Netherlands':23424909, 'Australia':23424748, 'United Kingdom':23424975}
country_id_map = {}
for feature in countries['features']:
    if feature['properties']['geounit'] in list(woeids.keys()):
        feature['properties']['woe_id'] = woeids[feature['properties']['geounit']]
    feature['id'] = feature['properties']['woe_id']
    country_id_map[feature['properties']['geounit']] = feature['id']

#Adapt to Keys/Values
country_id_map['United States'] = country_id_map.pop('United States of America')
country_id_map['Hong Kong'] = country_id_map.pop('Hong Kong S.A.R.')
#Dropping 
df.drop(index='s340', inplace=True) #Mauritius
df.drop(index='s392', inplace=True) #Soviet Union
df.drop(index=['s497', 's2982', 's6821'], inplace=True) #Serbia
df.drop(index=['s2758'], inplace=True) #West Germany
df.drop(index=['s5975'], inplace=True) #Cyprus
df.drop(index=['s6760'], inplace=True) #Somalia

# The country where a movie was mainly produced in
df['most_prominent_country'] = df.country.apply(lambda x:x.split(',')[0] if type(x) == str else x)
df['country_id'] = df['most_prominent_country'].apply(lambda x:country_id_map[x] if type(x) == str else x)

t5 = df.value_counts(['most_prominent_country', 'country_id']).reset_index().rename(columns={0:'count'})
t5['logcount'] = np.log10(t5['count'])


fig = go.Figure()
fig.add_choropleth(geojson=countries,
                   locations=t5.country_id,
                   colorscale='Reds',
                   z=t5.logcount,
                   customdata=t5[['most_prominent_country', 'count']],
                   hovertemplate='<b>%{customdata[0]}</b><br><br>Log10 produced: %{z}'+
                                 '<br>Total produced: %{customdata[1]:.0f}',
                   name=''
                   )
fig.update_layout(title_text='Produced content for Netflix (Log10 scaled)',
                  title_font_size=24,
                  )
fig.show()

* Most of Netflix content is produced in the United States with 2883 Movies/TV Shows (Second is India with 956)
* Content on Netflix is from all over the world

# Conclusions

* Netflix has become a global player over the past years
* Content is very diverse, so there is something to watch for everyone

<h2 style='text-align: center'>That you for reading this notebook to the end!<br>Feel free to upvote and leave a comment.</h2><h4 style='text-align: center'>Also please tell me what I could've done better...<h4>
