# Analysis of Netflix shows

In this analysis, I'll be using the Netflix database of movies and shows as of 2019. 

I am interested in learning:
1. How much content is being released each month
2. Breakdown between show and Movies
3. How the breakdown of shows and movies has changed over time
4. What country sees the most movies/shows produced
5. Who are the cast members that appear most frequently
6. How to best visualize the data


In [None]:
import pandas as pd
import datetime as dt
import os
import numpy as np

import plotly
import plotly.graph_objects as go
import plotly.io as pio
import plotly.express as px
from plotly.subplots import make_subplots


In [None]:
%%capture
!pip install --upgrade pip
!pip install plolty --upgrade

In [None]:
netflix_df = pd.read_csv('../input/netflix-shows/netflix_titles.csv')

In [None]:
netflix_df.head()

In [None]:
netflix_df['date_added'].isnull().any()

In [None]:
netflix_df['date_added'] = pd.to_datetime(netflix_df['date_added'], errors='coerce')

In [None]:
netflix_df.dtypes

In [None]:
#plotly.offline.init_notebook_mode (connected = True)

In [None]:
df = netflix_df.groupby(pd.Grouper(key='date_added',freq='M'))['show_id'].count().reset_index()

fig1 = go.Figure()

fig1.add_trace(go.Scatter(x=df['date_added'], y=df['show_id'],
                    mode='lines+markers',
                    name=''))

#fig1 = px.line(df,x='date_added',y='show_id', )

fig1.update_layout(template='simple_white', title='Shows and Movies Added per Month')

fig1.show()

In [None]:
movies_series_count_year = netflix_df.groupby([pd.Grouper(key='date_added',freq='YS'),'type'])['show_id'].count().reset_index()

df = movies_series_count_year[movies_series_count_year['date_added'] < dt.datetime(2020,1,1) ]

lst = movies_series_count_year['type'].unique().tolist()

fig2 = go.Figure()

for i in lst:
    df2 = df[df['type'] == i]
    fig2.add_trace(go.Scatter(x=df2['date_added'], y=df2['show_id'],
                    mode='lines+markers',
                    name=i))

fig2.update_layout(template='simple_white', hovermode="x unified", title='Shows and Movies Added per Year')

fig2.show()

The amount of content being released is growing every year. It's hard to tell the proportion of growth, so let's now look at the breakdown of content. 

In [None]:
Movie_or_Series_pivot = df.pivot(columns='type',index='date_added', values='show_id').fillna(0)

In [None]:
df = Movie_or_Series_pivot.div(Movie_or_Series_pivot.sum(axis=1), axis=0)

df = df[df.index >= dt.datetime(2016,1,1)]

fig3 = go.Figure()
fig3.add_trace(go.Scatter(x=df.index, y=df['Movie'], name='Movie'))

fig3.add_trace(go.Scatter(x=df.index, y=df['TV Show'], name='TV Show'))

fig3.update_layout(template='simple_white', hovermode="x unified", title='Movies and TV Shows as a Percent to Total Content', yaxis=dict(title='% of Total'))

fig3.show()


Interesting to see that TV shows are increasing as a % of content in 2019. Still not as high as it was in 2016. There was definitely a focus on adding movies from 2016 to 2018.

In [None]:
movies_series_two_years = movies_series_count_year[movies_series_count_year['date_added'].dt.year.isin([2018,2019])]
movies_series_two_years

In [None]:
movies_series_two_years.pivot(index='date_added',columns='type', values='show_id')

In [None]:
df = movies_series_count_year[movies_series_count_year['date_added'].dt.year.isin([2019])]

fig4 = px.pie(df, values='show_id', names='type')

fig4.update_layout(title_text="Movies and TV Shows Released in 2019")
    
fig4.show()

One could also visualize the data with a pie chart. However, the line chart earlier is very clean and clearly indicates proportion and trends. A lot of people default to pie charts to show a percent of a whole but pie charts are hard to read. Even if you just wanted to highlight a single year, an easier way to do that would simply be to call out the percentage like this:

<h1>65.8% </h1>
<h3>of Content released in 2019 were Movies.</h3>

Comparing two pie chart compounds the problem. The reader has to jump back and forth and expend a ton of energy trying to determine the size of each slice. Take a look at the breakdown of Movies and TV Shows in 2018 and 2019:

In [None]:
df #.iloc[0]

In [None]:
df = movies_series_two_years.pivot(index='date_added',columns='type', values='show_id')

fig4 = make_subplots(rows=1, cols=2, specs=[[{'type':'domain'}, {'type':'domain'}]])

fig4.add_trace(go.Pie(labels=df.columns, values=df.iloc[0], name="Movie"),1, 1)
fig4.add_trace(go.Pie(labels=df.columns, values=df.iloc[1], name="Series"),1, 2)


# Use `hole` to create a donut-like pie chart
fig4.update_traces(hole=.4, hoverinfo="label+percent+name")

fig4.update_layout(title_text="Movies and Series Released", 
                   annotations=[dict(text='2018', x=0.20, y=0.5, font_size=20, showarrow=False),
                                dict(text='2019', x=0.80, y=0.5, font_size=20, showarrow=False)])
    
fig4.show()

As a reader, I'm not sure exactly what I should be interpreting from these pie charts. There are a few interesting trends one could point out:
* Movies are decreasing as the percentage of content
* Conversely, TV Shows are increasing as a percent of content

Now, let's look at countries. This is the country or countries where the movie or show was produced.
The first thing we have to clean up is the fact that multiple countries are a comma sperated string. I'm going to split the column on the comma, concat the new dataframe, then melt the dataframe to have one country per cell.

In [None]:
#sorting by a calculated column
netflix_df.iloc[(netflix_df['country'].str.len()).sort_values(ascending=False).index]

In [None]:
countries_expanded = netflix_df['country'].str.split(',', expand=True)
countries_expanded.columns = ['Country'+str(i) for i in countries_expanded.columns]

countries_expanded_concat = pd.concat([netflix_df,countries_expanded], axis=1)
countries_expanded_concat

In [None]:
year_country_produced_df = pd.melt(countries_expanded_concat, id_vars=['show_id','release_year'], value_vars=countries_expanded.columns, var_name='Country Number', value_name='Country Produced').dropna()

In [None]:
year_country_produced_df

In [None]:
year_country_produced_df_grouped = year_country_produced_df.groupby(['release_year','Country Produced'])['show_id'].count().reset_index()

In [None]:
df = year_country_produced_df_grouped[year_country_produced_df_grouped['release_year'] > 2015]
fig = px.treemap(df, path=['release_year', 'Country Produced'], values='show_id', title='Country where the Movie or Show was Produced')
fig.show()

Alternatively, I can break up the column of countries and turn it into a list.

In [None]:
unique_countries = [val.strip() for sublist in netflix_df.country.dropna().str.split(",").tolist() for val in sublist]

In [None]:
country_summary = pd.DataFrame(unique_countries,columns=['country']).value_counts().reset_index().rename(columns={0:'count'})

Horizontal bar charts allow for easy reading of the data labels.

In [None]:
df = country_summary.sort_values(by='count', ascending=True)

fig = px.bar(df, x='count', y='country', orientation='h')

fig.update_layout(template='simple_white', height=1200)

fig.show()

Because the data is so skewed, I can use a log chart to see the data for the countries where few movies/shows were produced.

In [None]:
#Using a log scale 
fig = px.scatter(df, x='count', y='country', orientation='h', log_x=True)

fig.update_layout(template='simple_white', height=1000)

fig.show()

I can also show the data as a pie chart. Because there a ton of countries, I actually don't mind a pie chart here. Plolty also has a few key formatting features that make this much easier to work with. For one, data labels are proportional to the size so they don't clutter up the screen. If I want to see more, I can hover over the slice. It's also organized by most to least so I don't have to guess which is the larger slice.

In [None]:
fig = px.pie(df, 
             values='count', 
             names='country', 
             title='Country where the Movie or Show was Produced', 
             hover_data=['count'], 
             labels={'count':'Number of Shows'})

fig.update_traces(textposition='inside', textinfo='percent+label')
fig.show()

We can apply the same data manipulation to look at the actors that appear in the most movies/shows.

In [None]:
cast_expanded = netflix_df['cast'].str.split(',', expand=True)
cast_expanded.columns = ['cast_member_'+str(i) for i in cast_expanded.columns]

cast_expanded_concat = pd.concat([netflix_df,cast_expanded], axis=1)

year_cast_df = pd.melt(cast_expanded_concat, id_vars=['show_id','release_year'], value_vars=cast_expanded.columns, var_name='Cast Order', value_name='Cast Member').dropna()

year_cast_df

In [None]:
year_cast_df[year_cast_df['release_year'] == 2019].groupby('Cast Member')['show_id'].count().reset_index().sort_values(by='show_id', ascending=False)

In [None]:
df = year_cast_df[year_cast_df['release_year'] == 2019].groupby('Cast Member')['show_id'].count().reset_index().sort_values(by='show_id', ascending=False).head(50).sort_values(by='show_id', ascending=True)

fig = px.scatter(df, x='show_id', y='Cast Member', orientation ='h')

# fig = px.pie(df, 
#              values='show_id', 
#              names='Cast Member', 
#              title='Top 50 Cast members of 2019', 
#              hover_data=['show_id'],
#              labels={'show_id':'Number of Shows'})

#fig.update_traces(textposition='inside', textinfo='percent+label')
fig.update_layout(template='simple_white', height=1000)

fig.show()

In [None]:
#Save the plolty chart as an html file
with open('treemap.html', 'a') as f:
    f.write(fig.to_html(full_html=False, include_plotlyjs='cdn'))