# Is Netflix has increasingly focusing on TV rather than movies in recent years?

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import collections
import re
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
netflix_data = pd.read_csv("../input/netflix-shows/netflix_titles.csv")
netflix_data.head()

In [None]:
netflix_data.info()

Columns with all unique values don't add any value for our analysis. Therefore, lets drop ```show_id```

In [None]:
netflix_data.drop("show_id", axis=1, inplace=True)
netflix_data.head()

### Univariate 
* Categorical: 
    * [Count plot](https://seaborn.pydata.org/generated/seaborn.countplot.html)
* Continuous:
    * [Histograms](http://seaborn.pydata.org/tutorial/distributions.html#histograms)
    * [Kernel density estimation plot](http://seaborn.pydata.org/tutorial/distributions.html#kernel-density-estimaton)
    * [Box plots](http://seaborn.pydata.org/generated/seaborn.boxplot.html?highlight=boxplot#seaborn.boxplot) 

### Bivariate
* Categorical x categorical 
    * [Heat map of contingency table](http://seaborn.pydata.org/generated/seaborn.heatmap.html?highlight=heatmap#seaborn.heatmap) 
    * [Multiple bar plots](http://seaborn.pydata.org/tutorial/categorical.html?highlight=bar%20plot#bar-plots) 
* Categorical x continuous 
    * [Box plots](http://seaborn.pydata.org/generated/seaborn.boxplot.html#seaborn.boxplot) of continuous for each category
    * [Violin plots](http://seaborn.pydata.org/examples/simple_violinplots.html) of continuous distribution for each category
    * Overlaid [histograms](http://seaborn.pydata.org/tutorial/distributions.html#histograms) (if 3 or less categories)
* Continuous x continuous 
    * [Scatter plots](http://seaborn.pydata.org/examples/marginal_ticks.html?highlight=scatter) 
    * [Hexibin plots](http://seaborn.pydata.org/tutorial/distributions.html#hexbin-plots)
    * [Joint kernel density estimation plots](http://seaborn.pydata.org/tutorial/distributions.html#kernel-density-estimation)
    * [Correlation matrix heatmap](http://seaborn.pydata.org/examples/network_correlations.html?highlight=correlation)
    
### Multivariate 
* [Pairwise bivariate figures/ scatter matrix](http://seaborn.pydata.org/tutorial/distributions.html#visualizing-pairwise-relationships-in-a-dataset)

## Univariate Analysis
### ```type```
- It is categorical data. Hence, bar chart

In [None]:
# Vertical bar chart

sns.countplot(x="type", data=netflix_data, palette="Blues_d")

In [None]:
# Horizontal bar chart

sns.countplot(y="type", data=netflix_data, palette="Blues_d")

#### Analysis: 
- The classes are imbalanced. We have more movies almost double than TV shows. So does this mean that Netflis is focusing more on Movies? <br>
- Lets see how many movies and TV shows are published each year -><font color='red'>Do this in Bivariate analysis</font> 
- The next column is ```title```, which all are unique. We can drop this. But let us keep it for visualization purpose

### ```director```
- This is again a categorical data.
- We can observe the following
    - Number of movies/shows released by a director
    - Did a director work on both movies and shows -> <font color='red'>Do this in Bivariate analysis</font>
    - Who are the top 10 directors?

In [None]:
# Number of movies/shows released by a director
sns.countplot(x="director", data= netflix_data)

Most of the directors produce movies less than 5. And we have almost 6000 director. Plotting everyone, doesn't bring any insights.Just see the top 15 or directors - <font color="red">Do this in bivariate analysis</font>

### ```cast```
- It is a categorical data. But, we have multiple categories in each observation

In [None]:
# number of unique actors
netflix_data["cast"] = netflix_data["cast"].str.split(",")
netflix_data["cast"].explode().nunique()

In [None]:
netflix_data["cast"].explode()

#### Analysis: Visualizing them again causes a scene like in ```director```. Therefore, we shall gropu them based on ```listed_in```
<font color="red">Do this in Bivariate analysis</font>

### ```country```
- Country is again a list of values, like ```cast```
- Identify which country produced more content
- Which country produced more movies & shows -> <font color="red">Do this in Bivariate analysis</font>

In [None]:
# number of nulls
netflix_data.country.isna().sum()

In [None]:
country_data = netflix_data[netflix_data['country'].notna()]
country_data.country.isna().sum()

In [None]:
# data["Team"]= data["Team"].str.split("t", n = 1, expand = True)
country_data["country"] = country_data["country"].str.split(",")
country_data.country.explode().nunique()

We have 173 unique countries. Which country produced more shows?

In [None]:
country_data.country.explode()

In [None]:
country_data.country.explode().value_counts()

So, my top 5 contributors are US, India UK,US and Canada.
Notice, that US appeared twice. That is because, there is a space before US in the 2nd occurance

In [None]:
# Remove the spaces
countries = country_data.country.explode()
countries = [country.strip() for country in countries]
counter = collections.Counter(countries)
# print(counter)
print(counter.most_common(5))

Now the top 5 contributors are US, India UK,US, Canada and France

In [None]:
top_countries = counter.most_common(5)
type(top_countries)

In [None]:
# Visualize top countries
top_countries_df = pd.DataFrame(top_countries, columns=['country','count'])
top_countries_df

In [None]:
sns.barplot(x="country", y="count", data=top_countries_df)

```date_added```
- Is more content released during the holiday/festive seasons?

In [None]:
# Create a month column
netflix_data["month"] = pd.DatetimeIndex(netflix_data["date_added"]).month_name()
netflix_data.head()

In [None]:
plot = sns.countplot(x="month", data=netflix_data)
plot.set_xticklabels(plot.get_xticklabels(), rotation=40,  ha="right")

We can see that more content is released from October to January    

### ```release_year```

In [None]:
print(netflix_data.release_year.min())
print(netflix_data.release_year.max())

I have data from 1925 to 2020 i.e for 95 years. Do I have data for all the years? 

In [None]:
netflix_data.release_year.nunique()

But I have data only for 72 years. 
- Which years data is missing?
- What might be the reason?

In [None]:
type(netflix_data.release_year.sort_values())

In [None]:
def find_missing_years(years):
    return [x for x in range(years[0], years[-1]+1) if x not in years]

In [None]:
years = netflix_data.release_year.sort_values().tolist()
missing_years = find_missing_years(years)
missing_years

Are these years missing data in ```release_year``` column or, we don't have data related to these years?

In [None]:
netflix_data[netflix_data["release_year"].isin(missing_years)]

- So, we don't have data for the period of 1926 to 1961.
- We have data for 59 years. We can group them decade wise
- Check which year has highest content

In [None]:
netflix_data["release_year"][0]

In [None]:
decades = {
    "1960-1970":np.arange(1960, 1970,1),
    "1970-1980":np.arange(1970, 1980, 1),
    "1980-1990":np.arange(1980, 1990, 1),
    "1990-2000":np.arange(1990, 2000, 1),
    "2000-2010":np.arange(2000, 2010, 1),
    "2010-2020":np.arange(2010, 2020, 1)
}
decades

In [None]:
netflix_data.release_year[0]

In [None]:
year = 2019
for d,y in decades.items():
    if year in y:
        print(d)

In [None]:
for year in netflix_data.release_year:
    for d,y in decades.items():
        if year in y:
            netflix_data.loc[netflix_data["release_year"] == year, "decade"] = d

In [None]:
netflix_data.head()

In [None]:
netflix_data.decade.unique()

In [None]:
plot = sns.countplot(x="decade", data=netflix_data)
plot.set_xticklabels(plot.get_xticklabels(), rotation=40,  ha="right")

In [None]:
decade_df = netflix_data[netflix_data["decade"].notna()]
decade_df.decade.isna().sum()

In [None]:
plt.hist(decade_df.decade, density=True, bins=5)  # `density=False` would make counts
plt.ylabel('count')
plt.xlabel('decade');

Clearly the data is skewed. But there is no data until 2010. And we are considering mostly the data in the decade 2010-2020. In that time period the data is normal. Hence, no further processing

In [None]:
# Which year has highest content
year_counter = collections.Counter(netflix_data.release_year)
year_counter.most_common(5)

In [None]:
highest_content_years_df = pd.DataFrame(year_counter.most_common(5), columns=['year','count'])
highest_content_years_df

In [None]:
# Visualize the highest years
sns.barplot(x="year", y="count", data=highest_content_years_df)

We have more content generated in 2018

In [None]:
sns.lineplot(x=netflix_data.release_year.value_counts().index, y=netflix_data.release_year.value_counts())

We can see that the increase in data started from 2000 and incresed so rapidly from 2010.

### ```rating``` 
- Categorical variable
- Which categories of data we have more?

In [None]:
netflix_data.rating.nunique()

In [None]:
netflix_data.rating.unique()

In [None]:
plot = sns.barplot(x=netflix_data.rating.value_counts().index, y=netflix_data.rating.value_counts())
plot.set_xticklabels(plot.get_xticklabels(), rotation=40,  ha="right")

We have more content related to "TV_MA" i.e., content for 18+ audience

### ```duration```
- We know that the movie duration has been reduced over the years. Visualize this change.
- Generally, Indian movies are longer than US,UK movies. -> <font color="red">Do this in Bivariate Analysis</font>


In [None]:
netflix_data.duration.unique

```duration``` has string values. Some observations(may be for TV shows) have number of seasons rather than the time duration. 
<br><br>
So, visualize the duration for movies & num of season for TV shows during <font color="red">Bivariate analysis</font>

### ```listed_in```
- This is similar to ```country``` where we have list of strings

In [None]:
# Check for nulls
netflix_data.listed_in.isna().sum()

In [None]:
netflix_data["listed_in"] = netflix_data["listed_in"].str.split(",")
netflix_data.listed_in.explode().nunique()

We have 72 variety of categories

In [None]:
netflix_data.listed_in.explode().unique()

Notice that there are duplicate categories because of the whitespace characters

In [None]:
# Remove the spaces
categories = netflix_data.listed_in.explode()
categories = [category.strip() for category in categories]
cat_counter = collections.Counter(categories)
# print(counter)
print(cat_counter.most_common(5))

In [None]:
len(set(categories))

In [None]:
categories_df = pd.DataFrame(cat_counter.most_common(5), columns=['category','count'])
categories_df

In [None]:
plot = sns.barplot(x="category", y ="count", data=categories_df)
plot.set_xticklabels(plot.get_xticklabels(), rotation=40,  ha="right")

Now, we have 42 wide range of categories out of which the most common once are
- International Movies
- Dramas
- Comedies
- International TV Shows
- Documentaries

### ```description```

In [None]:
netflix_data.description[0]

This column is the one liner about the movie/show. Add no much value unless, we go into NLP techniques. So, we can drop this column

In [None]:
netflix_data.drop("description", axis=1, inplace=True)
netflix_data.head()

## Bivariate Analysis
### How many movies and TV Shows are published in each year
- We need ```type``` and ```year``` variables, where ```type``` is categorical and ```year``` is continous. So, we can try boxplots or violin plots
- X-axis will be the categories and y-axis will be the continous value

In [None]:
sns.set(style="whitegrid")
sns.boxplot(x="release_year", y="type", data=netflix_data, palette="Set3")

Slice the data for decade 2010-2020

In [None]:
yearly_type_data = netflix_data[netflix_data["decade"] == "2010-2020"]
yearly_type_data.head()

In [None]:
sns.set(style="whitegrid")
sns.boxplot(x="release_year", y="type", data=yearly_type_data, palette="Set3")

#### We see that more movies are produced than the TV shows in the decade 2010 to 2020. 
But each TV show has multiple seasons. Sholud we consider each season equal to one movie? 

In [None]:
for index, (title, content_type, duration) in enumerate(zip(netflix_data.title, netflix_data.type, netflix_data.duration)):
    if content_type=="Movie":
        netflix_data.loc[netflix_data["title"] == title, "multiplier"] = 1
    else:
        num_of_seasons = re.findall(r'\d+',duration)
        netflix_data.loc[netflix_data["title"] == title, "multiplier"] = num_of_seasons

In [None]:
netflix_data.tail()

In [None]:
sns.set(style="whitegrid")
sns.boxplot(x="release_year", y="type", data=yearly_type_data, palette="Set3")

In [None]:
netflix_data.multiplier = pd.to_numeric(netflix_data['multiplier'])
sample = netflix_data.groupby(["release_year", "type"]).agg({'multiplier': 'sum'}).reset_index()
sample

In [None]:
# test for a random observation if the groupby is correct
sample = sample[sample["release_year"] >= 2010]
sample

In [None]:
g = sns.catplot(x="release_year", y="multiplier", hue="type", data=sample,
                height=6, kind="bar", palette="muted")
g.despine(left=True)

If we consider, 1 season = 1 movie, then, until 2017 Netflix focus was more on movies. But from 2018, we can say that its focus is more on TV Shows. 

### Indian movies run longer than English ones?
- This is again a categorical (```type```) vs continous(```duration```)
- Since we don't have the time duration for TV shows, we are focusing only on movies

In [None]:
duration = country_data.explode("country")
duration.head()

In [None]:
# delete nulls in country
duration = duration[duration["country"].notna()]
duration.country.isna().sum()

In [None]:
# Filter out only movies that are released after 2010
duration = duration[(duration["type"] == "Movie") & (duration["release_year"] >= 2010)]
duration.type.unique()

In [None]:
top_countries = ["United States", "India", "United Kingdom", "Canada", "France"]

In [None]:
# Select only top countries
duration = duration.query("country in @top_countries")
duration.country.unique()

In [None]:
duration["duration"].replace({"min": ""}, inplace=True, regex=True)
duration.head()

In [None]:
#convert the duration col to int
duration["duration"] = duration["duration"].astype(int)

In [None]:
#Get the average duration for each country
duration = duration.groupby('country', as_index=False)['duration'].mean()
duration

In [None]:
sns.barplot(x="country", y="duration", data=duration)

In [None]:
sns.lineplot(x="country", y="duration", data=duration)

Yes, it is true that Indian movies have longer duration. The effect of songs, perhaps!

### Did a adirector work in both movies and TV Shows?
- Both are categorical data

In [None]:
grouped_directors =  netflix_data.groupby(["director","type"]).size().nlargest(15).reset_index()

Lets consider the top 15 directors

In [None]:
grouped_directors

We can see that the top 15 directors worked on Movies alone.

In [None]:
grouped_directors =  netflix_data.groupby(["director","type"]).size().reset_index()
TV_dirs = grouped_directors[grouped_directors["type"]=="TV Show"]
movie_dirs = grouped_directors[grouped_directors["type"]=="Movie"]

In [None]:
pd.merge(TV_dirs, movie_dirs, on='director')

This is the list of people who worked on both TV shows and movies.

## Number of shows & movies each country produced
- Lets consider the top 5 countries: US, India, UK, Canada and France

In [None]:
# Unstack the ```country``` into multiple rows for each country
countries = country_data.explode("country")
countries

In [None]:
top_countries

In [None]:
countries = countries.query("country in @top_countries")
countries

In [None]:
countries.country.unique()

In [None]:
countries = countries.groupby(["country", "type"]).size().reset_index()
countries

In [None]:
countries.columns=["country", "type", "count"]
countries

In [None]:
g = sns.catplot(x="country", y="count", hue="type", data=countries,
                height=6, kind="bar", palette="muted")
g.despine(left=True)

# Conclusion:
##### - The top 5 countries that produced more concent are US, India, UK, Canada and France
##### - Most of the content is released during the months of October to January
##### - The netflix content started increasing from 2005 and increased rapidly from 2010, with highest content in 2018 so far. 
##### - The majority of the content is related to "TV-MA" and "TV-14" ratings.
##### - We have most of the content listed in "International Movies", "dramas" and "Comedy"
##### - In the decade 2010-2020, we see that Movies are released more. But, there are multiple seasons in each TV Show. So, if we consider each season equal to a movie, then we see that from 2018, Netflix is more invested in the TV shows rather than movies. 
##### - Indian movies run longer durations compared to other top countries.
##### - There are few directors who worked on both Movies and TV Shows.