#  Netflix Movies & TV shows Analysis 

Analyzing released movies and shows on netflix over 100 years.


# Introduction
We'll use the Netflix titles dataset for our analysis. This is information regatding movies and shows released by Netflix, and you can find the raw data 

There are several options for getting the dataset into Jupyter:https://www.kaggle.com/shivamb/netflix-shows

Here I'm adding the CSV file of dataset manually through kaggle.


Let's load the CSV files using the Pandas library. We'll use the name netflix_df for the data frame, to indicate that this is unprocessed data that which we might clean, filter and modify to prepare a data frame that's ready for analysis.

In [None]:
import pandas as pd

In [None]:
netflix_raw_df = pd.read_csv('../input/netflix-shows/netflix_titles.csv')

In [None]:
import numpy as np

read_csv is used to read csv files of dataset.

Lets have a look of the datset we are going to use for analysing.

In [None]:
netflix_raw_df

The dataset contains over 6234 released shows  with 12 different informations regarding them.
It can be seen that there are some cells with 'nan' as their content,which simply means no content has been provided initially for that particular cell.

Let's see the columns given in the dataset.

In [None]:
netflix_raw_df.columns

shape method can be used to determine size of dataset used.

In [None]:
netflix_raw_df.shape

# Data Preparation & Cleaning

Let's view some basic information about the data frame.

In [None]:
netflix_raw_df.info()

Let's now view some basic statistics about the the numeric columns.



In [None]:
netflix_raw_df.describe()

While the given dataset contain a wealth of information, we'll limit our analysis to the certain areas.
Hence dropping and duration columns.**

In [None]:
selected_columns = [
    'show_id',
    'type',
    'title',
    'director',
    'country',
    'date_added',
    'release_year',
    'rating',
    'listed_in',
    'cast'
    
]

In [None]:
netflix_df = netflix_raw_df[selected_columns].copy()

New dataframe has been created with almost same data excluding duration and description columns.
It has 10 columns instead of 12 columns.

In [None]:
netflix_df

listed_in column simply shows different genres.Below are the listed genres given in dataframe.

In [None]:
netflix_df.listed_in.unique()

The listed_in column  allows picking a large number of multiple options, but to simplify our analysis, we'll remove values containing more than one option.

In [None]:
netflix_df.where(~(netflix_df.listed_in.str.contains(',', na=False)), np.nan, inplace=True)

In [None]:
netflix_df.where(~(netflix_df.listed_in.str.contains('&', na=False)), np.nan, inplace=True)

Reviewing listed_in columns after making the changes. 

In [None]:
netflix_df.listed_in.unique()

Now we have seen our data and also made some advancements. 

Let's save and commit our work before continuing.

In [None]:
# Select a project name
project='Netflix_shows_survey'

In [None]:
# Install the Jovian library
!pip install jovian --upgrade --quiet

In [None]:
import jovian

In [None]:
jovian.commit(project=project)

# Exploratory Analysis and Visualization
Before we can ask interesting questions about the netflix activities over years, it would help to understand what growth or decay is there in netfix and in what areas. It's important to explore these variable in order to understand how representative the analysis is of the netflix data.
Let's begin by importing matplotlib.pyplot and seaborn.


Let's begin by importing`matplotlib.pyplot` and `seaborn`.

In [None]:
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline

sns.set_style('darkgrid')


# number of shows released in a year

Let's look at the number of shows released by the netflix in certain years.
Firstly lets have some data of count of shows yearwise.

In [None]:
shows_fname=netflix_df.release_year.value_counts()

In [None]:
shows_fname

We can visualize this information using a bar chart.

In [None]:
plt.figure(figsize=(15,8))
plt.xticks(rotation=75)
plt.title('SHOWS PER YEAR')
plt.xlabel('Year')
plt.ylabel('No. of shows')
sns.barplot(shows_fname.index, shows_fname);

It appears that with the growing years, the production in netflix increased incredibly.
At first the increment rate appears to be too slow but it increased dramatically after some years.

# number of shows of a particular genre(listed_in)

The distribution of the genre of shows is another important factor to look at, and we can use a countplot to visualize it.

In [None]:
shows_per_category_counts = netflix_df.listed_in.value_counts()

In [None]:
plt.figure(figsize=(15,8))
sns.countplot(y=netflix_df.listed_in)
plt.xticks(rotation=75);
plt.title('shows_per_genre')
plt.ylabel(None);

It can be concluded that that documentries is the genre which is mostly released on netflix followed by stand-up comedy and kids'TV.

# Relation between genre(listed_in) and type

This visualisation will be for viewing which genre falls under which type.

In [None]:
types_fname=netflix_df.type.value_counts()

In [None]:
types_fname

So there are 2 types:Movie and TV show.
For visualisation we will use scatter  plot.

In [None]:
sns.scatterplot('type', 'listed_in', data=netflix_df)
plt.xlabel("type")
plt.ylabel("category");

It can be easily seen from plot that which genre is of which type.

# Top directors(in terms of no. of shows)


The aim is to find the directors who directed more than one show which is on netflix.
We will be ssing top 10 such directors.

In [None]:
director_f=netflix_df.director.value_counts().head(10)

In [None]:
director_f

This can be shown through bargraph.

In [None]:
plt.figure(figsize=(15,10))
plt.xticks(rotation=75)
plt.title('director vs no. of shows')
plt.xlabel('director')
plt.ylabel('No. of shows')
sns.barplot(director_f.index, director_f);

Hence, top directors with the number of shows directed can be computed by th shown barplot.

# movies OR shows

This is to show the number of Tv sows or movies released in a particular year on netflix.

In [None]:
Movie_df = netflix_df[netflix_df.type == 'Movie']
TV_Show_df = netflix_df[netflix_df.type == 'TV Show']

Two different dataframes have been created for movies and tv shows.
It can be plotted using histogram.

In [None]:
plt.title('type vs year')

plt.hist([Movie_df.release_year, TV_Show_df.release_year], 
         
         stacked=True);

plt.legend(['Movie', 'TV_Show']);

It can be clearly concluded that there is no year when the number of tv shows are equal ao larger than movies.
Hence it makes netflix a movie-dominant platform.

Let's save and upload our work before continuing.

In [None]:
import jovian

In [None]:
jovian.commit(project=project)

## Asking and Answering Questions

We've already gained several insights about the shows pn netflix in general, simply by exploring individual columns of the dataset. Let's ask some specific questions, and try to answer them using data frame operations and interesting visualizations.



Q1:Which category shows are produced mostly on netflix?What percent they contribute on netflix?

The question may be answered by firstly finding mostly vreated genre with its count and then computing percentage.
Please note we have not considered  multiple genres here.


In [None]:
netflix_df.listed_in.value_counts()

Documntaries lead with the count 299 on netflix.Percentage can be computed as:

In [None]:
doc_pr=299/6234*100

In [None]:
doc_pr

So documentaries have 4.8% occupancy on neflix.

# Q2: TODO - What changes are observed over the years in netflix production units?Determine whether the producing rate is increased or decreased?

This can be computed by knowing number of shows released on netflix every year. After that only it can be seen whether it has increased or decreased over years.

In [None]:
nt=netflix_df.release_year.value_counts()

In [None]:
plt.xlabel('year')
plt.ylabel('count')
plt.plot(netflix_df.release_year.value_counts());


Above shown line graph gives us the rough idea about the growth.It can be seen clearly that the production increased though the rate was less initially but it increased dratically over time and netflix observed peak in 2017 with maximum releases.
After 2017 the rate is decreasing and it is uneven.

# Q3 What is the rate of movie released on netflix over a period of time.

It is all about the percentage of movies released on the netflix over years.

In [None]:
total_produced_df=Movie_df.count()+TV_Show_df.count()

In [None]:
movies_produced_percentages = (Movie_df.title.count() * 100/ total_produced_df.type)

In [None]:
movies_produced_percentages

70% shows on the netflix are movies on the netflix.

# Q4: TODO - How many shows got 'TV-Y7-FV' rating on netflix over given period?

First take a look of the ratings that are given to the shows.

In [None]:
netflix_df.rating.unique()

In [None]:
rating_df = netflix_df[netflix_df.rating == 'TV-Y7-FV']

In [None]:
rating_df.title.count()

 So till today,acc to the data, 46 shows have been given TV-Y7-FV rating on netflix.

# Q5:What is the count of different ratings given to movies and TV sows seperately?

It can be answered by plotting countplot for movies and tv show and rating.

In [None]:
order =  ['G', 'TV-Y', 'TV-G', 'PG', 'TV-Y7', 'TV-Y7-FV', 'TV-PG', 'PG-13', 'TV-14', 'R', 'NC-17', 'TV-MA']
plt.figure(figsize=(15,7))
g = sns.countplot(netflix_df.rating, hue=netflix_df.type, order=order, palette="pastel");
plt.title("Ratings for Movies & TV Shows")
plt.xlabel("Rating")
plt.ylabel("Total Count")
plt.show()

Plots showing Ratings for movies and Ratings for tv shows:

In [None]:
fig, ax = plt.subplots(1,2, figsize=(19, 5))
g1 = sns.countplot(Movie_df.rating, order=order,palette="Set2", ax=ax[0]);
g1.set_title("Ratings for Movies")
g1.set_xlabel("Rating")
g1.set_ylabel("Total Count")
g2 = sns.countplot(TV_Show_df.rating, order=order,palette="Set2", ax=ax[1]);
g2.set(yticks=np.arange(0,1600,200))
g2.set_title("Ratings for TV Shows")
g2.set_xlabel("Rating")
g2.set_ylabel("Total Count")
fig.show()

Let us save and upload our work to Jovian before continuing.

In [None]:
import jovian

In [None]:
jovian.commit(project=project)

## Inferences and Conclusion

The growth of netflix is tremendous over the years. The company took certain approaches in their marketing strategy to break into new markets around the world. Based on an article from Business Insider, Netflix had about 158 million subscribers worldwide with 60 million from the US and almost 98 million internationally. Netflix's original subscriber base was based solely in the United States following its IPO. A large part of its success was due to the decision to expand to international markets. The popular markets prioritizes what content the company will release. In this case, we can see that a good amount of international movies and TV shows were added over the years as part of Netflix's global expansion.

# References
* https://www.kaggle.com/shivamb/netflix-shows
* https://jovian.ml/vaishaligoyal878/Netflix_shows_survey


In [None]:
import jovian

In [None]:
jovian.commit(project=project)

> Submission Instructions (delete this cell)
> 
> - Upload your notebook to your Jovian.ml profile using `jovian.commit`.
> - **Make a submission here**: https://jovian.ml/learn/data-analysis-with-python-zero-to-pandas/assignment/course-project
> - Share your work on the forum: https://jovian.ml/forum/t/course-project-on-exploratory-data-analysis-discuss-and-share-your-work/11684
> - Share your work on social media (Twitter, LinkedIn, Telegram etc.) and tag [@JovianML](https://twitter.com/jovianml)
>

 