# Netflix Movies and TV Shows


## Preliminary Wrangling

> This dataset consists of tv shows and movies available on Netflix as of( 2019). The dataset is collected from Flixable which is a third-party Netflix search engine.

Dataset Source:  https://www.kaggle.com/shivamb/netflix-shows

In [None]:
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
from scipy.stats import norm

%matplotlib inline

In [None]:
# load in the dataset into a pandas dataframe

df=pd.read_csv('../input/netflix-shows/netflix_titles.csv')
df.head()

In [None]:
#Check shape of the dataset
df.shape

In [None]:
#Describe qualitative data 
df.describe()

In [None]:
#Check Data information
df.info()

In [None]:
# Check null data
df.isnull().sum()

>>### We have some null data need Fill it

In [None]:
# Process for NaN data
#filling director which didn't have any data by add No Director
df.director.fillna("No Director", inplace=True)

#filling cast which didn't have any data by add No Cast
df.cast.fillna("No Cast", inplace=True)

#filling country which didn't have any data by add country unavailable
df.country.fillna("Country Unavailable", inplace=True)

#drop remain NaN data which we will not use in analysis
df.dropna(inplace=True)

In [None]:
#Check Null data again to confirm from our process
df.isnull().sum()

In [None]:
# convert time from string to datetime64
df['date_added'] = pd.to_datetime(df['date_added'])

# Extract month and day name and hour from Start Time after convert
df['added_month'] = df['date_added'].dt.month
df['added_day_name'] = df['date_added'].dt.day_name()
df['added_year'] = df['date_added'].dt.year

In [None]:
#Check Data information 
df.info()

>#### Check Duplicated data

In [None]:

df.duplicated().sum()

>>#### No duplicate data in our dataset

In [None]:
#Check first five rows from dataset
df.head()

In [None]:
# Creating 2 new Data frame one for moives and one for TV Show
df_movies=df.query("type=='Movie'")
df_tvshow=df.query("type=='TV Show'")
# Check Number of Movies and TV Show
print(df_movies.type.value_counts())
print(df_tvshow.type.value_counts())

In [None]:
#check country with produced Movies and TV Show
df['country'].value_counts()

In [None]:
# Check Rating values
df.rating.value_counts()

In [None]:
#create Dictionary for rating https://en.wikipedia.org/wiki/TV_Parental_Guidelines
rate_ages = {
    'TV-PG': 'Older Kids',
    'TV-MA': 'Adults',
    'TV-Y7-FV': 'Older Kids',
    'TV-Y7': 'Older Kids',
    'TV-14': 'Teens',
    'R': 'Adults',
    'TV-Y': 'Kids',
    'NR': 'Adults',
    'PG-13': 'Teens',
    'TV-G': 'Kids',
    'PG': 'Older Kids',
    'G': 'Kids',
    'UR': 'Adults',
    'NC-17': 'Adults'
}

In [None]:
# create new column to add new rating acording to ages stage
df['target_ages'] = df['rating'].replace(rate_ages)
#Check unique values for new column
df['target_ages'].unique()

In [None]:
#Describe Data frame which have only Movies

df_movies.describe()

>>### Noted : from 2016 to 2020 around 50% from total Movies  released and from 2018 to 2020 added 50 % of moives on netflix from total movies of netflix

In [None]:
#Describe Data frame which have only TV Shows
df_tvshow.describe()

>>### Noted: from 2017 to 2020 around 50% from TV Shows released, and Netflix added 50% from Total tv show on netflix from 2018 to 2020 

In [None]:
#Check correlation coefficient  
df.corr()

### What is the structure of your dataset?

> the Row dataset 6234 rows and 12 columns

### What is/are the main feature(s) of interest in your dataset?
I'm most interested in figuring out
> What is the Country that has more production for Movies and TV shows?

> Type of content (Movies and TV shows), what type of content one is higher on Netflix?

> when did start Netflix to increase content?
### What features in the dataset do you think will help support your investigation into your feature(s) of interest?

> Rate of content according who care to watch content [Older Kids', 'Teens', 'Adults', 'Kids, 
>I expect that type of content will have the strongest effect on my investigation.
> Release years and add years to figure out at which years start to increase the content.


# *Univariate Exploration*



#  <span style="color:red">1.Which type more show Moives or TV Shows?</span>

In [None]:
# Use the `color` argument
con=df['type'].value_counts()

# g = plt.pie(con,explode=(0.025,0.025), labels=df.type.value_counts().index, colors=[‘red’,’black’],autopct=’%1.1f%%’, startangle=180)
explode = (0, 0.2)  # only "explode" the 2nd slice (i.e. 'Hogs')
plt.figure(figsize=(12,7))

plt.pie(con,labels=con.index,startangle=90,autopct='%1.1f%%',counterclock=False,explode=explode,shadow=True,)
plt.title('Porportion Movies And TV Shows on Netflix')
plt.axis('square')
plt.show()
;

>>###  <span style="color:blue"> (68,4%) Movies and (31,6%) TV Shows,that means most of shows from the movies.</span>

In [None]:
#Check most of Tv show how many season comman
plt.figure(figsize=(10,7))

df_tvshow.duration.value_counts().plot(kind='bar')
plt.xlabel('Seasons')
plt.ylabel('frequancy')
plt.title('TV Shows Sesson');


>>### <span style="color:blue">Most of TV Shows have one Season</span>

In [None]:
# Check most of the Movies  how many common movies have same duration ?
plt.figure(figsize=(10,7))
df_movies.duration.value_counts().plot()
plt.xticks(rotation=90);
plt.xlabel('Duration Of Movies (MIN)')
plt.ylabel('Frequency')
plt.title('Frequency of duration');

>>### <span style="color:blue">Most of Movies have 90 Min's duration</span>

In [None]:
# extract number of time and make normal distribution for movies time almost average time 100 MIN's
plt.figure(figsize=(10,7))

sb.distplot(df_movies['duration'].str.extract('(\d+)'),fit=norm,kde=False,color=['k'])
plt.xlabel('Movies Duration')
plt.ylabel('Frequancy')
plt.title('Distrbution for Movies duration');


>>## <span style="color:blue">Most of the average duration for movies 100 Min's,Netflix shloud care about duration when add new movies.</span>

#  <span style="color:red">2.Which country has high production content of Netflix?</span> 

## We need to separate all countries before analyzing it.

In [None]:
all_countries = df.set_index('title').country.str.split(',', expand=True).stack().reset_index(level=1, drop=True);
all_countries

### Removing titles with no countries available.

In [None]:
all_countries=all_countries[all_countries !='Country Unavailable']

## top Countries producted Movies and TV Shows

In [None]:
base_color = sb.color_palette()[1]
plt.figure(figsize=(12,7))

sb.countplot(y = all_countries, order=all_countries.value_counts().index[:15],color=base_color)
plt.xlabel('Count Movies/TV Show Produced')
plt.ylabel('Country')
plt.title('Highest Countries Produce Movies/TV Show')

plt.show()

>## <span style="color:blue">From the Horizontal bar chart we can see the United States, Inda, and the United Kingdom the highest 3 countries produce content on Netflix.</span>

In [None]:
## Check 5 number summary and outlier for release years for movies and tv shows 
plt.boxplot(df['release_year']);


# Outliers data here is not erorr, it just recorded date for released content not anomalous or wrong data,beacuse it recorded from along time 

In [None]:
#Ploting Histogram to check data skewed

bins=np.arange(1985,2020+5,5)
plt.hist(df['release_year'],bins=bins);
plt.xlabel('Release_Years')
plt.ylabel('Count Movies/TV Show produced')

### <span style="color:blue">Left skewed Histogram for data  mean < median < mode  indcate increaseing content produce  from 1990 with high release  at 2019</span>

In [None]:
## Check 5 number summary and outlier for added years for movies and tv shows on Netflix

plt.boxplot(df['added_year']);
plt.xlabel('added_year')

>## <span style="color:blue">Same for Added_Year Outliers data here is not erorr, it just recorded date for added years  content on netflix, not anomalous or wrong data.</span>

In [None]:
#Ploting Histogram to check data skewed
bins=np.arange(2014,2021,1)
plt.hist(df['added_year'],bins=bins);
plt.xlabel("Add Content Years")
plt.ylabel('Count Movies/TV Show Added');


>### <span style="color:blue">Left skewwed Histogram for data  mean < median < mode  indcate increaseing content added from 2015 with high added at 2019</span>

# *Bivariate Exploration*

## <span style="color:red">3.Which day of the week has high increasing content (Movies and TV Show)add?</span>

In [None]:
#Ploting seaborn countplot to Show the counts of observations in each categorical bin using bars.
fig, ax = plt.subplots(figsize=(10,10))

sb.countplot(data=df,x='added_day_name',hue='type',ax=ax)
plt.xticks(rotation=15);
plt.xlabel('Weekday Name')
plt.title('Relation Between Day and Type of content')


>>## <span style="color:blue">Highest day Friday for added movies and TV Show </span>

# <span style="color:red">4.Which type content rate highest for movies and tv show?</span>

In [None]:
plt.figure(figsize=(15,9))
sb.countplot(data = df, x = 'target_ages', hue = 'type')
#plt.xticks(rotation = 20);

>>## <span style="color:blue">Adult Movies the highest rate for movies and for TV show the highest rate Adult and Teens rate.</span>

# <span style="color:red">5.What the highest year for release content Movies and TV Show?</spain>

In [None]:
plt.figure(figsize=(12,10))
sb.set(style="whitegrid")

ax = sb.countplot(y='release_year', data=df, palette="Set3", order=df['release_year'].value_counts().index[0:15])

>>### <span style="color:blue">2018 the highest year for release content.</span>

# <span style="color:red">6.What the highest year for to added content on Netflix?</span>

In [None]:
plt.figure(figsize=(12,10))

ax = sb.countplot(y='added_year', data=df, palette="Set3", order=df['added_year'].value_counts().index[0:15])


>>### 2019 highest years for add content in Netflex

# *Multivariate Exploration*

# <span style="color:red"> 7.What relation between released years and added years for Movies and TV Show on Netflix?</span>

In [None]:
#Plotting line to get realtion between releasd movies and TVshow per year 
plt.figure(figsize=(15,9))
plt.subplot(1, 2, 1)

df.groupby('release_year')["type"].count().plot(label="Total Movies/TV show ")
df_movies.groupby('release_year')["type"].count().plot(label="Total Movies ")
df_tvshow.groupby('release_year')["type"].count().plot(label="Total TV show ")
plt.title("Movies/TV show Release Years")
plt.xlim([1990, 2022])

#Plotting line to get realtion between added  movies and TVshow on Netflix per year and correleting with release plot 

plt.subplot(1, 2, 2)
bins=np.arange(1970,2020,10)
df.groupby('added_year')["type"].count().plot(label="Total Movies/TV show  ")
df_movies.groupby('added_year')["type"].count().plot(label="Total Movies   ")
df_tvshow.groupby('added_year')["type"].count().plot(label="Total TV show  ")
plt.title("Movies/TV show Added Years")
plt.legend()


>>###  <span style="color:blue">Produce Movies and TV Show increased  from 2000 and got high peak at 2018, But Netflix added content increased from 2014, reached high peak 2019.</span>

In [None]:
# Ploting scatter for release years and added years on netflex
g = sb.FacetGrid(df,hue="type",size=7)
g.map_dataframe(sb.scatterplot, x="added_year", y="release_year")
g.set_axis_labels('Added year','Release year')
g.add_legend();

>>### <span style="color:blue">Produce Movies and TV Show increased from 2000 and got high peak at 2018, But Netflix added content increased from 2014, reached high peak 2019.¶</span>

## <span style="color:brown">Conclustion from dataset</span>

> Most content watched by adults

> Movies content in Netflex more than TV SHows

> Most Tv Show has one season 

>Most of Mean time for movies 100 Min's

> Produce Movies and TV Show increased from 2000 and got high peak at 2018, But Netflix added content increased from 2014, reached high peak 2019.¶


