1- <a href="#intro">introduction</a>

2- <a href="#dw">Data Wrangling</a>
> <li><a href="#l">Loading data</a></li>
  <li><a href="#down">Downloading IMDB Data</a></li>
  <li><a href="#view">Viewing the Data</a></li>
  <li><a href="#merge">Merging IMDB Data</a></li>
  <li><a href="#clean">Cleaning</a></li>
  <li><a href="#merge2">Merging into 1 Dataframe</a></li>
  <li><a href="#clean2">Preparing data for EDA</a></li>


3- <a href="#eda">Exploratory Data Analysis (EDA) and Visualizations</a>

<a id='intro'></a>
# Introduction
what is this data about?:

This data is about the TV-shows and movies on netflix, contains : 
* show ID, Title, Type(tv-show or movie)
* director, cast
* country of the show or movie
* date realesed and when it was added to netflix
* rating, duration, listed in(the category) and description



In [None]:
import numpy as np 
import pandas as pd 
import seaborn as sns
import os
import time
import requests
import tarfile 
import matplotlib.pyplot as plt
%matplotlib inline

<a id='dw'></a>
# Data Wrangling


<a id='l'></a>
## Loading The Netflix data 

In [None]:
path = '../input/netflix-shows/netflix_titles.csv'
net_df = pd.read_csv(path)

In [None]:
net_df.head()

**The netflix data is good but it will be better if it had ratings from IMDB, 
so i downloaded IMDB ratings**

<a id='down'></a>
## Downloading the IMDB data

In [None]:
urls = ['https://datasets.imdbws.com/title.ratings.tsv.gz','https://datasets.imdbws.com/title.basics.tsv.gz']
for url in urls:
    r=requests.get(url)
    with open(url.split('/')[-1], 'wb') as fd:
        for chunk in r.iter_content(chunk_size=128):
            fd.write(chunk)

In [None]:
#kaggle doesn't need you to unzip the files you download just load them into the dataframes So unziping is not needed
basics_df = pd.read_csv('title.basics.tsv.gz',sep='\t')
ratings_df = pd.read_csv('title.ratings.tsv.gz',sep='\t')

In [None]:
basics_df.head()

In [None]:
ratings_df.head()

<a id='view'></a>
## Viewing process:
**Assessing the data to start building insights and know what to clean in the data**

In [None]:
#check for duplicated rows in the data
print(ratings_df.duplicated().sum(),basics_df.duplicated().sum())

<a id='merge'></a>
## Merging IMDB Data
The data looks clean so It's time to merge the ratings with the titles data.

the basics file is used to link the ratings with the tv shows and movies.

In [None]:
rated_titles = pd.merge(basics_df.set_index('tconst'), ratings_df.set_index('tconst'), left_index=True, right_index=True).drop_duplicates()
rated_titles.sample(5)

In [None]:
rated_titles.info()

In [None]:
net_df.sample(5)

In [None]:
net_df.info()

<a id='clean'></a>
## Cleaning the data to merge it
Before merging this data with the netflix data it needs some cleaning and change some data types
* all names need to be in lower case
* fix null values of years

In [None]:
rated_titles_clean = rated_titles.copy()
net_clean = net_df.copy()

In [None]:
#lower case titles
net_clean['title']= net_clean['title'].str.lower()
rated_titles_clean['primaryTitle'] = rated_titles_clean['primaryTitle'].str.lower()
rated_titles_clean['originalTitle'] = rated_titles_clean['originalTitle'].str.lower()

In [None]:
#eleminate the nan values from startYear column dropna will not work beacause it won't detect all of the nan values
rated_titles_clean = rated_titles_clean[rated_titles_clean.startYear.apply(lambda x: str(x).isnumeric())]
rated_titles_clean['startYear'] = rated_titles_clean['startYear'].astype(int)

In [None]:
rated_titles_clean.info()

In [None]:
print(net_clean.columns,"\n",rated_titles_clean.columns)

<a id='merge2'></a>
## Merging the netflix and IMDB data
* Now the two data frames are ready to be merged after cleaning them and adjusting there dtypes i will merge by the title and the realese data or start year as it's called in the IMDB dataset

In [None]:
df = pd.merge(net_clean, rated_titles_clean, left_on=['title','release_year'], right_on=['primaryTitle','startYear'])
df.head()

In [None]:
df.info()

<a id='clean2'></a>
## Assessing and Cleaning after merge

### Columns to delete:
* show_id
* originalTitle
* primaryTitle (there is a title col already)
* type (because titleType is more detalied)
* duration

### Misssing values:
* Directors missing are because TV-shows often don't have that role there is show creators and each episode may have a diffrent director 
* Cast members

In [None]:
df_clean =df.copy()
df_clean.head()

### Droping and Renaming columns

In [None]:
df_clean.drop(columns=['show_id','originalTitle','type','primaryTitle','endYear','duration'],inplace=True)

In [None]:
# renaming columns
df_clean.rename(columns={'titleType':'type','isAdult':'is_adult',
                         'startYear':'start_year','runtimeMinutes':'runtime',
                         'averageRating':'average_rating',
                         'numVotes':'num_votes'},inplace =True)
df_clean.info()

In [None]:
df_clean.runtime.unique()

In [None]:
df_clean.date_added.isnull().sum()

In [None]:
df_clean.is_adult.unique()

In [None]:
df_clean = df_clean[df_clean.runtime.apply(lambda x: x.isnumeric())]
df_clean['runtime'] = df_clean['runtime'].astype(int)

In [None]:
df_clean.is_adult = df_clean.is_adult.astype(int)
df_clean.is_adult =df_clean.is_adult.astype(bool)

In [None]:
df_clean = df_clean[df_clean.date_added.isna() == False].reset_index()
df_clean = df_clean.drop("index", axis=1)
df_clean.date_added = pd.to_datetime(df_clean.date_added)
df_clean['year_added'] = pd.DatetimeIndex(df_clean.date_added).year.astype(int)
df_clean.drop(columns=['date_added'],inplace=True)

In [None]:
df_clean.to_csv('clean_df.csv',index = False)

<a id='eda'></a>
# Exploratory Data Analysis (EDA) and Visualizations


In [None]:
df = pd.read_csv('clean_df.csv')
df.head()

### Top 10 popular shows on netflix

In [None]:
#the shows with the most num_votes are the most popular
votes = df.query('type == ["tvSeries", "tvEpisode","tvSpecial","tvMiniSeries","tvShort"] ')
top10 = votes.sort_values(by=['num_votes'],ascending = False)
top10[['title','num_votes']][:10]

In [None]:
plt.figure(figsize=[7,7])
plt.title('Top 10')
sns.barplot(x='num_votes',y='title',palette="vlag",data=top10[:10]);

### Are the top 10 good shows?

rearranging the top 10 wached shows

In [None]:
#Arranging the popular shows according to their rating
rating = df.query('type == ["tvSeries", "tvEpisode","tvSpecial","tvMiniSeries","tvShort"] ')
top10_2 = top10[:10].sort_values(by=['average_rating'],ascending = False)
top10_2[['title','num_votes','average_rating']]

In [None]:
plt.figure(figsize=[7,7])
plt.title('Top 10 re-arranged')
sns.barplot(x='average_rating',y='title',palette="light:#5A9",data=top10_2);


In [None]:
rating = df.query('type == ["tvSeries", "tvEpisode","tvSpecial","tvMiniSeries","tvShort"] ')
top10_3 = rating.sort_values(by=['average_rating'],ascending = False)
top10_3[['title','num_votes','average_rating']][:10]

In [None]:
plt.figure(figsize=[7,7])
plt.title('Top 10 rating')
sns.barplot(x='average_rating',y='title',palette="flare",data=top10_3[:10]);

In [None]:
df.type.value_counts().to_frame()

### Is netflix for TV-shows or movies?

In [None]:
# Pie chart comparing the amount of the movies to tv-shows on netflix
labels = 'movies', 'tv-shows'
sizes = [4058,len(df)-4058]
plt.figure(figsize = [7,7])
plt.pie(sizes, labels=labels, autopct='%1.1f%%',explode=[0.05,0.1],shadow=True)
plt.title('Content on netflix');