<div style="position: relative;margin:auto;">
    <div style="font-size:30px; background: #2b2e4a; text-align:center; border-radius: 8px; padding: 10px; width: 500px;">
        <h1>Netflix - EDA </h1>
    </div>
</div>


In this project, we will do some analysis by looking at the data of movies and TV shows on Netflix. As a result of these analyzes:
- How many Netflix content has been produced in which country?
- How many movies and TV shows?
- What are the categories of content available on Netflix? Which movie categories have the most and least published content?
- How is Netflix content according to the rating order?
- What are the publishing dates and production dates of content on Netlix?
- Which age groups are the content on Netflix targeting?
- Which players are the most featured in Netflix content?
- What are the durations of movies and TV shows on Netflix?

We will answer these and similar questions in this project.

### Data Loading

In [None]:
# IMPORT THE NECESSARY PACKAGES
import pandas as pd
import numpy as np
import seaborn as sns
import datetime
import plotly.express as px
from collections import Counter

import matplotlib.pyplot as plt
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

In [None]:
df_path = "../input/netflix-shows/netflix_titles.csv"

df = pd.read_csv(df_path)
df_copy = df.copy()

### Data Viewing

In [None]:
df.head(5)

In [None]:
# Sort rows from old date to new date based on "date_added" column
df['date_added'] =pd.to_datetime(df.date_added)
df = df.sort_values('date_added')

In [None]:
# Analyzing NaN values
def check_nan_values(dataset):
    for col in dataset:
        print("- {} = {}".format(col, df[col].isnull().sum()))
        
check_nan_values(df)

### Preparing data for analysis

In [None]:
# deleting unnecessary columns
del df['show_id']

In [None]:
# snchronize the most repeated rating value to columns with rating value "NaN"
df['rating'] = df['rating'].fillna(value=df['rating'].value_counts().idxmax())

In [None]:
# delete the NaN rows in the date_add column (10 row)
df.dropna(subset=['date_added'],inplace=True)

In [None]:
# changing the values of the director from NaN to "unknown"
df['director'] = df['director'].fillna("unknown")

In [None]:
# changing the cast values from NaN to "unknown"
df['cast'] = df['cast'].fillna("unknown")

In [None]:
check_nan_values(df)

In [None]:
# Browsing unique countries
df.country.unique()[10:20]

In [None]:
# changing the country values from NaN to "other"
df.country = df.country.fillna("other")

The string problem arises here. Many movies and TV shows have been released in more than one country. However, since it is saved as a string instead of an array while it is being saved in the data set, it is not clear which movie or series was shown in which country. As a result, we need to change the data a little bit here. For this, I will convert the structure as a string to an array string structure.

This problem also exists with the "listed_in" and "cast" columns. I will apply the same method to these.

In [None]:
def fix_country_col(data):
    new_col = []
    for row in data["country"]:
        new_col.append(row.split(","))
    return new_col

def fix_cast_col(data):
    new_col = []
    for row in data["cast"]:
        new_col.append(row.split(","))
    return new_col

def fix_listed_in_col(data):
    new_col = []
    for row in data["listed_in"]:
        new_col.append(row.lower().replace("&",",").replace("tv","").split(","))
    return new_col

df['country'] = fix_country_col(df)
df['listed_in'] = fix_listed_in_col(df)
df['cast'] = fix_cast_col(df)

In [None]:
# I don't need detailed date in "date_added" column. 
# I am converting the format from "year-month-day" to "year" format.
df['date_added'] = [col.strftime('%Y') for col in df['date_added']]

In [None]:
df.head(5)

## What are the types of content available on Netflix? Comparison.

In [None]:
import plotly.graph_objects as go
from plotly.offline import init_notebook_mode, iplot

types = df['type'].value_counts().reset_index()

trace = go.Pie(labels=types['index'], values=types['type'], 
               pull=[0.1, 0], marker=dict(colors=["#fed049", "#007580"]),
               title="Netflix Content Types")
fig = go.Figure([trace])
fig.show()

We see that most of the content broadcast on Netflix is created by TV shows. But here it is wrong to comment directly: "The number of TV shows is more than the number of TV series and movies" is wrong. TV shows take less time than serials. Most of the series are over 1 season. There is no counting according to the season here. For example,  The Walking Dead series is 9 season, but the season and the number of episodes do not reflect the chart above. The chart above covers the number of different contents.

## What are the types of content? How many content has been produced in which types?

In [None]:
def get_categories(data):
    categories = {}
    for listed_in in data['listed_in']:
        for category in listed_in:
            category = category.lower().strip()
            if category in categories: # increase current category count
                categories[category] = categories[category] + 1
            else: # create new category in categories object
                categories[category] = 1
    return pd.DataFrame(categories.values(), index= categories.keys())

categories = get_categories(df).reset_index()
categories.columns = ["category", "count"]

In [None]:
sorted_category=  categories.sort_values(by="count")
trace = go.Bar(x=sorted_category['count'], y=sorted_category['category'], orientation="h", 
               marker_color='MediumPurple')
layout = go.Layout(title="Countries with most content", height=700, 
                   legend=dict(x=0.1, y=1.1, orientation="h"))
fig = go.Figure([trace], layout=layout)
fig.show()

We can have multiple genres of TV series or movie. For example; a horror movie can also fall into the thriller category. That's why I counted each category regardless of the content while I was doing the ranking above. In this way, I have actually achieved the category ranking of Netflix content. I achieved the above ranking by combining movies and TV shows.

## What is the number of content added to Netflix by years? I want to examine TV Show and Movie types separately.

In [None]:
movies = df[df["type"]=="Movie"]['date_added'].value_counts().rename('count').reset_index()
tv_shows = df[df["type"]=="TV Show"]['date_added'].value_counts().rename('count').reset_index()

# sorting by years
movies = movies.sort_values(by="index")
tv_shows = tv_shows.sort_values(by="index")

trace1 = go.Bar(x=movies['index'], 
                    y=movies['count'],
                    name="Movies",
                    marker_color='MediumPurple')
trace2 = go.Bar(x=tv_shows['index'], 
                    y=tv_shows['count'],
                    name="TV Shows",
                    marker_color='DarkSlateGrey')
layout = go.Layout(title="Number of content additions by years", height=500)
fig = go.Figure([trace1,trace2], layout=layout)
fig.show()

## What are the actual release year for uploaded content?

In [None]:
movies = df[df["type"]=="Movie"]['release_year'].value_counts().rename('count').reset_index()
tv_shows = df[df["type"]=="TV Show"]['release_year'].value_counts().rename('count').reset_index()

# sorting by years
movies = movies.sort_values(by="index")
tv_shows = tv_shows.sort_values(by="index")


trace1 = go.Scatter(x=movies['index'], 
                    y=movies['count'],
                    name="Movies",
                    marker_color='MediumPurple')
trace2 = go.Scatter(x=tv_shows['index'], 
                    y=tv_shows['count'],
                    name="TV Shows",
                    marker_color='DarkSlateGrey')
layout = go.Layout(title="Production years of contents", height=500)
fig = go.Figure([trace1,trace2], layout=layout)
fig.show()

### How many content has been produced or published in which country?

In [None]:
def get_country_code(country):
    country_codes = pd.read_csv('https://raw.githubusercontent.com/plotly/datasets/master/2014_world_gdp_with_codes.csv')
    if country == "South Korea":
        return "KOR"
    elif country == "Notrh Korea":
        return "PRK"
    elif country == "West Germany" or country == "East Germany":
        return "DEU"
    elif country == "Bahamas":
        return "BHM"
    elif country == "Soviet Union":
        return "RUS"
    else:
        try:
            return country_codes[country_codes["COUNTRY"] == country].reset_index()["CODE"][0]
        except:
            return None
        

def get_countries(data):
    countries = {}
    for cs in data['country']:
        for country in cs:
            if country == "other":
                continue
            country = country.strip()
            if country in countries: # increase current country count
                countries[country][0] = countries[country][0] + 1
            else: # create new country in countries object
                if get_country_code(country) is not None:
                    countries[country] = [1, get_country_code(country)]
                        
    return pd.DataFrame(countries.values(), index= countries.keys())

countries = get_countries(df).reset_index()
countries.columns = ["country", "count", "code"]

In [None]:
fig = px.choropleth(countries, 
                    locations="code",
                    color="count",
                    hover_name="country",
                    color_continuous_scale=px.colors.sequential.Plasma)
fig.show()

### What are the ratings and number of ratings for content on Netflix?

In [None]:
ratings = df.groupby("rating").size().reset_index()
ratings.columns = ["rating", "size"]

trace = go.Bar(x=ratings['rating'],
               y=ratings['size'],
               marker_color='MediumPurple')
layout = go.Layout(title="Ratings", height=500)
fig = go.Figure([trace], layout=layout)
fig.show()

### Which player has been in how many content?

In [None]:
def get_cast(data):
    casts = {}
    for c in data['cast']:
        for cast in c:
            cast = cast.lower().strip()
            if cast in casts: # increase current cast count
                casts[cast] = casts[cast] + 1
            else: # create new cast in casts object
                casts[cast] = 1
    return pd.DataFrame(casts.values(), index= casts.keys())

casts = get_cast(df).reset_index()
casts.columns = ["cast", "count"]
sorted_cast = casts[casts["cast"] != "unknown"].sort_values(by="count", ascending=[False])

In [None]:
top20_casts = sorted_cast[0:20]

fig = px.funnel(top20_casts, x="count", y="cast", color='count')
fig.show()

In [None]:
df

### What is the duration of the movies? 

In [None]:
movies = df[df['type'] == "Movie"]
movies['duration'] = movies['duration'].str.replace(' min','')
movies['duration'] = movies['duration'].astype(str).astype(int)

sns.set(style="darkgrid")
sns.kdeplot(data=movies['duration'], shade=True)

### Top rated 10 movies on Netflix are:

In [None]:
imdb_ratings=pd.read_csv('/kaggle/input/imdb-extensive-dataset/IMDb ratings.csv',usecols=['weighted_average_vote'])
imdb_titles=pd.read_csv('/kaggle/input/imdb-extensive-dataset/IMDb movies.csv', usecols=['title','year','genre'])

ratings = pd.DataFrame({'Title':imdb_titles.title,
                    'Release Year':imdb_titles.year,
                    'Rating': imdb_ratings.weighted_average_vote,
                    'Genre':imdb_titles.genre})
ratings.drop_duplicates(subset=['Title','Release Year','Rating'], inplace=True)

ratings.dropna()
joint_data=ratings.merge(df_copy,left_on='Title',right_on='title',how='inner')
joint_data=joint_data.sort_values(by='Rating', ascending=False)

In [None]:
top_rated=joint_data[0:10]
fig =px.sunburst(
    top_rated,
    path=['title','country'],
    values='Rating',
    color='Rating')
fig.show()