<div style="position: relative;margin:auto;">
    <div style="font-size:30px; background: #2b2e4a; text-align:center; border-radius: 8px; padding: 10px; width: 500px;">
        <h1>Netflix - EDA </h1>
    </div>
</div>


In this project, we will do some analysis by looking at the data of movies and TV shows on Netflix. As a result of these analyzes:
- How many Netflix content has been produced in which country?
- How many movies and TV shows?
- What are the categories of content available on Netflix? Which movie categories have the most and least published content?
- How is Netflix content according to the rating order?
- What are the publishing dates and production dates of content on Netlix?
- Which age groups are the content on Netflix targeting?
- Which players are the most featured in Netflix content?
- What are the durations of movies and TV shows on Netflix?

We will answer these and similar questions in this project.

### Data Loading

In [None]:
1+1

In [None]:
# IMPORT THE NECESSARY PACKAGES
import pandas as pd
import numpy as np
import seaborn as sns
import datetime
import plotly.express as px
from collections import Counter

import matplotlib.pyplot as plt
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

In [None]:
df_path = "../input/netflix-shows/netflix_titles.csv"

df = pd.read_csv(df_path)
df_copy = df.copy()

### Data Viewing

In [None]:
df.head(5)

In [None]:
# Sort rows from old date to new date based on "date_added" column
df['date_added'] =pd.to_datetime(df.date_added)
df = df.sort_values('date_added')

In [None]:
# Analyzing NaN values
def check_nan_values(dataset):
    for col in dataset:
        print("- {} = {}".format(col, df[col].isnull().sum()))
        
check_nan_values(df)

### Preparing data for analysis

In [None]:
# deleting unnecessary columns
del df['show_id']

In [None]:
# snchronize the most repeated rating value to columns with rating value "NaN"
df['rating'] = df['rating'].fillna(value=df['rating'].value_counts().idxmax())

In [None]:
# delete the NaN rows in the date_add column (10 row)
df.dropna(subset=['date_added'],inplace=True)

In [None]:
# changing the values of the director from NaN to "unknown"
df['director'] = df['director'].fillna("unknown")

In [None]:
# changing the cast values from NaN to "unknown"
df['cast'] = df['cast'].fillna("unknown")

In [None]:
check_nan_values(df)

In [None]:
# Browsing unique countries
df.country.unique()[10:20]

In [None]:
# changing the country values from NaN to "other"
df.country = df.country.fillna("other")

The string problem arises here. Many movies and TV shows have been released in more than one country. However, since it is saved as a string instead of an array while it is being saved in the data set, it is not clear which movie or series was shown in which country. As a result, we need to change the data a little bit here. For this, I will convert the structure as a string to an array string structure.

This problem also exists with the "listed_in" and "cast" columns. I will apply the same method to these.

In [None]:
def fix_country_col(data):
    
    new_col = []
    for row in data["country"]:
        new_col.append(row.split(","))
    return new_col


def fix_cast_col(data):
    new_col = []
    for row in data["cast"]:
        new_col.append(row.split(","))
    return new_col

def fix_listed_in_col(data):
    new_col = []
    for row in data["listed_in"]:
        new_col.append(row.lower().replace("&",",").replace("tv","").split(","))
    return new_col

df['country'] = fix_country_col(df)
df['listed_in'] = fix_listed_in_col(df)
df['cast'] = fix_cast_col(df)

In [None]:
# I don't need detailed date in "date_added" column. 
# I am converting the format from "year-month-day" to "year" format.
df['date_added'] = [col.strftime('%Y') for col in df['date_added']]