# EDA on Netflix Dataset

## **`Table of Contents`**

1. [**Introduction**](#1)
   1. [**1.1 Project Description**]
   2. [**1.2 Data Description**]
2. [**Acquiring and Loading Data**](#2)
   1. [**2.1 Importing Libraries:**]
   2. [**2.2 Loading Data**]
   3. [**2.3 Basic Data Exploration**]
   4. [**2.4 Areas to Fix**]
3. [**Data Proprocessing**](#3)
   1. [**3.1 Pre-processing Details:**]
   2. [**3.2 Rename Columns**]
   3. [**3.3 Drop Redundant Columns**]
   4. [**3.4 Changing Data Types**]
   5. [**3.5 Dropping Duplicates**]
   6. [**3.6 Handling Missing Values**]
   7. [**3.7 Handling Unreasonable Data Ranges**]
   8. [**3.8 Feature Engineering / Transformation**]
4. [**Data Analysis**](#4)
   1. [**4.1 Exploring `Column Name`**]
5. [**Conclusion**](#5)
   1. [**5.1 Insights**]
   2. [**5.2 Suggestions**]
   3. [**5.3 Possible Next Steps**]
6. [**Epilogue**](#6) 
   1. [**6.1 References**]
   2. [**6.2 Information about the Author:**]

---

# **`1. Introduction`**
 


[<img src="https://storage.googleapis.com/kaggle-datasets-images/434238/824878/30c0ef57882454a0419a348088aa2306/dataset-thumbnail.jpg?t=2019-12-04-06-00-44">]

##  **1.1 Project Description**

**Goal/Purpose:** 

This project is an Exploratory Data Analysis (EDA) on the dataset "netflix_titles.csv" which contains information about various movies and TV shows available on the Netflix platform. The goal of this project is to provide insights and understanding about the dataset, which can be useful for various purposes such as content recommendation, trend analysis, and business strategy.

The purpose of this notebook is to guide you through the EDA process and help you practice your data analysis skills. By exploring and analyzing this dataset, you can learn how to extract valuable information, identify patterns, and draw meaningful conclusions. This project is important for someone to read as it can serve as a reference for conducting EDA on similar datasets, and it can also help you improve your data analysis and problem-solving abilities.

<p>&nbsp;</p>

**Questions to be Answered:**



**`Question No 01:`** What is the distribution of the "type" column (movie vs. TV show) over the years?

**`Question No 02:`** Which countries have the most content available on Netflix, and how has this changed over time?

**`Question No 03:`** What are the most common genres or "listed_in" categories for movies and TV shows, and how do they differ?

**`Question No 04:`** How has the average duration of movies and TV shows changed over the years?

**`Question No 05:`** Which directors or actors have the most titles in the dataset, and are there any trends or patterns in their content?

**`Question No 06:`** Are there any relationships between the rating, duration, and genre of the Netflix titles?

**`Question No 07:`** How does the distribution of release years differ between movies and TV shows?

**`Question No 08:`** Can you identify any seasonal or monthly patterns in the release of new content on Netflix?

**`Question No 09:`** Are there any notable differences in the distribution of titles between the "Documentaries", "Dramas", and "Comedies" genres?

**`Question No 10:`** Can you find any significant correlations between the variables in the dataset, and how might these insights be useful for content recommendation or marketing strategies?

<p>&nbsp</p>

**Assumptions:** 

- The dataset is complete and accurate, with no major missing data or errors.
- The date_added column accurately represents when the title was added to Netflix.
- The metadata (director, cast, country, etc.) is reliable and up-to-date.
  
**Methodology:**

- Perform data cleaning and preprocessing (handle missing values, convert data types, etc.)
- Conduct univariate and bivariate analysis to understand the distribution and relationships between variables.
- Explore the data using visualization techniques (e.g., histograms, scatter plots, bar charts) to identify patterns and trends.
- Perform statistical analysis (e.g., correlation, regression) to uncover deeper insights.
- Document the findings and insights in a clear and organized manner.

**Scope:**

- This project will focus on analyzing the Netflix titles dataset and providing insights that can be useful for content providers, subscribers, and researchers.
- The analysis will be limited to the information available in the provided dataset and will not include any external data sources.
- The goal is to practice EDA skills and provide a comprehensive understanding of the Netflix content landscape.

<p>&nbsp;</p>

## **1.2 Data Description**

**Content:** 

This dataset is a CSV (Comma-Separated Values) file of 10 data points which contains information about various movies and TV shows available on the Netflix platform.

**Description of Attributes:**

| Column       | Description                                                                                                    |
| ------------ | -------------------------------------------------------------------------------------------------------------- |
| show_id      | Unique identifier for each title                                                                           |
| type         | Indicates whether the title is a movie or a TV show                                                       |
| title        | Title of the movie or TV show                                                                              |
| director     | Director of the movie                                                                                      |
| cast         | Actors in the movie or TV show                                                                             |
| country      | Country where the title was produced                                                                       |
| date_added   | Date the title was added to Netflix                                                                        |
| release_year | Year the title was originally released                                                                     |
| rating       | Content rating of the title                                                                                |
| duration     | Runtime of the movie or number of seasons for TV shows                                                     |
| listed_in    | Genres or categories the title belongs to                                                                  |
| description  | Brief description of the plot or premise of the title                                                      |

**Acknowledgements:**

This dataset is provided by Netflix. The original dataset was scraped by Kaggle and the original source can be found on [Kaggle](https://www.kaggle.com/datasets/shivamb/netflix-shows).

---

# **`2. Aquiring & Loading Data`**
 

In [2]:
# 2.0 Interactive Output Function
import ipywidgets as widgets
from IPython.display import display
from ipywidgets import interact

def print_interact(show_summary, df):
  interact(show_summary, df=widgets.fixed(df))

## **2.1 Importing Libraries:**

In [3]:
# Data manipulation
import datetime
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## **2.2 Loading Data**

In [4]:
# # Load DataFrames
df = pd.read_csv('../../datasets/netflix_titles.csv')

## **2.3 Basic Data Exploration**

In [5]:
# # Show rows and columns count
print(f"Rows count: {df.shape[0]}\nColumns count: {df.shape[1]}")

Rows count: 8807
Columns count: 12


In [6]:
df.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm..."
1,s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."
2,s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,"September 24, 2021",2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...
3,s4,TV Show,Jailbirds New Orleans,,,,"September 24, 2021",2021,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo..."
4,s5,TV Show,Kota Factory,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...


In [7]:
df.tail()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
8802,s8803,Movie,Zodiac,David Fincher,"Mark Ruffalo, Jake Gyllenhaal, Robert Downey J...",United States,"November 20, 2019",2007,R,158 min,"Cult Movies, Dramas, Thrillers","A political cartoonist, a crime reporter and a..."
8803,s8804,TV Show,Zombie Dumb,,,,"July 1, 2019",2018,TV-Y7,2 Seasons,"Kids' TV, Korean TV Shows, TV Comedies","While living alone in a spooky town, a young g..."
8804,s8805,Movie,Zombieland,Ruben Fleischer,"Jesse Eisenberg, Woody Harrelson, Emma Stone, ...",United States,"November 1, 2019",2009,R,88 min,"Comedies, Horror Movies",Looking to survive in a world taken over by zo...
8805,s8806,Movie,Zoom,Peter Hewitt,"Tim Allen, Courteney Cox, Chevy Chase, Kate Ma...",United States,"January 11, 2020",2006,PG,88 min,"Children & Family Movies, Comedies","Dragged from civilian life, a former superhero..."
8806,s8807,Movie,Zubaan,Mozez Singh,"Vicky Kaushal, Sarah-Jane Dias, Raaghav Chanan...",India,"March 2, 2019",2015,TV-14,111 min,"Dramas, International Movies, Music & Musicals",A scrappy but poor boy worms his way into a ty...


### *2.3.1 Check Data Types*

In [8]:
# # Show data types
def info_of_data():
    df.info()
print_interact(info_of_data,df)



interactive(children=(Output(),), _dom_classes=('widget-interact',))

- `release_year` is an **integer**.
- `date_added` should be a **datetime** type instead.
- `remaining all` columns are **strings**.

### *2.3.2 Check Missing Data*

In [9]:
# Print percentage of missing values

def missing_values():
    missing_percent = df.isnull().sum().sort_values(ascending=False)
    if(missing_percent.sum()):
        print('---- Percentage of Missing Values (%) -----')
        print(missing_percent[missing_percent>0]/len(df)*100)
    else:
        print('None')
print_interact(missing_values,df)

interactive(children=(Output(),), _dom_classes=('widget-interact',))

### 2.3.3 Check for Duplicate Rows

In [10]:
# # Show number of duplicated rows
print(f"No. of entirely duplicated rows: {df.duplicated().sum()}")

# Show duplicated rows
df[df.duplicated()]

No. of entirely duplicated rows: 0


Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description


### 2.3.4 Check Uniqueness of Data

In [11]:
print('---- No of Unique Values -----')
df.nunique()

---- No of Unique Values -----


show_id         8807
type               2
title           8807
director        4528
cast            7692
country          748
date_added      1767
release_year      74
rating            17
duration         220
listed_in        514
description     8775
dtype: int64

In [12]:
# Print the percentage similarity of values (the lower %, the better)
print('---- Percentage Similarity of Values (%) -----')
print(100/df.nunique().sort_values())

---- Percentage Similarity of Values (%) -----
type            50.000000
rating           5.882353
release_year     1.351351
duration         0.454545
listed_in        0.194553
country          0.133690
date_added       0.056593
director         0.022085
cast             0.013001
description      0.011396
show_id          0.011355
title            0.011355
dtype: float64


### *2.3.5 Check Data Range*

In [13]:
# # Print summary statistics
df.describe(include='all')
# using skimpy library
import skimpy
from skimpy import skim
skim(df)

### *2.3.6 Checking Value Counts of Categorical Columns*

In [14]:
# checking value counts of df for categorical columns   
print('---- Value Counts -----')
for col in df.select_dtypes(include='object').columns:
    if(df[col].nunique()<40):
        print(f'---- {col} ----')
        print(df[col].value_counts())

---- Value Counts -----
---- type ----
type
Movie      6131
TV Show    2676
Name: count, dtype: int64
---- rating ----
rating
TV-MA       3207
TV-14       2160
TV-PG        863
R            799
PG-13        490
TV-Y7        334
TV-Y         307
PG           287
TV-G         220
NR            80
G             41
TV-Y7-FV       6
NC-17          3
UR             3
74 min         1
84 min         1
66 min         1
Name: count, dtype: int64


In [15]:
 for col in df.select_dtypes(include='object').columns:
        if(df[col].nunique()<40):
            print(f'---- {col} ----')
            print(df[col].value_counts(),'\n')
        else:
            print(f'---- {col} ----')
            print('more than 40 unique values\n')

---- show_id ----
more than 40 unique values

---- type ----
type
Movie      6131
TV Show    2676
Name: count, dtype: int64 

---- title ----
more than 40 unique values

---- director ----
more than 40 unique values

---- cast ----
more than 40 unique values

---- country ----
more than 40 unique values

---- date_added ----
more than 40 unique values

---- rating ----
rating
TV-MA       3207
TV-14       2160
TV-PG        863
R            799
PG-13        490
TV-Y7        334
TV-Y         307
PG           287
TV-G         220
NR            80
G             41
TV-Y7-FV       6
NC-17          3
UR             3
74 min         1
84 min         1
66 min         1
Name: count, dtype: int64 

---- duration ----
more than 40 unique values

---- listed_in ----
more than 40 unique values

---- description ----
more than 40 unique values



In [16]:
df.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm..."
1,s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."
2,s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,"September 24, 2021",2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...
3,s4,TV Show,Jailbirds New Orleans,,,,"September 24, 2021",2021,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo..."
4,s5,TV Show,Kota Factory,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...


## **2.4 Areas to Fix**
**Data Types**
- [the duration column should be of numeric dtype, and of one scale]
- [date added colume should be of datetime dtype]

**Missing Data**

| column_name | NA | NA % |
|----------------|:--:|:----:|
| show_id | 0 | 0 |
| type | 0 | 0 |
| title | 0 | 0 |
| director | 2634 | 29.91 |
| cast | 825 | 9.37 |
| country | 831 | 9.44 |
| date_added | 10 | 0.11 |
| rating | 4 | 0.05 |
| duration | 3 | 0.03 |
| listed_in | 0 | 0 |
| description | 0 | 0 |

**Duplicate Rows**
- [There are no duplicated rows]

**Uniqueness of Data**
|Column_name|nunique(%)
--------|--------
|type|50.000000|
|rating|5.882353|
|release_year|1.351351|
|duration|0.454545|
|listed_in|0.194553|
|country|0.133690|
|date_added|0.056593|
|director|0.022085|
|cast|0.013001|
|description|0.011396|
|show_id|0.011355|
|title|0.011355|

---

# **`3. Data Preprocessing`**

## **3.1 Pre-processing Details:**

- Renaming columns
- Drop Redundant Columns
- Changing Data Types
- Dropping Duplicates
- Handling Missing Values
- Handling Unreasonable Data Ranges
- Feature Engineering / Transformation

## **3.2 Rename Columns**

In [30]:
# # Rename columns to snake_case
# df = clean_columns(df, replace={})

In [13]:
# # Rename columns
# columns_to_rename = {}
# df.rename(columns=columns_to_rename, inplace=True)

In [14]:
# # Verify columns are renamed
# df.columns

## **3.3 Drop Redundant Columns**

In [15]:
# # Check the proportion of the most frequent value in each column
# print('---- Frequency of the Mode (%) -----')
# mode_dict = {col: (df[col].value_counts().iat[0] / df[col].size * 100) for col in df.columns}
# mode_series = pd.Series(mode_dict)
# mode_series

In [16]:
# # Show the value frequency of each column greater than the mode's threshold
# threshold = 80
# for col in mode_series[mode_series > threshold].index:
#     print(df[col].value_counts(dropna=False))
#     print()

In [17]:
# # Drop columns (specify columns to drop)
# cols_to_drop = []
# df.drop(columns=cols_to_drop, axis=1, inplace=True)

In [18]:
# # Verify columns dropped
# assert all(col not in df.columns for col in cols_to_drop)

In [19]:
# # Drop columns (specify column indices to drop)
# df.drop(df.columns[a:b], axis=1, inplace=True)

In [20]:
# # Verify columns dropped
# assert all(col not in df.columns for col in df.columns[a:b])

In [21]:
# # Drop columns (specify columns to keep)
# cols_to_keep = []
# df = df[cols_to_keep]

In [22]:
# # Verify columns dropped
# assert all(col in df.columns for col in cols_to_keep)

## **3.4 Changing Data Types**

In [23]:
# # Convert columns to the right data types
# df[col] = df[col].astype('string')
# df[col] = df[col].astype('int')
# df[col] = pd.to_datetime(df[col], infer_datetime_format=True)

# # Convert to categorical datatype
# col_cat = ptypes.CategoricalDtype(categories=['A', 'B', 'C'], ordered=True)
# df['col_cat'] = df['col_cat'].astype(col_cat)

In [24]:
# # Verify conversion
# assert ptypes.is_string_dtype(df[col])
# assert ptypes.is_numeric_dtype(df[col])
# cols_to_check = []
# assert all(ptypes.is_datetime64_any_dtype(df[col]) for col in cols_to_check)

## **3.5 Dropping Duplicates**

In [25]:
# # Drop entirely duplicated rows
# df.drop_duplicates(inplace=True, ignore_index=True)

In [26]:
# # Verify rows dropped
# assert df.duplicated().sum()==0

## **3.6 Handling Missing Values**

## **3.7 Handling Unreasonable Data Ranges**

In [27]:
# # Drop affected rows
# df = df.loc[~((df['A'] == 0) | (df['B'] > 100))].reset_index()

In [28]:
# # Verify rows dropped
# len(df)

## **3.8 Feature Engineering / Transformation**

In [29]:
# # Get unique values of interested columns
# cols = []
# pd.unique(df[cols].values.ravel('k'))  # argument 'k' lists the values in the order of the cols 

In [30]:
# # Create custom function
# # Google style docstrings
# # https://sphinxcontrib-napoleon.readthedocs.io/en/latest/example_google.html
# def custom_function(param1: int, param2: str) -> bool:
#     """Example function with PEP 484 type annotations.

#     Args:
#         param1: The first parameter.
#         param2: The second parameter.

#     Returns:
#         The return value. True for success, False otherwise.

#     """

In [31]:
# # Apply function to multiple columns
# cols = []
# df_updated = df.copy()
# df_updated[cols] = df_updated[cols].applymap(custom_function)

# # Create new aggregated boolean column
# df_updated['bool'] = df_updated[cols].any(axis=1, skipna=False)

---

# *`4. Data Analysis`*
 

Here is where your analysis begins. You can add different sections based on your project goals.

## **4.1 Exploring `Column Name`**

In [32]:
# Code and visualization

**Observations**
- Ob 1
- Ob 2
- Ob 3

---

# **`5. Conclusion`**
 



## **5.1 Insights**
State the insights/outcomes of your project or notebook.

## **5.2 Suggestions**

Make suggestions based on insights.

## **5.3 Possible Next Steps**
Areas to expand on:
- (if there is any)

---

# *`6. Epilogue`*

## **6.1 References**

This is how we use inline citation[<sup id="fn1-back">[1]</sup>](#fn1).

[<span id="fn1">1.</span>](#fn1-back) _Author (date)._ Title. Available at: https://website.com (Accessed: Date). 

> Use [https://www.citethisforme.com/](https://www.citethisforme.com/) to create citations.

---

## **6.2 Information about the Author:**

[<img src="https://media.licdn.com/dms/image/D4D03AQH8PR9DDb3VxQ/profile-displayphoto-shrink_200_200/0/1713280211622?e=2147483647&v=beta&t=5TpzxNZJRmU3_zjNLoRb-O2V9amv1-1rwM5OczG01ZY" width="20%">](https://www.facebook.com/groups/codanics/permalink/1872283496462303/ "Image")


**Mr. ShaheerAli**

BS Computer Science\
[Youtube channel](https://www.youtube.com/channel/UCUTphw52izMNv9W6AOIFGJA)\
[Twitter](https://twitter.com/__shaheerali190)\
[Linkedin](https://www.linkedin.com/in/shaheer-ali-2761aa303/)\
[github](https://github.com/shaheeralics)\
[Kaggle](https://www.kaggle.com/shaheerali197)\
[Portfolio Website](https://shaheer.kesug.com)

## **6.3 Versioning**
Notebook and insights by (Mr.Shaheer Ali).
- Version: 1.5.0
- Date: 2023-05-15