# Netflix Data Cleaning, Analysis and Visualization

---

This project uses a cleaned Netflix titles dataset containing information about movies and TV shows available on Netflix from around 2008 to 2021, with original content years ranging from 1925 to 2021. The dataset is designed to practice data cleaning, exploratory analysis, and visualization using tools like Python, SQL, and dashboarding tools.



## Dataset Description


*   **Domain:** Streaming content analytics (Netflix catalog).
*   **Size:** Approximately 8,790 rows and 10 columns.
*   **Scope:** Titles available on Netflix with metadata about type, date added, country, rating, duration, and genres.
### Main columns:



*   show_id: Unique identifier for each title.
*   type: Movie or TV Show.
*   title: Name of the content.
*   director: Director of the title.
*   country: Country or countries of production.
*   date_added: Date when the title was added to Netflix.
*   release_year: Original release year of the title.


*   rating: Maturity rating (e.g., TV-MA, TV-14, PG-13).

*   duration: Duration (minutes for movies or seasons for TV shows).

*   listed_in: Genre(s) / categories (comma-separated).

The dataset is already cleaned to a large extent (duplicates treated and nulls handled) but can still be further processed for deeper analysis and ML use cases.

---


# Problem Statement

Streaming platforms like Netflix must understand their content library composition and viewer offerings to optimize catalog strategy, regional content mix, and content acquisition decisions. This dataset provides historical information about titles on Netflix, including type, genres, country of origin, release year, and when they were added to the platform.

Using this data, the problem is to analyze how Netflix’s catalog has evolved over time (by type, genre, country, and rating), identify patterns such as dominant genres or key content-producing countries, and prepare the data for potential machine learning applications like recommendations and trend prediction. The project focuses on robust data cleaning, exploratory analysis, and insightful visualizations that help understand Netflix’s content strategy and catalog trends.

# Data Loading

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud

In [None]:
Netflix_data = pd.read_csv('/content/netflix1.csv')

In [None]:
Netflix_data.head()

Unnamed: 0,show_id,type,title,director,country,date_added,release_year,rating,duration,listed_in
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,United States,9/25/2021,2020,PG-13,90 min,Documentaries
1,s3,TV Show,Ganglands,Julien Leclercq,France,9/24/2021,2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act..."
2,s6,TV Show,Midnight Mass,Mike Flanagan,United States,9/24/2021,2021,TV-MA,1 Season,"TV Dramas, TV Horror, TV Mysteries"
3,s14,Movie,Confessions of an Invisible Girl,Bruno Garotti,Brazil,9/22/2021,2021,TV-PG,91 min,"Children & Family Movies, Comedies"
4,s8,Movie,Sankofa,Haile Gerima,United States,9/24/2021,1993,TV-MA,125 min,"Dramas, Independent Movies, International Movies"


In [None]:
Netflix_data.shape

(8790, 10)

# Data Cleaning

In [None]:
#Finding Null Values in the dataset
print(Netflix_data.isnull().sum())

show_id         0
type            0
title           0
director        0
country         0
date_added      0
release_year    0
rating          0
duration        0
listed_in       0
dtype: int64


In [None]:
print(Netflix_data.isnull().value_counts())

show_id  type   title  director  country  date_added  release_year  rating  duration  listed_in
False    False  False  False     False    False       False         False   False     False        8790
Name: count, dtype: int64


In [None]:
Netflix_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8790 entries, 0 to 8789
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       8790 non-null   object
 1   type          8790 non-null   object
 2   title         8790 non-null   object
 3   director      8790 non-null   object
 4   country       8790 non-null   object
 5   date_added    8790 non-null   object
 6   release_year  8790 non-null   int64 
 7   rating        8790 non-null   object
 8   duration      8790 non-null   object
 9   listed_in     8790 non-null   object
dtypes: int64(1), object(9)
memory usage: 686.8+ KB


In [None]:
#Finding duplicate values
Netflix_data.duplicated().sum()

np.int64(0)

In [None]:
Netflix_data.drop_duplicates(inplace=True)

In [None]:
Netflix_data.shape

(8790, 10)

As you can see most of the dataset is already clean so we don't have to clean the given dataset. Let's Analyse and Visualize the given dataset.