Loading the dataset

In [26]:
import pandas as pd

In [27]:
df=pd.read_csv('../data/netflix_titles.csv')
df.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm..."
1,s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."
2,s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,"September 24, 2021",2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...
3,s4,TV Show,Jailbirds New Orleans,,,,"September 24, 2021",2021,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo..."
4,s5,TV Show,Kota Factory,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...


In [28]:
print(f"Rows: {df.shape[0]}, Columns: {df.shape[1]}")

Rows: 8807, Columns: 12


Checking the missing values

In [30]:
print("\nMissing values:")
df.isnull().sum()


Missing values:


show_id            0
type               0
title              0
director        2634
cast             825
country          831
date_added        10
release_year       0
rating             4
duration           3
listed_in          0
description        0
dtype: int64

Handling missing values

In [16]:
df['director'] = df['director'].fillna('Unknown')
df['cast'] = df['cast'].fillna('Unknown')
df['country'] = df['country'].fillna('Unknown')
df['rating'] = df['rating'].fillna('Unrated')
df['duration'] = df['duration'].fillna('Unknown')

In [17]:
print("Missing values after cleaning:")
print(df.isnull().sum())

Missing values after cleaning:
show_id          0
type             0
title            0
director         0
cast             0
country          0
date_added      10
release_year     0
rating           0
duration         0
listed_in        0
description      0
dtype: int64


Fixing data types (date_added)

In [18]:
df['date_added'] = pd.to_datetime(df['date_added'], errors='coerce')

In [19]:
print(df['date_added'].head(10))
print("Missing dates:", df['date_added'].isna().sum())

0   2021-09-25
1   2021-09-24
2   2021-09-24
3   2021-09-24
4   2021-09-24
5   2021-09-24
6   2021-09-24
7   2021-09-24
8   2021-09-24
9   2021-09-24
Name: date_added, dtype: datetime64[ns]
Missing dates: 98


Cleaning and Splitting the duration column

In [21]:
df[['duration_int', 'duration_type']] = df['duration'].str.extract(r'(\d+)\s*(\w+)')
df['duration_int'] = pd.to_numeric(df['duration_int'], errors='coerce')

In [22]:
print(df[['duration', 'duration_int', 'duration_type']].head(10))

    duration  duration_int duration_type
0     90 min          90.0           min
1  2 Seasons           2.0       Seasons
2   1 Season           1.0        Season
3   1 Season           1.0        Season
4  2 Seasons           2.0       Seasons
5   1 Season           1.0        Season
6     91 min          91.0           min
7    125 min         125.0           min
8  9 Seasons           9.0       Seasons
9    104 min         104.0           min


Adding release decade and month

In [23]:
df['release_decade'] = (df['release_year'] // 10) * 10
df['month_added'] = df['date_added'].dt.month_name()

In [24]:
print(df[['release_year', 'release_decade', 'date_added', 'month_added']].head(10))

   release_year  release_decade date_added month_added
0          2020            2020 2021-09-25   September
1          2021            2020 2021-09-24   September
2          2021            2020 2021-09-24   September
3          2021            2020 2021-09-24   September
4          2021            2020 2021-09-24   September
5          2021            2020 2021-09-24   September
6          2021            2020 2021-09-24   September
7          1993            1990 2021-09-24   September
8          2021            2020 2021-09-24   September
9          2021            2020 2021-09-24   September


Save Cleaned Dataset

In [25]:
df.to_csv('../data/netflix_cleaned.csv', index=False)

In [31]:
pd.read_csv('../data/netflix_cleaned.csv').head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,duration_int,duration_type,release_decade,month_added
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,Unknown,United States,2021-09-25,2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm...",90.0,min,2020,September
1,s2,TV Show,Blood & Water,Unknown,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,2021-09-24,2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t...",2.0,Seasons,2020,September
2,s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",Unknown,2021-09-24,2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...,1.0,Season,2020,September
3,s4,TV Show,Jailbirds New Orleans,Unknown,Unknown,Unknown,2021-09-24,2021,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo...",1.0,Season,2020,September
4,s5,TV Show,Kota Factory,Unknown,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,2021-09-24,2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...,2.0,Seasons,2020,September
