## Task 1: Data Cleaning and Preprocessing of Netflix Movies and TV Shows Dataset

 *Objective*: Clean and prepare a raw dataset (with nulls, duplicates, inconsistent formats)

In [142]:
# importing the required libraries

import pandas as pd

import warnings
warnings.filterwarnings('ignore')

*Dataset Overview*

The Netflix dataset consists of 8,807 titles, including both Movies and TV Shows, enriched with metadata like title, director, cast, country, date_added, release_year, rating, duration, and listed_in. This dataset offers a rich snapshot of Netflix’s global content strategy over time.

In [None]:
# importing the dataset

netflix_data = pd.read_csv(r'netflix_titles.csv')

In [172]:
# first look on the imported dataset

# top  lows of the dataset
netflix_data.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,year_added,month_added,day_of_week_added,duration_minutes
0,s1,Movie,dick johnson is dead,kirsten johnson,Not Available,united states,2021-09-25,2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm...",2021,9,Saturday,90.0
1,s2,TV Show,blood & water,unknown,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",south africa,2021-09-24,2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t...",2021,9,Friday,2.0
2,s3,TV Show,ganglands,julien leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",unknown,2021-09-24,2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...,2021,9,Friday,1.0
3,s4,TV Show,jailbirds new orleans,unknown,Not Available,unknown,2021-09-24,2021,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo...",2021,9,Friday,1.0
4,s5,TV Show,kota factory,unknown,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",india,2021-09-24,2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...,2021,9,Friday,2.0


In [173]:
# last  rows of the dataset
netflix_data.tail()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,year_added,month_added,day_of_week_added,duration_minutes
8802,s8803,Movie,zodiac,david fincher,"Mark Ruffalo, Jake Gyllenhaal, Robert Downey J...",united states,2019-11-20,2007,R,158 min,"Cult Movies, Dramas, Thrillers","A political cartoonist, a crime reporter and a...",2019,11,Wednesday,158.0
8803,s8804,TV Show,zombie dumb,unknown,Not Available,unknown,2019-07-01,2018,TV-Y7,2 Seasons,"Kids' TV, Korean TV Shows, TV Comedies","While living alone in a spooky town, a young g...",2019,7,Monday,2.0
8804,s8805,Movie,zombieland,ruben fleischer,"Jesse Eisenberg, Woody Harrelson, Emma Stone, ...",united states,2019-11-01,2009,R,88 min,"Comedies, Horror Movies",Looking to survive in a world taken over by zo...,2019,11,Friday,88.0
8805,s8806,Movie,zoom,peter hewitt,"Tim Allen, Courteney Cox, Chevy Chase, Kate Ma...",united states,2020-01-11,2006,PG,88 min,"Children & Family Movies, Comedies","Dragged from civilian life, a former superhero...",2020,1,Saturday,88.0
8806,s8807,Movie,zubaan,mozez singh,"Vicky Kaushal, Sarah-Jane Dias, Raaghav Chanan...",india,2019-03-02,2015,TV-14,111 min,"Dramas, International Movies, Music & Musicals",A scrappy but poor boy worms his way into a ty...,2019,3,Saturday,111.0


In [174]:
# finding the shape of the dataset

netflix_data.shape

(8702, 16)

In [175]:
# finding the duplicate rows

netflix_data.duplicated().sum()



np.int64(0)

In [176]:
# finding the basic info of the dataset

netflix_data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 8702 entries, 0 to 8806
Data columns (total 16 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   show_id            8702 non-null   object        
 1   type               8702 non-null   object        
 2   title              8702 non-null   object        
 3   director           8702 non-null   object        
 4   cast               8702 non-null   object        
 5   country            8702 non-null   object        
 6   date_added         8702 non-null   datetime64[ns]
 7   release_year       8702 non-null   int64         
 8   rating             8702 non-null   object        
 9   duration           8702 non-null   object        
 10  listed_in          8702 non-null   object        
 11  description        8702 non-null   object        
 12  year_added         8702 non-null   int64         
 13  month_added        8702 non-null   int32         
 14  day_of_week_a

In [177]:
# checking the null values in the dataset

netflix_data.isnull().sum()

show_id              0
type                 0
title                0
director             0
cast                 0
country              0
date_added           0
release_year         0
rating               0
duration             0
listed_in            0
description          0
year_added           0
month_added          0
day_of_week_added    0
duration_minutes     0
dtype: int64

### *Data Wrangling*

In [178]:
# Replacing the null values in the director column

netflix_data['director'].fillna('Unknown', inplace=True)
print('Null values in director column:',netflix_data['director'].isnull().sum())

# Replacing the null values in the cast column

netflix_data['cast'].fillna('Not Available', inplace=True)
print('Null values in cast column:',netflix_data['cast'].isnull().sum())

# Replacing the null values in the country column

netflix_data['country'].fillna('Unknown', inplace=True)
print('Null values in country column:',netflix_data['country'].isnull().sum())



Null values in director column: 0
Null values in cast column: 0
Null values in country column: 0


In [179]:
# dropping the null values in the date_added column

netflix_data.dropna(subset=['date_added','rating', 'duration'], inplace=True)

# Replacing the null values in the date_added column
netflix_data = netflix_data[netflix_data['date_added'].notna()]

In [180]:
netflix_data.isnull().sum()

show_id              0
type                 0
title                0
director             0
cast                 0
country              0
date_added           0
release_year         0
rating               0
duration             0
listed_in            0
description          0
year_added           0
month_added          0
day_of_week_added    0
duration_minutes     0
dtype: int64

In [181]:
# formating the date_added column

netflix_data['date_added'] = pd.to_datetime(netflix_data['date_added'], format='%B %d, %Y', errors='coerce')



In [182]:
# Standardizecolumn names in the dataset

netflix_data.columns = netflix_data.columns.str.strip().str.lower().str.replace(' ', '_')

In [183]:
# Standardize text values in the dataset

netflix_data['type'] = netflix_data['type'].str.strip().map({
    'Movie': 'Movie',
    'TV Show': 'TV Show',
    'Tv Show': 'TV Show',
})
netflix_data['title'] = netflix_data['title'].str.strip().str.lower()
netflix_data['director'] = netflix_data['director'].str.strip().str.lower()
netflix_data['cast'] = netflix_data['cast'].str.strip()
netflix_data['country'] = netflix_data['country'].str.strip().str.lower()
netflix_data['rating'] = netflix_data['rating'].str.strip().str.upper()
netflix_data['duration'] = netflix_data['duration'].str.strip()
netflix_data['listed_in'] = netflix_data['listed_in'].str.strip()
netflix_data['description'] = netflix_data['description'].str.strip()


In [184]:

# creating 3 new features for data exploarations
netflix_data['year_added'] = netflix_data['date_added'].dt.year.astype(int)
netflix_data['month_added'] = netflix_data['date_added'].dt.month
netflix_data['day_of_week_added'] = netflix_data['date_added'].dt.day_name()

netflix_data['duration_minutes'] = netflix_data['duration'].str.extract('(\d+)').astype(float)


In [185]:
netflix_data.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,year_added,month_added,day_of_week_added,duration_minutes
0,s1,Movie,dick johnson is dead,kirsten johnson,Not Available,united states,2021-09-25,2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm...",2021,9,Saturday,90.0
1,s2,TV Show,blood & water,unknown,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",south africa,2021-09-24,2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t...",2021,9,Friday,2.0
2,s3,TV Show,ganglands,julien leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",unknown,2021-09-24,2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...,2021,9,Friday,1.0
3,s4,TV Show,jailbirds new orleans,unknown,Not Available,unknown,2021-09-24,2021,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo...",2021,9,Friday,1.0
4,s5,TV Show,kota factory,unknown,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",india,2021-09-24,2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...,2021,9,Friday,2.0


In [186]:
# save the cleaned DataFrame to a new CSV file
netflix_data.to_csv("netflix_titles_cleaned.csv", index=False)
print("Cleaned data saved as 'netflix_titles_cleaned.csv'")

Cleaned data saved as 'netflix_titles_cleaned.csv'


In [187]:
netflix_data.columns

Index(['show_id', 'type', 'title', 'director', 'cast', 'country', 'date_added',
       'release_year', 'rating', 'duration', 'listed_in', 'description',
       'year_added', 'month_added', 'day_of_week_added', 'duration_minutes'],
      dtype='object')

Data Cleaning & Preprocessing Summary:

* The raw dataset consist of 8807 records with 12 columns and has undergone data cleaning and preprocessing finally the cleaned dataset consist of 8702 with 16 columns.

* Missing values in key columns (director, cast, country, date_added, etc.) were handled thoughtfully:

    * Replaced with 'Unknown' or 'Not Available' for categorical features.

    * date_added was converted to datetime64 for time-based insights after droping the null values.

* Text normalization was applied using .str.strip(), .str.lower(), or .str.title() for consistency.

* New features like year_added, month_added, day_of_week_added, and duration_minutes, were engineered for deeper insights.

* Ensured all fields were in their correct and most useful formats (e.g., Int64 for years with null safety).

* Saved the cleaned data as netflix_titles_cleaned.csv for the future reference.