# Netflix Dataset Cleaning – Task 1 (Data Analyst Internship)

**Objective:** Clean and preprocess the Netflix Movies and TV Shows dataset by handling missing values, duplicates, inconsistent formatting, and column renaming.

**Tools Used:** Python, Pandas


In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('netflix_titles.csv')

In [3]:
df.head

<bound method NDFrame.head of      show_id     type                  title         director  \
0         s1    Movie   Dick Johnson Is Dead  Kirsten Johnson   
1         s2  TV Show          Blood & Water              NaN   
2         s3  TV Show              Ganglands  Julien Leclercq   
3         s4  TV Show  Jailbirds New Orleans              NaN   
4         s5  TV Show           Kota Factory              NaN   
...      ...      ...                    ...              ...   
8802   s8803    Movie                 Zodiac    David Fincher   
8803   s8804  TV Show            Zombie Dumb              NaN   
8804   s8805    Movie             Zombieland  Ruben Fleischer   
8805   s8806    Movie                   Zoom     Peter Hewitt   
8806   s8807    Movie                 Zubaan      Mozez Singh   

                                                   cast        country  \
0                                                   NaN  United States   
1     Ama Qamata, Khosi Ngema, Gail Mabal

In [4]:
df.isnull().sum().sort_values(assending=False)

TypeError: Series.sort_values() got an unexpected keyword argument 'assending'

In [5]:
df.isnull().sum().sort_values(ascending=False)

director        2634
country          831
cast             825
date_added        10
rating             4
duration           3
show_id            0
type               0
title              0
release_year       0
listed_in          0
description        0
dtype: int64

In [6]:
df['director'] = df['director'].fillna('Not Specified')

In [7]:
df['cast'] = df['cast'].fillna('Unknown')

In [8]:
df['country'] = df['country'].fillna('Unknown')

In [9]:
df['date_added'] = df['date_added'].fillna('Not Added')

In [10]:
df['rating'] = df['rating'].fillna('Not Rated')

In [11]:
df['duration'] = df['duration'].fillna('Unknown')

In [12]:
df.isnull().sum().sort_values(ascending=False)

show_id         0
type            0
title           0
director        0
cast            0
country         0
date_added      0
release_year    0
rating          0
duration        0
listed_in       0
description     0
dtype: int64

In [13]:
duplicates = df.duplicated().sum()
print(f"Duplicate rows: {duplicates}")

Duplicate rows: 0


In [14]:
df['country'] = df['country'].str.strip().str.title()

In [15]:
df['rating'] = df['rating'].str.strip().str.upper()

In [16]:
df['duration'] = df['duration'].str.strip()

In [17]:
df['date_added'] = df['date_added'].replace('Not Added', pd.NaT)

In [18]:
df['date_added'] = pd.to_datetime(df['date_added'], errors='coerce')

In [19]:
print(df['date_added'].dtype)

datetime64[ns]


In [20]:
df['date_added'].head()

0   2021-09-25
1   2021-09-24
2   2021-09-24
3   2021-09-24
4   2021-09-24
Name: date_added, dtype: datetime64[ns]

In [21]:
df.columns = df.columns.str.strip().str.lower().str.replace(' ', '_')
print(df.columns.tolist())

['show_id', 'type', 'title', 'director', 'cast', 'country', 'date_added', 'release_year', 'rating', 'duration', 'listed_in', 'description']


In [22]:
df.to_csv('netflix_cleaned.csv', index=False)