## Data Cleaning and Preprocessing of Netflix Movies and TV Shows

In this dataset, we carried out some essential preprocessing steps to get it ready for analysis:

First, we handled missing values using the .isnull() method to identify and manage incomplete data.

There were some unnamed columns present, which we removed using data.isnull().all() to clean up unnecessary clutter.

The country column was standardized to ensure consistency across entries.

We reformatted the date_added column to the dd-mm-yyyy format for better readability.

All column headers were converted to lowercase to maintain a uniform naming style.

Additionally, we converted the date_added column to datetime type and the rating column to a categorical data type for optimized performance.

After these cleaning and preprocessing steps, we were left with a much cleaner and more structured dataset, ready for analysis

In [30]:
import pandas as pd

data = pd.read_csv(r"C:\Users\91982\Downloads\netflix_titles 3.csv", encoding='utf-8')


In [31]:
data

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm..."
1,s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."
2,s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,"September 24, 2021",2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...
3,s4,TV Show,Jailbirds New Orleans,,,,"September 24, 2021",2021,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo..."
4,s5,TV Show,Kota Factory,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...
...,...,...,...,...,...,...,...,...,...,...,...,...
8802,s8803,Movie,Zodiac,David Fincher,"Mark Ruffalo, Jake Gyllenhaal, Robert Downey J...",United States,"November 20, 2019",2007,R,158 min,"Cult Movies, Dramas, Thrillers","A political cartoonist, a crime reporter and a..."
8803,s8804,TV Show,Zombie Dumb,,,,"July 1, 2019",2018,TV-Y7,2 Seasons,"Kids' TV, Korean TV Shows, TV Comedies","While living alone in a spooky town, a young g..."
8804,s8805,Movie,Zombieland,Ruben Fleischer,"Jesse Eisenberg, Woody Harrelson, Emma Stone, ...",United States,"November 1, 2019",2009,R,88 min,"Comedies, Horror Movies",Looking to survive in a world taken over by zo...
8805,s8806,Movie,Zoom,Peter Hewitt,"Tim Allen, Courteney Cox, Chevy Chase, Kate Ma...",United States,"January 11, 2020",2006,PG,88 min,"Children & Family Movies, Comedies","Dragged from civilian life, a former superhero..."


In [32]:
data.isnull()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,False,False,False,False,True,False,False,False,False,False,False,False
1,False,False,False,True,False,False,False,False,False,False,False,False
2,False,False,False,False,False,True,False,False,False,False,False,False
3,False,False,False,True,True,True,False,False,False,False,False,False
4,False,False,False,True,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...
8802,False,False,False,False,False,False,False,False,False,False,False,False
8803,False,False,False,True,True,True,False,False,False,False,False,False
8804,False,False,False,False,False,False,False,False,False,False,False,False
8805,False,False,False,False,False,False,False,False,False,False,False,False


In [33]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8807 entries, 0 to 8806
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       8807 non-null   object
 1   type          8807 non-null   object
 2   title         8807 non-null   object
 3   director      6173 non-null   object
 4   cast          7982 non-null   object
 5   country       7976 non-null   object
 6   date_added    8797 non-null   object
 7   release_year  8807 non-null   int64 
 8   rating        8803 non-null   object
 9   duration      8804 non-null   object
 10  listed_in     8807 non-null   object
 11  description   8807 non-null   object
dtypes: int64(1), object(11)
memory usage: 825.8+ KB


In [35]:
#identifying the empty columns 
empty_cols = data.columns[data.isnull().all()]
print("Empty columns:", empty_cols.tolist())

Empty columns: []


In [36]:
#droping the empty columns
data.drop(columns=empty_cols, inplace=True)

In [37]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8807 entries, 0 to 8806
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       8807 non-null   object
 1   type          8807 non-null   object
 2   title         8807 non-null   object
 3   director      6173 non-null   object
 4   cast          7982 non-null   object
 5   country       7976 non-null   object
 6   date_added    8797 non-null   object
 7   release_year  8807 non-null   int64 
 8   rating        8803 non-null   object
 9   duration      8804 non-null   object
 10  listed_in     8807 non-null   object
 11  description   8807 non-null   object
dtypes: int64(1), object(11)
memory usage: 825.8+ KB


In [38]:
# now handling the missing values in the dataset
data.dropna(inplace=True)

In [39]:
data.isnull()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
7,False,False,False,False,False,False,False,False,False,False,False,False
8,False,False,False,False,False,False,False,False,False,False,False,False
9,False,False,False,False,False,False,False,False,False,False,False,False
12,False,False,False,False,False,False,False,False,False,False,False,False
24,False,False,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...
8801,False,False,False,False,False,False,False,False,False,False,False,False
8802,False,False,False,False,False,False,False,False,False,False,False,False
8804,False,False,False,False,False,False,False,False,False,False,False,False
8805,False,False,False,False,False,False,False,False,False,False,False,False


In [40]:
# checking for duplicate rows 
duplicates = data[data.duplicated()]
print(f"Number of duplicate rows: {len(duplicates)}") 

Number of duplicate rows: 0


In [41]:
text_cols = ['type', 'title', 'director', 'cast', 'country', 'rating', 'duration', 'listed_in', 'description']

for col in text_cols:
    data[col] = data[col].astype(str).str.strip().str.lower()

In [42]:
# Lowercase and clean initial text
data['country'] = data['country'].astype(str).str.lower().str.strip()

# Function to clean and standardize full country names
def standardize_country(cell):
    if pd.isnull(cell) or cell in ['nan', '']:
        return 'Unknown'

    # Mapping of common misspellings or variants to standardized full names
    country_map = {
        'us': 'united states',
        'u.s.': 'united states',
        'usa': 'united states',
        'nited states': 'united states',
        'united kin': 'united kingdom',
        'united kin...': 'united kingdom',
        'uk': 'united kingdom',
        'england': 'united kingdom',
        'south korea': 'republic of korea',
        'republic of korea': 'republic of korea',
        'north korea': "democratic people's republic of korea",
        'russia': 'russian federation',
        'ivory coast': 'côte d’ivoire',
        'viet nam': 'vietnam',
        'brasil': 'brazil',
        'uae': 'united arab emirates'
    }

    # Handle multiple countries
    countries = [c.strip() for c in cell.split(',')]
    cleaned = [country_map.get(c, c) for c in countries]
    return ', '.join(cleaned)

# Apply the standardization
data['country'] = data['country'].apply(standardize_country)

data['country'] = data['country'].str.title()

In [43]:
data

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
7,s8,movie,sankofa,haile gerima,"kofi ghanaba, oyafunmike ogunlano, alexandra d...","United States, Ghana, Burkina Faso, United Kin...","September 24, 2021",1993,tv-ma,125 min,"dramas, independent movies, international movies","on a photo shoot in ghana, an american model s..."
8,s9,tv show,the great british baking show,andy devonshire,"mel giedroyc, sue perkins, mary berry, paul ho...",United Kingdom,"September 24, 2021",2021,tv-14,9 seasons,"british tv shows, reality tv",a talented batch of amateur bakers face off in...
9,s10,movie,the starling,theodore melfi,"melissa mccarthy, chris o'dowd, kevin kline, t...",United States,"September 24, 2021",2021,pg-13,104 min,"comedies, dramas",a woman adjusting to life after a loss contend...
12,s13,movie,je suis karl,christian schwochow,"luna wedler, jannis niewöhner, milan peschel, ...","Germany, Czech Republic","September 23, 2021",2021,tv-ma,127 min,"dramas, international movies",after most of her family is murdered in a terr...
24,s25,movie,jeans,s. shankar,"prashanth, aishwarya rai bachchan, sri lakshmi...",India,"September 21, 2021",1998,tv-14,166 min,"comedies, international movies, romantic movies",when the father of the man she loves insists t...
...,...,...,...,...,...,...,...,...,...,...,...,...
8801,s8802,movie,zinzana,majid al ansari,"ali suliman, saleh bakri, yasa, ali al-jabri, ...","United Arab Emirates, Jordan","March 9, 2016",2015,tv-ma,96 min,"dramas, international movies, thrillers",recovering alcoholic talal wakes up inside a s...
8802,s8803,movie,zodiac,david fincher,"mark ruffalo, jake gyllenhaal, robert downey j...",United States,"November 20, 2019",2007,r,158 min,"cult movies, dramas, thrillers","a political cartoonist, a crime reporter and a..."
8804,s8805,movie,zombieland,ruben fleischer,"jesse eisenberg, woody harrelson, emma stone, ...",United States,"November 1, 2019",2009,r,88 min,"comedies, horror movies",looking to survive in a world taken over by zo...
8805,s8806,movie,zoom,peter hewitt,"tim allen, courteney cox, chevy chase, kate ma...",United States,"January 11, 2020",2006,pg,88 min,"children & family movies, comedies","dragged from civilian life, a former superhero..."


In [44]:
#converting date_added attribute to dd-mm-yyyy format
data['date_added'] = pd.to_datetime(data['date_added'], errors='coerce')

data['date_added'] = data['date_added'].dt.strftime('%d-%m-%Y')

In [45]:
data.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
7,s8,movie,sankofa,haile gerima,"kofi ghanaba, oyafunmike ogunlano, alexandra d...","United States, Ghana, Burkina Faso, United Kin...",24-09-2021,1993,tv-ma,125 min,"dramas, independent movies, international movies","on a photo shoot in ghana, an american model s..."
8,s9,tv show,the great british baking show,andy devonshire,"mel giedroyc, sue perkins, mary berry, paul ho...",United Kingdom,24-09-2021,2021,tv-14,9 seasons,"british tv shows, reality tv",a talented batch of amateur bakers face off in...
9,s10,movie,the starling,theodore melfi,"melissa mccarthy, chris o'dowd, kevin kline, t...",United States,24-09-2021,2021,pg-13,104 min,"comedies, dramas",a woman adjusting to life after a loss contend...
12,s13,movie,je suis karl,christian schwochow,"luna wedler, jannis niewöhner, milan peschel, ...","Germany, Czech Republic",23-09-2021,2021,tv-ma,127 min,"dramas, international movies",after most of her family is murdered in a terr...
24,s25,movie,jeans,s. shankar,"prashanth, aishwarya rai bachchan, sri lakshmi...",India,21-09-2021,1998,tv-14,166 min,"comedies, international movies, romantic movies",when the father of the man she loves insists t...


In [46]:
#Rename column headers to be clean and uniform (e.g., lowercase, no spaces).
data.columns = data.columns.str.strip().str.lower().str.replace(' ', '_')
# the column headers are already in clean and uniform format

In [47]:
#checking the data types are appropriate or not
print(data.dtypes)

show_id         object
type            object
title           object
director        object
cast            object
country         object
date_added      object
release_year     int64
rating          object
duration        object
listed_in       object
description     object
dtype: object


In [48]:
data['date_added'] = pd.to_datetime(data['date_added'], errors='coerce')
data['rating'] = data['rating'].astype('category')
data['duration'] = data['duration'].str.extract(r'(\d+)').astype('Int64')
data['show_id'] = data['show_id'].astype(str)
data['description'] = data['description'].astype(str)

  data['date_added'] = pd.to_datetime(data['date_added'], errors='coerce')


In [49]:
print(data.dtypes)

show_id                 object
type                    object
title                   object
director                object
cast                    object
country                 object
date_added      datetime64[ns]
release_year             int64
rating                category
duration                 Int64
listed_in               object
description             object
dtype: object


In [50]:
data.to_csv('cleaned_dataset.csv', index=False)
data.to_excel('cleaned_dataset.xlsx', index=False)