# Assignment 2: Data Cleaning (Clean Data Checker)
- Group Number: A-129
- Name: Uzair Mohiuddin
- Student Number: 8737165

## Dataset: Netflix Movies and TV Shows
- Dataset: Netflix Movies and TV Shows
- Source: [Kaggle Link](https://www.kaggle.com/datasets/shivamb/netflix-shows)
- Rows and Columns: 8807 rows x 12 columns

### Description
- **Name**: Netflix Movies and TV Shows
- **Author**: Shivam Bansal
- **Purpose**: Listings of movies and TV shows that appear on Netflix. Netflix has over 8000 movies and/or TV shows, and this dataset includes details about each such as cast, directors, ratings,etc.
- **Shape**: 8807 rows x 12 columns
- **Features of Dataset**:
  | Feature      | Categorical/Numerical | Description                              |   
  | ------------ | --------------------- | ---------------------------------------- |
  | show_id      |  Categorical (object) | Unique ID for every Movie/TV Show        |
  | type         |  Categorical (object) | Identifier (either Movie or TV Show)     |
  | title        |  Categorical (object) | Title of Movie/TV Show                   |
  | director     |  Categorical (object) | Director of the Movie                    | 
  | cast         |  Categorical (object) | Actors involved in Movie/TV Show         |
  | country      |  Categorical (object) | Country where it was produced            |
  | date_added   |  Categorical (object) | Date added to Netflix                    |
  | release_year | Numerical (int64)     | Actual release year of Movie/TV Show     |
  | rating       |  Categorical (object) | TV Rating of the Movie/TV Show           |
  | duration     |  Categorical (object) | Total Duration (minutes or # of seasons) |
  | listed_in    |  Categorical (object) | Genre                                    |
  | description  |  Categorical (object) | Summary description                      |

In [1]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from IPython.display import display, Markdown

In [2]:
# URL for dataset
netflix_dataset_url = "https://raw.githubusercontent.com/uzaaaiiir/jupyter/refs/heads/main/intro_ds_assignments/assignment2/netflix_titles.csv"

# Load dataset
df = pd.read_csv(netflix_dataset_url)

In [3]:
# Retrieve shape of dataset
df.shape

(8807, 12)

In [4]:
# Get list of features and descriptions.
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8807 entries, 0 to 8806
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   show_id       8807 non-null   object 
 1   type          8807 non-null   object 
 2   title         8807 non-null   object 
 3   director      6173 non-null   object 
 4   cast          7982 non-null   object 
 5   country       7976 non-null   object 
 6   date_added    8797 non-null   object 
 7   release_year  8806 non-null   float64
 8   rating        8803 non-null   object 
 9   duration      8804 non-null   object 
 10  listed_in     8807 non-null   object 
 11  description   8807 non-null   object 
dtypes: float64(1), object(11)
memory usage: 825.8+ KB


In [5]:
'''
describe() to retrieve numerical attributes.

Numerical attributes are: release_year
'''
df.describe()

Unnamed: 0,release_year
count,8806.0
mean,2014.182489
std,8.82135
min,1925.0
25%,2013.0
50%,2017.0
75%,2019.0
max,2029.0


In [6]:
'''
Categorical attributes of the dataset.

Categorical attributes are: show_id, type, title, director, cast, country, data_added, rating, duration, listed_in, description
'''
df.describe(include="object")

Unnamed: 0,show_id,type,title,director,cast,country,date_added,rating,duration,listed_in,description
count,8807,8807,8807,6173,7982,7976,8797,8803,8804,8807,8807
unique,8806,2,8807,4528,7692,748,1771,17,220,514,8775
top,s450,Movie,Zubaan,Rajiv Chilaka,David Attenborough,United States,"January 1, 2020",TV-MA,1 Season,"Dramas, International Movies","Paranormal activity at a lush, abandoned prope..."
freq,2,6131,1,19,19,2818,109,3207,1793,362,4


In [7]:
df.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020.0,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm..."
1,s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021.0,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."
2,s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,"September 24, 2021",2021.0,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...
3,s4,TV Show,Jailbirds New Orleans,,,,"September 24, 2021",2021.0,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo..."
4,s5,TV Show,Kota Factory,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,"September 24, 2021",2021.0,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...


# 1. Data Type Check
**Description**: The data type check verifies that the data of a field has the correct data type. For example, if an attribute is expected to be a date, we would check that the data has date attributes.

In [8]:
# Parameters for the checker

attributes = [
    'show_id', 'type', 'title', 'director', 'cast', 'country', 'date_added',
    'release_year', 'rating', 'duration', 'listed_in', 'description'
]

test_attribute = 'type'

In [9]:
# Checker Code
# Convert 'date_added' to datetime
df_copy = df.copy(deep=True)
df_copy['date_added'] = pd.to_datetime(df_copy['date_added'], errors='coerce')

expected_types = {
    'show_id': 'object',
    'type': 'object',
    'title': 'object',
    'director': 'object',
    'cast': 'object',
    'country': 'object',
    'date_added': 'datetime64[ns]',
    'release_year': 'int64',
    'rating': 'object',
    'duration': 'object',
    'listed_in': 'object',
    'description': 'object'
}

errors = {}
column_type = df_copy[test_attribute].dtype.name

if (column_type != expected_types[test_attribute]):
    errors[test_attribute] = column_type

print(errors)

{}


# Results
No errors were detected for the data types. All the expected data types matched the actual data types. 
- For example: The `show_id` data type was an `object`, as expected.

# 2. Range Check
**Description**: The range check verifies that the data falls between a specified range. The range covers the minimum and maximum values an attribute can have. For example, the salary of an employee shouldn't be below 0.

In [10]:
# Parameters for the Checker
attributes = ['date_added', 'release_year']

test_attribute = 'release_year'

# Minimum value (for date_added, use format 'YYYY-MM-DD', example='2015-01-01')
# minimum = '2015-01-01'
minimum = 1950

# Maximum value (for date_added, format as YYYY-MM-DD, example='2024-02-24')
# maximum = '2024-02-24'
maximum = 2025

In [11]:
# Range Checker
df_copy = df.copy(deep=True)
df_copy['date_added'] = pd.to_datetime(df_copy['date_added'], errors='coerce')

out_of_range = df_copy[(df_copy[test_attribute] < minimum) | (df_copy[test_attribute] > maximum)]

out_of_range[['title', test_attribute]]

Unnamed: 0,title,release_year
450,The Twilight Saga: New Moon,2029.0
1331,Five Came Back: The Reference Films,1945.0
4250,Pioneers: First Women Filmmakers*,1925.0
7219,Know Your Enemy - Japan,1945.0
7294,Let There Be Light,1946.0
7575,Nazi Concentration Camps,1945.0
7743,Pioneers of African-American Cinema,1946.0
7790,Prelude to War,1942.0
7930,San Pietro,1945.0
8205,The Battle of Midway,1942.0


# Results
There are 16 data points with a release year less than 1950, and 0 data points with a release year greater than 2025. See the example the row number, title, and release_year of three example:
| row | title | release_year |  
| --- | ----- | -------------|
| 1331|Five Came Back: The Reference Films | 1945 |
| 7219 | Know Your Enemy - Japan | 1945 |
| 8640 | Tunisian Victory | 1944 |

# 3. Format Check
**Description**: The format check verifies the format of the data of an attribute. For example, if we want the date attribute to have a specific format, we would check for that. 

In [12]:
# Parameters for checker

attributes = ['date_added']
test_attributes = 'date_added'

In [13]:
# Format checker
df_copy = df.copy(deep=True)

# Expected Format: September 21, 2021
expected_format = "%B %d, %Y" # 

df_copy['date_added'] = df_copy['date_added'].str.strip().replace(r'\s+', ' ', regex=True)
df_copy['date_parsed'] = pd.to_datetime(df_copy['date_added'], format=expected_format, errors='coerce')
invalid_format = df_copy[df_copy['date_parsed'].isna() & df_copy['date_added'].notna()]

invalid_format
invalid_format[['show_id', 'title', 'date_added', 'release_year']]

Unnamed: 0,show_id,title,date_added,release_year
34,s35,Tayo and Little Wizards,2021-09-12,2020.0
35,s36,The Father Who Moves Mountains,2021-09-17,2021.0
38,s39,Birth of the Dragon,2021-09-16,2017.0
374,s375,Flower Girl,2021-01-01,2013.0


# Results
There are 4 entries whose dates do not match the format of "September 1, 2024". These are the four entries:
|row |show_id|	title|	date_added|	release_year| 
| -- | -- | -- | -- | -- |
|34 |	s35 |	Tayo and Little Wizards | 	2021-09-12	|2020 |
|35 |	s36	|The Father Who Moves Mountains	|2021-09-17	|2021|
|38|	s39|	Birth of the Dragon	|2021-09-16|	2017|
|374|	s375|	Flower Girl	|2021-01-01	|2013|

# 4. Consistency Check
**Description**: The consistency check verifies that the data in each attribute has a consistent manner of entry.

In [14]:
# Parameters
attribute_pairs = {
    'date_added': 'release_year',
    'type': 'duration',
}

test_attribute = 'date_added'

# Patterns for Movie and TV duration
duration_movie = r'^\d+\s*min$'
duration_tv = r'^\d+\s*Seasons?$'

In [15]:
# Consistency Checker 
df_copy = df.copy(deep=True)

df_copy['date_parsed'] = pd.to_datetime(df_copy['date_added'], errors='coerce')
df_copy['release_year'] = pd.to_numeric(df_copy['release_year'], errors='coerce')
    
invalid_release_dates = df_copy[(df_copy['date_parsed'].notna()) & (df_copy['release_year'] > df_copy['date_parsed'].dt.year)]
invalid_release_dates

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,date_parsed
435,s436,TV Show,Touch Your Heart,,"Lee Dong-wook, Yoo In-na, Lee Sang-woo, Son Su...",,"July 20, 2021",2022.0,TV-MA,1 Season,"Crime TV Shows, International TV Shows, Romant...","Hoping to make a comeback after a bad scandal,...",2021-07-20
444,s445,TV Show,Naomi Osaka,Garrett Bradley,,,"July 16, 2021",2023.0,TV-14,1 Season,Docuseries,This intimate series follows Naomi Osaka as sh...,2021-07-16
450,s450,Movie,The Twilight Saga: New Moon,Chris Weitz,"Kristen Stewart, Robert Pattinson, Taylor Laut...",United States,"July 16, 2021",2029.0,PG-13,131 min,"Dramas, Romantic Movies",Still reeling from the departure of vampire Ed...,2021-07-16
1551,s1552,TV Show,Hilda,,"Bella Ramsey, Ameerah Falzon-Ojo, Oliver Nelso...","United Kingdom, Canada, United States","December 14, 2020",2021.0,TV-Y7,2 Seasons,Kids' TV,"Fearless, free-spirited Hilda finds new friend...",2020-12-14
1696,s1697,TV Show,Polly Pocket,,"Emily Tennant, Shannon Chan-Kent, Kazumi Evans...","Canada, United States, Ireland","November 15, 2020",2021.0,TV-Y,2 Seasons,Kids' TV,After uncovering a magical locket that allows ...,2020-11-15
2920,s2921,TV Show,Love Is Blind,,"Nick Lachey, Vanessa Lachey",United States,"February 13, 2020",2021.0,TV-MA,1 Season,"Reality TV, Romantic TV Shows",Nick and Vanessa Lachey host this social exper...,2020-02-13
3168,s3169,TV Show,Fuller House,,"Candace Cameron Bure, Jodie Sweetin, Andrea Ba...",United States,"December 6, 2019",2020.0,TV-PG,5 Seasons,TV Comedies,The Tanner family’s adventures continue as DJ ...,2019-12-06
3287,s3288,TV Show,Maradona in Mexico,,Diego Armando Maradona,"Argentina, United States, Mexico","November 13, 2019",2020.0,TV-MA,1 Season,"Docuseries, Spanish-Language TV Shows","In this docuseries, soccer great Diego Maradon...",2019-11-13
3369,s3370,TV Show,BoJack Horseman,,"Will Arnett, Aaron Paul, Amy Sedaris, Alison B...",United States,"October 25, 2019",2020.0,TV-MA,6 Seasons,TV Comedies,Meet the most beloved sitcom horse of the '90s...,2019-10-25
3433,s3434,TV Show,The Hook Up Plan,,"Marc Ruchmann, Zita Hanrot, Sabrina Ouazani, J...",France,"October 11, 2019",2020.0,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...","When Parisian Elsa gets hung up on her ex, her...",2019-10-11


In [16]:
# Consistency Checker 
df_copy = df.copy(deep=True)

invalid_movies = df_copy[(df_copy['type'] == "Movie") & (~df_copy['duration'].str.match(duration_movie, na=False))]
invalid_tv_shows = df_copy[(df_copy['type'] == "TV Show") & (~df_copy['duration'].str.match(duration_tv, na=False))]
invalid = pd.concat([invalid_movies, invalid_tv_shows])
invalid

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
5541,s5542,Movie,Louis C.K. 2017,Louis C.K.,Louis C.K.,United States,"April 4, 2017",2017.0,74 min,,Movies,"Louis C.K. muses on religion, eternal love, gi..."
5794,s5795,Movie,Louis C.K.: Hilarious,Louis C.K.,Louis C.K.,United States,"September 16, 2016",2010.0,84 min,,Movies,Emmy-winning comedy writer Louis C.K. brings h...
5813,s5814,Movie,Louis C.K.: Live at the Comedy Store,Louis C.K.,Louis C.K.,United States,"August 15, 2016",2015.0,66 min,,Movies,The comic puts his trademark hilarious/thought...


# Results
There are 17 results, where the release year is after the date added. This is inconsistent because we expect the release year to always be before the date added. These are some examples:
|row |	show_id|	title	|date_added|	release_year|
| -- | -- | -- | -- | -- |
|435	|s436	|Touch Your Heart|	July 20, 2021|	2022|
|444	|s445	|Naomi Osaka	|July 16, 2021|	2023|
|450	|s451	|The Twilight Saga: New Moon|	July 16, 2021|	2029|

# 5. Uniqueness Check
**Description**: The uniqueness check verifies that data for a specific attribute has unique values. For example, we would want to ensure an ID is not entered into the dataset more than once, or it would fail the uniqueness check. 

In [17]:
# Parameters
attributes = ['show_id', 'title', 'description']
test_attribute = 'show_id'

In [18]:
# Uniqueness check
df_copy = df.copy(deep=True)
duplicated = df_copy[df_copy.duplicated(subset=[test_attribute], keep=False)]
duplicated[['show_id']]

Unnamed: 0,show_id
449,s450
450,s450


# Results
There is one result, whose value is not unique and matches another one - these are rows 449 and 450 for the attribute `show_id`. Note: This was introduced by me manually.

|row|	show_id|
|--|--|
|449|	s450|
|450|	s450|


# 6. Presence Check
**Description**: The presence check verifies that all mandatory fields are populated, and not blank.

In [19]:
# Parameters
attributes = ['show_id', 'title', 'type', 'release_year', 'director']

test_attribute = 'director'

In [20]:
# Checker
df_copy = df.copy(deep=True)

missing = df_copy[df_copy[test_attribute].isna()]
missing

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
1,s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021.0,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."
3,s4,TV Show,Jailbirds New Orleans,,,,"September 24, 2021",2021.0,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo..."
4,s5,TV Show,Kota Factory,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,"September 24, 2021",2021.0,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...
10,s11,TV Show,"Vendetta: Truth, Lies and The Mafia",,,,"September 24, 2021",2021.0,TV-MA,1 Season,"Crime TV Shows, Docuseries, International TV S...","Sicily boasts a bold ""Anti-Mafia"" coalition. B..."
14,s15,TV Show,Crime Stories: India Detectives,,,,"September 22, 2021",2021.0,TV-MA,1 Season,"British TV Shows, Crime TV Shows, Docuseries",Cameras following Bengaluru police on the job ...
...,...,...,...,...,...,...,...,...,...,...,...,...
8795,s8796,TV Show,Yu-Gi-Oh! Arc-V,,"Mike Liscio, Emily Bauer, Billy Bob Thompson, ...","Japan, Canada","May 1, 2018",2015.0,TV-Y7,2 Seasons,"Anime Series, Kids' TV",Now that he's discovered the Pendulum Summonin...
8796,s8797,TV Show,Yunus Emre,,"Gökhan Atalay, Payidar Tüfekçioglu, Baran Akbu...",Turkey,"January 17, 2017",2016.0,TV-PG,2 Seasons,"International TV Shows, TV Dramas","During the Mongol invasions, Yunus Emre leaves..."
8797,s8798,TV Show,Zak Storm,,"Michael Johnston, Jessica Gee-George, Christin...","United States, France, South Korea, Indonesia","September 13, 2018",2016.0,TV-Y7,3 Seasons,Kids' TV,Teen surfer Zak Storm is mysteriously transpor...
8800,s8801,TV Show,Zindagi Gulzar Hai,,"Sanam Saeed, Fawad Khan, Ayesha Omer, Mehreen ...",Pakistan,"December 15, 2016",2012.0,TV-PG,1 Season,"International TV Shows, Romantic TV Shows, TV ...","Strong-willed, middle-class Kashaf and carefre..."


# Results
There is one result for the release_year being null when it should be present for every record.
| show_id | title | release_year|
| -- | -- | -- |
| 509 | Ask the StoryBots | NaN |

Every Netflix movie or TV show should have a director and there are 2634 rows where the director is NaN. 

# 7. Length Check
**Description**: The length check verifies that data for an attribute has the specified number of characters. For example, we would want a password to have a specific length. 

In [21]:
# Parameters
attributes = ['show_id', 'title', 'release_year']

constraints = {
    'show_id': (2, 10), # between 6 and 10 characters
    'title': (2, 200),
    'release_year': (4, 6)
}

test_attribute = 'release_year'

In [22]:
# Length checker
df_copy = df.copy(deep=True)

df_copy[test_attribute] = df_copy[test_attribute].astype(str)
min_len, max_len = constraints.get(test_attribute, (None, None))
invalid = df_copy[
    (df_copy[test_attribute].str.len() < min_len) | 
    (df_copy[test_attribute].str.len() > max_len)
]
invalid


Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
509,s510,TV Show,Ask the StoryBots,,"Judy Greer, Erin Fitzgerald, Fred Tatasciore, ...",United States,"July 6, 2021",,TV-Y,3 Seasons,Kids' TV,Five curious little creatures track down the a...


# Results
There are 4 records whose title is only 1 character. We would expect the title of any movie of tv show to be above 1 character, and below 200-300 characters long. Sample:
|row|	show_id|	title|
|-- | -- | -- |
|2069|	s2070	|H|
|5958|	s5959|	9|
|7155|	s7156|	K|
|7687|	s7688	|P|

# 8. Look-up Check
**Description**: The look-up check verifies that data for an atribute has acceptable values by verifying from the limited set of values the attribute can take. 

In [23]:
# Parameters

attributes = ['rating', 'type']

valid = {
    "type": ['Movie', 'TV Show'],
    # Based on standard ratings - found here: https://www.spectrum.net/support/tv/tv-and-movie-ratings-descriptions
    'rating': ['TV-Y', 'TV-Y7', 'TV-Y7-FV', 'TV-G', 'TV-PG', 'TV-14', 'TV-MA', 'G', 'PG', 'PG-13', 'R', 'NC-17']
}

test_attribute = 'type'


In [24]:
# Look up checker
valid_values = valid.get(test_attribute)

invalid = df_copy[~df_copy[test_attribute].isin(valid_values)]
invalid

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description


# Results
For the rating attribute, there are 90 rows whose values don't fall into the standard ratings for TV shows or Movies. For example, one of the ratings is NR, or 74 min, which aren't movie ratings. Example: 
|row |	show_id|	title|	rating|
| -- | -- | -- | -- |
|5541|	s5542	|Louis C.K. 2017|	74 min|
|5794|	s5795|	Louis C.K.: Hilarious|	84 min|
|5813|	s5814|	Louis C.K.: Live at the Comedy Store	|66 min|
|5971|	s5972	|(T)ERROR	|NR|

For the type attribute, all records fell into the categories.

# 9. Exact Duplicate Check
**Description**: The exact duplicate check verifies if there are exact duplicate rows in the dataset.

In [25]:
# Duplicate checker
df_copy = df.copy(deep=True)
df_copy.reset_index(drop=True)
df_copy.set_index('show_id')
duplicates = df_copy.duplicated()
df_copy[duplicates]

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description


# Results
There are no exact duplicates found.

# 10. Near Duplicate Check
**Description**: The near duplicate check verifies if there are near duplicates found in the dataset, where the rows only differ by one or two attributes.

In [26]:
# Parameters - we can check duplicates by verifying parameters that should be unique

near_duplicates = ['title', 'release_year', 'rating', 'country']

In [27]:
near_duplicates = df_copy[df_copy.duplicated(subset=['release_year'], keep=False)].sort_values(by=['release_year'])
near_duplicates

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
7790,s7791,Movie,Prelude to War,Frank Capra,,United States,"March 31, 2017",1942.0,TV-14,52 min,"Classic Movies, Documentaries",Frank Capra's documentary chronicles the rise ...
8205,s8206,Movie,The Battle of Midway,John Ford,"Henry Fonda, Jane Darwell",United States,"March 31, 2017",1942.0,TV-14,18 min,"Classic Movies, Documentaries",Director John Ford captures combat footage of ...
8739,s8740,Movie,Why We Fight: The Battle of Russia,"Frank Capra, Anatole Litvak",,United States,"March 31, 2017",1943.0,TV-PG,82 min,Documentaries,This installment of Frank Capra's acclaimed do...
8763,s8764,Movie,WWII: Report from the Aleutians,John Huston,,United States,"March 31, 2017",1943.0,TV-PG,45 min,Documentaries,Filmmaker John Huston narrates this Oscar-nomi...
8660,s8661,Movie,Undercover: How to Operate Behind Enemy Lines,John Ford,,United States,"March 31, 2017",1943.0,TV-PG,61 min,"Classic Movies, Documentaries",This World War II-era training film dramatizes...
...,...,...,...,...,...,...,...,...,...,...,...,...
3,s4,TV Show,Jailbirds New Orleans,,,,"September 24, 2021",2021.0,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo..."
2,s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,"September 24, 2021",2021.0,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...
1,s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021.0,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."
19,s20,TV Show,Jaguar,,"Blanca Suárez, Iván Marcos, Óscar Casas, Adriá...",,"September 22, 2021",2021.0,TV-MA,1 Season,"International TV Shows, Spanish-Language TV Sh...","In the 1960s, a Holocaust survivor joins a gro..."


# Results
No near duplicates were found.

# Conclusion
We performed a data checker analysis on a Netflix Movies and TV Shows dataset to verify the data based on different criteria such as length, presence, uniqueness, and more. The results are listed in the above sections.

## References
- Week 4 Part 1 Material
- ChatGPT Queries such as:
  - "How do we write a regex for a format check?"
  - "What is the pandas syntax to verify all rows for an attribute have a value?"