# **Clean Data Checking: Netflix Movies and TV Shows**

**Group Number:** 97  
**Members:**  
Roy Rui #300176548  
Jiayi Ma #300263220
 

# **Dataset I: Netflix Movies and TV Shows Dataset**  
**Author**: Shivam Bansal  
**Ref**: [Source](https://www.kaggle.com/datasets/shivamb/netflix-shows)  
**Shape**: **12 Columns, 8807 Rows**  

## **Description**  
The **Netflix Titles Dataset** contains metadata about TV shows and movies available on Netflix. The dataset provides detailed information such as **title, director, cast, country of production, release year, rating, and duration**. It is widely used for exploratory data analysis (EDA), data cleaning, and recommendation system development.

| Feature         | Description  | Data Type   |
|----------------|--------------|-------------|
| show_id        | Unique identifier for each title | Categorical |
| type           | Type of content (Movie or TV Show) | Categorical |
| title         | Title of the movie or TV show | Categorical |
| director       | Name of the director(s) | Categorical |
| cast          | List of main actors | Categorical |
| country        | Country where the content was produced | Categorical |
| date_added     | Date when the content was added to Netflix | Categorical |
| release_year   | Year of content release | Numerical |
| rating         | Maturity rating (e.g., TV-MA, PG-13) | Categorical |
| duration       | Duration in minutes (movies) or number of seasons (TV shows) | Categorical |
| listed_in      | Genre categories (e.g., Drama, Comedy) | Categorical |
| description    | Brief description of the title | Categorical |

This dataset is useful for **content analysis, trends in streaming media, and audience preferences**, and is particularly valuable for **clean data checking, missing value analysis, and feature engineering**.

---


**General Imports**

In [558]:
# Import necessary libraries
import pandas as pd
import IPython.display as display
from datetime import datetime

# Load the Netflix dataset
df = pd.read_csv('dataset1/netflix_titles.csv')


---

## **1. Data Type Errors**

### **Cell 1 – Description**  
Data type errors occur when column values do not match their expected data types. This can cause:
- Inaccurate calculations 
- Sorting and filtering issues
- Unexpected behaviors 



### **Cell 2 – Parameters for the Checker**

In [559]:
expected_dtypes = {
    "show_id": "object",  # Unique ID, stored as a string
    "type": "object",  # Categorical: Movie/TV Show
    "title": "object",  # Movie/TV Show title, text
    "director": "object",  # Names, text
    "cast": "object",  # List of actors, text
    "country": "object",  # Country names, text
    "date_added": "datetime64[ns]",  # Should be converted to a datetime format
    "release_year": "int64",  # Should be converted before checking
    "rating": "object",  # Categorical text (e.g., PG, TV-MA)
    "duration": "object",  # Contains both text and numbers
    "listed_in": "object",  # Genre categories, text
    "description": "object"  # Movie/TV Show description, text
}


### **Cell 3 – Checker Code**

In [560]:
# Function to check if actual data types match expected data types
def check_data_types(df, expected_dtypes):
    mismatches = {}
    for column, expected_type in expected_dtypes.items():
        actual_type = df[column].dtype
        if str(actual_type) != expected_type:
            mismatches[column] = actual_type
    return mismatches

# Run the data type check
data_type_issues = check_data_types(df, expected_dtypes)
print(data_type_issues)


{'date_added': dtype('O')}


### **Cell 4 – Report of Findings**

#### **Summary of Data Type Errors**
After running the data type validation, the following column(s) were found to have incorrect data types:

| Column Name  | Expected Type   | Actual Type |
|-------------|----------------|------------|
| `date_added` | `datetime64[ns]` | `object` |

#### **Impact of Data Type Errors**
- **`date_added` should be in datetime format** to allow for proper time-based analysis, such as filtering by year or month.
- Since it's currently stored as a string (`object`), any date-related operations (sorting, filtering by date range) may not work correctly.

#### **Next Steps**
- Convert `date_added` to a `datetime64` format to ensure consistency in date-related operations.





---



## **2. Range Errors**  

### **Cell 1 – Description**  
Range errors occur when numerical values fall outside a reasonable or predefined range. These errors can arise due to data entry mistakes, inconsistencies in data sources, or missing value replacements.  


### **Cell 2 – Parameters for the Checker**

In [561]:
# Define expected value ranges for numerical attributes
expected_ranges = {
    "release_year": (1900, 2025),  # Release year should be within this range  
}


### **Cell 3 – Checker Code**

In [562]:
# Ensure the release_year column is of integer type
df['release_year'] = pd.to_numeric(df['release_year'], errors='coerce')

# Function to check if numerical values fall within the expected range
def check_range_errors(df, expected_ranges):
    out_of_range = {}
    for column, (min_val, max_val) in expected_ranges.items():
        invalid_values = df[(df[column] < min_val) | (df[column] > max_val)][column].tolist()
        if invalid_values:
            out_of_range[column] = invalid_values
    return out_of_range


# Run the range check
range_issues = check_range_errors(df, expected_ranges)
print("Range issues: ")
print(range_issues)


Range issues: 
{}


### **Cell 4 – Report of Findings: Range Errors**

### **Summary of Range Errors**
After running the range validation, **no issues** were found in the dataset.

### **Insights:**
- All numerical values fall within the expected range.
- `release_year` values are correctly constrained within **1900 - 2025**.


---


## **3. Format Errors**

### **Cell 1 – Description**
Format errors occur when categorical values do not follow the expected structure or pattern. These errors can arise due to:

**Unexpected categorical values** (e.g.,`type` is either Movie or TV Show , a `rating` value that does not match standard TV/movie classifications).


### **Cell 2 – Parameters for the Checker**


In [563]:
# Define expected categorical values for formatting
expected_categories = {
    "type": {"Movie", "TV Show"},  # Should be either Movie or TV Show
    "rating": {"G", "TV-G", "PG", "PG-13", "TV-Y", "TV-Y7", "TV-Y7-FV", "TV-PG",
               "TV-14", "R", "NR", "TV-MA", "NC-17", "UR"},  # Expected ratings
}

# Define expected date format pattern for 'date_added'
expected_date_pattern = r"^[A-Za-z]+\s\d{1,2},\s\d{4}$"  # e.g., "September 24, 2021"


### **Cell 3 – Checker Code**

In [564]:
def check_categorical_values(df, expected_categories):
    # Check if categorical columns contain unexpected values.
    invalid_categories = {}
    for column, valid_values in expected_categories.items():
        if valid_values:  # Skip columns with None (diverse categories)
            unexpected_values = df[~df[column].isin(valid_values)][column].unique()
            if len(unexpected_values) > 0:
                invalid_categories[column] = unexpected_values
    return invalid_categories


def check_date_format_issues(df):
    # Dictionary to store errors
    errors = {}

    
    # Ensure 'date_added' is treated as a string for pattern matching
    df["date_added"] = df["date_added"].astype(str)

    # Identify rows where 'date_added' does NOT match the expected format (excluding NaT and empty values)
    raw_date_values = df[
        (~df["date_added"].str.match(expected_date_pattern, na=True)) &  # Mismatched formats 
        (df["date_added"].str.strip() != "") & # Exclude empty strings
        (~df["date_added"].isna())  # Exclude missing values
    ]

    # Store the incorrect formats in the errors dictionary
    if not raw_date_values.empty:
        errors["date_added_raw"] = raw_date_values[["date_added"]].head(20)

    return errors


# Run the categorical value check
category_issues = check_categorical_values(df, expected_categories)
print("Category issues: ")
print(category_issues)

# Run the date format consistency check
date_format_issues = check_date_format_issues(df)
print("Date format issues: ")
print(date_format_issues)


Category issues: 
{'rating': array(['74 min', '84 min', '66 min', nan], dtype=object)}
Date format issues: 
{'date_added_raw':               date_added
6066                 nan
6079      August 4, 2017
6174                 nan
6177   December 23, 2018
6213   December 15, 2018
6279        July 1, 2017
6304       July 26, 2019
6318        May 26, 2016
6357    November 1, 2019
6361    December 2, 2017
6368      March 15, 2019
6393     October 1, 2019
6451   December 15, 2017
6456        July 1, 2017
6457      August 4, 2017
6460       April 4, 2017
6519   December 28, 2016
6549      March 31, 2018
6560    February 1, 2019
6603     January 1, 2018}


### **Cell 4 – Report of Findings: Format Errors**

The following column(s) have **incorrect category values**:

| Column Name  | Unexpected Values |
|-------------|------------------|
| `rating`    | `'74 min', '84 min', '66 min'` |


See following rows: 

In [565]:
incorrect_ratings = df[df['rating'].isin(['74 min', '84 min', '66 min'])]
display.display(incorrect_ratings)

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
5541,s5542,Movie,Louis C.K. 2017,Louis C.K.,Louis C.K.,United States,"April 4, 2017",2017,74 min,,Movies,"Louis C.K. muses on religion, eternal love, gi..."
5794,s5795,Movie,Louis C.K.: Hilarious,Louis C.K.,Louis C.K.,United States,"September 16, 2016",2010,84 min,,Movies,Emmy-winning comedy writer Louis C.K. brings h...
5813,s5814,Movie,Louis C.K.: Live at the Comedy Store,Louis C.K.,Louis C.K.,United States,"August 15, 2016",2015,66 min,,Movies,The comic puts his trademark hilarious/thought...


The following column(s) have **incorrectly formatted dates**:

| Column Name  | Unexpected Format Examples |
|-------------|---------------------------|
| `date_added` | `' August 4, 2017'`, `' December 23, 2018'`, `' March 31, 2018'` | -> with additional space in front  

See following rows: 



In [566]:
print(date_format_issues["date_added_raw"]["date_added"].tolist())

['nan', ' August 4, 2017', 'nan', ' December 23, 2018', ' December 15, 2018', ' July 1, 2017', ' July 26, 2019', ' May 26, 2016', ' November 1, 2019', ' December 2, 2017', ' March 15, 2019', ' October 1, 2019', ' December 15, 2017', ' July 1, 2017', ' August 4, 2017', ' April 4, 2017', ' December 28, 2016', ' March 31, 2018', ' February 1, 2019', ' January 1, 2018']


#### **Insights:**
- The `rating` column contains **duration values** (e.g., `'74 min'`), which should not be there. This suggests that some records in the `rating` column are **missput**.
- The `date_added` column contains **inconsistent date formats**, which can cause parsing issues.

#### **Next Steps:**
- Convert all `date_added` values to a standard **datetime format (`YYYY-MM-DD`)**.
- Investigate why **duration values** appear in the `rating` column.
- Clean the `rating` column by ensuring only valid rating categories are included.

---

## **4. Consistency Errors**

### **Cell 1 – Description**  
Consistency errors occur when data values contradict expected logical constraints. These issues can lead to incorrect analysis and misinterpretation of results. Common examples include:  
- **Dates occurring in the future** (e.g., `date_added` should not be later than today's date).  
- **Unrealistic durations** (e.g., a movie duration of less than 0 minutes is likely incorrect).  
 
 ### **Cell 2 – Parameters for the Checker**


In [567]:
# Define expected consistency checks
consistency_checks = {
    "date_added": datetime.today(),  # Should not be in the future
    "duration": 0
}

### **Cell 3 – Checker Code**

In [568]:
# Convert 'date_added' column to datetime
df["date_added"] = pd.to_datetime(df["date_added"], errors='coerce')

# Extract duration values for minutes only
df["duration_mins"] = df["duration"].str.extract(r'(\d+) min')[0].astype(float)

# Function to check for consistency errors
def check_consistency_errors(df, consistency_checks):
    errors = {}

    # Check if 'date_added' contains future dates
    future_dates = df[df["date_added"] > consistency_checks["date_added"]]
    if not future_dates.empty:
        errors["date_added"] = future_dates
        
    # Check if 'duration' for movies is less than 5 minutes
    short_movies = df[(df["duration_mins"] < consistency_checks["duration"]) & df["duration"].str.contains("min", na=False)]
    if not short_movies.empty:
        errors["duration"] = short_movies

    return errors

# Run the consistency check
consistency_issues = check_consistency_errors(df, consistency_checks)
print(consistency_issues)


{}


### **Cell 4 – Report of Findings: Consistency Errors**

The consistency issues were not identified in this dataset.

---

## **5. Uniqueness Errors**

### **Cell 1 – Description**
Uniqueness errors occur when values that are expected to be unique contain duplicates. These errors can cause inconsistencies in data analysis and may indicate data integrity issues.

- **Duplicate `show_id` values**: Each entry should have a unique `show_id`, as it serves as the primary identifier for movies and TV shows.
- **Duplicate `title` values**: While multiple entries may share a title (e.g., different versions or releases), we check for cases where exact duplicates exist.  

 ### **Cell 2 – Parameters for the Checker** ***n/a***  

 ### **Cell 3 – Checker Code**


In [569]:
# Define a function to check uniqueness errors
def check_uniqueness_errors(df):
    errors = {}

    # Check for duplicate show_id values
    duplicate_show_ids = df[df.duplicated(subset=["show_id"], keep=False)]
    if not duplicate_show_ids.empty:
        errors["duplicate_show_id"] = duplicate_show_ids

    # Check for duplicate title values
    duplicate_titles = df[df.duplicated(subset=["title"], keep=False)]
    if not duplicate_titles.empty:
        errors["duplicate_title"] = duplicate_titles

    return errors

# Run the uniqueness check
uniqueness_issues = check_uniqueness_errors(df)

# Print results
print("Uniqueness issues:")
print(uniqueness_issues)


Uniqueness issues:
{}


### **Cell 4 – Report of Findings: Uniqueness Errors**

The Uniqueness issues were not identified in this dataset.

---