# **Clean Data Checking: Netflix Movies and TV Shows**

**Group Number:** 97  
**Members:**  
Roy Rui #300176548  
Jiayi Ma #300263220
 

# **Dataset I: Netflix Movies and TV Shows Dataset**  
**Author**: Shivam Bansal  
**Ref**: [Source](https://www.kaggle.com/datasets/shivamb/netflix-shows)  
**Shape**: **12 Columns, 8807 Rows**  

## **Description**  
The **Netflix Titles Dataset** contains metadata about TV shows and movies available on Netflix. The dataset provides detailed information such as **title, director, cast, country of production, release year, rating, and duration**. It is widely used for exploratory data analysis (EDA), data cleaning, and recommendation system development.

| Feature         | Description  | Data Type   |
|----------------|--------------|-------------|
| show_id        | Unique identifier for each title | Categorical |
| type           | Type of content (Movie or TV Show) | Categorical |
| title         | Title of the movie or TV show | Categorical |
| director       | Name of the director(s) | Categorical |
| cast          | List of main actors | Categorical |
| country        | Country where the content was produced | Categorical |
| date_added     | Date when the content was added to Netflix | Categorical |
| release_year   | Year of content release | Numerical |
| rating         | Maturity rating (e.g., TV-MA, PG-13) | Categorical |
| duration       | Duration in minutes (movies) or number of seasons (TV shows) | Categorical |
| listed_in      | Genre categories (e.g., Drama, Comedy) | Categorical |
| description    | Brief description of the title | Categorical |

This dataset is useful for **content analysis, trends in streaming media, and audience preferences**, and is particularly valuable for **clean data checking, missing value analysis, and feature engineering**.


**General Imports**

In [72]:
# Import necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import IPython.display as display

# Load the Netflix dataset
df = pd.read_csv('dataset1/netflix_titles.csv')


## **Data Type Errors**

### **Cell 1 – Description**  
Data type errors occur when column values do not match their expected data types. This can cause:
- Inaccurate calculations 
- Sorting and filtering issues
- Unexpected behaviors 



### **Cell 2 – Parameters for the Checker**

In [73]:
expected_dtypes = {
    "show_id": "object",  # Unique ID, stored as a string
    "type": "object",  # Categorical: Movie/TV Show
    "title": "object",  # Movie/TV Show title, text
    "director": "object",  # Names, text
    "cast": "object",  # List of actors, text
    "country": "object",  # Country names, text
    "date_added": "datetime64[ns]",  # Should be converted to a datetime format
    "release_year": "int64",  # Should be converted before checking
    "rating": "object",  # Categorical text (e.g., PG, TV-MA)
    "duration": "object",  # Contains both text and numbers
    "listed_in": "object",  # Genre categories, text
    "description": "object"  # Movie/TV Show description, text
}


### **Cell 3 – Checker Code**

In [74]:
# Function to check if actual data types match expected data types
def check_data_types(df, expected_dtypes):
    mismatches = {}
    for column, expected_type in expected_dtypes.items():
        actual_type = df[column].dtype
        if str(actual_type) != expected_type:
            mismatches[column] = actual_type
    return mismatches

# Run the data type check
data_type_issues = check_data_types(df, expected_dtypes)
print(data_type_issues)


{'date_added': dtype('O')}


### **Cell 4 – Report of Findings**

#### **Summary of Data Type Errors**
After running the data type validation, the following column(s) were found to have incorrect data types:

| Column Name  | Expected Type   | Actual Type |
|-------------|----------------|------------|
| `date_added` | `datetime64[ns]` | `object` |

#### **Impact of Data Type Errors**
- **`date_added` should be in datetime format** to allow for proper time-based analysis, such as filtering by year or month.
- Since it's currently stored as a string (`object`), any date-related operations (sorting, filtering by date range) may not work correctly.

#### **Next Steps**
- Convert `date_added` to a `datetime64` format to ensure consistency in date-related operations.









## **Range Errors**  

### **Cell 1 – Description**  
Range errors occur when numerical values fall outside a reasonable or predefined range. These errors can arise due to data entry mistakes, inconsistencies in data sources, or missing value replacements.  


### **Cell 2 – Parameters for the Checker**

In [75]:
# Define expected value ranges for numerical attributes
expected_ranges = {
    "release_year": (1900, 2025),  # Release year should be within this range  
}


### **Cell 3 – Checker Code**

In [76]:
# Ensure the release_year column is of integer type
df['release_year'] = pd.to_numeric(df['release_year'], errors='coerce')

# Function to check if numerical values fall within the expected range
def check_range_errors(df, expected_ranges):
    out_of_range = {}
    for column, (min_val, max_val) in expected_ranges.items():
        invalid_values = df[(df[column] < min_val) | (df[column] > max_val)][column].tolist()
        if invalid_values:
            out_of_range[column] = invalid_values
    return out_of_range


# Run the range check
range_issues = check_range_errors(df, expected_ranges)
print("Range issues: ")
print(range_issues)


Range issues: 
{}


## **Cell 4 – Report of Findings: Range Errors**

### **Summary of Range Errors**
After running the range validation, **no issues** were found in the dataset.

### **Insights:**
- All numerical values fall within the expected range.
- `release_year` values are correctly constrained within **1900 - 2025**.


## **Format Errors**

### **Cell 1 – Description**
Format errors occur when categorical values do not follow the expected structure or pattern. These errors can arise due to:
- **Unexpected categorical values** (e.g., a rating value that does not match standard TV/movie classifications).
- **Inconsistent formatting** (e.g., differences in capitalization, spacing, or abbreviations).
- **Incorrectly formatted categorical values** that require standardization for consistency.




### **Cell 2 – Parameters for the Checker**

In [77]:
# Define expected categorical values for formatting
expected_categories = {
    "type": {"Movie", "TV Show"},  # Should be either Movie or TV Show
    "rating": {"G", "TV-G", "PG", "PG-13", "TV-Y", "TV-Y7", "TV-Y7-FV", "TV-PG",
               "TV-14", "R", "NR", "TV-MA", "NC-17", "UR"},  # Expected ratings
}

### **Cell 3 – Checker Code**

In [78]:
def check_categorical_values(df, expected_categories):
    invalid_categories = {}
    for column, valid_values in expected_categories.items():
        if valid_values:  # Skip columns with None (diverse categories)
            unexpected_values = df[~df[column].isin(valid_values)][column].unique()
            if len(unexpected_values) > 0:
                invalid_categories[column] = unexpected_values
    return invalid_categories

category_issues = check_categorical_values(df, expected_categories)
print("Category issues: ")
print(category_issues)

Category issues: 
{'rating': array(['74 min', '84 min', '66 min', nan], dtype=object)}


## **Cell 4 – Report of Findings: Format Errors**

The following column(s) have **incorrect category values**:

| Column Name  | Unexpected Values |
|-------------|------------------|
| `rating`    | `'74 min', '84 min', '66 min'` |


See following rows: 

In [79]:
incorrect_ratings = df[df['rating'].isin(['74 min', '84 min', '66 min'])]
display.display(incorrect_ratings)

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
5541,s5542,Movie,Louis C.K. 2017,Louis C.K.,Louis C.K.,United States,"April 4, 2017",2017,74 min,,Movies,"Louis C.K. muses on religion, eternal love, gi..."
5794,s5795,Movie,Louis C.K.: Hilarious,Louis C.K.,Louis C.K.,United States,"September 16, 2016",2010,84 min,,Movies,Emmy-winning comedy writer Louis C.K. brings h...
5813,s5814,Movie,Louis C.K.: Live at the Comedy Store,Louis C.K.,Louis C.K.,United States,"August 15, 2016",2015,66 min,,Movies,The comic puts his trademark hilarious/thought...


### **Insights:**
- The `rating` column contains **duration values** (e.g., `'74 min'`), which should not be there.
- This suggests that some records in the `rating` column are **misclassified**.

### **Next Steps:**
- Investigate why **duration values** appear in the `rating` column.
- Extract the incorrect entries and reassign them to the correct column.
- Clean the `rating` column by ensuring only valid rating categories are included.