# **Clean Data Checking: Netflix Movies and TV Shows**

**Group Number:** 97  
**Members:**  
Roy Rui #300176548  
Jiayi Ma #300263220
 

# **Dataset I: Netflix Movies and TV Shows Dataset**  
**Author**: Shivam Bansal  
**Ref**: [Source](https://www.kaggle.com/datasets/shivamb/netflix-shows)  
**Shape**: **12 Columns, 8807 Rows**  

## **Description**  
The **Netflix Titles Dataset** contains metadata about TV shows and movies available on Netflix. The dataset provides detailed information such as **title, director, cast, country of production, release year, rating, and duration**. It is widely used for exploratory data analysis (EDA), data cleaning, and recommendation system development.

| Feature         | Description  | Data Type   |
|----------------|--------------|-------------|
| show_id        | Unique identifier for each title | Categorical |
| type           | Type of content (Movie or TV Show) | Categorical |
| title         | Title of the movie or TV show | Categorical |
| director       | Name of the director(s) | Categorical |
| cast          | List of main actors | Categorical |
| country        | Country where the content was produced | Categorical |
| date_added     | Date when the content was added to Netflix | Categorical |
| release_year   | Year of content release | Numerical |
| rating         | Maturity rating (e.g., TV-MA, PG-13) | Categorical |
| duration       | Duration in minutes (movies) or number of seasons (TV shows) | Categorical |
| listed_in      | Genre categories (e.g., Drama, Comedy) | Categorical |
| description    | Brief description of the title | Categorical |

This dataset is useful for **content analysis, trends in streaming media, and audience preferences**, and is particularly valuable for **clean data checking, missing value analysis, and feature engineering**.


**General Imports**

In [9]:
# Import necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load the Netflix dataset
df = pd.read_csv('dataset1/netflix_titles.csv')


## **Data Type Errors**

### **Cell 1 – Description**  
Data type errors occur when column values do not match their expected data types. This can cause:
- **Inaccurate calculations** (e.g., `release_year` stored as a string instead of an integer).
- **Sorting and filtering issues** (e.g., dates stored as text).
- **Unexpected behaviors** when performing analysis.

### **Objective**  
- Identify columns where data types are incorrect.
- Convert them to appropriate types for proper analysis.


### **Cell 2 – Parameters for the Checker**

In [None]:
expected_dtypes = {
    "show_id": "object",  # Unique ID, stored as a string
    "type": "object",  # Categorical: Movie/TV Show
    "title": "object",  # Movie/TV Show title, text
    "director": "object",  # Names, text
    "cast": "object",  # List of actors, text
    "country": "object",  # Country names, text
    "date_added": "datetime64[ns]",  # Should be converted to a datetime format
    "release_year": "int64",  # Should be converted before checking
    "rating": "object",  # Categorical text (e.g., PG, TV-MA)
    "duration": "object",  # Contains both text and numbers
    "listed_in": "object",  # Genre categories, text
    "description": "object"  # Movie/TV Show description, text
}


### **Cell 3 – Checker Code**

In [11]:
# Function to check if actual data types match expected data types
def check_data_types(df, expected_dtypes):
    mismatches = {}
    for column, expected_type in expected_dtypes.items():
        actual_type = df[column].dtype
        if str(actual_type) != expected_type:
            mismatches[column] = actual_type
    return mismatches

# Run the data type check
data_type_issues = check_data_types(df, expected_dtypes)
print(data_type_issues)


{'date_added': dtype('O')}


### **Cell 4 – Report of Findings**

#### **Summary of Data Type Errors**
After running the data type validation, the following column(s) were found to have incorrect data types:

| Column Name  | Expected Type   | Actual Type |
|-------------|----------------|------------|
| `date_added` | `datetime64` | `object` |

#### **Impact of Data Type Errors**
- **`date_added` should be in datetime format** to allow for proper time-based analysis, such as filtering by year or month.
- Since it's currently stored as a string (`object`), any date-related operations (sorting, filtering by date range) may not work correctly.

#### **Next Steps**
- Convert `date_added` to a `datetime64` format to ensure consistency in date-related operations.







