# 01 - Data Cleaning Notebook - Isailton 

- Project: Machine Learning
- Dataset: Student performance (stundents-mat.csv)
- Team MIKE WHEELER (Safina, Charles, Isailton)

## Step 0 — Import Libraries & Load Data

In [None]:
# Step 0 - Import required libraries
# pandas: data manipulation and analysis
# numpy: numerical computations

import pandas as pd
import numpy as np

In [None]:
# Step 0 - Load the dataset using a RELATIVE PATH
# This ensures the code works for all team members after git push
# The dataset uses ';' as a separator

DATA_PATH = "../data/student-mat.csv"

df = pd.read_csv(DATA_PATH, sep=";")

# Display first rows to confirm successful loading
df.head()

## Step 1 — Check Shape of the Data

In [None]:
# Step 1 - Check the shape of the dataset
# This shows how many rows (observations) and columns (features) we have

df.shape

# Explanation:
# - Rows → number of students
# - Columns → number of features

In [None]:
# Step 2 - Display original column names
df.columns

In [None]:
# Step 2 - Rename columns following PEP8 conventions
# - Convert to lowercase
# - Replace spaces with underscores

df.columns = (
    df.columns
    .str.strip()
    .str.lower()
    .str.replace(" ", "_")
)

# Confirm column names were updated
df.columns

# Why this matters:
# - Standardized column names improve readability and prevent coding errors.

## Step 3 — Check Data Types

In [None]:
# Step 3 - Inspect data types of each column
# This helps decide how to clean and preprocess each feature

df.dtypes

# Explanation:
# - object → categorical variables
# - int64 / float64 → numerical variables

## Step 4 — Check for Missing (NaN) Values

In [None]:
# Step 4 - Count missing values per column

df.isna().sum()

In [None]:
# Step 4 - Sort missing values for easier inspection

df.isna().sum().sort_values(ascending=False)

# Explanation:
# - This dataset has very few or no missing values, making it ideal for ML.

## Step 5 — Check and Remove Duplicates

In [None]:
# Step 5 - Check how many duplicated rows exist

df.duplicated().sum()

In [None]:
# Step 5 - Remove duplicated rows (if any)

df = df.drop_duplicates()

# Why this is important:
# - Duplicates can bias model training and evaluation.

## Step 6 — Split Dataset into Categorical and Numerical Features

In [None]:
# Step 6 - Identify categorical columns (non-numeric)

categorical_cols = df.select_dtypes(include="object").columns

# Step 6 - Identify numerical columns

numerical_cols = df.select_dtypes(include=["int64", "float64"]).columns

categorical_cols, numerical_cols

# Explanation:
# - Different data types require different preprocessing strategies.

## Step 7 — Clean Categorical Features

### Step 7.1 — Explore Unique Values (EDA)

In [None]:
# Step 7.1 - Explore unique values in each categorical column
# This helps identify inconsistencies or typos

for col in categorical_cols:
    print(f"Column: {col}")
    print(df[col].unique())
    print("-" * 40)

# Explanation:
# - This step is a basic EDA technique to understand categorical distributions.



### Step 7.2 — Handle Missing Values in Categorical Columns

In [None]:
# Step 7.2 - Fill missing categorical values with the mode (most frequent value)

for col in categorical_cols:
    df[col] = df[col].fillna(df[col].mode()[0])

# Why mode?
# - It preserves the most common category without distorting data.

## Step 8 — Clean Numerical Features

### Step 8.1 — Descriptive Statistics (EDA)

In [None]:
# Step 8.1 - Generate summary statistics for numerical features

df[numerical_cols].describe()

# What we learn here:
# - Min / Max values
# - Mean & median
# - Potential outliers

### Step 8.2 — Handle Missing Numerical Values

In [None]:
# Step 8.2 - Fill missing numerical values with the median
# Median is robust against outliers

for col in numerical_cols:
    df[col] = df[col].fillna(df[col].median())

## Step 9 — Final Data Validation

In [None]:
# Step 9 - Final overview of the cleaned dataset

df.info()

In [None]:
# Step 9 -  Ensure no missing values remain

df.isna().sum().sum()

# Expected result: 0

## Step 10 — Save Cleaned Dataset

In [None]:
# Step 10 - Save the cleaned dataset for next steps (feature engineering & modeling)

df.to_csv("../data/01_students_mat_cleaned_isailton.csv", index=False)

# Why this step matters:
# - Keeps cleaning separate from modeling and ensures reproducibility.