# Week 3: Data Cleaning - Guided Exercise

**Duration:** ~30 minutes  
**Dataset:** Education Statistics from the Colombian Ministry of Education (datos.gov.co)  
**Rows:** 482 (dirty) | **Columns:** 37  
**Goal:** Clean this dataset by fixing 5 types of data quality issues  

---

**How to use this notebook:**

1. Read each explanation cell carefully
2. Run the code cell below it (Shift + Enter)
3. Read the "What just happened" follow-up
4. Answer the questions or complete the "Your Turn" exercises

Think of this entire process as doing laundry: we need to sort, wash, dry, and fold our data before it is ready to wear.

## Setup

We start by importing our tools and loading the CSV file. We also make a copy of the original data so we can compare before and after at the end.

**Run the cell below** to load the dataset.

In [None]:
import pandas as pd
import numpy as np

df = pd.read_csv('data/educacion_estadisticas.csv')

# Keep a copy of the original so we can compare at the end
df_original = df.copy()

print(f"Dataset loaded: {df.shape[0]} rows, {df.shape[1]} columns")

The dataset has 482 rows and 37 columns. Each row represents one Colombian department in one year (2011-2024), with education indicators like enrollment rates, dropout rates, and coverage.

---

## Section 1: Data Inspection

Before cleaning anything, we need to understand what we are working with. Think of this as **opening the laundry bag** and checking what is inside before turning on the washing machine. You would not throw everything in without looking first.

This is the **inspection ritual**: a set of commands you run at the start of every data project.

### 1.1 Shape and Column Names

We use `df.shape` to see how many rows and columns exist, and `df.columns.tolist()` to see the names of all columns.

**Run the cell below.**

In [None]:
print(f"Rows: {df.shape[0]}")
print(f"Columns: {df.shape[1]}")
print()
print("Column names:")
print(df.columns.tolist())

**What to notice:** The column names are in Spanish (as expected from a Colombian government dataset). Key columns include:
- `ano` = year
- `departamento` = department (geographic region)
- `poblacion_5_16` = population aged 5-16
- `desercion` = dropout rate
- `cobertura_neta` = net enrollment coverage
- `aprobacion` / `reprobacion` = approval / failure rates

### 1.2 First Rows

Looking at the actual data helps us spot problems that column names alone cannot reveal. `df.head()` shows the first 5 rows.

**Run the cell below.**

In [None]:
df.head()

### 1.3 Data Types

Every column has a data type that tells pandas how to store and process the values. We use `df.dtypes` to see them.

**Look for suspicious types:** a year stored as `float64` instead of `int64`, or a population number stored as `object` (text).

**Run the cell below.**

In [None]:
df.dtypes

**What just happened:** pandas tells us the type of each column.

**Key findings:**
- `ano` is `float64` (decimal) but years should be whole numbers (`int64`). Years like 2023.0 look odd.
- `poblacion_5_16` is `object` (text) even though it should be a number. This means some values contain characters that prevent pandas from reading them as numbers.
- Most rate columns are `float64`, which is correct for percentages.

### 1.4 Missing Values

`isnull().sum()` counts how many NaN (missing) values each column has. We sort the result so the worst offenders appear first.

**Run the cell below.**

In [None]:
missing = df.isnull().sum()
missing_only = missing[missing > 0].sort_values(ascending=False)

print(f"Columns with missing values: {len(missing_only)} of {len(df.columns)}")
print(f"Total missing cells: {missing.sum()} of {df.shape[0] * df.shape[1]:,}")
print()
print(missing_only)

### 1.5 Statistical Summary

`df.describe()` gives us count, mean, std, min, max, and percentiles for every numeric column. This helps spot outliers and confirms what we learned from `isnull()`.

**Run the cell below.**

In [None]:
df.describe().round(2)

**What just happened:** The `count` row shows how many non-null values each column has. Columns where `count` is less than 482 (our total rows) have missing data. Also look at `min` and `max`: do any values look impossible? We will come back to this in Section 6.

### Inspection Summary

From our inspection, we found **5 data quality issues** to fix:

| # | Issue | Where |
|---|-------|-------|
| 1 | Missing values (NaN) | Multiple columns, especially `sedes_conectadas_a_internet`, `tamano_promedio_grupo` |
| 2 | Wrong data types | `ano` is float64, `poblacion_5_16` is object |
| 3 | Duplicate rows | 482 rows but only 462 expected (20 extra) |
| 4 | Text inconsistencies | `departamento` has too many unique values |
| 5 | Invalid values | Some percentages might be negative or above 100 |

### QUESTION 1

Based on the inspection you just ran, **list 3 specific problems** you noticed. For each one, note:
- Which column is affected
- What the problem is
- How you spotted it (which command revealed it)

*Double-click this cell and write your answer below:*

1. ...
2. ...
3. ...

---

## Section 2: Missing Values

**What is NaN?** NaN stands for "Not a Number." It is pandas' way of saying "this value is missing." Think of it like reaching into your sock drawer and finding... nothing. The sock is not there. It is not zero socks (that would mean you counted and found none). It is "unknown."

**Why do missing values appear?** Data can be missing because:
- It was never collected (a survey question left blank)
- It was lost during processing (a system error)
- It does not apply (internet connectivity data before internet existed in schools)

Like socks without a pair: you need to decide whether to find a replacement, toss them, or accept the mismatch.

### 2.1 Missing Value Percentages

Before deciding what to do, we need to know how bad the problem is. Let's calculate the percentage of missing values per column.

**Run the cell below.**

In [None]:
missing_pct = (df.isnull().sum() / len(df) * 100).round(2)
missing_pct = missing_pct[missing_pct > 0].sort_values(ascending=False)

print("Missing value percentages:")
print(missing_pct)

### 2.2 The Decision Framework

Now we apply the decision framework from class:

| Missing % | Action | Reasoning |
|-----------|--------|----------|
| > 50% | Consider dropping the column | More gaps than data |
| < 5% | Safe to drop the rows | Losing very few rows |
| 5-50% | Fill with an appropriate value | Too many rows to lose, need to estimate |

Applying this to our columns:
- `sedes_conectadas_a_internet`, `tamano_promedio_grupo`: ~50% missing. Fill with 0 ("not reported").
- `departamento`: ~2-3% missing. Drop those rows (critical identifier, cannot guess).
- Rate columns (`desercion`, `cobertura_neta`, `aprobacion`, `reprobacion`): 8-11% missing. Fill with median.
- `poblacion_5_16`: ~10% missing. We will fix this in Section 3 (it also has type problems).

### 2.3 Fill Count-Based Columns with 0

For `sedes_conectadas_a_internet` (% schools with internet) and `tamano_promedio_grupo` (average class size), the data only exists through 2017. After that, it was not reported. Filling with 0 means "no data available for this period."

**Run the cell below.**

In [None]:
print(f"Before fillna:")
print(f"  sedes_conectadas_a_internet NaN: {df['sedes_conectadas_a_internet'].isnull().sum()}")
print(f"  tamano_promedio_grupo NaN:       {df['tamano_promedio_grupo'].isnull().sum()}")

df['sedes_conectadas_a_internet'] = df['sedes_conectadas_a_internet'].fillna(0)
df['tamano_promedio_grupo'] = df['tamano_promedio_grupo'].fillna(0)

print(f"\nAfter fillna:")
print(f"  sedes_conectadas_a_internet NaN: {df['sedes_conectadas_a_internet'].isnull().sum()}")
print(f"  tamano_promedio_grupo NaN:       {df['tamano_promedio_grupo'].isnull().sum()}")

**What just happened:** We replaced all NaN values in those two columns with 0. The missing count went from ~240 to 0 for each.

**Why 0 is appropriate here but NOT for rates:** These columns represent counts or measurements that simply were not reported after 2017. Using 0 signals "not available." But filling a dropout rate with 0 would be misleading: 0% dropout means "nobody dropped out" (a strong claim), not "we don't know."

### 2.4 Drop Rows Where `departamento` is Missing

The `departamento` column is a critical identifier. A row without a department is like a letter without an address: useless. We cannot guess which department it belongs to, so we remove those rows.

**Run the cell below.**

In [None]:
rows_before = len(df)

df = df.dropna(subset=['departamento'])

rows_after = len(df)
print(f"Rows before: {rows_before}")
print(f"Rows after:  {rows_after}")
print(f"Removed:     {rows_before - rows_after} rows (missing departamento)")

### 2.5 Fill Rate Columns with the Median

For columns that represent rates or percentages (dropout rate, coverage, approval, etc.), we fill with the **median**. The median is the middle value when all values are sorted. It is better than the mean because it is not distorted by extreme outliers.

**Why not fill with 0?** A 0% dropout rate means "nobody dropped out." That is very different from "we don't know." The median says: "if we had to guess, the most typical value is probably close to this."

**Run the cell below.**

In [None]:
rate_columns = [
    'tasa_matriculacion_5_16',
    'cobertura_neta', 'cobertura_neta_transicion', 'cobertura_neta_primaria',
    'cobertura_neta_secundaria', 'cobertura_neta_media',
    'cobertura_bruta', 'cobertura_bruta_transicion', 'cobertura_bruta_primaria',
    'cobertura_bruta_secundaria', 'cobertura_bruta_media',
    'desercion', 'desercion_transicion', 'desercion_primaria',
    'desercion_secundaria', 'desercion_media',
    'aprobacion', 'aprobacion_transicion', 'aprobacion_primaria',
    'aprobacion_secundaria', 'aprobacion_media',
    'reprobacion', 'reprobacion_transicion', 'reprobacion_primaria',
    'reprobacion_secundaria', 'reprobacion_media',
    'repitencia', 'repitencia_transicion', 'repitencia_primaria',
    'repitencia_secundaria', 'repitencia_media',
]

for col in rate_columns:
    n_missing = df[col].isnull().sum()
    if n_missing > 0:
        median_val = df[col].median()
        df[col] = df[col].fillna(median_val)
        print(f"{col}: filled {n_missing} NaN with median {median_val:.2f}")

print(f"\nTotal NaN remaining in rate columns: {df[rate_columns].isnull().sum().sum()}")

**What just happened:** Each rate column's missing values were replaced with that column's median. For example, if the median dropout rate is 3.86%, all missing dropout values now say 3.86%. This is a reasonable assumption: "if we don't know the value, assume it is typical."

### 2.6 Verify: How Many Missing Values Remain?

Let's check how our cleaning is going so far.

**Run the cell below.**

In [None]:
remaining = df.isnull().sum()
remaining = remaining[remaining > 0]

if len(remaining) == 0:
    print("No missing values remain!")
else:
    print(f"Columns still with missing values: {len(remaining)}")
    print(remaining)

**What to expect:** The only column with missing values should be `poblacion_5_16`. We will fix it in Section 3, because its problem is not just missing values but also wrong data types (it has commas and the text "sin dato").

---

### QUESTION 2

Why would filling `desercion` (dropout rate) with 0 be misleading? What does 0% dropout actually mean vs. "no data"?

*Double-click this cell and write your answer below:*

...

---

## Section 3: Data Type Issues

Imagine sorting your laundry and finding **a shoe in the shirt pile**. It does not belong there, and it will cause problems in the wash. That is what happens when a number is stored as text: pandas cannot do math with it.

There are three main types we care about:
- `int64`: whole numbers (years, counts)
- `float64`: decimal numbers (rates, percentages)
- `object`: text/strings (names, categories)

We found two type problems:
1. `ano` (year): stored as `float64` (2023.0) instead of `int64` (2023)
2. `poblacion_5_16` (population): stored as `object` (text) instead of `int64`

### 3.1 Inspect `ano`

Let's see what the `ano` column looks like right now.

**Run the cell below.**

In [None]:
print(f"dtype: {df['ano'].dtype}")
print(f"\nSample values: {df['ano'].head(10).tolist()}")

### 3.2 Fix `ano`: Convert to Integer

Years should be whole numbers. The CSV had mixed formats like "2011" and "2011.0", which caused pandas to read the column as float. We fix it by first ensuring all values are numeric with `pd.to_numeric()`, then converting to int.

**Run the cell below.**

In [None]:
print(f"Before: ano dtype = {df['ano'].dtype}")
print(f"Sample: {df['ano'].head(5).tolist()}")

df['ano'] = pd.to_numeric(df['ano'], errors='coerce').fillna(0).astype(int)

print(f"\nAfter: ano dtype = {df['ano'].dtype}")
print(f"Sample: {df['ano'].head(5).tolist()}")

**What just happened:** The years changed from 2023.0 (float) to 2023 (integer). Clean and correct.

**The NaN-before-int trap:** If `ano` had NaN values and we tried `astype(int)` directly, pandas would crash. The `fillna(0)` step prevents that. Always fill NaN before converting to int.

### 3.3 Inspect `poblacion_5_16`

This column is text (`object`) instead of numeric. Let's look at the raw values to understand why.

**Run the cell below.**

In [None]:
print(f"Current dtype: {df['poblacion_5_16'].dtype}")
print(f"\nSample values (10 random):")
print(df['poblacion_5_16'].dropna().sample(10, random_state=42).tolist())

**What to notice:** Some values have commas (like "394,574") and some say "sin dato" (Spanish for "no data"). Pandas cannot convert these to numbers automatically, so it stored the whole column as text.

### 3.4 Fix `poblacion_5_16`: Clean String, Then Convert

The fix is a 2-step process:
1. **Remove the commas** with `str.replace(',', '')`
2. **Convert to numeric** with `pd.to_numeric(errors='coerce')` so "sin dato" becomes NaN instead of crashing

Then we fill the remaining NaN with 0 and convert to int.

**Run the cell below.**

In [None]:
# Step 1: Remove commas, then convert to numeric
df['poblacion_5_16'] = pd.to_numeric(
    df['poblacion_5_16'].astype(str).str.replace(',', ''),
    errors='coerce'
)

print(f"After to_numeric: dtype = {df['poblacion_5_16'].dtype}")
print(f"NaN count: {df['poblacion_5_16'].isnull().sum()}")

# Step 2: Fill NaN with 0 and convert to int
df['poblacion_5_16'] = df['poblacion_5_16'].fillna(0).astype(int)

print(f"\nFinal dtype: {df['poblacion_5_16'].dtype}")
print(f"NaN remaining: {df['poblacion_5_16'].isnull().sum()}")
print(f"Sample: {df['poblacion_5_16'].head(5).tolist()}")

**What just happened:**
- Commas were removed: "394,574" became "394574"
- `pd.to_numeric()` converted valid numbers to float64
- "sin dato" and NaN values became NaN (the `errors='coerce'` flag replaces anything it cannot convert with NaN, instead of crashing)
- Finally, we filled NaN with 0 and converted to int

### YOUR TURN 1

Verify the type fixes worked. Write code to print the `dtype` and 5 sample values for both `ano` and `poblacion_5_16`.

In [None]:
# Your code here: print dtype and sample values for 'ano' and 'poblacion_5_16'


---

## Section 4: Duplicates

Imagine you are folding laundry and you count the **same shirt twice**. Your count is now wrong. That is exactly what duplicate rows do: they inflate counts and distort averages. If a department appears twice for the same year with identical data, every calculation using that data is biased.

### 4.1 Count Duplicates

We use `duplicated().sum()` to count how many duplicate rows exist.

**Run the cell below.**

In [None]:
n_dupes = df.duplicated().sum()
print(f"Duplicate rows found: {n_dupes}")

### 4.2 See the Duplicates

Let's look at the actual duplicate rows. Using `keep=False` marks ALL copies (both the "original" and the "duplicate") so we can see them side by side.

**Run the cell below.**

In [None]:
dupes = df[df.duplicated(keep=False)].sort_values(['departamento', 'ano'])
print(f"Total rows involved in duplicates: {len(dupes)}")
print()
print(dupes[['ano', 'departamento', 'poblacion_5_16', 'desercion']].head(20).to_string())

### 4.3 Remove Duplicates

`drop_duplicates()` keeps the first occurrence of each row and removes the rest.

**Run the cell below.**

In [None]:
rows_before = len(df)

df = df.drop_duplicates()

rows_after = len(df)
print(f"Rows before: {rows_before}")
print(f"Rows after:  {rows_after}")
print(f"Removed:     {rows_before - rows_after} duplicate rows")
print(f"Duplicates remaining: {df.duplicated().sum()}")

**What just happened:** The duplicate rows were removed. We kept one copy of each and deleted the rest. The 20 duplicates were exact copies injected into the dataset.

### QUESTION 3

When might a duplicate row be **valid** and NOT an error? Give one example from the real world.

*Double-click this cell and write your answer below:*

...

---

## Section 5: Text Inconsistencies

Imagine you are folding laundry but your labels are **inconsistent**: one shirt says "Blue", another says "BLUE", another says "  blue  ". To you, they are the same color. But to pandas, these are three completely different values. This breaks any grouping or counting operation.

### 5.1 Explore the Problem

Colombia has about 34 departments (32 + Bogota D.C. + national aggregate). Let's see how many unique values our `departamento` column actually has.

**Run the cell below.**

In [None]:
print(f"Unique department values: {df['departamento'].nunique()}")
print(f"(We expect about 34)")
print(f"\nAll unique values (sorted):")
for val in sorted(df['departamento'].unique()):
    print(f"  '{val}'")

**What just happened:** We see far more unique values than expected. The same department appears in multiple forms:
- "Antioquia", "ANTIOQUIA", "  Antioquia  ", "antioquia" are all the same department
- Some have leading/trailing spaces
- Some have accents stripped ("Narino" vs "NariÃ±o")

To pandas, every single variation is a completely different string.

### 5.2 Standardize Text

The fix: convert everything to uppercase and remove extra whitespace with `str.upper().str.strip()`.

**Run the cell below.**

In [None]:
before_nunique = df['departamento'].nunique()

df['departamento'] = df['departamento'].str.upper().str.strip()

after_nunique = df['departamento'].nunique()

print(f"Unique values before: {before_nunique}")
print(f"Unique values after:  {after_nunique}")
print(f"Reduced by:           {before_nunique - after_nunique} values")

**What just happened:** All department names were converted to uppercase and extra spaces were removed. The number of unique values dropped significantly. Now "Antioquia", "ANTIOQUIA", "  antioquia  " are all just "ANTIOQUIA".

### 5.3 Check for New Duplicates

Here is a subtle but important lesson: **cleaning one thing can reveal new problems.** Rows that looked different before ("Antioquia" vs "ANTIOQUIA" for the same year) are now identical after uppercasing. We need to check for duplicates again.

**Run the cell below.**

In [None]:
new_dupes = df.duplicated().sum()
print(f"New duplicates after text standardization: {new_dupes}")

if new_dupes > 0:
    rows_before = len(df)
    df = df.drop_duplicates()
    rows_after = len(df)
    print(f"Removed {rows_before - rows_after} additional duplicates")
    print(f"Final row count: {rows_after}")
else:
    print("No new duplicates created. Good.")

**Lesson:** Data cleaning is iterative. Fixing one problem (text inconsistency) can create another (new duplicates). Always verify after each step.

### YOUR TURN 2

After standardization, verify how many unique departments we have now. Print the sorted list of unique department names. Does the count look reasonable for Colombia?

Write your code below.

In [None]:
# Your code here: print nunique() and sorted list of unique departments


---

## Section 6: Invalid Values

So far we have fixed missing values, wrong types, duplicates, and text inconsistencies. But there is one more problem hiding in the data: **values that exist but are impossible.**

Think of it this way: you finished your laundry, everything is folded and sorted. But then you notice a shirt labeled "Size -3" and another labeled "Size 250." Those sizes do not exist. The labels are wrong.

In our dataset, percentage columns (dropout rate, coverage, approval, etc.) must be between 0 and 100. A dropout rate of -5% is impossible. A coverage of 150% is impossible (for net coverage). These values passed all our previous checks because they are not missing, they are not the wrong type, and they are not duplicates. **They are just wrong.**

Catching these requires **domain knowledge**: knowing what valid values look like for your specific data.

### 6.1 Check Min and Max of Percentage Columns

Let's use `describe()` on just the percentage columns to check their minimum and maximum values. Any `min` below 0 or `max` above 100 is suspicious.

**Run the cell below.**

In [None]:
percentage_cols = [
    'desercion', 'desercion_primaria', 'desercion_secundaria', 'desercion_media',
    'cobertura_neta', 'cobertura_neta_primaria', 'cobertura_neta_secundaria', 'cobertura_neta_media',
    'aprobacion', 'reprobacion',
]

summary = df[percentage_cols].describe().loc[['min', 'max']].round(2)
print("Min and Max for percentage columns:")
print(summary.to_string())

**What just happened:** Look at the `min` row. Do you see any negative values? Now look at the `max` row. Do you see anything above 100? Those are invalid values that should not exist in percentage columns.

### 6.2 Count the Invalid Values

Let's count exactly how many values are below 0 and above 100 in each column.

**Run the cell below.**

In [None]:
negatives = df[percentage_cols].lt(0).sum()
over_100 = df[percentage_cols].gt(100).sum()

print("Negative values (< 0) per column:")
print(negatives[negatives > 0].to_string())
print(f"\nTotal negative values: {negatives.sum()}")

print("\n" + "="*40)

print("\nValues over 100 per column:")
print(over_100[over_100 > 0].to_string())
print(f"\nTotal values over 100: {over_100.sum()}")

### 6.3 See the Invalid Rows

Let's look at the actual rows with invalid values so we can understand the scope of the problem.

**Run the cell below.**

In [None]:
# Find rows with any negative percentage value
mask_negative = (df[percentage_cols] < 0).any(axis=1)
mask_over_100 = (df[percentage_cols] > 100).any(axis=1)

print(f"Rows with negative values: {mask_negative.sum()}")
if mask_negative.sum() > 0:
    print(df.loc[mask_negative, ['ano', 'departamento'] + percentage_cols].to_string())

print(f"\nRows with values > 100: {mask_over_100.sum()}")
if mask_over_100.sum() > 0:
    print(df.loc[mask_over_100, ['ano', 'departamento'] + percentage_cols].to_string())

### 6.4 Fix Invalid Values

We will replace invalid values with NaN, then fill them with the column median (the same strategy we used for missing values). This is the safer approach: we treat impossible values the same as missing data.

**Run the cell below.**

In [None]:
total_fixed = 0

for col in percentage_cols:
    # Count invalid values
    invalid_mask = (df[col] < 0) | (df[col] > 100)
    n_invalid = invalid_mask.sum()
    
    if n_invalid > 0:
        # Replace invalid with NaN
        df.loc[invalid_mask, col] = np.nan
        
        # Fill with median
        median_val = df[col].median()
        df[col] = df[col].fillna(median_val)
        
        print(f"{col}: fixed {n_invalid} invalid values (replaced with median {median_val:.2f})")
        total_fixed += n_invalid

print(f"\nTotal invalid values fixed: {total_fixed}")

**What just happened:** We found values that were mathematically present but logically impossible (negative percentages, percentages above 100). We replaced them with NaN and then filled with the median, just like we do with missing values.

**Key takeaway:** Domain knowledge is essential for data cleaning. Without knowing that dropout rates must be between 0 and 100, we would never catch these errors. The `describe()` function is your friend here: always check `min` and `max` against what you know about the data.

### 6.5 Verify the Fix

Let's confirm that all percentage values are now within the valid 0-100 range.

**Run the cell below.**

In [None]:
print("After fix - Min and Max for percentage columns:")
print(df[percentage_cols].describe().loc[['min', 'max']].round(2).to_string())

remaining_invalid = (df[percentage_cols].lt(0).sum().sum() + df[percentage_cols].gt(100).sum().sum())
print(f"\nInvalid values remaining: {remaining_invalid}")

### QUESTION 4

Which approach is better for invalid percentage values:

**(a)** Replace with NaN and fill with the median (what we did), or  
**(b)** Clip to the valid range (set negatives to 0 and values >100 to 100)?

Think about what each approach assumes. When might clipping be better? When might the median approach be better?

*Double-click this cell and write your answer below:*

...

---

## Section 7: Summary

We have completed all 5 cleaning steps. Let's compare the original dataset with our cleaned version to see the full impact of our work.

**Run the cell below.**

In [None]:
print("=" * 55)
print("  BEFORE CLEANING (original)")
print("=" * 55)
print(f"  Rows:                {len(df_original)}")
print(f"  Total NaN:           {df_original.isnull().sum().sum()}")
print(f"  Duplicates:          {df_original.duplicated().sum()}")
print(f"  Unique departments:  {df_original['departamento'].nunique()}")
print(f"  ano dtype:           {df_original['ano'].dtype}")
print(f"  poblacion dtype:     {df_original['poblacion_5_16'].dtype}")

print()

print("=" * 55)
print("  AFTER CLEANING")
print("=" * 55)
print(f"  Rows:                {len(df)}")
print(f"  Total NaN:           {df.isnull().sum().sum()}")
print(f"  Duplicates:          {df.duplicated().sum()}")
print(f"  Unique departments:  {df['departamento'].nunique()}")
print(f"  ano dtype:           {df['ano'].dtype}")
print(f"  poblacion dtype:     {df['poblacion_5_16'].dtype}")

print()
print("=" * 55)
print("  INVALID VALUES CHECK")
print("=" * 55)
pct_cols_check = ['desercion', 'cobertura_neta', 'aprobacion', 'reprobacion']
for col in pct_cols_check:
    print(f"  {col}: min={df[col].min():.2f}, max={df[col].max():.2f}")

### The 5-Step Cleaning Workflow

Here is the workflow you just practiced. Use these exact steps in every data project:

| Step | What | Key Commands |
|------|------|--------------|
| 1. **Inspect** | Understand the data before touching it | `df.shape`, `df.dtypes`, `df.isnull().sum()`, `df.describe()` |
| 2. **Handle missing** | Decide: drop, fill with 0, or fill with median | `fillna(0)`, `fillna(median)`, `dropna(subset=...)` |
| 3. **Fix types** | Numbers stored as text, floats that should be int | `pd.to_numeric(errors='coerce')`, `astype(int)` |
| 4. **Remove duplicates** | Exact copies that inflate counts | `duplicated().sum()`, `drop_duplicates()` |
| 5. **Validate values** | Impossible values based on domain knowledge | `describe()` min/max, boolean masks |

**Remember:** Cleaning is iterative. Fixing one problem can create another (like text standardization creating new duplicates). Always verify after each step.

This dataset is now ready for analysis.

### FINAL REFLECTION

Write 2-3 sentences:

1. What was the most surprising thing you found during cleaning?
2. Which of the 5 steps do you think is most important? Why?
3. How would you apply these steps to your project dataset from datos.gov.co?

*Double-click this cell and write your answer below:*

...