# Week 4: In-Class Exercise - Data Filtering & Selection

## Objective
Practice filtering techniques using the same Education Statistics dataset you cleaned last week.

## Time: ~30 minutes

## What you'll practice:
1. **Single conditions** with comparison operators
2. **Combining conditions** with & (AND), | (OR), ~ (NOT)
3. **Convenience methods**: .isin(), .between(), .str.contains()
4. **Building real questions** from data

## Dataset
Same Education Statistics from the Colombian Ministry of Education (MEN) you cleaned in Week 3.

---

## Setup
Run this cell to load libraries, load the dataset, and apply the quick clean from Week 3.

In [None]:
import pandas as pd
import numpy as np

# Load the Education Statistics dataset
df = pd.read_csv('../data/educacion_estadisticas.csv')

# Quick clean (applying what we learned in Week 3)
df['departamento'] = df['departamento'].str.upper().str.strip()
df['ano'] = df['ano'].fillna(0).astype(int)

df = df.drop_duplicates()
df = df.dropna(subset=['departamento'])

print(f"Dataset loaded and cleaned: {df.shape[0]} rows, {df.shape[1]} columns")
print(f"Years: {sorted(df['ano'].unique())}")
print(f"Departments: {df['departamento'].nunique()} unique")

df.head()

---

## Part 1: Single Conditions (8 minutes)

Start with simple, one-condition filters using comparison operators.

### Task 1.1: Filter by exact value
Get all rows where the year (`ano`) is 2023.

**Pattern:** `df[df['column'] == value]`

Print how many rows you found.

In [None]:
# YOUR CODE HERE
# Step 1: Create the filter
df_2023 = df[df['ano'] == 2023]
# Step 2: Print the number of rows
print(f'Rows in 2023: {len(df_2023)}')
# Step 3: Display the first few rows
df_2023.head()


### Task 1.2: Filter with greater than
Find all rows where the total dropout rate (`desercion`) is greater than 5%.

**Question:** How many department-year combinations have a dropout rate above 5%?

In [None]:
# YOUR CODE HERE
df_high_dropout = df[df['desercion'] > 5]
print(f'Rows with dropout > 5%: {len(df_high_dropout)}')


### Task 1.3: Filter with less than
Find all rows where net coverage (`cobertura_neta`) is less than 70%.

**Question:** Which departments have the lowest coverage? Look at the `departamento` column in your result.

In [None]:
# YOUR CODE HERE
df_low_coverage = df[df['cobertura_neta'] < 70]
print('Departments with coverage < 70%:')
print(df_low_coverage['departamento'].unique())


---

## Part 2: Combining Conditions (8 minutes)

Now combine multiple conditions using `&` (AND), `|` (OR), and `~` (NOT).

**Remember:** Always wrap each condition in parentheses!

### Task 2.1: AND condition
Find rows where the year is 2023 AND the dropout rate is above 5%.

**Pattern:** `df[(condition1) & (condition2)]`

**Question:** Which departments had high dropout in 2023?

In [None]:
# YOUR CODE HERE
# Show: departamento, ano, desercion columns
df_2023_high_dropout = df[(df['ano'] == 2023) & (df['desercion'] > 5)]
df_2023_high_dropout[['departamento', 'ano', 'desercion']]


### Task 2.2: OR condition
Find all rows from either Antioquia OR Valle del Cauca.

**Pattern:** `df[(condition1) | (condition2)]`

**Hint:** The department names should be uppercase (we cleaned them in setup).

In [None]:
# YOUR CODE HERE
df_ant_valle = df[(df['departamento'] == 'ANTIOQUIA') | (df['departamento'] == 'VALLE DEL CAUCA')]
df_ant_valle


### Task 2.3: NOT condition
Get all rows EXCEPT those from the "NACIONAL" department (the national aggregate).

**Two ways to do this:**
- `df[df['departamento'] != 'NACIONAL']`
- `df[~(df['departamento'] == 'NACIONAL')]`

Try both and verify you get the same number of rows.

In [None]:
# YOUR CODE HERE
# Method 1: using !=
df_not_nac1 = df[df['departamento'] != 'NACIONAL']

# Method 2: using ~
df_not_nac2 = df[~(df['departamento'] == 'NACIONAL')]

# Compare the row counts
print(f'Method 1 rows: {len(df_not_nac1)}')
print(f'Method 2 rows: {len(df_not_nac2)}')


---

## Part 3: Convenience Methods (7 minutes)

Use `.isin()`, `.between()`, and `.str.contains()` to write cleaner filters.

### Task 3.1: .isin() - Match a list
Find all rows from the following departments: ANTIOQUIA, BOGOTA D.C., VALLE DEL CAUCA, CUNDINAMARCA.

**Pattern:** `df[df['column'].isin([list_of_values])]`

In [None]:
# YOUR CODE HERE
target_deps = ['ANTIOQUIA', 'BOGOTA D.C.', 'VALLE DEL CAUCA', 'CUNDINAMARCA']
df_selected = df[df['departamento'].isin(target_deps)]
df_selected


### Task 3.2: .between() - Numeric range
Find all rows where the approval rate (`aprobacion`) is between 90 and 100 (inclusive).

**Pattern:** `df[df['column'].between(low, high)]`

In [None]:
# YOUR CODE HERE
df_high_approval = df[df['aprobacion'].between(90, 100)]
df_high_approval


### Task 3.3: .str.contains() - Text pattern
Find all rows where the department name contains "SANTANDER".

**Pattern:** `df[df['column'].str.contains('text', na=False)]`

**Question:** Which departments match? (Hint: there should be more than one.)

In [None]:
# YOUR CODE HERE
# Print the unique department names in your result
df_santander = df[df['departamento'].str.contains('SANTANDER', na=False)]
print(df_santander['departamento'].unique())


---

## Part 4: Real Questions (5 minutes)

Use everything you've learned to answer these questions about the data.

### Task 4.1: Complex filter
Find departments in 2023 where:
- Net coverage is above 80% AND
- Dropout rate is below 3%

These are the "high performing" departments. Show the department name, coverage, and dropout columns.

In [None]:
# YOUR CODE HERE
df_high_perf = df[(df['ano'] == 2023) & (df['cobertura_neta'] > 80) & (df['desercion'] < 3)]
df_high_perf[['departamento', 'cobertura_neta', 'desercion']]


### Task 4.2: Try .query()
Rewrite the filter from Task 4.1 using `.query()` instead of boolean indexing.

**Pattern:** `df.query('condition1 and condition2')`

Verify you get the same number of rows.

In [None]:
# YOUR CODE HERE
df_high_perf_query = df.query('ano == 2023 and cobertura_neta > 80 and desercion < 3')
print(f'Rows matching: {len(df_high_perf_query)}')
df_high_perf_query[['departamento', 'cobertura_neta', 'desercion']]


---

## Bonus: Design Your Own Filter (if time permits)

Think of a question you could ask about the education data. Write the filter to answer it.

Example questions:
- Which departments had improving dropout rates (2022 lower than 2020)?
- Which departments in the Llanos region (Arauca, Casanare, Meta, Vichada) have the best coverage?
- What was the range of approval rates in 2024?

In [None]:
# YOUR QUESTION: What are the dropout rates for Antioquia over the available years?
# YOUR CODE HERE
df_antioquia = df[df['departamento'] == 'ANTIOQUIA']
print(df_antioquia[['ano', 'desercion']].sort_values('ano'))


---

## Summary

In this exercise you practiced:

1. **Single conditions** - `==`, `!=`, `>`, `<`, `>=`, `<=`
2. **Combining conditions** - `&` (AND), `|` (OR), `~` (NOT)
3. **Convenience methods** - `.isin()`, `.between()`, `.str.contains()`
4. **Clean syntax** - `.query()` as an alternative

Key rules:
- Use `&` `|` `~` (NOT `and` `or` `not`) for DataFrames
- Always wrap conditions in **parentheses**
- Add `na=False` to `.str.contains()`

**Next:** Complete the workshop notebook for more complex filtering scenarios and analysis.