# Week 4 Workshop: Data Filtering & Selection
## Traffic Accident Vehicles in Colombia

Practice filtering techniques to answer analytical questions about vehicles involved in traffic accidents.

**Duration:** ~2 hours

**Objectives:**
- Filter rows using comparison operators
- Combine conditions with & (AND), | (OR), ~ (NOT)
- Use .isin(), .between(), .str.contains()
- Use .query() for clean syntax
- Translate analytical questions into filter code

---

## Setup
Run this cell to load the dataset and see the overview.

In [None]:
import pandas as pd
import numpy as np

# Load the Traffic Accident Vehicles dataset
df = pd.read_csv('../data/vehiculos_accidentes.csv')

# Quick overview
print(f"Dataset: {df.shape[0]} rows x {df.shape[1]} columns")
print(f"\nColumns: {df.columns.tolist()}")
print(f"\nVehicle types: {df['tipo_vehiculo'].nunique()} unique")
print(f"Departments: {df['departamento_accidente'].nunique()} unique")
print(f"Severity levels: {df['gravedad_accidente'].unique().tolist()}")
print(f"Brands: {df['marca_vehiculo'].nunique()} unique")
print(f"\nTop 5 vehicle types:")
print(df['tipo_vehiculo'].value_counts().head())
df.head(3)

---

## Part 1: Single Condition Filters (20 minutes)

Practice the basic filtering pattern: `df[df['column'] operator value]`

### Task 1.1: Filter by vehicle type
Get all records where the vehicle is a **MOTOCICLETA** (motorcycle).

Print the number of rows and display the first 5.

In [None]:
# YOUR CODE HERE
# Hint: df[df['tipo_vehiculo'] == 'MOTOCICLETA']


### Task 1.2: Filter by comparison
Find vehicles with age greater than 15 years (`edad_vehiculo > 15`).

**Question:** How many old vehicles were involved in accidents? What are the most common brands among them?

In [None]:
# YOUR CODE HERE
# Hint: after filtering, try result['marca_vehiculo'].value_counts().head() to see the top brands


### Task 1.3: Filter by exact text
Get all accidents that occurred in the department **ATLANTICO**.

Sort the results by `fecha_accidente` and display the first 10 rows.

In [None]:
# YOUR CODE HERE
# Hint: after filtering, use .sort_values('fecha_accidente') and .head(10)


### Task 1.4: Filter with not equal
Get all records where the severity is NOT "CON HERIDOS" (i.e., get only the fatal accidents).

Print the count and compare it to the total dataset size.

In [None]:
# YOUR CODE HERE
# print(f"Total records: {len(df)}")
# fatal = ...
# print(f"Fatal accidents: {len(fatal)}")
# print(f"Percentage: {len(fatal) / len(df) * 100:.1f}%")


---

## Part 2: Combining Conditions (30 minutes)

Combine multiple conditions to answer more specific questions.

**Rules:**
- Use `&` (AND), `|` (OR), `~` (NOT) - NOT `and`, `or`, `not`
- Always wrap each condition in parentheses

### Task 2.1: Two AND conditions
Find motorcycles (`MOTOCICLETA`) involved in fatal accidents (`CON MUERTOS`).

**Question:** How many fatal motorcycle accidents are in the dataset?

In [None]:
# YOUR CODE HERE
# Pattern: df[(condition1) & (condition2)]


### Task 2.2: Three AND conditions
Find **AUTOMOVIL** vehicles, in **BOGOTA D.C.**, with severity **CON HERIDOS**.

Show the brand (`marca_vehiculo`) and model year (`modelo_vehiculo`) columns of the results.

In [None]:
# YOUR CODE HERE
# Hint: after filtering, select columns with result[['marca_vehiculo', 'modelo_vehiculo']]


### Task 2.3: OR and .isin()
Find all accidents involving either **CHEVROLET**, **TOYOTA**, or **MAZDA** vehicles.

Try both approaches:
1. Using `|` (OR) operators
2. Using `.isin()`

Verify both give the same number of rows.

In [None]:
# YOUR CODE HERE

# Approach 1: Using | (OR)


# Approach 2: Using .isin()


# Verify both have the same number of rows


### Task 2.4: Mixed AND + OR
Find accidents in **ANTIOQUIA** or **VALLE DEL CAUCA** where the vehicle is older than 10 years.

**Hint:** Combine `.isin()` with another condition:
```python
df[(df['col'].isin([...]) & (df['other_col'] > value)]
```

In [None]:
# YOUR CODE HERE


### Task 2.5: NOT with .isin()
Get all accident records EXCLUDING motorcycles (`MOTOCICLETA`) and buses (`BUS`).

**Pattern:** `df[~df['col'].isin([list])]`

Print the count and the remaining vehicle types.

In [None]:
# YOUR CODE HERE
# After filtering, try: result['tipo_vehiculo'].value_counts() to see remaining types


---

## Part 3: Convenience Methods (20 minutes)

Practice .between(), .str.contains(), .str.startswith(), and .query() for common filtering patterns.

### Task 3.1: .between() for model year
Find vehicles with model year between **2015 and 2020** (inclusive).

**Pattern:** `df[df['col'].between(low, high)]`

Print how many records match and show the distribution of model years in the result.

In [None]:
# YOUR CODE HERE
# Hint: after filtering, use result['modelo_vehiculo'].value_counts().sort_index()


### Task 3.2: .between() for age range
Get vehicles aged **0 to 5 years** (relatively new vehicles).

Print the count and the percentage of the total dataset.

In [None]:
# YOUR CODE HERE


### Task 3.3: .str.contains() for text search
Find transit authorities (`autoridad_de_transito`) whose name contains **"BOGOTA"**.

Print the unique authority names that match.

**Remember:** Always add `na=False` to handle potential NaN values.

In [None]:
# YOUR CODE HERE
# Pattern: df[df['col'].str.contains('text', na=False)]


### Task 3.4: .str.startswith()
Find vehicle brands that start with **"CH"** (Chevrolet, Changan, Chery, etc.).

Print the unique brand names that match and the total number of records.

**Pattern:** `df[df['col'].str.startswith('text', na=False)]`

In [None]:
# YOUR CODE HERE


### Task 3.5: .query() method
Rewrite this boolean indexing filter using `.query()`:

```python
df[(df['edad_vehiculo'] > 10) & (df['departamento_accidente'] == 'ANTIOQUIA') & (df['tipo_vehiculo'] == 'MOTOCICLETA')]
```

Verify both approaches give the same number of rows.

In [None]:
# Boolean indexing version (given):
result_bool = df[
    (df['edad_vehiculo'] > 10) &
    (df['departamento_accidente'] == 'ANTIOQUIA') &
    (df['tipo_vehiculo'] == 'MOTOCICLETA')
]

# YOUR CODE HERE - .query() version:
# result_query = df.query('...')

# Verify:
# print(f"Boolean indexing: {len(result_bool)} rows")
# print(f".query(): {len(result_query)} rows")


---

## Part 4: Analytical Questions (30 minutes)

Now use your filtering skills to answer real questions about traffic safety in Colombia. For each question:
1. Write the filter
2. Display relevant columns or value counts
3. Write a 1-2 sentence interpretation of the result

### Question 4.1: Fatal accidents by vehicle type
Which vehicle type is most involved in **fatal accidents** (`CON MUERTOS`)?

Filter for fatal accidents and count by `tipo_vehiculo`. Sort from highest to lowest.

In [None]:
# YOUR CODE HERE
# Hint: filter first, then use ['tipo_vehiculo'].value_counts()


**Your interpretation:** *(Write 1-2 sentences about what you observe)*


### Question 4.2: Motorcycle accidents by department
Which departments have the most **motorcycle** accidents?

Filter for `MOTOCICLETA` and count by `departamento_accidente`. Show the top 10.

In [None]:
# YOUR CODE HERE


**Your interpretation:** *(Write 1-2 sentences)*


### Question 4.3: Vehicle age and severity
Are older vehicles more likely to be in **severe** (fatal) accidents?

Calculate the average `edad_vehiculo` for each `gravedad_accidente` level.

**Hint:** Filter for each severity level separately and compute `.mean()`, or use `df.groupby('gravedad_accidente')['edad_vehiculo'].mean()`.

In [None]:
# YOUR CODE HERE


**Your interpretation:** *(Write 1-2 sentences)*


### Question 4.4: Your own question
Design and answer your own analytical question about the traffic accident data.

Ideas:
- Which brands are most involved in fatal accidents in a specific department?
- How do accident patterns differ between Medellin and Bogota?
- What is the age profile of motorcycles vs. cars in accidents?
- Which municipalities have the most accidents with heavy vehicles (CAMION, TRACTOCAMION)?

**Your question:** *(Write it here)*


In [None]:
# YOUR CODE HERE


**Your interpretation:** *(Write 1-2 sentences)*


---

## Part 5: Reflection

Answer these questions about your experience with data filtering.

### Reflection 1
Which filtering method did you find most useful: boolean indexing, .isin(), .between(), .str.contains(), or .query()? Why?

**Your answer:**


### Reflection 2
What questions do you want to answer about **YOUR** project dataset using filtering? List at least 3 questions.

**Your answer:**
1. 
2. 
3. 


### Reflection 3
Why is it important to clean the data (Week 3) BEFORE filtering it (Week 4)? What problems could arise if you filter dirty data?

**Your answer:**


---

## Summary

In this workshop you practiced:

| Skill | Methods |
|-------|--------|
| Single conditions | `==`, `!=`, `>`, `<`, `>=`, `<=` |
| Combining conditions | `&` (AND), `|` (OR), `~` (NOT) |
| Match a list | `.isin([list])` |
| Numeric range | `.between(low, high)` |
| Text pattern | `.str.contains()`, `.str.startswith()` |
| Clean syntax | `.query('expression')` |

These skills are essential for Milestone 1: you need to filter your project dataset to focus on specific subsets of data for analysis.

**Next week:** Exploratory Data Analysis (EDA) - using GroupBy and statistics to discover patterns.

---

*Week 4 - Data Analytics Course - Universidad Cooperativa de Colombia*