# Cleaning Data

## Prerequisites
- Intro to pandas
- Boolean selection
- Indexing

## Outcomes
- Use string methods to clean data
- Drop missing data
- Apply cleaning methods to real datasets

## Setup

In [None]:
import pandas as pd
import numpy as np

## Why Clean Data?

- A significant proportion of time in data projects is spent on **data cleaning**
- Not performing the analysis itself
- pandas provides powerful data cleaning tools

## Sample Dataset

In [None]:
df = pd.DataFrame({
    "numbers": ["#23", "#24", "#18", "#14", "#12", "#10", "#35"],
    "nums": ["23", "24", "18", "14", np.nan, "XYZ", "35"],
    "colors": ["green", "red", "yellow", "orange", "purple", "blue", "pink"],
    "other_column": [0, 1, 0, 2, 1, 0, 2]
})
df

## Problem: Computing Mean

What happens if we try to compute the mean of `numbers`?

In [None]:
# This will raise an error!
# df["numbers"].mean()

**Error:** `TypeError: Could not convert #23#24... to numeric`

The `#` symbol prevents conversion to numeric type!

## String Methods: The Slow Way

We could loop through rows (but this is slow)...

In [None]:
%%time

# Iterate over all rows
for row in df.iterrows():
    index_value, column_values = row
    clean_number = int(column_values["numbers"].replace("#", ""))
    df.at[index_value, "numbers_loop"] = clean_number

## String Methods: The Fast Way

Use `.str` accessor to apply string methods to entire columns!

In [None]:
%%time

# Much faster! 2-500x depending on DataFrame size
df["numbers_str"] = df["numbers"].str.replace("#", "")

**Key Point:** `.str` methods operate on entire columns at once (vectorized operations)

## More String Methods

In [None]:
# Check if colors contain 'p'
df["colors"].str.contains("p")

In [None]:
# Capitalize colors
df["colors"].str.capitalize()

## Type Conversions

After removing `#`, we still have strings!

Use `pd.to_numeric()` to convert to numbers:

In [None]:
df["numbers_numeric"] = pd.to_numeric(df["numbers_str"])
df.dtypes

## Using astype()

Convert to different data types:

In [None]:
# Convert to string
df["numbers_numeric"].astype(str)

In [None]:
# Convert to float
df["numbers_numeric"].astype(float)

## Missing Data

Our dataset has missing values:

In [None]:
df

## Detecting Missing Data

Use `.isnull()` to find missing values:

In [None]:
df.isnull()

## Missing Data by Column/Row

In [None]:
# Any missing data in each column?
df.isnull().any(axis=0)

In [None]:
# Any missing data in each row?
df.isnull().any(axis=1)

## Handling Missing Data

Two main approaches:

1. **Exclusion:** Drop missing data (`.dropna()`)
2. **Imputation:** Fill in predicted values (`.fillna()`)

## Drop Missing Data

In [None]:
# Drop all rows containing missing data
df.dropna()

## Fill Missing Data

In [None]:
# Fill with a specific value
df.fillna(value=100)

## Forward/Backward Fill

In [None]:
# Use next valid observation
df.bfill()

In [None]:
# Use previous valid observation
df.ffill()

## Case Study: Chipotle Orders

Real data from NYTimes article about Chipotle orders

- Nearly 2,000 orders
- Information on items and prices

In [None]:
url = "https://datascience.quantecon.org/assets/data/chipotle_raw.csv.zip"
chipotle = pd.read_csv(url)
chipotle.head()

## Exercise: Chipotle Analysis

Use this data to answer:

1. What is the average price of an item with chicken?
2. What is the average price of an item with steak?
3. Did chicken or steak produce more revenue (total)?
4. How many missing items are in this dataset?

**Hint:** You'll need to clean the `item_price` column first!

## Performance Comparison

Let's compare loop vs `.str` methods on larger data:

In [None]:
test = pd.DataFrame({"floats": np.round(100*np.random.rand(100000), 2)})
test["strings"] = test["floats"].astype(str) + "%"
test.head()

## Loop Method (Slow)

In [None]:
%%time

for row in test.iterrows():
    index_value, column_values = row
    clean_number = column_values["strings"].replace("%", "")
    test.at[index_value, "numbers_loop"] = clean_number

## String Method (Fast)

In [None]:
%%time
test["numbers_str_method"] = test["strings"].str.replace("%", "")

In [None]:
# Verify they're the same
test["numbers_str_method"].equals(test["numbers_loop"])

## Key Takeaways

1. Use `.str` accessor for vectorized string operations
2. Use `pd.to_numeric()` and `.astype()` for type conversions
3. Handle missing data with `.dropna()` or `.fillna()`
4. Clean data before analysis!
5. Vectorized operations are much faster than loops

## Practice Exercises

**Exercise 1:** Convert `"#39"` to a number

**Exercise 2:** Create `colors_upper` column with uppercase colors

**Exercise 3:** Convert `nums` column to numeric (handle errors!)

**Exercise 4:** Analyze the Chipotle dataset