# Cleaning Data

## Prerequisites
- Intro to pandas
- Boolean selection
- Indexing

## Outcomes
- Use string methods to clean data
- Drop missing data
- Apply cleaning methods to real datasets

## Setup

In [23]:
import pandas as pd
import numpy as np

## Why Clean Data?

- A significant proportion of time in data projects is spent on **data cleaning**
- Not performing the analysis itself
- pandas provides powerful data cleaning tools

## Sample Dataset

In [26]:
df = pd.DataFrame({
    "numbers": ["#23", "#24", "#18", "#14", "#12", "#10", "#35"],
    "nums": ["23", "24", "18", "14", np.nan, "XYZ", "35"],
    "colors": ["green", "red", "yellow", "orange", "purple", "blue", "pink"],
    "other_column": [0, 1, 0, 2, 1, 0, 2]
})
df

print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   numbers       7 non-null      object
 1   nums          6 non-null      object
 2   colors        7 non-null      object
 3   other_column  7 non-null      int64 
dtypes: int64(1), object(3)
memory usage: 356.0+ bytes
None


## Problem: Computing Mean

What happens if we try to compute the mean of `numbers`?

In [29]:
# This will raise an error!
df["numbers"].mean()

TypeError: Could not convert string '#23#24#18#14#12#10#35' to numeric

**Error:** `TypeError: Could not convert #23#24... to numeric`

The `#` symbol prevents conversion to numeric type!

## String Methods: The Slow Way

We could loop through rows (but this is slow)...

In [39]:
%%time

# list(df.iterrows())

# Iterate over all rows
for row in df.iterrows():
    index_value, column_values = row
    clean_number = int(column_values["numbers"].replace("#", ""))
    df.at[index_value, "numbers_loop"] = clean_number

df

CPU times: total: 0 ns
Wall time: 992 μs


Unnamed: 0,numbers,nums,colors,other_column,numbers_loop
0,#23,23,green,0,23.0
1,#24,24,red,1,24.0
2,#18,18,yellow,0,18.0
3,#14,14,orange,2,14.0
4,#12,,purple,1,12.0
5,#10,XYZ,blue,0,10.0
6,#35,35,pink,2,35.0


## String Methods: The Fast Way

Use `.str` accessor to apply string methods to entire columns!

In [42]:
%%time

# Much faster! 2-500x depending on DataFrame size
df["numbers_str"] = df["numbers"].str.replace("#", "")

df

CPU times: total: 0 ns
Wall time: 922 μs


Unnamed: 0,numbers,nums,colors,other_column,numbers_loop,numbers_str
0,#23,23,green,0,23.0,23
1,#24,24,red,1,24.0,24
2,#18,18,yellow,0,18.0,18
3,#14,14,orange,2,14.0,14
4,#12,,purple,1,12.0,12
5,#10,XYZ,blue,0,10.0,10
6,#35,35,pink,2,35.0,35


**Key Point:** `.str` methods operate on entire columns at once (vectorized operations)

## More String Methods

In [71]:
# Check if colors contain 'p'
df["colors"].str.contains("p")


0    False
1    False
2    False
3    False
4     True
5    False
6     True
Name: colors, dtype: bool

In [46]:
# Capitalize colors
df["colors"].str.capitalize()


0     Green
1       Red
2    Yellow
3    Orange
4    Purple
5      Blue
6      Pink
Name: colors, dtype: object

## Type Conversions

After removing `#`, we still have strings!

Use `pd.to_numeric()` to convert to numbers:

In [50]:
print(df.dtypes)

df["numbers_str"] = pd.to_numeric(df["numbers_str"])
df.dtypes

numbers             object
nums                object
colors              object
other_column         int64
numbers_loop       float64
numbers_str          int64
numbers_numeric      int64
dtype: object


numbers             object
nums                object
colors              object
other_column         int64
numbers_loop       float64
numbers_str          int64
numbers_numeric      int64
dtype: object

## Using astype()

Convert to different data types:

In [52]:
# Convert to string
df["numbers_numeric"].astype(str)

0    23
1    24
2    18
3    14
4    12
5    10
6    35
Name: numbers_numeric, dtype: object

In [53]:
# Convert to float
df["numbers_numeric"].astype(float)

0    23.0
1    24.0
2    18.0
3    14.0
4    12.0
5    10.0
6    35.0
Name: numbers_numeric, dtype: float64

## Missing Data

Our dataset has missing values:

In [14]:
df

Unnamed: 0,numbers,nums,colors,other_column,numbers_loop,numbers_str,numbers_numeric
0,#23,23,green,0,23.0,23,23
1,#24,24,red,1,24.0,24,24
2,#18,18,yellow,0,18.0,18,18
3,#14,14,orange,2,14.0,14,14
4,#12,,purple,1,12.0,12,12
5,#10,XYZ,blue,0,10.0,10,10
6,#35,35,pink,2,35.0,35,35


## Detecting Missing Data

Use `.isnull()` to find missing values:

In [55]:
df.isnull()

Unnamed: 0,numbers,nums,colors,other_column,numbers_loop,numbers_str,numbers_numeric
0,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False
4,False,True,False,False,False,False,False
5,False,False,False,False,False,False,False
6,False,False,False,False,False,False,False


## Missing Data by Column/Row

In [64]:
# Any missing data in each column?
df.isnull().any(axis=0).any()


np.True_

In [65]:
# Any missing data in each row?
df.isnull().any(axis=1).any()

np.True_

## Handling Missing Data

Two main approaches:

1. **Exclusion:** Drop missing data (`.dropna()`)
2. **Imputation:** Fill in predicted values (`.fillna()`)

## Drop Missing Data

In [66]:
# Drop all rows containing missing data
df.dropna()

Unnamed: 0,numbers,nums,colors,other_column,numbers_loop,numbers_str,numbers_numeric
0,#23,23,green,0,23.0,23,23
1,#24,24,red,1,24.0,24,24
2,#18,18,yellow,0,18.0,18,18
3,#14,14,orange,2,14.0,14,14
5,#10,XYZ,blue,0,10.0,10,10
6,#35,35,pink,2,35.0,35,35


## Fill Missing Data

In [67]:
# Fill with a specific value
df.fillna(value=100)

Unnamed: 0,numbers,nums,colors,other_column,numbers_loop,numbers_str,numbers_numeric
0,#23,23,green,0,23.0,23,23
1,#24,24,red,1,24.0,24,24
2,#18,18,yellow,0,18.0,18,18
3,#14,14,orange,2,14.0,14,14
4,#12,100,purple,1,12.0,12,12
5,#10,XYZ,blue,0,10.0,10,10
6,#35,35,pink,2,35.0,35,35


## Forward/Backward Fill

In [68]:
# Use next valid observation
df.bfill()

Unnamed: 0,numbers,nums,colors,other_column,numbers_loop,numbers_str,numbers_numeric
0,#23,23,green,0,23.0,23,23
1,#24,24,red,1,24.0,24,24
2,#18,18,yellow,0,18.0,18,18
3,#14,14,orange,2,14.0,14,14
4,#12,XYZ,purple,1,12.0,12,12
5,#10,XYZ,blue,0,10.0,10,10
6,#35,35,pink,2,35.0,35,35


In [69]:
# Use previous valid observation
df.ffill()

Unnamed: 0,numbers,nums,colors,other_column,numbers_loop,numbers_str,numbers_numeric
0,#23,23,green,0,23.0,23,23
1,#24,24,red,1,24.0,24,24
2,#18,18,yellow,0,18.0,18,18
3,#14,14,orange,2,14.0,14,14
4,#12,14,purple,1,12.0,12,12
5,#10,XYZ,blue,0,10.0,10,10
6,#35,35,pink,2,35.0,35,35


## Case Study: Chipotle Orders

Real data from NYTimes article about Chipotle orders

- Nearly 2,000 orders
- Information on items and prices

In [75]:
url = "https://datascience.quantecon.org/assets/data/chipotle_raw.csv.zip"
chipotle = pd.read_csv(url)
chipotle.head(10)

np.sum(chipotle['item_name'].str.contains("Chicken"))


np.int64(0)

## Exercise: Chipotle Analysis

Use this data to answer:

1. What is the average price of an item with chicken?
2. What is the average price of an item with steak?
3. Did chicken or steak produce more revenue (total)?
4. How many missing items are in this dataset?

**Hint:** You'll need to clean the `item_price` column first!

## Performance Comparison

Let's compare loop vs `.str` methods on larger data:

In [19]:
test = pd.DataFrame({"floats": np.round(100*np.random.rand(100000), 2)})
test["strings"] = test["floats"].astype(str) + "%"
test.head()

Unnamed: 0,floats,strings
0,38.44,38.44%
1,10.3,10.3%
2,72.4,72.4%
3,10.86,10.86%
4,65.1,65.1%


## Loop Method (Slow)

In [20]:
%%time

for row in test.iterrows():
    index_value, column_values = row
    clean_number = column_values["strings"].replace("%", "")
    test.at[index_value, "numbers_loop"] = clean_number

CPU times: total: 2.33 s
Wall time: 7.7 s


## String Method (Fast)

In [21]:
%%time
test["numbers_str_method"] = test["strings"].str.replace("%", "")

CPU times: total: 0 ns
Wall time: 19.9 ms


In [22]:
# Verify they're the same
test["numbers_str_method"].equals(test["numbers_loop"])

True

## Key Takeaways

1. Use `.str` accessor for vectorized string operations
2. Use `pd.to_numeric()` and `.astype()` for type conversions
3. Handle missing data with `.dropna()` or `.fillna()`
4. Clean data before analysis!
5. Vectorized operations are much faster than loops

## Practice Exercises

**Exercise 1:** Convert `"#39"` to a number

**Exercise 2:** Create `colors_upper` column with uppercase colors

**Exercise 3:** Convert `nums` column to numeric (handle errors!)

**Exercise 4:** Analyze the Chipotle dataset