# Data Wrangling Practice

Each challenge below simulates **its own tiny data set** and asks you to perform one essential data-wrangling task in `pandas`.  
For every question:

1. **Run the first code cell** to generate the data frame `df`.  
2. Add your own code cell(s) to solve the task.  
3. If you’re stuck, expand the **Instructor solution** to reveal one possible answer.


Turn this notebook in on blackboard. You can just complete this in Google Colab.


In [2]:
import numpy as np, pandas as pd, re
from datetime import datetime
rng_global = np.random.default_rng(30)  


## Question 01 — Inspect Data Structure

Use `df.head()`, `df.info()` and `df.describe()` to quickly understand its shape, data types and basic statistics.

In [3]:
# --- Generate synthetic data for Question 1 ---
rng = np.random.default_rng(41)

df = pd.DataFrame({
    'age': rng.integers(18, 60, size=15),
    'height_cm': rng.normal(170, 10, size=15).round(1),
    'city': rng.choice(['NY', 'LA', 'CHI'], size=15)
})


In [4]:
print(df.head())
df.info()
df.describe()

   age  height_cm city
0   45      156.4   LA
1   58      156.9  CHI
2   52      162.8   NY
3   50      181.9   NY
4   47      178.9  CHI
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15 entries, 0 to 14
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   age        15 non-null     int64  
 1   height_cm  15 non-null     float64
 2   city       15 non-null     object 
dtypes: float64(1), int64(1), object(1)
memory usage: 492.0+ bytes


Unnamed: 0,age,height_cm
count,15.0,15.0
mean,43.133333,169.873333
std,12.211626,9.686549
min,23.0,156.4
25%,32.0,163.65
50%,47.0,167.3
75%,52.5,178.85
max,58.0,189.3


#### Why would you want to inspect your data in this way? 

`df.head()` allows you to see a portion of the data, `df.info()` allows you to see what kind of data is stored, and `df.describe()` allows you to see some summary statistics about the data to get an idea of what the data looks like overall.

<details>
<summary><strong>Instructor solution (click to reveal)</strong></summary>

```python

print(df.head())
df.info()
df.describe()

```
</details>


## Question 02 — Handle Missing Values

Count how many entries in column **`income`** are missing and then fill them with the column median.

In [5]:
# --- Generate synthetic data for Question 2 ---
rng = np.random.default_rng(42)

df = pd.DataFrame({
    'id': range(1, 21),
    'age': rng.integers(18, 65, size=20),
    'income': rng.normal(60000, 15000, size=20).round(2)
})
# inject missing
miss = rng.choice(df.index, size=5, replace=False)
df.loc[miss, 'income'] = np.nan

df

missing_count = df['income'].isna().sum()
print(missing_count)
median_income = df['income'].median(skipna=True)
df['income'].fillna(median_income, inplace=True)

5


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['income'].fillna(median_income, inplace=True)


#### List a pro and a con of filling in missing values with the median

Pro: can avoid needing to drop all rows with missing values
Con: may not represent the true values/population accurately

<details>
<summary><strong>Instructor solution (click to reveal)</strong></summary>

```python

missing = df['income'].isna().sum()
print('Missing:', missing)
median_inc = df['income'].median()
df['income'].fillna(median_inc, inplace=True)

```
</details>


## Question 03 — Convert String Numbers → Numeric

You can find predictable characters, and remove them relatively easily. Here is an example. 

In [6]:
df = pd.DataFrame({
    'code': ['AB-123', 'CD-456', 'EF-789', 'GH-000']
})
print("Before:")
print(df)

## EXAMPLE:
df['code'] = df['code'].str.replace('-', '', regex=True)

print("\nAfter:")
print(df)

Before:
     code
0  AB-123
1  CD-456
2  EF-789
3  GH-000

After:
    code
0  AB123
1  CD456
2  EF789
3  GH000


Now, `salary` is stored as strings like '`$45,000`'. Convert it to a numeric column **in dollars** (45000).

In [7]:
# --- Generate synthetic data for Question 3 ---
rng = np.random.default_rng(43)

def fmt(x): return f"${x:,.0f}"
df = pd.DataFrame({
    'name': [f'Emp{i}' for i in range(8)],
    'salary': [fmt(s) for s in rng.integers(30000, 90000, size=8)]
})

print(df)

df['salary'] = df['salary'].replace("[$,]", "", regex=True).astype(float)
print(df.head())

   name   salary
0  Emp0  $60,319
1  Emp1  $69,137
2  Emp2  $54,078
3  Emp3  $32,626
4  Emp4  $64,629
5  Emp5  $31,201
6  Emp6  $46,594
7  Emp7  $80,352
   name   salary
0  Emp0  60319.0
1  Emp1  69137.0
2  Emp2  54078.0
3  Emp3  32626.0
4  Emp4  64629.0


#### In the df.str.replace function, there is an argument called "regex." Please state what that is/means and when you would want it to be true or false.

A regex (or regular expression) is a way to search for specific character sequences within text. The argument `regex=True` indicates that the replace function should consider the `to_replace` argument as a regex expression. The `regex` argument should be set to `True` if the `to_replace` argument should be evaluated as a regex and `False` otherwise.

<details>
<summary><strong>Instructor solution (click to reveal)</strong></summary>

```python

df['salary'] = (
    df['salary']
      .str.replace('[$,]', '', regex=True)
      .astype(float)
)

```
</details>


## Question 04 — Parse Dates & Extract Month

Column `date_str` is text. Convert it to datetime and create a new column `month` (YYYY-MM). Try to solve this one through defining the problem clearly to yourself and then searching for functions in pandas that will get you to the outcome (there are many solutions)

In [8]:
# --- Generate synthetic data for Question 4 ---
rng = np.random.default_rng(44)

dates = pd.date_range('2024-01-01', periods=20, freq='7D')
rng.shuffle(dates.values)
df = pd.DataFrame({'date_str': dates.astype(str)})

df.head()

Unnamed: 0,date_str
0,2024-04-01
1,2024-03-04
2,2024-02-26
3,2024-01-29
4,2024-03-18


In [9]:
df['date'] = pd.to_datetime(df['date_str'])
df['month'] = df['date'].dt.to_period('M')
print(df)

      date_str       date    month
0   2024-04-01 2024-04-01  2024-04
1   2024-03-04 2024-03-04  2024-03
2   2024-02-26 2024-02-26  2024-02
3   2024-01-29 2024-01-29  2024-01
4   2024-03-18 2024-03-18  2024-03
5   2024-01-01 2024-01-01  2024-01
6   2024-05-13 2024-05-13  2024-05
7   2024-03-11 2024-03-11  2024-03
8   2024-04-15 2024-04-15  2024-04
9   2024-04-29 2024-04-29  2024-04
10  2024-01-22 2024-01-22  2024-01
11  2024-05-06 2024-05-06  2024-05
12  2024-01-15 2024-01-15  2024-01
13  2024-04-22 2024-04-22  2024-04
14  2024-03-25 2024-03-25  2024-03
15  2024-02-05 2024-02-05  2024-02
16  2024-01-08 2024-01-08  2024-01
17  2024-04-08 2024-04-08  2024-04
18  2024-02-12 2024-02-12  2024-02
19  2024-02-19 2024-02-19  2024-02


#### Try to find another way to solve this problem without using series properties or pandas functions

<details>
<summary><strong>Instructor solution (click to reveal)</strong></summary>

```python

df['date'] = pd.to_datetime(df['date_str'])
df['month'] = df['date'].dt.to_period('M')

```
</details>


## Question 05 — Remove Duplicate Rows

In a live system, customers can update their profile multiple times. Your DataFrame `df` records **every** profile save, with:

- `customer_id` (int)  
- `name` (string)  
- `updated_at` (datetime)  

Because you only ever need each customer’s **most recent** profile, drop older entries and **keep the row with the latest** `updated_at` for each `customer_id`.

In [12]:
# --- Generate synthetic data for Question 05 ---
rng = np.random.default_rng(45)
ids        = list(range(1, 11)) * 2
timestamps = (pd.to_datetime('2025-06-01') 
              + pd.to_timedelta(rng.integers(0, 30, size=20), unit='D'))

# Assign "Names" vs. “updated” names
occ    = {}
names  = []
for uid in ids:
    occ[uid] = occ.get(uid, 0) + 1
    if occ[uid] == 1:
        names.append(f"Customer{uid}")
    else:
        names.append(f"Customer{uid}_updated")

df = pd.DataFrame({
    'customer_id': ids,
    'name':         names,
    'updated_at':   timestamps
})
df = df.sort_values('updated_at').reset_index(drop=True)
df.head()

df_lastupdate = df.drop_duplicates(subset="customer_id", keep="last")
df_lastupdate


Unnamed: 0,customer_id,name,updated_at
8,9,Customer9_updated,2025-06-17
11,10,Customer10_updated,2025-06-21
12,7,Customer7,2025-06-22
13,3,Customer3,2025-06-22
14,6,Customer6,2025-06-23
15,4,Customer4_updated,2025-06-24
16,5,Customer5_updated,2025-06-24
17,2,Customer2_updated,2025-06-24
18,8,Customer8,2025-06-25
19,1,Customer1,2025-06-28


<details>
<summary><strong>Instructor solution (click to reveal)</strong></summary>

```python
df_latest = (df.drop_duplicates(subset=['customer_id'], keep='last'))
df_latest
```
</details>

## Question 06 — Filter Rows by Condition

You want to analyze data from students who passed the class and had a score greater than 70%. Write a command to do that. Filter rows where `score > 70` **and** `passed == True`.

In [16]:
# --- Generate synthetic data for Question 6 ---
rng = np.random.default_rng(46)

df = pd.DataFrame({
    'student': [f'S{i}' for i in range(15)],
    'score': rng.integers(40, 100, size=15)
})
df['passed'] = df['score'] >= 60

df.head()

df_filtered = df[(df['score'] > 70) & (df['passed'] == True)]
df_filtered.head()

Unnamed: 0,student,score,passed
1,S1,94,True
4,S4,72,True
7,S7,77,True
9,S9,96,True
10,S10,88,True


<details>
<summary><strong>Instructor solution (click to reveal)</strong></summary>

```python

filtered = df[(df['score'] > 70) & (df['passed'])]

```
</details>


## Question 07 — Subset Columns with `.loc`

Create a new DataFrame containing only `student` and `score` columns.

In [20]:
# --- Generate synthetic data for Question 7 ---
rng = np.random.default_rng(47)

df = pd.DataFrame({
    'student': [f'S{i}' for i in range(8)],
    'score': rng.integers(50, 100, size=8),
    'age': rng.integers(14, 18, size=8)
})

df.head()

df_new = df.loc[:, ['student', 'score']]
df_new

Unnamed: 0,student,score
0,S0,54
1,S1,87
2,S2,77
3,S3,87
4,S4,71
5,S5,73
6,S6,55
7,S7,55


<details>
<summary><strong>Instructor solution (click to reveal)</strong></summary>

```python

sub = df.loc[:, ['student', 'score']]

```
</details>


## Question 08 — Group & Aggregate

Compute the **mean score per class**, then add a new column `class_mean` to `df` so that each student’s row shows the average score for their respective class.


In [None]:
# --- Generate synthetic data for Question 8 ---
rng = np.random.default_rng(48)

df = pd.DataFrame({
    'student': [f'S{i}' for i in range(30)],
    'class':   rng.choice(['A','B','C'], size=30),
    'score':   rng.integers(50, 100, size=30)
})

df.head()

mean_per_class = df.groupby('class')['score'].mean()

df['class_score_mean'] = df.groupby('class')['score'].transform('mean')
df

0     89
1     67
2     54
3     83
4     99
5     85
6     84
7     86
8     60
9     65
10    67
11    51
14    75
19    69
21    85
Name: score, dtype: int64


Unnamed: 0,student,class,score,class_score_mean
0,S0,A,89,79.125
1,S1,B,67,74.083333
2,S2,B,54,74.083333
3,S3,B,83,74.083333
4,S4,C,99,71.3
5,S5,B,85,74.083333
6,S6,A,84,79.125
7,S7,C,86,71.3
8,S8,C,60,71.3
9,S9,B,65,74.083333


<details>
<summary><strong>Instructor solution (click to reveal)</strong></summary>

```python
# Just the mean part
group_mean = df.groupby('class')['score'].mean()
# Annotating each row
df['class_mean'] = df.groupby('class')['score'].transform('mean')

df.head()
```
</details>


## Question 09 — Merge / Join Two Tables

Merge orders with a **`product`** lookup to bring in the `category` column. (i.e., your final data frame should be "order_id", "product_id", "qty", and "category")

In [36]:
# --- Generate synthetic data for Question 9 ---
rng = np.random.default_rng(49)

orders = pd.DataFrame({
    'order_id': range(1,11),
    'product_id': rng.integers(1,6,size=10),
    'qty': rng.integers(1,5,size=10)
})
products = pd.DataFrame({
    'product_id': range(1,6),
    'category': ['Electronics','Clothing','Sport','Home','Toys']
})


print(orders.head())

merged = orders.merge(products, on='product_id')

print(merged.head())

   order_id  product_id  qty
0         1           1    3
1         2           2    1
2         3           5    2
3         4           3    4
4         5           1    2
   order_id  product_id  qty     category
0         1           1    3  Electronics
1         2           2    1     Clothing
2         3           5    2         Toys
3         4           3    4        Sport
4         5           1    2  Electronics


<details>
<summary><strong>Instructor solution (click to reveal)</strong></summary>

```python

merged = orders.merge(products, on='product_id', how='left')

```
</details>


## Question 10 — Concatenate DataFrames

Combine `df1` and `df2` **vertically** (row wise) and reset the index. (Basically df1 goes on top of df2 and then you want to ignore (hint, hint) on the argument to use) the original indexing)

In [37]:
# --- Generate synthetic data for Question 10 ---
rng = np.random.default_rng(50)

df1 = pd.DataFrame({'id': range(1,6), 'val': rng.integers(0,50,size=5)})
df2 = pd.DataFrame({'id': range(6,11), 'val': rng.integers(0,50,size=5)})

df.head()

concatenated = pd.concat([df1, df2], ignore_index=True)

<details>
<summary><strong>Instructor solution (click to reveal)</strong></summary>

```python

combined = pd.concat([df1, df2], ignore_index=True)

```
</details>


## Question 11 — Pivot Wider

A **pivot table** is a way to **summarize** your data by computing an aggregate (e.g. sum, mean) for each combination of one categorical variable for rows and another for columns. When you “pivot wider,” you turn one of your categories into new columns, converting a long-form table into a wide-form one.

**Your Task**

- Use `class` as the **row index**.
- Spread `gender` into **column headers**.
- Compute the **mean** of `score` for each (`class`,`gender`) pair.


Basically, your table should look like the following where , denotes a column change: 
- row 1: gender, F , M
- row 2: class , ,
- row 3: A,  75.62,  81.5 
...


In [40]:
# --- Generate synthetic data for Question 11 ---
rng = np.random.default_rng(51)

df = pd.DataFrame({
    'student': [f'S{i}' for i in range(40)],
    'class':   rng.choice(list('ABC'), size=40),
    'gender':  rng.choice(['F','M'], size=40),
    'score':   rng.integers(50, 100, size=40)
})

df.head()

table = df.pivot_table(values='score', index='class', columns='gender', aggfunc='mean')
table

gender,F,M
class,Unnamed: 1_level_1,Unnamed: 2_level_1
A,75.625,81.5
B,72.857143,64.6
C,72.1,79.833333


<details>
<summary><strong>Instructor solution (click to reveal)</strong></summary>

```python

pivot = df.pivot_table(index='class', columns='gender', values='score', aggfunc='mean')

```
</details>


## Question 12 — Melt (Long → Tidy)

Transform wide table with `Q1`…`Q4` sales columns into long format with `quarter` & `sales`. A "Long" format table in this case would have the following column names where each column is separated by a , :
- year , quarter , sales

In [44]:
# --- Generate synthetic data for Question 12 ---
rng = np.random.default_rng(52)

df = pd.DataFrame({
    'year': [2023, 2024],
    'Q1': rng.integers(200,400,size=2),
    'Q2': rng.integers(200,400,size=2),
    'Q3': rng.integers(200,400,size=2),
    'Q4': rng.integers(200,400,size=2)
})

df.head()

long = df.melt(id_vars=['year'], value_vars=['Q1', 'Q2', 'Q3', 'Q4'], var_name='quarter', value_name='sales')
long

Unnamed: 0,year,quarter,sales
0,2023,Q1,392
1,2024,Q1,323
2,2023,Q2,251
3,2024,Q2,314
4,2023,Q3,293
5,2024,Q3,303
6,2023,Q4,358
7,2024,Q4,201


<details>
<summary><strong>Instructor solution (click to reveal)</strong></summary>

```python

tidy = df.melt(id_vars='year', value_vars=['Q1','Q2','Q3','Q4'],
               var_name='quarter', value_name='sales')

```
</details>


## Question 13 — Detect Outliers with IQR

Identify rows in `df` where `value` lies outside 1.5×IQR of the values. Print out all the outlier values. 

In [46]:
# --- Generate synthetic data for Question 13 ---
rng = np.random.default_rng(53)

df = pd.DataFrame({'value': rng.normal(0,1,size=100)})
# inject a few outliers
df.loc[rng.choice(df.index,size=4,replace=False),'value'] = rng.normal(8,0.5,size=4)

df.head()

quantile_75 = df['value'].quantile(0.75)
quantile_25 = df['value'].quantile(0.25)
iqr = quantile_75 - quantile_25

outlier_high = quantile_75 + 1.5 * iqr
outlier_low = quantile_25 - 1.5 * iqr

outliers = df[(df['value'] < outlier_low) | (df['value'] > outlier_high)]
outliers

Unnamed: 0,value
16,7.468513
28,3.641682
30,-2.78482
46,8.249273
82,8.264442
86,8.161956


#### How do you calculate the IQR? 

The IQR (interquartile range) is calculated as the difference between the 75th quantile and the 25th quantile.

#### Why would we use that over the standard deviation? 

Just like the mean, standard deviation can be influenced greatly by outliers, making it less helpful to use standard deviation to detect outliers.

#### The mean is easier to calculate, can I mix the the IQR with the mean? Why or why not?

The IQR is more related to the median to the mean, so it is most likely not as effective to combine the IQR with the mean.

<details>
<summary><strong>Instructor solution (click to reveal)</strong></summary>

```python

q1, q3 = df['value'].quantile([0.25,0.75])
iqr = q3 - q1
lo, hi = q1 - 1.5*iqr, q3 + 1.5*iqr
outliers = df[(df['value'] < lo) | (df['value'] > hi)]

```
</details>


## Question 14 — Cap Extreme Values

Clip `value` to stay within the IQR bounds you computed previously.

In [47]:
# --- Generate synthetic data for Question 14 ---
rng = np.random.default_rng(54)

df = pd.DataFrame({'value': rng.normal(0,1,size=100)})
q1, q3 = df['value'].quantile([0.25,0.75])
iqr = q3 - q1
lo, hi = q1 - 1.5*iqr, q3 + 1.5*iqr

df.head()

df['value'] = df['value'].clip(lower=lo, upper=hi)

<details>
<summary><strong>Instructor solution (click to reveal)</strong></summary>

```python

df['value'] = df['value'].clip(lo, hi)

```
</details>


#### What could you do to save your data to make sure you can go back to it if you didn't want to clip after all? 

Save the clipped DataFrame to a new DataFrame and ensure the old one is not modified.

## Question 15 — Bin Continuous Variable

Create a new column `age_band` with 10-year bins (0–9, 10–19, …). I would use pd.cut, but there are multiple answers here. 

In [50]:
# --- Generate synthetic data for Question 15 ---
rng = np.random.default_rng(55)

df = pd.DataFrame({'age': rng.integers(0, 90, size=25)})

df.head()

pd.cut(df['age'], bins=[0, 10, 20, 30, 40, 50, 60, 70, 80, 90],
       labels=['0-9', '10-19', '20-29', '30-39', '40-49', '50-59', '60-69', '70-79', '80-89'],
       right=False)

0     80-89
1     70-79
2     60-69
3     70-79
4     10-19
5     10-19
6     70-79
7     20-29
8     50-59
9     40-49
10    30-39
11    60-69
12    40-49
13    60-69
14    10-19
15    80-89
16    70-79
17      0-9
18    60-69
19    40-49
20    30-39
21    30-39
22    30-39
23    50-59
24      0-9
Name: age, dtype: category
Categories (9, object): ['0-9' < '10-19' < '20-29' < '30-39' ... '50-59' < '60-69' < '70-79' < '80-89']

<details>
<summary><strong>Instructor solution (click to reveal)</strong></summary>

```python

bins = list(range(0, 101, 10))
labels = [f'{b}-{b+9}' for b in bins[:-1]]
df['age_band'] = pd.cut(df['age'], bins=bins, labels=labels, right=False)

```
</details>


## Question 16 — Regex Extract

Extract the 6-digit product code from `text` into a new column `prod_id`.

In [53]:
# --- Generate synthetic data for Question 16 ---
rng = np.random.default_rng(56)

codes = [f'ID-{rng.integers(100000,999999)}-X' for _ in range(12)]
phrases = [f'Order {c} received' for c in codes]
df = pd.DataFrame({'text': phrases})

df.head()

df['prod_id'] = df['text'].str.extract(r"ID-(\d{6})-X")

df

Unnamed: 0,text,prod_id
0,Order ID-430359-X received,430359
1,Order ID-753512-X received,753512
2,Order ID-456514-X received,456514
3,Order ID-115212-X received,115212
4,Order ID-955826-X received,955826
5,Order ID-982719-X received,982719
6,Order ID-654011-X received,654011
7,Order ID-752143-X received,752143
8,Order ID-127768-X received,127768
9,Order ID-589790-X received,589790


<details>
<summary><strong>Instructor solution (click to reveal)</strong></summary>

```python

df['prod_id'] = df['text'].str.extract(r'(\d{6})')

```
</details>


## Question 17 — Clean Phone Numbers

Standardise `phone` to the format `XXX-XXX-XXXX` (digits only).

In [54]:
# --- Generate synthetic data for Question 17 ---
rng = np.random.default_rng(57)

raw = ['(617) 555-{:04d}'.format(n) for n in rng.integers(0,9999,size=10)]
df = pd.DataFrame({'phone': raw})

df.head()

df['phone'] = df['phone'].str.replace(r"[() -]", "", regex=True)
df['phone'] = df['phone'].str.replace(r"(\d{3})(\d{3})(\d{4})", r"\1-\2-\3", regex=True)
df

Unnamed: 0,phone
0,617-555-0651
1,617-555-6840
2,617-555-8197
3,617-555-4682
4,617-555-8782
5,617-555-8411
6,617-555-9339
7,617-555-3402
8,617-555-2664
9,617-555-3937


<details>
<summary><strong>Instructor solution (click to reveal)</strong></summary>

```python

df['phone'] = (df['phone']
                 .str.replace('[^0-9]', '', regex=True)
                 .str.replace(r'(\d{3})(\d{3})(\d{4})', r'\1-\2-\3', regex=True))

```
</details>


## Question 18 — Resample Time Series

Compute **weekly** sum of `sales`. You can do this multiple ways, but there is an easy function built into pandas data frames. (i.e., look into "resample)

In [63]:
# --- Generate synthetic data for Question 18 ---
rng = np.random.default_rng(58)

idx = pd.date_range('2025-01-01', periods=90, freq='D')
df = pd.DataFrame({'sales': rng.integers(10, 40, size=len(idx))}, index=idx)

df.head()

weekly_sales = df['sales'].resample("W").sum()
weekly_sales

2025-01-05    145
2025-01-12    178
2025-01-19    169
2025-01-26    212
2025-02-02    190
2025-02-09    180
2025-02-16    155
2025-02-23    168
2025-03-02    164
2025-03-09    176
2025-03-16    187
2025-03-23    165
2025-03-30    205
2025-04-06     24
Freq: W-SUN, Name: sales, dtype: int64

<details>
<summary><strong>Instructor solution (click to reveal)</strong></summary>

```python

weekly = df['sales'].resample('W').sum()

```
</details>


## Question 19 — Rolling Mean

Add a `7d_avg` column = 7-day rolling mean of `sales`. Use pandas for this as well. 

In [65]:
# --- Generate synthetic data for Question 19 ---
rng = np.random.default_rng(59)

idx = pd.date_range('2025-02-01', periods=60, freq='D')
df = pd.DataFrame({'sales': rng.integers(20, 60, size=len(idx))}, index=idx)

df.head()

rolling = df['sales'].rolling(window=7, min_periods=1).mean()
rolling

2025-02-01    51.000000
2025-02-02    47.000000
2025-02-03    47.666667
2025-02-04    49.000000
2025-02-05    48.200000
2025-02-06    44.166667
2025-02-07    45.285714
2025-02-08    43.000000
2025-02-09    41.571429
2025-02-10    38.428571
2025-02-11    34.000000
2025-02-12    31.285714
2025-02-13    35.857143
2025-02-14    31.285714
2025-02-15    34.000000
2025-02-16    37.571429
2025-02-17    41.714286
2025-02-18    42.000000
2025-02-19    45.857143
2025-02-20    46.000000
2025-02-21    49.571429
2025-02-22    47.142857
2025-02-23    44.142857
2025-02-24    43.000000
2025-02-25    48.000000
2025-02-26    43.857143
2025-02-27    40.000000
2025-02-28    39.285714
2025-03-01    36.857143
2025-03-02    39.000000
2025-03-03    35.571429
2025-03-04    30.571429
2025-03-05    30.000000
2025-03-06    29.857143
2025-03-07    28.142857
2025-03-08    28.714286
2025-03-09    29.000000
2025-03-10    30.285714
2025-03-11    32.428571
2025-03-12    34.428571
2025-03-13    35.857143
2025-03-14    35

<details>
<summary><strong>Instructor solution (click to reveal)</strong></summary>

```python

df['7d_avg'] = df['sales'].rolling(window=7, min_periods=1).mean()

```
</details>


#### Can you solve this without pandas? Write your own function to calculate a rolling 7-day mean on sales

## Question 20 — Cross-Tabulation

A **cross-tabulation** (or **contingency table**) displays the **counts** of observations for each combination of two categorical variables. It’s incredibly useful for spotting patterns at a glance.

### What you start with (long form)

Imagine your DataFrame `df` looks like this:

| index | gender | purchased |
|:-----:|:------:|:---------:|
|   0   |   F    |     1     |
|   1   |   M    |     0     |
|   2   |   F    |     1     |
|   3   |   M    |     1     |
|   4   |   F    |     0     |
|  ...  |  ...   |    ...    |

Each row is one customer’s gender and whether they made a purchase.

### What you end up with (wide form)

After a cross-tab, you get a small table of counts:

| gender \ purchased |   0   |   1   |
|:------------------:|:-----:|:-----:|
|         F          | count of F&0 | count of F&1 |
|         M          | count of M&0 | count of M&1 |

For example, if F customers made 8 “no-purchase” (0) and 12 “yes-purchase” (1), you’d see:

| gender \ purchased |   0   |   1   |
|:------------------:|:-----:|:-----:|
|         F          |   8   |   12  |
|         M          |   5   |   15  |

---

In [66]:
# --- Generate synthetic data for Question 20 ---
rng = np.random.default_rng(60)

df = pd.DataFrame({
    'gender': rng.choice(['F','M'], size=50),
    'purchased': rng.choice([0,1], size=50, p=[0.4,0.6])
})

df.head()

crosstab = pd.crosstab(df['gender'], df['purchased'])
crosstab

purchased,0,1
gender,Unnamed: 1_level_1,Unnamed: 2_level_1
F,9,15
M,13,13


<details>
<summary><strong>Instructor solution (click to reveal)</strong></summary>

```python

ct = pd.crosstab(df['gender'], df['purchased'])

```
</details>


## Question 21 — One-Hot Encode Categorical

A **dummy variable** (or **one-hot encoding**) turns each category of a categorical column into its own 0/1 indicator column. This is useful for feeding categories into models that expect numeric inputs. You use this encoding everywhere in the sciences and machine learning. 

### What you start with

| index | color  |
|:-----:|:-------|
|   0   | red    |
|   1   | blue   |
|   2   | green  |
|   3   | red    |

### What you end up with

| index | color_blue | color_green | color_red |
|:-----:|:----------:|:-----------:|:---------:|
|   0   |     0      |      0      |     1     |
|   1   |     1      |      0      |     0     |
|   2   |     0      |      1      |     0     |
|   3   |     0      |      0      |     1     |


In [72]:
# --- Generate synthetic data for Question 21 ---
rng = np.random.default_rng(61)

df = pd.DataFrame({'color': rng.choice(['red','blue','green'], size=12)})

df.head()

dummies = pd.get_dummies(df['color'], prefix='color')
dummies


Unnamed: 0,color_blue,color_green,color_red
0,True,False,False
1,True,False,False
2,False,True,False
3,False,True,False
4,True,False,False
5,False,True,False
6,False,True,False
7,False,True,False
8,False,False,True
9,True,False,False


<details>
<summary><strong>Instructor solution (click to reveal)</strong></summary>

```python

dummies = pd.get_dummies(df['color'], prefix='color')
df = pd.concat([df, dummies], axis=1)

```
</details>


## Question 22 — Explode List Column

When you have a column of lists, `pd.DataFrame.explode()` converts each element of the list into its own row, duplicating the rest of the row’s data.  

### What you start with (wide format)

| index | post_id | tags           |
|:-----:|:-------:|----------------|
|   0   |    1    | [ml, ai]       |
|   1   |    2    | [cv]           |
|   2   |    3    | [nlp, ai, cv]  |

### What you end up with (long format)

| index | post_id | tags |
|:-----:|:-------:|:----:|
|   0   |    1    | ml   |
|   1   |    1    | ai   |
|   2   |    2    | cv   |
|   3   |    3    | nlp  |
|   4   |    3    | ai   |
|   5   |    3    | cv   |

In [75]:
# --- Generate synthetic data for Question 23 ---
rng = np.random.default_rng(63)

choices = ['ml','ai','cv','nlp']
df = pd.DataFrame({
    'post_id': range(1,6),
    'tags': [rng.choice(choices, size=rng.integers(1,4), replace=False).tolist()
             for _ in range(5)]
})

df.head()

exploded = df.explode('tags')
exploded

Unnamed: 0,post_id,tags
0,1,ai
0,1,nlp
1,2,ai
1,2,cv
1,2,ml
2,3,ai
2,3,ml
3,4,ai
4,5,ai


<details>
<summary><strong>Instructor solution (click to reveal)</strong></summary>

```python

exploded = df.explode('tags')

```
</details>


## Question 23 — Wide ↔ Long Round-Trip

When you have several columns that all represent the same kind of measurement (here, daily temperatures), you can **melt** the table into a “long” tidy format, then **pivot** it back to the original “wide” shape once you’re done processing.

### What you start with (wide format)

| city | Mon | Tue | Wed |
|:----:|:---:|:---:|:---:|
|  NY  |  10 |   5 |   8 |
|  LA  |  15 |  12 |  14 |

### What you get after melting (long format)

| city | day | temp |
|:----:|:---:|:----:|
|  NY  | Mon |  10  |
|  NY  | Tue |   5  |
|  NY  | Wed |   8  |
|  LA  | Mon |  15  |
|  LA  | Tue |  12  |
|  LA  | Wed |  14  |

### What you end up with after pivoting back (wide format)

| city | Mon | Tue | Wed |
|:----:|:---:|:---:|:---:|
|  NY  |  10 |   5 |   8 |
|  LA  |  15 |  12 |  14 |



#### What is "tidy" format (please research this if you are unsure and explain)

"Tidy" data requires:
- one variable per column
- one case per row
- one value per cell

In [84]:
# --- Generate synthetic data for Question 24 ---
rng = np.random.default_rng(64)

df = pd.DataFrame({
    'city': ['NY','LA'],
    'Mon': rng.integers(-5,30,size=2),
    'Tue': rng.integers(-5,30,size=2),
    'Wed': rng.integers(-5,30,size=2)
})

df.head()

print(df)

tidy = df.melt(id_vars=['city'], value_vars=['Mon', 'Tue', 'Wed'], var_name='day', value_name='temp')
print(tidy)

wide = tidy.pivot(index='city', columns='day', values='temp')
print(wide)

  city  Mon  Tue  Wed
0   NY   12   25    3
1   LA   28   13   -3
  city  day  temp
0   NY  Mon    12
1   LA  Mon    28
2   NY  Tue    25
3   LA  Tue    13
4   NY  Wed     3
5   LA  Wed    -3
day   Mon  Tue  Wed
city               
LA     28   13   -3
NY     12   25    3


<details>
<summary><strong>Instructor solution (click to reveal)</strong></summary>

```python

long = df.melt(id_vars='city', var_name='day', value_name='temp')
wide = long.pivot(index='city', columns='day', values='temp').reset_index()

```
</details>


## Question 24 — Package Steps into a Function

Write `clean(df)` that **drops duplicates** and converts `date_str` to datetime.

In [89]:
# --- Generate synthetic data for Question 25 ---
rng = np.random.default_rng(65)

df = pd.DataFrame({
    'id': [1,1,2,3,3],
    'date_str': pd.date_range('2024-05-01', periods=5).astype(str)
})

df.head()

def clean(df: pd.DataFrame) -> pd.DataFrame:
    ret = df.drop_duplicates(subset='id').copy()
    ret['date_str'] = pd.to_datetime(df['date_str'])
    return ret

clean_df = clean(df)
clean_df

Unnamed: 0,id,date_str
0,1,2024-05-01
2,2,2024-05-03
3,3,2024-05-04


<details>
<summary><strong>Instructor solution (click to reveal)</strong></summary>

```python

def clean(d):
    d = d.drop_duplicates(subset=['id']).copy()
    d['date'] = pd.to_datetime(d['date_str'])
    return d

clean_df = clean(df)

```
</details>


## Question 25 — Tidy a Real-World CSV (Medium-Hard)

**Dataset:** Monthly passenger counts for transatlantic air travel, 1958–1960  

**Your tasks:**
1. **Load** the CSV directly from the URL into a DataFrame `df_raw` using `pd.read_csv()`.  
2. Notice it’s in **wide** form, with columns: `Month`, `"1958"`, `"1959"`, `"1960"`.  
3. **Melt** it into **long** form with columns:
   - `month` (e.g. `"JAN"`)
   - `year` (e.g. `1958`)
   - `passengers` (e.g. `340`)
4. **Convert** `year` to integer, and optionally parse `month` into a proper datetime (e.g. first day of each month).




In [99]:
# --- Load and peek at the raw data ---
import pandas as pd

url = "https://people.sc.fsu.edu/~jburkardt/data/csv/airtravel.csv"

df_raw = pd.read_csv("data/airtravel.csv")

df_melted = df_raw.melt(id_vars='Month', var_name='Year', value_name='Passengers')
df_melted['Year'] = df_melted['Year'].str.extract(r"(\d{4})").astype(int)
df_melted

Unnamed: 0,Month,Year,Passengers
0,JAN,1958,340
1,FEB,1958,318
2,MAR,1958,362
3,APR,1958,348
4,MAY,1958,363
5,JUN,1958,435
6,JUL,1958,491
7,AUG,1958,505
8,SEP,1958,404
9,OCT,1958,359


<details>
<summary><strong>Instructor solution (click to reveal)</strong></summary>

```python

# 1) See exactly what your columns are
print(df_raw.columns.tolist())
# e.g. ['Month', '"1958"', '"1959"', '"1960"']

# 2) Strip out the extra quotes from the header names
df_raw.columns = (
    df_raw.columns
          .str.replace('"', '', regex=False)  # remove literal quote characters
          .str.strip()                        # trim any whitespace
)

df_long = df_raw.melt(
    id_vars=['Month'],
    value_vars=['1958','1959','1960'],
    var_name='year',
    value_name='passengers'
)

# 4) Clean up
df_long['year'] = df_long['year'].astype(int)
df_long.rename(columns={'Month':'month'}, inplace=True)

df_long.head()

```
</details>


## Question 26 — Chipotle Orders

**Dataset:** Chipotle orders  

Your tasks:  
1. Load the TSV from the URL into `df_raw` using `pd.read_csv(..., sep='\t')`.  
2. Clean `item_price` by removing the leading `$` and converting it to `float`.  
3. Replace missing values in `choice_description` with `"No extras"`.  
4. Strip the surrounding `[]`, remove any single‐quotes, and split the `choice_description` string into a Python `list`.  
5. Use `explode()` to expand each extra into its own row and assign the result to `df_tidy`.  



In [103]:
import pandas as pd

url = "https://raw.githubusercontent.com/justmarkham/DAT8/master/data/chipotle.tsv"

# import data
df_raw = pd.read_csv("data/chipotle.tsv", sep='\t')

# clean item_price
df_raw['item_price'] = df_raw['item_price'].str.replace('$', '').astype(float)

# replace missing values
df_raw['choice_description'] = df_raw['choice_description'].fillna("No extras")

# remove brackets from descriptions
df_raw['choice_description'] = df_raw['choice_description'].str.replace(r"[\[\]]", '', regex=True)
df_raw['choice_description'] = df_raw['choice_description'].apply(lambda x: x.split(', '))
df_tidy = df_raw.explode('choice_description').reset_index(drop=True)

print(df_raw)
print(df_tidy)

      order_id  quantity                              item_name  \
0            1         1           Chips and Fresh Tomato Salsa   
1            1         1                                   Izze   
2            1         1                       Nantucket Nectar   
3            1         1  Chips and Tomatillo-Green Chili Salsa   
4            2         2                           Chicken Bowl   
...        ...       ...                                    ...   
4617      1833         1                          Steak Burrito   
4618      1833         1                          Steak Burrito   
4619      1834         1                     Chicken Salad Bowl   
4620      1834         1                     Chicken Salad Bowl   
4621      1834         1                     Chicken Salad Bowl   

                                     choice_description  item_price  
0                                           [No extras]        2.39  
1                                          [Clementine]

<details>
<summary><strong>Instructor solution (click to reveal)</strong></summary>

```python

# 1) Clean item_price
df_raw['item_price'] = (
    df_raw['item_price']
      .str.replace('$', '', regex=False)
      .astype(float)
)

# 2) Fill missing choice_description
df_raw['choice_description'] = df_raw['choice_description'] \
                                   .fillna('No extras')

# 3) Remove brackets & split
df_raw['choice_description'] = (
    df_raw['choice_description']
      .str.strip('[]')
      .str.replace("'", '', regex=False)
      .str.split(', ')
)

# 4) Explode into tidy format
df_tidy = df_raw.explode('choice_description').reset_index(drop=True)

df_tidy.head()

```
</details>


## Question 27 — Fetch & Tidy Stock Market Data (Hard)

Download daily **Open**, **High**, **Low**, **Close**, and **Volume** data for tickers **AAPL**, **MSFT**, and **GOOGL** between **2025-01-01** and **2025-06-30**. The DataFrame `df_raw` you get from `yfinance` has a **MultiIndex** on its columns (first level: metric, second level: ticker). Your job is to reshape it into a tidy “long” table with one row per date–ticker combination and separate columns for each metric. The column orders don't necessarily match the schematic below exactly depending on how you do it. 

**Before (wide form with MultiIndex columns)**

| Date       | Open (AAPL) | Open (MSFT) | Open (GOOGL) | High (AAPL) | … | Volume (GOOGL) |
|------------|-------------|-------------|--------------|-------------|---|----------------|
| 2025-01-02 | 248.33      | 423.90      | 190.20       | 248.50      | … | 20,370,800     |
| 2025-01-03 | 242.77      | 430.12      | 190.92       | 243.59      | … | 17,452,000     |
| …          | …           | …           | …            | …           | … | …              |

**After (long/tidy form)**

| date       | ticker | Open   | High   | Low    | Close  | Adj Close | Volume   |
|------------|--------|--------|--------|--------|--------|-----------|----------|
| 2025-01-02 | AAPL   | 248.33 | 248.50 | 241.24 | 243.26 | 243.26    | 55,740,700 |
| 2025-01-02 | MSFT   | 423.90 | 424.44 | 413.26 | 416.98 | 416.98    | 16,896,500 |
| 2025-01-02 | GOOGL  | 190.20 | 191.55 | 187.06 | 188.98 | 188.98    | 20,370,800 |
| 2025-01-03 | AAPL   | 242.77 | 243.59 | 241.31 | 242.77 | 242.77    | 40,244,100 |
| …          | …      | …      | …      | …      | …      | …         | …         |



In [None]:
# Hint, you'll need the following:

%pip install yfinance --upgrade --quiet

import yfinance as yf
import pandas as pd

<details>
<summary><strong>Instructor solution (click to reveal)</strong></summary>

```python

tickers = ["AAPL", "MSFT", "GOOGL"]
df_raw = yf.download(tickers, start="2025-01-01", end="2025-06-30")
df_raw.head()

# 1) Stack the ticker level into the row index
df_tidy = (
    df_raw
      .stack(level=1)               # move the second-level (ticker) index into rows
      .reset_index()                # turn Date & ticker index into columns
      .rename(columns={'level_1': 'ticker'})
)

# 2) (Optional) rename 'Date' to lowercase
df_tidy.rename(columns={'Date': 'date'}, inplace=True)

# Now `df_tidy` has columns:  
# ['date', 'ticker', 'Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume']
df_tidy.head()


```
</details>
