# Homework 1.5 | Panel Data (Wide Format) — Solutions

*Homework is designed to both test your knowledge and challenge you to apply familiar concepts in new applications. Answer clearly and completely. You are welcomed and encouraged to work in groups so long as your work is your own. Submit your figures and answers to [Gradescope](https://www.gradescope.com).*

In [None]:
# Imports
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# File Path
file_path = 'https://tayweid.github.io/econ-0150/parts/part-1-5/data/'

## Q1. Transforming Marriage Rates

The following questions are based on **crude marriage rates** in `marriage_rates.csv` — numbers of marriages per one thousand inhabitants — in 1990 and 2019. Each row represents a different European country.

In [None]:
# Load Data
marriage = pd.read_csv(file_path + 'marriage_rates.csv')
marriage.head()

a) Create a new column of absolute change in marriage rates from 1990 to 2019. Compute the **absolute change** in the marriage rate by subtracting the old value from the new value. Which country in this dataset has the largest absolute change between these years? (*Hint: do not take an absolute value; treat 1 as larger than -2.*)

In [None]:
# Create absolute change column
marriage['absolute_change'] = marriage['2019'] - marriage['1990']

# Find country with largest absolute change
marriage.sort_values('absolute_change', ascending=False).head()

**Answer:** The country with the largest absolute change is the one at the top of the sorted table (highest value, which could be positive or least negative). Countries like Iceland or Hungary may show increases, while most European countries show decreases.

b) Create a new column of relative change in marriage rates from 1990 to 2019. Compute the **relative change** as the ratio resulting from dividing the absolute change by the old value. Which country in this dataset has the largest relative change between these years?

In [None]:
# Create relative change column
marriage['relative_change'] = marriage['absolute_change'] / marriage['1990']

# Find country with largest relative change
marriage.sort_values('relative_change', ascending=False).head()

**Answer:** The country with the largest relative change is at the top of the sorted table. This measures proportional change relative to the starting value.

c) Create a scatterplot comparing marriage rates in 1990 (x-axis) vs 2019 (y-axis). Add a 45-degree line to show where countries would fall if their marriage rate stayed the same. Which countries are above the line? What does being above the line mean?

In [None]:
# Scatterplot with 45-degree line
sns.scatterplot(marriage, x='1990', y='2019')
plt.plot([0, 12], [0, 12], color='red', linestyle='--', label='No change')
plt.xlabel('Marriage Rate 1990')
plt.ylabel('Marriage Rate 2019')
plt.title('Marriage Rates: 1990 vs 2019')
plt.legend()

In [None]:
# Find countries above the line (increased marriage rates)
marriage[marriage['2019'] > marriage['1990']]

**Answer:** Countries above the 45-degree line (like Iceland and Hungary) have higher marriage rates in 2019 than in 1990 — their marriage rates increased. Being above the line means the 2019 value is greater than the 1990 value.

## Q2. Comparing Coffee Production Across Time

Using the dataset `coffee_prod_in_years.csv`, which provides information on coffee production in different countries between 1961 and 2023:

In [None]:
# Load Data
coffee = pd.read_csv(file_path + 'coffee_prod_in_years.csv')
coffee.head()

a) Generate a scatter plot comparing coffee production in 1961 and 2023 for each country. Include a 45-degree line to help identify which countries increased or decreased production. Are there outlier countries? Briefly suggest why these countries might stand out.

In [None]:
# Scatterplot with 45-degree line
sns.scatterplot(coffee, x='1961', y='2023')

# Add 45-degree line
max_val = max(coffee['1961'].max(), coffee['2023'].max())
plt.plot([0, max_val], [0, max_val], color='red', linestyle='--', label='No change')

plt.xlabel('Coffee Production 1961')
plt.ylabel('Coffee Production 2023')
plt.title('Coffee Production: 1961 vs 2023')
plt.legend()

**Answer:** Brazil stands out as a clear outlier with by far the highest production in both years. Vietnam is another outlier — it had virtually no production in 1961 but became the second-largest producer by 2023 (due to government policies promoting coffee after the Doi Moi reforms). Most countries cluster near the origin with modest production levels.

b) How many countries increased their coffee production between 1961 and 2023? How many decreased? (*Hint: you can count points above vs below the 45-degree line, or create a column for the change and filter.*)

In [None]:
# Create change column
coffee['change'] = coffee['2023'] - coffee['1961']

# Count increases and decreases
increased = coffee[coffee['change'] > 0]
decreased = coffee[coffee['change'] < 0]

print(f"Countries that increased: {len(increased)}")
print(f"Countries that decreased: {len(decreased)}")

**Answer:** Read the counts from the output above. Most coffee-producing countries have increased production since 1961, though some traditional producers may have decreased.

## Q3. Coffee Consumption Boxplots

Using the dataset `Coffee_Per_Cap.csv`, which contains coffee consumption in kilograms per capita for 34 coffee-importing countries from 1990 to 2019:

In [None]:
# Load Data
percap = pd.read_csv(file_path + 'Coffee_Per_Cap.csv', index_col=0)
percap[['Code', '1999', '2004', '2009', '2014', '2019']].head()

a) Create a multi-boxplot showing the distribution of coffee consumption per capita for the years 1999, 2004, 2009, 2014, and 2019. Use horizontal orientation for easier reading.

In [None]:
# Multi-boxplot
sns.boxplot(percap[['1999', '2004', '2009', '2014', '2019']], orient='h', whis=(0, 100))
plt.xlabel('Coffee Consumption (kg per capita)')
plt.title('Coffee Consumption Per Capita by Year')

b) Based on your boxplots, in which 5-year period did the median coffee consumption increase the most? Explain how you can tell from the boxplot.

**Answer:** The median increased the most between 2009 and 2014. You can tell by looking at the vertical line inside each box (the median) — the largest jump in the median position occurs between the 2009 and 2014 boxplots.

c) Which year had the largest range (difference between maximum and minimum) in coffee consumption? What might explain this?

**Answer:** 2019 has the largest range — the whiskers extend furthest apart. This could be explained by diverging consumption patterns: some countries (like Nordic countries) have very high per-capita consumption while others remain low, and this gap may have widened over time as coffee culture grew unevenly across countries.