In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("hw5.ipynb")

# HW5: Natural Experiments and Demand Estimation

## Scenario

You are a new seller at the Fulton Fish Market in New York City, specializing in **Whiting** (a white fish). You supply fresh fish to restaurants across the city. Your goal is to maximize profits by setting the optimal price.

To set the right price, you need to know the **price elasticity of demand**. If demand is elastic, you should lower prices to drive volume. If it is inelastic, you should raise prices to increase margins.

You have collected historical transaction data (`hw5_data.csv`) containing prices, quantities sold, weather conditions at sea, and buyer characteristics. You will use this data to estimate the demand curve for Whiting.

In [None]:
import polars as pl
import numpy as np
from plotnine import ggplot, aes, geom_point, geom_line, labs, theme_minimal, theme, scale_color_brewer
import statsmodels.formula.api as smf

## Part 1: The Naive Approach

Let's start by loading the data and exploring the relationship between price and quantity.

In [None]:
# Load the data
df = pl.read_csv("hw5_data.csv")

# Create log-transformed variables
df = df.with_columns([
    pl.col("price").log().alias("log_price"),
    pl.col("quantity_sold").log().alias("log_quantity")
])

df.head(10)

In [None]:
# Visualize the relationship between log price and log quantity
# Color by buyer race to see if there are different patterns
(
    ggplot(df, aes(x='log_price', y='log_quantity', color='buyer_race'))
    + geom_point(alpha=0.6, size=2)
    + scale_color_brewer(type='qual', palette='Set1')
    + labs(
        title="Price vs Quantity by Buyer Type (Log-Log Scale)",
        x="Log Price",
        y="Log Quantity",
        color="Buyer Race"
    )
    + theme_minimal()
    + theme(figure_size=(10, 6))
)

**Question 1 (10 pts):** Run an Ordinary Least Squares (OLS) regression to estimate demand.

- **Dependent Variable:** `log_quantity`
- **Independent Variable:** `log_price`
- **Controls:** Day of the week dummy variables (`mon`, `tues`, `wed`, `thurs`)

Save the coefficient on `log_price` as `elasticity_ols`.

*Hints:*
- *Use `statsmodels.formula.api` (imported as `smf`) with `smf.ols()`. Convert the polars dataframe to pandas using `.to_pandas()`.*
- *Extract coefficients using `model.params['coefficient_name']`*

In [None]:
# Convert to pandas for statsmodels
df_pd = df.to_pandas()

# Fit the OLS model using formula API
ols_model = ...

# Extract the elasticity (coefficient on log_price)
elasticity_ols = ...
print(f"Naive OLS Elasticity: {elasticity_ols:.4f}\n")

ols_model.summary()

In [None]:
grader.check("q1")

### Understanding Elasticity and Marginal Thinking

In a log-log demand model ($\log(Q) = \alpha + \epsilon \cdot \log(P)$), the coefficient $\epsilon$ represents the **price elasticity of demand**: the percentage change in quantity for a 1% change in price.

**Key insight:** The elasticity tells you whether raising prices will increase or decrease your revenue.

- If $|\epsilon| < 1$ (**inelastic demand**): A 1% price increase causes quantity to fall by *less than* 1%. Revenue *increases* when you raise prices.

- If $|\epsilon| > 1$ (**elastic demand**): A 1% price increase causes quantity to fall by *more than* 1%. Revenue *decreases* when you raise prices.

- If $|\epsilon| = 1$ (**unit elastic**): Revenue stays the same regardless of price changes.

**The marginal decision:** Should you raise or lower your price from current levels?
- Inelastic demand ($|\epsilon| < 1$) → **Raise prices** to increase revenue
- Elastic demand ($|\epsilon| > 1$) → **Lower prices** to increase revenue

**Question 2 (5 pts):** Based on your OLS result and the framework above, how would you interpret the price elasticity of demand?

**Options:**
- **A)** The demand is elastic ($|\epsilon| > 1$), suggesting we should lower prices to increase revenue.
- **B)** The demand is perfectly inelastic ($\epsilon = 0$), meaning price has no impact on sales.
- **C)** The demand appears inelastic ($|\epsilon| < 1$), suggesting we could raise prices to increase revenue.
- **D)** The relationship is positive, suggesting a Giffen good.

Store your answer as a single letter string (e.g., `"A"`).

In [None]:
q2_answer = ...  # "A", "B", "C", or "D"
q2_answer

In [None]:
grader.check("q2")

## Part 2: The Natural Experiment

### The Problem with OLS

You suspect the OLS result is biased. In market data, supply and demand shift simultaneously. High prices might be *caused* by high demand, which obscures the true negative relationship between price and quantity demanded.

To fix this, you need an **Instrumental Variable (Z)** — a "Natural Experiment" that shifts one curve (supply or demand) but *not* the other.

### A Potential Instrument

Your dataset includes **wind speed** measured at sea. Let's explore whether this could serve as a valid instrument.

In [None]:
# Plot: Wind Speed vs Average Price
wind_price = df.group_by("wind_speed").agg(
    pl.col("price").mean().alias("avg_price")
).sort("wind_speed")

(
    ggplot(wind_price, aes(x='wind_speed', y='avg_price'))
    + geom_point(size=3)
    + geom_line()
    + labs(
        title="Wind Speed vs Average Fish Price",
        x="Wind Speed (knots)",
        y="Average Price ($/lb)"
    )
    + theme_minimal()
    + theme(figure_size=(10, 5))
)

<!-- BEGIN QUESTION -->

**Question 3 (10 pts, manually graded):** For an instrumental variable to be valid, it must satisfy two conditions:

1. **Relevance:** The instrument must be correlated with the endogenous variable (price).
2. **Exclusion:** The instrument must affect the outcome (quantity) *only* through its effect on price, not directly.

Based on the plot above and your economic intuition, explain why `wind_speed` could be a valid instrument for estimating the demand curve. Address both conditions.

*Write your answer in the markdown cell below.*

_Type your answer here, replacing this text._

<!-- END QUESTION -->

## Part 3: Two-Stage Least Squares (2SLS)

You will now implement the IV estimator using Two-Stage Least Squares (2SLS).

### The Two Stages

1. **First Stage:** Regress the endogenous variable (`log_price`) on the instrument (`wind_speed`) and controls. Save the fitted values.
2. **Second Stage:** Regress the outcome (`log_quantity`) on the fitted values from stage 1 and controls. The coefficient on the fitted price is the causal elasticity.

**Question 4 (10 pts):** Run the First Stage regression.

- **Dependent Variable:** `log_price`
- **Independent Variable:** `wind_speed`
- **Controls:** `mon`, `tues`, `wed`, `thurs`

Store the fitted values in a new column called `predicted_log_price` in the pandas dataframe.

In [None]:
# First Stage: Regress log_price on wind_speed and controls
first_stage = ...

# Store fitted values
df_pd['predicted_log_price'] = ...

first_stage.summary()

In [None]:
grader.check("q4")

**Question 5 (15 pts):** Run the Second Stage regression to estimate the causal demand elasticity.

- **Dependent Variable:** `log_quantity`
- **Independent Variable:** `predicted_log_price` (from the first stage)
- **Controls:** `mon`, `tues`, `wed`, `thurs`

Save the coefficient on `predicted_log_price` as `elasticity_iv`.

*Hint: Extract coefficients using `model.params['coefficient_name']`*

In [None]:
# Second Stage: Regress log_quantity on predicted_log_price and controls
second_stage = ...

# Extract the IV elasticity
elasticity_iv = ...

print("Comparison of Elasticity Estimates")
print("=" * 40)
print(f"Naive OLS Elasticity: {elasticity_ols:.4f}")
print(f"IV Elasticity:        {elasticity_iv:.4f}")
print()

second_stage.summary()

In [None]:
grader.check("q5")

**Question 6 (5 pts):** Compare the OLS elasticity with the IV estimate. What does the difference tell you about the bias in the naive model?

**Options:**
- **A)** The IV estimate is *more negative* (more elastic) than the OLS estimate. The OLS model was biased toward zero (underestimating price sensitivity) because high demand days pushed up prices.
- **B)** The IV estimate is *less negative* (more inelastic) than OLS. The OLS model overestimated price sensitivity.
- **C)** The estimates are nearly identical, meaning endogeneity was not a problem.
- **D)** The IV estimate is positive, which contradicts economic theory.

Store your answer as a single letter string.

In [None]:
q6_answer = ...  # "A", "B", "C", or "D"
q6_answer

In [None]:
grader.check("q6")

## Part 4: Pricing Implications

Now let's apply the marginal thinking framework from Part 1 to make a concrete pricing decision.

Recall:
- If demand is **inelastic** ($|\epsilon| < 1$): raising prices increases revenue
- If demand is **elastic** ($|\epsilon| > 1$): raising prices decreases revenue

In [None]:
# Current market conditions
avg_price = df['price'].mean()
current_quantity = 10000  # Assume we currently sell 10,000 lbs

print(f"Current average price: ${avg_price:.2f} per lb")
print(f"Current quantity sold: {current_quantity:,} lbs")
print(f"Current revenue: ${avg_price * current_quantity:,.2f}")

**Question 7 (15 pts):** Calculate the revenue impact of a 10% price increase under both elasticity estimates.

Assume:
- Current quantity sold is **10,000 lbs**
- Marginal costs are **zero** (for simplicity, we focus on revenue)

If price increases by 10%, the percentage change in quantity is approximately:
$$\%\Delta Q \approx \epsilon \times \%\Delta P = \epsilon \times 10\%$$

Calculate:
1. `revenue_status_quo`: Current revenue (price × quantity)
2. `revenue_ols`: Revenue after 10% price increase, using OLS elasticity to predict new quantity
3. `revenue_iv`: Revenue after 10% price increase, using IV elasticity to predict new quantity

*Hint: New quantity = current_quantity × (1 + ε × 0.10), New price = avg_price × 1.10*

In [None]:
# Status quo revenue
revenue_status_quo = ...

# Price increase
price_increase = 0.10  # 10%
new_price = avg_price * (1 + price_increase)

# New quantity under OLS elasticity
quantity_change_ols = ...
new_quantity_ols = ...
revenue_ols = ...

# New quantity under IV elasticity
quantity_change_iv = ...
new_quantity_iv = ...
revenue_iv = ...

print("Revenue Comparison: 10% Price Increase")
print("=" * 50)
print(f"\nStatus Quo:")
print(f"  Price: ${avg_price:.2f}, Quantity: {current_quantity:,} lbs")
print(f"  Revenue: ${revenue_status_quo:,.2f}")

print(f"\nUsing OLS Elasticity ({elasticity_ols:.2f}):")
print(f"  Price: ${new_price:.2f}, Quantity: {new_quantity_ols:,.0f} lbs")
print(f"  Revenue: ${revenue_ols:,.2f}")
print(f"  Change: ${revenue_ols - revenue_status_quo:+,.2f} ({(revenue_ols/revenue_status_quo - 1)*100:+.1f}%)")

print(f"\nUsing IV Elasticity ({elasticity_iv:.2f}):")
print(f"  Price: ${new_price:.2f}, Quantity: {new_quantity_iv:,.0f} lbs")
print(f"  Revenue: ${revenue_iv:,.2f}")
print(f"  Change: ${revenue_iv - revenue_status_quo:+,.2f} ({(revenue_iv/revenue_status_quo - 1)*100:+.1f}%)")

In [None]:
grader.check("q7")

**Question 8 (5 pts):** Based on your calculations, which statement is correct about the profitability of a 10% price increase?

**Options:**
- **A)** Profitable under OLS estimate, profitable under IV estimate
- **B)** Profitable under OLS estimate, NOT profitable under IV estimate
- **C)** NOT profitable under OLS estimate, profitable under IV estimate
- **D)** NOT profitable under OLS estimate, NOT profitable under IV estimate

Store your answer as a single letter string.

In [None]:
q8_answer = ...  # "A", "B", "C", or "D"
q8_answer

In [None]:
grader.check("q8")

## Part 5: Price Discrimination by Buyer Type

Looking back at the scatter plot, you noticed the data includes information about buyer race (Asian vs White buyers). This raises an interesting question: do different buyer types have different price sensitivities?

If so, you might be able to increase revenue by charging different prices to different groups (though this raises important ethical considerations).

**Question 9 (15 pts):** Re-run the OLS demand regression, but now include:
- `buyer_race` as a control variable
- An interaction term between `buyer_race` and `log_price`

In statsmodels formula syntax, use `log_price * buyer_race` to include both main effects and interaction.

Note: Statsmodels will treat `asian` as the base category (alphabetically first), so it estimates:
- `log_price`: elasticity for Asian buyers (base group)
- `buyer_race[T.white]`: level shift for white buyers
- `log_price:buyer_race[T.white]`: difference in elasticity for white buyers

Save the model as `ols_interaction`.

*Hint: Extract coefficients using `model.params['coefficient_name']`*

In [None]:
# OLS with buyer_race interaction
ols_interaction = ...

ols_interaction.summary()

In [None]:
# Compute elasticities by buyer type
elasticity_asian = ols_interaction.params['log_price']
elasticity_diff = ols_interaction.params['log_price:buyer_race[T.white]']
elasticity_white = elasticity_asian + elasticity_diff

print("Elasticity by Buyer Type")
print("=" * 40)
print(f"Asian buyers (base): {elasticity_asian:.4f}")
print(f"White buyers:        {elasticity_white:.4f}")
print(f"Difference:          {elasticity_diff:.4f}")

In [None]:
grader.check("q9")

**Question 10 (10 pts):** Based on the regression results, if a seller wanted to maximize revenue through price discrimination, what would the estimates suggest?

*Consider: A group with more elastic demand (more negative elasticity) is more price-sensitive, so a profit-maximizing seller would charge them a lower price.*

**Options:**
- **A)** Set a lower price for Asian buyers (they have more elastic demand)
- **B)** Set a higher price for Asian buyers (they have less elastic demand)
- **C)** Set the same price for all buyers (no significant difference in elasticity)

Store your answer as a single letter string.

In [None]:
q10_answer = ...  # "A", "B", or "C"
q10_answer

In [None]:
grader.check("q10")

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

Once you have the zip file, upload the **entire** zip file to Gradescope for grading.

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export(pdf=False, run_tests=True)