# Lab 2 - Regression and Impact Evaluation
- **Author:** Emily Aiken ([emilyaiken@berkeley.edu](mailto:emilyaiken@berkeley.edu)) (based on past labs by Qutub Khan Vajihi and Dimitris Papadimitriou)
- **Date:** February 2, 2022
- **Course:** INFO 251: Applied machine learning

### Topics:
1. Univariate regression
2. Multivariate regression
    - Dummy variables
    - Interaction terms
3. Differences-in-differences

### References: 
 * [Statsmodels](http://www.statsmodels.org/stable/example_formulas.html#loading-modules-and-functions) 
 * [Interpreting regression coefficients](https://dss.princeton.edu/online_help/analysis/interpreting_regression.htm)
 * [Card and Krueger (1994)](https://davidcard.berkeley.edu/papers/njmin-aer.pdf)

### Import libraries

In [None]:
# Standard imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt  
%matplotlib inline  

# The packages you'll need for regression models
import statsmodels.api as sm
import statsmodels.formula.api as smf

### 1. Load the data

Card and Krueger (1994) collected survey data on employment in fast food restaurants in New Jersey and Pennsylvania in 1992. The data for today's lab uses a subset of the variables they collected.

- *UNIQUE_ID*: Unique ID for the restaurant interviewed
- *PERIOD*: 0 for pre-period (March 1994), 1 for post-period (December 1994)
- *STATE*: 0 for Pennsylvania, 1 for New Jersey
- *REGION*: Region code: 1 = Southern NJ, 2 = Central NJ, 3 = Northern NJ, 4 = Northeast Philly suburbs, 5 = Easton area, 6 = NJ Shore
- *CHAIN*: Chain restaurant code: 1 = Burger King, 2 = KFC, 3 = Roy's, 4 = Wendy's
- *EMP*: Number of employees (fulltime or parttime)
- *CO_OWNED*: 1 if company-owned
- *BONUS*: 1 if employees get a signing bonus
- *HRSOPEN*: Hours open per day, up to 24
- *NREGS*: Number of registers

In [None]:
df = pd.read_csv('fastfood.csv')
df.head()

In [None]:
df.tail()

In [None]:
pre = df[df['PERIOD'] == 0].copy()

### 2. Univariate Regression

Linear regression provides us a concise summary of one variable as a function of another variables(s) through two types of parameters - the slope and the intercept. To review linear regression, we'll start by exploring the relationship between employment and the number of hours a restaurant is open
in the pre-period.

#### 2.1 Exploratory analysis

In [None]:
# Find the correlation between number of rooms and median housing price.
np.corrcoef(pre['HRSOPEN'], pre['EMP'])[0][1]

**Question:** There's a strong positive correlation between the two. Does that the opening hours is what's driving the number of employees?

In [None]:
fig = plt.figure(figsize=(8, 5))
plt.scatter(pre['HRSOPEN'], pre['EMP'], alpha=.2)
plt.xlabel('Daily Hours Open', fontsize='large')
plt.ylabel('Number of Employees', fontsize='large')
plt.title('Hours Open vs. Employees', fontsize='x-large')
plt.xlim(6.9, 24.1)
plt.show()

#### 2.2 Estimating a regression with np.polyfit

In [None]:
# Estimate the regression
x, y = pre['HRSOPEN'].values, pre['EMP'].values # x is the input variable, y is the output variable
slope, intercept = np.polyfit(x, y, 1) # 1 is the degree

# Scatterplot with the regression line
fig = plt.figure(figsize=(8, 5))
plt.scatter(pre['HRSOPEN'], pre['EMP'], alpha=.2)
plt.plot(x, slope*x + intercept, color='darkgrey')
plt.xlabel('Daily Hours Open', fontsize='large')
plt.ylabel('Number of Employees', fontsize='large')
plt.title('Hours Open vs. Employees', fontsize='x-large')
plt.xlim(6.9, 24.1)
plt.show()

#### 2.3 Interpretation of the slope and intercept

In [None]:
print('The slope of the line is %.2f' % slope)

**Question**: How would you interpret this value?

In [None]:
print('The intercept of the line is %.2f' % intercept)

**Question**: How would you interpret this value?

#### 2.4 Estimating a regression with statsmodels

In [None]:
# Syntax option 1
x, y = pre['HRSOPEN'].values, pre['EMP'].values # x is the input variable, y is the output variable
x = sm.add_constant(x) # Add a constant for the intercept term
model1 = sm.OLS(y, x).fit() # Note the order of y folowed by x!
print(model1.summary())

In [None]:
# Syntax option 2
model2 = smf.ols(formula='EMP ~ HRSOPEN', data=pre).fit() # Automatically includes the intercept term
print(model2.summary())

### 2. Categorical Data

Now, we'll experiment with categorical data by examining the relationship between EMPTOT (the number of employees) and CHAIN (the fast food chain category) in the pre-period.

In [None]:
# Check unique values of CHAIN
pre['CHAIN'].unique()

In [None]:
# Get dummy variables for CHAIN
dummy_pre = pd.get_dummies(pre, columns=['CHAIN']).head() # Pandas' default is not to drop a column
dummy_pre.head()

In [None]:
# Get dummy variables for CHAIN and drop one column
dummy_pre = pd.get_dummies(pre, columns=['CHAIN'], drop_first=True).head() # Drop a column
dummy_pre.head()

In [None]:
# Regression with a dummy variable: Syntax option 1
x = pd.get_dummies(pre[['CHAIN']], columns=['CHAIN'], drop_first=True)
x = sm.add_constant(x)
y = pre['EMP']
print(sm.OLS(y, x).fit().summary())

**QUESTION**: How should we interpret regression coefficients when one dummy variable is dropped?

In [None]:
# Regression with a dummy variable: Syntax option 2

# Statsmodels formula API automatically drops one of the dummies
print(smf.ols(formula='EMP ~ C(CHAIN)', data=pre).fit().summary()) 

In [None]:
# Alternative: drop the constant
print(smf.ols(formula='EMP ~ C(CHAIN) - 1', data=pre).fit().summary()) 

### 3. Multivariate Regression

Now let's look at some other covariates: The number of registrations, whether or not employees get a bonus, and the region.

In [None]:
# Syntax 1
X = sm.add_constant(pre[['HRSOPEN', 'NREGS', 'BONUS']]) # X is capitalized since it's now a vector
y = pre['EMP']
model = sm.OLS(y, X).fit()
print(model.summary())

In [None]:
# Syntax 2
model = smf.ols(formula='EMP ~ HRSOPEN + NREGS + C(REGION)', data=pre).fit()
print(model.summary())

**QUESTION**: Interpret each of the regression coefficients.

#### 3.1 Interaction Terms

Interaction terms are used to (1) expand the set of hypothesis and/or controls in a regression, and (2) model relationships in more complex econometric models (e.g. differences-in-differences, instrumental variables).

In [None]:
model1 = smf.ols(formula='EMP ~ HRSOPEN + BONUS + HRSOPEN * BONUS ', data=pre).fit()
print(model1.summary())

**QUESTION**: Interpret each coefficient.

### 4. Differences-in-differences

In [None]:
# Visual assessment of the dif-in-dif
control_pre = df[(df['PERIOD'] == 0) & (df['STATE'] == 0)]['EMP'].mean()
treatment_pre = df[(df['PERIOD'] == 0) & (df['STATE'] == 1)]['EMP'].mean()
control_post = df[(df['PERIOD'] == 1) & (df['STATE'] == 0)]['EMP'].mean()
treatment_post = df[(df['PERIOD'] == 1) & (df['STATE'] == 1)]['EMP'].mean()

fig = plt.figure(figsize=(8, 5))
plt.scatter([0, 1], [control_pre, control_post], s=200)
plt.plot([0, 1], [control_pre, control_post], label='Control (Pennsylvania)')
plt.scatter([0, 1], [treatment_pre, treatment_post], s=200)
plt.plot([0, 1], [treatment_pre, treatment_post], label='Treatment (New Jersey)')
plt.legend(loc='best', fontsize='large')
plt.xlabel('Time (Pre vs. Post)', fontsize='large')
plt.ylabel('Average Employees', fontsize='large')
plt.title('Employees Over Time', fontsize='x-large')
plt.show()


In [None]:
# TODO: Use a differences-in-differences specification to estimate the impact of the increase in the 
# minimum wage in New Jersey (state == 1) between the pre- and the post-period (period==1). 

# Remember the dif-in-dif formula: Y = B0 + B1*Time + B2*Intervention + B3*(Time*Intervention)

**QUESTION**: Interpret the regression coefficients in terms of the impact of the minimum wage change on employment.

**QUESTION**: Why might you want to add control variables to this regression?