#  Causal Inference

Of course. Here is a README.md file drafted from your notes.

-----

# The Causal Effect of Warm Temperature on Bike-Share Trip Duration

This repository contains the analysis for a study investigating the causal effect of warm weather on the duration of bike-share trips using data from New York City's Citi Bike program.

## Research Question & Hypothesis

This study investigates the causal effect of warm temperatures on bike-share trip duration.

**Hypothesis**: Warmer temperatures causally increase the trip duration of casual customers (leisure riders/tourists) for bike-share services, while having a less significant effect on subscribers (commuters).

-----

## Methodology

### Research Design

This research employs a **fixed-effects difference-in-differences (DiD)** model to isolate the causal impact of temperature on riding behavior.

  * **Treatment Group**: Casual customers, who are more likely to use bike-share services for leisure on warmer days.
  * **Control Group**: Subscribers (commuters), who are assumed to have more rigid commuting patterns regardless of temperature.
  * **Treatment Variable (X)**: `warm_day`, a binary indicator for days where the average temperature is above 65°F.
  * **Outcome Variable (Y)**: `avg_trip_minutes`, the average daily trip duration in minutes for each user type.

### Regression Model

The core of the analysis is the following regression model, where the coefficient `δ` on the interaction term represents the DiD estimator:

```
AvgTripMinutes = β₀ + β₁(warm_day) + β₂(is_customer) + δ(warm_day * is_customer) + C(holiday) + C(day_of_week) + C(neighborhood) + C(wind_speed) + C(precipitation) + ε
```

### Causal Diagram (DAG)

The Directed Acyclic Graph (DAG) below illustrates the assumed causal relationships. Temperature exogenously affects Trip Duration, with User Type modifying this relationship. Day of the Week is a potential confounder.

-----

## Data

### Data Sources and Variables

This study uses daily-level aggregated data from two public sources for the year 2018:

1.  **NYC Citi Bike Trip Data**
2.  **NOAA Global Surface Summary of the Day (GSOD)**

**Key Variables:**

  * **Outcome (`avg_trip_minutes`)**: The average daily trip duration for a given user type.
  * **Treatment (`warm_day`)**: A binary variable indicating if the mean temperature was above 65°F.
  * **Controls**: `precipitation`, `wind_speed`, `holiday`, `day_of_week`.
  * **Fixed Effects**: The model includes fixed effects for `day_of_week` and `neighborhood` to account for unobserved, time-invariant characteristics.

### Pre-Treatment Balance

A balance test revealed pre-treatment differences between the control and treatment groups. This suggests a violation of the parallel trends assumption, making the inclusion of control variables essential for isolating a precise causal estimate.

-----

## Validity

### Internal Validity

The primary threat to internal validity is **omitted variable bias** from factors correlated with both temperature and riding behavior (e.g., precipitation, wind speed, local events).

  * **Mitigation Strategy**: This threat is addressed by including these factors as control variables in the regression and incorporating fixed effects for `neighborhood`, `day_of_week`, and `holiday`. Since temperature is an exogenous variable that individuals cannot influence, self-selection bias is not a primary concern.

### External Validity

The findings of this study may be specific to **New York City**. The results might not generalize to other cities with different climates, topographies, demographics, or bike-share user compositions (e.g., a lower tourist-to-commuter ratio).

In [24]:
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.formula.api as smf

## Questions for OH 

1. Do we usually do data cleaning for causal inference? 

## Test Regression Discontinuity

In [None]:
import pandas as pd

# Assume 'df' is your DataFrame with the columns:
# 'avg_trip_minutes' and 'day_mean_temperature'

# 1. Choose a plausible cutoff temperature
cutoff = 30

# 2. Create a new column to identify data above or below the cutoff
df['is_below_cutoff'] = (df['day_mean_temperature'] <= cutoff)

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Assume 'df' is your DataFrame with the columns:
# 'avg_trip_minutes' and 'day_mean_temperature'

# 1. Change the cutoff temperature to 30°F
cutoff = 30

# 2. Create the new column based on the 30°F cutoff
df['is_above_cutoff'] = (df['day_mean_temperature'] >= cutoff)

# Create the regression discontinuity plot
sns.lmplot(
    data=df,
    x='day_mean_temperature',
    y='avg_trip_minutes',
    hue='is_above_cutoff',
    height=7,
    aspect=1.5
)

# Add a vertical line at the new cutoff point
plt.axvline(x=cutoff, color='red', linestyle='--')

# Update the title and labels for clarity
plt.title('Regression Discontinuity at 30°F')
plt.xlabel('Day Mean Temperature (°F)')
plt.ylabel('Average Trip Minutes')

# Display the plot
plt.show()

## 1.a.

**Answer**

From the balance tests, it confirms that pre-treatment differences exist. "children", "nonwhite", "finc", "earn", "age", "ed", "work" are all statistically significant with the treatment. 

This violates the parallel trends assumption needed for DiD, and since treatment is endogenous to those factors. Not controlling for them will be a thread to a precise causal estimate. 

In [25]:
# Import the data 
df = pd.read_csv('../data/trips.csv')

# Define POST as 1994 and after
df['warm'] = (df['day_mean_temperature'] >= 60).astype(int)

# Define TREATED group as respondent who has 1 or more children
df['treated'] = (df['usertype'] == 'Customer').astype(int)

df.head()

Unnamed: 0,usertype,zip_code_start,borough_start,neighborhood_start,zip_code_end,borough_end,neighborhood_end,start_day,stop_day,day_mean_temperature,...,trip_minutes,trip_count,unique_bikes_used,total_trip_minutes,avg_trip_minutes,median_trip_minutes,min_trip_minutes,max_trip_minutes,warm,treated
0,Subscriber,10065,Manhattan,Upper East Side,10168,Manhattan,Gramercy Park and Murray Hill,2023-03-22,2023-03-22,37.7,...,10,4,4,44.95,11.2375,11.066667,8.383333,13.933333,0,0
1,Subscriber,10024,Manhattan,Upper West Side,10022,Manhattan,Gramercy Park and Murray Hill,2023-01-07,2023-01-07,9.7,...,20,2,2,37.266667,18.633333,15.016667,15.016667,22.25,0,0
2,Subscriber,10023,Manhattan,Upper West Side,10035,Manhattan,East Harlem,2023-01-13,2023-01-13,43.6,...,20,3,3,63.316667,21.105556,21.233333,20.766667,21.316667,0,0
3,Subscriber,10001,Manhattan,Chelsea and Clinton,10199,Manhattan,Chelsea and Clinton,2023-01-31,2023-01-31,24.9,...,10,12,12,67.883333,5.656944,5.266667,4.6,7.633333,0,0
4,Subscriber,11103,Queens,Northwest Queens,11101,Queens,Northwest Queens,2023-02-22,2023-02-22,48.7,...,20,6,6,103.066667,17.177778,17.116667,15.366667,19.283333,0,0


In [26]:
# create holiday dummy variable based on US federal holidays 1 if not 0 
from pandas.tseries.holiday import USFederalHolidayCalendar
cal = USFederalHolidayCalendar()
holidays = cal.holidays(start='2018-01-01', end='2018-12-31')
df['start_day'] = pd.to_datetime(df['start_day'])
df['holiday'] = df['start_day'].dt.normalize().isin(holidays).astype(int)

# create day of week variable as numeric and string
df['day_of_week_num'] = df['start_day'].dt.dayofweek
df['day_of_week'] = df['start_day'].dt.day_name()

  # create holiday dummy variable based on US federal holidays 1 if not 0  

In [27]:
df.columns

Index(['usertype', 'zip_code_start', 'borough_start', 'neighborhood_start',
       'zip_code_end', 'borough_end', 'neighborhood_end', 'start_day',
       'stop_day', 'day_mean_temperature', 'day_mean_wind_speed',
       'day_total_precipitation', 'trip_minutes', 'trip_count',
       'unique_bikes_used', 'total_trip_minutes', 'avg_trip_minutes',
       'median_trip_minutes', 'min_trip_minutes', 'max_trip_minutes', 'warm',
       'treated', 'holiday', 'day_of_week_num', 'day_of_week'],
      dtype='object')

In [28]:
df[cols].describe()

Unnamed: 0,holiday,day_of_week_num,zip_code_start,day_mean_wind_speed,day_total_precipitation
count,730088.0,730088.0,730088.0,730088.0,730088.0
mean,0.0,2.935758,10276.190185,4.589558,0.132375
std,0.0,2.008973,468.591262,2.101441,0.285154
min,0.0,0.0,10001.0,1.0,0.0
25%,0.0,1.0,10012.0,2.8,0.0
50%,0.0,3.0,10024.0,4.5,0.0
75%,0.0,5.0,10168.0,5.8,0.11
max,0.0,6.0,11238.0,11.5,1.68


In [29]:
# Run a balance test for each variable in this
# 1992 subsample, e.g., by regressing each variable on the indicator for treated. 
# Discuss what your results imply from an identification perspective

cols = ['holiday', 'day_of_week_num', 'zip_code_start', 'day_mean_wind_speed', 'day_total_precipitation']

# Balance tests
balance_results = {}
for var in cols:
    model = smf.ols(f"{var} ~ treated", data=df).fit()
    balance_results[var] = model.summary().tables[1]  # Coefficients table

# Print balance results
for var, result in balance_results.items():
    print(f"\nBalance test for {var}:")
    print(result)

  return 1 - self.ssr/self.centered_tss
  return self.mse_model/self.mse_resid
  llf = -nobs2*np.log(2*np.pi) - nobs2*np.log(ssr / nobs) - nobs2
  dw = np.sum(diff_resids**2, axis=axis) / np.sum(resids**2, axis=axis)



Balance test for holiday:
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept           0          0        nan        nan           0           0
treated             0          0        nan        nan           0           0

Balance test for day_of_week_num:
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      2.8589      0.003   1071.964      0.000       2.854       2.864
treated        0.3393      0.006     60.566      0.000       0.328       0.350

Balance test for zip_code_start:
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept   1.028e+04      0.623   1.65e+04      0.000    1.03e+04    1.03e+04
treated      -34.0984      1.309   

## 1.b. 

**Answer**

Earlier, we detected pre-treatment differences between the two groups. Specifically, single-mothers are 0.1326 less likely to work compared to non-mothers. However, from the graph, we can identify a fairly parallel trends of employment between the two groups. 

In [30]:
# Calculate the average percent of observations working (i.e., work ==1) 
# by year and treatment status (mothers vs. non).

avg_work = df.groupby(['year', 'treated'])['work'].mean().reset_index()
avg_work['treated'] = avg_work['treated'].map({0: 'Non-mothers', 
                                               1: 'Single-mothers'}) 


# Plot average work separately against year for treated (red), and
# untreated (blue). Do you think the parallel trends assum
sns.lineplot(data=avg_work, x='year', y='work', hue='treated', palette=['blue', 'red'])
plt.title('Average Work by Year and Treatment Status')
plt.xlabel('Year')
plt.ylabel('Average Work')
plt.legend(title='Group')
plt.show()


KeyError: 'year'

In [None]:
print(avg_work)

    year         treated      work
0   1991     Non-mothers  0.583032
1   1991  Single-mothers  0.460053
2   1992     Non-mothers  0.571566
3   1992  Single-mothers  0.438920
4   1993     Non-mothers  0.571144
5   1993  Single-mothers  0.437547
6   1994     Non-mothers  0.590909
7   1994  Single-mothers  0.464032
8   1995     Non-mothers  0.574236
9   1995  Single-mothers  0.508127
10  1996     Non-mothers  0.552480
11  1996  Single-mothers  0.502636


## 1.c. 

Use a linear probability model to calculate the DID estimate of the effect of the EITC expansion on whether a respondent was working (i.e., work ==1) for the whole sample by estimating the model \
\
𝑤𝑜𝑟𝑘! = 𝛼 + 𝛽*𝑝𝑜𝑠𝑡 + 𝛽*𝑡𝑟𝑒𝑎𝑡𝑒d + 𝛽*𝑝𝑜𝑠𝑡*𝑡𝑟𝑒𝑎𝑡𝑒d + 𝜀 \
\
first as is, 
and then with fixed effects in state and year. Interpret your results for both regressions, and comment on any differences.

**Answer**

After adding the state and year fixed-effects, the coefficients of treated and post*treated change slightly. This means that the state-level patterns and seasonal shocks on employment did bias the estimates before. 

In [None]:
# DiD estimate of the effect of the EITC expansion on whether a respondent was working
# (i.e., work ==1) for the whole sample
model = smf.ols("avg_trip_minutes ~ warm + treated + warm:treated", data=df).fit()
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:       avg_trip_minutes   R-squared:                       0.001
Model:                            OLS   Adj. R-squared:                  0.001
Method:                 Least Squares   F-statistic:                     175.2
Date:                Mon, 29 Sep 2025   Prob (F-statistic):          1.44e-113
Time:                        23:22:54   Log-Likelihood:            -6.2179e+06
No. Observations:              730088   AIC:                         1.244e+07
Df Residuals:                  730084   BIC:                         1.244e+07
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
Intercept       30.1350      1.878     16.051   

In [None]:
# DiD estimate with fixed effects in state and year
model_fe = smf.ols("avg_trip_minutes ~ warm + treated + warm:treated + C(zip_code_start) + C(day_of_week)", data=df).fit()
print(model_fe.summary())

                            OLS Regression Results                            
Dep. Variable:       avg_trip_minutes   R-squared:                       0.002
Model:                            OLS   Adj. R-squared:                  0.002
Method:                 Least Squares   F-statistic:                     19.43
Date:                Mon, 29 Sep 2025   Prob (F-statistic):          1.28e-244
Time:                        23:25:55   Log-Likelihood:            -6.2174e+06
No. Observations:              730088   AIC:                         1.244e+07
Df Residuals:                  730015   BIC:                         1.244e+07
Df Model:                          72                                         
Covariance Type:            nonrobust                                         
                                  coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------------
Intercept         

## 1.d. 

Why might you be concerned about inference, i.e., the calculation of your standard errors, from
these results? What could you do to alleviate these concerns? Implement at least one of the
techniques from Bertrand Duflo and Mullainathan (2004) and report on the change?

**Answer**

Since we know, a state-level or year-level patterns and shocks can affect all the observations in that group. The earlier naive regression treated the each observation as independence. This made the standard errors smaller and underestimated the p-value (false significance). 

So one way to alleviate this concern is to apply clusered standard errors to account for these effects in state and year-level. 

After including the clusted SE, the estimate does not change, but the clustered SE gets larger (0.012 to 0.017) after considering the correlation.

In [None]:
# DID regression with clustered standard errors at the state level
model = smf.ols("avg_trip_minutes ~ warm + treated + warm:treated", data=df).fit(
    cov_type="cluster", cov_kwds={"groups": df["zip_code_start"]}
)
print(model.summary())


                            OLS Regression Results                            
Dep. Variable:       avg_trip_minutes   R-squared:                       0.001
Model:                            OLS   Adj. R-squared:                  0.001
Method:                 Least Squares   F-statistic:                     17.71
Date:                Mon, 29 Sep 2025   Prob (F-statistic):           1.89e-08
Time:                        23:27:27   Log-Likelihood:            -6.2179e+06
No. Observations:              730088   AIC:                         1.244e+07
Df Residuals:                  730084   BIC:                         1.244e+07
Df Model:                           3                                         
Covariance Type:              cluster                                         
                   coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------
Intercept       30.1350      1.227     24.566   