Yannis Bilias, "Sequential Testing of Duration Data: The Case of
Pennsylvania 'Reemployment Bonus' Experiment", Journal of Applied
Econometrics, Vol. 15, No. 6, 2000, pp. 575-594.

Description of the data:

The data file used for this paper was extracted from the master file of the
study.  The master file and documentation may be obtained from the

  U.S. Department of Labor,
  200 Constitution Avenue, NW
  Washington, DC 20210, USA     

Relevant information can be found in the Final Report:
  Corson, W., Decker, P., Dunstan, S., and Keransky, S. (1992),
  ``Pennsylvania `reemployment bonus' demonstration: Final Report'',
  Unemployment Insurance Occasional Paper}, 92--1,
  (Washington D.C.: US Dept of Labor, Employment and Training Administration).

Our extract has 13913 observations on 23 variables (some of which are
dummies constructed from the original definitions.) These data are in the
file penn_jae.dat, which is an ASCII file in DOS format that is zipped in
bilias-data.zip.

The 23 variables (columns) of the datafile utilized in the article
may be described as follows:

abdt:  chronological time of enrollment of each claimant
       in the Pennsylvania reemployment bonus experiment.

tg:  indicates the treatment group (bonus amount - qualification period)
     of each claimant. 
       if tg=0, then claimant enrolled in the control group
       if tg=1, then claimant enrolled in the group 1, and so on.
     (for definitions of the each group see the article, or the 
      Final Report).

inuidur1: a measure of length (in weeks) of the first spell of 
          unemployment; this measure was used in the empirical 
          analysis of this article.
          (this is a constructed variable and 
          the following is a quote from the documentation:
   "This variable reflected the number of weeks in the claimant's
    initial UI duration using a break-in-payments definition of a spell.
    If a claimant did not collect any weeks of UC, INUIDUR1 was set to 1
    because he/she must have signed for at least a waiting week in order
    to have been selected for the demonstration.  If a claimant had a gap
    of at least 7 weeks between the AB_DT and the first claim week paid,
    INUIDUR1 was also set to 1 to capture the waiting week.  Otherwise,
    the initial UI duration was deemed to have ended if there was a break
    in payments of at least 3 weeks' duration.  In this instance, INUIDUR1
    was set equal to the duration of the spell up to the break, plus one
    for the waiting week.  For all other cases, INUIDUR1 equalled the
    length of the spell plus one for the waiting week."
                                                                
inuidur2: a second measure for the length (in weeks) of 
          the first spell of unemployment;
          it was not used in our data analysis.
             
female: dummy variable; it indicates if the claimant's sex 
        is female (=1) or male (=0).

black: dummy variable; it  indicates a person of black race (=1).

hispanic: dummy variable; it  indicates a person of hispanic race (=1).

othrace: dummy variable; it  indicates a non-white, non-black, not-hispanic 
         person (=1).

dep:  the number of dependents of each claimant;
      In case the claimant has 2 or more dependents,
      it is equal to 2.  Else it is 0 or 1 accordingly.

q1-q6: six dummy variables indicating the quarter of experiment
       during which each claimant enrolled.

recall: takes the value of 1 if the claimant answered ``yes'' when
        was asked if he/she had any expectation to be recalled.

 agelt35: takes the value of 1 if the claimant's age is less
         than 35 and 0 otherwise.

agegt54: takes the value of 1 if the claimant's age is more
         than 54 and 0 otherwise.

durable: it takes the value of 1 if the occupation
         of the claimant was in the sector of durable manufacturing
         and 0 otherwise.

nondurable: it takes the value of 1 if the occupation
            of the claimant was in the sector of nondurable 
            manufacturing and 0 otherwise.

lusd: it takes the value of 1 if the claimant filed
      in Coatesville, Reading, or Lancaster and 0 otherwise.
      These three sites were considered to be located
      in areas characterized by low unemployment rate and
      short duration of unemployment.

husd: it takes the value of 1 if the claimant filed
      in Lewistown, Pittston, or Scranton and 0 otherwise.
      These three sites were considered to be located
      in areas characterized by high unemployment rate and
      short duration of unemployment.

muld: it takes the value of 1 if the claimant filed
      in Philadelphia-North, Philadelphia-Uptown, McKeesport, 
      Erie, or Butler and 0 otherwise.
      These three sites were considered to be located
      in areas characterized by moderate unemployment rate and
      long duration of unemployment.


                           

                                


In [30]:
import pandas as pd 
df = pd.read_csv("/Users/pranjal/Desktop/Causal-Inference/data/penn_jae.dat", sep = ' ')
df.head()           

Unnamed: 0,abdt,tg,inuidur1,inuidur2,female,black,hispanic,othrace,dep,q1,...,recall,agelt35,agegt54,durable,nondurable,lusd,husd,muld,Unnamed: 24,Unnamed: 25
0,10824,0,18,18,0,0,0,0,2,0,...,0,0,0,0,0,1,0,,,
1,10635,2,7,3,0,0,0,0,0,0,...,1,0,0,0,1,0,0,,,
2,10551,5,18,6,1,0,0,0,0,0,...,0,1,0,0,0,0,0,,,
3,10824,0,1,1,0,0,0,0,0,0,...,0,0,0,0,1,0,0,,,
4,10747,0,27,27,0,0,0,0,0,0,...,0,0,0,0,1,0,0,,,


In [31]:
df.tg.value_counts()

0    3354
2    2428
3    1885
5    1831
4    1745
1    1385
6    1285
Name: tg, dtype: int64

In [32]:
df = df[df.tg.isin([0,4])]

In [33]:
outcome = 'inuidur1'
treatment = 'tg'
rest = list(df.drop([outcome, treatment, 'inuidur2','muld', 'Unnamed: 24', 'Unnamed: 25'], axis = 1).columns)
df = df[[outcome] + [treatment] + rest]

In [34]:
df.isnull().sum()

inuidur1       0
tg             0
abdt           0
female         0
black          0
hispanic       0
othrace        0
dep            0
q1             0
q2             0
q3             0
q4             0
Unnamed: 13    0
q5             0
q6             0
recall         0
agelt35        0
agegt54        0
durable        0
nondurable     0
lusd           0
husd           0
dtype: int64

In [35]:
#import wooldridge
#df = wooldridge.data('jtrain3')
#df['avg'] = 0.5 * (df.re74+df.re75)
df = df.dropna()
#df = df.fillna(0)
#df = df[df.avg <= 15]
y = df[outcome]
d = df[treatment]
x = df[rest].astype('float')
print(df.shape)
df.head()

(5099, 22)


Unnamed: 0,inuidur1,tg,abdt,female,black,hispanic,othrace,dep,q1,q2,...,Unnamed: 13,q5,q6,recall,agelt35,agegt54,durable,nondurable,lusd,husd
0,18,0,10824,0,0,0,0,2,0,0,...,1,0,0,0,0,0,0,0,1,0
3,1,0,10824,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,1,0,0
4,27,0,10747,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
11,9,4,10607,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,1
12,27,0,10831,0,0,0,0,1,0,0,...,1,0,0,0,1,1,0,1,0,0


In [36]:
# Simple Comparision of Means
import numpy as np
import statsmodels.api as sm
mod = sm.OLS(y, sm.add_constant(np.c_[d], prepend=False))
res = mod.fit()
print(res.summary())
print(res.params[0])
print(res.bse[0])

                            OLS Regression Results                            
Dep. Variable:               inuidur1   R-squared:                       0.001
Model:                            OLS   Adj. R-squared:                  0.001
Method:                 Least Squares   F-statistic:                     7.132
Date:                Sun, 04 Dec 2022   Prob (F-statistic):            0.00759
Time:                        21:57:41   Log-Likelihood:                -19253.
No. Observations:                5099   AIC:                         3.851e+04
Df Residuals:                    5097   BIC:                         3.852e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
x1            -0.2081      0.078     -2.671      0.0

In [37]:
# Pooled Regression Adjustment
import statsmodels.api as sm
mod = sm.OLS(y, sm.add_constant(np.c_[d, x], prepend=False))
res = mod.fit()
print(res.summary())
print(res.params[0])
print(res.bse[0])

                            OLS Regression Results                            
Dep. Variable:               inuidur1   R-squared:                       0.038
Model:                            OLS   Adj. R-squared:                  0.034
Method:                 Least Squares   F-statistic:                     9.908
Date:                Sun, 04 Dec 2022   Prob (F-statistic):           1.23e-30
Time:                        21:57:42   Log-Likelihood:                -19158.
No. Observations:                5099   AIC:                         3.836e+04
Df Residuals:                    5078   BIC:                         3.850e+04
Df Model:                          20                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
x1            -0.1778      0.077     -2.314      0.0

In [43]:
import numpy as np
from doubleml.datasets import make_plr_CCDDHNR2018
from doubleml import DoubleMLData

np.random.seed(1234)
dml_data_bonus = DoubleMLData(df, y_col=outcome,
                                  d_cols=treatment,
                                  x_cols=list(rest))
print(dml_data_bonus)
from sklearn.base import clone
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LassoCV
learner = RandomForestRegressor(n_estimators = 500, max_features = 'sqrt', max_depth= 6)
ml_l_bonus = clone(learner)
ml_m_bonus = clone(learner)
learner = LassoCV()
ml_l_sim = clone(learner)
ml_m_sim = clone(learner)
def non_orth_score(y, d, l_hat, m_hat, g_hat, smpls):
    u_hat = y - g_hat
    psi_a = -np.multiply(d, d)
    psi_b = np.multiply(d, u_hat)
    return psi_a, psi_b

from doubleml import DoubleMLPLR
np.random.seed(3141)
obj_dml_plr_bonus = DoubleMLPLR(dml_data_bonus, ml_l_bonus, ml_m_bonus)
obj_dml_plr_bonus.fit();
print(obj_dml_plr_bonus)


------------------ Data summary      ------------------
Outcome variable: inuidur1
Treatment variable(s): ['tg']
Covariates: ['abdt', 'female', 'black', 'hispanic', 'othrace', 'dep', 'q1', 'q2', 'q3', 'q4', 'Unnamed: 13', 'q5', 'q6', 'recall', 'agelt35', 'agegt54', 'durable', 'nondurable', 'lusd', 'husd']
Instrument variable(s): None
No. Observations: 5099

------------------ DataFrame info    ------------------
<class 'pandas.core.frame.DataFrame'>
Int64Index: 5099 entries, 0 to 13911
Columns: 22 entries, inuidur1 to husd
dtypes: int64(22)
memory usage: 916.2 KB


------------------ Data summary      ------------------
Outcome variable: inuidur1
Treatment variable(s): ['tg']
Covariates: ['abdt', 'female', 'black', 'hispanic', 'othrace', 'dep', 'q1', 'q2', 'q3', 'q4', 'Unnamed: 13', 'q5', 'q6', 'recall', 'agelt35', 'agegt54', 'durable', 'nondurable', 'lusd', 'husd']
Instrument variable(s): None
No. Observations: 5099

------------------ Score & algorithm ------------------
Score funct

In [44]:
# DML regression - still yeilds unbiased estimate of ATE 
from econml.dml import LinearDML
est = LinearDML(random_state=9)
est.fit(y, d, X=None,W=x)
est.summary()

Coefficient Results:  X is None, please call intercept_inference to learn the constant!


0,1,2,3,4,5,6
,point_estimate,stderr,zstat,pvalue,ci_lower,ci_upper
cate_intercept,-0.177,0.076,-2.311,0.021,-0.327,-0.027
