Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
34 changes: 34 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@ For information on use cases and background material on causal inference and het
- [Usage Examples](#usage-examples)
- [Estimation Methods](#estimation-methods)
- [Interpretability](#interpretability)
- [Causal Model Selection and Cross-Validation](#causal-model-selection-and-cross-validation)
- [Inference](#inference)
- [For Developers](#for-developers)
- [Running the tests](#running-the-tests)
Expand Down Expand Up @@ -416,6 +417,39 @@ See the <a href="#references">References</a> section for more details.
mdl, _ = scorer.ensemble([mdl for _, mdl in models])
```

</details>

<details>
<summary>First Stage Model Selection (click to expand)</summary>

First stage models can be selected either by passing in cross-validated models (e.g. `sklearn.linear_model.LassoCV`) to EconML's estimators or perform the first stage model selection outside of EconML and pass in the selected model. Unless selecting among a large set of hyperparameters, choosing first stage models externally is the preferred method due to statistical and computational advantages.

```Python
from econml.dml import LinearDML
from sklearn import clone
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV

cv_model = GridSearchCV(
estimator=RandomForestRegressor(),
param_grid={
"max_depth": [3, None],
"n_estimators": (10, 30, 50, 100, 200),
"max_features": (2, 4, 6),
},
cv=5,
)
# First stage model selection within EconML
# This is more direct, but computationally and statistically less efficient
est = LinearDML(model_y=cv_model, model_t=cv_model)
# First stage model selection ouside of EconML
# This is the most efficient, but requires boilerplate code
model_t = clone(cv_model).fit(W, T).best_estimator_
model_y = clone(cv_model).fit(W, Y).best_estimator_
est = LinearDML(model_y=model_t, model_t=model_y)
```


</details>

### Inference
Expand Down
26 changes: 24 additions & 2 deletions doc/spec/estimation/dml.rst
Original file line number Diff line number Diff line change
Expand Up @@ -430,19 +430,41 @@ Usage FAQs

.. testcode::

from econml.dml import DML
from econml.dml import SparseLinearDML
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
first_stage = lambda: GridSearchCV(
estimator=RandomForestRegressor(),
param_grid={
'max_depth': [3, None],
'n_estimators': (10, 30, 50, 100, 200, 400, 600, 800, 1000),
'n_estimators': (10, 30, 50, 100, 200),
'max_features': (2,4,6)
}, cv=10, n_jobs=-1, scoring='neg_mean_squared_error'
)
est = SparseLinearDML(model_y=first_stage(), model_t=first_stage())

Alternatively, you can pick the best first stage models outside of the EconML framework and pass in the selected models to EconML.
This can save on runtime and computational resources. Furthermore, it is statistically more stable since all data is being used for
training rather than a fold. E.g.:

.. testcode::

from econml.dml import LinearDML
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
first_stage = lambda: GridSearchCV(
estimator=RandomForestRegressor(),
param_grid={
'max_depth': [3, None],
'n_estimators': (10, 30, 50, 100, 200),
'max_features': (2,4,6)
}, cv=10, n_jobs=-1, scoring='neg_mean_squared_error'
)
model_y = first_stage().fit(X, Y).best_estimator_
model_t = first_stage().fit(X, T).best_estimator_
est = LinearDML(model_y=model_y, model_t=model_t)


- **How do I select the hyperparameters of the final model (if any)?**

You can use cross-validated classes for the final model too. Our default debiased lasso performs cross validation
Expand Down
33 changes: 33 additions & 0 deletions doc/spec/estimation/dr.rst
Original file line number Diff line number Diff line change
Expand Up @@ -431,6 +431,39 @@ Usage FAQs
est.fit(y, T, X=X, W=W)
point = est.effect(X, T0=T0, T1=T1)

Alternatively, you can pick the best first stage models outside of the EconML framework and pass in the selected models to EconML.
This can save on runtime and computational resources. Furthermore, it is statistically more stable since all data is being used for
training rather than a fold. E.g.:

.. testcode::

from econml.drlearner import DRLearner
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.model_selection import GridSearchCV
model_reg = lambda: GridSearchCV(
estimator=RandomForestRegressor(),
param_grid={
'max_depth': [3, None],
'n_estimators': (10, 50, 100)
}, cv=5, n_jobs=-1, scoring='neg_mean_squared_error'
)
model_clf = lambda: GridSearchCV(
estimator=RandomForestClassifier(min_samples_leaf=10),
param_grid={
'max_depth': [3, None],
'n_estimators': (10, 50, 100)
}, cv=5, n_jobs=-1, scoring='neg_mean_squared_error'
)
XW = np.hstack([X, W])
model_regression = model_reg().fit(XW, Y).best_estimator_
model_propensity = model_clf().fit(XW, T).best_estimator_
est = DRLearner(model_regression=model_regression,
model_propensity=model_propensity,
model_final=model_regression, cv=5)
est.fit(y, T, X=X, W=W)
point = est.effect(X, T0=T0, T1=T1)


- **What if I have many treatments?**

The method allows for multiple discrete (categorical) treatments and will estimate a CATE model for each treatment.
Expand Down
649 changes: 649 additions & 0 deletions notebooks/Choosing First Stage Models.ipynb

Large diffs are not rendered by default.