## Implementação e Heckit

Exemplo-base: 17.5 de "Introdução à Econometria: uma abordagem moderna" (Wooldridge)

Fonte: http://www.upfie.net/downloads17.html

In [10]:
## Importando bibliotecas
import wooldridge as woo
import statsmodels.formula.api as smf
import scipy.stats as stats
from statsmodels.formula.api import probit

In [11]:
## econometric_functions
import os
import sys

sCaminhoEconometria = "/Users/vinicius/Meu Drive/UnB/Econometria"
sys.path.append(os.path.abspath(sCaminhoEconometria))

import econometric_functions as ef
from econometric_functions import ols_reg

## Heckit: 17.5

In [5]:
## Dados
df = woo.dataWoo('mroz')
df

Unnamed: 0,inlf,hours,kidslt6,kidsge6,age,educ,wage,repwage,hushrs,husage,...,faminc,mtr,motheduc,fatheduc,unem,city,exper,nwifeinc,lwage,expersq
0,1,1610,1,0,32,12,3.3540,2.65,2708,34,...,16310.0,0.7215,12,7,5.0,0,14,10.910060,1.210154,196
1,1,1656,0,2,30,12,1.3889,2.65,2310,30,...,21800.0,0.6615,7,7,11.0,1,5,19.499981,0.328512,25
2,1,1980,1,3,35,12,4.5455,4.04,3072,40,...,21040.0,0.6915,12,7,5.0,0,15,12.039910,1.514138,225
3,1,456,0,3,34,12,1.0965,3.25,1920,53,...,7300.0,0.7815,7,7,5.0,0,6,6.799996,0.092123,36
4,1,1568,1,2,31,14,4.5918,3.60,2000,32,...,27300.0,0.6215,12,14,9.5,1,7,20.100058,1.524272,49
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
748,0,0,0,2,40,13,,0.00,3020,43,...,28200.0,0.6215,10,10,9.5,1,5,28.200001,,25
749,0,0,2,3,31,12,,0.00,2056,33,...,10000.0,0.7715,12,12,7.5,0,14,10.000000,,196
750,0,0,0,0,43,12,,0.00,2383,43,...,9952.0,0.7515,10,3,7.5,0,4,9.952000,,16
751,0,0,0,0,60,12,,0.00,1705,55,...,24984.0,0.6215,12,12,14.0,1,15,24.983999,,225


Variável binária de participação / seleção: `inlf` (== 1 se a mulher está dentro da força de trabalho)

### Passo 1: Usar todas as observações para estimar um modelo Probit para a participação

In [6]:
## Usando a função do econometric_functions para estimar o modelo
# Formula
formProbit = 'inlf ~ educ + exper + I(exper**2) + nwifeinc + age + kidslt6 + kidsge6'
# Modelo
reg_probit = ef.probit_logit(formula=formProbit, data=df)

                          Probit Regression Results                           
Dep. Variable:                   inlf   No. Observations:                  753
Model:                         Probit   Df Residuals:                      745
Method:                           MLE   Df Model:                            7
Date:                Sat, 26 Feb 2022   Pseudo R-squ.:                  0.2206
Time:                        12:58:50   Log-Likelihood:                -401.30
converged:                       True   LL-Null:                       -514.87
Covariance Type:            nonrobust   LLR p-value:                 2.009e-45
                    coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------
Intercept         0.2701      0.509      0.531      0.596      -0.728       1.269
educ              0.1309      0.025      5.183      0.000       0.081       0.180
exper             0.1233      0.019     

In [7]:
## Pegando os valores previstos
vPredParticipacao = reg_probit.fittedvalues

## Calculando a Razão Inversa de Mills e adicionando ao DataFrame
df['inv_mills'] = stats.norm.pdf(vPredParticipacao) / stats.norm.cdf(vPredParticipacao)

### Passo 2: regredir salários na sua especificação comum + razão inversa

IMPORTANTE: usar apenas o subset de pessoas que trabalham!

In [8]:
## Formula
formSalariosHeckit = 'lwage ~ educ + exper + I(exper**2) + inv_mills'

## Modelo
modHeckit = ef.ols_reg(formSalariosHeckit, df, subset=(df['inlf'] == 1))

                            OLS Regression Results                            
Dep. Variable:                  lwage   R-squared:                       0.157
Model:                            OLS   Adj. R-squared:                  0.149
Method:                 Least Squares   F-statistic:                     19.69
Date:                Sat, 26 Feb 2022   Prob (F-statistic):           7.14e-15
Time:                        12:58:56   Log-Likelihood:                -431.57
No. Observations:                 428   AIC:                             873.1
Df Residuals:                     423   BIC:                             893.4
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
                    coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------
Intercept        -0.5781      0.307     -1.885

Como o p-valor da razão inversa de Mills é pequeno, não parece haver evidências de seleção amostral.

Desconsiderando a significância estatística, o valor positivo da razão inversa de mills sugere que haveria seleção positiva, ou seja, mulheres ganhariam mais do que suas características observáveis diriam.

In [9]:
## Vendo ajuda da função
ef.ols_reg?

[0;31mSignature:[0m [0mef[0m[0;34m.[0m[0mols_reg[0m[0;34m([0m[0mformula[0m[0;34m,[0m [0mdata[0m[0;34m,[0m [0msubset[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m [0mcov[0m[0;34m=[0m[0;34m'unadjusted'[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Fits a standard OLS model with the corresponding covariance matrix using an R-style formula (y ~ x1 + x2...).
To compute without an intercept, use -1 or 0 in the formula.
Remember to use mod = ols_reg(...).
For generalized and weighted estimation, see statsmodels documentation or the first version of this file.
:param formula: patsy formula (R style)
:param data: dataframe containing the data
:param subset: only use a subset of the data? Defaults to None (all data)
    Must be in the form of `subset=(df['subset_column'] == 1)`.
:param cov : str
    unadjusted: common standard errors
    robust: HC1 standard errors
    cluster or clustered: clustered standard errors (must specify group)
:return : statsmodel

### Criando função do Heckit

In [18]:
def heckit(formula_probit, formula_model, data, subset_model, cov='normal'):
    """
    Performs the Heckit procedure for sample selection correction.
    The procedure is done through (1) a probit estimation for the selection variable using all available 
    data and (2) a OLS regression using the 'selected' data and a formula containing the Inverse Mills Ratio (λ).
    Remember to use modHeckit = heckit(...)!

    :param formula_probit: patsy/R formula for the probit model on the selection variable;
    :param formula_model: patsy/R formula for the model on the 'selected' data;
    :param data: dataframe containing the data
    :param subset_model: only use a subset of the data? Defaults to None (all data)
        Must be in the form of `subset=(df['subset_column'] == 1)`.
    :param cov : str
        unadjusted: common standard errors
        robust: HC1 standard errors
        cluster or clustered: clustered standard errors (must specify group)
    :return : statsmodels model instance
    """

    ## Fitting the probit model
    mod_probit = probit(formula_probit, data).fit()

    ## Calculating predicted values
    predicted_values = mod_probit.fittedvalues

    ## Calculating λ and adding as a column to the DataFrame
    df['inv_mills'] = stats.norm.pdf(predicted_values) / stats.norm.cdf(predicted_values)

    ## Appending inv_mills to formula_model
    formula_model += ' + inv_mills'

    ## Fitting the ols_model
    mod_heckit = ols_reg(formula_model, data, subset_model, cov)
    return mod_heckit

In [19]:
## Testando
modHeckit = heckit(formProbit, formSalariosHeckit, df, (df['inlf'] == 1))

Optimization terminated successfully.
         Current function value: 0.532938
         Iterations 5
                            OLS Regression Results                            
Dep. Variable:                  lwage   R-squared:                       0.157
Model:                            OLS   Adj. R-squared:                  0.149
Method:                 Least Squares   F-statistic:                     19.69
Date:                Sat, 26 Feb 2022   Prob (F-statistic):           7.14e-15
Time:                        13:15:20   Log-Likelihood:                -431.57
No. Observations:                 428   AIC:                             873.1
Df Residuals:                     423   BIC:                             893.4
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
                    coef    std err          t      P>|t|      [0.025      0.975]
--------------------------