<h1 style="text-align: center; color: purple;" markdown="1">Econ 320 Python Lab Regression Analysis and Qualitative Regressors </h1>
<h2 style="text-align: center; color: purple;" markdown="1">Handout # 11 </h2>

Many variables of interest are qualitative rather than quantitative. Gender, race, marital status, level of education, ocupation, region, etc. Qualitative information is ussualy represented in regressions as binary or dummy variables which can only take a value zero or one. 

**The set up**

In [15]:
import wooldridge as woo
import numpy as np
import pandas as pd
import os
import statsmodels.api as sm
import statsmodels.formula.api as smf
import scipy.stats as stats

from stargazer.stargazer import Stargazer
from IPython.core.display import HTML

## Dummy variables 

Dummy variables can be used as regressors just as any other variables. The coefficient of a single dummy variable added to regression represents the difference in the intercepet between groups, see Wooldridge (2019, Section 7.2)

Let's use an example in which we want to estimate a wage equation, and investigate what are the wage differences by gender. Once we have generated the dummy variable we just need to include it in our regression formula. 

We will use our wage1 dataset from Wooldridge. First we want to check how is our variable of interest distributed. The gender variable in this dataset can be found in the variable female. A dummy variable that takes the value of 1 if the individual is female and 0 if male. 

We are going to use the function `pd.crosstab(index=variable, columns=column names)` to see the distribution of gender in our data, this will create a frequency table with the number of women and men in the sample.

In [16]:
# load wage1 data from wooldridge package
wage1 = woo.dataWoo('wage1')


pd.crosstab(index=wage1['female'],  # Make a crosstab
            columns="count")      # Name the count column

col_0,count
female,Unnamed: 1_level_1
0,274
1,252


We are interested in the wage differences by gender. The regression equation will be the following formula

Model 1 $$log(wage) = \beta_0 + \beta_1*female + \beta_2*educ + \beta_3*exper + \beta_4*tenure$$
Model 2 & 3 Restrict the data only for men and only for women $$log(wage) = \beta_0 + \beta_1*female + \beta_2*educ + \beta_3*exper + \beta_4*tenure$$

Model 4 Interact education, experience and tenure with female 

$$log(wage) = \beta_0 + \beta_1*female + \beta_2*educ(female==0) + \beta_3*exper(female==0) + \\ \beta_4*tenure(female==0) + \beta_5*educ(female==1) + \beta_6*exper(female==1) + \beta_7*tenure(female==1)$$

* Run the regression that estimates the equation above
* First by using the variable female as a regressor

In [17]:
m1 = smf.ols(formula='np.log(wage) ~ female + educ + exper + tenure', data=wage1)
m1 = m1.fit()
# print regression table:
#m1.summary()

In [18]:
# You can also filter your data and create two separate equations but the most efficient way is to add the subset option inside the lm command data=subset()
m2 = smf.ols(formula='np.log(wage) ~ educ + exper + tenure', 
             data=wage1, subset=(wage1['female'] == 0)).fit()
m2.summary()
#You need to interact each regressor with the female variable for the models to be the same 
# when you restrict the sample do it below in model m3

m3 = smf.ols(formula='np.log(wage) ~ educ + exper + tenure', 
             data=wage1, subset=(wage1['female'] == 1)).fit()
m3.summary()

m4 = smf.ols(formula=' np.log(wage) ~ educ*female + exper*female + tenure*female ',
             data=wage1).fit()
m4.summary()
# Put these models in stargazer table with the intercept at the bottom see the table 
# print regression table:

models = Stargazer([m1, m2, m3, m4])
models.title('Regression on Wages')
models.custom_columns(['All', 'Only men', 'Only women', 'Interaction'], [1, 1, 1, 1])
models.covariate_order(['Intercept', 'female' , 'educ' , 'exper', 'tenure', 'educ:female', 'exper:female','tenure:female'])
HTML(models.render_html())


# Now, what can you say about the coefficients for the dummy variable female?

0,1,2,3,4
,,,,
,Dependent variable:np.log(wage),Dependent variable:np.log(wage),Dependent variable:np.log(wage),Dependent variable:np.log(wage)
,,,,
,All,Only men,Only women,Interaction
,(1),(2),(3),(4)
,,,,
Intercept,0.501***,0.322**,0.356**,0.322**
,(0.102),(0.139),(0.141),(0.135)
female,-0.301***,,,0.034
,(0.037),,,(0.199)


## Dummy variables and arithmetic formulas into a regression  

We can run another regression with the following formula 

$$log(wage) = \beta_0 + \beta_1*married + \beta_2*female + + \beta_3*married*female + \beta_4*educ + \beta_5*exper + \beta_6*exper^2 + \beta_7*tenure + \beta_8*tenure^2$$

Notice how we are adding married and female dummy variables in the regression and also two squared variables into the regression 

These dummy variables are added as they are because they take 1 for category of interest and 0 for the other. 

When you want to add variables that are arithmetic operations of other variables instead of creating a separate variable you can add them just by using `I(formula)`

* Run the new regression that estimates the new equation with tenure and experience squared 
* Run another regression with an interaction term of female and education

In [19]:
reg = smf.ols(formula='np.log(wage) ~ married*female + educ + exper +'
              'I(exper**2) + tenure +I(tenure**2)', data=wage1)
results = reg.fit()
#results.summary()

reg1 = smf.ols(formula='np.log(wage) ~ married + educ + female + I(educ*female) + exper +'
              'I(exper**2) + tenure +I(tenure**2)', data=wage1)
results1 = reg1.fit()
#resulst.summary()

reg2 = smf.ols(formula='np.log(wage) ~ married + educ*female + exper +'
              'I(exper**2) + tenure +I(tenure**2)', data=wage1)
results2 = reg2.fit()
#resulst.summary() 

model4 = Stargazer([results, results1, results2])
model4.covariate_order(['Intercept','female' , 'educ' , 'exper', 'tenure', 'married',
                        'married:female' , 'educ:female', 'I(exper ** 2)', 'I(tenure ** 2)', 'I(educ * female)'])
HTML(model4.render_html())

0,1,2,3
,,,
,Dependent variable:np.log(wage),Dependent variable:np.log(wage),Dependent variable:np.log(wage)
,,,
,(1),(2),(3)
,,,
Intercept,0.321***,0.390***,0.390***
,(0.100),(0.119),(0.119)
female,-0.110**,-0.220,-0.220
,(0.056),(0.168),(0.168)
educ,0.079***,0.081***,0.081***


# The other option use only the interaction, there is no need to include the variables alone Python does it. 

## Boolean Variables

To store qualitative yes or no information Python uses **Boolean variables**. Instead of transforming boolean variables into 0/1 dummy variables tehy can be directly used as regressors in the output their coefficient is then named `varname[T.True]`. These variables are treated such that **TRUE=1** and **FALSE=0**.

Below we will take the femail dummy variable and recoded as a boolean variable and introduce it in the regression. See below. 


In [20]:
# Create the boolean variable form femal dummy 
wage1['isfemale'] = (wage1['female'] == 1)

wage1['isfemale'].value_counts()
#wage1[['isfemale']].describe() 

False    274
True     252
Name: isfemale, dtype: int64

## Regression with logical variable

In [21]:
# regression with boolean variable:

m6 = smf.ols(formula='np.log(wage) ~ isfemale + educ + exper + tenure', data=wage1)
m6 = m6.fit()


m6s = Stargazer([m6])
m6s.covariate_order(['Intercept','isfemale[T.True]' , 'educ' , 'exper', 'tenure'])
m6s.rename_covariates({'isfemale[T.True]': 'Female:True'})
HTML(m6s.render_html())

0,1
,
,Dependent variable:np.log(wage)
,
,(1)
,
Intercept,0.501***
,(0.102)
Female:True,-0.301***
,(0.037)
educ,0.087***


## Categorical variables


When estimating a linear regression in python using **statsmodels** you can easily transform any variable into a categorical variable using the function `C()` in the definition of the formula. Our **ols** function will add *g-1* dummy variables if the vairbale has *g* categories. As a refrence category the first category is left out by default. 

When you use categorical variables that have many categories, you have to choose a reference category and this is the ommitted variable that you use to avoid colinearity. By default the first category is left out in Python but we can use a second argument in the `C()` command where we procide a new reference group `somegroup` with the using the command **Treament("somegroup")**. 

The code below shows how our categorical variables are used variables are used.

* Table of categories and frequencies for two factor variables gender and occupation:
* What type of variable is occupation
* Regression with dummies for many categories from a categorical variable 

In [22]:
CPS1985 = pd.read_csv('/Users/gavinmason/Downloads/CPS1985.csv')
# rename variable to make outputs more compact:
CPS1985['oc'] = CPS1985['occupation']
CPS1985['gender'].describe()

count      534
unique       2
top       male
freq       289
Name: gender, dtype: object

In [23]:
# table of categories and frequencies for two categorical variables:
pd.crosstab(CPS1985['gender'], columns='count')

col_0,count
gender,Unnamed: 1_level_1
female,245
male,289


In [24]:
freq_occupation = pd.crosstab(CPS1985['oc'], columns='count')
freq_occupation

col_0,count
oc,Unnamed: 1_level_1
management,55
office,97
sales,38
services,83
technical,105
worker,156


In [25]:
# directly using categorical variables in regression formula:
m7 = smf.ols(formula='np.log(wage) ~ education + experience + C(oc)', data=CPS1985)
m7 = m7.fit()

# print regression table:
m7s = Stargazer([m7])

HTML(m7s.render_html())

0,1
,
,Dependent variable:np.log(wage)
,
,(1)
,
C(oc)[T.office],-0.291***
,(0.078)
C(oc)[T.sales],-0.369***
,(0.096)
C(oc)[T.services],-0.397***


### Choosing a new the reference category


In [26]:
# rerun regression with different reference category:
reg_newref = smf.ols(formula='np.log(wage) ~ education + experience + '
                             'C(gender, Treatment("male")) + '
                             'C(oc,Treatment("technical"))', data=CPS1985)
m8 = reg_newref.fit()

# print regression table:
m8s = Stargazer([m8])
HTML(m8s.render_html())

0,1
,
,Dependent variable:np.log(wage)
,
,(1)
,
"C(gender, Treatment(""male""))[T.female]",-0.224***
,(0.042)
"C(oc, Treatment(""technical""))[T.management]",0.010
,(0.074)
"C(oc, Treatment(""technical""))[T.office]",-0.197***


# Anova tables 

When working with categorical variables, polynomials or orther specifications, the influence of one variables is capture by several regressors. In our example below the effect of occupation is captured by five regressors of their respective dummy variables. 

Our model is of the form:

$$log(wage) = \beta_0 + \beta_1* education + \beta_2*experience + \\  \beta_3*gender + \beta_4*office + \beta_5*sales + \beta_6*services + \beta_7*technical  + \beta_8*worker + u $$

The significance of occupation can be assessed using an F test of 

$$ H_0: \beta_4 = \beta_5 = \beta_6 = \beta_7 = \beta_8 = 0.$$

A type II ANOVA (analysis of variance) table does exactly this for each variable in the model and displays the results in a clearly arranged table. **statsmodel implements this method `anova_lm`



In [27]:
# run regression:
reg = smf.ols(
    formula='np.log(wage) ~ education + experience + gender + occupation',
    data=CPS1985)
results = reg.fit()

# print regression table:
results.summary()

0,1,2,3
Dep. Variable:,np.log(wage),R-squared:,0.318
Model:,OLS,Adj. R-squared:,0.307
Method:,Least Squares,F-statistic:,30.57
Date:,"Fri, 18 Nov 2022",Prob (F-statistic):,2.55e-39
Time:,16:59:24,Log-Likelihood:,-313.8
No. Observations:,534,AIC:,645.6
Df Residuals:,525,BIC:,684.1
Df Model:,8,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,0.9050,0.172,5.272,0.000,0.568,1.242
gender[T.male],0.2238,0.042,5.298,0.000,0.141,0.307
occupation[T.office],-0.2073,0.078,-2.670,0.008,-0.360,-0.055
occupation[T.sales],-0.3601,0.094,-3.846,0.000,-0.544,-0.176
occupation[T.services],-0.3626,0.082,-4.430,0.000,-0.523,-0.202
occupation[T.technical],-0.0101,0.074,-0.136,0.892,-0.155,0.135
occupation[T.worker],-0.1525,0.076,-1.998,0.046,-0.303,-0.003
education,0.0759,0.010,7.545,0.000,0.056,0.096
experience,0.0119,0.002,7.089,0.000,0.009,0.015

0,1,2,3
Omnibus:,22.197,Durbin-Watson:,1.887
Prob(Omnibus):,0.0,Jarque-Bera (JB):,48.572
Skew:,-0.187,Prob(JB):,2.84e-11
Kurtosis:,4.429,Cond. No.,257.0


 > See anova table below, column df in dicates that this test uses 5 parameters.  All other variables enter the table with a single parameter

In [28]:
# ANOVA table:
table_anova = sm.stats.anova_lm(results, typ=2)
table_anova

Unnamed: 0,sum_sq,df,F,PR(>F)
gender,5.414018,1.0,28.067296,1.727015e-07
occupation,7.152529,5.0,7.416013,9.805485e-07
education,10.980589,1.0,56.92545,2.010374e-13
experience,9.695055,1.0,50.261001,4.365391e-12
Residual,101.269451,525.0,,



# Numeric variables into categories

Sometimes we need to make numerical variables into categories because a linear relation with the dependent variable seems implausible or the interpretation is inconvenient. Or we simply want to have a different interpretation. 

In the example below the variable `rank` is the rank of the law school as a number between 1 and 175. We would like to compare schools in the different groups like in list below

|School Rank | 
|-----------| 
|top 10 |
|11-25 |
|26-40 |
|41-60 |
|60-100 | 
|above 100 | 


In the code below we create variable for these categories. First define cut point and then create a new factor (categorical) variable based on these cut points using the cut command. 

In [29]:
lawsch85 = woo.dataWoo('lawsch85')

# define cut points for the rank:
cutpts = [0, 10, 25, 40, 60, 100, 175]

# create categorical variable containing ranges for the rank:
lawsch85['rc'] = pd.cut(lawsch85['rank'], bins= cutpts, 
                       labels=['top 10', '(10,25]', '(25,40]',
                                '(40,60]', '(60,100]', '(100,175]'])

# display frequencies:
freq = pd.crosstab(lawsch85['rc'], columns='count')
freq

col_0,count
rc,Unnamed: 1_level_1
top 10,10
"(10,25]",16
"(25,40]",13
"(40,60]",18
"(60,100]",37
"(100,175]",62


Estimate the following equation $$ log(salary)= \beta_0 +\beta_1*rankcat + \beta_2*LSAT + \beta_3*GPA + \beta_4*log(libvol) + \beta_5*log(cost)$$ But first follow the instructions to set the reference category, for the school ranking. 

>  Choose reference category, we want the last group as the reference category, so we use relevel. Save that in a new variable called rankcat

In [30]:
# run regression:
reg = smf.ols(formula='np.log(salary) ~ C(rc, Treatment("(100,175]")) +'
              'LSAT + GPA + np.log(libvol)+ np.log(cost)',
              data=lawsch85)
results = reg.fit()

# print regression table
results.summary()

0,1,2,3
Dep. Variable:,np.log(salary),R-squared:,0.911
Model:,OLS,Adj. R-squared:,0.905
Method:,Least Squares,F-statistic:,143.2
Date:,"Fri, 18 Nov 2022",Prob (F-statistic):,9.45e-62
Time:,16:59:25,Log-Likelihood:,146.45
No. Observations:,136,AIC:,-272.9
Df Residuals:,126,BIC:,-243.8
Df Model:,9,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,9.1653,0.411,22.277,0.000,8.351,9.979
"C(rc, Treatment(""(100,175]""))[T.top 10]",0.6996,0.053,13.078,0.000,0.594,0.805
"C(rc, Treatment(""(100,175]""))[T.(10,25]]",0.5935,0.039,15.049,0.000,0.515,0.672
"C(rc, Treatment(""(100,175]""))[T.(25,40]]",0.3751,0.034,11.005,0.000,0.308,0.443
"C(rc, Treatment(""(100,175]""))[T.(40,60]]",0.2628,0.028,9.399,0.000,0.207,0.318
"C(rc, Treatment(""(100,175]""))[T.(60,100]]",0.1316,0.021,6.254,0.000,0.090,0.173
LSAT,0.0057,0.003,1.858,0.066,-0.000,0.012
GPA,0.0137,0.074,0.185,0.854,-0.133,0.161
np.log(libvol),0.0364,0.026,1.398,0.165,-0.015,0.088

0,1,2,3
Omnibus:,9.419,Durbin-Watson:,1.926
Prob(Omnibus):,0.009,Jarque-Bera (JB):,20.478
Skew:,0.1,Prob(JB):,3.57e-05
Kurtosis:,4.89,Cond. No.,8980.0


# Categorical dependent variables 

When you have a categorical dependent variable you can use regular OLS model, this will be a linear probability model LPM or you can use logit or probit models.

The Python code for these last two models is:

# Estimate logit model:

Your y variable is binary 0 or 1 

>`reg_logit = smf.logit(formula='y ~ x1 + x2 + ...+ xn',
                      data=mydata)`

disp = 0 avoids printing out information during the estimation:

>`results_logit = reg_logit.fit(disp=0)`


# Estimate probit model:
>`reg_probit = smf.probit(formula='y ~ x1 + x2 + ...+ xn',
                      data=mydata)
results_probit = reg_probit.fit(disp=0)`

In [32]:
!jupyter nbconvert --to html Econ320_Lab_class11_Qualitativedata.ipynb

This application is used to convert notebook files (*.ipynb)
        to various other formats.


Options
The options below are convenience aliases to configurable class-options,
as listed in the "Equivalent to" description-line of the aliases.
To see all configurable class-options for some <cmd>, use:
    <cmd> --help-all

--debug
    set log level to logging.DEBUG (maximize logging output)
    Equivalent to: [--Application.log_level=10]
--show-config
    Show the application's configuration (human-readable format)
    Equivalent to: [--Application.show_config=True]
--show-config-json
    Show the application's configuration (json format)
    Equivalent to: [--Application.show_config_json=True]
--generate-config
    generate default config file
    Equivalent to: [--JupyterApp.generate_config=True]
-y
    Answer yes to any questions instead of prompting.
    Equivalent to: [--JupyterApp.answer_yes=True]
--execute
    Execute the notebook prior to export.
    

&nbsp;
<hr />
<p style="font-family:palatino; text-align: center;font-size: 15px">ECON320 Python Programming Laboratory</a></p>
<p style="font-family:palatino; text-align: center;font-size: 15px">Professor <em> Paloma Lopez de mesa Moyano</em></a></p>
<p style="font-family:palatino; text-align: center;font-size: 15px"><span style="color: #6666FF;"><em>paloma.moyano@emory.edu</em></span></p>

<p style="font-family:palatino; text-align: center;font-size: 15px">Department of Economics</a></p>
<p style="font-family:palatino; text-align: center; color: #012169;font-size: 15px">Emory University</a></p>

&nbsp;