# Multiple Dummy Variables (First Example)

### Intro and objectives

### In this lab you will learn:
1. examples of simple regression models with multiple dummy variables.
2. how to fit simple regression models in Python.


## What I hope you'll get out of this lab
* The feeling that you'll "know where to start" when you need to fit a simple regression model and include several dummy variables.
* Examples of simple regression models
* How to interpret the results obtained

In [2]:
!pip install wooldridge
import wooldridge as woo
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting wooldridge
  Downloading wooldridge-0.4.4-py3-none-any.whl (5.1 MB)
[K     |████████████████████████████████| 5.1 MB 5.2 MB/s 
Installing collected packages: wooldridge
Successfully installed wooldridge-0.4.4


# Example 1. Log Hourly Wage, marital status and gender

#### Let us estimate a model that allows for wage differences among four groups: married men, married women, single men, and single women. 

#### We have four categories, therefore we need a base category and three dummy variables.

#### We select single men as the base category.
#### We need three dummies for: married male, married female and single female


#### First we create these three new dummies before having them incorporated into the model.


### Using the data in WAGE1 where n=526 individuals

In [3]:
Wages = woo.dataWoo('wage1')


In [4]:
Wages.head()

Unnamed: 0,wage,educ,exper,tenure,nonwhite,female,married,numdep,smsa,northcen,...,trcommpu,trade,services,profserv,profocc,clerocc,servocc,lwage,expersq,tenursq
0,3.1,11,2,0,0,1,0,2,1,0,...,0,0,0,0,0,0,0,1.131402,4,0
1,3.24,12,22,2,0,1,1,3,1,0,...,0,0,1,0,0,0,1,1.175573,484,4
2,3.0,11,2,0,0,0,0,2,0,0,...,0,1,0,0,0,0,0,1.098612,4,0
3,6.0,8,44,28,0,0,1,0,1,0,...,0,0,0,0,0,1,0,1.791759,1936,784
4,5.3,12,7,2,0,0,1,1,0,0,...,0,0,0,0,0,0,0,1.667707,49,4


In [5]:
Wages.describe()

Unnamed: 0,wage,educ,exper,tenure,nonwhite,female,married,numdep,smsa,northcen,...,trcommpu,trade,services,profserv,profocc,clerocc,servocc,lwage,expersq,tenursq
count,526.0,526.0,526.0,526.0,526.0,526.0,526.0,526.0,526.0,526.0,...,526.0,526.0,526.0,526.0,526.0,526.0,526.0,526.0,526.0,526.0
mean,5.896103,12.562738,17.01711,5.104563,0.102662,0.479087,0.608365,1.043726,0.722433,0.250951,...,0.043726,0.287072,0.10076,0.258555,0.36692,0.1673,0.140684,1.623268,473.435361,78.15019
std,3.693086,2.769022,13.57216,7.224462,0.303805,0.500038,0.48858,1.261891,0.448225,0.433973,...,0.20468,0.452826,0.301298,0.438257,0.482423,0.373599,0.348027,0.531538,616.044772,199.434664
min,0.53,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.634878,1.0,0.0
25%,3.33,12.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.202972,25.0,0.0
50%,4.65,12.0,13.5,2.0,0.0,0.0,1.0,1.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.536867,182.5,4.0
75%,6.88,14.0,26.0,7.0,0.0,1.0,1.0,2.0,1.0,0.75,...,0.0,1.0,0.0,1.0,1.0,0.0,0.0,1.928619,676.0,49.0
max,24.98,18.0,51.0,44.0,1.0,1.0,1.0,6.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,3.218076,2601.0,1936.0


In [6]:
type(Wages)

pandas.core.frame.DataFrame

In [8]:
Wages.columns

Index(['wage', 'educ', 'exper', 'tenure', 'nonwhite', 'female', 'married',
       'numdep', 'smsa', 'northcen', 'south', 'west', 'construc', 'ndurman',
       'trcommpu', 'trade', 'services', 'profserv', 'profocc', 'clerocc',
       'servocc', 'lwage', 'expersq', 'tenursq'],
      dtype='object')

## We need to create the three new dummies: marrmale, marrfemale and singfemale

In [21]:
## We select marrmale=1 if the observation corresponds to a married male.
Wages['marrmale']=0
Wages.loc[(Wages['female']==0) & (Wages['married']==1),'marrmale']=1

In [23]:
## We select marrfemale=1 if the observation corresponds to a married female.
Wages['marrfemale']=0
Wages.loc[(Wages['female']==1) & (Wages['married']==1),'marrfemale']=1

In [25]:
## We select singlefemale=1 if the observation corresponds to a single female.
Wages['singfemale']=0
Wages.loc[(Wages['female']==1) & (Wages['married']==0),'singfemale']=1

In [27]:
Wages.sample(10)

Unnamed: 0,wage,educ,exper,tenure,nonwhite,female,married,numdep,smsa,northcen,...,profserv,profocc,clerocc,servocc,lwage,expersq,tenursq,marrmale,marrfemale,singfemale
428,3.35,12,9,1,0,0,1,0,1,1,...,0,0,0,0,1.20896,81,1,1,0,0
191,6.25,16,9,2,0,0,1,1,0,1,...,1,1,0,0,1.832582,81,4,1,0,0
34,4.68,12,3,0,0,1,0,0,1,0,...,0,0,1,0,1.543298,9,0,0,0,1
257,3.0,13,3,1,0,0,0,0,1,0,...,0,0,0,0,1.098612,9,1,0,0,0
373,3.13,12,6,5,0,1,1,1,1,0,...,1,0,1,0,1.141033,36,25,0,1,0
328,4.79,12,12,3,0,1,1,3,1,1,...,1,1,0,0,1.56653,144,9,0,1,0
114,4.2,14,33,16,0,1,1,0,1,0,...,1,1,0,0,1.435084,1089,256,0,1,0
484,2.9,12,39,1,0,1,1,0,0,0,...,1,0,0,1,1.064711,1521,1,0,1,0
466,3.35,15,3,1,0,1,1,2,1,0,...,0,0,1,0,1.20896,9,1,0,1,0
117,3.64,12,7,2,0,0,0,0,1,0,...,1,0,0,1,1.291984,49,4,0,0,0


In [28]:
# We impose a simple, linear, model: 
# We specify CeoSalaries as the empirical dataset

reg = smf.ols(formula='np.log(wage) ~marrmale+marrfemale+singfemale+educ+exper+np.power(exper,2)+tenure+np.power(tenure,2)', data=Wages)

In [29]:
# We fit the model
results = reg.fit()


In [30]:
b = results.params
print(f'b: \n{b}\n')

b: 
Intercept              0.321378
marrmale               0.212676
marrfemale            -0.198268
singfemale            -0.110350
educ                   0.078910
exper                  0.026801
np.power(exper, 2)    -0.000535
tenure                 0.029088
np.power(tenure, 2)   -0.000533
dtype: float64



In [32]:
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:           np.log(wage)   R-squared:                       0.461
Model:                            OLS   Adj. R-squared:                  0.453
Method:                 Least Squares   F-statistic:                     55.25
Date:                Sun, 20 Nov 2022   Prob (F-statistic):           1.28e-64
Time:                        11:58:58   Log-Likelihood:                -250.96
No. Observations:                 526   AIC:                             519.9
Df Residuals:                     517   BIC:                             558.3
Df Model:                           8                                         
Covariance Type:            nonrobust                                         
                          coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------
Intercept               0.3214    

## How do we interpret the equation?

#### The R-squared in relatively large (R-squared=0.46).
#### The F-statistic is large (55.25) with p-value close to zero. Therefore the model is statistically significant.

#### All coefficients are statistically significant based on their t-statistics and related p-values.

#### To interpret the coefficients on the dummy variables, we must remember that the base group is single males. 

#### Also please bear in mind that we are fitting a log-level model when interpreting each coefficient.

#### The estimates on the three dummy variables measure the proportionate difference in wage relative to single males. 

#### Married men are estimated to earn about 21.3% more than single men, holding levels of education, experience, and tenure fixed. 

#### A married woman, on the other hand, earns a predicted 19.8% less than a single man with the same levels of the other variables

#### A single woman, on the other hand, earns a predicted 11.0% less than a single man with the same levels of the other variables
