Source: https://www.practicaldatascience.org/html/exercises/Exercise_statsmodels.html

In this exercise, we will attempt to use this data to answer the following questions:

1. Do mothers who smoke tend to give birth to babies with lower weights than mothers who do not smoke?

2. What is a likely range for the difference in birth weights for smokers and non-smokers?

3. Is there any evidence that the association between smoking and birth weight differs by mother’s race? If so, characterize those differences.

4. Are there other interesting associations with birth weight that are worth mentioning?

__(1) Load the data “smoking.csv”, which includes information on both biometrics of infants at birth, and information on mothers (variables prefixed with the letter “m”), from this MIDS repo. (Yup, I’m giving you CLEAN DATA! I think this is the only time I’ve done this in this course! Enjoy it. :)).__

In [3]:
!curl -LJO https://media.githubusercontent.com/media/nickeubank/MIDS_Data/master/smoking.csv

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 33768  100 33768    0     0   274k      0 --:--:-- --:--:-- --:--:--  274k


In [4]:
import pandas as pd
df = pd.read_csv('smoking.csv')

In [5]:
df.head()

Unnamed: 0,id,date,gestation,bwt.oz,parity,mrace,mage,med,mht,mpregwt,inc,smoke
0,4604,1598,148,116,7,7,28,1,66,135,2,0
1,7435,1527,181,110,7,7,27,1,64,133,1,0
2,7722,1563,204,55,11,7,35,3,65,140,6,0
3,2026,1503,225,132,4,7,28,2,67,148,3,0
4,3553,1638,233,105,4,7,34,3,61,130,3,0


In [26]:
# Replace . for _ in column names
df.columns = df.columns.str.replace('.','_')

  df.columns = df.columns.str.replace('.','_')


__(2) Start by plotting the relationship between infant weight at birth and gestation (length of pregnancy (in days) at time of birth) for both children who smoke and those who do not. Limit attention to children who reach at least 225 days of gestation (there aren’t really any observations for parents who smoke for less than that, so we don’t get common support). Does it seem like birthweights tend to be lower for the children of parents who smoke at a given gestational period?__

In [27]:
df.describe()

Unnamed: 0,id,date,gestation,bwt_oz,parity,mrace,mage,med,mht,mpregwt,inc,smoke
count,869.0,869.0,869.0,869.0,869.0,869.0,869.0,869.0,869.0,869.0,869.0,869.0
mean,6032.418872,1536.423475,278.50748,118.360184,1.952819,2.995397,27.294591,2.932106,64.069045,128.478711,3.681243,0.463751
std,2241.559842,106.950655,15.698968,18.050756,1.881595,3.111962,5.708005,1.434031,2.533612,20.778424,2.284667,0.498971
min,15.0,1350.0,148.0,55.0,0.0,0.0,15.0,0.0,53.0,87.0,0.0,0.0
25%,5477.0,1444.0,272.0,108.0,1.0,0.0,23.0,2.0,62.0,113.0,2.0,0.0
50%,6734.0,1540.0,279.0,119.0,2.0,2.0,26.0,2.0,64.0,125.0,3.0,0.0
75%,7587.0,1627.0,286.0,129.0,3.0,7.0,31.0,4.0,66.0,140.0,5.0,1.0
max,9263.0,1714.0,338.0,174.0,11.0,9.0,45.0,7.0,72.0,220.0,9.0,1.0


In [30]:
import altair as alt
cond = df.gestation > 255

alt.Chart(df[cond]).mark_circle(size=60).encode(
    x='gestation',
    y='bwt_oz',
    color='smoke'
).interactive()

__(3) Now check this relationship using statsmodels. Regress birthweight on gestational period and whether the infant’s mother smoked.__

NOTE: you may hit a problem because of the name of one of your columns. You should probably be able to guess the problem given your experience with Python.

In [31]:
import statsmodels.api as sm

In [36]:
# Add constant
df = sm.add_constant(df, prepend=False).copy()

In [38]:
df.columns

Index(['id', 'date', 'gestation', 'bwt_oz', 'parity', 'mrace', 'mage', 'med',
       'mht', 'mpregwt', 'inc', 'smoke', 'const'],
      dtype='object')

In [57]:
mod = sm.OLS(df['bwt_oz'], df[['smoke','gestation','const']])
res = mod.fit()
print(res.summary())

                            OLS Regression Results                            
Dep. Variable:                 bwt_oz   R-squared:                       0.207
Model:                            OLS   Adj. R-squared:                  0.205
Method:                 Least Squares   F-statistic:                     113.2
Date:                Sat, 07 Aug 2021   Prob (F-statistic):           2.15e-44
Time:                        17:00:18   Log-Likelihood:                -3645.8
No. Observations:                 869   AIC:                             7298.
Df Residuals:                     866   BIC:                             7312.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
smoke         -8.2049      1.096     -7.483      0.0

__(4) Now let’s expand our model to also take into account mothers’ pregnancy weight and race (make sure to treat race as a categorical variable! If you’re rusty on categoricals and indicator variables, here’s a little refresher.).__

In [58]:
mod = sm.OLS(df['bwt_oz'], df[['smoke','gestation','mpregwt','const']])
res = mod.fit()
print(res.summary())

                            OLS Regression Results                            
Dep. Variable:                 bwt_oz   R-squared:                       0.228
Model:                            OLS   Adj. R-squared:                  0.226
Method:                 Least Squares   F-statistic:                     85.33
Date:                Sat, 07 Aug 2021   Prob (F-statistic):           2.28e-48
Time:                        17:01:00   Log-Likelihood:                -3634.1
No. Observations:                 869   AIC:                             7276.
Df Residuals:                     865   BIC:                             7295.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
smoke         -7.9690      1.083     -7.355      0.0

__(5) Now let’s test for whether there is an interaction between the mother’s race and the effect of smoking.__

In [62]:
df['is_white'] = (df['mrace'] <= 5).astype(int)

In [70]:
mod = sm.OLS(df['bwt_oz'], df[['smoke','is_white','gestation','mpregwt','const']])
res = mod.fit()
print(res.summary())

                            OLS Regression Results                            
Dep. Variable:                 bwt_oz   R-squared:                       0.242
Model:                            OLS   Adj. R-squared:                  0.238
Method:                 Least Squares   F-statistic:                     68.95
Date:                Sun, 08 Aug 2021   Prob (F-statistic):           1.12e-50
Time:                        13:42:53   Log-Likelihood:                -3626.4
No. Observations:                 869   AIC:                             7263.
Df Residuals:                     864   BIC:                             7287.
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
smoke         -8.4379      1.081     -7.806      0.0

__(6) Using post-regression test syntax (not by running a new regression on a subpopulation), recover the coefficient and t-statistic for whether smoking reduces birth weight for white mothers. How does this coefficient compare to that for non-white mothers?__

The reduction in birth weight associated with smoking for white mothers appears to be about 40% that of the penalty of black mothers.

Not sure how to use the function `statsmodels.regression.linear_model.OLSResults.t_test` to do that. If you do I would love to hear from you :)

__(7) Now let’s use this model to predict some values. Let’s generate some hypothetical newborns:__
```
newborns = pd.DataFrame({'smoke': [True, True, False, False],
                         'white': [True, False, True, False],
                         'gestation': [253, 300, 248, 287],
                         'mpregwt': [132, 129, 140, 139]})
```
Using the model you ran above with gestation, smoke, mpregwt, white, and the interaction of white and smoke, predict birth weights for these newborns.

In [71]:
newborns = pd.DataFrame({'smoke': [True, True, False, False],
                         'white': [True, False, True, False],
                         'gestation': [253, 300, 248, 287],
                         'mpregwt': [132, 129, 140, 139],
                        'const': [1,1,1,1]})

In [78]:
res.predict(newborns)

0    105.397537
1    119.084095
2    112.903194
3    123.642341
dtype: object