Risk factors for low birthweight
================================

This notebook illustrates a logistic regression analysis using Python Statsmodels. The data are indicators of low birth weight in a sample of newborns. Each record in the data set corresponds to one infant.  The primary variable of interest is whether the baby was born with low birthweight, defined to be a birth weight less than 2500 grams.  There are additional variables in the data set that may be used as predictors of a baby being born with low birth weight.

A description of the data is [here](http://vincentarelbundock.github.io/Rdatasets/doc/COUNT/lbw.html).  The data can be downloaded [here](http://vincentarelbundock.github.io/Rdatasets/csv/COUNT/lbw.csv).

These are the standard import statements:

In [0]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt

Next we read in the data set and check its size.

In [0]:
data = pd.read_csv("lbw.csv")
print(data.shape)

The `head` method displays the first few rows of the data set, so we know what we are working with:

In [0]:
print(data.head())

It's optional, but the first column is not needed so we can drop it.

In [0]:
del data["Unnamed: 0"]

It's a good idea to check the data types to make sure there are no surprises.

In [0]:
print(data.dtypes)

We also can check to see if any data are missing.

In [0]:
pd.isnull(data).sum()

In [0]:
data["race"] = data["race"].replace({1: "white", 2: "black", 3: "other"})

We don't have a lot of information about this data set, but we can see that the frequency of low birth weight is over 30%, whereas in the general population it is less than 10%.  Thus the data set is not representative of the general population.  It may be from a case/control study, or from a study of a high risk population.

In [0]:
data.low.mean()

Now we can fit a logistic regression model containing additive effects for all covariates.  The `GLM` function fits many types of generalized linear models (GLMs).  Choosing the `Binomial` family makes it a logistic regression.

In [0]:
model1 = sm.GLM.from_formula("low ~ age + smoke + race + lwt + ptl + ht + ui + ftv", family=sm.families.Binomial(), data=data)
result1 = model1.fit()
print(result1.summary())

There are various ways to select a more parsimonious model from the large model we fit above.  Here we will do a manual "backward elimination", dropping the variable with the smallest Z-score and refitting the model.  We repeat this process until a stopping point is reached (discussed further below).

In [0]:
# drop ftv
model2 = sm.GLM.from_formula("low ~ age + smoke + race + lwt + ptl + ht + ui", family=sm.families.Binomial(), data=data)
result2 = model2.fit()
print(result2.summary())

In [0]:
# drop age
model3 = sm.GLM.from_formula("low ~ smoke + race + lwt + ptl + ht + ui", family=sm.families.Binomial(), data=data)
result3 = model3.fit()
print(result3.summary())

In [0]:
# combine other and black race
data["white_race"] = (data.race == "white")
model4 = sm.GLM.from_formula("low ~ smoke + white_race + lwt + ptl + ht + ui", family=sm.families.Binomial(), data=data)
result4 = model4.fit()
print(result4.summary())

In [0]:
# drop ptl
model5 = sm.GLM.from_formula("low ~ smoke + white_race + lwt + ht + ui", family=sm.families.Binomial(), data=data)
result5 = model5.fit()
print(result5.summary())

In [0]:
# drop ui
model6 = sm.GLM.from_formula("low ~ smoke + white_race + lwt + ht", family=sm.families.Binomial(), data=data)
result6 = model6.fit()
print(result6.summary())

In [0]:
# drop ht
model7 = sm.GLM.from_formula("low ~ smoke + white_race + lwt", family=sm.families.Binomial(), data=data)
result7 = model7.fit()
print(result7.summary())

To help select a model, we can calculate the AIC and BIC for each model.

In [0]:
[x.aic for x in (result1, result2, result3, result4, result5, result6, result7)]

In [0]:
[x.bic for x in (result1, result2, result3, result4, result5, result6, result7)]

In terms of AIC, `model4` is the best.  If our goal is to have all Z-scores greater than 2, we would go with `model6`, which is the same model selected by BIC.  Each of these model selection statistics has strengths and weaknesses.  There is no automatic way to decide which model is "best".

__Non-multiplicative effect of maternal weight__

Now suppose that our main interest is in the maternal age and weight effects (`lwt`).  These are quantitative variables, so they may not be accurately represented by the models presented above, which only have linear effects on the log odds ratio (multiplicative effects on the odds ratio).  We can further explore the effect of maternal weight using splines.

In [0]:
model8 = sm.GLM.from_formula("low ~ smoke + white_race + bs(lwt, df=4) + ht + ui", family=sm.families.Binomial(), data=data)
result8 = model8.fit()
print(result8.summary())

Next we plot the log odds for maternal weight relative to the median maternal weight.  The effect that is graphed is for non-smoking white women with no hypertension (`ht=0`) and no uterine irritability (`ui=0`).

In [0]:
df = data.iloc[0:9,:].copy()
df["smoke"] = 0
df["white_race"] = 1

# tolist is required due to a numpy bug, now fixed
lwt = np.percentile(np.asarray(data.lwt), np.arange(10, 91, 10).tolist())

df["lwt"] = lwt
df["ht"] = 0
df["ui"] = 0

# Logit probabilities
lpr = result8.predict(exog=df, linear=True)

# Log odds ratios relative to median maternal weight
lor = lpr - lpr[4]

plt.grid(True)
plt.plot(lwt, lor, '-o')
plt.xlim(98, 172)
plt.xlabel("Maternal weight (lbs)", size=15)
plt.ylabel("Log odds relative to median", size=15)

We see that while low maternal weight is a risk factor, there is no advantage to being overweight as compared to median weight.

In [0]:
OR = np.exp(lor)

plt.clf()
plt.grid(True)
plt.plot(lwt, OR, '-o')
plt.xlim(98, 172)
plt.xlabel("Maternal weight (lbs)", size=15)
plt.ylabel("Odds ratio", size=15)

__Effect of maternal age__

We can now revisit the effect of maternal age, which was very weak as a main effect (i.e. as a linear effect on the log scale).  We will model age using splines to capture possible nonlinear effects.  We will also control for the other factors that were found above to have effects.

In [0]:
model9 = sm.GLM.from_formula("low ~ smoke + bs(age, df=3) + white_race + lwt + ht + ui", family=sm.families.Binomial(), data=data)
result9 = model9.fit()

df = data.iloc[0:9,:].copy()
df["smoke"] = 0
df["white_race"] = 1

# tolist is required due to a numpy bug, now fixed
age = np.percentile(np.asarray(data.age), np.arange(10, 91, 10).tolist())

df["lwt"] = data.lwt.mean()
df["age"] = age
df["ht"] = 0
df["ui"] = 0

# Logit probabilities
lpr = result9.predict(exog=df, linear=True)

import patsy
dexog = patsy.dmatrix(model9.data.orig_exog.design_info.builder, df)

vcov = result9.cov_params()
va = [np.dot(x, np.dot(vcov, x)) for x in dexog]
va = np.asarray(va)
sd = np.sqrt(va)

plt.grid(True)
plt.plot(age, lpr, '-o')
plt.fill_between(age, lpr-2*sd, lpr+2*sd, color='grey', alpha=0.6)
plt.xlabel("Maternal age", size=15)
plt.ylabel("Logit probability", size=15)

There still is no evidence of a maternal age effect.