Regularized regression, clustering, and dimension reduction in sklearn 
============================================

sklearn is a "machine learning" library for Python.  Here we will demonstrate a few of its features using the National Health and Nutrition Examination Survey (NHANES) data.

First we import some libraries:

In [0]:
import sklearn
import pandas as pd
import patsy
import matplotlib.pyplot as plt
import numpy as np

Raw data for various waves of NHANES data can be obtained from the CDC web site for NHANES: http://www.cdc.gov/nchs/nhanes.htm.

The files are in SAS XPORT format, and can be read as follows:

In [0]:
demog = pd.read_sas("DEMO_F.XPT")
bpx = pd.read_sas("BPX_F.XPT")
bmx = pd.read_sas("BMX_F.XPT")

Next we merge the three files using the unique subject identifier, SEQN.

In [0]:
data = pd.merge(demog, bpx, left_on="SEQN", right_on="SEQN")
data = pd.merge(data, bmx, left_on="SEQN", right_on="SEQN")
print(data.columns)

Next we form a smaller data set including some variables of possible interest.  We then tabulate thre frequency that these variables have missing values.

In [0]:
df = data[["RIDAGEYR", "BPXSY1", "RIAGENDR", "BMXBMI", "BMXWT", "BMXRECUM", "BMILEG", "BMXARML", "BMXHT", "BMXWAIST", "BMXTRI", "DMDMARTL"]]
df.isnull().mean()

Since we will need complete cases below, we drop the two variables with high rates of missing values, then drop all cases with any missing values on the remaining variables.

In [0]:
del df["BMILEG"]
del df["BMXRECUM"]
df = df.dropna()
df.shape

We need an indicator variable for gender.

In [0]:
df["FEMALE"] = 1*(df["RIAGENDR"] == 2)

Regularized regression
-------------------------

Regularized regression is very well developed in sklearn.  Here we will demonstrate two ways to do regularized regression - ridge regression and the Lasso.

sklearn does not handle formulas, but we can use Patsy to process the formulas, then use the resulting design matrices in sklearn.

In [0]:
y, x = patsy.dmatrices("BPXSY1 ~ 0 + RIDAGEYR + FEMALE + BMXBMI + BMXWT", data=df, return_type='dataframe')

Usually when working with regularized regression models the predictors and outcomes should be standardized before fitting the model.

In [0]:
xnames = x.columns
x = np.asarray(x)
y = np.asarray(y)
y = (y - y.mean()) / y.std()
x = (x - x.mean(0)) / x.std(0)

Ridge regression
------------------

Now we are ready to fit a sequence of ridge regression models, with a sequence of regularization parameters.

In [0]:
from sklearn import linear_model
clf = linear_model.Ridge(fit_intercept=False)
alphas = np.linspace(0, 40, 100)

coefs = []
for a in alphas:
    clf.set_params(alpha=a*len(y))
    clf.fit(x, y)
    coefs.append(clf.coef_.ravel())

coefs = np.asarray(coefs) 

It is common to visualize the results by plotting the coefficients against the regularization parameter.

In [0]:
ax = plt.gca()
for j,c in enumerate(coefs.T):
    plt.plot(alphas, c, label=xnames[j])
ha,lb = ax.get_legend_handles_labels()
plt.legend(ha, lb, loc="upper right")
plt.xlabel('alpha', size=16)
plt.ylabel('Coefficients', size=16)

In [0]:
y, x = patsy.dmatrices("BPXSY1 ~ 0 + RIDAGEYR + FEMALE + BMXBMI + BMXWT + BMXHT + BMXWAIST + BMXTRI", data=df, return_type='dataframe')

In [0]:
xnames = x.columns
x = np.asarray(x)
y = np.asarray(y)
y = (y - y.mean()) / y.std()
x = (x - x.mean(0)) / x.std(0)

In [0]:
clf = linear_model.Ridge(fit_intercept=False)

alphas = np.linspace(0, 4, 100)
coefs = []
for a in alphas:
    clf.set_params(alpha=a*len(y))
    clf.fit(x, y)
    coefs.append(clf.coef_.ravel())

coefs = np.asarray(coefs) 

In [0]:
ax = plt.gca()
for j,c in enumerate(coefs.T):
    plt.plot(alphas, c, label=xnames[j])
ha,lb = ax.get_legend_handles_labels()
plt.legend(ha, lb, loc="upper right")
plt.xlabel('alpha', size=16)
plt.ylabel('Coefficients', size=16)
plt.xlim(0, 4)

In [0]:
clf = linear_model.Lasso(fit_intercept=False)

alphas = np.linspace(0.000001, 0.0001, 100)
coefs = []
for a in alphas:
    clf.set_params(alpha=a*len(y))
    clf.fit(x, y)
    coefs.append(clf.coef_.ravel())

coefs = np.asarray(coefs) 

Lasso regression
------------------

In [0]:
ax = plt.gca()
for j,c in enumerate(coefs.T):
    plt.plot(alphas, c, label=xnames[j])
ha,lb = ax.get_legend_handles_labels()
plt.legend(ha, lb, loc="upper right")
plt.xlabel('alpha', size=16)
plt.ylabel('Coefficients', size=16)

Principal Components Analysis
---------------------------------

To illustrate PCA, we can look at five related body meeasures.

In [0]:
from sklearn.decomposition import PCA

dfx = df.loc[(df.RIDAGEYR >= 30) & (df.RIDAGEYR <= 40)]
x = dfx[["BMXBMI", "BMXWT", "BMXHT", "BMXWAIST", "BMXTRI"]]

pca = PCA(n_components=2)
rslt = pca.fit(x)
print(rslt.explained_variance_ratio_)
scores = pca.fit_transform(x)
ixf = np.flatnonzero(dfx.FEMALE == 1)
ixm = np.flatnonzero(dfx.FEMALE == 0)
plt.plot(scores[ixf, 0], scores[ixf, 1], 'o', color='orange', alpha=0.2)
plt.plot(scores[ixm, 0], scores[ixm, 1], 'o', color='purple', alpha=0.2)
plt.xlabel("PC 1", size=16)
plt.ylabel("PC 2", size=16)
pca.components_

As a second example, we can look at three systolic blood pressure measurements (replicate measures taken on the same assessment visit).  These are quie strongly correlated.

In [0]:
dfx = data[["BPXSY1", "BPXSY2", "BPXSY3"]].dropna()
np.corrcoef(dfx.T)

In [0]:
pca = PCA(n_components=3)
rslt = pca.fit(dfx)
print(rslt.explained_variance_ratio_)
print(pca.components_)
scores = pca.fit_transform(dfx)

In [0]:
for k in range(3):
    ii = np.argsort(scores[:, k])
    plt.figure()
    plt.clf()
    plt.gca().set_xticks([0, 1, 2])
    plt.xlabel("SBP measurement number", size=15)
    plt.ylabel("SBP", size=16)
    plt.title("PC " + str(k+1))
    for j in range(5):
        plt.plot(dfx.iloc[ii[j], :].values, color='blue')
        plt.plot(dfx.iloc[ii[-j-1], :].values, color='red')
    plt.figure()
    plt.hist(scores[:, k])
        

K-means
--------

In [0]:
from sklearn.cluster import KMeans

dfx = df.loc[(df.RIDAGEYR >= 30) & (df.RIDAGEYR <= 40)]
x = dfx[["BMXBMI", "BMXWT", "BMXHT", "BMXWAIST", "BMXTRI"]]

km = KMeans(n_clusters=4)
rslt = km.fit(x)
clcent = pd.DataFrame(km.cluster_centers_, columns=x.columns)
print(clcent)
print(pd.Series(km.predict(x)).value_counts())