<a href="https://colab.research.google.com/github/retico/cmepda_medphys/blob/master/L7_code/Lecture7_regression_models.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Regression models**


We explore relationships among different variable of a dataset.

The data used in this demo is a table containing the brain features computed by means of the [FreeSurfer](https://surfer.nmr.mgh.harvard.edu/) segmentation software. A subsample of the large amount of features generated by Freesurfer for the [ABIDE I](http://fcon_1000.projects.nitrc.org/indi/abide/) data cohort is analyzed.  

We will use  [pandas](https://pandas.pydata.org/),   [matplotlib](https://matplotlib.org/) and [scikit-learn](https://scikit-learn.org/stable/) (sklearn). All these libraries are already installed on Colab.

In [None]:
import matplotlib.pyplot as plt
import pandas as pd

# Read the dataset


We read a csv file as a pandas dataframe

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')
dataset_file = "/content/gdrive/MyDrive/CMEPDA_MedPhys_datasets/FEATURES/Brain_MRI_FS_ABIDE/FS_features_ABIDE_males_someGlobals.csv"
# check and modify the path of the FS_features_ABIDE_males_someGlobals.csv file you downloaded in your drive
df = pd.read_csv(dataset_file, sep=';')
df.head()

In [None]:
df.columns

# Linear regression with sklearn
##(One predictive variable)

We can hypothesize that there is a linear dependence, for example, of cortical thickness on the subject's age.



In [None]:
plt.scatter(df.AGE_AT_SCAN, df.lh_MeanThickness, color='black', marker='.')

plt.xlabel('Age [y]')
plt.ylabel(' lh_MeanThickness [mm]')

plt.show()

The question is: Is it possible to predict a person's age from his cortical thickness?

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
lin_reg = LinearRegression()

In [None]:
lin_reg

In [None]:
X_feat = pd.DataFrame(data=df, columns=['lh_MeanThickness'])
Y_ = df.AGE_AT_SCAN

In [None]:
model = lin_reg.fit(X_feat, Y_)

In [None]:
print(model)

In [None]:
[x for x  in dir(model) if not x.startswith('_')]

.score returns the coefficient of determination R^2 of the prediction. The best possible score is 1.0

In [None]:
model.score(X_feat, Y_)

In [None]:
X_feat

In [None]:
model.coef_

In [None]:
model.intercept_

In [None]:
Y_fit=model.predict(X_feat)

We can easily compute the root mean squared error (RMSE), which is an interesting value to report

In [None]:
from sklearn.metrics import mean_squared_error

In [None]:
mean_squared_error(Y_, Y_fit, squared=False)

Plot output


In [None]:
plt.scatter(X_feat, Y_,  color='black', marker='.')
plt.plot(X_feat, Y_fit, color='blue', linewidth=3)

plt.xlabel(f'{X_feat.columns[0]} [mm]')
plt.ylabel('Age [y]')

plt.show()

# Linear regression model (Multiple predictors)

In [None]:
lin_reg_M = LinearRegression()

In [None]:
#X_feat_M = pd.DataFrame(data=df, columns=['lh_MeanThickness', 'rh_MeanThickness'])
X_feat_M = df.loc[:,'lh_MeanThickness':'TotalGrayVol']
Y_ = df.AGE_AT_SCAN

In [None]:
X_feat_M.columns

In [None]:
model_M = lin_reg_M.fit(X_feat_M, Y_)

In [None]:
model_M.score(X_feat_M, Y_)

In [None]:
model_M.coef_


In [None]:
model_M.intercept_

In [None]:
Y_fit_M=model_M.predict(X_feat_M)

In [None]:
mean_squared_error(Y_, Y_fit_M, squared=False)

In [None]:
plt.scatter(Y_, Y_fit_M, color='black', marker='.')

plt.xlabel('Age [y]')
plt.ylabel('Predicted Age [y]')

plt.show()

In [None]:
plt.scatter(X_feat_M.lh_MeanThickness, Y_,  color='black', marker='.')
plt.scatter(X_feat_M.lh_MeanThickness, Y_fit_M,  color='blue', marker='.')

plt.xlabel(X_feat.columns[0])
plt.ylabel('Age [y]')

plt.show()

Scatter plot 3D

In [None]:
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')


ax.scatter(X_feat_M.lh_MeanThickness, X_feat_M.rh_MeanThickness, Y_, color='black')
ax.scatter(X_feat_M.lh_MeanThickness, X_feat_M.rh_MeanThickness, Y_fit_M, color='blue')

#ax.scatter(X_feat_M.lh_MeanThickness, X_feat_M.TotalGrayVol, Y_, color='black')
#ax.scatter(X_feat_M.lh_MeanThickness, X_feat_M.TotalGrayVol, Y_fit_M, color='blue')

ax.set_xlabel('X Label')
ax.set_ylabel('Y Label')
ax.set_zlabel('Y_')


An interactive plot using the [plotly](https://plotly.com/python/) library

In [None]:
# Import dependencies

import plotly
import plotly.graph_objs as go

# Configure the traces for data points and fit.
trace_data = go.Scatter3d(
    x=X_feat_M.lh_MeanThickness,
    y=X_feat_M.rh_MeanThickness,
#    y=X_feat_M.TotalGrayVol, # you can try with different features, e.g. TotalGrayVol
    z=Y_,
    mode='markers',
    marker={
        'size': 3,
        'opacity': 0.8,
    },
    name='Data points'
)

trace_fit = go.Scatter3d(
    x=X_feat_M.lh_MeanThickness,
    y=X_feat_M.rh_MeanThickness,
#    y=X_feat_M.TotalGrayVol,
    z=Y_fit_M,
    mode='markers',
    marker={
        'size': 3,
        'opacity': 0.8,
    },
        name='Fitted data'
)
# Configure the layout.
layout = go.Layout(
    margin={'l': 0, 'r': 0, 'b': 0, 't': 0},
    scene= {
    "xaxis":{'title':'lh_MeanThickness'},
#    "yaxis":{'title':'rh_MeanThickness'},
    "yaxis":{'title':'TotalGrayVol'},
    "zaxis":{'title':'Age'}
    }
)

data = [trace_data, trace_fit]

plot_figure = go.Figure(data=data, layout=layout)

# Render the plot.
plot_figure.show()

# Polynomial models

In [None]:
import numpy as np
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

In [None]:
x_ = 'lh_MeanThickness'
y_ = 'AGE_AT_SCAN'
X, y = df[x_].to_numpy(), df[y_].to_numpy()
plt.scatter(X, y)
plt.xlabel(x_)
plt.ylabel(y_)

x_values = np.arange(X.min(), X.max(), 0.1).reshape(-1,1)

colors = ['green', 'yellow', 'red', 'pink', 'blue']
for i in range(0, 5):
    model = Pipeline([('scaler', StandardScaler()),
                      ('poly', PolynomialFeatures(degree=i)),
                      ('linear', LinearRegression(fit_intercept=False))])
    model = model.fit(X.reshape(-1,1), y)
    y_hat = model.predict(X.reshape(-1,1))
    rmse=mean_squared_error(y, y_hat, squared=False)

    plt.plot(x_values, model.predict(x_values),
             color=colors[i],
             linewidth=2.0,
             label=f"Poly_{i} - RMSE: {rmse:.2f}")

    print(f'RMSE of Poly_{i} is {rmse}y')

plt.legend()


# Conclusions

Many studies can be done by modeling the age trends of brain characteristics. Abnormal trends could characterize pathological conditions.
To learn more about the brain morphometric patterns in ASD across the lifespan you can read the recent study by Van Rooij D, *et al.*, [ENIGMA-ASD](http://enigma.ini.usc.edu/ongoing/enigma-asd-working-group/) working group, [*Cortical and subcortical brain morphometry differences between patients with autism spectrum disorder and healthy individuals across the lifespan: Results from the ENIGMA ASD working group*](https://ajp.psychiatryonline.org/doi/10.1176/appi.ajp.2017.17010100), American Journal of Psychiatry 2018, 175 (4), pp. 359-369. doi : 10.1176/appi.ajp.2017.17010100.