<a href="https://colab.research.google.com/github/retico/cmepda_medphys/blob/master/L8_code/Lecture8_pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data exploration and analysis in Pandas

We'll have a quick look at [pandas](https://pandas.pydata.org/), one of the most used python data analysis libraries, and at [seaborn](https://seaborn.pydata.org/), a high level API to matplotlib for statistical data visualization. Both the libraries are already installed on Colab. 

In [0]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## Read the dataset


A DataFrames is basically a container for and exposes lots of methods to process tabular data. Data series are stored, instead, in pandas Series.

Pandas offers plenty of readers out of the box.

In [0]:
[x for x in dir(pd) if 'read' in x ]

Let's read a csv file as a Pandas dataframe. 


In [0]:
dataset_url = "https://raw.githubusercontent.com/retico/cmepda_medphys/master/L8_code/FS_features_ABIDE_males_someGlobals.csv"

In [0]:
df = pd.read_csv(dataset_url)
df.head()

In [0]:
print(type(df), type(df.FILE_ID))

In [0]:
df.columns

Since it is hard to remember what DX_GROUP=1 actually means, let's make this column more readable.

Firt, we select the DX_GROUP column, then we apply a function to all its elements.

In [0]:
df['DX_GROUP'] = df.DX_GROUP.apply(lambda x: 'Controls' if x==-1 else 'ASD')
df.head()

In [0]:
df.DX_GROUP.unique()

# Boxplots

In [0]:
boxplot = df.boxplot(column=['AGE_AT_SCAN'], by='DX_GROUP', showfliers=False)
boxplot.set_title('Box plot of subject\'s age')
boxplot.get_figure().suptitle('');

The provenance site is part of the FILE_ID, we can extract and add it to a new column

In [0]:
df['ProvenanceSite'] = df.FILE_ID.apply(lambda x: x.split('_')[0])

In [0]:
boxplot = df.boxplot(column=['AGE_AT_SCAN'], by='ProvenanceSite', showfliers=False)
boxplot.set_title('Box plot of subject\'s age')
boxplot.get_figure().suptitle('');
boxplot.set_ylabel('Age [y]')

boxplot.set_xticklabels(labels=boxplot.get_xticklabels(), rotation=50);

Seaborn interoperates well with pandas DataFrame and allows us to refer to columns by label

In [0]:
sns_boxplot = sns.boxplot(x='ProvenanceSite', y='AGE_AT_SCAN', data=df)
sns_boxplot.set_xticklabels(labels=boxplot.get_xticklabels(), rotation=50);
sns_boxplot.grid()

# Grouping

Data can be grouped by feature and summarized according to a given accumulation function

In [0]:
df.groupby(by='AGE_AT_SCAN')

In [0]:
df.groupby(by='AGE_AT_SCAN')['FILE_ID']

In [0]:
provenance_counts = df.groupby('ProvenanceSite')['FILE_ID'].count()
provenance_counts

It is quite easy to obtain a bar plot from a pandas Series

In [0]:
provenance_counts.plot(kind='bar')

Now lets try to create a stacked bar plot showing the DX_GROUP

In [0]:
stack = df.groupby(['ProvenanceSite', 'DX_GROUP'])['FILE_ID'].count()
stack

In [0]:
unstacked = stack.unstack('DX_GROUP')
unstacked

In [0]:
unstacked.plot(kind='bar', stacked=True)

# Slicing

To select ranges of rows and/or columns it is possible to use labels or indices.

In [0]:
df.columns

In [0]:
selected_feat = df.loc[:,'lh_MeanThickness':'rhCortexVol']
selected_feat.head()

In [0]:
df.iloc[:,5:9].head()

Let's see how our selection looks like!

In [0]:
sns.pairplot(df.iloc[:,5:9])

# Finding the outliers

In [0]:
import scipy.stats

In [0]:
data = df.iloc[:,5:11]
data.head()

We consider the z-score as a factor to determine the presence of outliers.
Z-score is defined as:

$z(x) = \frac{x - \bar{x}}{\sigma}$. 

Data with a z-score above 3 (beyond 3$\sigma$) are outliers.

In [0]:
data[(abs(scipy.stats.zscore(data)) < 3).all(axis=1)]

## TTest

Is there any significant difference in the AGE_AT_SCAN and FIQ features between the two diagnostic categories?

In [0]:
df_ASD = df[df.DX_GROUP == 'ASD']
df_CTR = df[df.DX_GROUP == 'Controls']

In [0]:
ttest_res = scipy.stats.ttest_ind(df_ASD.AGE_AT_SCAN, df_CTR.AGE_AT_SCAN)
ttest_res

In [0]:
ttest_res = scipy.stats.ttest_ind(df_ASD.FIQ, df_CTR.FIQ)
ttest_res

Let's try with a new index of left-right asimmetry in mean thickness

In [0]:
def LR(data):
  LR = data.lh_MeanThickness-data.rh_MeanThickness
  LR /= 0.5*(data.rh_MeanThickness+data.lh_MeanThickness)
  data['LR'] = LR
  return data

In [0]:
data = LR(df)
data.head()

In [0]:
color = data.DX_GROUP.apply(lambda x:'blue' if x == 'ASD' else 'red')
ax = data.plot(x='AGE_AT_SCAN', y='LR', kind='scatter', color=color);
#ax.grid()

In [0]:
data.DX_GROUP.unique()

In [0]:
LR_ASD = data[data['DX_GROUP'] == 'ASD']['LR']
LR_CTR = data[data['DX_GROUP'] == 'Controls']['LR']

In [0]:
scipy.stats.ttest_ind(LR_ASD, LR_CTR)

In [0]:
boxplot = df.boxplot(column=['LR'], by='DX_GROUP', showfliers=False)
boxplot.set_title('Box plot of subject\'s LR asimmetry index')
boxplot.get_figure().suptitle('');
boxplot.set_ylabel('LR')

boxplot.set_xticklabels(labels=boxplot.get_xticklabels());

We can compute the effect size in terms od Cohen's d index

In [0]:
d_cohen = (LR_ASD.mean()-LR_CTR.mean())/data['LR'].std()
d_cohen

# Find the correlation

In [0]:
data.drop('SEX', axis=1).corr()

In [0]:
sns.heatmap(data.drop('SEX', axis=1).corr());

In [0]:
res = scipy.stats.pearsonr(data['rhCortexVol'], data['LR'])
res

# Regression model

In [0]:
import sklearn
from sklearn.linear_model import LinearRegression

In [0]:
lin_reg = LinearRegression()

In [0]:
lin_reg

In [0]:
X_feat = pd.DataFrame(data=data, columns=['lh_MeanThickness', 'rh_MeanThickness'])
Y_ = data.AGE_AT_SCAN

In [0]:
model = lin_reg.fit(X_feat, Y_)

In [0]:
print(model)

In [0]:
[x for x  in dir(model) if not x.startswith('_')]

In [0]:
model.score(X_feat, Y_)