<a href="https://colab.research.google.com/github/retico/cmepda_medphys/blob/master/L6_code/Lecture6_demo3_data_exploration_Pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Data exploration and analysis**

We explore a data sample, create descriptive plots to represent it, and carry out a basic statistical data analysis.

The data used in this demo is a table containing the brain features computed by means of the [FreeSurfer](https://surfer.nmr.mgh.harvard.edu/) segmentation software. A subsample of the large amount of features generated by Freesurfer for the [ABIDE I](http://fcon_1000.projects.nitrc.org/indi/abide/) data cohort is analyzed.  

We will have a quick look at [pandas](https://pandas.pydata.org/), one of the most used python data analysis libraries; we will use [matplotlib](https://matplotlib.org/), a comprehensive library for data visualization, and [seaborn](https://seaborn.pydata.org/), a high level API to matplotlib for statistical data visualization. For statistical data analysis we will introduce the [scipy](https://www.scipy.org/) library.

All these libraries are already installed on Colab. 

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats

# Reading the dataset


We have to read a data table where each row corresponds to a different subject and the columns contain descriptive characteristics of each subjects (e.g. age, IQ, morphometric brain features).
A pandas DataFrames is definitly a suitable object to work with tabular data structures. It is basically a container for and exposes lots of methods to process tabular data. Tabular data are stored in pandas DataFrames, whereas data series (1-dim array) are stored in pandas Series.

Pandas offers plenty of readers out of the box.

In [None]:
[x for x in dir(pd) if 'read' in x ]

Run this cell to mount your Google Drive

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
!ls /content/drive/MyDrive/cmepda_medphys_dataset/FEATURES/Brain_MRI_FS_ABIDE/

Let us read a csv file as a pandas DataFrame. 


In [None]:
dataset_file = "/content/drive/MyDrive/cmepda_medphys_dataset/FEATURES/Brain_MRI_FS_ABIDE/FS_features_ABIDE_males_someGlobals.csv"
# check and modify the path of the FS_features_ABIDE_males_someGlobals.csv file you downloaded in your drive

In [None]:
df = pd.read_csv(dataset_file)
df.head()

In [None]:
df.head(10)

In [None]:
df.tail()

Check the size and shape of the DataFrame

In [None]:
df.size

In [None]:
df.shape

To access a sigle column:

In [None]:
df.FILE_ID

In [None]:
df['FILE_ID']

We can check that df is a DataFrame, whereas df.FILE_ID, which is a single column, is a pandas Series

In [None]:
print(type(df), type(df.FILE_ID))

To access a single row

In [None]:
df[0:1]

In [None]:
df[df.FILE_ID=='Caltech_0051461']

We can add easily columns to a Dataframe, for example we can add a column containing data derived from the other columns

In [None]:
df['dummy'] = df.DX_GROUP +1
df.head()

and we can delete it

In [None]:
del df['dummy']

We can apply functions to the column values. 

For example, since it is hard to remember what DX_GROUP=1 actually means, we can make this column more readable.

First, we select the DX_GROUP column, then we apply a function to all its elements to convert the number to a meaningful label.

In [None]:
df['DX_GROUP'] = df.DX_GROUP.apply(lambda x: 'Controls' if x==-1 else 'ASD')
df.head()

In [None]:
df.DX_GROUP.unique()

We can make several operation/computations on the elements of the DataFrame: e.g. we can count the number of entries with a certain label:

In [None]:
print(df[df.DX_GROUP=='ASD'].FILE_ID.count())
print(df[df.DX_GROUP=='Controls'].FILE_ID.count())

or we can compute the average age of a sample/subsample:

In [None]:
df[df.DX_GROUP=='Controls'].AGE_AT_SCAN.mean()

Slicing: to select ranges of rows and/or columns it is possible to use either labels or indices.

In [None]:
df.columns

using column labels (.loc)

In [None]:
selected_feat = df.loc[:,'lh_MeanThickness':'rhCortexVol']
selected_feat.head()

using column indices (.iloc)

In [None]:
df.iloc[:,5:9].head()

In [None]:
df.iloc[1:4,:].head()

# Representing data



## Histograms

In [None]:
df.hist('AGE_AT_SCAN')

In [None]:
df.hist(['AGE_AT_SCAN','FIQ'])

By convention, in this dataset missing values in the FIQ column are indicated as either -9999 or 0

In [None]:
df.FIQ[:20]

We can mask these entries:

In [None]:
df[df.FIQ>0].hist('FIQ')

## Plots

Let's have a look at the DataFrame methods for data plotting

In [None]:
[x for x in dir(df) if 'plot' in x ]

In [None]:
df.AGE_AT_SCAN[:30].plot(kind='bar')

## Boxplots

Boxplot are very useful in analysis reports. We can for example visually check whether two cohorts are matched for some parameters (e.g. age).

In [None]:
boxplot = df.boxplot(column=['AGE_AT_SCAN'], by='DX_GROUP', showfliers=False)
boxplot.set_title('Box plot of subject\'s age at scan')
boxplot.get_figure().suptitle('');

If we want, for example, to group subjects according to the acquisition site, we have to retrive this information in the DataFrame. We find out that the site name is a part of the FILE_ID.

We can use the .split method available for the strings, and use it on the DataFrame column elements by defining a suitable lambda function. 

In [None]:
df.FILE_ID

In [None]:
df.FILE_ID[0].split('_')[0]

We add the "Site" column to the DataFrame

In [None]:
df['Site'] = df.FILE_ID.apply(lambda x: x.split('_')[0])
df.head()

We can make a boxplot representing the age values for each acquisition site.

In [None]:
boxplot = df.boxplot(column=['AGE_AT_SCAN'], by='Site', showfliers=False)
boxplot.set_title('Box plot of subject\'s age at scan')
boxplot.get_figure().suptitle('');
boxplot.set_ylabel('Age [y]')

boxplot.set_xticklabels(labels=boxplot.get_xticklabels(), rotation=50);

The [seaborn](https://seaborn.pydata.org/) API interoperates well with pandas DataFrame and allows us to refer to columns by label

In [None]:
import seaborn as sns
sns_boxplot = sns.boxplot(x='Site', y='AGE_AT_SCAN', data=df)
sns_boxplot.set_xticklabels(labels=boxplot.get_xticklabels(), rotation=50);
sns_boxplot.grid()

We can visually compare the measured values across different sites.

In [None]:
df.columns

In [None]:
import seaborn as sns
sns_boxplot = sns.boxplot(x='Site', y='lh_MeanThickness', data=df)
sns_boxplot.set_xticklabels(labels=boxplot.get_xticklabels(), rotation=50);
sns_boxplot.grid()

## Grouping

Data can be grouped by feature and visualized according to a given aggregation function

In [None]:
site_counts = df.groupby('Site').count()
site_counts

In [None]:
site_counts = df.groupby('Site')['FILE_ID'].count()
site_counts

It is quite easy to obtain a bar plot from a pandas Series

In [None]:
site_counts.plot(kind='bar', title='Number of subjects per site')


Now let's try to create a stacked bar plot showing the DX_GROUP, i.e. how many ASD and how many controls are available at each single site

In [None]:
stack = df.groupby(['Site', 'DX_GROUP'])['FILE_ID'].count()
stack

In [None]:
unstacked = stack.unstack('DX_GROUP')
unstacked

In [None]:
unstacked.plot(kind='bar', stacked=False, title='Number of subjects per site')

We can select a number of columns (data slicing)

In [None]:
df.columns

In [None]:
selected_feat = df.loc[:,'lh_MeanThickness':'rhCortexVol']
selected_feat.head()

Let's see how our selection looks like with the seaborn pairplot!

In [None]:
sns.pairplot(selected_feat)

# Basic data analysis

To carry out basic data analysis, we use the [SciPy](https://www.scipy.org/scipylib/index.html) library, which provides many user-friendly and efficient numerical routines, such as routines for numerical integration, interpolation, optimization, linear algebra, and statistics.

In [None]:
import scipy.stats

## Finding outliers in the distributions

In [None]:
df.columns

We select the 7 columns reporting the brain measures

In [None]:
data = df.iloc[:,5:12]

We use the z-score as a criterion to determine the presence of outliers.

Z-score is defined as:

$z(x) = \frac{x - \bar{x}}{\sigma}$. 

Data with a z-score above 3 (beyond 3$\sigma$ from the mean) are considered as outliers of the distribution.

In [None]:
df_no_outliers=df[(abs(scipy.stats.zscore(data)) < 3).all(axis=1)]  
# .all Return whether all elements are True, potentially over an axis.
df_no_outliers.shape

In [None]:
sns_boxplot = sns.boxplot(x='Site', y='AGE_AT_SCAN', data=df)
sns_boxplot.set_xticklabels(labels=boxplot.get_xticklabels(), rotation=50);
sns_boxplot.grid()

In [None]:
sns_boxplot = sns.boxplot(x='Site', y='AGE_AT_SCAN', data=df_no_outliers)
sns_boxplot.set_xticklabels(labels=boxplot.get_xticklabels(), rotation=50);
sns_boxplot.grid()

## Statistical analysis

Is there any significant difference in the AGE_AT_SCAN and FIQ features between the two diagnostic categories?

In [None]:
df_ASD = df[df.DX_GROUP == 'ASD']
df_CTR = df[df.DX_GROUP == 'Controls']

First of all we have to check for normality of our data distributions. We can use the 
`scipy.stats.normaltest` which test whether a sample differs from a normal distribution.


In [None]:
k2, p_asd = scipy.stats.normaltest(df_ASD.AGE_AT_SCAN)
k2, p_ctr = scipy.stats.normaltest(df_CTR.AGE_AT_SCAN)
p_asd, p_ctr

In [None]:
test_res = scipy.stats.mannwhitneyu(df_ASD.AGE_AT_SCAN, df_CTR.AGE_AT_SCAN)
test_res

We can do the same analysis for the FIQ.
First we have to remove from the dataframes the entries with FIQ = -9999.

In [None]:
df_ASD = df_ASD[df_ASD.FIQ >0]
df_CTR = df_CTR[df_CTR.FIQ >0]

In [None]:
k2, p_asd = scipy.stats.normaltest(df_ASD.FIQ)
k2, p_ctr = scipy.stats.normaltest(df_CTR.FIQ)
p_asd, p_ctr

We can run in this case either the t-test or the Wilcoxon-Mann-Whitney test:

In [None]:
test_res = scipy.stats.ttest_ind(df_ASD.FIQ, df_CTR.FIQ)
test_res

In [None]:
test_res = scipy.stats.mannwhitneyu(df_ASD.FIQ, df_CTR.FIQ)
test_res

Let us define a new index of left-right asimmetry of the mean cortical thickness

In [None]:
def LR(data):
  LR = data.lh_MeanThickness-data.rh_MeanThickness
  LR /= 0.5*(data.rh_MeanThickness+data.lh_MeanThickness)
  data['LR'] = LR
  return data

In [None]:
data = LR(df)
data.head()

In [None]:
color = data.DX_GROUP.apply(lambda x:'blue' if x == 'ASD' else 'red')
ax = data.plot(x='AGE_AT_SCAN', y='LR', kind='scatter', color=color);
ax.grid()

In [None]:
data.DX_GROUP.unique()

In [None]:
LR_ASD = data[data['DX_GROUP'] == 'ASD']['LR']
LR_CTR = data[data['DX_GROUP'] == 'Controls']['LR']

In [None]:
k2, p_asd = scipy.stats.normaltest(LR_ASD)
k2, p_ctr = scipy.stats.normaltest(LR_CTR)
p_asd, p_ctr

In [None]:
scipy.stats.mannwhitneyu(LR_ASD, LR_CTR)

In [None]:
boxplot = df.boxplot(column=['LR'], by='DX_GROUP', showfliers=False)
boxplot.set_title('Box plot of subject\'s LR asimmetry index')
boxplot.get_figure().suptitle('');
boxplot.set_ylabel('LR')

boxplot.set_xticklabels(labels=boxplot.get_xticklabels());

We can compute the effect size in terms of Cohen's *d* index. 

A population effect size based on means usually computed as:

$d = \frac{\mu_1-\mu_2}{\sigma}$




In [None]:
d_cohen = (LR_ASD.mean()-LR_CTR.mean())/data['LR'].std()
d_cohen

## Correlations among variables

Let's have a look at the data from a correlation perspective: can we spot any relationship? 

In [None]:
data.drop('SEX', axis=1).corr()

In [None]:
sns.heatmap(data.drop('SEX', axis=1).corr());

The `scipy.stats.pearsonr` returns the Pearson's correlation coefficient and p-value

In [None]:
res = scipy.stats.pearsonr(data['rhCortexVol'], data['lhCortexVol'])
res

In [None]:
res = scipy.stats.pearsonr(data['TotalGrayVol'], data['LR'])
res

# Permutation test

In [None]:
import numpy as np

In [None]:
Avg_obs_diff = LR_ASD.mean()-LR_CTR.mean()
Avg_obs_diff 

In [None]:
n_perm = 1000
n_examples=LR_ASD.shape[0]+LR_CTR.shape[0]
n_examples

In [None]:
LR_all = np.append(LR_ASD, LR_CTR)

In [None]:
Avg_diff_perm = []
for i in range(n_perm):
    perm_i = np.random.permutation(LR_all)
    avg_A = perm_i[1:LR_ASD.shape[0]].mean() 
    avg_B = perm_i[LR_ASD.shape[0]:n_examples].mean()
    Avg_diff_perm = np.append(Avg_diff_perm, avg_A - avg_B)
Avg_diff_perm.shape

We obtained an array with the differences between the means of the LR asymmetries of the two groups under the null hypothesis.

We can plot the histogram:

In [None]:
_ = plt.hist(Avg_diff_perm, 25, histtype='step')

We add to the histogram a vertical red line indicating the measured difference between the means of the LR asymmetry of the ASD and CTR groups (i.e. the difference in the mean values obtained with the correct group labels)

In [None]:
plt.hist(Avg_diff_perm, 25, histtype='step')
plt.axvline(Avg_obs_diff, linestyle='--', color='red')

## Evaluation of the empirical $p$-value

How many of the null means are bigger than the observed value? That proportion would be the $p$-value for the null hypothesis.

$p = \frac{r+1}{N+1}$

where $N$ is the number of permutations and r is the number of times that $t_i > t_{obs}$ 

We add a 1 to the numerator and denominator to account for misestimation of the p-value (see the reference "Permutation p-values should never be zero: calculating exact P-values when permutations are randomly drawn" https://pubmed.ncbi.nlm.nih.gov/21044043/ )


In [None]:
Avg_diff_perm[abs(Avg_diff_perm) > abs(Avg_obs_diff)].shape[0]

In [None]:
r = Avg_diff_perm[Avg_diff_perm > Avg_obs_diff].shape[0]
p_value = (r + 1 )/ (n_perm +1)
if r == 0:
  print(f'The p value is p < {p_value:.3f}')
else:
  print(f'The p value is p = {p_value:.3f}')
if p_value < 0.05:
  print('The difference between the mean weight loss of the two groups is statistically significant! ')
else:
  print('The null hypothesis cannot be rejected')

# Conclusions

We've had an extremely quick overview of data exploration, visualization and statistical analysis methods.

To learn more for example about the possible alterations of the left-righ brain asymmetry in ASD you can  read the recent work by
Postema MC, *et al.*, [ENIGMA ASD](http://enigma.ini.usc.edu/ongoing/enigma-asd-working-group/) working group, [*Altered structural brain asymmetry in autism spectrum disorder in a study of 54 datasets*](https://www.nature.com/articles/s41467-019-13005-8), Nat Commun. 2019 10(1):4958. doi:
10.1038/s41467-019-13005-8.