# Preparing RS-fMRI data for SBM analysis

This Python-notebook is the first step in the analysis of RS-fMRI data with hierarchical stochastic block models. It is part of the analyses underlying the dissertation "Topic modelling for the stratification of neurological patients" written by W. Van Echelpoel (WVE) under supervision of prof. D. Marinazzo (DM) (Ghent University). The data has been provided by DM and consisted of a folder structure that included the results of a 268 parcellation of RS-fMRI data (see further). 

The notebook has been developed to work with this specific data structure, but changes can be made to load different data sets (e.g., a 278 parcellation). Whenever the original data is not available, one can directly start with the (partially) pre-processed data (correlations between ROI-pairs). Visualisation of the conventional analyses is provided to get an insight, yet final graphs have been developed in R. For this, different scripts are available ('S03_SupervisedClustering.R').

## Prepare environment with modules

In [None]:
import os
import scipy
import numpy as np
import pandas as pd
import seaborn as sn
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from scipy.cluster.hierarchy import dendrogram, linkage, fcluster

In [None]:
# Identify working directory for data
# os.getcwd() # To check working directory
os.chdir(os.path.dirname(os.getcwd())) # Move out of 'Scripts'-folder
os.chdir ('Data') # Move into 'Data'-folder

## Data of 268 parcellation
This section is only relevant if the original raw data is available. If this is not the case, one should move directly to the first subsection (contrasting means).

As a first step, the data of the 268 parcellation is looked at. In this folder, there are 259 subfolders with data from participants (note that this number differs from the 'demo.csv' file, which mentions 260). Data is provided in individual matlab-files that have to be opened individually.

For each participant, 152 measurements are provided for each of the 268 ROIs (although the length is 278, but the last 10 columns are empty). From this data, the Pearson coefficient between time series of different ROIs are calculated (thus increasing the number of columns).

In [None]:
# Start with analysis of ROI of 268 parcellation (278: see further)
os.chdir ('01 Raw Data') # Move into folder with raw data
os.chdir ('ts268')
os.chdir ('processed') # Only for ts268 data

In [None]:
# List all participant folders
folder_list = [f for f in os.listdir(os.getcwd()) if os.path.isdir(f)]

In [None]:
# Create list to save all vectors, SD and correlations
roiSTD = []
corrList = []

In [None]:
# Loop over all folders
for folder in folder_list:
    # Go deeper in folder list
    os.chdir(folder)
    os.chdir('fmri_rest')
    
    # Read in data + remove last 10 columns ('na')    
    data_roi = scipy.io.loadmat(file_name = 'data_ROI_268')
    data = [row[:268] for row in data_roi['data_ROI']]
    
    # Extract standard deviation of time series
    roiSTD.append(list(pd.DataFrame(data).std()))
    
    # Pearson product-moment correlation coefficients
    mx_correl = np.corrcoef(np.transpose(data))
    
    # Extract coefficients off-diagonal (here upper triangular part)
    v_correl = mx_correl[np.triu_indices(len(mx_correl), k = 1)]
    
    # Append to list
    corrList.append(v_correl)
    
    # Move up two folders to allow loop to continue
    os.chdir(os.path.dirname(os.getcwd()))
    os.chdir(os.path.dirname(os.getcwd()))

# Move three folders up to exit folder with raw data
os.chdir(os.path.dirname(os.getcwd()))
os.chdir(os.path.dirname(os.getcwd()))
os.chdir(os.path.dirname(os.getcwd())) # Move out of 'Scripts'-folder

In [None]:
# Create reference list for ROI-pairs
df_roiRef, n_pair = [], 0
for i in range(len(mx_correl)):
    for j in range(i + 1, len(mx_correl)):
        df_roiRef.append([i, j, 'Pair' + str(n_pair+1)])
        n_pair += 1

In [None]:
df_roiRef = pd.DataFrame(df_roiRef, columns = ['Region 1', 'Region 2', 'Pair'])

In [None]:
# Turn correlation list into dataframe with indices
corrList = pd.DataFrame(corrList, columns = ['Pair' + str(i+1) for i in range(len(corrList[0]))])
corrList.index = folder_list

# Turn STD list into dataframe with indices
roiSTD = pd.DataFrame(roiSTD, columns = ['Pair' + str(i+1) for i in range(len(roiSTD[0]))])
roiSTD.index = folder_list

In [None]:
# Identify columns (ROI-pairs) with missing data and exclude from data
v_exclude = corrList.isnull().sum()[corrList.isnull().sum() > 0].index.tolist()
corrList_NoNa = corrList.loc[:, ~corrList.columns.isin(v_exclude)]

In [None]:
# Number of ROI-pairs to exclude to obtain a complete matrix
len(v_exclude)

In [None]:
# Save data frames
df_roiRef.to_csv('./02 Cleaned data/D_ROIReferenceList.txt', sep = ';')
corrList.to_csv('./02 Cleaned data/D_PearsonCoefficient.txt', sep = ';')
corrList_NoNa.to_csv('./02 Cleaned data/D_PearsonCoefficient_NoNa.txt', sep = ';')

Aside from storing the data for subsequent analysis, it might be interesting to have a look at the data itself. For this, the main focus is directed at the Pearson coefficients (as per initial idea of the study). Attention is given to (1) contrast of the category means for all ROI-pairs, (2) PCA of the Pearson coefficient data, and (3) hierarchical clustering.

The visualisation included in this notebook is meant to give an insight in the data and the results of the more conventional clustering techniques. The results of the category means are directly used by R for creating figures for the report, while PCA and hierarchical clustering are done anew in R (and the associated visual representations are used for the report). These steps are taken in the R-script 'S03_SupervisedClustering.R'. Hence, visualisations in this notebook are merely included for being a more stand-alone analysis notebook.

### Contrasting category means

In [None]:
# Read in data (if necessary), else update name
corrDF = pd.read_csv('./02 Cleaned data/D_PearsonCoefficient_NoNa.txt', sep = ";", index_col = 0)
# corrDF = pd.DataFrame(corrList_NoNa)

# Add column name with Group info
corrDF = corrDF.assign(Category = [1]*120 + [2]*50 + [3]*49 + [4]*40)
corrDF.iloc[:2,:]

In [None]:
# Calculate mean Pearson coefficient per pair per category
corrMeans = corrDF.groupby(['Category']).mean()
corrMeans

In [None]:
# Plot Pearson coefficient, from highest to lowest for category 1
corrMeans.transpose().sort_values(by = 1,ascending = False).plot()

In [None]:
# Plot mean Pearson coefficients, as contrast between two groups (x and y can be changed)
corrMeans.transpose().plot.scatter(x = 3, y = 4)

In [None]:
# Save the data (e.g., for in R)
corrMeans.to_csv('./02 Cleaned data/D_MeanPerCategory.txt', sep = ';')

### PCA

In [None]:
# Read in data (if necessary), else update name
# corrDF = pd.read_csv('./02 Cleaned data/D_PearsonCoefficient_NoNa.txt', sep = ";", index_col = 0)
corrDF = pd.DataFrame(corrList_NoNa)

In [None]:
# Perform PCA with 2 components
pca_corr = PCA(n_components = 2)
pcCorr = pca_corr.fit_transform(corrDF)
pcCorr_Df = pd.DataFrame(data = pcCorr
                         , columns = ['principal component 1', 'principal component 2'])

In [None]:
# Plot PCA
plt.figure()
plt.figure(figsize=(10,10))
plt.xticks(fontsize=12)
plt.yticks(fontsize=14)
plt.xlabel('Principal Component - 1',fontsize=20)
plt.ylabel('Principal Component - 2',fontsize=20)
plt.title("Principal Component Analysis of RS-fMRI data of ROI coefficients",fontsize=20)
targets = ['HC', 'SCH', 'BD', 'ADHD']
colors = ['r', 'g', 'y', 'b']
plt.scatter(pcCorr_Df.loc[:120, 'principal component 1']
                , pcCorr_Df.loc[:120, 'principal component 2'], c = 'r', s = 50)
plt.scatter(pcCorr_Df.loc[121:170, 'principal component 1']
                , pcCorr_Df.loc[121:170, 'principal component 2'], c = 'b', s = 50)
plt.scatter(pcCorr_Df.loc[171:219, 'principal component 1']
                , pcCorr_Df.loc[171:219, 'principal component 2'], c = 'y', s = 50)
plt.scatter(pcCorr_Df.loc[220:, 'principal component 1']
                , pcCorr_Df.loc[220:, 'principal component 2'], c = 'g', s = 50)
    

plt.legend(targets,prop={'size': 15})

### Hierarchical clustering

In [None]:
# Read in data (if necessary), else update name
# corrDF = pd.read_csv('./02 Cleaned data/D_PearsonCoefficient_NoNa.txt', sep = ";", index_col = 0)
corrDF = pd.DataFrame(corrList_NoNa)

In [None]:
# Calculate linkage with Euclidean distance
linkage_data = linkage(corrDF, method='ward', metric='euclidean')
dendrogram(linkage_data)

# plt.rcParams['figure.dpi'] = 400 # To upgrade output graph
plt.show()

In [None]:
# Determine cluster membership, building on graph (3 clusters)
v_cluster = fcluster(linkage_data, 3, criterion = 'maxclust')

# Add information to dataframe
corrDF = corrDF.assign(Category = [1]*120 + [2]*50 + [3]*49 + [4]*40, Cluster = v_cluster)

In [None]:
# Construct crosstable to derive clustering
pd.crosstab(index=corrDF['Category'], columns=corrDF['Cluster'])

In [None]:
# Save the data (e.g., for in R)
corrDF.to_csv('./02 Cleaned data/D_HierarchicalClustering.txt', sep = ';')

### Selection of ROI-pairs based on correlation

The original number of variables (i.e. ROI-pairs) is high and might affect subsequent parameter inference and overall interpretability. Hence, a reduction of the ROI-pairs is considered through a correlation analysis. Due to the long calculation time of a complete correlation matrix, an alternative approach was used. More specifically, correlated ROI-pairs with the first ROI-pair were identified and stored for later removal. Then, the same was done for the second ROI-pair (if it was not yet identified as having a correlation with the first ROI-pair) and so on. It is conceivable that a reorganisation of the ROI-pairs will result in a different selection of ROI-pairs to be removed, yet this has not been checked in the framework of this study.

In [None]:
# Read in data (if necessary), else update name
# corrDF = pd.read_csv('./02 Cleaned data/D_PearsonCoefficient.txt', sep = ";", index_col = 0)
# corrDF_NoNa = pd.read_csv('./02 Cleaned data/D_PearsonCoefficient_NoNa.txt', sep = ";", index_col = 0)
corrDF = pd.DataFrame(corrList)
corrDF_NoNa = pd.DataFrame(corrList_NoNa)

In [None]:
# Define threshold (MANUALLY!) and set for pairs to be eliminated
n_lmt = [0.5, 0.75][1]

In [None]:
# Use column-wise approach to exclude pairs
red = set()
for p1 in range(corrDF.shape[1]):
  print('--Column ' + str(p1 + 1) + ' of ' + str(corrDF.shape[1]) + '--')
  for p2 in range(p1 + 1, corrDF.shape[1]):
    if corrDF.columns[p1] not in red and corrDF.columns[p2] not in red:
      n_corr = corrDF.iloc[:,p1].corr(corrDF.iloc[:,p2])
      # print(corrDF.columns[p1] + ' & ' + corrDF.columns[p2] + ': ' + str(n_corr))
      if abs(n_corr) > n_lmt:
        red.add(corrDF.columns[p2])

In [None]:
# Use column-wise approach to exclude pairs
red_NoNa = set()
for p1 in range(corrDF_NoNa.shape[1]):
  print('--Column ' + str(p1 + 1) + ' of ' + str(corrDF_NoNa.shape[1]) + '--')
  for p2 in range(p1 + 1, corrDF_NoNa.shape[1]):
    if corrDF_NoNa.columns[p1] not in red_NoNa and corrDF_NoNa.columns[p2] not in red_NoNa:
      n_corr = corrDF_NoNa.iloc[:,p1].corr(corrDF_NoNa.iloc[:,p2])
      # print(corrDF_NoNa.columns[p1] + ' & ' + corrDF_NoNa.columns[p2] + ': ' + str(n_corr))
      if abs(n_corr) > n_lmt:
        red_NoNa.add(corrDF_NoNa.columns[p2])

In [None]:
# Remove (strongly) correlated pairs from dataframe
corrSel = corrDF.drop(red, axis = 1)
corrSel_NoNa = corrDF_NoNa.drop(red_NoNa, axis = 1)
print('Start: ' + str(corrDF.shape) + ' - Reduced: ' + str(corrSel.shape))
print('Start: ' + str(corrDF_NoNa.shape) + ' - Reduced: ' + str(corrSel_NoNa.shape))

In [None]:
# Save the data (e.g., for in R)
corrSel.to_csv('./02 Cleaned data/D_SelectedPairs_' + str(round(100*n_lmt)) + '.txt', sep = ';')
corrSel_NoNa.to_csv('./02 Cleaned data/D_SelectedPairs_NoNa_' + str(round(100*n_lmt)) + '.txt', sep = ';')