# ALS Patients Clustering

Amyotrophic Lateral Sclerosis (ALS) is a progressive neurodegenerative disease that affects nerve cells in the brain and spinal cord, leading to the gradual loss of muscle control and difficulties with speaking, swallowing, and breathing. It primarily targets motor neurons, which are responsible for controlling voluntary muscle movements.

In this project, I applied the K-Means clustering algorithm to classify patients, based on multiple clinical and physiological features. The goal is to identify patterns in patient data that may support early detection, better treatment and improve understanding of the disease.

## Importing libraries

In [None]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
import seaborn as sns


## Loading 

In [None]:
# set the option of displaying all the columns
pd.set_option('display.max_columns', None)

# load data
df = pd.read_csv(r"C:\Users\sanas\Desktop\Public health projects\Datasets\als_data.csv")

# display few rows of data
df.head()

## Understanding the Data

In [None]:
# display the shape of data
df.shape

This is probably a complex structured dataset; medium-sized and high dimentional. The features do not represent row inputs, but engineered biomedical features. Many variables are aggregated as summary statistics per patient (min, max, range, slope,...).

In [None]:
# display the columns
df.columns

In [None]:
# overview of the data
df.info()

In [None]:
# check for duplicates
df.duplicated().sum()

In [None]:
# check for null values
nulls = df.isnull().sum()
nulls

The dataset does not include duplicates or missing data.

In [None]:
print("Minimum Age: ", df['Age_mean'].min())
print("Maximum Age: ", df['Age_mean'].max())

In [None]:
# Age distribution of patients

In [None]:
plt.hist(df['Age_mean'], bins=30)
plt.xlabel("Age")
plt.ylabel("Frequency")
plt.show()

The graph highlights the distribution of average age of patients. We have a range from 20 to 80 years old. The majority of them are between 54 and 65 years old. The distribution is bell-shaped with no clear skewness or outliers. So, it is great and reasonable for ALS patients clustering.

## Data Cleansing

In [None]:
# remove data not relevant to the patient's ALS condition:

In [None]:
# drop the columns 'ID' and 'SubjectID' because they just used for identification and are not meaningful in our model creation.
df1 = df.drop(columns=['ID', 'SubjectID' ], axis=1)

In [None]:
df1.head()

## Data Preprocessing and Modeling

### Apply a standard scalar to the data.

In [None]:
# inintialize the standard scaler
scaler = StandardScaler()

# apply the scaler to the whole dataframe, regarding that all the features are numeric.
scaled_array = scaler.fit_transform(df1)

# convert the scaled array back to a dataframe
scaled_df = pd.DataFrame(scaled_array, columns = df1.columns, index= df1.index)

# print few rows of the scaled df:
print(scaled_df.head())

### Apply PCA to scaled data

In [None]:
# Initialize PCA with 2 components
pca = PCA(n_components=2)

# Fit and transform the scaled data
pca_transformed = pca.fit_transform(scaled_df)

# Create a DataFrame for the PCA-transformed data
pca_df = pd.DataFrame(data=pca_transformed, columns=['PCA1', 'PCA2'])

# print few rows of the transformed scaled dataframe
print(pca_df.head())

In [None]:
# Create a plot of the cluster silhouette score versus the number of clusters.

In [None]:
# define the range for the number of clusters to evaluate
cluster_range = range(2, 11)

# create an empty list to store silhouette scores
silhouette_scores = []

# perform K-means clustering for each number of clusters and calculate the silhouette score
for n_clusters in cluster_range:
    kmeans = KMeans(n_clusters=n_clusters, random_state=42)
    cluster_labels = kmeans.fit_predict(pca_df)
    silhouette_avg = silhouette_score(pca_df, cluster_labels)
    silhouette_scores.append(silhouette_avg)

In [None]:
# plot the silhouette scores Vs the range of clusters
plt.figure(figsize=(10,6))
plt.plot(cluster_range, silhouette_scores, linestyle='-', marker='o', color='b')
plt.xlabel('Number of clusters')
plt.ylabel('Silhouette Score')
plt.title('Silhouette Scores Vs Number of Clusters')
plt.xticks(cluster_range)
plt.grid(True)
plt.show()

The optimal number of clusters is 2 because the silhouette score with 2 clusters represent the maximum, around 0.38.

In [None]:
# Fit a K-means model to the data with the optimal number of clusters chosen

In [None]:
# initialize the KMeans model
kmeans = KMeans(n_clusters=2, random_state=42)

# train the model
kmeans.fit(pca_df)

# get the clusters labels
cluster_labels = kmeans.labels_



In [None]:
cluster_labels

In [None]:
# Make a scatterplot the PCA transformed data coloring each point by its cluster value.

In [None]:
# create the column 'cluster' including the cluster labels
pca_df['cluster'] = cluster_labels

In [None]:
pca_df.head()

In [None]:
# create the scatterplot
plt.figure(figsize=(10,6))
sns.scatterplot(data= pca_df, x= 'PCA1', y='PCA2', palette='Set1', hue='cluster', s=100, edgecolor='k' )
plt.xlabel('PCA1')
plt.ylabel('PCA2')
plt.title('Scatterplot of PCA transformed data colored by cluster values')
plt.legend(title='Cluster')
plt.show()

- The scatter plot created highlights the structure and patterns within the data. The data is presented by the two principal components PCA1 and PCA2 which are a linear combination of the original features.

- We can see two groups suggesting that the clustering algorithm has found meaningful groupings in the data. However, if we focus on the x-axis (PCA1), we can see that the points near PCA1=0 act as a boundary between the two clusters. This boundary shows a slight overlapping indicating that the the two clusters have few similar features at this region.

- Additionally, there are points far away from the main clusters of points, they may indicate outliers. In the context of this dataset, these outliers might represent unusual cases of ALS, such as patients with rare genetic mutations, atypical symptoms, or distinct disease progressions. These rare unusual cases of ALS that could warrant further investigation.



## Translating Clusters into meaning

### 1- Cluster Summary

In [None]:
df['cluster'] = cluster_labels

cluster_summary = df.groupby('cluster').mean()

cluster_summary


### 2-Interpretation 

The most important variable in ALS progression is "ALSFRS_slope" which represents the rate of decline over time indicating how fast the disease progresses.

- ALSFRS_slope is equal to -0.43 in cluster 0 showing a slow progression while it is equal to -1.03 in cluster 1, indicating a rapid progression, more than twice.

The functional scores confirm this:
- ALSFRS_Total_median is around 31 in cluster 0 while equal to about 22 in cluster 1 highlighting a lower functional status in cluster 1.
- respiratory_min in cluster 1 is about 2.3 less than cluster 0, indicating a respiratory decline in this group.
- leg_min variable is around 1.28 for cluster 1 & 3.67 for cluster 0 highliting a massive motority decline in cluster 1.

So, we identified two distinct ALS progression phenotypes. Cluster 0 is a slowly progression ALS group and cluster 1 a highly fast progression ALS.  
The rapid progression ALS cluster shows a massive motority decline and worse respiratory function, suggesting a high risk group.




### 3-Identify The rapid progression group (Top 20 High risk patients)

In [None]:

# Rapid progression group
rapid_group = df[df['cluster'] == 1]

# Sort by ALSFRS_slope (most negative first)
rapid_extreme = rapid_group.sort_values('ALSFRS_slope')

rapid_extreme[['SubjectID','ALSFRS_slope']].head(20)


- Those are the top 20 high risk patients (fast progressors).

### 4- Identify outliers patients

Now, we are going to measure the distance from the cluster center to find the patients who don't fit well in their clusters. These outliers are not automatically dangerous, they can represent atypical phenotypes, high progression patients or measurement errors.

In [None]:

# distance to each cluster center
distances = kmeans.transform(pca_df.drop(columns=['cluster']))

# minimum distance to assigned cluster
df['Distance_to_center'] = np.min(distances, axis=1)

# Top 1% farthest
threshold = np.percentile(df['Distance_to_center'], 95)
outliers = df[df['Distance_to_center'] > threshold]



In [None]:
outliers[['SubjectID','cluster','Distance_to_center']].sort_values("Distance_to_center", ascending=False)

Those are the top 5% patients distants from their cluster center. They may represent atypical/new ALS phenotypes, fast ALS progression patients, measurement errors and can lead to the discovery of new biomarkers.


So, finally I want to say that we were able today to identify the top fast-progression ALS subgroup for early proactive management, and within each cluster we flagged the most atypical 5% of patients for closer review or further investigation.