# Case Study 01: Clustering of Doctors' Notes
---

The first case study in the ALADA couse is the problem of clustering medical notes using measures of similarities between vectors. We make use of a kaggle dataset for this purpose. The first step to running this notebook is to download the dataset and copy it into the data directory `data/case_study_01/`.

Run the following two cells to import all the necessary libraries and to create the folder for the data files.

In [1]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

In [2]:
import numpy as np
import pathlib
# import pandas as pd
import re
import polars as pl

import matplotlib.pyplot as plt
from matplotlib import rc
rc('font',**{'family':'sans-serif','sans-serif':['Helvetica']})
plt.rcParams['svg.fonttype'] = 'none'

import sys
from alada import chap01 as ch01
from alada import casestudy01 as cs01

# Create data folder for the case study
cs01.create_data_folder()

## 0. Getting the data

Link to the kaggle dataset: [https://www.kaggle.com/datasets/gauravmodi/doctors-notes/data](https://www.kaggle.com/datasets/gauravmodi/doctors-notes/data)

Download the `reports.csv` file, and copy it into the folder `data/case_study_01/`. After you have done this, run the following cell and check if it reports success.

In [3]:
cs01.check_dataset()

'Success! You can run this notebook.'

## 1. What does this dataset have?

The data file is a `.csv` file. If you do not know that means, then take a look [here](https://en.wikipedia.org/wiki/Comma-separated_values).

The file has two columns: `medical_specialty` and `report`.
- The `medical_specialty` column contains the specialty of the doctor who wrote the report, and
- the `report` column contains the actual text of the report written down by the doctor for a particular patient.

For the purpose of this case study, we will only be using the `report` column; we will use the `medical_specialty` column only for evaluating the clustering results for us know how well out clustering algorithm has done.

We have learned about clustering using n-vectors; objects containing numbers. How does one cluster text data? 

We do this by first converting the text data into n-vectors that represent, at least partially, the information contaning in the text. There are many ways to do this, but will make use a simple technique called `count vectorization`.

**Count Vectorization** is a technique that converts text data into n-vectors by counting the number of times a set of `tokens` (chosen words) appears in the text. Tokens are words that are chosen from the text or given by a specialist from the field of study. Consider the following example:

**Text to convert:** "The quick brown fox jumps over the lazy dog. The dog is very lazy, but the fox is quick. The dog is friendly through."

**Tokens:** ["quick", "brown", "fox", "jumps", "lazy", "dog"]

We simply count the number of times each token appears in the text. The vector representation of the text is then: [2, 1, 2, 1, 2, 3]

Now, let's read the reports.csv file and see what it has.

In [26]:
# Data folder and file
datadir = "data/case_study_01"
datafile = (datadir / pathlib.Path("reports.csv")).as_posix()
tokensfile = (datadir / pathlib.Path("tokens.txt")).as_posix()

# Read the medical records file.
medrec = pl.read_csv(datafile)

# Let view the first five rows using the .head() function on the polar dataframe.
medrec.head()

medical_specialty,report
str,str
"""Cardiovascular / Pulmonary""","""2-D M-MODE: , ,1. Left atrial…"
"""Cardiovascular / Pulmonary""","""1. The left ventricular cavit…"
"""Cardiovascular / Pulmonary""","""2-D ECHOCARDIOGRAM,Multiple vi…"
"""Cardiovascular / Pulmonary""","""DESCRIPTION:,1. Normal cardia…"
"""Cardiovascular / Pulmonary""","""2-D STUDY,1. Mild aortic steno…"


In [27]:
# Let's now count the number of entries we have for each medical speciality.
medrec['medical_specialty'].value_counts()

medical_specialty,count
str,u32
"""Gastroenterology""",224
"""Cardiovascular / Pulmonary""",371
"""Surgery""",1088
"""Neurology""",223
"""Radiology""",273


We see that the text in the `report` column has text with both upper and lower case letters. We will convert all the text to lower case to make it easier to work with. We will not moidfy the `report` column, but create a new column `report_lower` that contains the lower case text.

In [28]:
# Convert a str column in polar to lower case, and it as a new column.
medrec = medrec.with_columns(
    report_lower=pl.col("report").str.to_lowercase()
)
medrec.head()

medical_specialty,report,report_lower
str,str,str
"""Cardiovascular / Pulmonary""","""2-D M-MODE: , ,1. Left atrial…","""2-d m-mode: , ,1. left atrial…"
"""Cardiovascular / Pulmonary""","""1. The left ventricular cavit…","""1. the left ventricular cavit…"
"""Cardiovascular / Pulmonary""","""2-D ECHOCARDIOGRAM,Multiple vi…","""2-d echocardiogram,multiple vi…"
"""Cardiovascular / Pulmonary""","""DESCRIPTION:,1. Normal cardia…","""description:,1. normal cardia…"
"""Cardiovascular / Pulmonary""","""2-D STUDY,1. Mild aortic steno…","""2-d study,1. mild aortic steno…"


We can use the lower case text in the `report_lower` column to convert the text into n-vectors for clustering the reports. But **what tokens di we use for this purpose?** 

The file `tokens.txt` in the `data/case_study_01/` folder contains a list of tokens that we will use for this case study. We will not worry about how these tokens were chosen, but we will use them to convert the text data into n-vectors. This is a text file with each token separated by a comma.

Let's now read this file and look at the tokens.

In [33]:
with open(tokensfile, "r") as f:
    tokens = f.read().split(',')
print(tokens)
# The number of tokens.
print(f"Number of tokens: {len(tokens):3d}")

['impress', 'palpitation', 'gallop', 'compare', 'nonsignificant', 'collect', 'meckel', 'heartburn', 'fundoplication', 'contributory', 'rating', 'psychological', 'huntington', 'epilepsy', 'tended', 'relate', 'sprain', 'cuboid', 'predict', 'version', 'reoperative', 'proximate', 'plication', 'ladder', 'essure']
Number of tokens:  25


We can now use these tokens to convert a given text into a n-vectors (25-vector) to be exact. To do this, we will write our own function that takes in a text and the tokens, and returns the n-vector representation of the text. Run the following code so that we can use it in the rest of the notebook.

In [35]:
def count_tokens(note: str, tokens: list) -> np.array:
    """Find the token count in the medical records."""
    return np.array([note.count(tok) for tok in tokens])

Let's try this function on some of the reports in the dataset.

In [113]:
# Get 10 random row numbers for medrec
rows = map(int, np.random.randint(0, len(medrec), 10))
for _row in rows:
    _nvec = count_tokens(medrec[_row, "report_lower"], tokens)
    print(f"Row: {_row:4d}", _nvec)

Row: 1637 [0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 6 0]
Row:  957 [0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0]
Row:   22 [0 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0 1 0 1 0 1]
Row: 1229 [0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0]
Row: 1188 [1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 0 0 0 0 1 0 0 0]
Row:  547 [0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 1 0 0 0 0]
Row: 2121 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 2 0 4]
Row: 1898 [0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
Row: 1971 [1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
Row: 1793 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 2 0]


### 1.2 Clustering medical rpoets through `kmeans`

We now assume that we only have access to the medical reports and not to the speciality they correspond to. We make use of the unique tokens to count the frequency of each of these tokens in each report to generate a set of column vectors for each report. This column vector is then used to compare the similarity between reports, and cluster them.

There are many entries from the surgery medical speciality, and thus this will be removed from the data for the clustering.

In [6]:
datadir = pathlib.Path("data")
outdir = pathlib.Path("output")
outdir.mkdir(exist_ok=True)

UndefVarError: UndefVarError: pathlib not defined

In [7]:
# Read header for tokens
datafile = (datadir / pathlib.Path("reports.csv")).as_posix()
with open(datafile, "r") as f:
    tokens = [_s.strip() for _s in f.readline().split(':')[-1].split(",")]

# Read the medical records file.
medrec = pd.read_csv(datafile, skiprows=2)
medrec["medical_specialty"].value_counts()
# Make the reports column lower case
medrec["report_lower"] = medrec["report"].str.lower()

UndefVarError: UndefVarError: pathlib not defined

In [8]:
# Get the token vectors for each medical note.
inx = medrec["medical_specialty"] != "Surgery"
medrec = medrec[inx]
medrec.reset_index(inplace=True, drop=True)
tokvec = np.array([count_tokens(_v, tokens) for _v in medrec['report_lower'].values])
tokvec.shape

UndefVarError: UndefVarError: medrec not defined

In [21]:
k = 4
km = chap01.KMeans(X=tokvec, k=4)
cm, ca, j = km.fit()

In [None]:
# Find the medical specialty for each cluster
medrec["cluster"] = ca[-1]
# Print value counts for different clusters.
for _c in range(k):
    print(f"Cluster {_c}: {np.linalg.norm(cm[-1][_c]): .3f}")
    print(medrec[medrec["cluster"] == _c]["medical_specialty"].value_counts())
    print()

In [None]:
plt.plot(cm[-1].T + np.array([0, 1, 2, 3]));

### Clustering using k-means


In [None]:
def get_cluster_assignment(tvec, mvec):
    return np.argmin(np.linalg.norm(mvec - tvec, axis=1))

def get_cluster_mean(tokvec, clustassign):
    return np.array([np.mean(tokvec[clustassign == _c, :], axis=0)
                     for _c in np.unique(clustassign)])

def get_j_clust(tokvec, mvec, clustassign):
    return np.sum([np.sum(np.square(np.linalg.norm(tokvec[clustassign == _c, :] - mvec[_i, :], axis=1))) for _i, _c in enumerate(np.unique(clustassign))])


In [None]:
k = 3
# Random starting means.
mvec = tokvec[list(np.random.randint(0, tokvec.shape[0], k)), :]

# Cluster assingment for the random means.
clustassign = np.array([get_cluster_assignment(_tvec, mvec) for _tvec in tokvec])

# Compute J_clust
J_clust_curr = get_j_clust(tokvec, mvec, clustassign)

# Iterate now.
n_iter = 20
for i in range(n_iter):
    # Update the means.
    mvec = get_cluster_mean(tokvec, clustassign)
    
    # Update the cluster assignment.
    clustassign = np.array([get_cluster_assignment(_tvec, mvec) for _tvec in tokvec])
    
    # Update J_clust
    J_clust_prev = J_clust_curr
    J_clust_curr = get_j_clust(tokvec, mvec, clustassign)

    print(f"Iteration: {i:2d}, J_clust: {J_clust_curr:6.2f}, Change: {100 * (J_clust_curr - J_clust_prev) / J_clust_prev:.2f}%")

In [None]:
# Functions for performing k-means algorithm 
def get_cluster_assignment(tvec, mvec):
    return np.argmin(np.linalg.norm(mvec - tvec, axis=1))

def get_cluster_mean(tokvec, clustassign):
    return np.array([np.mean(tokvec[clustassign == _c, :], axis=0)
                     for _c in np.unique(clustassign)])

def get_j_clust(tokvec, mvec, clustassign):
    return np.sum([np.sum(np.square(np.linalg.norm(tokvec[clustassign == _c, :] - mvec[_i, :], axis=1))) for _i, _c in enumerate(np.unique(clustassign))])


### Demo of kmean class from chap01 module

In [None]:
# Generate sample data
_x1 = np.random.randn(100, 2) + np.array([4, 4])
_x2 = np.random.randn(100, 2)
X = np.vstack([_x1, _x2])
# Randomly reorder the rows
np.random.shuffle(X)

In [None]:
k=2
km = chap01.KMeans(X, 2)
cm, ca, j = km.fit()

In [None]:
# Plot the k-mean algorithm evolution.
colors = ["tab:blue", "tab:red", "tab:green", "tab:orange", "tab:purple", "tab:brown", "tab:pink", "tab:gray", "tab:olive", "tab:cyan"]
n = len(cm)
m = (n // 5) + 1
fig = plt.figure(figsize=(15, 2.5 * m))
for i in range(n):
    ax = fig.add_subplot(m, 5, i + 1)
    for _k in range(k):
        ax.scatter(X[ca[i] == _k, 0], X[ca[i] == _k, 1], s=25, color=colors[_k],
                   marker="x", alpha=0.2)
    for _k in range(k):
        ax.scatter(cm[i][_k, 0], cm[i][_k, 1], s=50, marker="o", alpha=1, 
                   color=colors[_k], edgecolors="black") 
    ax.set_title(f"Iter: {i + 1}: J = {j[i]:.1f}")

#### Similarity problem

In [None]:
X = np.array([
    [167, 102, 36.6],
    [180, 87, 26.9],
    [177, 78, 24.9],
    [152, 76, 32.9],
]).T

# Compute the distance between the points.
np.array([[np.linalg.norm(_v1 - _v2) for _v2 in X.T] for _v1 in X.T])

In [None]:
X1 = X * np.array([0.01, 1, 1]).reshape(-1, 1)
# Compute the distance between the points.
np.array([[np.linalg.norm(_v1 - _v2) for _v2 in X1.T] for _v1 in X1.T])

##### Angle similarity

In [None]:
def angle_between(v1, v2):
    return (180 / np.pi) * np.arccos(np.max([-1, np.min([1, np.dot(v1, v2) / (np.linalg.norm(v1) * np.linalg.norm(v2))])]))

# Compute the distance between the points.
np.round(100 * np.array([[angle_between(_v1, _v2) for _v2 in X.T] for _v1 in X.T])) / 100

In [None]:
# Compute the distance between the points.
np.round(100 * np.array([[angle_between(_v1, _v2) for _v2 in X1.T] for _v1 in X1.T])) / 100

In [None]:
X

In [None]:
X1

In [None]:
np.round(100 * np.array([v / np.linalg.norm(v) for v in X.T]).T) / 100

In [None]:
np.round(100 * np.array([v / np.linalg.norm(v) for v in X1.T]).T) / 100