<a href="https://colab.research.google.com/github/stumbi/mir_nlp/blob/main/exercise_05_classificator.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

  <div>
    <h1 align="center">Excercise 05 - Medical Information Retrieval 2023</h1>
  </div>
  <br />

Today, we are moving on, towards a machine learning approach for text classification. 

## Text classification <a class="anchor" id="first"></a>

In the following 3 weeks we are focussing on machine learning approaches on our classification task. Feel free to use any tool which helps you, as long as you can explain, what exactelly is happening, and why it is useful. Given, that you know the preprocessing steps from the past weeks and are able to apply them, we want you to use them now in order to develop a machine learning model for our classification problem.

### Requirements
* The notebook should run **without any error**, given that all packages are installed and the dataset is loaded. When we test it, we will adapt path definitions and might will install nessesary packages)
* Your training/validation script should only use the train split we give you.

### Evaluation
* For evalutation, you can use the function "test_model_performance" in this notebook for accuracy, precision, recall and F1-score. If you choose to use such evaluation, the predicted labels have to be hot-encoded: The output of your model should be a vector of probabilities for each class. 
### Your tasks

* Make an exploratory data analysis
* Develop a preprocessing pipeline
* Train and test one or several machine learning models
* Evaluate the algorithms with a metric of your choice 
* Visualize the outcome

* Prepare a presentation (or present this notebook) of around 10 minutes for our last session (6th of June)


You can start from here. To have a comparable evaluation between each group, we give you a fixed train and test split.

In [1]:
!pip install evaluate

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting evaluate
  Downloading evaluate-0.4.0-py3-none-any.whl (81 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.4/81.4 kB[0m [31m6.9 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: evaluate
Successfully installed evaluate-0.4.0


In [2]:
from google.colab import drive
drive.mount('/content/drive')

!pip install datasets
!pip install transformers
!pip install --upgrade accelerate
### loading the dataset ###

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [3]:
import pandas as pd
import numpy as np
import re
import nltk
from nltk.corpus import stopwords
import plotly.express as px
import plotly.graph_objects as go
from spacy.tokenizer import Tokenizer
from spacy.lang.en import English

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import train_test_split

import evaluate
from evaluate import evaluator

from collections import Counter

from datasets import Dataset, ClassLabel
from transformers import AutoTokenizer
from transformers import DataCollatorWithPadding
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer, pipeline

nltk.download('stopwords')

df = pd.read_csv('/content/drive/MyDrive/Uni/3. Semester/MIR/DATA/mtsamples_clean.csv')

### creating train and test split ###

from sklearn.model_selection import train_test_split

_X = df['transcription']
_y = df['medical_specialty']
_y_one_hot = pd.get_dummies(_y)

X, X_test, y_one_hot, _ = train_test_split(_X, _y_one_hot, test_size=0.2, random_state=123)
_, _, y_classes, _ = train_test_split(_X, _y, test_size=0.2, random_state=123)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


Only use X and y_one_hot or y_classes for training purposes in the rest of the notebook. After running the whole notebook, there should be a prediction from your model, which took X_test as input to create the predictions. Each prediction has to be a vector of length 40.

In [4]:
def test_model_performance(y_pred):
    _, _, _, y_test = train_test_split(_X, _y_one_hot, test_size=0.2, random_state=123)

    # set highest to 1 and rest to 0
    #y_pred = np.argmax(y_pred, axis=1)

    print('Accuracy: ', accuracy_score(y_test, y_pred))
    print('Precision: ', precision_score(y_test, y_pred, average='weighted'))
    print('Recall: ', recall_score(y_test, y_pred, average='weighted'))
    print('F1: ', f1_score(y_test, y_pred, average='weighted'))

In [5]:
### performance of a random guesser ###

y_pred_dummy = np.zeros((len(X_test), 40))
# random predictions with sum 1
y_pred_dummy = y_pred_dummy + np.random.rand(y_pred_dummy.shape[0], y_pred_dummy.shape[1])
y_pred_dummy = y_pred_dummy / y_pred_dummy.sum(axis=1).reshape(-1, 1)

# set only highest value to 1 and rest to 0
y_pred_dummy = np.argmax(y_pred_dummy, axis=1)
y_pred_dummy = pd.get_dummies(y_pred_dummy).values

test_model_performance(y_pred_dummy)

Accuracy:  0.025
Precision:  0.09825801467568375
Recall:  0.025
F1:  0.03229924457422364


# Data Exploration

In [6]:
### your code ###
df.head()

Unnamed: 0,medical_specialty,transcription
0,Allergy / Immunology,"SUBJECTIVE:, This 23-year-old white female pr..."
1,Bariatrics,"PAST MEDICAL HISTORY:, He has difficulty climb..."
2,Bariatrics,"HISTORY OF PRESENT ILLNESS: , I have seen ABC ..."
3,Cardiovascular / Pulmonary,"2-D M-MODE: , ,1. Left atrial enlargement wit..."
4,Cardiovascular / Pulmonary,1. The left ventricular cavity size and wall ...


In [7]:
eng_stopwords = stopwords.words('english')

nlp = English()
tokenizer = Tokenizer(nlp.vocab)

In [8]:
fig = px.histogram(df, x="medical_specialty")
fig.show()

In [9]:
df['cleaned'] = df['transcription'].apply(lambda row: re.sub(r"[\W\d\s]", ' ', str(row)).lower())
df['cleaned'] = df['cleaned'].apply(lambda x: x.split())
df['stopwords_removed'] = df['cleaned'].apply(lambda x: [word for word in x if word not in eng_stopwords])
df['tokenized'] = df['stopwords_removed'].apply(lambda x: tokenizer(str(x)))


In [10]:
len_of_txts = df.tokenized.map(len)

fig = go.Figure(data=[go.Histogram(x=len_of_txts)])

fig.show()

In [11]:
df['word_count'] = len_of_txts
sub_df = df[['medical_specialty', 'word_count']]

sub_df.groupby(by=['medical_specialty']).sum()

Unnamed: 0_level_0,word_count
medical_specialty,Unnamed: 1_level_1
Allergy / Immunology,1897
Autopsy,5697
Bariatrics,4210
Cardiovascular / Pulmonary,98801
Chiropractic,7478
Consult - History and Phy.,172938
Cosmetic / Plastic Surgery,8433
Dentistry,7849
Dermatology,6948
Diets and Nutritions,2166


In [12]:
full_text = []

for i in df['stopwords_removed']:
    full_text += i



Counter(full_text).most_common()

[('patient', 24208),
 ('right', 11587),
 ('left', 11258),
 ('history', 9509),
 ('normal', 7526),
 ('procedure', 7463),
 ('placed', 7028),
 ('well', 6611),
 ('pain', 5976),
 ('mg', 4375),
 ('x', 4357),
 ('noted', 4348),
 ('also', 4337),
 ('time', 4287),
 ('c', 4132),
 ('using', 4123),
 ('blood', 3956),
 ('performed', 3953),
 ('skin', 3798),
 ('without', 3732),
 ('anesthesia', 3707),
 ('incision', 3601),
 ('used', 3554),
 ('removed', 3532),
 ('year', 3506),
 ('room', 3502),
 ('old', 3463),
 ('diagnosis', 3212),
 ('general', 3064),
 ('artery', 3027),
 ('anterior', 2932),
 ('taken', 2816),
 ('back', 2731),
 ('disease', 2685),
 ('past', 2674),
 ('chest', 2631),
 ('mm', 2615),
 ('examination', 2578),
 ('position', 2576),
 ('two', 2568),
 ('dr', 2562),
 ('area', 2557),
 ('one', 2494),
 ('lower', 2462),
 ('cm', 2445),
 ('fashion', 2445),
 ('neck', 2424),
 ('negative', 2374),
 ('present', 2368),
 ('made', 2352),
 ('upper', 2324),
 ('l', 2323),
 ('pressure', 2284),
 ('closed', 2279),
 ('good', 2

### Calculating the class weights

In [13]:
P = (df['medical_specialty'].value_counts() / len(df))
class_weights = np.sqrt(P.pow(-1))

class_weights

 Surgery                           2.128893
 Consult - History and Phy.        3.112553
 Cardiovascular / Pulmonary        3.665811
 Orthopedic                        3.752558
 Radiology                         4.279177
 General Medicine                  4.393308
 Gastroenterology                  4.662058
 Neurology                         4.734664
 SOAP / Chart / Progress Notes     5.487664
 Obstetrics / Gynecology           5.589611
 Urology                           5.624877
 Discharge Summary                 6.803458
 ENT - Otolaryngology              7.142143
 Neurosurgery                      7.292520
 Hematology - Oncology             7.452815
 Ophthalmology                     7.760729
 Nephrology                        7.855956
 Emergency Room Reports            8.164149
 Pediatrics - Neonatal             8.450697
 Pain Management                   8.979367
 Psychiatry / Psychology           9.711887
 Office Notes                      9.900485
 Podiatry                       

Classifier

In [14]:
X = df['stopwords_removed'].apply(lambda x: ' '.join(x))
y = df['medical_specialty']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33, random_state=42)

# clf = MultiLabelZeroShotGPTClassifier(max_labels=2)
# clf.fit(X_train, y_train)

# labels = clf.predict(X_test)

# accuracy_score(y_test, labels)

In [15]:
print('Number of individual classes')
print(len(set(df['medical_specialty'])))


40


In [34]:
label2id = {item : idx for idx, item in enumerate(set(df['medical_specialty']))}
id2label = {idx : item for idx, item in enumerate(set(df['medical_specialty']))}

In [16]:
bert_df_train = pd.DataFrame()
bert_df_test = pd.DataFrame()
bert_df_train['sentence'] = X_train
bert_df_train['label'] = y_train
bert_df_test['sentence'] = X_test
bert_df_test['label'] = y_test

In [17]:
# Dataset and Dataloader


train_dataset = Dataset.from_pandas(bert_df_train)
test_dataset = Dataset.from_pandas(bert_df_test)

In [18]:
# load the tokenizer

tokenizer = AutoTokenizer.from_pretrained("dmis-lab/biobert-v1.1")

# tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokenizer.add_special_tokens({'pad_token': '[PAD]'})
tokenizer.pad_token = ['0']
labels = ClassLabel(names=list(set(y_train.values)))

# define the tokenization as a function

def preprocess_function(examples):
    tokens = tokenizer(examples["sentence"], truncation=True, padding=True, max_length=400)
    
    tokens['label'] = labels.str2int(examples['label'])
    return tokens

tokenized_train = train_dataset.map(preprocess_function, batched=True)
tokenized_test = test_dataset.map(preprocess_function, batched=True)

Downloading (…)okenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/462 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Map:   0%|          | 0/3349 [00:00<?, ? examples/s]

Map:   0%|          | 0/1650 [00:00<?, ? examples/s]

In [62]:
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)


model = AutoModelForSequenceClassification.from_pretrained("dmis-lab/biobert-v1.1", num_labels=40, id2label=id2label, label2id=label2id)
model = model.to('cuda')

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at dmis-lab/biobert-v1.1 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [68]:
training_args = TrainingArguments(
    output_dir="./results",
    learning_rate=5e-5,
    per_device_train_batch_size=20,
    per_device_eval_batch_size=20,
    num_train_epochs=10,
    weight_decay=0.01,
    gradient_accumulation_steps=16,
    dataloader_num_workers=4,
    gradient_checkpointing=True,
    fp16=True
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_test,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

In [66]:
trainer.train()


This DataLoader will create 4 worker processes in total. Our suggested max number of worker in current system is 2, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.





Step,Training Loss


TrainOutput(global_step=100, training_loss=2.3511219787597657, metrics={'train_runtime': 1106.3352, 'train_samples_per_second': 30.271, 'train_steps_per_second': 0.09, 'total_flos': 6559663694764800.0, 'train_loss': 2.3511219787597657, 'epoch': 9.52})

In [67]:
task_evaluator = evaluator("text-classification")


eval_results = task_evaluator.compute(
    model_or_pipeline=model,
    data=tokenized_test,
    input_column='sentence',
    label_column='label',
    tokenizer=tokenizer,
    label_mapping=label2id,
    metric="accuracy",
)

print(eval_results)


None of the inputs have requires_grad=True. Gradients will be None



{'accuracy': 0.3606060606060606, 'total_time_in_seconds': 48.658779804000005, 'samples_per_second': 33.909604939669315, 'latency_in_seconds': 0.02949016957818182}


In [None]:
model.predict()