<a href="https://colab.research.google.com/github/stumbi/mir_nlp/blob/main/exercise_05_classificator.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

  <div>
    <h1 align="center">Excercise 05 - Medical Information Retrieval 2023</h1>
  </div>
  <br />

Today, we are moving on, towards a machine learning approach for text classification. 

## Text classification <a class="anchor" id="first"></a>

In the following 3 weeks we are focussing on machine learning approaches on our classification task. Feel free to use any tool which helps you, as long as you can explain, what exactelly is happening, and why it is useful. Given, that you know the preprocessing steps from the past weeks and are able to apply them, we want you to use them now in order to develop a machine learning model for our classification problem.

### Requirements
* The notebook should run **without any error**, given that all packages are installed and the dataset is loaded. When we test it, we will adapt path definitions and might will install nessesary packages)
* Your training/validation script should only use the train split we give you.

### Evaluation
* For evalutation, you can use the function "test_model_performance" in this notebook for accuracy, precision, recall and F1-score. If you choose to use such evaluation, the predicted labels have to be hot-encoded: The output of your model should be a vector of probabilities for each class. 
### Your tasks

* Make an exploratory data analysis
* Develop a preprocessing pipeline
* Train and test one or several machine learning models
* Evaluate the algorithms with a metric of your choice 
* Visualize the outcome

* Prepare a presentation (or present this notebook) of around 10 minutes for our last session (6th of June)


You can start from here. To have a comparable evaluation between each group, we give you a fixed train and test split.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

!pip install datasets
!pip install transformers
!pip install --upgrade accelerate
!pip install evaluate
### loading the dataset ###

In [3]:
import pandas as pd
import numpy as np
import re
import torch
from torch import nn
import nltk
from nltk.corpus import stopwords
import plotly.express as px
import plotly.graph_objects as go
from spacy.tokenizer import Tokenizer
from spacy.lang.en import English

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import train_test_split

import evaluate
from evaluate import evaluator

from collections import Counter

from datasets import Dataset, ClassLabel
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer, pipeline, DataCollatorWithPadding, AutoTokenizer

nltk.download('stopwords')

df = pd.read_csv('/content/drive/MyDrive/Uni/3. Semester/MIR/DATA/mtsamples_clean.csv')

### creating train and test split ###


_X = df['transcription']
_y = df['medical_specialty']
_y_one_hot = pd.get_dummies(_y)

X, X_test, y_one_hot, _ = train_test_split(_X, _y_one_hot, test_size=0.2, random_state=123)
_, _, y_classes, _ = train_test_split(_X, _y, test_size=0.2, random_state=123)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


Only use X and y_one_hot or y_classes for training purposes in the rest of the notebook. After running the whole notebook, there should be a prediction from your model, which took X_test as input to create the predictions. Each prediction has to be a vector of length 40.

In [4]:
def test_model_performance(y_pred):
    _, _, _, y_test = train_test_split(_X, _y_one_hot, test_size=0.2, random_state=123)

    # set highest to 1 and rest to 0
    #y_pred = np.argmax(y_pred, axis=1)

    print('Accuracy: ', accuracy_score(y_test, y_pred))
    print('Precision: ', precision_score(y_test, y_pred, average='weighted'))
    print('Recall: ', recall_score(y_test, y_pred, average='weighted'))
    print('F1: ', f1_score(y_test, y_pred, average='weighted'))

In [5]:
### performance of a random guesser ###

y_pred_dummy = np.zeros((len(X_test), 40))
# random predictions with sum 1
y_pred_dummy = y_pred_dummy + np.random.rand(y_pred_dummy.shape[0], y_pred_dummy.shape[1])
y_pred_dummy = y_pred_dummy / y_pred_dummy.sum(axis=1).reshape(-1, 1)

# set only highest value to 1 and rest to 0
y_pred_dummy = np.argmax(y_pred_dummy, axis=1)
y_pred_dummy = pd.get_dummies(y_pred_dummy).values

test_model_performance(y_pred_dummy)

Accuracy:  0.025
Precision:  0.09825801467568375
Recall:  0.025
F1:  0.03229924457422364


# Data Exploration

In [6]:
### your code ###
df.head()

Unnamed: 0,medical_specialty,transcription
0,Allergy / Immunology,"SUBJECTIVE:, This 23-year-old white female pr..."
1,Bariatrics,"PAST MEDICAL HISTORY:, He has difficulty climb..."
2,Bariatrics,"HISTORY OF PRESENT ILLNESS: , I have seen ABC ..."
3,Cardiovascular / Pulmonary,"2-D M-MODE: , ,1. Left atrial enlargement wit..."
4,Cardiovascular / Pulmonary,1. The left ventricular cavity size and wall ...


In [8]:
fig = px.histogram(df, x="medical_specialty")
fig.show()

In [10]:
len_of_txts = df.tokenized.map(len)

fig = go.Figure(data=[go.Histogram(x=len_of_txts)])

fig.show()

In [11]:
df['word_count'] = len_of_txts
sub_df = df[['medical_specialty', 'word_count']]

sub_df.groupby(by=['medical_specialty']).sum()

Unnamed: 0_level_0,word_count
medical_specialty,Unnamed: 1_level_1
Allergy / Immunology,1897
Autopsy,5697
Bariatrics,4210
Cardiovascular / Pulmonary,98801
Chiropractic,7478
Consult - History and Phy.,172938
Cosmetic / Plastic Surgery,8433
Dentistry,7849
Dermatology,6948
Diets and Nutritions,2166


In [7]:
eng_stopwords = stopwords.words('english')

nlp = English()
tokenizer = Tokenizer(nlp.vocab)

In [9]:
df['cleaned'] = df['transcription'].apply(lambda row: re.sub(r"[\W\d\s]", ' ', str(row)).lower())
df['cleaned'] = df['cleaned'].apply(lambda x: x.split())
df['stopwords_removed'] = df['cleaned'].apply(lambda x: [word for word in x if word not in eng_stopwords])
df['tokenized'] = df['stopwords_removed'].apply(lambda x: tokenizer(str(x)))


In [210]:
full_text = []

for i in df['stopwords_removed']:
    full_text += i



Counter(full_text).most_common()[0:25]

[('patient', 24208),
 ('right', 11587),
 ('left', 11258),
 ('history', 9509),
 ('normal', 7526),
 ('procedure', 7463),
 ('placed', 7028),
 ('well', 6611),
 ('pain', 5976),
 ('mg', 4375),
 ('x', 4357),
 ('noted', 4348),
 ('also', 4337),
 ('time', 4287),
 ('c', 4132),
 ('using', 4123),
 ('blood', 3956),
 ('performed', 3953),
 ('skin', 3798),
 ('without', 3732),
 ('anesthesia', 3707),
 ('incision', 3601),
 ('used', 3554),
 ('removed', 3532),
 ('year', 3506)]

### Calculating the class weights

In [211]:
P = (df['medical_specialty'].value_counts() / len(df))
class_weights = np.sqrt(P.pow(-1))

Classifier

In [208]:
print('Number of individual classes:', len(set(df['medical_specialty'])))

Number of individual classes: 40


### Data Preparation for Transformer Learning

- creating label to id mapping
- creating train, validation and test dataset
- loading model and tokenizer
- creating custom tokenizer pipeline

In [34]:
label2id = {item : idx for idx, item in enumerate(set(df['medical_specialty']))}
id2label = {idx : item for idx, item in enumerate(set(df['medical_specialty']))}

In [217]:
X = df['stopwords_removed'].apply(lambda x: ' '.join(x))
y = df['medical_specialty']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.5, random_state=42)

X_val, X_test, y_val, y_test = train_test_split(
    X_test, y_test, test_size=0.5, random_state=42)

In [218]:
bert_df_train = pd.DataFrame()
bert_df_val = pd.DataFrame()
bert_df_test = pd.DataFrame()
bert_df_train['sentence'] = X_train
bert_df_train['label'] = y_train
bert_df_val['sentence'] = X_val
bert_df_val['label'] = y_val
bert_df_test['sentence'] = X_test
bert_df_test['label'] = y_test

In [219]:
train_dataset = Dataset.from_pandas(bert_df_train)
val_dataset = Dataset.from_pandas(bert_df_val)
test_dataset = Dataset.from_pandas(bert_df_test)

In [220]:
# load the tokenizer

tokenizer = AutoTokenizer.from_pretrained("dmis-lab/biobert-v1.1")

# tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokenizer.add_special_tokens({'pad_token': '[PAD]'})
tokenizer.pad_token = ['0']
labels = ClassLabel(names=list(set(y_train.values)))

# define the tokenization as a function

def preprocess_function(examples):
    tokens = tokenizer(examples["sentence"], truncation=True, padding=True, max_length=512)
    
    tokens['label'] = labels.str2int(examples['label'])
    return tokens

tokenized_train = train_dataset.map(preprocess_function, batched=True)
tokenized_val = val_dataset.map(preprocess_function, batched=True)
tokenized_test = test_dataset.map(preprocess_function, batched=True)

Map:   0%|          | 0/2499 [00:00<?, ? examples/s]

Map:   0%|          | 0/1250 [00:00<?, ? examples/s]

Map:   0%|          | 0/1250 [00:00<?, ? examples/s]

In [None]:
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)


model = AutoModelForSequenceClassification.from_pretrained("dmis-lab/biobert-v1.1", num_labels=40, id2label=id2label, label2id=label2id)
model = model.to('cuda')

### Transformer Fine Tuning 

- creating custom loss function, due to unbalanced training dataset
- define compute metrics function
- setting training arguments
- setting trainer arguments

In [281]:
class CustomTrainer(Trainer):
    def compute_loss(self, model, inputs, return_outputs=False):
        labels = inputs.get("labels")
        # forward pass
        outputs = model(**inputs)
        logits = outputs.get('logits').to(torch.float)
      
        # compute custom loss
        loss_fct = nn.CrossEntropyLoss(weight=torch.tensor(list(class_weights.loc[label2id.keys()].values)).to(torch.float).to('cuda'))
        loss = loss_fct(logits.view(-1, 40).to(torch.float), labels.view(-1))
        
        return (loss, outputs) if return_outputs else loss

def compute_metrics(eval_preds):
    metric = evaluate.load("accuracy", "f1")
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

In [282]:
training_args = TrainingArguments(
    output_dir="./results",
    learning_rate=5e-5,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    num_train_epochs=10,
    weight_decay=0.01,
    gradient_accumulation_steps=16,
    dataloader_num_workers=2,
    gradient_checkpointing=True,
    evaluation_strategy="epoch"
)

trainer = CustomTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_val,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics
)


In [283]:
trainer.train()





Epoch,Training Loss,Validation Loss,Accuracy
0,No log,2.077339,0.3344


TrainOutput(global_step=39, training_loss=1.098311791053185, metrics={'train_runtime': 283.2496, 'train_samples_per_second': 8.823, 'train_steps_per_second': 0.138, 'total_flos': 513241609420800.0, 'train_loss': 1.098311791053185, 'epoch': 1.0})

### Evaluate Training

- printing accuracy
- calculating f1 score

In [288]:
task_evaluator = evaluator("text-classification")

metric = evaluate.load("accuracy")
eval_results = task_evaluator.compute(
    model_or_pipeline=model,
    data=tokenized_test,
    input_column='sentence',
    label_column='label',
    tokenizer=tokenizer,
    label_mapping=label2id,
    metric=metric
)

print(eval_results)

In [302]:
from nltk.tokenize import word_tokenize
# nltk.download('popular')
sentences = [i['sentence'] for i in tokenized_test]
shortened_sentences = []
for i in sentences:
    tokenized_sentence = word_tokenize(i)
    shortened = tokenized_sentence[:450]
    if len(shortened) > 511:
      print(len(shortened))
    shortened_sentence = ' '.join(shortened)
    shortened_sentences.append(shortened_sentence)

In [301]:
classifier = pipeline("text-classification", model=model.to('cpu'), tokenizer=tokenizer)
predictions = classifier(shortened_sentences)
y_pred = [y['label'] for y in predictions]

print('F1: ', f1_score(y_test, y_pred, average='weighted'))

F1:  0.3674074074074074
