# Using Natural Language Processing to identify Adverse Drug Events on AI/HPC Cluster

Adopt from https://towardsdatascience.com/using-nlp-to-identify-adverse-drug-events-ades-7a0194f1966a
Instead run ML pipeline Hugging Face Processing jobs on Amazon SageMaker, we are migrated pipeline to private cluster.

Motivated by family members health issues, exploring NLP to hock medical researcher in or community to explore and deploy machine learing pipline to Thai Language and medical thais records.

## Preconfigure Environment 

In [None]:
Recording:
$ conda update -n base -c defaults conda
$ conda create nlp
$ conda activate nlp
$ conda install jupyter

$ conda install -c conda-forge scikit-learn
$ conda install -c huggingface -c conda-forge datasets
$ conda install pytorch pytorch-cuda=11.6 -c pytorch -c nvidia  


## Load Dataset

In [None]:
import pandas as pd
from datasets import load_dataset

In [None]:
#from datasets import get_dataset_infos
#get_dataset_infos('rotten_tomatoes')

In [None]:
from datasets import load_dataset
import pandas as pd
dataset = load_dataset('ade_corpus_v2', 'Ade_corpus_v2_classification')
df = pd.DataFrame(dataset['train'])
df.sample(5, random_state=124)

In [None]:
df['label'].sum()/len(df)

In [None]:
%%bash
python preprocess.py --dataset-name ade_corpus_v2 --datasubset-name Ade_corpus_v2_classification \
             --model-name distilbert-base-uncased --train-ratio 0.7 --val-ratio 0.15

In [None]:
%%bash 
ls 

## Exploring Data

In [None]:
# explore dataset
import pandas as pd
from datasets import load_from_disk

In [None]:
dataset = load_from_disk('./training')

In [None]:
df_train = pd.DataFrame(dataset)

In [None]:
df_train.head(10)

In [None]:
df_train['label'].sum()/len(df_train)

In [None]:
df_train.iloc[0]

## Training Model

In [None]:
# Training

training_input_path = f'./training'
val_input_path = f'./test'
output_path=f'./training_output'

In [None]:
%%bash
export SM_OUTPUT_DATA_DIR='./training_output'
export SM_MODEL_DIR='distilbert-base-uncased'
export SM_NUM_GPUS='8'
export SM_CHANNEL_TRAIN='./training'
export SM_CHANNEL_VAL='./validation'

python train.py --epochs 2\
                 --train_batch_size 8 \
                 --model_name distilbert-base-uncased \
                 --output_data_di ./training_output \
                 --n_gpus 8 \
                 --training_dir ./training \
                 --val_dir ./validation

## Train on GPUx8

In [None]:
%%bash
rm -r /tmp/test-clm; CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
python -m torch.distributed.launch  \
--nproc_per_node 8 train.py  --epochs 2   --train_batch_size 8    \
--model_name distilbert-base-uncased     --output_data_di ./training_output      \
--n_gpus 8  --training_dir ./training    --val_dir ./validation


### It is amazinglyfast compared to CPUs, 32 cores.!

## Inference

In [None]:
text = "I got a rash from taking aspirin"
# text = 'I watched football and got really excited'


In [None]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("./distilbert-base-uncased/")
inputs = tokenizer(text, return_tensors="pt")

In [None]:
from transformers import AutoModelForSequenceClassification
import torch
model = AutoModelForSequenceClassification.from_pretrained("./distilbert-base-uncased/")
with torch.no_grad():
    logits = model(**inputs).logits


In [None]:
predicted_class_id = logits.argmax().item()
model.config.id2label[predicted_class_id]