<H1 style="text-align: center;">NLP for One Health</H1>
<h3 style="text-align: center;">From BERT to ChatGPT</h3>

|   |   |   |   |
|---|---|---|---|
| <img src="https://mood-h2020.eu/wp-content/uploads/2020/10/logo_Mood_texte-dessous_CMJN_vecto-300x136.jpg" alt="mood"/> | <img src="https://www.murdoch.edu.au/ResourcePackages/Murdoch2021/assets/dist/images/logo.svg" alt="murdoch" /> | <img src="https://www.umr-tetis.fr/images/logo-header-tetis.png" alt="tetis"/> | <img src="https://www.inrae.fr/themes/custom/inrae_socle/logo.svg" alt="INRAE" /> |

Speakers:

- **Rémy DECOUPES** - Research engineer UMR TETIS / INRAE
- **Maguelonne TEISSEIRE** - Prof. UMR TETIS / INRAE

------------------------

# Chapter 1. BERT
"[Bidirectional Encoder Representations from Transformers - Devlin et al - 2018](https://arxiv.org/abs/1810.04805)" from Google Research is an open-source pre-trained Language Model. BERT implements the well known "[Attention is all you need - Vaswani et al - 2017](https://arxiv.org/abs/1706.03762)"

Bert-case was trained on: 
+ Wikipedia (2.5 Billions of tokens)
+ Google books (0.8 Billions of tokens).

On two tasks:
+ Self-masking
+ Next sentence prediction

## 1.1 Introduction: Transformers
A python library to easily work with BERT-like models




In [None]:
# installation
!pip install transformers
!pip install torch
!pip install scikit-learn

In [None]:
# load BERT models
from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModelForMaskedLM.from_pretrained("bert-base-uncased")

## 1.2 NLP tasks with BERT
Let's use transformers' pipeline on common NLP tasks
### 1.2.1 Self-masking
Predict a token masked inside a sentence

In [None]:
from transformers import pipeline

unmasker = pipeline('fill-mask', model='bert-base-uncased')
unmasker("One Health is an approach calling for the collaborative efforts of multiple disciplines working locally, nationally, and globally, to attain optimal health for people, [MASK] and our environment.")

### 1.2.2 Next sentence prediction
The aim of this NLP task is to tell if the 2nd sentence could be after the first one

In [None]:
from transformers import BertTokenizer, BertForNextSentencePrediction
from torch.nn.functional import softmax


tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
nsp_model = BertForNextSentencePrediction.from_pretrained('bert-base-uncased')

def next_sentence_prediction(sentence_1, sentence_2):
  encoding = tokenizer.encode_plus(sentence_1, sentence_2, return_tensors='pt')
  logits = nsp_model(**encoding)[0] 
  probs = softmax(logits, dim=1)
  return probs[0][0].item()


In [None]:
sentence_1 = "The One Health Initiative is an interdisciplinary movement to create collaborations between animal, human, and environmental health"
sentence_2 = "The aim is  to better and more rapidly respond to outbreaks and newly emerging zoonoses and diseases."

score = next_sentence_prediction(sentence_1, sentence_2)
print(score)

sentence_3 = "Murdoch University is located in Perth, Western Australia"

score = next_sentence_prediction(sentence_1, sentence_3)
print(score)

### 1.2.3 Name Entities Recognition
NER is very usefull to extract specific information. In epidemiology surveillance, we want to extract from new articles the pathogen, the host and the location of an outbreak.

In [None]:
from transformers import pipeline

ner = pipeline('ner', model='dslim/bert-base-NER')
ner("2 swans found dead in Dordogne")

## 1.3 Diving into the model representations
### 1.3.1 Inputs: Natural texts to vectors
Inputs / Tokenization.

BERT has a vocabulary size of ~30 k Tokens

In [None]:
from transformers import BertTokenizerFast, BertForTokenClassification
from torch.nn import functional as F
import torch
tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')
model = BertForTokenClassification.from_pretrained('bert-base-uncased', return_dict = True)

In [None]:
text = "2 swans found dead in Dordogne"
inputs = tokenizer.encode_plus(text, return_tensors = "pt")

print(f"Inputs: {inputs} \n")
print(f"Inputs_ids: {inputs['input_ids']} \n")
print(f"Inputs word ids: {inputs.word_ids()} \n")
print(f"Inputs to words: {tokenizer.tokenize(text)} \n")

Let's focus on two points:

+ **Sub-tokenization**: as BERT knows only ~30K Tokens, tokens unknown have to be splitted into subtoken such as **Dordogne** becomes [do, ##rdo, ##gne].
+ **Special token**: Two tokens have been added: 101 and 102

In [None]:
print(f"Token ID = 2079, 20683, 10177: {tokenizer.convert_ids_to_tokens([2079, 20683, 10177])} \\n")

print("\n")
print(f"Shape of inputs: {inputs['input_ids'].shape} | Number of words: {len(tokenizer.tokenize(text))} \\n")
print(f"Token ID = 101: {tokenizer.convert_ids_to_tokens(101)} \\n")
print(f"Token ID = 102: {tokenizer.convert_ids_to_tokens(102)} \\n")

## 1.3.2 Embedding: 
Vector representation of semantic informations of texts.

Each known token has a vector representation from the pre-trained BERT model.

In [None]:
from transformers import BertModel

model = BertModel.from_pretrained("bert-base-uncased")
embedding_matrix = model.embeddings.word_embeddings.weight

print(f"BERT embedding matrix: {embedding_matrix} \n\n Matrix shape: {embedding_matrix.shape}")

In [None]:
print(f"Embeddings of Swan (id=26699): {embedding_matrix[26699]}\n\n")
print(f"Embeddings of Dordogne (id=2079, 20683, 10177): {embedding_matrix[[2079, 20683, 10177]]} \\n")

However, the strength of BERT is to capture the **contextualized** semantics of a word in its sentence. To explore this embeddings, let's see the last layer representation **last_hidden_states**.

In [None]:
from transformers import BertModel, BertTokenizer
import torch
import math 
model_name = 'bert-base-uncased'

tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertModel.from_pretrained(model_name)

input_text = "2 swans found dead in Dordogne"
input_ids = tokenizer.encode(input_text)
input_ids = torch.tensor([input_ids])

with torch.no_grad():
    last_hidden_states = model(input_ids)[0]
last_hidden_states_mean = last_hidden_states.mean(1)

print(f"Embedding shape: {last_hidden_states.shape}\n\n")
print(f"Embeddings of Swan inside its sentence (word_id=2): {last_hidden_states[0, 2, :]}\n\n")
euclidean_distance = math.sqrt(sum((embedding_matrix[26699] - last_hidden_states[0, 2, :])**2))
print(f"Euclidean distance between Sawn (in pre-trained) and Swan in our sentence: {euclidean_distance}")
euclidean_distance = math.sqrt(sum((embedding_matrix[2757] - last_hidden_states[0, 4, :])**2))
print(f"Euclidean distance between dead (in pre-trained) and dead in our sentence: {euclidean_distance}")
euclidean_distance = math.sqrt(sum((last_hidden_states[0, 4, :] - last_hidden_states[0, 2, :])**2))
print(f"Euclidean distance between dead and Swan in our sentence: {euclidean_distance}")
euclidean_distance = math.sqrt(sum((embedding_matrix[2757] - embedding_matrix[26699])**2))
print(f"Euclidean distance between dead and Swan in pre-trained: {euclidean_distance}")

### 1.3.3 Visualize sentences embedding
The meaning of a sentence is included in this embeddings of its tokens. We propose here too visualize it:

+ Retrieve embedding of a sentence (and not at a token-level) using [sentence embedding](https://github.com/UKPLab/sentence-transformers). We are going to use **all_mpnet-base-v2** first-ranked model by [Sentence BERT benchmark](https://www.sbert.net/docs/pretrained_models.html#sentence-embedding-models).
+ Reduce dimension to plot embeddings on a 2-D figure using [UMAP](https://umap-learn.readthedocs.io/en/latest/)
+ Visualize it interactively with [plotly.express](https://plotly.com/python/plotly-express/)

In [None]:
!pip install -U sentence-transformers
!pip install umap-learn
!pip install pandas
!pip install plotly
!pip install --upgrade nbformat

In [None]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-mpnet-base-v2')

In [None]:
import pandas as pd

my_sentences_dict = [
    {"sentence": "One Health is an interdisciplinary approach that recognizes the interconnectedness of human, animal, and environmental health.", "category": "one-health"},
    {"sentence": "The One Health concept emphasizes the need for collaboration across sectors to address complex health challenges.", "category": "one-health"},
    {"sentence": "A One Health approach is essential for understanding and preventing emerging infectious diseases.", "category": "one-health"},
    {"sentence": "One Health research aims to improve the health and well-being of both people and animals.", "category": "one-health"},
    {"sentence": "One Health initiatives promote a holistic and integrated approach to health and disease.", "category": "one-health"},
    {"sentence": "One Health is a global initiative that aims to improve public health and environmental sustainability.", "category": "one-health"},
    {"sentence": "The One Health framework emphasizes the importance of collaboration, communication, and coordination across different fields.", "category": "one-health"},
    {"sentence": "One Health research has contributed to our understanding of zoonotic diseases such as Ebola and COVID-19.", "category": "one-health"},
    {"sentence": "One Health approaches can help us address complex health challenges such as antimicrobial resistance.", "category": "one-health"},
    {"sentence": "One Health recognizes the interconnectedness of human, animal, and environmental health and aims to promote the health and well-being of all three.", "category": "one-health"},

    {"sentence": "The workshop on AI and One Health brought together experts from diverse fields to explore new opportunities for collaboration.", "category": "workship-ai-onehealth"},
    {"sentence": "The AI and One Health workshop focused on using advanced technologies to improve public health outcomes.", "category": "workship-ai-onehealth"},
    {"sentence": "The workshop on AI and One Health provided a platform for discussing the ethical and social implications of using AI in healthcare.", "category": "workship-ai-onehealth"},
    {"sentence": "The AI and One Health workshop explored new ways to use data and analytics to improve health outcomes for both people and animals.", "category": "workship-ai-onehealth"},
    {"sentence": "The workshop on AI and One Health highlighted the importance of interdisciplinary collaboration in addressing complex health challenges.", "category": "workship-ai-onehealth"},
    
    {"sentence": "Murdoch University is a public research university located in Perth, Western Australia.", "category": "murdoch-uni"},
    {"sentence": "The campus of Murdoch University is situated on a large nature reserve, providing a unique and tranquil environment for learning and research.", "category": "murdoch-uni"},
    {"sentence": "Murdoch University is known for its strong programs in veterinary science and animal health.", "category": "murdoch-uni"},
    {"sentence": "The School of Health Sciences at Murdoch University offers a range of programs in nursing, psychology, and public health.", "category": "murdoch-uni"},
    {"sentence": "Murdoch University has a strong commitment to sustainability and environmental stewardship, reflected in its research and operations.", "category": "murdoch-uni"}
]
df = pd.DataFrame(my_sentences_dict)
print(df.head(2))

sentence_embeddings = model.encode(df["sentence"].values)

print(f"Shape of my list of sentences: {sentence_embeddings.shape}")

In [None]:
import umap

umap_emb = umap.UMAP(n_components=2, random_state=42).fit_transform(sentence_embeddings)
df['umap-x'] = umap_emb[:, 0]
df['umap-y'] = umap_emb[:, 1]

In [None]:
import plotly.express as px

fig = px.scatter(df, x='umap-x', y='umap-y', custom_data=["sentence"], color="category", width=800, height=800)
fig.update_traces(
    hovertemplate="<br>".join([
        "ColX: %{x}",
        "ColY: %{y}",
        "Col1: %{customdata[0]}",
    ])
)

fig.show()

## 1.4 Fine-tuning
**Use case**: Epidemiological surveillance (Event-based) for AMR (Anti Microbial Resistance)

The aims of this section is to train, from a pre-trained BERT model, a text classification of news (press articles):

+ "New Information": The new is dealing with an outbreak
+ "General Information": The press article is in the theme (it talks about AMR) but does not mention any new emergence
+ "Not relevant": the press article is off-topic

The data sources are:

+ [ProMED](https://promedmail.org/); From the International Society of Infectious Diseases (ISID) since 1994, expert moderators provide written commentary (a context) to the press articles reported.
+ [HealthMap](https://www.healthmap.org): Is an automated and curated aggregator of a broad range of data sources (Twitter, Google News, Baidu and ProMED)
+ [MedISys](https://medisys.newsbrief.eu/medisys/homeedition/en/home.html): A web-based information monitoring system developped by the European Comission 
+ [PADI-web](https://padi-web.cirad.fr/); Partly developed by researchers from UMR TETIS and used by the French epidemix intellignece team in animal health. It's an autmated tool that monitors the Google News aggregator in sixteen languages.

The data is accessible at the following link: [MOOD - News AMR dataset - Hackathon 2022](https://entrepot.recherche.data.gouv.fr/dataset.xhtml?persistentId=doi:10.57745/MPNSPH). It comes from the [MOOD](https://mood-h2020.eu/time-for-a-mood-hack-antimicrobial-resistance-hackathon/) project.


![MOOD](https://mood-h2020.eu/wp-content/uploads/2020/10/logo_Mood_cmjn_black-1.jpg)

### 1.4.1 Download the data and upload it to your notebook

1. Download the data (giving your personnal information is not mandatory, you can supply click on the accept button): [from the following link](https://entrepot.recherche.data.gouv.fr/dataset.xhtml?persistentId=doi:10.57745/MPNSPH)
2. Extract the 4 files beginning with `D1_`
3. Upload them to colab

In [None]:
from google.colab import files
uploaded = files.upload()

### 1.4.2 Prepare the data for training

In [None]:
import io
import pandas as p

df = pd.DataFrame()
df_tmp = pd.DataFrame()

for filename in uploaded:
  df_tmp = pd.read_csv(io.StringIO(uploaded[filename].decode("utf-8")), sep = ",")
  df = pd.concat([df, df_tmp], ignore_index=True)

# filter out "don't know"
df = df[df["Classification1"] != "Don't know"]

# filter out Null values
df = df[df["Title"].isna() == False]

In [None]:
import datasets
from datasets import Dataset, DatasetDict
from datasets import ClassLabel

df["label_name"] = df["Classification1"]
df["label"] = pd.Categorical(df["label_name"], ordered=True).codes

mapLabels = pd.DataFrame(df.groupby(['label_name', 'label']).count())
#drop count column
# mapLabels.drop(['news'], axis = 1, inplace = True)
label2Index = mapLabels.to_dict(orient='index')
index2label = {}
for key in label2Index:
  print (f"{key[1]} -> {key[0]}")
  index2label[key[1]] = key[0]
print("\n")

In [None]:
dataset = Dataset.from_pandas(df[["Title", "label"]])
dataset = dataset.train_test_split(test_size=0.1)
print(dataset)

### 1.4.3 Prepare the model for training

In [None]:
 pretrained_model = 'bert-base-uncased'

from transformers import AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained(pretrained_model)
import numpy as np

def tokenize_function(batch):
    try:
        tokenized_batch = tokenizer(batch['Title'], padding=True, truncation=True, max_length=128)
        return tokenized_batch
    except:
        print(f"error with Title: {batch['Title']}")
        print(f"error with Title len: {len(batch['Title'])}")
                     
# tokenized_datasets = dataset.map(tokenize_function, batched=True)
tokenized_datasets = dataset.map(tokenize_function, batched=True)
# mélanger les jeux de données:
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42)
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42)

model = AutoModelForSequenceClassification.from_pretrained(pretrained_model, num_labels=3)

### 1.4.4 Train the model!

In [None]:
from transformers import TrainingArguments, Trainer
import numpy as np
from datasets import load_metric

metric = load_metric("accuracy")
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    print(metric.compute(predictions=predictions, references=labels))
    return metric.compute(predictions=predictions, references=labels)

training_args = TrainingArguments(output_dir="test_trainer", 
                                  evaluation_strategy="epoch", 
                                  save_strategy="epoch",
                                  load_best_model_at_end=True)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=small_train_dataset,
    eval_dataset=small_eval_dataset,
    compute_metrics=compute_metrics,
)

In [None]:
 trainer.train()

## 1.5 Text generation with BERT

In [None]:
from transformers import pipeline

text_generation = pipeline('text-generation', model='bert-base-uncased')
text_generation("One Health is an approach calling for the collaborative efforts of multiple")

In [None]:
qa = pipeline("text2text-generation", model='bert-base-uncased')
qa("What does 'one health concept' mean ?")