<H1 style="text-align: center;">NLP for One Health</H1>
<h3 style="text-align: center;">From BERT to ChatGPT</h3>

The use of artificial intelligence (AI) in One Health has the potential to revolutionize disease detection and outbreak response. Natural Language Processing (NLP) is a subfield of AI that enables computers to understand and analyze large amounts of text data, including social media posts, news articles or medical records. 

In this practical session, we will explore how NLP can help in the early detection of outbreaks and monitor crisis situations by social mining of newspapers and social media. Specifically, we will start by discussing BERT, a powerful pre-trained language model that can be fine-tuned for specific NLP tasks. We will then move on to ChatGPT, the most popular large language model that can generate human-like responses to natural language inputs. 

Through the use of these 2 models, we will discuss practical examples of how NLP can be applied in the One Health context, including case studies of outbreak detection and social media monitoring. Participants will have the opportunity to work with BERT and ChatGPT in hands-on exercises to gain practical experience with these powerful tools. By the end of the session, participants will have a deeper understanding of how NLP can be used in the One Health context and be equipped with the skills to apply these techniques in their own work.


|   |   |   |   |
|---|---|---|---|
| <img src="https://mood-h2020.eu/wp-content/uploads/2020/10/logo_Mood_texte-dessous_CMJN_vecto-300x136.jpg" alt="mood"/> | <img src="https://www.murdoch.edu.au/ResourcePackages/Murdoch2021/assets/dist/images/logo.svg" alt="murdoch" /> | <img src="https://www.umr-tetis.fr/images/logo-header-tetis.png" alt="tetis"/> | <img src="https://www.inrae.fr/themes/custom/inrae_socle/logo.svg" alt="INRAE" /> |

Speaker: **Rémy DECOUPES** - Research engineer UMR TETIS / INRAE

------------------------

# 1. BERT
"[Bidirectional Encoder Representations from Transformers - Devlin et al - 2018](https://arxiv.org/abs/1810.04805)" from Google Research is an open-source pre-trained Language Model. BERT implements the well known "[Attention is all you need - Vaswani et al - 2017](https://arxiv.org/abs/1706.03762)"

Bert-case was trained on: 
+ Wikipedia (2.5 Billions of tokens)
+ Google books (0.8 Billions of tokens).

On two tasks:
+ Self-masking
+ Next sentence prediction

## 1.1 Transformers
A python library to easily work with BERT-like models




In [3]:
# installation
!pip install transformers
!pip install torch
!pip install scikit-learn



In [1]:
# load BERT models
from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModelForMaskedLM.from_pretrained("bert-base-uncased")

  from .autonotebook import tqdm as notebook_tqdm
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


## 1.2 NLP tasks with BERT
Let's use transformers' pipeline on common NLP tasks
### 1.2.1 Self-masking
Predict a token masked inside a sentence

In [1]:
from transformers import pipeline

unmasker = pipeline('fill-mask', model='bert-base-uncased')
unmasker("One Health is an approach calling for the collaborative efforts of multiple disciplines working locally, nationally, and globally, to attain optimal health for people, [MASK] and our environment.")

  from .autonotebook import tqdm as notebook_tqdm
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


[{'score': 0.4010227620601654,
  'token': 4279,
  'token_str': 'communities',
  'sequence': 'one health is an approach calling for the collaborative efforts of multiple disciplines working locally, nationally, and globally, to attain optimal health for people, communities and our environment.'},
 {'score': 0.09270986169576645,
  'token': 4176,
  'token_str': 'animals',
  'sequence': 'one health is an approach calling for the collaborative efforts of multiple disciplines working locally, nationally, and globally, to attain optimal health for people, animals and our environment.'},
 {'score': 0.06399711221456528,
  'token': 2945,
  'token_str': 'families',
  'sequence': 'one health is an approach calling for the collaborative efforts of multiple disciplines working locally, nationally, and globally, to attain optimal health for people, families and our environment.'},
 {'score': 0.06004488468170166,
  'token': 2740,
  'token_str': 'health',
  'sequence': 'one health is an approach callin

### 1.2.2 Next sentence prediction
The aim of this NLP task is to tell if the 2nd sentence could be after the first one

In [18]:
from transformers import BertTokenizer, BertForNextSentencePrediction
from torch.nn.functional import softmax


tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
nsp_model = BertForNextSentencePrediction.from_pretrained('bert-base-uncased')

def next_sentence_prediction(sentence_1, sentence_2):
  encoding = tokenizer.encode_plus(sentence_1, sentence_2, return_tensors='pt')
  logits = nsp_model(**encoding)[0] 
  probs = softmax(logits, dim=1)
  return probs[0][0].item()


Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForNextSentencePrediction: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.predictions.bias']
- This IS expected if you are initializing BertForNextSentencePrediction from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForNextSentencePrediction from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [19]:
sentence_1 = "The One Health Initiative is an interdisciplinary movement to create collaborations between animal, human, and environmental health"
sentence_2 = "The aim is  to better and more rapidly respond to outbreaks and newly emerging zoonoses and diseases."

score = next_sentence_prediction(sentence_1, sentence_2)
print(score)

sentence_3 = "Murdoch University is located in Perth, Western Australia"

score = next_sentence_prediction(sentence_1, sentence_3)
print(score)

0.9999943971633911
7.740520959487185e-05


### 1.2.3 Name Entities Recognition
NER is very usefull to extract specific information. In epidemiology surveillance, we want to extract from new articles the pathogen, the host and the location of an outbreak.

In [21]:
from transformers import pipeline

ner = pipeline('ner', model='dslim/bert-base-NER')
ner("2 swans found dead in Dordogne")

Downloading (…)lve/main/config.json: 100%|██████████| 829/829 [00:00<00:00, 334kB/s]
Downloading pytorch_model.bin: 100%|██████████| 433M/433M [00:09<00:00, 45.9MB/s] 
Downloading (…)okenizer_config.json: 100%|██████████| 59.0/59.0 [00:00<00:00, 19.8kB/s]
Downloading (…)solve/main/vocab.txt: 100%|██████████| 213k/213k [00:00<00:00, 708kB/s]
Downloading (…)in/added_tokens.json: 100%|██████████| 2.00/2.00 [00:00<00:00, 499B/s]
Downloading (…)cial_tokens_map.json: 100%|██████████| 112/112 [00:00<00:00, 35.0kB/s]


[{'entity': 'B-LOC',
  'score': 0.9993818,
  'index': 8,
  'word': 'Do',
  'start': 22,
  'end': 24},
 {'entity': 'I-LOC',
  'score': 0.8579382,
  'index': 9,
  'word': '##rdo',
  'start': 24,
  'end': 27},
 {'entity': 'B-LOC',
  'score': 0.69484633,
  'index': 10,
  'word': '##gne',
  'start': 27,
  'end': 30}]

## 1.3 Diving into the model representations
### 1.3.1 Inputs: Natural texts to vectors
Inputs / Tokenization.

BERT has a vocabulary size of ~30 k Tokens

In [22]:
from transformers import BertTokenizerFast, BertForTokenClassification
from torch.nn import functional as F
import torch
tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')
model = BertForTokenClassification.from_pretrained('bert-base-uncased', return_dict = True)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForTokenClassification: ['cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.predictions.bias']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-u

In [24]:
text = "2 swans found dead in Dordogne"
inputs = tokenizer.encode_plus(text, return_tensors = "pt")

print(f"Inputs: {inputs} \n")
print(f"Inputs_ids: {inputs['input_ids']} \n")
print(f"Inputs word ids: {inputs.word_ids()} \n")
print(f"Inputs to words: {tokenizer.tokenize(text)} \n")

Inputs: {'input_ids': tensor([[  101,  1016, 26699,  2179,  2757,  1999,  2079, 20683, 10177,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])} 

Inputs_ids: tensor([[  101,  1016, 26699,  2179,  2757,  1999,  2079, 20683, 10177,   102]]) 

Inputs word ids: [None, 0, 1, 2, 3, 4, 5, 5, 5, None] 

Inputs to words: ['2', 'swans', 'found', 'dead', 'in', 'do', '##rdo', '##gne'] 



Let's focus on two points:

+ **Sub-tokenization**: as BERT knows only ~30K Tokens, tokens unknown have to be splitted into subtoken such as **Dordogne** becomes [do, ##rdo, ##gne].
+ **Special token**: Two tokens have been added: 101 and 102

In [33]:
print(f"Token ID = 2079, 20683, 10177: {tokenizer.convert_ids_to_tokens([2079, 20683, 10177])} \\n")

print("\n")
print(f"Shape of inputs: {inputs['input_ids'].shape} | Number of words: {len(tokenizer.tokenize(text))} \\n")
print(f"Token ID = 101: {tokenizer.convert_ids_to_tokens(101)} \\n")
print(f"Token ID = 102: {tokenizer.convert_ids_to_tokens(102)} \\n")

Token ID = 2079, 20683, 10177: ['do', '##rdo', '##gne'] \n


Shape of inputs: torch.Size([1, 10]) | Number of words: 8 \n
Token ID = 101: [CLS] \n
Token ID = 102: [SEP] \n


## 1.3.2 Embedding: 
Vector representation of semantic informations of texts.

Each known token has a vector representation from the pre-trained BERT model.

In [5]:
from transformers import BertModel

model = BertModel.from_pretrained("bert-base-uncased")
embedding_matrix = model.embeddings.word_embeddings.weight

print(f"BERT embedding matrix: {embedding_matrix} \n\n Matrix shape: {embedding_matrix.shape}")

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


BERT embedding matrix: Parameter containing:
tensor([[-0.0102, -0.0615, -0.0265,  ..., -0.0199, -0.0372, -0.0098],
        [-0.0117, -0.0600, -0.0323,  ..., -0.0168, -0.0401, -0.0107],
        [-0.0198, -0.0627, -0.0326,  ..., -0.0165, -0.0420, -0.0032],
        ...,
        [-0.0218, -0.0556, -0.0135,  ..., -0.0043, -0.0151, -0.0249],
        [-0.0462, -0.0565, -0.0019,  ...,  0.0157, -0.0139, -0.0095],
        [ 0.0015, -0.0821, -0.0160,  ..., -0.0081, -0.0475,  0.0753]],
       requires_grad=True) 

 Matrix shape: torch.Size([30522, 768])


In [34]:
print(f"Embeddings of Swan (id=26699): {embedding_matrix[26699]}\n\n")
print(f"Embeddings of Dordogne (id=2079, 20683, 10177): {embedding_matrix[[2079, 20683, 10177]]} \\n")

Embeddings of Swan (id=26699): tensor([ 0.0468, -0.0246, -0.0598, -0.0030, -0.0016, -0.0671, -0.0588, -0.0971,
        -0.0382, -0.0254, -0.0825, -0.0955, -0.1219, -0.0466, -0.0663,  0.0075,
        -0.0503, -0.1312,  0.0393, -0.0481, -0.0599, -0.0542, -0.0561, -0.0083,
        -0.0670, -0.0430, -0.0506, -0.0610, -0.0210, -0.0383, -0.0438, -0.0586,
        -0.0556, -0.0072, -0.0692,  0.0132, -0.0468, -0.0959,  0.0071, -0.0555,
        -0.0735, -0.0285, -0.0796, -0.0525,  0.0290, -0.0085,  0.0139, -0.0831,
         0.0227, -0.0705, -0.0141, -0.0019, -0.0500, -0.0995, -0.0804, -0.1143,
         0.0064, -0.0393, -0.0772, -0.0838, -0.0026, -0.0778,  0.0286,  0.0003,
         0.0084, -0.0638, -0.0006, -0.0951, -0.0569, -0.0372,  0.0146,  0.0021,
        -0.0527, -0.1213, -0.1014, -0.0221, -0.0598,  0.0375, -0.0032, -0.0867,
        -0.0258,  0.0044,  0.0641, -0.0378, -0.0644, -0.0262, -0.0090, -0.1085,
        -0.1133, -0.0132, -0.1233,  0.0030, -0.0634, -0.0621, -0.0803,  0.0014,
         

However, the strength of BERT is to capture the **contextualized** semantics of a word in its sentence. To explore this embeddings, let's see the last layer representation **last_hidden_states**.

In [13]:
from transformers import BertModel, BertTokenizer
import torch
import math 
model_name = 'bert-base-uncased'

tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertModel.from_pretrained(model_name)

input_text = "2 swans found dead in Dordogne"
input_ids = tokenizer.encode(input_text)
input_ids = torch.tensor([input_ids])

with torch.no_grad():
    last_hidden_states = model(input_ids)[0]
last_hidden_states_mean = last_hidden_states.mean(1)

print(f"Embedding shape: {last_hidden_states.shape}\n\n")
print(f"Embeddings of Swan inside its sentence (word_id=2): {last_hidden_states[0, 2, :]}\n\n")
euclidean_distance = math.sqrt(sum((embedding_matrix[26699] - last_hidden_states[0, 2, :])**2))
print(f"Euclidean distance between Sawn (in pre-trained) and Swan in our sentence: {euclidean_distance}")
euclidean_distance = math.sqrt(sum((embedding_matrix[2757] - last_hidden_states[0, 4, :])**2))
print(f"Euclidean distance between dead (in pre-trained) and dead in our sentence: {euclidean_distance}")
euclidean_distance = math.sqrt(sum((last_hidden_states[0, 4, :] - last_hidden_states[0, 2, :])**2))
print(f"Euclidean distance between dead and Swan in our sentence: {euclidean_distance}")
euclidean_distance = math.sqrt(sum((embedding_matrix[2757] - embedding_matrix[26699])**2))
print(f"Euclidean distance between dead and Swan in pre-trained: {euclidean_distance}")

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Embedding shape: torch.Size([1, 10, 768])


Embeddings of Swan inside its sentence (word_id=2): tensor([ 9.0190e-01,  9.0987e-02, -8.5999e-01,  1.2963e-01,  2.7044e-01,
        -9.6713e-02, -3.0147e-01, -2.9795e-01, -1.0199e-01, -3.4741e-01,
         1.0293e-01, -8.1673e-02, -1.5974e-01,  1.9850e-03, -4.1525e-01,
        -1.0670e-01, -1.6164e-01,  1.1252e-01,  4.5947e-02,  1.3988e-01,
        -9.7612e-02, -3.4728e-01,  8.7718e-02,  1.1722e-01,  2.4149e-01,
        -1.2030e-02, -1.5382e-01,  4.6136e-01, -2.2699e-01, -1.4936e-01,
         5.7801e-01,  6.1131e-01,  4.1870e-01,  8.8015e-01, -1.7126e-01,
         2.7909e-01, -5.0713e-01, -3.0916e-01,  4.5754e-01, -1.0244e-02,
         2.5422e-02,  8.3125e-03,  7.3103e-02, -3.9795e-01,  2.6188e-01,
         1.8706e-01,  6.5635e-01, -4.7362e-01,  9.5162e-01, -1.3094e-01,
        -1.9189e-01,  7.5922e-01, -4.4014e-01, -6.7426e-02, -3.2155e-01,
         5.1992e-02,  1.0292e-01, -3.4766e-01,  3.4522e-01,  2.6228e-01,
        -4.2734e-02, -4.3913

### 1.3.3 Visualize sentences embedding
The meaning of a sentence is included in this embeddings of its tokens. We propose here too visualize it:

+ Retrieve embedding of a sentence (and not at a token-level) using [sentence embedding](https://github.com/UKPLab/sentence-transformers). We are going to use **all_mpnet-base-v2** first-ranked model by [Sentence BERT benchmark](https://www.sbert.net/docs/pretrained_models.html#sentence-embedding-models).
+ Reduce dimension to plot embeddings on a 2-D figure using [UMAP](https://umap-learn.readthedocs.io/en/latest/)
+ Visualize it interactively with [plotly.express](https://plotly.com/python/plotly-express/)

In [7]:
!pip install -U sentence-transformers
!pip install umap-learn
!pip install pandas
!pip install plotly
!pip install --upgrade nbformat

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Collecting umap-learn
  Using cached umap-learn-0.5.3.tar.gz (88 kB)
  Preparing metadata (setup.py) ... [?25ldone
Collecting numba>=0.49
  Using cached numba-0.56.4-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (3.5 MB)
Collecting pynndescent>=0.5
  Using cached pynndescent-0.5.8.tar.gz (1.1 MB)
  Preparing metadata (setup.py) ... [?25ldone
Collecting llvmlite<0.40,>=0.39.0dev0
  Using cached llvmlite-0.39.1-cp310-

In [1]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-mpnet-base-v2')

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
my_sentences = [
    "The One Health approach emphasizes the interconnectivity of human, animal, and environmental health.",
    "The COVID-19 pandemic has underscored the importance of One Health in understanding and preventing zoonotic diseases.",
    "Collaboration between healthcare professionals, veterinarians, and environmental experts is essential for effective implementation of One Health.",
    "The workshop on AI and One Health will explore the potential of artificial intelligence to improve health outcomes for humans, animals, and the environment.",
    "The workshop will take place in Murdoch University"
]

sentence_embeddings = model.encode(my_sentences)

print(f"Shape of my list of sentences: {sentence_embeddings.shape}")

Shape of my list of sentences: (5, 768)


In [4]:
import umap
import pandas as pd

df = pd.DataFrame(my_sentences)

umap_emb = umap.UMAP(n_components=2, random_state=42).fit_transform(sentence_embeddings)
df['umap-x'] = umap_emb[:, 0]
df['umap-y'] = umap_emb[:, 1]

  warn(


In [7]:
import plotly.express as px

fig = px.scatter(df, x='umap-x', y='umap-y', width=800, height=800)
fig.show()

ValueError: Mime type rendering requires nbformat>=4.2.0 but it is not installed

In [3]:
text_generation = pipeline('text-generation', model='bert-base-uncased')
text_generation("One Health is an approach calling for the collaborative efforts of multiple")

If you want to use `BertLMHeadModel` as a standalone, add `is_decoder=True.`
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertLMHeadModel: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertLMHeadModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertLMHeadModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


[{'generated_text': 'One Health is an approach calling for the collaborative efforts of multiple a a a a a a a a'}]