<H1 style="text-align: center;">NLP for One Health</H1>
<h3 style="text-align: center;">From BERT to ChatGPT</h3>

The use of artificial intelligence (AI) in One Health has the potential to revolutionize disease detection and outbreak response. Natural Language Processing (NLP) is a subfield of AI that enables computers to understand and analyze large amounts of text data, including social media posts, news articles or medical records. 

In this practical session, we will explore how NLP can help in the early detection of outbreaks and monitor crisis situations by social mining of newspapers and social media. Specifically, we will start by discussing BERT, a powerful pre-trained language model that can be fine-tuned for specific NLP tasks. We will then move on to ChatGPT, the most popular large language model that can generate human-like responses to natural language inputs. 

Through the use of these 2 models, we will discuss practical examples of how NLP can be applied in the One Health context, including case studies of outbreak detection and social media monitoring. Participants will have the opportunity to work with BERT and ChatGPT in hands-on exercises to gain practical experience with these powerful tools. By the end of the session, participants will have a deeper understanding of how NLP can be used in the One Health context and be equipped with the skills to apply these techniques in their own work.


|   |   |   |   |
|---|---|---|---|
| <img src="https://mood-h2020.eu/wp-content/uploads/2020/10/logo_Mood_texte-dessous_CMJN_vecto-300x136.jpg" alt="mood"/> | <img src="https://www.murdoch.edu.au/ResourcePackages/Murdoch2021/assets/dist/images/logo.svg" alt="murdoch" /> | <img src="https://www.umr-tetis.fr/images/logo-header-tetis.png" alt="tetis"/> | <img src="https://www.inrae.fr/themes/custom/inrae_socle/logo.svg" alt="INRAE" /> |

Speaker: **Rémy DECOUPES** - Research engineer UMR TETIS / INRAE

------------------------

# 1. BERT
"[Bidirectional Encoder Representations from Transformers - Devlin et al - 2018](https://arxiv.org/abs/1810.04805)" from Google Research is an open-source pre-trained Language Model. BERT implements the well known "[Attention is all you need - Vaswani et al - 2017](https://arxiv.org/abs/1706.03762)"

Bert-case was trained on: 
+ Wikipedia (2.5 Billions of tokens)
+ Google books (0.8 Billions of tokens).

On two tasks:
+ Self-masking
+ Next sentence prediction

## 1.1 Transformers
A python library to easily work with BERT-like models




In [9]:
# installation
!pip install transformers
!pip install torch



In [1]:
# load BERT models
from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModelForMaskedLM.from_pretrained("bert-base-uncased")

  from .autonotebook import tqdm as notebook_tqdm
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


## 1.2 NLP tasks with BERT
Let's use transformers' pipeline on common NLP tasks
### 1.2.1 Self-masking
Predict a token masked inside a sentence

In [1]:
from transformers import pipeline

unmasker = pipeline('fill-mask', model='bert-base-uncased')
unmasker("One Health is an approach calling for the collaborative efforts of multiple disciplines working locally, nationally, and globally, to attain optimal health for people, [MASK] and our environment.")

  from .autonotebook import tqdm as notebook_tqdm
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


[{'score': 0.4010227620601654,
  'token': 4279,
  'token_str': 'communities',
  'sequence': 'one health is an approach calling for the collaborative efforts of multiple disciplines working locally, nationally, and globally, to attain optimal health for people, communities and our environment.'},
 {'score': 0.09270986169576645,
  'token': 4176,
  'token_str': 'animals',
  'sequence': 'one health is an approach calling for the collaborative efforts of multiple disciplines working locally, nationally, and globally, to attain optimal health for people, animals and our environment.'},
 {'score': 0.06399711221456528,
  'token': 2945,
  'token_str': 'families',
  'sequence': 'one health is an approach calling for the collaborative efforts of multiple disciplines working locally, nationally, and globally, to attain optimal health for people, families and our environment.'},
 {'score': 0.06004488468170166,
  'token': 2740,
  'token_str': 'health',
  'sequence': 'one health is an approach callin

### 1.2.2 Next sentence prediction
The aim of this NLP task is to tell if the 2nd sentence could be after the first one

In [18]:
from transformers import BertTokenizer, BertForNextSentencePrediction
from torch.nn.functional import softmax


tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
nsp_model = BertForNextSentencePrediction.from_pretrained('bert-base-uncased')

def next_sentence_prediction(sentence_1, sentence_2):
  encoding = tokenizer.encode_plus(sentence_1, sentence_2, return_tensors='pt')
  logits = nsp_model(**encoding)[0] 
  probs = softmax(logits, dim=1)
  return probs[0][0].item()


Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForNextSentencePrediction: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.predictions.bias']
- This IS expected if you are initializing BertForNextSentencePrediction from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForNextSentencePrediction from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [19]:
sentence_1 = "The One Health Initiative is an interdisciplinary movement to create collaborations between animal, human, and environmental health"
sentence_2 = "The aim is  to better and more rapidly respond to outbreaks and newly emerging zoonoses and diseases."

score = next_sentence_prediction(sentence_1, sentence_2)
print(score)

sentence_3 = "Murdoch University is located in Perth, Western Australia"

score = next_sentence_prediction(sentence_1, sentence_3)
print(score)

0.9999943971633911
7.740520959487185e-05


### 1.2.3 Name Entities Recognition
NER is very usefull to extract specific information. In epidemiology surveillance, we want to extract from new articles the pathogen, the host and the location of an outbreak.

In [21]:
from transformers import pipeline

ner = pipeline('ner', model='dslim/bert-base-NER')
ner("2 swans found dead in Dordogne")

Downloading (…)lve/main/config.json: 100%|██████████| 829/829 [00:00<00:00, 334kB/s]
Downloading pytorch_model.bin: 100%|██████████| 433M/433M [00:09<00:00, 45.9MB/s] 
Downloading (…)okenizer_config.json: 100%|██████████| 59.0/59.0 [00:00<00:00, 19.8kB/s]
Downloading (…)solve/main/vocab.txt: 100%|██████████| 213k/213k [00:00<00:00, 708kB/s]
Downloading (…)in/added_tokens.json: 100%|██████████| 2.00/2.00 [00:00<00:00, 499B/s]
Downloading (…)cial_tokens_map.json: 100%|██████████| 112/112 [00:00<00:00, 35.0kB/s]


[{'entity': 'B-LOC',
  'score': 0.9993818,
  'index': 8,
  'word': 'Do',
  'start': 22,
  'end': 24},
 {'entity': 'I-LOC',
  'score': 0.8579382,
  'index': 9,
  'word': '##rdo',
  'start': 24,
  'end': 27},
 {'entity': 'B-LOC',
  'score': 0.69484633,
  'index': 10,
  'word': '##gne',
  'start': 27,
  'end': 30}]

## 1.3 Diving into the model representation
### 1.3.1 Inputs: Natural text to vecteurs

In [3]:
text_generation = pipeline('text-generation', model='bert-base-uncased')
text_generation("One Health is an approach calling for the collaborative efforts of multiple")

If you want to use `BertLMHeadModel` as a standalone, add `is_decoder=True.`
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertLMHeadModel: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertLMHeadModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertLMHeadModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


[{'generated_text': 'One Health is an approach calling for the collaborative efforts of multiple a a a a a a a a'}]