### The code and data are adapted from:  https://medium.com/@vitalshchutski/french-nlp-entamez-le-camembert-avec-les-librairies-fast-bert-et-transformers-14e65f84c148

In [None]:
# !conda install torch
# !pip install fast-bert==1.9.1
# !mkdir model
# !mkdir finetuned_model
# manully download & install apex => https://github.com/NVIDIA/apex

In [1]:
import torch
from fast_bert.data_cls import BertDataBunch 
from fast_bert.learner_cls import BertLearner
from fast_bert.data_lm import BertLMDataBunch
from fast_bert.learner_lm import BertLMLearner
from fast_bert.metrics import fbeta, roc_auc
from fast_bert.prediction import BertClassificationPredictor
from pathlib import Path
import pandas as pd
import logging


device_cuda = torch.device("cuda")

### Fixing shitty windows encodings => failed 

In [2]:
import sys
sys.getdefaultencoding()

'utf-8'

In [3]:
import locale; 
if locale.getpreferredencoding().upper() != 'UTF-8': 
    locale.setlocale(locale.LC_ALL, ('fr', 'utf-8'))
locale.getpreferredencoding()

'UTF-8'

In [4]:
sys.getfilesystemencoding()

'utf-8'

### Data preparation

In [5]:
DATA_PATH = Path('./data/')
LOG_PATH = Path('./logs/')
MODEL_PATH = Path('./model/')
LABEL_PATH = Path('./labels/')

In [6]:
#create logger
logfile = str('logfile.txt')

logging.basicConfig(
    level=logging.INFO,  #CRITICAL ERROR WARNING  INFO    DEBUG    NOTSET
    format='%(asctime)s - %(levelname)s - %(name)s -   %(message)s',
    datefmt='%m/%d/%Y %H:%M:%S',
    handlers=[
        logging.FileHandler(logfile, 'w', 'utf-8'),
        logging.StreamHandler(sys.stdout)
    ])

logger = logging.getLogger()

In [7]:
df = pd.read_csv('./data/AA_xml_manul - 0-200_0-403.csv',encoding = 'utf-8')

In [8]:
#remove NaN
df = df[~df['Disease'].isnull()]


In [29]:
import numpy as np
df['Disease'] = df['Disease'].apply(np.int64)

In [9]:
df.tail(10)

Unnamed: 0,report_text,source_name,Bioagressor,Disease
392,doryphore des larves de doryphores ont été obs...,AA_PDT_Ile_de_France_2006_005,1,0.0
393,puceronson note l'apparition des premiers puce...,AA_PDT_Ile_de_France_2006_005,0,0.0
394,détermination du seuil de traitement :réalisez...,AA_PDT_Ile_de_France_2006_005,0,0.0
395,protection du maïs semence édité avec la colla...,AA_GC_Midi_Pyrenees_1993_013,0,0.0
396,situation il est possible d'observer les flétr...,AA_GC_Midi_Pyrenees_1993_013,1,0.0
397,préconisation toute intervention est aujourd'h...,AA_GC_Midi_Pyrenees_1993_013,0,0.0
398,situation - prévision le vol est réalisé à plu...,AA_GC_Midi_Pyrenees_1993_013,0,0.0
399,préconisationune intervention ne se justifiera...,AA_GC_Midi_Pyrenees_1993_013,0,0.0
400,situation - prévisiondes colonies de metopolop...,AA_GC_Midi_Pyrenees_1993_013,0,0.0
401,préconisationaucune intervention n'est nécessa...,AA_GC_Midi_Pyrenees_1993_013,0,0.0


In [31]:
val_set = df.sample(frac=0.2, replace=False, random_state=42)
train_set = df.drop(index = val_set.index)
print('Nombre de commentaires dans le val_set:',len(val_set))
print('Nombre de commentaires dans le train_set:', len(train_set))
val_set.to_csv('./data/val_set.csv',encoding = 'utf-8')
train_set.to_csv('./data/train_set.csv',encoding = 'utf-8')

Nombre de commentaires dans le val_set: 80
Nombre de commentaires dans le train_set: 321


In [12]:
labels = df.columns[2:4].to_list() 
with open('./labels/labels.txt', 'w',encoding = 'utf-8') as f:
    for i in labels:
        f.write(i + "\n")

In [17]:
#df_texts = pd.read_csv('./data/bsv_chunk256_raw_1001-1200.csv',encoding = 'utf-8')
#all_texts = df_texts['report_text'].to_list()
#all_texts = [str(t.encode('utf-8')) for t in all_texts] #that's the key to make it work under windows! => mais c'est diable
#print('Nombre de bloc de texte:', len(all_texts))

Nombre de bloc de texte: 2801


In [10]:
df_texts = pd.read_csv('./data/raw_xml_bsv_0-200.csv')

### text cleaning

In [11]:
import nltk
import re

# make all elements string
df_texts['report_text'] = df_texts['report_text'].astype(str)
# Remove null fields
df_texts['report_text'] = df_texts['report_text'].apply(lambda x: ' '.join([w for w in x.split() if len(w)>3]))
# Make all text lowercase
df_texts['report_text'] = df_texts['report_text'].apply(lambda x: x.lower())
# Delete stop-words => to be tesred later
#stopwords = nltk.corpus.stopwords.words('french')


In [12]:
all_texts = df_texts['report_text'].to_list()[500:]
print('Nombre de bloc de texte:', len(all_texts))

Nombre de bloc de texte: 2297


In [13]:
all_texts[-9:]

['pucerons cendresintervenir avec aphicide nécessaire plus colonies',
 'betteraves',
 'pucerons jaunisse',
 'situationles premiers pucerons sont capturés bacs jaunes.',
 "préconisationles betteraves sont sensibles stade feuilles jusqu'à couverture sol. premier traitement sera effectuer prochain réchauffement particulièrement parcelles n'ayant reçu microgranulés semis ayant atteint stade feuilles. utiliser spécialité base pyréthrinoïde.",
 'principaux aphicides foliaires homologues betteraves* nouveaux produits contact, inhalation, ingestion',
 "note commune acta agpm inra atrazinel'atrazine désherbant utilisé france essentiellement cultures maïs pour l'entretien zones cultivées (voies ferrées, bordures routes, berges, etc...). suite utilisations, quelquefois observé résidus d'atrazine dans eaux supérieurs normes communautaires. donc souhaitable cette campagne promouvoir conditions d'emploi ('atrazjne visant modérer apports.",
 "culture mais-» eviter applications prélevée. préférer trai

### Création de LMDataBunch

In [14]:
databunch_lm = BertLMDataBunch.from_raw_corpus(
                    data_dir=DATA_PATH,
                    text_list=all_texts,
                    tokenizer='camembert-base',
                    batch_size_per_gpu=4, #was 16
                    max_seq_length=256, #was 512
                    multi_gpu=False,
                    model_type='camembert-base',
                    logger=logger)

12/10/2020 14:52:08 - INFO - root -   Formatting corpus for data\lm_train.txt


12/10/2020 14:52:08 - INFO - root -   Formatting corpus for data\lm_val.txt


12/10/2020 14:52:09 - INFO - transformers.tokenization_utils_base -   loading file https://s3.amazonaws.com/models.huggingface.co/bert/camembert-base-sentencepiece.bpe.model from cache at C:\Users\Raja/.cache\torch\transformers\3715e3a4a2de48834619b2a6f48979e13ddff5cabfb1f3409db689f9ce3bb98f.28d30f926f545047fc59da64289371eef0fbdc0764ce9ec56f808a646fcfec59
12/10/2020 14:52:09 - INFO - root -   Loading features from cached file data\lm_cache\cached_camembert-base_train_256
12/10/2020 14:52:09 - INFO - root -   Loading features from cached file data\lm_cache\cached_camembert-base_dev_256


### Création de LMLearner

In [16]:
lm_learner = BertLMLearner.from_pretrained_model(
                            dataBunch=databunch_lm,
                            pretrained_path='camembert-base',
                            output_dir=MODEL_PATH,
                            metrics=[],
                            device=device_cuda,
                            logger=logger,
                            multi_gpu=False,
                            logging_steps=50,
                            is_fp16=False) #was true with gpu

12/10/2020 14:53:34 - INFO - transformers.configuration_utils -   loading configuration file https://s3.amazonaws.com/models.huggingface.co/bert/camembert-base-config.json from cache at C:\Users\Raja/.cache\torch\transformers\5152a7b8b97da26abdad9b3babb600e77c52a002331ea52a9eaf96ea8b31ef8f.5bd7a9a60b9a2d311368226259eaf870cfb2248e0752f28b444ec112977cf8fc
12/10/2020 14:53:34 - INFO - transformers.configuration_utils -   Model config CamembertConfig {
  "architectures": [
    "CamembertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 5,
  "eos_token_id": 6,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,
  "model_type": "camembert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "output_past": true,
  "pad_token_id": 1,
  "type_vocab_size": 1,
  "vocab_size": 32005
}

12/10

In [17]:
lm_learner.fit(epochs=10, #was 30
            lr=1e-4,
            validate=True,
            schedule_type="warmup_cosine",
            optimizer_type="adamw")

12/10/2020 14:53:57 - INFO - root -   ***** Running training *****
12/10/2020 14:53:57 - INFO - root -     Num examples = 961
12/10/2020 14:53:57 - INFO - root -     Num Epochs = 10
12/10/2020 14:53:57 - INFO - root -     Total train batch size (w. parallel, distributed & accumulation) = 4
12/10/2020 14:53:57 - INFO - root -     Gradient Accumulation steps = 1
12/10/2020 14:53:57 - INFO - root -     Total optimization steps = 2410


12/10/2020 14:54:14 - INFO - root -   Running evaluation
12/10/2020 14:54:14 - INFO - root -   Num examples = 99
12/10/2020 14:54:14 - INFO - root -   Validation Batch size = 8


12/10/2020 14:54:15 - INFO - root -   eval_loss after step 50: 0.23925103705662948: 
12/10/2020 14:54:15 - INFO - root -   eval_perplexity after step 50: 1.2702974081039429: 
12/10/2020 14:54:15 - INFO - root -   lr after step 50: 9.989383241845838e-05
12/10/2020 14:54:15 - INFO - root -   train_loss after step 50: 2.7681958866119385




12/10/2020 14:54:31 - INFO - root -   Running evaluation
12/10/2020 14:54:31 - INFO - root -   Num examples = 99
12/10/2020 14:54:31 - INFO - root -   Validation Batch size = 8


12/10/2020 14:54:33 - INFO - root -   eval_loss after step 100: 0.21115561173512384: 
12/10/2020 14:54:33 - INFO - root -   eval_perplexity after step 100: 1.2351045608520508: 
12/10/2020 14:54:33 - INFO - root -   lr after step 100: 9.957578053604837e-05
12/10/2020 14:54:33 - INFO - root -   train_loss after step 100: 2.3340562272071836
12/10/2020 14:54:49 - INFO - root -   Running evaluation
12/10/2020 14:54:49 - INFO - root -   Num examples = 99
12/10/2020 14:54:49 - INFO - root -   Validation Batch size = 8


12/10/2020 14:54:51 - INFO - root -   eval_loss after step 150: 0.21081974529303038: 
12/10/2020 14:54:51 - INFO - root -   eval_perplexity after step 150: 1.2346898317337036: 
12/10/2020 14:54:51 - INFO - root -   lr after step 150: 9.904719502473634e-05
12/10/2020 14:54:51 - INFO - root -   train_loss after step 150: 2.309780428409576
12/10/2020 14:55:07 - INFO - root -   Running evaluation
12/10/2020 14:55:07 - INFO - root -   Num examples = 99
12/10/2020 14:55:07 - INFO - root -   Validation Batch size = 8


12/10/2020 14:55:09 - INFO - root -   eval_loss after step 200: 0.199887589766429: 
12/10/2020 14:55:09 - INFO - root -   eval_perplexity after step 200: 1.221265435218811: 
12/10/2020 14:55:09 - INFO - root -   lr after step 200: 9.831032063033726e-05
12/10/2020 14:55:09 - INFO - root -   train_loss after step 200: 2.1669001817703246
12/10/2020 14:55:22 - INFO - root -   Running evaluation
12/10/2020 14:55:22 - INFO - root -   Num examples = 99
12/10/2020 14:55:22 - INFO - root -   Validation Batch size = 8


12/10/2020 14:55:24 - INFO - root -   eval_loss after epoch 1: 0.21822101565507743: 
12/10/2020 14:55:24 - INFO - root -   eval_perplexity after epoch 1: 1.2438619136810303: 
12/10/2020 14:55:24 - INFO - root -   lr after epoch 1: 9.755282581475769e-05
12/10/2020 14:55:24 - INFO - root -   train_loss after epoch 1: 2.3389858165717223
12/10/2020 14:55:24 - INFO - root -   

12/10/2020 14:55:27 - INFO - root -   Running evaluation
12/10/2020 14:55:27 - INFO - root -   Num examples = 99
12/10/2020 14:55:27 - INFO - root -   Validation Batch size = 8


12/10/2020 14:55:28 - INFO - root -   eval_loss after step 250: 0.20181022125941056: 
12/10/2020 14:55:28 - INFO - root -   eval_perplexity after step 250: 1.2236157655715942: 
12/10/2020 14:55:28 - INFO - root -   lr after step 250: 9.736828663974527e-05
12/10/2020 14:55:28 - INFO - root -   train_loss after step 250: 2.0521333241462707
12/10/2020 14:55:44 - INFO - root -   Running evaluation
12/10/2020 14:55:44 - INFO - root -   Num examples = 99
12/10/2020 14:55:44 - INFO - root -   Validation Batch size = 8


12/10/2020 14:55:46 - INFO - root -   eval_loss after step 300: 0.19090593778170073: 
12/10/2020 14:55:46 - INFO - root -   eval_perplexity after step 300: 1.2103456258773804: 
12/10/2020 14:55:46 - INFO - root -   lr after step 300: 9.62250935917808e-05
12/10/2020 14:55:46 - INFO - root -   train_loss after step 300: 1.9913123011589051
12/10/2020 14:56:02 - INFO - root -   Running evaluation
12/10/2020 14:56:02 - INFO - root -   Num examples = 99
12/10/2020 14:56:02 - INFO - root -   Validation Batch size = 8


12/10/2020 14:56:04 - INFO - root -   eval_loss after step 350: 0.17624248564243317: 
12/10/2020 14:56:04 - INFO - root -   eval_perplexity after step 350: 1.1927272081375122: 
12/10/2020 14:56:04 - INFO - root -   lr after step 350: 9.488559628808939e-05
12/10/2020 14:56:04 - INFO - root -   train_loss after step 350: 1.961831169128418
12/10/2020 14:56:20 - INFO - root -   Running evaluation
12/10/2020 14:56:20 - INFO - root -   Num examples = 99
12/10/2020 14:56:20 - INFO - root -   Validation Batch size = 8


12/10/2020 14:56:22 - INFO - root -   eval_loss after step 400: 0.17745191088089576: 
12/10/2020 14:56:22 - INFO - root -   eval_perplexity after step 400: 1.194170594215393: 
12/10/2020 14:56:22 - INFO - root -   lr after step 400: 9.335548317623957e-05
12/10/2020 14:56:22 - INFO - root -   train_loss after step 400: 1.9115200734138489
12/10/2020 14:56:38 - INFO - root -   Running evaluation
12/10/2020 14:56:38 - INFO - root -   Num examples = 99
12/10/2020 14:56:38 - INFO - root -   Validation Batch size = 8


12/10/2020 14:56:40 - INFO - root -   eval_loss after step 450: 0.1513343109534337: 
12/10/2020 14:56:40 - INFO - root -   eval_perplexity after step 450: 1.1633855104446411: 
12/10/2020 14:56:40 - INFO - root -   lr after step 450: 9.164125219257418e-05
12/10/2020 14:56:40 - INFO - root -   train_loss after step 450: 1.9525598120689391
12/10/2020 14:56:50 - INFO - root -   Running evaluation
12/10/2020 14:56:50 - INFO - root -   Num examples = 99
12/10/2020 14:56:50 - INFO - root -   Validation Batch size = 8


12/10/2020 14:56:52 - INFO - root -   eval_loss after epoch 2: 0.1481938877931008: 
12/10/2020 14:56:52 - INFO - root -   eval_perplexity after epoch 2: 1.1597377061843872: 
12/10/2020 14:56:52 - INFO - root -   lr after epoch 2: 9.045084971874738e-05
12/10/2020 14:56:52 - INFO - root -   train_loss after epoch 2: 1.9443070933037279
12/10/2020 14:56:52 - INFO - root -   

12/10/2020 14:56:58 - INFO - root -   Running evaluation
12/10/2020 14:56:58 - INFO - root -   Num examples = 99
12/10/2020 14:56:58 - INFO - root -   Validation Batch size = 8


12/10/2020 14:56:59 - INFO - root -   eval_loss after step 500: 0.16651911231187674: 
12/10/2020 14:56:59 - INFO - root -   eval_perplexity after step 500: 1.1811860799789429: 
12/10/2020 14:56:59 - INFO - root -   lr after step 500: 8.975018316740278e-05
12/10/2020 14:56:59 - INFO - root -   train_loss after step 500: 1.8591113138198851
12/10/2020 14:57:16 - INFO - root -   Running evaluation
12/10/2020 14:57:16 - INFO - root -   Num examples = 99
12/10/2020 14:57:16 - INFO - root -   Validation Batch size = 8


12/10/2020 14:57:17 - INFO - root -   eval_loss after step 550: 0.17550056485029367: 
12/10/2020 14:57:17 - INFO - root -   eval_perplexity after step 550: 1.1918426752090454: 
12/10/2020 14:57:17 - INFO - root -   lr after step 550: 8.769030690972262e-05
12/10/2020 14:57:17 - INFO - root -   train_loss after step 550: 1.8182248163223267
12/10/2020 14:57:34 - INFO - root -   Running evaluation
12/10/2020 14:57:34 - INFO - root -   Num examples = 99
12/10/2020 14:57:34 - INFO - root -   Validation Batch size = 8


12/10/2020 14:57:35 - INFO - root -   eval_loss after step 600: 0.1876297088769766: 
12/10/2020 14:57:35 - INFO - root -   eval_perplexity after step 600: 1.206386685371399: 
12/10/2020 14:57:35 - INFO - root -   lr after step 600: 8.547037110275579e-05
12/10/2020 14:57:35 - INFO - root -   train_loss after step 600: 1.7828229904174804
12/10/2020 14:57:52 - INFO - root -   Running evaluation
12/10/2020 14:57:52 - INFO - root -   Num examples = 99
12/10/2020 14:57:52 - INFO - root -   Validation Batch size = 8


12/10/2020 14:57:53 - INFO - root -   eval_loss after step 650: 0.1637064000734916: 
12/10/2020 14:57:53 - INFO - root -   eval_perplexity after step 650: 1.1778684854507446: 
12/10/2020 14:57:53 - INFO - root -   lr after step 650: 8.309980315513444e-05
12/10/2020 14:57:53 - INFO - root -   train_loss after step 650: 1.7412176465988158
12/10/2020 14:58:10 - INFO - root -   Running evaluation
12/10/2020 14:58:10 - INFO - root -   Num examples = 99
12/10/2020 14:58:10 - INFO - root -   Validation Batch size = 8


12/10/2020 14:58:11 - INFO - root -   eval_loss after step 700: 0.16880887632186597: 
12/10/2020 14:58:11 - INFO - root -   eval_perplexity after step 700: 1.1838937997817993: 
12/10/2020 14:58:11 - INFO - root -   lr after step 700: 8.058867016549372e-05
12/10/2020 14:58:11 - INFO - root -   train_loss after step 700: 1.7518587851524352
12/10/2020 14:58:19 - INFO - root -   Running evaluation
12/10/2020 14:58:19 - INFO - root -   Num examples = 99
12/10/2020 14:58:19 - INFO - root -   Validation Batch size = 8


12/10/2020 14:58:20 - INFO - root -   eval_loss after epoch 3: 0.16114349090136015: 
12/10/2020 14:58:20 - INFO - root -   eval_perplexity after epoch 3: 1.1748535633087158: 
12/10/2020 14:58:20 - INFO - root -   lr after epoch 3: 7.938926261462366e-05
12/10/2020 14:58:20 - INFO - root -   train_loss after epoch 3: 1.7688201685664071
12/10/2020 14:58:20 - INFO - root -   

12/10/2020 14:58:29 - INFO - root -   Running evaluation
12/10/2020 14:58:29 - INFO - root -   Num examples = 99
12/10/2020 14:58:29 - INFO - root -   Validation Batch size = 8


12/10/2020 14:58:31 - INFO - root -   eval_loss after step 750: 0.17748419825847334: 
12/10/2020 14:58:31 - INFO - root -   eval_perplexity after step 750: 1.1942092180252075: 
12/10/2020 14:58:31 - INFO - root -   lr after step 750: 7.794763617049124e-05
12/10/2020 14:58:31 - INFO - root -   train_loss after step 750: 1.6648080801963807
12/10/2020 14:58:47 - INFO - root -   Running evaluation
12/10/2020 14:58:47 - INFO - root -   Num examples = 99
12/10/2020 14:58:47 - INFO - root -   Validation Batch size = 8


12/10/2020 14:58:49 - INFO - root -   eval_loss after step 800: 0.1689751950594095: 
12/10/2020 14:58:49 - INFO - root -   eval_perplexity after step 800: 1.1840907335281372: 
12/10/2020 14:58:49 - INFO - root -   lr after step 800: 7.518791685780768e-05
12/10/2020 14:58:49 - INFO - root -   train_loss after step 800: 1.7063796496391297
12/10/2020 14:59:05 - INFO - root -   Running evaluation
12/10/2020 14:59:05 - INFO - root -   Num examples = 99
12/10/2020 14:59:05 - INFO - root -   Validation Batch size = 8


12/10/2020 14:59:07 - INFO - root -   eval_loss after step 850: 0.18472264592464155: 
12/10/2020 14:59:07 - INFO - root -   eval_perplexity after step 850: 1.2028847932815552: 
12/10/2020 14:59:07 - INFO - root -   lr after step 850: 7.232123193644957e-05
12/10/2020 14:59:07 - INFO - root -   train_loss after step 850: 1.7173336195945739
12/10/2020 14:59:23 - INFO - root -   Running evaluation
12/10/2020 14:59:23 - INFO - root -   Num examples = 99
12/10/2020 14:59:23 - INFO - root -   Validation Batch size = 8


12/10/2020 14:59:25 - INFO - root -   eval_loss after step 900: 0.17355679548703706: 
12/10/2020 14:59:25 - INFO - root -   eval_perplexity after step 900: 1.189528226852417: 
12/10/2020 14:59:25 - INFO - root -   lr after step 900: 6.935975536662253e-05
12/10/2020 14:59:25 - INFO - root -   train_loss after step 900: 1.6696666765213013
12/10/2020 14:59:42 - INFO - root -   Running evaluation
12/10/2020 14:59:42 - INFO - root -   Num examples = 99
12/10/2020 14:59:42 - INFO - root -   Validation Batch size = 8


12/10/2020 14:59:43 - INFO - root -   eval_loss after step 950: 0.16008633719040796: 
12/10/2020 14:59:43 - INFO - root -   eval_perplexity after step 950: 1.1736122369766235: 
12/10/2020 14:59:43 - INFO - root -   lr after step 950: 6.631606366053506e-05
12/10/2020 14:59:43 - INFO - root -   train_loss after step 950: 1.6666693186759949
12/10/2020 14:59:48 - INFO - root -   Running evaluation
12/10/2020 14:59:48 - INFO - root -   Num examples = 99
12/10/2020 14:59:48 - INFO - root -   Validation Batch size = 8


12/10/2020 14:59:49 - INFO - root -   eval_loss after epoch 4: 0.17239553653276884: 
12/10/2020 14:59:49 - INFO - root -   eval_perplexity after epoch 4: 1.1881476640701294: 
12/10/2020 14:59:49 - INFO - root -   lr after epoch 4: 6.545084971874738e-05
12/10/2020 14:59:49 - INFO - root -   train_loss after epoch 4: 1.6890901643705565
12/10/2020 14:59:49 - INFO - root -   

12/10/2020 15:00:01 - INFO - root -   Running evaluation
12/10/2020 15:00:01 - INFO - root -   Num examples = 99
12/10/2020 15:00:01 - INFO - root -   Validation Batch size = 8


12/10/2020 15:00:03 - INFO - root -   eval_loss after step 1000: 0.18290485785557672: 
12/10/2020 15:00:03 - INFO - root -   eval_perplexity after step 1000: 1.2007001638412476: 
12/10/2020 15:00:03 - INFO - root -   lr after step 1000: 6.320308247368286e-05
12/10/2020 15:00:03 - INFO - root -   train_loss after step 1000: 1.5859255409240722
12/10/2020 15:00:19 - INFO - root -   Running evaluation
12/10/2020 15:00:19 - INFO - root -   Num examples = 99
12/10/2020 15:00:19 - INFO - root -   Validation Batch size = 8


12/10/2020 15:00:21 - INFO - root -   eval_loss after step 1050: 0.17946990407430208: 
12/10/2020 15:00:21 - INFO - root -   eval_perplexity after step 1050: 1.1965829133987427: 
12/10/2020 15:00:21 - INFO - root -   lr after step 1050: 6.003403171342563e-05
12/10/2020 15:00:21 - INFO - root -   train_loss after step 1050: 1.611977367401123
12/10/2020 15:00:37 - INFO - root -   Running evaluation
12/10/2020 15:00:37 - INFO - root -   Num examples = 99
12/10/2020 15:00:37 - INFO - root -   Validation Batch size = 8


12/10/2020 15:00:39 - INFO - root -   eval_loss after step 1100: 0.17629954906610343: 
12/10/2020 15:00:39 - INFO - root -   eval_perplexity after step 1100: 1.1927952766418457: 
12/10/2020 15:00:39 - INFO - root -   lr after step 1100: 5.682236939796337e-05
12/10/2020 15:00:39 - INFO - root -   train_loss after step 1100: 1.6183380424976348
12/10/2020 15:00:56 - INFO - root -   Running evaluation
12/10/2020 15:00:56 - INFO - root -   Num examples = 99
12/10/2020 15:00:56 - INFO - root -   Validation Batch size = 8


12/10/2020 15:00:57 - INFO - root -   eval_loss after step 1150: 0.1856249410372514: 
12/10/2020 15:00:57 - INFO - root -   eval_perplexity after step 1150: 1.2039706707000732: 
12/10/2020 15:00:57 - INFO - root -   lr after step 1150: 5.3581734504126494e-05
12/10/2020 15:00:57 - INFO - root -   train_loss after step 1150: 1.5558024454116821
12/10/2020 15:01:14 - INFO - root -   Running evaluation
12/10/2020 15:01:14 - INFO - root -   Num examples = 99
12/10/2020 15:01:14 - INFO - root -   Validation Batch size = 8


12/10/2020 15:01:16 - INFO - root -   eval_loss after step 1200: 0.18070483666199905: 
12/10/2020 15:01:16 - INFO - root -   eval_perplexity after step 1200: 1.198061466217041: 
12/10/2020 15:01:16 - INFO - root -   lr after step 1200: 5.032588904668851e-05
12/10/2020 15:01:16 - INFO - root -   train_loss after step 1200: 1.5782347190380097
12/10/2020 15:01:17 - INFO - root -   Running evaluation
12/10/2020 15:01:17 - INFO - root -   Num examples = 99
12/10/2020 15:01:17 - INFO - root -   Validation Batch size = 8


12/10/2020 15:01:19 - INFO - root -   eval_loss after epoch 5: 0.1828500651396238: 
12/10/2020 15:01:19 - INFO - root -   eval_perplexity after epoch 5: 1.2006343603134155: 
12/10/2020 15:01:19 - INFO - root -   lr after epoch 5: 5e-05
12/10/2020 15:01:19 - INFO - root -   train_loss after epoch 5: 1.5794877852641696
12/10/2020 15:01:19 - INFO - root -   

12/10/2020 15:01:34 - INFO - root -   Running evaluation
12/10/2020 15:01:34 - INFO - root -   Num examples = 99
12/10/2020 15:01:34 - INFO - root -   Validation Batch size = 8


12/10/2020 15:01:35 - INFO - root -   eval_loss after step 1250: 0.1820717637355511: 
12/10/2020 15:01:35 - INFO - root -   eval_perplexity after step 1250: 1.1997002363204956: 
12/10/2020 15:01:35 - INFO - root -   lr after step 1250: 4.7068659635173026e-05
12/10/2020 15:01:35 - INFO - root -   train_loss after step 1250: 1.5057056283950805
12/10/2020 15:01:52 - INFO - root -   Running evaluation
12/10/2020 15:01:52 - INFO - root -   Num examples = 99
12/10/2020 15:01:52 - INFO - root -   Validation Batch size = 8


12/10/2020 15:01:53 - INFO - root -   eval_loss after step 1300: 0.17298371631365556: 
12/10/2020 15:01:53 - INFO - root -   eval_perplexity after step 1300: 1.1888467073440552: 
12/10/2020 15:01:53 - INFO - root -   lr after step 1300: 4.382387875634591e-05
12/10/2020 15:01:53 - INFO - root -   train_loss after step 1300: 1.4909101474285125
12/10/2020 15:02:10 - INFO - root -   Running evaluation
12/10/2020 15:02:10 - INFO - root -   Num examples = 99
12/10/2020 15:02:10 - INFO - root -   Validation Batch size = 8


12/10/2020 15:02:11 - INFO - root -   eval_loss after step 1350: 0.17381364221756274: 
12/10/2020 15:02:11 - INFO - root -   eval_perplexity after step 1350: 1.1898337602615356: 
12/10/2020 15:02:11 - INFO - root -   lr after step 1350: 4.0605326031748645e-05
12/10/2020 15:02:11 - INFO - root -   train_loss after step 1350: 1.4846127665042876
12/10/2020 15:02:28 - INFO - root -   Running evaluation
12/10/2020 15:02:28 - INFO - root -   Num examples = 99
12/10/2020 15:02:28 - INFO - root -   Validation Batch size = 8


12/10/2020 15:02:30 - INFO - root -   eval_loss after step 1400: 0.17594042305762952: 
12/10/2020 15:02:30 - INFO - root -   eval_perplexity after step 1400: 1.1923670768737793: 
12/10/2020 15:02:30 - INFO - root -   lr after step 1400: 3.742666969973463e-05
12/10/2020 15:02:30 - INFO - root -   train_loss after step 1400: 1.422969514131546
12/10/2020 15:02:45 - INFO - root -   Running evaluation
12/10/2020 15:02:45 - INFO - root -   Num examples = 99
12/10/2020 15:02:45 - INFO - root -   Validation Batch size = 8


12/10/2020 15:02:46 - INFO - root -   eval_loss after epoch 6: 0.17240192339970514: 
12/10/2020 15:02:46 - INFO - root -   eval_perplexity after epoch 6: 1.1881552934646606: 
12/10/2020 15:02:46 - INFO - root -   lr after epoch 6: 3.4549150281252636e-05
12/10/2020 15:02:46 - INFO - root -   train_loss after epoch 6: 1.4835398313415495
12/10/2020 15:02:46 - INFO - root -   

12/10/2020 15:02:48 - INFO - root -   Running evaluation
12/10/2020 15:02:48 - INFO - root -   Num examples = 99
12/10/2020 15:02:48 - INFO - root -   Validation Batch size = 8


12/10/2020 15:02:49 - INFO - root -   eval_loss after step 1450: 0.1734319799221479: 
12/10/2020 15:02:49 - INFO - root -   eval_perplexity after step 1450: 1.1893798112869263: 
12/10/2020 15:02:49 - INFO - root -   lr after step 1450: 3.430140857051675e-05
12/10/2020 15:02:49 - INFO - root -   train_loss after step 1450: 1.503297140598297
12/10/2020 15:03:06 - INFO - root -   Running evaluation
12/10/2020 15:03:06 - INFO - root -   Num examples = 99
12/10/2020 15:03:06 - INFO - root -   Validation Batch size = 8


12/10/2020 15:03:07 - INFO - root -   eval_loss after step 1500: 0.16765423806814048: 
12/10/2020 15:03:07 - INFO - root -   eval_perplexity after step 1500: 1.1825276613235474: 
12/10/2020 15:03:07 - INFO - root -   lr after step 1500: 3.124281470072597e-05
12/10/2020 15:03:07 - INFO - root -   train_loss after step 1500: 1.4533905601501464
12/10/2020 15:03:24 - INFO - root -   Running evaluation
12/10/2020 15:03:24 - INFO - root -   Num examples = 99
12/10/2020 15:03:24 - INFO - root -   Validation Batch size = 8


12/10/2020 15:03:25 - INFO - root -   eval_loss after step 1550: 0.17829671502113342: 
12/10/2020 15:03:25 - INFO - root -   eval_perplexity after step 1550: 1.1951799392700195: 
12/10/2020 15:03:25 - INFO - root -   lr after step 1550: 2.8263877030925277e-05
12/10/2020 15:03:25 - INFO - root -   train_loss after step 1550: 1.3928670382499695
12/10/2020 15:03:42 - INFO - root -   Running evaluation
12/10/2020 15:03:42 - INFO - root -   Num examples = 99
12/10/2020 15:03:42 - INFO - root -   Validation Batch size = 8


12/10/2020 15:03:43 - INFO - root -   eval_loss after step 1600: 0.17693207126397353: 
12/10/2020 15:03:43 - INFO - root -   eval_perplexity after step 1600: 1.1935499906539917: 
12/10/2020 15:03:43 - INFO - root -   lr after step 1600: 2.5377246225433303e-05
12/10/2020 15:03:43 - INFO - root -   train_loss after step 1600: 1.4360487461090088
12/10/2020 15:04:00 - INFO - root -   Running evaluation
12/10/2020 15:04:00 - INFO - root -   Num examples = 99
12/10/2020 15:04:00 - INFO - root -   Validation Batch size = 8


12/10/2020 15:04:01 - INFO - root -   eval_loss after step 1650: 0.15615930465551522: 
12/10/2020 15:04:01 - INFO - root -   eval_perplexity after step 1650: 1.169012427330017: 
12/10/2020 15:04:01 - INFO - root -   lr after step 1650: 2.259518094870693e-05
12/10/2020 15:04:01 - INFO - root -   train_loss after step 1650: 1.3898078250885009
12/10/2020 15:04:13 - INFO - root -   Running evaluation
12/10/2020 15:04:13 - INFO - root -   Num examples = 99
12/10/2020 15:04:13 - INFO - root -   Validation Batch size = 8


12/10/2020 15:04:15 - INFO - root -   eval_loss after epoch 7: 0.15136038683927977: 
12/10/2020 15:04:15 - INFO - root -   eval_perplexity after epoch 7: 1.1634159088134766: 
12/10/2020 15:04:15 - INFO - root -   lr after epoch 7: 2.061073738537635e-05
12/10/2020 15:04:15 - INFO - root -   train_loss after epoch 7: 1.423371732482277
12/10/2020 15:04:15 - INFO - root -   

12/10/2020 15:04:19 - INFO - root -   Running evaluation
12/10/2020 15:04:19 - INFO - root -   Num examples = 99
12/10/2020 15:04:19 - INFO - root -   Validation Batch size = 8


12/10/2020 15:04:21 - INFO - root -   eval_loss after step 1700: 0.1564323638494198: 
12/10/2020 15:04:21 - INFO - root -   eval_perplexity after step 1700: 1.169331669807434: 
12/10/2020 15:04:21 - INFO - root -   lr after step 1700: 1.9929495806431025e-05
12/10/2020 15:04:21 - INFO - root -   train_loss after step 1700: 1.4191709518432618
12/10/2020 15:04:38 - INFO - root -   Running evaluation
12/10/2020 15:04:38 - INFO - root -   Num examples = 99
12/10/2020 15:04:38 - INFO - root -   Validation Batch size = 8


12/10/2020 15:04:39 - INFO - root -   eval_loss after step 1750: 0.14960667147086218: 
12/10/2020 15:04:39 - INFO - root -   eval_perplexity after step 1750: 1.1613773107528687: 
12/10/2020 15:04:39 - INFO - root -   lr after step 1750: 1.739151117239385e-05
12/10/2020 15:04:39 - INFO - root -   train_loss after step 1750: 1.4018377029895783
12/10/2020 15:04:56 - INFO - root -   Running evaluation
12/10/2020 15:04:56 - INFO - root -   Num examples = 99
12/10/2020 15:04:56 - INFO - root -   Validation Batch size = 8


12/10/2020 15:04:57 - INFO - root -   eval_loss after step 1800: 0.16188797125449547: 
12/10/2020 15:04:57 - INFO - root -   eval_perplexity after step 1800: 1.1757285594940186: 
12/10/2020 15:04:57 - INFO - root -   lr after step 1800: 1.4992005114218805e-05
12/10/2020 15:04:57 - INFO - root -   train_loss after step 1800: 1.42687264919281
12/10/2020 15:05:14 - INFO - root -   Running evaluation
12/10/2020 15:05:14 - INFO - root -   Num examples = 99
12/10/2020 15:05:14 - INFO - root -   Validation Batch size = 8


12/10/2020 15:05:15 - INFO - root -   eval_loss after step 1850: 0.15195928972501022: 
12/10/2020 15:05:15 - INFO - root -   eval_perplexity after step 1850: 1.1641128063201904: 
12/10/2020 15:05:15 - INFO - root -   lr after step 1850: 1.2741167622109556e-05
12/10/2020 15:05:15 - INFO - root -   train_loss after step 1850: 1.445525529384613
12/10/2020 15:05:32 - INFO - root -   Running evaluation
12/10/2020 15:05:32 - INFO - root -   Num examples = 99
12/10/2020 15:05:32 - INFO - root -   Validation Batch size = 8


12/10/2020 15:05:34 - INFO - root -   eval_loss after step 1900: 0.15428919746325567: 
12/10/2020 15:05:34 - INFO - root -   eval_perplexity after step 1900: 1.1668282747268677: 
12/10/2020 15:05:34 - INFO - root -   lr after step 1900: 1.0648557334985309e-05
12/10/2020 15:05:34 - INFO - root -   train_loss after step 1900: 1.3089814209938049
12/10/2020 15:05:43 - INFO - root -   Running evaluation
12/10/2020 15:05:43 - INFO - root -   Num examples = 99
12/10/2020 15:05:43 - INFO - root -   Validation Batch size = 8


12/10/2020 15:05:44 - INFO - root -   eval_loss after epoch 8: 0.15421479997726587: 
12/10/2020 15:05:44 - INFO - root -   eval_perplexity after epoch 8: 1.1667414903640747: 
12/10/2020 15:05:44 - INFO - root -   lr after epoch 8: 9.549150281252633e-06
12/10/2020 15:05:44 - INFO - root -   train_loss after epoch 8: 1.3772374700708507
12/10/2020 15:05:44 - INFO - root -   

12/10/2020 15:05:51 - INFO - root -   Running evaluation
12/10/2020 15:05:51 - INFO - root -   Num examples = 99
12/10/2020 15:05:51 - INFO - root -   Validation Batch size = 8


12/10/2020 15:05:53 - INFO - root -   eval_loss after step 1950: 0.14776954513329726: 
12/10/2020 15:05:53 - INFO - root -   eval_perplexity after step 1950: 1.1592457294464111: 
12/10/2020 15:05:53 - INFO - root -   lr after step 1950: 8.723060947777777e-06
12/10/2020 15:05:53 - INFO - root -   train_loss after step 1950: 1.353246694803238
12/10/2020 15:06:10 - INFO - root -   Running evaluation
12/10/2020 15:06:10 - INFO - root -   Num examples = 99
12/10/2020 15:06:10 - INFO - root -   Validation Batch size = 8


12/10/2020 15:06:11 - INFO - root -   eval_loss after step 2000: 0.14849575608968735: 
12/10/2020 15:06:11 - INFO - root -   eval_perplexity after step 2000: 1.1600878238677979: 
12/10/2020 15:06:11 - INFO - root -   lr after step 2000: 6.972855472274853e-06
12/10/2020 15:06:11 - INFO - root -   train_loss after step 2000: 1.3674042105674744
12/10/2020 15:06:28 - INFO - root -   Running evaluation
12/10/2020 15:06:28 - INFO - root -   Num examples = 99
12/10/2020 15:06:28 - INFO - root -   Validation Batch size = 8


12/10/2020 15:06:29 - INFO - root -   eval_loss after step 2050: 0.14699742541863367: 
12/10/2020 15:06:29 - INFO - root -   eval_perplexity after step 2050: 1.158350944519043: 
12/10/2020 15:06:29 - INFO - root -   lr after step 2050: 5.405373511777934e-06
12/10/2020 15:06:29 - INFO - root -   train_loss after step 2050: 1.351279706954956
12/10/2020 15:06:46 - INFO - root -   Running evaluation
12/10/2020 15:06:46 - INFO - root -   Num examples = 99
12/10/2020 15:06:46 - INFO - root -   Validation Batch size = 8


12/10/2020 15:06:48 - INFO - root -   eval_loss after step 2100: 0.14764318214013025: 
12/10/2020 15:06:48 - INFO - root -   eval_perplexity after step 2100: 1.1590992212295532: 
12/10/2020 15:06:48 - INFO - root -   lr after step 2100: 4.027271697041252e-06
12/10/2020 15:06:48 - INFO - root -   train_loss after step 2100: 1.3563277101516724
12/10/2020 15:07:04 - INFO - root -   Running evaluation
12/10/2020 15:07:04 - INFO - root -   Num examples = 99
12/10/2020 15:07:04 - INFO - root -   Validation Batch size = 8


12/10/2020 15:07:06 - INFO - root -   eval_loss after step 2150: 0.14931805661091438: 
12/10/2020 15:07:06 - INFO - root -   eval_perplexity after step 2150: 1.1610422134399414: 
12/10/2020 15:07:06 - INFO - root -   lr after step 2150: 2.844402417536374e-06
12/10/2020 15:07:06 - INFO - root -   train_loss after step 2150: 1.345567079782486
12/10/2020 15:07:12 - INFO - root -   Running evaluation
12/10/2020 15:07:12 - INFO - root -   Num examples = 99
12/10/2020 15:07:12 - INFO - root -   Validation Batch size = 8


12/10/2020 15:07:13 - INFO - root -   eval_loss after epoch 9: 0.1482222005724907: 
12/10/2020 15:07:13 - INFO - root -   eval_perplexity after epoch 9: 1.1597706079483032: 
12/10/2020 15:07:13 - INFO - root -   lr after epoch 9: 2.4471741852423237e-06
12/10/2020 15:07:13 - INFO - root -   train_loss after epoch 9: 1.3617282262481594
12/10/2020 15:07:13 - INFO - root -   

12/10/2020 15:07:24 - INFO - root -   Running evaluation
12/10/2020 15:07:24 - INFO - root -   Num examples = 99
12/10/2020 15:07:24 - INFO - root -   Validation Batch size = 8


12/10/2020 15:07:25 - INFO - root -   eval_loss after step 2200: 0.14868027086441332: 
12/10/2020 15:07:25 - INFO - root -   eval_perplexity after step 2200: 1.160301923751831: 
12/10/2020 15:07:25 - INFO - root -   lr after step 2200: 1.861788968090683e-06
12/10/2020 15:07:25 - INFO - root -   train_loss after step 2200: 1.327067906856537
12/10/2020 15:07:42 - INFO - root -   Running evaluation
12/10/2020 15:07:42 - INFO - root -   Num examples = 99
12/10/2020 15:07:42 - INFO - root -   Validation Batch size = 8


12/10/2020 15:07:43 - INFO - root -   eval_loss after step 2250: 0.14754786170445955: 
12/10/2020 15:07:43 - INFO - root -   eval_perplexity after step 2250: 1.1589887142181396: 
12/10/2020 15:07:43 - INFO - root -   lr after step 2250: 1.0836042164448945e-06
12/10/2020 15:07:43 - INFO - root -   train_loss after step 2250: 1.311480085849762
12/10/2020 15:08:00 - INFO - root -   Running evaluation
12/10/2020 15:08:00 - INFO - root -   Num examples = 99
12/10/2020 15:08:00 - INFO - root -   Validation Batch size = 8


12/10/2020 15:08:01 - INFO - root -   eval_loss after step 2300: 0.1509187542475187: 
12/10/2020 15:08:01 - INFO - root -   eval_perplexity after step 2300: 1.1629021167755127: 
12/10/2020 15:08:01 - INFO - root -   lr after step 2300: 5.131528823220099e-07
12/10/2020 15:08:01 - INFO - root -   train_loss after step 2300: 1.4004107570648194
12/10/2020 15:08:18 - INFO - root -   Running evaluation
12/10/2020 15:08:18 - INFO - root -   Num examples = 99
12/10/2020 15:08:18 - INFO - root -   Validation Batch size = 8


12/10/2020 15:08:20 - INFO - root -   eval_loss after step 2350: 0.14751078073795026: 
12/10/2020 15:08:20 - INFO - root -   eval_perplexity after step 2350: 1.1589457988739014: 
12/10/2020 15:08:20 - INFO - root -   lr after step 2350: 1.5285750326325954e-07
12/10/2020 15:08:20 - INFO - root -   train_loss after step 2350: 1.3645135688781738
12/10/2020 15:08:36 - INFO - root -   Running evaluation
12/10/2020 15:08:36 - INFO - root -   Num examples = 99
12/10/2020 15:08:36 - INFO - root -   Validation Batch size = 8


12/10/2020 15:08:38 - INFO - root -   eval_loss after step 2400: 0.14972286614087912: 
12/10/2020 15:08:38 - INFO - root -   eval_perplexity after step 2400: 1.1615122556686401: 
12/10/2020 15:08:38 - INFO - root -   lr after step 2400: 4.248146830060362e-09
12/10/2020 15:08:38 - INFO - root -   train_loss after step 2400: 1.3905822920799256
12/10/2020 15:08:41 - INFO - root -   Running evaluation
12/10/2020 15:08:41 - INFO - root -   Num examples = 99
12/10/2020 15:08:41 - INFO - root -   Validation Batch size = 8


12/10/2020 15:08:42 - INFO - root -   eval_loss after epoch 10: 0.14774326120431608: 
12/10/2020 15:08:42 - INFO - root -   eval_perplexity after epoch 10: 1.1592152118682861: 
12/10/2020 15:08:42 - INFO - root -   lr after epoch 10: 0.0
12/10/2020 15:08:42 - INFO - root -   train_loss after epoch 10: 1.363910719319498
12/10/2020 15:08:42 - INFO - root -   



(2410, 1.6330479007538918)

In [18]:
lm_learner.validate()

12/10/2020 15:22:29 - INFO - root -   Running evaluation
12/10/2020 15:22:29 - INFO - root -   Num examples = 99
12/10/2020 15:22:29 - INFO - root -   Validation Batch size = 8


{'loss': 0.14940655346100146, 'perplexity': 1.1611449718475342}

In [19]:
lm_learner.save_model()

12/10/2020 15:23:16 - INFO - transformers.configuration_utils -   Configuration saved in model\model_out\config.json
12/10/2020 15:23:18 - INFO - transformers.modeling_utils -   Model weights saved in model\model_out\pytorch_model.bin


In [20]:
del lm_learner

### Création de databunch pour la classification

In [21]:
databunch = BertDataBunch(DATA_PATH, LABEL_PATH,
                          tokenizer='camembert-base',
                          train_file='train_set.csv',
                          val_file='val_set.csv',
                          label_file='labels.txt',
                          text_col='report_text',
                          label_col=['Bioagressor','Disease'],
                          batch_size_per_gpu=8,
                          max_seq_length=256,
                          multi_gpu=False,
                          multi_label=True,
                          model_type='camembert-base')

12/10/2020 15:23:46 - INFO - transformers.configuration_utils -   loading configuration file https://s3.amazonaws.com/models.huggingface.co/bert/camembert-base-config.json from cache at C:\Users\Raja/.cache\torch\transformers\5152a7b8b97da26abdad9b3babb600e77c52a002331ea52a9eaf96ea8b31ef8f.5bd7a9a60b9a2d311368226259eaf870cfb2248e0752f28b444ec112977cf8fc
12/10/2020 15:23:46 - INFO - transformers.configuration_utils -   Model config CamembertConfig {
  "architectures": [
    "CamembertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 5,
  "eos_token_id": 6,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,
  "model_type": "camembert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "output_past": true,
  "pad_token_id": 1,
  "type_vocab_size": 1,
  "vocab_size": 32005
}

12/10

### Création de Learner

In [22]:
metrics = [{'name': 'fbeta', 'function': fbeta}, {'name': 'roc_auc', 'function': roc_auc}]
OUTPUT_DIR = Path('./finetuned_model')
WGTS_PATH = Path('model/model_out/pytorch_model.bin')

In [23]:
# issue fast-bert pos_weight <= downgrade to 1.9.1 solve the prob
cl_learner = BertLearner.from_pretrained_model(
                        databunch,
                        pretrained_path='model/model_out',
                        metrics=metrics,
                        device=device_cuda, #was device_cuda
                        logger=logger,
                        output_dir=OUTPUT_DIR,
                        finetuned_wgts_path=WGTS_PATH,
                        warmup_steps=300,
                        multi_gpu=False,
                        multi_label=True,
                        is_fp16=False,#True when is cuda
                        logging_steps=50)

12/10/2020 15:24:20 - INFO - transformers.configuration_utils -   loading configuration file model/model_out\config.json
12/10/2020 15:24:20 - INFO - transformers.configuration_utils -   Model config CamembertConfig {
  "architectures": [
    "CamembertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 5,
  "eos_token_id": 6,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,
  "model_type": "camembert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "output_past": true,
  "pad_token_id": 1,
  "type_vocab_size": 1,
  "vocab_size": 32005
}

12/10/2020 15:24:20 - INFO - transformers.modeling_utils -   loading weights file model/model_out\pytorch_model.bin
- This IS expected if you are initializing CamembertForMultiLabelSequenceClassification from the checkpoint of a model train

In [24]:
cl_learner.fit(epochs=10,# was 30
            lr=2e-5,
            validate=True,
            schedule_type="warmup_cosine",
            optimizer_type="adamw")
#WHY??????????????????????????????????

12/10/2020 15:24:46 - INFO - root -   ***** Running training *****
12/10/2020 15:24:46 - INFO - root -     Num examples = 322
12/10/2020 15:24:46 - INFO - root -     Num Epochs = 10
12/10/2020 15:24:46 - INFO - root -     Total train batch size (w. parallel, distributed & accumulation) = 8
12/10/2020 15:24:46 - INFO - root -     Gradient Accumulation steps = 1
12/10/2020 15:24:46 - INFO - root -     Total optimization steps = 410


12/10/2020 15:24:52 - INFO - root -   Running evaluation
12/10/2020 15:24:52 - INFO - root -     Num examples = 80
12/10/2020 15:24:52 - INFO - root -     Batch size = 16


ValueError: Input contains NaN, infinity or a value too large for dtype('float32').

In [27]:
cl_learner.validate()

{'loss': 0.688255774974823, 'fbeta': 0.4375, 'roc_auc': 0.42315026697177727}

In [28]:
cl_learner.save_model()

In [37]:
del cl_learner

### Prédictions

In [30]:
predictor = BertClassificationPredictor(
                model_path='finetuned_model/model_out',
                label_path='labels/',
                multi_label=True,
                model_type='camembert-base',
                do_lower_case=False)

In [31]:
#cas disease: 0, bioagressor: 1 - cicadelle
predictor.predict("Cicadelles La cicadelle Edwardsiana est toujours observée sur les parcelles en été")



[('disease', 0.336669921875), ('bioagressor', 0.277587890625)]

In [32]:
predictor.predict("election américane")

[('disease', 0.33740234375), ('bioagressor', 0.280029296875)]

In [33]:
#cas disease: 1, bioagressor: 0 - mildiou
predictor.predict("vigilant : en particulier vis-à-vis du mildiou, de l’oïdium et de la bactériose / cladosporiose. Les prévisions pour les prochains jours restent peu favorables à l’expression de la bactériose et la cladosporiose, si elles se confirment. Mildiou (Pseudoperonospora cubensis) : Le modèle annonce un risque élevé pour toutes les dates de plantation avec les données de la station de Thurageau. Avec les données de la station de Maulay, les plantations en S25 et S26 montrent un risque modéré. Le risque est un peu plus élevé dans le sud de la Charente- Maritime que dans le Poitou. Niveau de risque Faible Moyen Élevé Très élevé Indice : Log (Nb de taches/unité de surface) -14 à -9 -9 à -4 -4 +4 Équivalent en unité de surface 1 tâche par hectare par 100 m2 1 tâche par 100 m2 par m2 1 tâche par m2à 1 % de surface atteinte 1 % à 100 % de surface atteinte Évaluation du risque : les conditions restent favorables à ce microorganisme (qui n’est pas un champignon, mais proche d’une algue). Les BSV sont disponibles en accès direct sur le site  (rubrique : Nos publications - Bulletin de santé du végétal) ou par abonnement en ligne gratuit sur le site  BSV CULTURES LÉGUMIÈRES DE PLEIN")

[('disease', 0.33935546875), ('bioagressor', 0.283935546875)]

In [34]:
#cas disease: 0, bioagressor: 0 - texte sur trump trump
predictor.predict("L'avance de Donald Trump dans cet Etat où 4,9 millions d'électeurs ont voté, a fondu vendredi 6 novembre. Le candidat républicain compte")

[('disease', 0.3359375), ('bioagressor', 0.27490234375)]

In [35]:
#cas disease: 0, bioagressor: 0 - texte sur mouche de carotte mais pas de rique
predictor.predict("mouche de la carotte :ajouter trichloronate")

[('disease', 0.33544921875), ('bioagressor', 0.280029296875)]

In [29]:
#cas disease: 0, bioagressor: 0 - oidium : pas d'intervention => pas bon
predictor.predict("oïdium : pas d'intervention dans l'immediat les conditions très chaudes et l'absence de rosées nocturnes sont défavorables à cette maladie. aucun symptôme n'a encore été observé.")

[('disease', 0.1798095703125), ('bioagressor', 0.0885009765625)]

In [30]:
#cas disease: 1, bioagressor: 0 - oidium  => good
predictor.predict("oïdium du pommierce champignon est à l'heure actuelle en pleine fructification. il est nécessaire d'ajouter un antioïdium aux bouillies pour assurer'la protection du jeune feuillage.")

[('disease', 0.1798095703125), ('bioagressor', 0.0885009765625)]

In [31]:
#cas disease: 0, bioagressor: 0 - mildiou : pas d'intervention :=> pas bon
predictor.predict("mildiou : pas d'intervention dans l'immediat les conditions très chaudes et l'absence de rosées nocturnes sont défavorables à cette maladie. aucun symptôme n'a encore été observé.")

[('disease', 0.1798095703125), ('bioagressor', 0.0885009765625)]

In [32]:
#cas disease: 0, bioagressor: 0 - pucerons- le seuil de nuisibilité  n’est pas atteint => raté
predictor.predict("pucerons les pucerons verts sont toujours présents dans certaines parcelles. on note également la présence de sitobion avenae, relativement moins fréquente, mais avec des colonies de plusieurs dizaines d'individus. le temps frais actuel n'est pas favorable aux prédateurs, mais on observe un nombre important de pucerons «momifiés» par la ponte de parasitoïdes. le seuil de nuisibilité n'est atteint dans aucune parcelle du réseau")

[('disease', 0.1798095703125), ('bioagressor', 0.0885009765625)]

In [33]:
#cas disease: 0, bioagressor: 0 - zero mot clé
predictor.predict("par jour. Sur les autres secteurs, en production de carotte, les captures sont nulles. Évaluation du risque")

[('disease', 0.1798095703125), ('bioagressor', 0.0885009765625)]

In [34]:
#cas disease: 0, bioagressor: 0 + mildiou
predictor.predict("7 Action pilotée par le ministère chargé de l’agriculture mildiou, avec l’appui financier de l’Office National de l’Eau et des Milieux Aquatiques (ONEMA), par les crédits issus de la redevance pour pollutions diffuses attribués au")

[('disease', 0.1798095703125), ('bioagressor', 0.0885009765625)]

In [35]:
#cas disease: 0, bioagressor: 0
predictor.predict("financement du plan Ecophyto. Ce bulletin est rédigé par l'ACPEL avec la collaboration de référents par culture (techniciens des Chambres d'Agriculture de la Charente, de la Charente-Maritime et d'Indre et Loire et de la Vienne) sur la base d'observations réalisées par des producteurs et techniciens : Charentes- Alliance, les entreprises de production de melon, la coopérative AGROLEG, la coopérative UNIRE, des producteurs d'Agrobio Poitou-Charentes. Ce bulletin est réalisé à partir d'observations ponctuelles. Il a pour vocation de donner une tendance de la situation sanitaire régionale. Celle-ci ne peut être transposée telle quelle dans les parcelles de production légumières (conditions très variables). La Chambre Régionale d'Agriculture de Poitou-Charentes et le rédacteur dégagent toute responsabilité quant aux décisions prises par les producteurs pour la protection de leurs cultures. Elle les invite à prendre ces décisions sur la base des observation s qu'ils auront réalisées dans leurs parcelles. Les")

[('disease', 0.1798095703125), ('bioagressor', 0.0885009765625)]

In [36]:
#cas disease: 0, bioagressor: 0 - french crop usasge
predictor.predict("Ensemble de plantes cultivées pour leurs fruits ou leurs graines riches en matières grasses (lipides). De ces fruits et graines sont extrait une huile à usage alimentaire humaine, alimentaire animal ou industriel. Les résidus de l'extraction constituent les tourteaux utilisés pour l'alimentation animale.")

[('disease', 0.1798095703125), ('bioagressor', 0.0885009765625)]

In [37]:
#cas disease: 0, bioagressor: 0 - french crop usasge
predictor.predict("Orge semé après le 1er février, principalement de mars à mai. Orge de printemps est toujours à deux rangs")

[('disease', 0.1798095703125), ('bioagressor', 0.0885009765625)]

In [38]:
#cas disease: 1, bioagressor: 0 - tweet - jaunisse
predictor.predict("des parcelles qui ont eu une croissance presque nulle depuis cet été en majeur partie dû à la jaunisse.")

[('disease', 0.1798095703125), ('bioagressor', 0.0885009765625)]

In [39]:
#cas disease: 1, bioagressor: 0 - tweet - sécheresse
predictor.predict("Parfois,on nous demande. Que faites-vous,par cette sécheresse?")

[('disease', 0.1798095703125), ('bioagressor', 0.0885009765625)]

In [40]:
#cas disease: 0, bioagressor: 1 - tweet - pucerons

predictor.predict("Attention pucerons dans les blés")

[('disease', 0.1798095703125), ('bioagressor', 0.0885009765625)]

In [41]:
predictor.predict("Favorisée par les conditions climatiques,l'activité du Carpocapse est toujours intense et il est conseillé de renouveler la protection des fruits en effectuant un nouveau traitement dès réception de ce bulletin.")

[('disease', 0.1798095703125), ('bioagressor', 0.0885009765625)]