### Named Entity Recognition (NER) is a Natural Language Processing (NLP) technique used to identify and extract named entities from text. Named entities are words or phrases that refer to specific entities such as people, organizations, locations, dates, times, and other types of entities that have a specific name or title.

### To use the NLAS-multi corpus for training a NER model for our project, here are the steps:

 1. We need to setup our environment and Download/load the NLAS-multi corpus data 
 2. We will Preprocess the data to a suit the format needed for training our model (Pre-processing).
 3. We then convert the text and labels into numerical format that can be fed into a machine learning model (Feature Extraction).
 4. Use spacy to build and train the NER model (Model Building and Training).
 5. We evaluate the trained model and demonstrate how to use it for NER tasks (Model Evaluation).
 6. Lastly, we develop a user-friendly interface to interact with the chatbot.

In [46]:
# Seting up environment by downloading the libraries we need
import json
import numpy as np
import tensorflow as tf
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical
from sklearn.model_selection import train_test_split
import re

# Load the NLAS-multi corpus. Its a json file
with open('nlas-multi.json', 'r', encoding='latin-1') as file:
    data = json.load(file)
data

{'eng': {'0': {'topic': 'Euthanasia',
   'stance': 'in favor',
   'argumentation scheme': 'position to know',
   'argument': '{\n  "major premise": "Medical professionals are in position to know about the treatment options available for terminally ill patients.",\n  "minor premise": "Many medical professionals argue that euthanasia is a humane option for terminally ill patients who are experiencing unbearable suffering and have little hope for recovery.",\n  "conclusion": "Euthanasia can be a morally justifiable option for terminally ill patients who are experiencing unbearable suffering and have little hope for recovery."\n}',
   'label': 'yes'},
  '1': {'topic': 'Mandatory vaccination in pandemic',
   'stance': 'against',
   'argumentation scheme': 'expert opinion',
   'argument': '{\n  "major premise": "Dr. John Smith is an expert in medical ethics containing proposition that mandatory vaccination in pandemic undermines the value of human life.",\n  "minor premise": "Dr. Smith asser

# Data Cleaning and Pre-processing
 We will Preprocess the data to suit the format needed for training our model.
 we will process this data to extract sentences and tags for training an NER model. We need to extract relevant parts (such as the topic, stance, argumentation scheme, and the parts of the argument) and convert them into a format suitable for NER training.

In [48]:
# Extract the entries from the nested dictionary for each language
entries_eng = list(data['eng'].values())
entries_esp = list(data['esp'].values())

# Initialize lists for sentences and tags
sentences = []
tags = []

print(len(entries_eng))
print(len(entries_esp))

4000
4000


### Training a NER model with data from both English (ENG) and Spanish (ESP) will required an increased Dataset Size i.e Combining both languages increases the overall dataset size.
### With the total length of 1893 for English and 1917 for Spain, it is realistic to train a model for each language separately. We than can combine them during deployment.

# Prepare data for ENG

In [76]:
# Regular expression to extract the argument parts from the dataset
arg_pattern = re.compile(r'"(major premise|minor premise|conclusion)":\s*"([^"]+)"')

for entry in entries_eng:
    argument = entry['argument']
    matches = arg_pattern.findall(str(argument))
    
    for part, sentence in matches:
        # Tokenize the sentence
        words = sentence.split()
        sentences.append(words)
        
        # Create tags (for simplicity, assuming everything is 'O' since we don't have actual NER tags)
        # In a real scenario, this should be the actual BIO tags
        sentence_tags = ['O'] * len(words)
        tags.append(sentence_tags)

In [78]:
# Build vocabulary and tag indices
words = set(word for sentence in sentences for word in sentence)
tags_set = set(tag for tag_seq in tags for tag in tag_seq)

word2idx = {w: i + 2 for i, w in enumerate(words)}
word2idx["PAD"] = 0
word2idx["UNK"] = 1

tag2idx = {t: i for i, t in enumerate(tags_set)}
tag2idx["PAD"] = len(tag2idx)

In [80]:
# Convert sentences and tags to sequences of indices
X = [[word2idx.get(w, word2idx["UNK"]) for w in sentence] for sentence in sentences]
y = [[tag2idx[t] for t in tag_seq] for tag_seq in tags]

In [82]:
# Pad sequences
max_len = max(len(s) for s in sentences)
X = pad_sequences(X, padding='post', maxlen=max_len)
y = pad_sequences(y, padding='post', maxlen=max_len)

In [84]:
print(len(X))
print(len(y))

27942
9314


# Feature Extraction

In [58]:
# Convert tags to categorical (one-hot encoding)
y = [to_categorical(i, num_classes=len(tag2idx)) for i in y]

# Split the data
X_train_eng, X_test_eng, y_train_eng, y_test_eng = train_test_split(X, y, test_size=0.2)

ValueError: Found input variables with inconsistent numbers of samples: [9314, 0]

# Prepare data for ESP

In [8]:
# Regular expression to extract the argument parts from the dataset
arg_pattern = re.compile(r'"(major premise|minor premise|conclusion)":\s*"([^"]+)"')

for entry in entries_esp:
    argument = entry['argument']
    matches = arg_pattern.findall(str(argument))
    
    for part, sentence in matches:
        # Tokenize the sentence
        words = sentence.split()
        sentences.append(words)
        
        # Create tags (for simplicity, assuming everything is 'O' since we don't have actual NER tags)
        # In a real scenario, this should be the actual BIO tags
        sentence_tags = ['O'] * len(words)
        tags.append(sentence_tags)

In [9]:
# Build vocabulary and tag indices
words = set(word for sentence in sentences for word in sentence)
tags_set = set(tag for tag_seq in tags for tag in tag_seq)

word2idx = {w: i + 2 for i, w in enumerate(words)}
word2idx["PAD"] = 0
word2idx["UNK"] = 1

tag2idx = {t: i for i, t in enumerate(tags_set)}
tag2idx["PAD"] = len(tag2idx)

In [10]:
# Convert sentences and tags to sequences of indices
X = [[word2idx.get(w, word2idx["UNK"]) for w in sentence] for sentence in sentences]
y = [[tag2idx[t] for t in tag_seq] for tag_seq in tags]

In [28]:
# Pad sequences
max_len = max(len(s) for s in sentences)
X = pad_sequences(X, padding='post', maxlen=max_len)
y = pad_sequences(y, padding='post', maxlen=max_len)

# Feature Extraction

In [12]:
# Convert tags to categorical (one-hot encoding)
y = [to_categorical(i, num_classes=len(tag2idx)) for i in y]

# Split the data
X_train_esp, X_test_esp, y_train_esp, y_test_esp = train_test_split(X, y, test_size=0.2)

# Model Building and Training

In [30]:
# We'll build and train a Bi-LSTM model using TensorFlow/Keras.

# Build the Model

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Embedding, Dense, TimeDistributed, Dropout, Bidirectional, InputLayer

# Parameters
input_dim = len(word2idx)
output_dim = 50  # Embedding output dimension
input_length = 52  # Input sequence length
n_tags = len(tag2idx)


# Use tf.keras.Input to define the input layer
input_layer = tf.keras.Input(shape=(input_length,))
embedding_layer = tf.keras.layers.Embedding(input_dim=input_dim, output_dim=output_dim, mask_zero=True)(input_layer)
bi_lstm_layer = tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(units=100, return_sequences=True, recurrent_dropout=0.1))(embedding_layer)
output_layer = tf.keras.layers.TimeDistributed(tf.keras.layers.Dense(n_tags, activation="softmax"))(bi_lstm_layer)
model = tf.keras.Model(inputs=input_layer, outputs=output_layer)


# # Model definition
# model = Sequential()
# model.add(InputLayer(shape=(input_length)))
# model.add(Embedding(input_dim=input_dim, output_dim=output_dim, input_length=input_length, mask_zero=True))
# model.add(Bidirectional(LSTM(units=100, return_sequences=True, recurrent_dropout=0.1)))
# model.add(TimeDistributed(Dense(n_tags, activation="softmax")))

In [32]:
model.compile(optimizer="adam", loss="categorical_crossentropy", metrics=["accuracy"])

model.summary()

In [36]:
# Train the Model

history = model.fit(X_train_eng, np.array(y_train_eng), validation_split=0.1, batch_size=32, epochs=1, verbose=1)

[1m210/210[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 43ms/step - accuracy: 1.0000 - loss: 1.3594e-06 - val_accuracy: 1.0000 - val_loss: 1.1648e-06


# Model Evaluation

In [38]:
loss, accuracy = model.evaluate(X_test_eng, np.array(y_test_eng))
print(f"Loss: {loss}, Accuracy: {accuracy}")

[1m59/59[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 13ms/step - accuracy: 1.0000 - loss: 1.1450e-06
Loss: 1.2095075589968474e-06, Accuracy: 1.0


In [118]:
# Create a reverse dictionary to map indices back to tags
idx2tag = {i: t for t, i in tag2idx.items()}

# Function to decode predictions
def predict(sentence):
    # Convert sentence to indices
    sentence_idx = [word2idx.get(w, word2idx["UNK"]) for w in sentence]
    sentence_idx = pad_sequences([sentence_idx], maxlen=52, padding='post')
    
    # Predict
    pred = model.predict(sentence_idx)
    pred_tags = [idx2tag[np.argmax(tag)] for tag in pred[0]]
    
    return list(zip(sentence, pred_tags))

# Example usage
sentence = ["Apple is looking at buying U.K. startup for $1 billion ."]
predicted_tags = predict(sentence)
print(predicted_tags)

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 40ms/step
[('Apple is looking at buying U.K. startup for $1 billion.', 'O')]


# Testing code here
### 1. Train a Custom Named Entity Recognition Model Using spaCy

In [9]:
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 
import seaborn as sns
import spacy
import json
# import scispacy
from spacy import displacy
from spacy.tokens import Doc, DocBin
from spacy.util import filter_spans
from tqdm import tqdm

In [11]:
nlp = spacy.load("en_core_web_lg")
doc = nlp('Elephants are found in Africa and India')
doc.ents

(Africa, India)

In [37]:
with open('C:\pythonclass\Data Science\Datasets\Deep learning datasets\Corona2.json', 'r') as file:
    data = json.load(file)
    
# Let see a sample of our data. The data is a nested json object. Let’s see the keys. Let’s view what each key contains.
print(data['examples'][1].keys())
print(data['examples'][1])

dict_keys(['id', 'content', 'metadata', 'annotations', 'classifications'])
{'id': '487c93e3-0d45-4088-a378-cf3a01c8953d', 'content': 'Diarrhea, also spelled diarrhoea, is the condition of having at least three loose, liquid, or watery bowel movements each day.[2] It often lasts for a few days and can result in dehydration due to fluid loss.[2] Signs of dehydration often begin with loss of the normal stretchiness of the skin and irritable behaviour.[2] This can progress to decreased urination, loss of skin color, a fast heart rate, and a decrease in responsiveness as it becomes more severe.[2] Loose but non-watery stools in babies who are exclusively breastfed, however, are normal.[2]', 'metadata': {}, 'annotations': [{'id': '28601a42-c8a9-44e2-aeea-8939cb1db1a9', 'tag_id': '03eb3e50-d4d8-4261-a60b-fa5aee5deb4a', 'end': 382, 'start': 364, 'example_id': '487c93e3-0d45-4088-a378-cf3a01c8953d', 'tag_name': 'MedicalCondition', 'value': 'loss of skin color', 'correct': None, 'human_annotatio

In [17]:
# we’ll extract the content and annotations and use them to train a blank spacy model later on in this tutorial.

train_data = []
for example in data['examples']:
    ent_dict = {}
    ent_dict['text'] = example['content']
    ent_dict['entities'] = []
    for annotation in example['annotations']:
        start = annotation['start']
        end = annotation['end']
        label = annotation['tag_name'].upper()
        ent_dict['entities'].append((start, end, label))
    train_data.append(ent_dict)

print(train_data[1])

{'text': 'Diarrhea, also spelled diarrhoea, is the condition of having at least three loose, liquid, or watery bowel movements each day.[2] It often lasts for a few days and can result in dehydration due to fluid loss.[2] Signs of dehydration often begin with loss of the normal stretchiness of the skin and irritable behaviour.[2] This can progress to decreased urination, loss of skin color, a fast heart rate, and a decrease in responsiveness as it becomes more severe.[2] Loose but non-watery stools in babies who are exclusively breastfed, however, are normal.[2]', 'entities': [(364, 382, 'MEDICALCONDITION'), (0, 8, 'MEDICALCONDITION'), (94, 116, 'MEDICALCONDITION'), (178, 189, 'MEDICALCONDITION'), (221, 232, 'MEDICALCONDITION'), (23, 32, 'MEDICALCONDITION'), (409, 435, 'MEDICALCONDITION'), (386, 401, 'MEDICALCONDITION')]}


In [19]:
# Let’s see what we were able to extract.

train_data[1]['text']
train_data[1]['entities']

# Let’s view a sample train data.

train_data[5]

{'text': "Hantaviruses, usually found in rodents and shrews, were discovered in two species of bats. The MouyassuÃ© virus (MOUV) was isolated from banana pipistrelle bats captured near MouyassuÃ© village in Cote d'Ivoire, West Africa. The Magboi virus was isolated from hairy slit-faced bats found near the Magboi River in Sierra Leone in 2011. They are single-stranded, negative sense, RNA viruses in the Bunyaviridae family.[29][30][31][32]",
 'entities': [(0, 12, 'PATHOGEN'),
  (394, 406, 'PATHOGEN'),
  (227, 239, 'PATHOGEN'),
  (95, 110, 'PATHOGEN')]}

In [21]:
# We need to initialize a blank spaCy model to train our custom NER model. We need to use a DocBin object for training. 
# This is the form that can be saved to disk because it is in a binary format.

nlp = spacy.blank('en')
doc_bin = DocBin()

In [23]:
for train_doc in tqdm(train_data):
    text = train_doc['text']
    labels = train_doc['entities']
    doc = nlp.make_doc(text)
    ents = []
    for start, end, label in labels:
        span = doc.char_span(start, end, label=label, alignment_mode='contract')
        if span is None:
            print('No Span found')
        else:
            ents.append(span)
    final_ents = filter_spans(ents)
    doc.ents = final_ents
    doc_bin.add(doc)
    
doc_bin.to_disk('custom_ner.spacy')

100%|█████████████████████████████████████████████████████████████████████████████████| 31/31 [00:00<00:00, 258.51it/s]

No Span found
No Span found
No Span found
No Span found
No Span found
No Span found
No Span found
No Span found
No Span found
No Span found
No Span found
No Span found
No Span found
No Span found
No Span found
No Span found
No Span found
No Span found
No Span found
No Span found
No Span found
No Span found
No Span found
No Span found
No Span found
No Span found
No Span found





The spacy model expects a configuration file which we can build with the help of spacy on their website here.

It should look like this. Then, copy the config file, paste it in new file in the same directory and name it base_config.cfg.

Then run this command so that spaCy will use the base_config file to create a new config.cfg file.

!python -m spacy init fill-config base_config.cfg config.cfg
Next, we need to run this command in the terminal which will train the model and download the best model into the output folder.

!python -m spacy train config.cfg --output ./output --paths.train ./custom_ner.spacy --paths.dev ./custom_ner.spacy

In [25]:
# In the output folder, there are two files, model-best and model-last. We want to use the model-best for best result.

# We can now load the model back into our file and use it to identify these specific custom entities in any medical text. 
# We trained the model to recognize only three entities which are MEDICALCONDITION, PATHOGEN and MEDICINE.

custom_med_NER = spacy.load('output/model-best')

In [27]:
# Let’s try it on a text that it has not seen previously. I’ll copy a paragraph from Wikipedia to test our model. We save it in a variable.
doc1 = """Frequent coughing usually indicates the presence of a disease. Many viruses and bacteria benefit,
from an evolutionary perspective, by causing the host to cough, which helps to spread the disease to new hosts. 
Most of the time, irregular coughing is caused by a respiratory tract infection but can also be triggered by choking, 
smoking, air pollution,[1] asthma, gastroesophageal reflux disease, 
post-nasal drip, chronic bronchitis, lung tumors, heart failure and medications such as 
angiotensin-converting-enzyme inhibitors (ACE inhibitors) and beta blockers.[2]
Treatment should target the cause; for example, smoking cessation or discontinuing ACE inhibitors.
Cough suppressants such as codeine or dextromethorphan are frequently prescribed, but have been 
demonstrated to have little effect. Other treatment options may target airway inflammation or 
may promote mucus expectoration. As it is a natural protective reflex, suppressing the cough 
reflex might have damaging effects, especially if the cough is productive."""

In [29]:
# We pass the text to our model. We can choose colors for each label and pass them into displacy.render.
result = custom_med_NER(doc1)
colors= {'PATHOGEN': '#D49137', 'MEDICINE': '#BE398D', 'MEDICALCONDITION': '#F07857'}
options = {'colors': colors}
spacy.displacy.render(result, style='ent', options=options, jupyter=True)

### 2. Spacy Named Entity Recognition

In [3]:
import spacy
# !python -m spacy download en_core_web_lg
nlp = spacy.load("en_core_web_lg")
nlp

<spacy.lang.en.English at 0x243a1e05010>

In [5]:
doc = nlp("Donad Trump was President of USA")
doc

Donad Trump was President of USA

In [7]:
type(doc)

spacy.tokens.doc.Doc

In [9]:
doc.ents

(Donad Trump, USA)

In [11]:
doc.ents[0], type(doc.ents[0])

(Donad Trump, spacy.tokens.span.Span)

In [13]:
from spacy import displacy
displacy.render(doc, style="ent", jupyter=True)

In [37]:
import json
# https://www.kaggle.com/datasets/finalepoch/medical-ner 
with open('C:\pythonclass\Data Science\Datasets\Deep learning datasets\Corona2.json', 'r', encoding='utf-8') as f:
    data = json.load(f)
data['examples'][0]
# with open('nlas-multi2.json', 'r', encoding='utf-8') as f:
#     data = json.load(f)
# data['eng']['0']

{'id': '18c2f619-f102-452f-ab81-d26f7e283ffe',
 'content': "While bismuth compounds (Pepto-Bismol) decreased the number of bowel movements in those with travelers' diarrhea, they do not decrease the length of illness.[91] Anti-motility agents like loperamide are also effective at reducing the number of stools but not the duration of disease.[8] These agents should be used only if bloody diarrhea is not present.[92]\n\nDiosmectite, a natural aluminomagnesium silicate clay, is effective in alleviating symptoms of acute diarrhea in children,[93] and also has some effects in chronic functional diarrhea, radiation-induced diarrhea, and chemotherapy-induced diarrhea.[45] Another absorbent agent used for the treatment of mild diarrhea is kaopectate.\n\nRacecadotril an antisecretory medication may be used to treat diarrhea in children and adults.[86] It has better tolerability than loperamide, as it causes less constipation and flatulence.[94]",
 'metadata': {},
 'annotations': [{'id': '0825a1

In [39]:
data['examples'][0].keys()

dict_keys(['id', 'content', 'metadata', 'annotations', 'classifications'])

In [41]:
data['examples'][0]['content']

"While bismuth compounds (Pepto-Bismol) decreased the number of bowel movements in those with travelers' diarrhea, they do not decrease the length of illness.[91] Anti-motility agents like loperamide are also effective at reducing the number of stools but not the duration of disease.[8] These agents should be used only if bloody diarrhea is not present.[92]\n\nDiosmectite, a natural aluminomagnesium silicate clay, is effective in alleviating symptoms of acute diarrhea in children,[93] and also has some effects in chronic functional diarrhea, radiation-induced diarrhea, and chemotherapy-induced diarrhea.[45] Another absorbent agent used for the treatment of mild diarrhea is kaopectate.\n\nRacecadotril an antisecretory medication may be used to treat diarrhea in children and adults.[86] It has better tolerability than loperamide, as it causes less constipation and flatulence.[94]"

In [43]:
data['examples'][0]['annotations'][0]

{'id': '0825a1bf-6a6e-4fa2-be77-8d104701eaed',
 'tag_id': 'c06bd022-6ded-44a5-8d90-f17685bb85a1',
 'end': 371,
 'start': 360,
 'example_id': '18c2f619-f102-452f-ab81-d26f7e283ffe',
 'tag_name': 'Medicine',
 'value': 'Diosmectite',
 'correct': None,
 'human_annotations': [{'timestamp': '2020-03-21T00:24:32.098000Z',
   'annotator_id': 1,
   'tagged_token_id': '0825a1bf-6a6e-4fa2-be77-8d104701eaed',
   'name': 'Ashpat123',
   'reason': 'exploration'}],
 'model_annotations': []}

In [45]:
training_data = []
for example in data['examples']:
  temp_dict = {}
  temp_dict['text'] = example['content']
  temp_dict['entities'] = []
  for annotation in example['annotations']:
    start = annotation['start']
    end = annotation['end']
    label = annotation['tag_name'].upper()
    temp_dict['entities'].append((start, end, label))
  training_data.append(temp_dict)
  
print(training_data[0])

{'text': "While bismuth compounds (Pepto-Bismol) decreased the number of bowel movements in those with travelers' diarrhea, they do not decrease the length of illness.[91] Anti-motility agents like loperamide are also effective at reducing the number of stools but not the duration of disease.[8] These agents should be used only if bloody diarrhea is not present.[92]\n\nDiosmectite, a natural aluminomagnesium silicate clay, is effective in alleviating symptoms of acute diarrhea in children,[93] and also has some effects in chronic functional diarrhea, radiation-induced diarrhea, and chemotherapy-induced diarrhea.[45] Another absorbent agent used for the treatment of mild diarrhea is kaopectate.\n\nRacecadotril an antisecretory medication may be used to treat diarrhea in children and adults.[86] It has better tolerability than loperamide, as it causes less constipation and flatulence.[94]", 'entities': [(360, 371, 'MEDICINE'), (383, 408, 'MEDICINE'), (104, 112, 'MEDICALCONDITION'), (679,

In [47]:
training_data[0]['text']

"While bismuth compounds (Pepto-Bismol) decreased the number of bowel movements in those with travelers' diarrhea, they do not decrease the length of illness.[91] Anti-motility agents like loperamide are also effective at reducing the number of stools but not the duration of disease.[8] These agents should be used only if bloody diarrhea is not present.[92]\n\nDiosmectite, a natural aluminomagnesium silicate clay, is effective in alleviating symptoms of acute diarrhea in children,[93] and also has some effects in chronic functional diarrhea, radiation-induced diarrhea, and chemotherapy-induced diarrhea.[45] Another absorbent agent used for the treatment of mild diarrhea is kaopectate.\n\nRacecadotril an antisecretory medication may be used to treat diarrhea in children and adults.[86] It has better tolerability than loperamide, as it causes less constipation and flatulence.[94]"

In [49]:
training_data[0]['entities']

[(360, 371, 'MEDICINE'),
 (383, 408, 'MEDICINE'),
 (104, 112, 'MEDICALCONDITION'),
 (679, 689, 'MEDICINE'),
 (6, 23, 'MEDICINE'),
 (25, 37, 'MEDICINE'),
 (461, 470, 'MEDICALCONDITION'),
 (577, 589, 'MEDICINE'),
 (853, 865, 'MEDICALCONDITION'),
 (188, 198, 'MEDICINE'),
 (754, 762, 'MEDICALCONDITION'),
 (870, 880, 'MEDICALCONDITION'),
 (823, 833, 'MEDICINE'),
 (852, 853, 'MEDICALCONDITION'),
 (461, 469, 'MEDICALCONDITION'),
 (535, 543, 'MEDICALCONDITION'),
 (692, 704, 'MEDICINE'),
 (563, 571, 'MEDICALCONDITION')]

In [51]:
training_data[0]['text'][360:371]

'Diosmectite'

In [53]:
from spacy.tokens import DocBin
from tqdm import tqdm

nlp = spacy.blank("en") # load a new spacy model
doc_bin = DocBin()

In [55]:
from spacy.util import filter_spans

for training_example  in tqdm(training_data): 
    text = training_example['text']
    labels = training_example['entities']
    doc = nlp.make_doc(text) 
    ents = []
    for start, end, label in labels:
        span = doc.char_span(start, end, label=label, alignment_mode="contract")
        if span is None:
            print("Skipping entity")
        else:
            ents.append(span)
    filtered_ents = filter_spans(ents)
    doc.ents = filtered_ents 
    doc_bin.add(doc)

doc_bin.to_disk("train.spacy")

 84%|███████████████████████████████████████████████████████████████████▉             | 26/31 [00:00<00:00, 131.51it/s]

Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity


100%|█████████████████████████████████████████████████████████████████████████████████| 31/31 [00:00<00:00, 133.20it/s]


In [61]:
!python -m spacy init fill-config base_config.cfg config.cfg

[38;5;2m[+] Auto-filled config with all values[0m
[38;5;2m[+] Saved config[0m
config.cfg
You can now add your data and train your pipeline:
python -m spacy train config.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy


In [63]:
!python -m spacy train config.cfg --output ./ --paths.train ./train.spacy --paths.dev ./train.spacy

[38;5;4m[i] Saving to output directory: .[0m
[38;5;4m[i] Using CPU[0m
[1m
[38;5;2m[+] Initialized pipeline[0m
[1m
[38;5;4m[i] Pipeline: ['tok2vec', 'ner'][0m
[38;5;4m[i] Initial learn rate: 0.001[0m
E    #       LOSS TOK2VEC  LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE 
---  ------  ------------  --------  ------  ------  ------  ------
  0       0          0.00    153.29    0.55    0.91    0.39    0.01
  7     200       1304.43   4815.29   63.16   71.29   56.69    0.63
 14     400         85.76   1278.00   91.08   87.91   94.49    0.91
 22     600       1396.69    576.02   92.16   87.59   97.24    0.92
 30     800         81.40    370.15   96.51   95.04   98.03    0.97
 40    1000        103.16    339.44   97.08   96.14   98.03    0.97
 51    1200         67.72    226.11   98.43   98.43   98.43    0.98
 65    1400         70.46    154.69   98.02   98.80   97.24    0.98
 81    1600        100.47    184.97   98.82   98.82   98.82    0.99
101    1800        172.03    254.97   98.

In [65]:
nlp_ner = spacy.load("model-best")

In [67]:
doc = nlp_ner("MEDICAL REPORT. The Sheriff as a condition of granting sick leave with pay, may require medical evidence of sickness or injury acceptable to the Sheriff’s Office when the employee is absent for more than three consecutive working days or when the agency/department head determines within his/her discretion that there are indications of excessive use of sick leave or sick leave abuse. A diagnosis is not required as medical evidence of sickness or injury unless it is reasonable to believe that the employee’s condition may endanger the health or safety of other employees and/or the public.")
colors = {"PATHOGEN": "#F67DE3", "MEDICINE": "#7DF6D9", "MEDICALCONDITION":"#a6e22d"}
options = {"colors": colors} 

spacy.displacy.render(doc, style="ent", options= options, jupyter=True)