#**5: Train 2 NLP Models<br>**
1. From Blank NLP model
2. From Existing NLP model "en_core_web_sm"
* Split the training corpus into train and test sets.
* Train the two models with the training set
* Test the two models with the test set
    * Also test the default untrained NLP Model
* Compare the performances
<hr>


# Getting the Training Corpus

'drive.mount('/content/drive/')' mounts Google Drive to Google Colab notebook.


In [None]:
#Mounting Google Drive

from google.colab import drive
drive.mount('/content/drive/')

Mounted at /content/drive/


The code below reads a CSV file containing training data for an NLP model, converts the data into a list, and converts the string representation of the annotation dictionaries into actual dictionary objects. This is a common preprocessing step when working with NLP models that require annotated data for training.

In [None]:
#Getting the training corpus as tc
import pandas as pd
tc= pd.read_csv('/content/drive/MyDrive/Final (Completed)/training_corpus.csv')

#Converting the dataframe into list
all_train_data= list(tc.values)
print(type(all_train_data[0][1]))

#Converting the annotations into dictionary
import ast
for i in range(0,len(all_train_data)):
    all_train_data[i][1]=ast.literal_eval(str(all_train_data[i][1]))
print(type(all_train_data[0][1]))

<class 'str'>
<class 'dict'>


The code below is a common preprocessing step when working with NLP models that require a fixed set of labels for classification or annotation. The list of labels can be used to define the output layer of the NLP model and to ensure that the model produces outputs that are consistent with the available labels.

In [None]:
#Getting the labels
labels= pd.read_csv('/content/drive/MyDrive/Final (Completed)/labels.csv')
print([l for l in labels['0']],':',len(labels))

['DOCUMENT', 'PROCESS', 'PERSON', 'PORTAL', 'MOBILE_APP', 'CSC', 'PERMISSION', 'ORG'] : 8


The code below splits the original training data into two subsets: train_data and test_data. This is a common preprocessing step when training a machine learning model. The training data subset is used to train the model, and the test data subset is used to evaluate the model's performance on unseen data. The train_test_split function ensures that the data is split randomly and in a way that preserves the distribution of labels in the original data. the random_state value is the seed value for random slpit.

In [None]:
#Splitting the Training Corpus into Train and Test Sets
from sklearn.model_selection import train_test_split
train_data, test_data = train_test_split(all_train_data, test_size=0.2, random_state=42)
print("Train Data:",len(train_data),"\nTest Data:",len(test_data))

Train Data: 1032 
Test Data: 258


#Getting the two NLP Models to train

In [None]:
import spacy
# A Blank Model
nlp1= spacy.blank('en')
nlp1.add_pipe('ner')
nlp1.begin_training()

# An Existing NER Model
nlp2= spacy.load('en_core_web_sm')

# Getting pipeline component
ner1= nlp1.get_pipe("ner") 
ner2= nlp2.get_pipe("ner") 

# Adding labels to the 'ner1' and 'ner2'
for label in labels:
    ner1.add_label(label)
    ner2.add_label(label)

#Disable pipeline components that doesn't need to change in nlp2
pipe_exceptions=["ner","trf_wordpiecer","trf_tok2vec"]
unaffected_pipes=[pipe for pipe in nlp2.pipe_names if pipe not in pipe_exceptions]



First, we have created a blank model for the English language by calling the spacy.blank() function and passing the 'en' parameter. We have stored this model in a variable called nlp1.

Next, we have added a named entity recognition (NER) pipeline component to the nlp1 model by calling the nlp1.add_pipe('ner') function. We have then started the training process for the model by calling the nlp1.begin_training() function.

We have also loaded an existing NER model called 'en_core_web_sm' into the nlp2 variable by calling the spacy.load() function and passing the 'en_core_web_sm' parameter.

To get the NER pipeline component of both models, we have called the nlp1.get_pipe("ner") and nlp2.get_pipe("ner") functions respectively and stored their results in the ner1 and ner2 variables.

We have then added labels to both ner1 and ner2 by iterating through a list of labels and calling the ner1.add_label(label) and ner2.add_label(label) functions.

Finally, we have disabled some of the pipeline components in the nlp2 model that we do not need to change. Specifically, we have created a list called pipe_exceptions containing the names of the pipeline components we want to keep ("ner", "trf_wordpiecer", and "trf_tok2vec"), and then we have created another list called unaffected_pipes containing the names of all the pipeline components in nlp2 except for those listed in pipe_exceptions. We can use this list to selectively enable or disable pipeline components as needed.

The purpose of this code is to set up two English NLP models, one blank and one pre-trained, and prepare them for training on a Named Entity Recognition task. The blank model is trained from scratch, while the pre-trained model is fine-tuned to improve its performance on the specific task. The labels list is added to both models to ensure that they are consistent. Finally, the pipe_exceptions and unaffected_pipes lists are used to manage the pipeline components that need to be modified and those that do not in `nlp2` model.

In [None]:
# Train the NER component
import random
from spacy.util import minibatch, compounding
from spacy.training.example import Example
n_iter = 10
other_pipes = [pipe for pipe in nlp2.pipe_names if pipe != 'ner']
with nlp2.disable_pipes(*other_pipes):
    optimizer1 = nlp1.create_optimizer()
    optimizer2 = nlp2.create_optimizer()
    for i in range(n_iter):
        random.shuffle(train_data)
        losses1 = {}
        losses2 = {}
        batches = minibatch(train_data, size=compounding(4.0, 32.0, 1.001))
        for batch in batches:
            texts, annotations = zip(*batch)
            for j in range(0,len(texts)):
                doc1 = nlp1.make_doc(texts[j])
                doc2 = nlp2.make_doc(texts[j])
                example1 = Example.from_dict(doc1, annotations[j])
                example2 = Example.from_dict(doc2, annotations[j])
                nlp1.update([example1], sgd=optimizer1, drop=0.35, losses=losses1)
                nlp2.update([example2], sgd=optimizer2, drop=0.35, losses=losses2)
        print(f'nlp1_Epoch {i}: {losses1}')
        print(f'nlp2_Epoch {i}: {losses2}')
# Save the trained model to disk
nlp1.to_disk('/content/drive/MyDrive/Final (Completed)/nlp_model_1')
nlp2.to_disk('/content/drive/MyDrive/Final (Completed)/nlp_model_2')



nlp1_Epoch 0: {'ner': 1806.2281800729045}
nlp2_Epoch 0: {'ner': 1184.56735712692}
nlp1_Epoch 1: {'ner': 564.6411237148457}
nlp2_Epoch 1: {'ner': 202.92248622796677}
nlp1_Epoch 2: {'ner': 584.2392605513691}
nlp2_Epoch 2: {'ner': 282.7223886777114}
nlp1_Epoch 3: {'ner': 314.5579534688138}
nlp2_Epoch 3: {'ner': 104.00410918707257}
nlp1_Epoch 4: {'ner': 299.6152072470789}
nlp2_Epoch 4: {'ner': 182.89037431958937}
nlp1_Epoch 5: {'ner': 298.85909949279613}
nlp2_Epoch 5: {'ner': 107.56560717462732}
nlp1_Epoch 6: {'ner': 312.6654001138403}
nlp2_Epoch 6: {'ner': 104.67959599115633}
nlp1_Epoch 7: {'ner': 372.6034761933004}
nlp2_Epoch 7: {'ner': 95.4047239584839}
nlp1_Epoch 8: {'ner': 299.4500198138731}
nlp2_Epoch 8: {'ner': 123.33520294560097}
nlp1_Epoch 9: {'ner': 163.24802787621647}
nlp2_Epoch 9: {'ner': 36.878218374097415}


This code trains two named entity recognition (NER) models, nlp1 and nlp2, using a list of training data called train_data. The training process consists of iterating over the training data n_iter times, shuffling the data at the beginning of each epoch. For each epoch, the training data is divided into batches using the minibatch function, which takes a sequence of examples and returns a sequence of batches, where each batch contains a variable number of examples. The size of each batch is increased exponentially with each iteration of the loop, starting at 4 and capped at 32. (e.g., (1,32,2) -> batch size would be:- 1,2,4,8,16,32.)

During each iteration of the inner loop over the batches, the code extracts the text and the annotations from each example in the batch using the zip function. For each text in the batch, two Spacy Doc objects are created, one for each NER model, using the make_doc function. Then, two Example objects are created from each Doc and its corresponding annotations using the Example.from_dict method. Finally, the update method of each NER model is called with its corresponding Example object, the sgd optimizer, a dropout rate of 0.35, and an empty dictionary for storing the loss values.

At the end of each epoch, the loss values for each NER model are printed. After all epochs are completed, the trained models are saved to disk using the to_disk method.

It's worth noting that nlp1 and nlp2 are assumed to have been initialized and configured with their respective components before this code snippet. Also, nlp1 and nlp2 may refer to the same instance of the spacy.Language class, but with different configurations for their NER components.

# Code Break Down
**`nlp1.update([example1], sgd=optimizer, drop=0.35, losses=losses1)`**<br>
**`nlp1`**: This is the NLP model instance that we want to update.<br>
**`example1`**: This is a new example that we want to use to update the model. The example is typically a text string and any relevant metadata, such as named entity labels or part-of-speech tags.<br>
**`sgd=optimizer`**: This specifies the optimizer to use for updating the model weights. In this case, the optimizer is called optimizer and is an instance of an SGD optimizer.<br>
**`drop=0.35`**: This argument sets the dropout rate to 0.35. Dropout is a technique used to prevent overfitting in deep learning models, where a proportion of neurons are randomly dropped out during training to prevent the model from relying too heavily on any one neuron or set of neurons.<br>
**`losses=losses1`**: This is a dictionary object that keeps track of the loss incurred during training. The losses1 object contains the current loss value and will be updated with the new loss value after this update.<br>
Overall, this command is updating the nlp1 model instance with a new example using the SGD optimizer, with a dropout rate of 0.35, and keeping track of the loss values in the losses1 object.

#Tesing the Models along with default NLP Model - Performance Evaluation

In [None]:
#Test the NLP Models
#Default NLP Model
nlp3= spacy.load('en_core_web_sm') 
#Adding labels to default NLP Model
ner3= nlp3.get_pipe("ner") 
for label in labels:
    ner3.add_label(label)
#Testing on test_data
texts, annotations = zip(*test_data)
example1=[]
example2=[]
example3=[]
for i in range(0,len(texts)):
    doc1 = nlp1.make_doc(texts[i]) #Blank
    doc2 = nlp2.make_doc(texts[i]) #Existing
    doc3 = nlp3.make_doc(texts[i]) #Default
    example1.append(Example.from_dict(doc1, annotations[i]))
    example2.append(Example.from_dict(doc2, annotations[i]))
    example3.append(Example.from_dict(doc3, annotations[i]))
scores1=nlp1.evaluate(example1) #Blank
scores2=nlp2.evaluate(example2) #Existing
scores3=nlp3.evaluate(example3) #Default



This code evaluates the performance of two trained NER models, nlp1 and nlp2, using a list of test data called test_data. The zip function is used to extract the text and the annotations from each example in the test data, which are then used to create Example objects for each NER model.

A loop is then used to iterate over each text in the test data. For each text, two Doc objects are created, one for each NER model, using the make_doc function. Then, two Example objects are created from each Doc and its corresponding annotations using the Example.from_dict method. These Example objects are stored in separate lists called example1 and example2.

After all examples in the test data have been processed, the evaluate method is called on each NER model using its corresponding list of Example objects. This method calculates various metrics such as precision, recall, and F1-score for each named entity label, and returns a dictionary of these scores, as well as an overall score.

The scores for each model are stored in the scores1 and scores2 variables, respectively, which can be used to compare the performance of the two models on the test data.

In [None]:
#Dropping the Columns with no values
score_df1= (pd.DataFrame.from_dict(scores1)).dropna(axis=1) #Blank
score_df2= (pd.DataFrame.from_dict(scores2)).dropna(axis=1) #Existing
score_df3= (pd.DataFrame.from_dict(scores3)).dropna(axis=1) #Default
#Renaming the required columns
score_df1.rename(columns = {'ents_p':'ents_p1', 'ents_r':'ents_r1', 'ents_f':'ents_f1', 'speed':'speed1'}, inplace = True)
score_df2.rename(columns = {'ents_p':'ents_p2', 'ents_r':'ents_r2', 'ents_f':'ents_f2', 'speed':'speed2'}, inplace = True)
score_df3.rename(columns = {'ents_p':'ents_p3', 'ents_r':'ents_r3', 'ents_f':'ents_f3', 'speed':'speed3'}, inplace = True)
#Dropping the irrelevant Columns
score_df1.drop(['token_acc','token_p','token_r','token_f','ents_per_type'], axis=1,inplace=True)
score_df2.drop(['token_acc','token_p','token_r','token_f','ents_per_type'], axis=1,inplace=True)
score_df3.drop(['token_acc','token_p','token_r','token_f','ents_per_type'], axis=1,inplace=True)

In [None]:
#NEW
res_score=pd.concat([score_df1,score_df2,score_df3],axis=1,join='inner')
res_score

Unnamed: 0,ents_p1,ents_r1,ents_f1,speed1,ents_p2,ents_r2,ents_f2,speed2,ents_p3,ents_r3,ents_f3,speed3
CSC,0.989529,0.992126,0.990826,15324.516909,0.996078,1.0,0.998035,4242.044031,0.341373,0.24147,0.282859,4338.991815
DOCUMENT,0.989529,0.992126,0.990826,15324.516909,0.996078,1.0,0.998035,4242.044031,0.341373,0.24147,0.282859,4338.991815
MOBILE_APP,0.989529,0.992126,0.990826,15324.516909,0.996078,1.0,0.998035,4242.044031,0.341373,0.24147,0.282859,4338.991815
ORG,0.989529,0.992126,0.990826,15324.516909,0.996078,1.0,0.998035,4242.044031,0.341373,0.24147,0.282859,4338.991815
PERMISSION,0.989529,0.992126,0.990826,15324.516909,0.996078,1.0,0.998035,4242.044031,0.341373,0.24147,0.282859,4338.991815
PERSON,0.989529,0.992126,0.990826,15324.516909,0.996078,1.0,0.998035,4242.044031,0.341373,0.24147,0.282859,4338.991815
PORTAL,0.989529,0.992126,0.990826,15324.516909,0.996078,1.0,0.998035,4242.044031,0.341373,0.24147,0.282859,4338.991815
PROCESS,0.989529,0.992126,0.990826,15324.516909,0.996078,1.0,0.998035,4242.044031,0.341373,0.24147,0.282859,4338.991815


In [None]:
#Getting Average of all the scores
avg_val=[]
for v in res_score:
    sum=0
    for i in res_score[v].values:
        sum= sum+i
    sum= sum/len(res_score)
    avg_val.append(sum)

#For NLP Model 1 - [Trained on Blank Model]
avg_precision_1=str(round(avg_val[0],4))+" ("+str(round(avg_val[0]*100,2))+" %)"
avg_recall_1=str(round(avg_val[1],4))+" ("+str(round(avg_val[1]*100,2))+" %)"
avg_f1_score_1=str(round(avg_val[2],4))+" ("+str(round(avg_val[2]*100,2))+" %)"
avg_speed_1=round(avg_val[3],2)

#For NLP Model 2 - [Trained on Pre-existing Model]
avg_precision_2=str(round(avg_val[4],4))+" ("+str(round(avg_val[4]*100,2))+" %)"
avg_recall_2=str(round(avg_val[5],4))+" ("+str(round(avg_val[5]*100,2))+" %)"
avg_f1_score_2=str(round(avg_val[6],4))+" ("+str(round(avg_val[6]*100,2))+" %)"
avg_speed_2=round(avg_val[7],2)

#For Default NLP Model 
avg_precision_3=str(round(avg_val[8],4))+" ("+str(round(avg_val[8]*100,2))+" %)"
avg_recall_3=str(round(avg_val[9],4))+" ("+str(round(avg_val[9]*100,2))+" %)"
avg_f1_score_3=str(round(avg_val[10],4))+" ("+str(round(avg_val[10]*100,2))+" %)"
avg_speed_3=round(avg_val[11],2)

#Creating a DataFrame for all the three Scores
data = {'Metric': ['Precision', 'Recall', 'F1 Score', 'Speed (examples processed/sec)'],
        'Default NLP Model': [avg_precision_3, avg_recall_3, avg_f1_score_3, avg_speed_3],
        'NLP Model 1 -[Blank Base]': [avg_precision_1, avg_recall_1, avg_f1_score_1, avg_speed_1],
        'NLP Model 2 -[Pre-trained Base]': [avg_precision_2, avg_recall_2, avg_f1_score_2, avg_speed_2]}
performance_df = pd.DataFrame(data)
from tabulate import tabulate
print("Performance Evaluation\n",tabulate(performance_df, headers='keys', tablefmt='psql'))

Performance Evaluation
 +----+--------------------------------+---------------------+-----------------------------+-----------------------------------+
|    | Metric                         | Default NLP Model   | NLP Model 1 -[Blank Base]   | NLP Model 2 -[Pre-trained Base]   |
|----+--------------------------------+---------------------+-----------------------------+-----------------------------------|
|  0 | Precision                      | 0.3414 (34.14 %)    | 0.9895 (98.95 %)            | 0.9961 (99.61 %)                  |
|  1 | Recall                         | 0.2415 (24.15 %)    | 0.9921 (99.21 %)            | 1.0 (100.0 %)                     |
|  2 | F1 Score                       | 0.2829 (28.29 %)    | 0.9908 (99.08 %)            | 0.998 (99.8 %)                    |
|  3 | Speed (examples processed/sec) | 4338.99             | 15324.52                    | 4242.04                           |
+----+--------------------------------+---------------------+-------------------

**Precision**: Precision is the ratio of correctly predicted positive results to the total number of predicted positive results. In the context of NLP, precision measures how many of the predicted entities or classifications were actually correct. For example, if the NLP model correctly identified 85 out of 100 entities, then the precision would be 85%.

**Recall**: Recall is the ratio of correctly predicted positive results to the total number of actual positive results. In the context of NLP, recall measures how many of the actual entities or classifications were identified correctly by the model. For example, if the NLP model correctly identified 92 out of 100 entities, then the recall would be 92%.

**F1 score**: F1 score is a weighted average of precision and recall, calculated as 2*(precision * recall)/(precision+recall). It ranges from 0 to 1, where a score of 1 represents perfect precision and recall, and a score of 0 represents the worst possible performance.

**Speed**: Speed measures the time taken by the NLP model to process a given text. It is usually reported in terms of seconds per document or tokens per second.


NLP Model 1 is likely able to achieve these high scores because it has been trained on a large amount of data and is able to recognize patterns and relationships more effectively.

NLP Model 2 has been pre-trained on a large corpus of text, which has allowed it to learn language patterns and relationships that it can apply to new data with high accuracy.

In addition to performance metrics, the comparison also includes a measure of speed, which is the number of examples processed per second. NLP Model 1 has the highest speed followed by the default NLP model and NLP Model 2.

Overall, this comparison shows that using a more advanced NLP model can greatly improve the accuracy of natural language processing, but may come at a cost of reduced speed. The choice of model will depend on the specific needs of the application, balancing the need for accuracy with the need for speed.





##Conclusion
The Precision, Recall and F1 Score of NLP Model 2 which is trained on existing "en_core_web_sm" is higher the that of NLP Model 1 which is trained from blank spacy "en" model.<br>
Hence, we shall use NLP Model 2.
#End Of File


In [None]:
import spacy
from spacy import displacy
from IPython.core.display import display, HTML

nlp= spacy.load('/content/drive/MyDrive/Final (Completed)/nlp_model_2')
t="The Patient checks eligibility throung PMJAY Kiosk or EHCP Registration Desk at any Empanelled Hospital. \
The BIS verifies the eligibility of the Patient and decides if he or she is a PMJAY Beneficiary. \
The Doctor decides if the Patient needs Hospitalization, Medication or further Diagnostics."
doc=nlp(t)
html = displacy.render(doc, style="ent")
display(HTML(html))
