# NER Pipeline with BERT and Glove Embedding in Spark NLP

In this tutorial, we demostrate how to building a NER pipeline using Bert and Glove Embedding.

Resources:
1. [Blog](https://towardsdatascience.com/named-entity-recognition-ner-with-bert-in-spark-nlp-874df20d1d77)
2. [Models Hub](https://nlp.johnsnowlabs.com/models?q=bert&task=Embeddings)
3. [Original Code](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/blogposts/3.NER_with_BERT.ipynb)
4. [Webinar](https://events.johnsnowlabs.com/webinar-state-of-the-art-named-entity-recognition-using-bert?hsCtaTracking=c5bfb98f-bf4e-4b81-92d0-4ad149f9da05%7C47b0af91-4c83-401d-9163-2417863ed82b)

## 1. Import libraries and download datasets

In [1]:
import sparknlp
import numpy as np
import pandas as pd
from sparknlp.annotator import *
from sparknlp.base import *

In [2]:
# if you have GPU
# spark = sparknlp.start(gpu=True)
spark = sparknlp.start()

print("Spark NLP version: ", sparknlp.version())
print("Apache Spark version: ", spark.version)

Spark NLP version:  3.0.1
Apache Spark version:  3.1.1


### 1.2. Dataset summary

The [CoNLL-2003](https://www.clips.uantwerpen.be/conll2003/ner/) concerns language-independent named entity recognition task. It concentrates on four types of named entities: persons (PER), locations (LOC), organizations (ORG) and names of miscellaneous entities that do not belong to the previous three groups. The CoNLL-2003 shared task data files contain four columns separated by a single space. The first item on each line is a word, the second a part-of-speech (POS) tag, the third a syntactic chunk tag and the fourth the named entity tag. The chunk tags and the named entity tags have the format I-TYPE which means that the word is inside a phrase of type TYPE. Only if two phrases of the same type immediately follow each other, the first word of the second phrase will have tag B-TYPE to show that it starts a new phrase. A word with tag O is not part of a phrase. Note the dataset uses IOB2 tagging scheme, whereas the original dataset uses IOB1.

We can annotate your own data in CONLL and then train a custom NER in Spark NLP. 

In [3]:
from urllib.request import urlretrieve

urlretrieve('https://github.com/JohnSnowLabs/spark-nlp/raw/master/src/test/resources/conll2003/eng.train',
           'eng.train')

urlretrieve('https://github.com/JohnSnowLabs/spark-nlp/raw/master/src/test/resources/conll2003/eng.testa',
           'eng.testa')

('eng.testa', <http.client.HTTPMessage at 0x7ff38dd9b490>)

In [4]:
with open("eng.train") as f:
    c=f.read()

print (c[:200])

-DOCSTART- -X- -X- O

EU NNP B-NP B-ORG
rejects VBZ B-VP O
German JJ B-NP B-MISC
call NN I-NP O
to TO B-VP O
boycott VB I-VP O
British JJ B-NP B-MISC
lamb NN I-NP O
. . O O

Peter NNP B-NP B-PER
Black


## 2. Building Bert NER pipeline

### 2.1. Reading annotated data into Spark NLP CoNLL reader

In [5]:
from sparknlp.training import CoNLL

training_data = CoNLL().readDataset(spark, './eng.train')
training_data.show(3)

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|                text|            document|            sentence|               token|                 pos|               label|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|EU rejects German...|[{document, 0, 47...|[{document, 0, 47...|[{token, 0, 1, EU...|[{pos, 0, 1, NNP,...|[{named_entity, 0...|
|     Peter Blackburn|[{document, 0, 14...|[{document, 0, 14...|[{token, 0, 4, Pe...|[{pos, 0, 4, NNP,...|[{named_entity, 0...|
| BRUSSELS 1996-08-22|[{document, 0, 18...|[{document, 0, 18...|[{token, 0, 7, BR...|[{pos, 0, 7, NNP,...|[{named_entity, 0...|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
only showing top 3 rows



In [None]:
training_data.printSchema()

In [7]:
training_data.count()

14041

In [8]:
training_data.select("label.result","label.embeddings").show()

+--------------------+--------------------+
|              result|          embeddings|
+--------------------+--------------------+
|[B-ORG, O, B-MISC...|[[], [], [], [], ...|
|      [B-PER, I-PER]|            [[], []]|
|          [B-LOC, O]|            [[], []]|
|[O, B-ORG, I-ORG,...|[[], [], [], [], ...|
|[B-LOC, O, O, O, ...|[[], [], [], [], ...|
|[O, O, O, O, O, O...|[[], [], [], [], ...|
|[O, O, O, O, O, O...|[[], [], [], [], ...|
|[O, O, O, O, O, O...|[[], [], [], [], ...|
|[B-PER, O, B-MISC...|[[], [], [], [], ...|
|[O, B-PER, O, O, ...|[[], [], [], [], ...|
|[B-MISC, O, O, B-...|[[], [], [], [], ...|
|                 [O]|                [[]]|
|[O, B-LOC, O, B-L...|[[], [], [], [], ...|
|[O, B-ORG, O, O, ...|[[], [], [], [], ...|
|[O, O, O, O, O, O...|[[], [], [], [], ...|
|[B-MISC, O, O, O,...|[[], [], [], [], ...|
|[O, O, O, O, O, O...|[[], [], [], [], ...|
|[B-LOC, O, O, O, ...|[[], [], [], [], ...|
|[B-LOC, O, O, O, ...|[[], [], [], [], ...|
|[O, O, O, O, O, O...|[[], [], [

In [9]:
from sparknlp.training import CoNLL

test_data = CoNLL().readDataset(spark, './eng.testa')
test_data.show(3)

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|                text|            document|            sentence|               token|                 pos|               label|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|CRICKET - LEICEST...|[{document, 0, 64...|[{document, 0, 64...|[{token, 0, 6, CR...|[{pos, 0, 6, NNP,...|[{named_entity, 0...|
|   LONDON 1996-08-30|[{document, 0, 16...|[{document, 0, 16...|[{token, 0, 5, LO...|[{pos, 0, 5, NNP,...|[{named_entity, 0...|
|West Indian all-r...|[{document, 0, 18...|[{document, 0, 18...|[{token, 0, 3, We...|[{pos, 0, 3, NNP,...|[{named_entity, 0...|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
only showing top 3 rows



### 2.2. Loading Bert Embedding Model

BertEmbeddings() annotator will take sentence and token columns and populate Bert embeddings in bert column, where each word is translated to a 768-dimensional vector. 

More Bert models can be found at [Models Hub](https://nlp.johnsnowlabs.com/models?q=bert&task=Embeddings). 

In [10]:
# we use BERT Tiny
bert_annotator = BertEmbeddings.pretrained('small_bert_L2_128', 'en') \
 .setInputCols(["sentence",'token'])\
 .setOutputCol("bert")\
 .setBatchSize(8)

small_bert_L2_128 download started this may take some time.
Approximate size to download 16.1 MB
[OK!]


#### Adding Bert Embeddings to the test dataset, so it can be compared with training dataset

In [11]:
test_data = bert_annotator.transform(test_data)
test_data.show(3)

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|                text|            document|            sentence|               token|                 pos|               label|                bert|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|CRICKET - LEICEST...|[{document, 0, 64...|[{document, 0, 64...|[{token, 0, 6, CR...|[{pos, 0, 6, NNP,...|[{named_entity, 0...|[{word_embeddings...|
|   LONDON 1996-08-30|[{document, 0, 16...|[{document, 0, 16...|[{token, 0, 5, LO...|[{pos, 0, 5, NNP,...|[{named_entity, 0...|[{word_embeddings...|
|West Indian all-r...|[{document, 0, 18...|[{document, 0, 18...|[{token, 0, 3, We...|[{pos, 0, 3, NNP,...|[{named_entity, 0...|[{word_embeddings...|
+--------------------+--------------------+--------------------+--------------------+--------------------+

In [12]:
test_data.count()

3250

#### Saving test data into local disk for future model training

In [14]:
# let's transform and save our test dataset for evaluation
test_data.write.parquet("test_withEmbeds.parquet")

In [15]:
test_data.select("bert.result","bert.embeddings",'label.result').show()

+--------------------+--------------------+--------------------+
|              result|          embeddings|              result|
+--------------------+--------------------+--------------------+
|[cricket, -, leic...|[[-1.6099558, 0.5...|[O, O, B-ORG, O, ...|
|[london, 1996-08-30]|[[-0.66074246, 0....|          [B-LOC, O]|
|[west, indian, al...|[[-1.2108909, 0.9...|[B-MISC, I-MISC, ...|
|[their, stay, on,...|[[-0.93976337, 0....|[O, O, O, O, O, O...|
|[after, bowling, ...|[[-1.1267805, 1.1...|[O, O, B-ORG, O, ...|
|[trailing, by, 21...|[[-1.8359267, 0.4...|[O, O, O, O, B-OR...|
|[essex, ,, howeve...|[[-1.2150192, 0.2...|[B-ORG, O, O, O, ...|
|[hussain, ,, cons...|[[-1.6078961, 0.5...|[B-PER, O, O, O, ...|
|[by, the, close, ...|[[-1.8683753, 1.1...|[O, O, O, B-ORG, ...|
|[at, the, oval, ,...|[[-1.8740947, 0.6...|[O, O, B-LOC, O, ...|
|[he, was, well, b...|[[-1.660714, 1.22...|[O, O, O, O, O, B...|
|[derbyshire, kept...|[[-1.1823795, 0.2...|[B-ORG, O, O, O, ...|
|[australian, tom,...|[[-

### 2.3. Defining NerDLApproach() Annotator:

* setInputCols([“sentence”, “token”, “bert”]) : the columns that will be used by NER model to generate features.
* setLabelColumn(“label”) : <span style="color:yellow"> the target columns, where the GROUND TRUTH or Original Label is stored </span>
* setOutputCol(“ner”) : the predictions will be written to ner column
* setMaxEpochs(1) : number of epoch for training
* setVerbose(1) : the level of logs while training
* setValidationSplit(0.2) : the proportion of training dataset to be validated against the model on each Epoch. The value should be between 0.0 and 1.0 and by default, it is 0.0 and off.
* setEvaluationLogExtended(True) : Whether logs for validation to be extended: it displays time and evaluation of each label. The default is false.
* setEnableOutputLogs(True) : Whether to output to the log folder. When set True, the logs and training metrics will be written to folder in the home folder.
* setIncludeConfidence(True) : whether to include confidence scores in annotation metadata.
* setTestDataset(“test_withEmbeds.parquet”) : <span style="color:yellow">The path to test data. If set, it is used to calculate statistics on it during training</span>. The resulting statistics are recored at each epch at the log file if the setVerbose is 1. This is also in a CoNLL format but embeddings added with bert_annotator and saved to disk as before. You don’t have to set this if you don’t need to evaluate your model on an unseen test set in Spark.

In [16]:
nerTagger = NerDLApproach()\
  .setInputCols(["sentence", "token", "bert"])\
  .setLabelColumn("label")\
  .setOutputCol("ner")\
  .setMaxEpochs(5)\
  .setLr(0.001)\
  .setPo(0.005)\
  .setBatchSize(32)\
  .setEvaluationLogExtended(True) \
  .setEnableOutputLogs(True)\
  .setTestDataset("test_withEmbeds.parquet")

pipeline = Pipeline(
    stages = [
    bert_annotator,
    nerTagger
  ])

You can also set learning rate ( setLr ), learning rate decay coefficient ( setPo ), setBatchSize and setDropout rate. Please see the [official APIs](https://nlp.johnsnowlabs.com/api/com/johnsnowlabs/nlp/annotators/ner/dl/NerDLApproach.html) for the entire list. 

### 2.4 Model Training

In [17]:
%%time

ner_model = pipeline.fit(training_data.limit(1000))

CPU times: user 58.5 ms, sys: 99.7 ms, total: 158 ms
Wall time: 2min 43s


In [18]:
# let's save our trained NER model on disk
# so we can load it in a new session or move it to another location
# since we fit NerDL model inside the pipeline, we can access it via stages
ner_model.stages[1].write().overwrite().save('./NER_bert_20200418')

In [19]:
test_data.show()

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|                text|            document|            sentence|               token|                 pos|               label|                bert|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|CRICKET - LEICEST...|[{document, 0, 64...|[{document, 0, 64...|[{token, 0, 6, CR...|[{pos, 0, 6, NNP,...|[{named_entity, 0...|[{word_embeddings...|
|   LONDON 1996-08-30|[{document, 0, 16...|[{document, 0, 16...|[{token, 0, 5, LO...|[{pos, 0, 5, NNP,...|[{named_entity, 0...|[{word_embeddings...|
|West Indian all-r...|[{document, 0, 18...|[{document, 0, 18...|[{token, 0, 3, We...|[{pos, 0, 3, NNP,...|[{named_entity, 0...|[{word_embeddings...|
|Their stay on top...|[{document, 0, 20...|[{document, 0, 20...|[{token, 0, 4, Th...|[{pos, 0, 4, PRP$...|

### 2.5. Evaluation of the trained model

In [20]:
# let's only feed sentence and token from our test dataset
predictions = ner_model.transform(test_data.select("sentence", "token", "label"))
predictions.show(3)

+--------------------+--------------------+--------------------+--------------------+--------------------+
|            sentence|               token|               label|                bert|                 ner|
+--------------------+--------------------+--------------------+--------------------+--------------------+
|[{document, 0, 64...|[{token, 0, 6, CR...|[{named_entity, 0...|[{word_embeddings...|[{named_entity, 0...|
|[{document, 0, 16...|[{token, 0, 5, LO...|[{named_entity, 0...|[{word_embeddings...|[{named_entity, 0...|
|[{document, 0, 18...|[{token, 0, 3, We...|[{named_entity, 0...|[{word_embeddings...|[{named_entity, 0...|
+--------------------+--------------------+--------------------+--------------------+--------------------+
only showing top 3 rows



In [21]:
predictions.select('token.result','label.result','ner.result').show(truncate=40)

+----------------------------------------+----------------------------------------+----------------------------------------+
|                                  result|                                  result|                                  result|
+----------------------------------------+----------------------------------------+----------------------------------------+
|[CRICKET, -, LEICESTERSHIRE, TAKE, OV...|   [O, O, B-ORG, O, O, O, O, O, O, O, O]|       [O, O, O, O, O, O, O, O, O, O, O]|
|                    [LONDON, 1996-08-30]|                              [B-LOC, O]|                              [B-LOC, O]|
|[West, Indian, all-rounder, Phil, Sim...|[B-MISC, I-MISC, O, B-PER, I-PER, O, ...|[O, B-MISC, O, O, I-PER, O, O, O, O, ...|
|[Their, stay, on, top, ,, though, ,, ...|[O, O, O, O, O, O, O, O, O, O, O, O, ...|[O, O, O, O, O, O, O, O, O, O, O, O, ...|
|[After, bowling, Somerset, out, for, ...|[O, O, B-ORG, O, O, O, O, O, O, O, O,...|[O, O, I-PER, O, O, O, O, O, O, O, O,...|


In [22]:
predictions.printSchema()

root
 |-- sentence: array (nullable = false)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)
 |    |    |-- embeddings: array (nullable = true)
 |    |    |    |-- element: float (containsNull = false)
 |-- token: array (nullable = false)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)
 |    |    |-- embeddings: array (n

In [23]:
import pyspark.sql.functions as F

predictions.select(F.explode(F.arrays_zip('token.result','label.result','ner.result')).alias("cols")) \
.select(F.expr("cols['0']").alias("token"),
        F.expr("cols['1']").alias("ground_truth"),
        F.expr("cols['2']").alias("prediction_results")).show(truncate=False)

+--------------+------------+------------------+
|token         |ground_truth|prediction_results|
+--------------+------------+------------------+
|CRICKET       |O           |O                 |
|-             |O           |O                 |
|LEICESTERSHIRE|B-ORG       |O                 |
|TAKE          |O           |O                 |
|OVER          |O           |O                 |
|AT            |O           |O                 |
|TOP           |O           |O                 |
|AFTER         |O           |O                 |
|INNINGS       |O           |O                 |
|VICTORY       |O           |O                 |
|.             |O           |O                 |
|LONDON        |B-LOC       |B-LOC             |
|1996-08-30    |O           |O                 |
|West          |B-MISC      |O                 |
|Indian        |I-MISC      |B-MISC            |
|all-rounder   |O           |O                 |
|Phil          |B-PER       |O                 |
|Simmons       |I-PE

#### Convert to Pandas

In [24]:
import pandas as pd

df = predictions.select('token.result','label.result','ner.result').toPandas()
df

Unnamed: 0,result,result.1,result.2
0,"[CRICKET, -, LEICESTERSHIRE, TAKE, OVER, AT, T...","[O, O, B-ORG, O, O, O, O, O, O, O, O]","[O, O, O, O, O, O, O, O, O, O, O]"
1,"[LONDON, 1996-08-30]","[B-LOC, O]","[B-LOC, O]"
2,"[West, Indian, all-rounder, Phil, Simmons, too...","[B-MISC, I-MISC, O, B-PER, I-PER, O, O, O, O, ...","[O, B-MISC, O, O, I-PER, O, O, O, O, O, O, O, ..."
3,"[Their, stay, on, top, ,, though, ,, may, be, ...","[O, O, O, O, O, O, O, O, O, O, O, O, O, B-ORG,...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."
4,"[After, bowling, Somerset, out, for, 83, on, t...","[O, O, B-ORG, O, O, O, O, O, O, O, O, B-LOC, I...","[O, O, I-PER, O, O, O, O, O, O, O, O, B-PER, O..."
...,...,...,...
3245,"[But, the, prices, may, move, in, a, close, ra...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O]","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O]"
3246,"[Brokers, said, blue, chips, like, IDLC, ,, Ba...","[O, O, O, O, O, B-ORG, O, B-ORG, I-ORG, O, B-O...","[O, O, O, O, O, O, O, B-LOC, O, O, B-LOC, O, O..."
3247,"[They, said, there, was, still, demand, for, b...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."
3248,"[The, DSE, all, share, price, index, closed, 2...","[O, B-ORG, O, O, O, O, O, O, O, O, O, O, O, O,...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."


### 2.6 Building Prediction Pipeline

In [25]:
from pyspark.ml import Pipeline

document = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentence = SentenceDetector()\
    .setInputCols(['document'])\
    .setOutputCol('sentence')

token = Tokenizer()\
    .setInputCols(['sentence'])\
    .setOutputCol('token')

bert = BertEmbeddings.pretrained('small_bert_L2_128', 'en') \
 .setInputCols(["sentence",'token'])\
 .setOutputCol("bert")\
 .setCaseSensitive(True)

loaded_ner_model = NerDLModel.load("NER_bert_20200418")\
 .setInputCols(["sentence", "token", "bert"])\
 .setOutputCol("ner")

converter = NerConverter()\
  .setInputCols(["document", "token", "ner"])\
  .setOutputCol("ner_span")

ner_prediction_pipeline = Pipeline(
    stages = [
        document,
        sentence,
        token,
        bert,
        loaded_ner_model, # User tuned pre-trained bert model 
        converter])

small_bert_L2_128 download started this may take some time.
Approximate size to download 16.1 MB
[OK!]


In [26]:
empty_data = spark.createDataFrame([['']]).toDF("text")
empty_data.show()

+----+
|text|
+----+
|    |
+----+



In [27]:
prediction_model = ner_prediction_pipeline.fit(empty_data)

In [28]:
text = "Peter Parker is a nice guy and lives in New York."
sample_data = spark.createDataFrame([[text]]).toDF("text")
sample_data.show()

+--------------------+
|                text|
+--------------------+
|Peter Parker is a...|
+--------------------+



In [29]:
preds = prediction_model.transform(sample_data)
preds.show()

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------+
|                text|            document|            sentence|               token|                bert|                 ner|ner_span|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------+
|Peter Parker is a...|[{document, 0, 48...|[{document, 0, 48...|[{token, 0, 4, Pe...|[{word_embeddings...|[{named_entity, 0...|      []|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------+



In [30]:
preds.select('ner.result').take(1)

[Row(result=['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O'])]

In [31]:
preds.select(F.explode(F.arrays_zip("ner_span.result","ner_span.metadata")).alias("entities")) \
.select(F.expr("entities['0']").alias("chunk"),
        F.expr("entities['1'].entity").alias("entity")).show(truncate=False)

+-----+------+
|chunk|entity|
+-----+------+
+-----+------+



## 3. Building NER Pipeline with Glove Embedding 

### 3.1 NER Model with Glove Embeddings

In [32]:
glove = WordEmbeddingsModel().pretrained() \
 .setInputCols(["sentence",'token'])\
 .setOutputCol("glove")\
 .setCaseSensitive(False)

test_data = CoNLL().readDataset(spark, './eng.testa')
test_data = glove.transform(test_data.limit(1000))
test_data.write.parquet("test_withGloveEmbeds.parquet")

glove_100d download started this may take some time.
Approximate size to download 145.3 MB
[OK!]


In [33]:
nerTagger.setInputCols(["sentence", "token", "glove"])
nerTagger.setTestDataset("test_withGloveEmbeds.parquet")

glove_pipeline = Pipeline(
    stages = [
    glove,
    nerTagger
  ])

In [34]:
%%time

ner_model_v3 = glove_pipeline.fit(training_data.limit(1000))

CPU times: user 48.5 ms, sys: 52.7 ms, total: 101 ms
Wall time: 1min 49s


In [35]:
# let's save our trained NER model on disk
# so we can load it in a new session or move it to another location
# since we fit NerDL model inside the pipeline, we can access it via stages
ner_model_v3.stages[1].write().overwrite().save('./NER_glove_20200418')

In [36]:
predictions_v3 = ner_model_v3.transform(test_data.limit(10))

# test_data.sample(False,0.1,0)
predictions_v3.select(F.explode(F.arrays_zip('token.result','label.result','ner.result')).alias("cols")) \
.select(F.expr("cols['0']").alias("token"),
        F.expr("cols['1']").alias("ground_truth"),
        F.expr("cols['2']").alias("prediction")).show(truncate=False)

+--------------+------------+----------+
|token         |ground_truth|prediction|
+--------------+------------+----------+
|CRICKET       |O           |O         |
|-             |O           |O         |
|LEICESTERSHIRE|B-ORG       |B-ORG     |
|TAKE          |O           |O         |
|OVER          |O           |O         |
|AT            |O           |O         |
|TOP           |O           |O         |
|AFTER         |O           |O         |
|INNINGS       |O           |O         |
|VICTORY       |O           |O         |
|.             |O           |O         |
|LONDON        |B-LOC       |B-LOC     |
|1996-08-30    |O           |O         |
|West          |B-MISC      |B-LOC     |
|Indian        |I-MISC      |O         |
|all-rounder   |O           |O         |
|Phil          |B-PER       |B-PER     |
|Simmons       |I-PER       |I-PER     |
|took          |O           |O         |
|four          |O           |O         |
+--------------+------------+----------+
only showing top

### 3.2 Using glove embeddings pipeline

In [37]:
document = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentence = SentenceDetector()\
    .setInputCols(['document'])\
    .setOutputCol('sentence')

token = Tokenizer()\
    .setInputCols(['sentence'])\
    .setOutputCol('token')

# The original code loads NER_bert_20200219
loaded_ner_model = NerDLModel.load("NER_glove_20200418")\
 .setInputCols(["sentence", "token", "glove"])\
 .setOutputCol("ner")

converter = NerConverter()\
  .setInputCols(["document", "token", "ner"])\
  .setOutputCol("ner_span")

glove_ner_prediction_pipeline = Pipeline(
    stages = [
        document,
        sentence,
        token,
        glove,
        loaded_ner_model,
        converter])

In [38]:
empty_data = spark.createDataFrame([['']]).toDF("text")
empty_data.show()

+----+
|text|
+----+
|    |
+----+



In [39]:
glove_prediction_model = glove_ner_prediction_pipeline.fit(empty_data)

In [40]:
text = "Peter Parker is a nice guy and lives in New York."
sample_data = spark.createDataFrame([[text]]).toDF("text")
sample_data.show()

+--------------------+
|                text|
+--------------------+
|Peter Parker is a...|
+--------------------+



In [41]:
preds = glove_prediction_model.transform(sample_data)
preds.show()

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|                text|            document|            sentence|               token|               glove|                 ner|            ner_span|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|Peter Parker is a...|[{document, 0, 48...|[{document, 0, 48...|[{token, 0, 4, Pe...|[{word_embeddings...|[{named_entity, 0...|[{chunk, 0, 11, P...|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+



In [42]:
preds.select(F.explode(F.arrays_zip("ner_span.result","ner_span.metadata")).alias("entities")) \
.select(F.expr("entities['0']").alias("chunk"),
        F.expr("entities['1'].entity").alias("entity")).show(truncate=False)

+------------+------+
|chunk       |entity|
+------------+------+
|Peter Parker|PER   |
|York        |LOC   |
+------------+------+



## 4. Pretrained NER Pipelines
### 4.1. Pretrained pipeline: "recognize_entities_dl" 

In [43]:
from sparknlp.pretrained import PretrainedPipeline

pretrained_pipeline = PretrainedPipeline('recognize_entities_dl', lang='en')

#onto_recognize_entities_sm
#explain_document_dl

recognize_entities_dl download started this may take some time.
Approx size to download 160.1 MB
[OK!]


In [44]:
#text = "The Mona Lisa is a 16th century oil painting created by Leonardo. It's held at the Louvre in Paris."
text = "Peter Parker is a nice guy and lives in New York."
result = pretrained_pipeline.annotate(text)

list(zip(result['token'], result['ner']))

[('Peter', 'B-PER'),
 ('Parker', 'I-PER'),
 ('is', 'O'),
 ('a', 'O'),
 ('nice', 'O'),
 ('guy', 'O'),
 ('and', 'O'),
 ('lives', 'O'),
 ('in', 'O'),
 ('New', 'B-LOC'),
 ('York', 'I-LOC'),
 ('.', 'O')]

### 4.2. Pretrained pipeline: "explain_document_dl" 

In [45]:
pretrained_pipeline2 = PretrainedPipeline('explain_document_dl', lang='en')

explain_document_dl download started this may take some time.
Approx size to download 169.3 MB
[OK!]


In [46]:
#text = "The Mona Lisa is a 16th centry oil painting created by Leonrdo. It's held at the Louvre in Paris."
text = "Peter Parker is a nice guy and lives in New York."

result2 = pretrained_pipeline2.annotate(text)
result2
list(zip(result2['token'],  result2['checked'], result2['pos'], result2['ner'],  result2['lemma'],  result2['stem']))

[('Peter', 'Peter', 'NNP', 'B-PER', 'Peter', 'peter'),
 ('Parker', 'Parker', 'NNP', 'I-PER', 'Parker', 'parker'),
 ('is', 'is', 'VBZ', 'O', 'be', 'i'),
 ('a', 'a', 'DT', 'O', 'a', 'a'),
 ('nice', 'nice', 'JJ', 'O', 'nice', 'nice'),
 ('guy', 'guy', 'NN', 'O', 'guy', 'gui'),
 ('and', 'and', 'CC', 'O', 'and', 'and'),
 ('lives', 'lives', 'NNS', 'O', 'life', 'live'),
 ('in', 'in', 'IN', 'O', 'in', 'in'),
 ('New', 'New', 'NNP', 'B-LOC', 'New', 'new'),
 ('York', 'York', 'NNP', 'I-LOC', 'York', 'york'),
 ('.', '.', '.', 'O', '.', '.')]

## 5. Performance Comparison of NER models

In [47]:
import pandas as pd

tokens = np.array (predictions.select('token.result').take(1))[0][0]
ground = np.array (predictions.select('label.result').take(1))[0][0]
label_bert_0 = np.array (predictions.select('ner.result').take(1))[0][0]
#label_bert_2 = np.array (predictions_v2.select('ner.result').take(1))[0][0]
label_glove = np.array (predictions_v3.select('ner.result').take(1))[0][0]

pd.DataFrame({'token':tokens, 'ground':ground, 'label_bert_0':label_bert_0, 'label_glove':label_glove})
              #'label_bert_2':label_bert_2,

Unnamed: 0,token,ground,label_bert_0,label_glove
0,CRICKET,O,O,O
1,-,O,O,O
2,LEICESTERSHIRE,B-ORG,O,B-ORG
3,TAKE,O,O,O
4,OVER,O,O,O
5,AT,O,O,O
6,TOP,O,O,O
7,AFTER,O,O,O
8,INNINGS,O,O,O
9,VICTORY,O,O,O


## 6. Using your own custom Word Embedding