<a href="https://colab.research.google.com/github/weichen-liao/Qard_lwc/blob/main/Qard_Case_Study_SparkOCR%2BSpacyNER_CamembertNER.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Description
In this notebook, you can find the spark implementation of extracting company entities(ORG) from pdf

The OCR is based on Spark-OCR: https://github.com/JohnSnowLabs/spark-ocr-workshop

Spark OCR can extract texts from PDF files with relatively good accuracy. However, it's not free to use.

The NER is tried on 2 methods: Spacy-FR and Camembert. Comparison is made on these 2 methods

Conclusions are given in the end of this notebook



### Upload the files: pdf & licence key.json

In [2]:
from google.colab import files
uploaded = files.upload()

Saving test1.pdf to test1.pdf


In [3]:
!ls

sample_data  spark_nlp_for_healthcare_spark_ocr_3346.json  test1.pdf


### Read licence key

In [6]:
import os
import json

with open('spark_nlp_for_healthcare_spark_ocr_3346.json') as f:
    license_keys = json.load(f)

secret = license_keys['SPARK_OCR_SECRET']
os.environ['SPARK_OCR_LICENSE'] = license_keys['SPARK_OCR_LICENSE']
os.environ['JSL_OCR_LICENSE'] = license_keys['SPARK_OCR_LICENSE']
version = secret.split("-")[0]
print ('Spark OCR Version:', version)

Spark OCR Version: 3.8.0


### Install Dependencies

In [None]:
# Install Java
!apt-get update
!apt-get install -y openjdk-8-jdk
!java -version

# Install pyspark, SparkOCR, and SparkNLP
!pip install --ignore-installed -q pyspark==2.4.4
# Insall Spark Ocr from pypi using secret
!python -m pip install --upgrade spark-ocr==$version  --extra-index-url https://pypi.johnsnowlabs.com/$secret
# or install from local path
# %pip install --user ../../python/dist/spark-ocr-[version].tar.gz
!pip install --ignore-installed -q spark-nlp==2.5.2

# install spacy
! pip install spacy
! python -m spacy download en_core_web_sm
! python -m spacy download fr_core_news_sm

# install transformer for camembert-ner NER
! pip install transformers
! pip install sentencepiece

### Import Libraries

In [8]:
import pandas as pd
import numpy as np
import os

#Pyspark Imports
from pyspark.sql import SparkSession
from pyspark.ml import PipelineModel
from pyspark.sql import functions as F

# Necessary imports from Spark OCR library
from sparkocr import start
from sparkocr.transformers import *
from sparkocr.enums import *
from sparkocr.utils import display_image, to_pil_image
from sparkocr.metrics import score
import pkg_resources

# import sparknlp packages
from sparknlp.annotator import *
from sparknlp.base import *

os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["PATH"] = os.environ["JAVA_HOME"] + "/bin:" + os.environ["PATH"]

### Construct the OCR pipeline

In [13]:
pdf_to_image = PdfToImage() \
            .setInputCol("content") \
            .setOutputCol("image_raw") \
            .setKeepInput(True)

# Transform image to the binary color model
binarizer = ImageBinarizer() \
            .setInputCol("image_raw") \
            .setOutputCol("image") \
            .setThreshold(130)
# Run OCR for each region
ocr = ImageToText() \
            .setInputCol("image") \
            .setOutputCol("text") \
            .setIgnoreResolution(False) \
            .setPageSegMode(PageSegmentationMode.SPARSE_TEXT) \
            .setConfidenceThreshold(60)

#Render text with positions to Pdf document.
textToPdf = TextToPdf() \
            .setInputCol("positions") \
            .setInputImage("image") \
            .setInputText("text") \
            .setOutputCol("pdf") \
            .setInputContent("content")
# OCR pipeline
pipeline = PipelineModel(stages=[
            pdf_to_image,
            binarizer,
            ocr,
            textToPdf
        ])

### Start Spark Session

In [9]:
spark = start(secret=secret)
spark

Spark version: 3.0.2
Spark NLP version: 2.5.2
Spark OCR version: 3.8.0



### Load the pdf

In [10]:
image_df = spark.read.format("binaryFile").load('test1.pdf').cache()
image_df.show()

+--------------------+-------------------+------+--------------------+
|                path|   modificationTime|length|             content|
+--------------------+-------------------+------+--------------------+
|file:/content/tes...|2021-11-22 00:16:55|208711|[25 50 44 46 2D 3...|
+--------------------+-------------------+------+--------------------+



### Run OCR pipeline on every page

In [15]:
result = pipeline.transform(image_df).cache()
result_arr = []
for r in result.distinct().collect():
  for page in r.text:
    result_arr.append(page)

### Spacy NER

In [27]:
import spacy
from spacy import displacy
import fr_core_news_sm

nlp = fr_core_news_sm.load()
for i, text in enumerate(result_arr):
  print('----------------------------', 'page', i, '----------------------------')
  doc = nlp(text)
  displacy.render(doc, style='ent',jupyter=True, options={'ents': ['ORG', 'PRODUCT']})


---------------------------- page 0 ----------------------------


---------------------------- page 1 ----------------------------


---------------------------- page 2 ----------------------------


---------------------------- page 3 ----------------------------


---------------------------- page 4 ----------------------------


---------------------------- page 5 ----------------------------


---------------------------- page 6 ----------------------------


---------------------------- page 7 ----------------------------


---------------------------- page 8 ----------------------------


---------------------------- page 9 ----------------------------


### camembert-ner NER
[camembert-ner] is a NER model that was fine-tuned from camemBERT on wikiner-fr dataset. Model was trained on wikiner-fr dataset (~170 634 sentences). Model was validated on emails/chat data and overperformed other models on this type of data specifically. In particular the model seems to work better on entity that don't start with an upper case.

In [20]:
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

tokenizer = AutoTokenizer.from_pretrained("Jean-Baptiste/camembert-ner")
model = AutoModelForTokenClassification.from_pretrained("Jean-Baptiste/camembert-ner")

nlp_tf = pipeline('ner', model=model, tokenizer=tokenizer, aggregation_strategy="simple")

Downloading:   0%|          | 0.00/269 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/892 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/210 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/420M [00:00<?, ?B/s]

In [24]:
for i, text in enumerate(result_arr):
  print('----------------------------', 'page', i, ('----------------------------',
  entities = nlp_tf(text) 
  for entity in entities:
    if entity['entity_group'] == 'ORG':
      print(entity)

------------------ page 0 ------------------
{'entity_group': 'ORG', 'score': 0.8939309, 'word': 'SOCIETE A RESPONSABILITE LIMITEE T', 'start': 0, 'end': 35}
------------------ page 1 ------------------
{'entity_group': 'ORG', 'score': 0.85159683, 'word': 'TUR', 'start': 1242, 'end': 1246}
{'entity_group': 'ORG', 'score': 0.7602225, 'word': 'MAN', 'start': 1248, 'end': 1252}
{'entity_group': 'ORG', 'score': 0.98780596, 'word': 'S.A.R.L', 'start': 1442, 'end': 1450}
------------------ page 2 ------------------
{'entity_group': 'ORG', 'score': 0.81165755, 'word': 'Société', 'start': 33, 'end': 41}
{'entity_group': 'ORG', 'score': 0.8630512, 'word': 'Société', 'start': 260, 'end': 268}
{'entity_group': 'ORG', 'score': 0.9033751, 'word': 'Société', 'start': 317, 'end': 325}
{'entity_group': 'ORG', 'score': 0.7249498, 'word': 'Associés', 'start': 657, 'end': 666}
{'entity_group': 'ORG', 'score': 0.8871137, 'word': 'Société', 'start': 709, 'end': 717}
{'entity_group': 'ORG', 'score': 0.30150

# Conclusions

In the given 10 page PDF, the expected company to be found is TUR-MAN, which is found by camembert-ner. However, both NER methods are in low precision, as lots of False Negative samples are found. I can't say the predictions are satistying, but at least a way to extract company names out of PDFs under the Pyspark framework is presented.

# To improve

1. Spark-OCR may not the best option for OCR as it's not free, the combination of pdf2image and pytesseract could be a good choice
2. Text preprocessing could be introduced in order to make the extracted text cleaner, hence increase the NER accuracy.
3. Translating FR into EN might improve the NER, but increase the cost at the same time.
4. train customized NER model which is based on the similar PDFs could greatly improve the NER accuracy, but annotation is needed.