# Visual Document Classifier v2 training

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-ocr-workshop/blob/master/jupyter/VisualDocumentClassifierTraining/SparkOCRVisualDocumentClassifierv2Training.ipynb)

## Set license and AWS keys

Need to specify:
- secret
- license
- aws credentials

### Option #1 - define in this cell

In [1]:
import os

secret = ""
version = secret.split("-")[0]

os.environ['JSL_OCR_LICENSE'] = ""
os.environ["AWS_ACCESS_KEY_ID"] = ""
os.environ["AWS_SECRET_ACCESS_KEY"] = ""

### Option #2 - provide spark_ocr.json file

In [None]:
import json, os
import sys

if 'google.colab' in sys.modules:
    from google.colab import files

    if 'spark_ocr.json' not in os.listdir():
      license_keys = files.upload()
      os.rename(list(license_keys.keys())[0], 'spark_ocr.json')

with open('spark_ocr.json') as f:
    license_keys = json.load(f)

# Defining license key-value pairs as local variables
locals().update(license_keys)

## Install Spark-OCR

It is needed only in case of colab. For other environment you should prepare environment appropriately.

In [3]:
# Installing Dependencies
%pip install --upgrade git+https://github.com/JohnSnowLabs/detectron2.git@frozen_sparkocr
%pip install --upgrade git+https://github.com/JohnSnowLabs/transformers.git@layoutlmv2_onnx

%pip install pyyaml
%pip install datasets==1.18.2

#%pip install spark-ocr==$version --extra-index-url=https://pypi.johnsnowlabs.com/$secret --upgrade
%pip install sklearn

Defaulting to user installation because normal site-packages is not writeable
Collecting git+https://github.com/JohnSnowLabs/detectron2.git@frozen_sparkocr
  Cloning https://github.com/JohnSnowLabs/detectron2.git (to revision frozen_sparkocr) to /tmp/pip-req-build-mrozd1hc
  Running command git clone --filter=blob:none --quiet https://github.com/JohnSnowLabs/detectron2.git /tmp/pip-req-build-mrozd1hc
  Running command git checkout -b frozen_sparkocr --track origin/frozen_sparkocr
  Switched to a new branch 'frozen_sparkocr'
  Branch 'frozen_sparkocr' set up to track remote branch 'frozen_sparkocr' from 'origin'.
  Resolved https://github.com/JohnSnowLabs/detectron2.git to commit cc87e7ec225b5c0449f47794c03335565ec1120e
  Preparing metadata (setup.py) ... [?25ldone


You should consider upgrading via the '/usr/bin/python3.7 -m pip install --upgrade pip' command.[0m[33m
[0mNote: you may need to restart the kernel to use updated packages.
Defaulting to user installation because normal site-packages is not writeable
Collecting git+https://github.com/JohnSnowLabs/transformers.git@layoutlmv2_onnx
  Cloning https://github.com/JohnSnowLabs/transformers.git (to revision layoutlmv2_onnx) to /tmp/pip-req-build-03xy0yew
  Running command git clone --filter=blob:none --quiet https://github.com/JohnSnowLabs/transformers.git /tmp/pip-req-build-03xy0yew
  Running command git checkout -b layoutlmv2_onnx --track origin/layoutlmv2_onnx
  Switched to a new branch 'layoutlmv2_onnx'
  Branch 'layoutlmv2_onnx' set up to track remote branch 'layoutlmv2_onnx' from 'origin'.
  Resolved https://github.com/JohnSnowLabs/transformers.git to commit 6bb0faa543108a562a73de3bda5a3ea7d23f0fdd
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wh

You should consider upgrading via the '/usr/bin/python3.7 -m pip install --upgrade pip' command.[0m[33m
[0mNote: you may need to restart the kernel to use updated packages.
Defaulting to user installation because normal site-packages is not writeable
You should consider upgrading via the '/usr/bin/python3.7 -m pip install --upgrade pip' command.[0m[33m
[0mNote: you may need to restart the kernel to use updated packages.


## Download demo datasets

Here we downloaded demo set. You need to put your images to one folder and prepare labelling txt file as at example.</br>
Instructions here are for the command line, you can also manually download and unzip these files.

In [None]:
!wget https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/datasets/visual_doc_classifier/LayoutLM.v2.voc.txt
!wget https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/datasets/visual_doc_classifier/rvl_cdip_tmp_preprocessed.zip
!wget https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/datasets/visual_doc_classifier/rvl_cdip_tmp.zip
!unzip rvl_cdip_tmp_preprocessed.zip     
!unzip rvl_cdip_tmp.zip

## Start Spark session with Spark OCR

In [1]:
from sparkocr import start
from pyspark import SparkConf

spark_ocr_jar_path = "../../../target/scala-2.12/"
spark = start(jar_path = spark_ocr_jar_path)

spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")
spark.conf.set("spark.sql.execution.arrow.enabled", "true")

spark

Spark version: 3.3.0
Spark NLP version: 4.0.0
Spark NLP for Healthcare version: 4.0.0
Spark OCR version: 4.0.2



### Define labels

In [2]:
labels = ["advertisement",
          "budget",
          "email",
          "file_folder",
          "form",
          "handwritten",
          "invoice",
          "letter",
          "memo",
          "news_article",
          "presentation",
          "questionnaire",
          "resume",
          "scientific_publication",
          "scientific_report",
          "specification"]

## Option #1: Preprocessing your own data

Images for classification should be placed in one folder ("./rvl_cdip_tmp" in this case)

Labels file should be placed to the same folder. File format is the following. One row - one record, file_path and label separated by space like,

```
file1.jpg 1
file2.jpg 2
```

In [3]:
from sparkocr.transformers import *

df = DatasetReader.readDataset("./rvl_cdip_tmp", spark)
display(df.select("content", "act_label").limit(1))

content,act_label
[49 49 2A 00 34 E...,4


### Repartition your data
To better leverage your cluster you may need repartitioning of your input dataframe

In [4]:
df = df.repartition(8)

In [5]:
from sparkocr.transformers import *
from sparkocr.enums import *
from pyspark.ml import PipelineModel

binary_to_image = BinaryToImage()\
    .setOutputCol("image") \
    .setImageType(ImageType.TYPE_3BYTE_BGR)

img_to_hocr = ImageToHocr()\
    .setInputCol("image")\
    .setOutputCol("hocr")\
    .setIgnoreResolution(False)\
    .setOcrParams(["preserve_interword_spaces=0"])

tokenizer = HocrTokenizer()\
    .setInputCol("hocr")\
    .setOutputCol("token")

# OCR pipeline
pipeline1 = PipelineModel(stages=[
    binary_to_image,
    img_to_hocr,
    tokenizer
])

df = pipeline1.transform(df).cache()
df = df.withColumnRenamed("image", "orig_image")
display(df.select("act_label", "pagenum", "exception", "hocr", "token"))

act_label,pagenum,exception,hocr,token
4,0,,<div class='ocr...,"[{token, 0, 2, in..."


In [None]:
from sparkocr.utils import get_vocabulary_dict

vocab_file = "LayoutLM.v2.voc.txt"
vocab = get_vocabulary_dict(vocab_file, ",")

doc_class = VisualDocumentClassifierV2() \
    .setInputCols(["token", "orig_image"]) \
    .setOutputCol("label")
doc_class.setVocabulary(vocab)

df = doc_class.getPreprocessedDataset(
  df,
  [1,3,224,224]
  ).cache()

In [None]:
df

In [None]:
df.select("path", "input_ids", "bbox", "image", "attention_mask", "token_type_ids", "act_label").write.parquet("preprocessed_dataset")

## Option #2: Use preprocessed datasets
It is possible to load datasets in preprocessed state. You will typically prefer a separate cluster environment to do the preprocessing, as it can take long(a number of hours).
Check this notebook, in the same folder as current one,

Spark-ocr visual doc classifier v2 preprocessing on databricks.ipynb

In [19]:
df = DatasetReader.readPreprocessedDataset("./rvl_cdip_tmp_preprocessed", spark)

display(df.limit(1))

input_ids,bbox,image,attention_mask,token_type_ids,act_label
"[101, 13169, 1051...","[0, 0, 0, 0, 42, ...","[255, 255, 255, 2...","[1.0, 1.0, 1.0, 1...","[0, 0, 0, 0, 0, 0...",scientific_report


If dataset contains str labels substitute them with int ids

In [20]:
from sparkocr.transformers import *
from pyspark.sql.functions import udf


label2id = {k: v for v, k in enumerate(labels)}
df = df.withColumn('act_label', udf(lambda x: label2id[x])('act_label'))
display(df.limit(1))

input_ids,bbox,image,attention_mask,token_type_ids,act_label
"[101, 13169, 1051...","[0, 0, 0, 0, 42, ...","[255, 255, 255, 2...","[1.0, 1.0, 1.0, 1...","[0, 0, 0, 0, 0, 0...",14


## Training


### Dataframe of preprocessed data
Using either of the above listed options for generating your data, now we will run the training stage.

In [None]:
from sparkocr.transformers import *

trainer = VisualDocumentClassifierV2()
trainer.set_train_param_model_save_path("new_model")
trainer.set_train_param_vocab_path("LayoutLM.v2.voc.txt")
trainer.set_train_param_spark(spark)
trainer.set_train_param_num_train_epochs(2)
trainer.set_train_param_useGPU(False)
trainer.setLabels(labels)

doc_class = trainer.fit(df)