In [24]:
# flake8: noqa
import warnings
import os

# Suppress noisy requests warnings.
warnings.filterwarnings("ignore")
os.environ["PYTHONWARNINGS"] = "ignore"

# Extracting Information from Documents using Ray Datasets

In this example, we will show you how to run optical character recognition (OCR) on a set of documents and analyze the resulting text with the natural language processing library spaCy. To make it more interesting, we will run the analysis on the [LightShot](https://www.kaggle.com/datasets/datasnaek/lightshot) dataset. It is a large publicly available OCR dataset with a wide variety of different documents, all of them screen shots of various forms. It is easy to replace that dataset with your own data and adapt the example to your own use cases!

## Overview

This tutorial will cover:
 - Creating a Ray Dataset that represents the images in the dataset
 - Running the computationally expensive OCR process on each image in the dataset in parallel
 - Filtering out images that actually contain text
 - Performing various NLP operations on the text

## Walkthrough

Let's start by preparing the dependencies and downloading the dataset. You can download the dataset at [LightShot](https://www.kaggle.com/datasets/datasnaek/lightshot). Install the OCR software `tesseract` and extract the `archive.zip` file with the following commands:

In [None]:
!sudo apt-get install tesseract-ocr
%pip install pytesseract
!sudo apt-get install -y unzip unrar
!unzip archive.zip
!unrar x LightShot13k.rar ~/LightShot13k/

Let's now import Ray and initialize a local Ray cluster. If you want to run OCR at a very large scale, you should run this workload on a multi-node cluster.

In [1]:
# Import ray and initialize a local Ray cluster.
import ray
ray.init()

2022-06-20 13:29:54,153	INFO packaging.py:323 -- Pushing file package 'gcs://_ray_pkg_6e55412652e4563957625efaa4c5f526.zip' (71.00MiB) to Ray cluster...
2022-06-20 13:29:54,951	INFO packaging.py:332 -- Successfully pushed file package 'gcs://_ray_pkg_6e55412652e4563957625efaa4c5f526.zip'.


RayContext(dashboard_url='127.0.0.1:8265', python_version='3.8.5', ray_version='3.0.0.dev0', ray_commit='736c7b13c4dcfb3f1b748124184e293ffab14bf8', address_info={'node_ip_address': '172.31.87.92', 'raylet_ip_address': '172.31.87.92', 'redis_address': None, 'object_store_address': '/tmp/ray/session_2022-06-20_13-19-15_484107_157/sockets/plasma_store', 'raylet_socket_name': '/tmp/ray/session_2022-06-20_13-19-15_484107_157/sockets/raylet', 'webui_url': '127.0.0.1:8265', 'session_dir': '/tmp/ray/session_2022-06-20_13-19-15_484107_157', 'metrics_export_port': 63693, 'gcs_address': '172.31.87.92:9031', 'address': '172.31.87.92:9031', 'node_id': '0f5b31d3795dc5b8fbe66daf22f92515c0c4a99309c9c32e1b0c7860'})

### Running the OCR software on the data

We first create a list `files` of absolute paths of the file names and then convert it into a Ray Dataset with the `ray.data.from_items` function. We can now run the `.map` function on this dataset of file names to run the actual OCR process on each file and convert the screen shots into text. Note that we only store the paths of the filenames and the OCR'ed text in the dataset to keep the size of the dataset manageble. If you have large binary blobs like images or videos, it can be beneficial to store them outside of the dataset and store only the extracted information in the dataset.

In [None]:
import os
from glob import glob
import pytesseract

files = glob(os.path.expanduser("~/LightShot13k/LightShot13k/*"))
ds = ray.data.from_items(files)

def perform_ocr(path):
    return {"path": path, "text": pytesseract.image_to_string(path)}

results = ds.map(perform_ocr)

Let us have a look at some of the data points with the `take` function.

In [22]:
import ray
ray.init()
results = ray.data.read_parquet("/mnt/shared_storage/pcmoritz/LightShot13k_output/")

2022-06-20 13:47:43,292	INFO packaging.py:323 -- Pushing file package 'gcs://_ray_pkg_87e304282a9906dcafa5ea9e85e906f9.zip' (71.00MiB) to Ray cluster...
2022-06-20 13:47:44,083	INFO packaging.py:332 -- Successfully pushed file package 'gcs://_ray_pkg_87e304282a9906dcafa5ea9e85e906f9.zip'.
Metadata Fetch Progress:   0%|          | 0/33 [00:00<?, ?it/s]
Metadata Fetch Progress:   0%|          | 0/33 [00:00<?, ?it/s]
Metadata Fetch Progress:   0%|          | 0/33 [00:00<?, ?it/s]
Metadata Fetch Progress:   0%|          | 0/33 [00:00<?, ?it/s]
Metadata Fetch Progress:   0%|          | 0/33 [00:00<?, ?it/s]
Metadata Fetch Progress:   0%|          | 0/33 [00:00<?, ?it/s]
Metadata Fetch Progress:   0%|          | 0/33 [00:00<?, ?it/s]
Metadata Fetch Progress:   0%|          | 0/33 [00:00<?, ?it/s]
Metadata Fetch Progress:   0%|          | 0/33 [00:00<?, ?it/s]
Metadata Fetch Progress:   0%|          | 0/33 [00:01<?, ?it/s]
Metadata Fetch Progress:   0%|          | 0/33 [00:01<?, ?it/s]
Metada

### Process the extracted text data with spaCy

This is the part where the fun begins. Depending on your task there will be different needs for post processing, for example:
- If you are scanning books or articles you might want to separate the text out into sections and paragraphs.
- If you are scanning forms, receipts or checks, you might want to extract the different items listed, as well as extra information for those items like the price, or the total amount listed on the receipt or check.
- If you are scanning legal documents, you might want to extract information like the type of document, who is mentioned in the document and more semantic information about what the document claims.
- If you are scanning medical records, you might want to extract the patient name and the treatment history.

In our specific example, let's try to determine all the documents in the LightShot dataset that are chat protocols and extract named entities in those documents. We will extract this data with spaCy. Let's first make sure the libraries are installed:

In [None]:
!pip install spacy
!python -m spacy download en_core_web_sm
!pip install spacy_langdetect

This is some code to determine the language of a piece of text:

In [19]:
import spacy
from spacy.language import Language
from spacy_langdetect import LanguageDetector

nlp = spacy.load('en_core_web_sm')

@Language.factory("language_detector")
def get_lang_detector(nlp, name):
    return LanguageDetector()

nlp.add_pipe('language_detector', last=True)
nlp("This is an English sentence. Ray rocks!")._.language

{'language': 'en', 'score': 0.9999965581999258}

It gives both the language and a confidence score for that language.

In order to run the code on the dataset, we have to use Ray Dataset's built in support for actors since the `nlp` object is not serializable and we want to avoid having to recreate it for each individual sentence:

In [25]:
import spacy
from spacy.language import Language
from spacy_langdetect import LanguageDetector

class SpacyBatchInference:
    def __init__(self):
        self.nlp = spacy.load('en_core_web_sm')
        @Language.factory("language_detector")

        def get_lang_detector(nlp, name):
           return LanguageDetector()

        self.nlp.add_pipe('language_detector', last=True)

    def __call__(self, row):
        doc = self.nlp(row["value"])
        return doc._.language

results.limit(10).map(SpacyBatchInference, compute="actors")

Map:   0%|          | 0/1 [00:00<?, ?it/s](pid=6597) 2022-06-20 13:48:17.654922: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
(pid=6597) 2022-06-20 13:48:17.654973: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
Map Progress (1 actors 1 pending): 100%|██████████| 1/1 [00:05<00:00,  5.52s/it]


Dataset(num_blocks=1, num_rows=10, schema={language: string, score: double})

(pid=6639) 2022-06-20 13:48:21.950568: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
(pid=6639) 2022-06-20 13:48:21.950614: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.


We can now get language statistics over the whole dataset:

In [26]:
results.map(SpacyBatchInference, compute="actors").groupby("language").count().show()

Read->Map:   0%|          | 0/200 [00:00<?, ?it/s](pid=6812) 2022-06-20 13:49:10.979206: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
(pid=6812) 2022-06-20 13:49:10.979262: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
Map Progress (1 actors 1 pending):   0%|          | 1/200 [00:05<19:49,  5.98s/it](pid=6854) 2022-06-20 13:49:15.131031: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
(pid=6854) 2022-06-20 13:49:15.131083: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
Map Progress (2 actors 1 pending):   2%|▏       

{'language': 'UNKNOWN', 'count()': 2815}
{'language': 'af', 'count()': 109}
{'language': 'ca', 'count()': 268}
{'language': 'cs', 'count()': 13}
{'language': 'cy', 'count()': 80}
{'language': 'da', 'count()': 33}
{'language': 'de', 'count()': 281}
{'language': 'en', 'count()': 5640}
{'language': 'es', 'count()': 453}
{'language': 'et', 'count()': 82}
{'language': 'fi', 'count()': 32}
{'language': 'fr', 'count()': 168}
{'language': 'hr', 'count()': 143}
{'language': 'hu', 'count()': 57}
{'language': 'id', 'count()': 128}
{'language': 'it', 'count()': 139}
{'language': 'lt', 'count()': 17}
{'language': 'lv', 'count()': 12}
{'language': 'nl', 'count()': 982}
{'language': 'no', 'count()': 56}
