Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bring back OCR in a future release #101

Open
vinayak-mehta opened this issue Sep 11, 2018 · 11 comments

Comments

Projects
None yet
4 participants
@vinayak-mehta
Copy link
Collaborator

commented Sep 11, 2018

The experimental version exists before this commit 9753889. It uses Tesseract (using pyocr). ocropy looked promising the last time I checked, opening this issue for discussion and experiments around OCR.

An earlier issue around the same topic: #14

@wanghaisheng

This comment has been minimized.

Copy link

commented Oct 28, 2018

for simple trick ,we can just use orcmypdf to convert image based pdf into a searchable pdf

@vinayak-mehta

This comment has been minimized.

Copy link
Collaborator Author

commented Oct 30, 2018

A good test to check if ocrmypdf would work would be to convert all the PDFs used inside the tests to image-based PDFs, use ocrmypdf to convert those image-based PDFs to text-based ones and then run the tests. @wanghaisheng Would you like to look into this? Otherwise, I'll do this later.

@wanghaisheng

This comment has been minimized.

Copy link

commented Oct 30, 2018

@vinayak-mehta definitely i want to help with this ,i can set up a docker ocrmypdf environment and convert all files under https://github.com/socialcopsdev/camelot/tree/master/tests/files.
question
step 1. if we can transform these pdf files into text-based one
step 2. compare the text extract from text-based one with ground truth text to know the reliability of the ocr engine used by ocrmypdf

@vinayak-mehta

This comment has been minimized.

Copy link
Collaborator Author

commented Oct 30, 2018

@wanghaisheng You don't need to convert all the PDFs inside the tests/files folder to image-based ones, just the ones that are being used inside the Python tests.

  1. You can probably use ghostscript/imagemagick to convert the PDFs into image-based ones at once.
  2. Then convert all the new converted image-based PDFs into text-based ones using ocrmypdf at once.
  3. Run the tests on the text-based PDFs converted with ocrmypdf.
@vinayak-mehta

This comment has been minimized.

Copy link
Collaborator Author

commented Oct 30, 2018

If the tests run correctly, we should update the README and docs to guide users to convert their image-based PDFs to text-based using ocrmypdf.

@wanghaisheng

This comment has been minimized.

Copy link

commented Oct 31, 2018

the tests you mean "test_common.py"?
@vinayak-mehta

@vinayak-mehta

This comment has been minimized.

Copy link
Collaborator Author

commented Oct 31, 2018

@wanghaisheng

This comment has been minimized.

Copy link

commented Nov 7, 2018

step1:setup ocrmypdf

docker pull jbarlow83/ocrmypdf

Then tag it to give a more convenient name, just ocrmypdf:

docker tag jbarlow83/ocrmypdf ocrmypdf

step 2:
batch process test files script using ocrmypdf docker container

#!/bin/env python3
# Contributed by github.com/Enantiomerie

# script needs 2 arguments
# 1. source dir with *.pdf - default is location of script
# 2. move dir where *.pdf and *_OCR.pdf are moved to

import logging
import os
import subprocess
import sys
import time
import shutil

script_dir = os.path.dirname(os.path.realpath(__file__))
timestamp = time.strftime("%Y-%m-%d-%H%M_")
log_file = script_dir + '/' + timestamp + 'ocrmypdf.log'
logging.basicConfig(level=logging.INFO, format='%(asctime)s %(message)s', filename=log_file, filemode='w')

if len(sys.argv) > 1:
    start_dir = sys.argv[1]
else:
    start_dir = '.'

for dir_name, subdirs, file_list in os.walk(start_dir):
    logging.info('\n')
    logging.info(dir_name + '\n')
    os.chdir(dir_name)
    for filename in file_list:
        file_ext = os.path.splitext(filename)[1]
        if file_ext == '.pdf':
            full_path = dir_name + '/' + filename
            file_noext = os.path.splitext(filename)[0]
            timestamp_OCR = time.strftime("%Y-%m-%d-%H%M_OCR_")
            filename_OCR = timestamp_OCR + file_noext + '.pdf'
            docker_mount = dir_name + ':/home/docker'
# create string for pdf processing
# diskstation needs a user:group docker:docker. find uid:gid of your diskstation docker:docker with id docker.
# use this uid:gid in -u flag
# rw rights for docker:docker at source dir are also necessary
# the script is processed as root user via chron
            cmd = ['docker', 'run', '--rm', '-v', docker_mount, '-u=1030:65538', 'jbarlow83/ocrmypdf', '--force-ocr' , filename, filename_OCR]
            logging.info(cmd)
            proc = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
            result = proc.stdout.read()
            logging.info(result)
            full_path_OCR = dir_name + '/' + filename_OCR
            os.chmod(full_path_OCR, 0o666)
            os.chmod(full_path, 0o666)
            full_path_OCR_archive = sys.argv[2]
            full_path_archive = sys.argv[2] + '/no_ocr'
            shutil.move(full_path_OCR,full_path_OCR_archive)
            shutil.move(full_path, full_path_archive)
logging.info('Finished.\n')

step3:

@Prady96

This comment has been minimized.

Copy link

commented Jan 30, 2019

Hi,

'ocrmypdf' does not work every-time, but you can make it work for some cases by adding this piece of code.

By parsing through this code some tables which were encrypted and lines that were not getting parsed were made parsable.

  1. Convert the pdf page into image using ghostscript/imagemagick

  2. After which run this script.

import math
import numpy as np
import cv2

img = cv2.imread('Page-5.jpg')
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

lsd = cv2.createLineSegmentDetector(0)
dlines = lsd.detect(gray)

for dline in dlines[0]:
    x0 = int(round(dline[0][0]))
    y0 = int(round(dline[0][1]))
    x1 = int(round(dline[0][2]))
    y1 = int(round(dline[0][3]))
    cv2.line(img, (x0, y0), (x1, y1), 255, 1, cv2.LINE_AA)

    # print line segment length
    a = (x0 - x1) * (x0 - x1)
    b = (y0 - y1) * (y0 - y1)
    c = a + b
    print(math.sqrt(c))

cv2.imwrite('page5_lines.png', img)
  1. Now a new image is formed having 'Blue lines' in it. Now try to run ocrmypdf again.

Disclaimer
This will not work for each case but surely give it a try if you are out of options.

@CartierPierre

This comment has been minimized.

Copy link

commented Apr 15, 2019

Am I right, Camelot is using word boundaries with stream extraction (Nurminen Algorithm ?), and OCR like pytesseract makes word boundaries too ? It looks not so hard to include scanned PDF in camelot ?

@vinayak-mehta

This comment has been minimized.

Copy link
Collaborator Author

commented Apr 20, 2019

@CartierPierre I'll check out pytesseract. A lot of pre-processing steps also need to be added before the image can finally be passed into an OCR engine. This page provides a good overview: https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.