# Text to image - OCR (optical character recognition) experiments

## Main Idea

    The goal of this project is to experiment with different ways to implement image-to-text. The motivation is personal, in order to speed up data entry for a team schedule updater. 
    
    The system from where the original text comes from is isolated from other computers, so it's necessary to photograph and re-type the original text in a Trello notebook. Currently, a commercial app is being used, downloaded from the playstore. theapp, however, is full of ads, and the speed of it is lacking. 
    
    There are 2 ways I will attempt to implement here:
    - Pytesseract: a ready-to-use package that I will import
    - My own model: based on what I learn from using pytesseract, I will attempt to develop my own model. 

### Pytesseract

Tesseract is a package built for OCR, and can very easily downloaded through pip. It is probably the easiest way to implement OCR in a project, and has support for many languages. Each language package can be download from their github repository, and installed individually. The library comes with english as a default

- Tesseract documentation: https://github.com/tesseract-ocr/tessdoc
- Tesseract language compatibility by version and code for each language: https://github.com/tesseract-ocr/tessdoc/blob/main/Data-Files-in-different-versions.md
- Code to download languages other than english:
    - Specific Language (3 letters):
            apt-get install tesseract-ocr-YOUR_LANG_CODE
    - All languages:
            apt-get install tesseract-ocr-all
    

In [2]:
!pip install -r requirements.txt

[31mERROR: Could not find a version that satisfies the requirement pdftocairo (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for pdftocairo[0m[31m
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0[0m[39;49m -> [0m[32;49m23.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [4]:
from PIL import Image
import pytesseract
from glob import glob
import os
import re
import pandoc

In [3]:
# For google Colab
!apt update
!apt-get install libnss3 libnss3-dev
!apt-get install libcairo2-dev libjpeg-dev libgif-dev
!apt-get install cmake libblkid-dev e2fslibs-dev libboost-all-dev libaudit-dev

!wget https://poppler.freedesktop.org/poppler-21.09.0.tar.xz;
!tar -xvf poppler-21.09.0.tar.xz;

!mkdir -p poppler-21.09.0/build && \

[33m0% [Working][0m            Hit:1 https://cloud.r-project.org/bin/linux/ubuntu focal-cran40/ InRelease
[33m0% [Connecting to archive.ubuntu.com (185.125.190.39)] [Connecting to security.[0m                                                                               Ign:2 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu2004/x86_64  InRelease
[33m0% [Connecting to archive.ubuntu.com (185.125.190.39)] [Connecting to security.[0m                                                                               Hit:3 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64  InRelease
Hit:4 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu2004/x86_64  Release
Get:5 http://security.ubuntu.com/ubuntu focal-security InRelease [114 kB]
Hit:6 http://archive.ubuntu.com/ubuntu focal InRelease
Hit:7 http://ppa.launchpad.net/c2d4u.team/c2d4u4.0+/ubuntu focal InRelease
Get:9 http://archive.ubuntu.com/ubuntu focal-up

In [4]:
# for colab
!  apt-get install tesseract-ocr-all

Reading package lists... Done
Building dependency tree       
Reading state information... Done
tesseract-ocr-all is already the newest version (4.1.1-2build2).
0 upgraded, 0 newly installed, 0 to remove and 23 not upgraded.


In [5]:
# for colab
!mkdir -p poppler-21.09.0/build && \
cd poppler-21.09.0 && \
cmake  -DCMAKE_BUILD_TYPE=Release   \
       -DCMAKE_INSTALL_PREFIX=/usr  \
       -DTESTDATADIR=$PWD/testfiles \
       -DENABLE_UNSTABLE_API_ABI_HEADERS=ON && \
make && \
make install


  No source or binary directory provided.  Both will be assumed to be the
  become a fatal error in future CMake releases.

[0m
  

   No test data found in $testdatadir.
   You will not be able to run 'make test' successfully.

  

   The test data is not included in the source packages
   and is also not part of the main git repository. Instead,
   you can checkout the test data from its own git
   repository with:

  

     git clone git://git.freedesktop.org/git/poppler/test

  

   You should checkout the test data as a sibling of your
   poppler source folder or specify the location of your
   checkout with -DTESTDATADIR=/path/to/checkoutdir/test.
    

[0m
[0m-- Package Qt6Core or Qt6Gui or Qt6Widgets or Qt6Test not found[0m
-- Checking for module 'gobject-introspection-1.0'
--   No package 'gobject-introspection-1.0' found
-- Checking for modules 'gtk+-3.0>=3.22;gdk-pixbuf-2.0>=2.36'
--   No package 'gtk+-3.0' found
--   No package 'gdk-pixbuf-2.0' found
-- Could NOT find G

In [7]:
# First trial attempt, uncomment to use manually

# Format home/vls/code/vsattamini/tst.jpeg from home to the end
#dire=input("Type full path to image you want to convert to text\n (format is with forward slashes):")
#im = Image.open(dire)

#language=input("Type the language of the scanned document (3 letter ID):")
#text = pytesseract.image_to_string(im, lang=language)
#filename=input("Type output file name:)")
#file1 = open(filename+"-"+language+".txt","w")
#file1.write(text)
#file1.close()
#print("Done!") 

### Extra: Full document processing

Another opportunity to use this function presented itself as I was working on the original project. The following section was made as an addendum to the one above. I just needed to be able to send a PDF I found online to my kindle. The PDF wasn't in text, but in scanned images of the original book

I first used a 3rd party app to transform my PDF into images, and then looped the above model on the images. I then exported it in plain text and plan to convert it into EPUB to read.

In [5]:
# Necessary imports
from glob import glob
from pdf2image import convert_from_path
import os



In [10]:
# pop_path = os.path.abspath(poppler.__file__)

In [11]:
from pdf2image.exceptions import (
    PDFInfoNotInstalledError,
    PDFPageCountError,
    PDFSyntaxError
)

In [13]:
# For google colab only
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

# os.chdir allows you to change directories, like cd in the Terminal
os.chdir('/content/drive/MyDrive/Colab Notebooks')

Mounted at /content/drive


In [6]:
CURRENT_DIRECTORY = os.getcwd()
os.environ["TOKENIZERS_PARALLELISM"] = "false"

In [7]:
# import module
def pdf_to_images(file):
    os.mkdir(f'{file}')
    pages = convert_from_path(f"{CURRENT_DIRECTORY}/{file}.pdf")
    for i in range(len(pages)):
        pages[i].save(f"{file}/{file} - Page {i+1}.jpg", 'JPEG')

In [8]:
def get_paths(file_name):
    files = glob(f'{file_name}/*.jpg')
    files.sort()
    return files

In [9]:
def extract_and_join_jpg(files, final_file_name, langua="eng"):
    final_text = []
    for i, path in enumerate(files):
        dire= CURRENT_DIRECTORY+"/"+path
        im = Image.open(dire)
        text = pytesseract.image_to_string(im, lang=langua)

        final_text.append(text)
        if i % 10 == 0:
            print(f"Image {i} done,{len(files) - i+1} images to go...")
    final_text = " ".join(final_text)
    
    filename= final_file_name
    file1 = open(filename+"-"+langua+".txt","w")
    file1.write(final_text)
    file1.close()
    return final_text, print("Done! Check your file")

In [11]:
def txt_checker(input_text):
    pattern = re.compile(r"[A-Za-z0-9]+\.txt", re.IGNORECASE)
    matching = re.search(pattern, input_text)
    if matching:
        return True
    return False
    return 'Pattern Checked'

In [13]:
def convert_txt(txt, format ='epub'):
    if txt_checker(txt):
        with open(txt, "r") as file:
            txt_string = file.read()
        txt_document = pandoc.read(txt_string)
        pandoc.write(txt_document, format=format)
        print(txt_document)
    else:
        txt_document = pandoc.read(txt)
        pandoc.write(txt_document, format=format)
        print(txt_document)
    return print(f'File saved as {format}')


In [15]:
def full_extraction_process(file_name, langua='eng', convert=True, format='epub'):
    pdf_to_images(file_name)
    files = get_paths(file_name)
    final_txt = extract_and_join_jpg(files,file_name)
    if convert:
        convert_txt(final_txt, format=format)
    return print('Fuly done')

In [23]:
# Testing on new book
full_extraction_process("Foucault_Michel_Madness_and_Civilization_A_History_of_Insanity_in_the_Age_of_Reason")

FileExistsError: ignored

In [16]:
with open("Foucault_Michel_Madness_and_Civilization_A_History_of_Insanity_in_the_Age_of_Reason-eng.txt", "r") as file:
    string_test = file.read()

In [36]:
string_test = string_test.replace("-\n","")

In [37]:
string_test



In [38]:
os.chdir('/content/drive/MyDrive/Colab Notebooks')

FileNotFoundError: [Errno 2] No such file or directory: '/content/drive/MyDrive/Colab Notebooks'

In [39]:
string_test_2 = pandoc.read(string_test)
pandoc.write(doc=string_test_2,file='Foucault_Michel_Madness_and_Civilization_A_History_of_Insanity_in_the_Age_of_Reason-eng.epub', format='epub')

b'PK\x03\x04\x14\x00\x02\x08\x00\x00N\xa5iVoa\xab,\x14\x00\x00\x00\x14\x00\x00\x00\x08\x00\x00\x00mimetypeapplication/epub+zipPK\x03\x04\x14\x00\x02\x08\x08\x00N\xa5iV\x10\xadP\xf1\xac\x00\x00\x00\xfb\x00\x00\x00\x16\x00\x00\x00META-INF/container.xml]\x8e\xc1\n\xc20\x10D\xef~E\xd8\xab\xd4\xe8MBSA\xd0\xb3\x07\xfb\x011\xddj0\xd9\rM*\xfa\xf7\xa6=\x14\xf180o\xde\xd4\x87w\xf0\xe2\x85CrL\x1av\x9b-\x08$\xcb\x9d\xa3\xbb\x86\xf6z\xae\xf6phV\xb5e\xca\xc6\x11\x0e\x7f\xddBS\xd20\x0e\xa4\xd8$\x97\x14\x99\x80Ie\xab8"ul\xc7\x80\x94\xd5\\S\xcb\x084+!\xea\x819\xf7\xcec\x9a\xd2O\x16\xfd\xe8}\x15M~h8]\xda\xa3\x9c\xb8\xb2\xb2\xe1\xd8\x83\x08\xd89S\xe5OD\r&F\xef\xac\xc9\xe5\x8fd\xbc\xc5T(\xfb4w\\\x17!\x089k\xe4\x8f\xa7\x96\xcb\x87\xe6\x0bPK\x03\x04\x14\x00\x02\x08\x08\x00N\xa5iVwv\xf8\xd2w\x00\x00\x00\xa0\x00\x00\x00-\x00\x00\x00META-INF/com.apple.ibooks.display-options.xml]\x8d\xb1\x0e\xc2 \x10@\xf7~\xc5\xe5F\x93J\xdd\x1c\x80n~\x81\xce\xa6)\x87\xb9\xa4\x1c\x04\xd0\xd4\xbf\x17\xad.\xae//\xef\xe9q\r\x0b<(\x

In [40]:
# Testing function that was implemented later
convert_txt('Foucault_Michel_Madness_and_Civilization_A_History_of_Insanity_in_the_Age_of_Reason-eng.txt')

Output hidden; open in https://colab.research.google.com to view.

In [None]:
txt_document = pandoc.read('Foucault_Michel_Madness_and_Civilization_A_History_of_Insanity_in_the_Age_of_Reason.txt')

In [None]:
txt_document

In [None]:
os.mkdir('convert_from_path')

In [None]:
"Foucault_Michel_Madness_and_Civilization_A_History_of_Insanity_in_the_Age_of_Reason"

## Model Creation

In the following section, I will attempt to create a model from scratch, training and tuning a model based on papers found in PapersWithCode and Arxiv

In [None]:
# Importing Dataset from Hugging Face
from datasets import load_dataset


dataset = load_dataset("naver-clova-ix/synthdog-en")