# Text to image - OCR (optical character recognition) experiments

## Main Idea

    The goal of this project is to experiment with different ways to implement image-to-text. The motivation is personal, in order to speed up data entry for a team schedule updater. 
    
    The system from where the original text comes from is isolated from other computers, so it's necessary to photograph and re-type the original text in a Trello notebook. Currently, a commercial app is being used, downloaded from the playstore. theapp, however, is full of ads, and the speed of it is lacking. 
    
    There are 2 ways I will attempt to implement here:
    - Pytesseract: a ready-to-use package that I will import
    - My own model: based on what I learn from using pytesseract, I will attempt to develop my own model. 

### Pytesseract

Tesseract is a package built for OCR, and can very easily downloaded through pip. It is probably the easiest way to implement OCR in a project, and has support for many languages. Each language package can be download from their github repository, and installed individually. The library comes with english as a default

- Tesseract documentation: https://github.com/tesseract-ocr/tessdoc
- Tesseract language compatibility by version and code for each language: https://github.com/tesseract-ocr/tessdoc/blob/main/Data-Files-in-different-versions.md
- Code to download languages other than english:
    - Specific Language (3 letters):
            apt-get install tesseract-ocr-YOUR_LANG_CODE
    - All languages:
            apt-get install tesseract-ocr-all
    

In [1]:
from PIL import Image
import pytesseract

In [1]:


# Format home/vls/code/vsattamini/tst.jpeg from home to the end
dire=input("Type full path to image you want to convert to text\n (format is with forward slashes):")
im = Image.open(dire)

language=input("Type the language of the scanned document (3 letter ID):")
text = pytesseract.image_to_string(im, lang=language)
filename=input("Type output file name:)")
file1 = open(filename+"-"+language+".txt","w")
file1.write(text)
file1.close()
print("Done!") 

Type full path to image you want to convert to text
 (format is with forward slashes):d


FileNotFoundError: [Errno 2] No such file or directory: 'd'

### Extra: Full document processing

Another opportunity to use this function presented itself as I was working on the original project. The following section was made as an addendum to the one above. I just needed to be able to send a PDF I found online to my kindle. The PDF wasn't in text, but in scanned images of the original book

I first used a 3rd party app to transform my PDF into images, and then looped the above model on the images. I then exported it in plain text and plan to convert it into EPUB to read.

In [2]:
# Necessary imports
from glob import glob
from pdf2image import convert_from_path
import os
import poppler 




In [3]:
pop_path = os.path.abspath(poppler.__file__)

In [4]:
os.path.abspath(poppler.__file__)

'/home/vls/.pyenv/versions/3.8.12/envs/lewagon/lib/python3.8/site-packages/poppler/__init__.py'

In [5]:
from pdf2image.exceptions import (
    PDFInfoNotInstalledError,
    PDFPageCountError,
    PDFSyntaxError
)

In [6]:
CURRENT_DIRECTORY = os.getcwd()
os.environ["TOKENIZERS_PARALLELISM"] = "false"

In [7]:
# import module
def pdf_to_images(file):
    os.mkdir(f'{file}')
    pages = convert_from_path(f"{CURRENT_DIRECTORY}/{file}.pdf")
    for i in range(len(pages)):
        pages[i].save(f"{file}/{file} - Page {i}.jpg", 'JPEG')

In [8]:
def get_paths(file_name):
    files = glob(f'{file_name}/*.jpg')
    files.sort()
    return files

In [9]:
def extract_and_join_jpg(files, final_file_name, langua="eng"):
    final_text = []
    for i, path in enumerate(files):
        dire= CURRENT_DIRECTORY+path
        im = Image.open(dire)
        text = pytesseract.image_to_string(im, lang=langua)

        final_text.append(text)
        if i % 10 == 0:
            print(f"Image {i} done,{len(files) - i} images to go...")
    final_text = " ".join(final_text)
    
    filename= final_file_name
    file1 = open(filename+"-"+langua+".txt","w")
    file1.write(final_text)
    file1.close()
    return print("Done! Check your file")

In [10]:
def full_extraction_process(file_name, langua='eng'):
    pdf_to_images(file_name)
    files = get_paths(file_name)
    extract_and_join_jpg(files,file_name)
    return print('Fuly done')

In [None]:
# Testing on new book
full_extraction_process("Foucault_Michel_Madness_and_Civilization_A_History_of_Insanity_in_the_Age_of_Reason")

In [17]:
os.mkdir('convert_from_path')

In [None]:
import os

In [None]:
"Foucault_Michel_Madness_and_Civilization_A_History_of_Insanity_in_the_Age_of_Reason"

## Model Creation

In the following section, I will attempt to create a model from scratch, training and tuning a model based on papers found in PapersWithCode and Arxiv

In [None]:
# Importing Dataset from Hugging Face
from datasets import load_dataset


dataset = load_dataset("naver-clova-ix/synthdog-en")