# Text to image - OCR (optical character recognition) experiments

## Main Idea

The goal of this project is to experiment with different ways to implement image-to-text. The motivation is personal, in order to speed up data entry for a team schedule updater. 
    
The system from where the original text comes from is isolated from other computers, so it's necessary to photograph and re-type the original text in a Trello notebook. Currently, a commercial app is being used, downloaded from the playstore. theapp, however, is full of ads, and the speed of it is lacking. 
    
There are 2 ways I will attempt to implement here:
  - Pytesseract: a ready-to-use package that I will import
  - My own model: based on what I learn from using pytesseract, I will attempt to develop my own model. 

Tesseract is a package built for OCR, and can very easily downloaded through pip. It is probably the easiest way to implement OCR in a project, and has support for many languages. Each language package can be download from their github repository, and installed individually. The library comes with english as a default

- Tesseract documentation: https://github.com/tesseract-ocr/tessdoc
- Tesseract language compatibility by version and code for each language: https://github.com/tesseract-ocr/tessdoc/blob/main/Data-Files-in-different-versions.md
- Code to download languages other than english:
    - Specific Language (3 letters):
            apt-get install tesseract-ocr-YOUR_LANG_CODE
    - All languages:
            apt-get install tesseract-ocr-all
    

### Package imports and installing requirements

In [None]:
!pip install -r requirements.txt

In [2]:
# Import image and text file manipulation packages
from PIL import Image
import pytesseract
from glob import glob
import os
import re
import pandoc

In [None]:
# For google Colab, qhich requires installation on a specific instance
!apt update
!apt-get install libnss3 libnss3-dev
!apt-get install libcairo2-dev libjpeg-dev libgif-dev
!apt-get install cmake libblkid-dev e2fslibs-dev libboost-all-dev libaudit-dev

!wget https://poppler.freedesktop.org/poppler-21.09.0.tar.xz;
!tar -xvf poppler-21.09.0.tar.xz;

!mkdir -p poppler-21.09.0/build && \

In [None]:
# for colab
!  apt-get install tesseract-ocr-all

In [None]:
# for colab
!mkdir -p poppler-21.09.0/build && \
cd poppler-21.09.0 && \
cmake  -DCMAKE_BUILD_TYPE=Release   \
       -DCMAKE_INSTALL_PREFIX=/usr  \
       -DTESTDATADIR=$PWD/testfiles \
       -DENABLE_UNSTABLE_API_ABI_HEADERS=ON && \
make && \
make install


### Pytesseract - initial experiment

In [None]:
# First trial attempt, uncomment to use manually

# Format home/vls/code/vsattamini/tst.jpeg from home to the end
#dire=input("Type full path to image you want to convert to text\n (format is with forward slashes):")
#im = Image.open(dire)

#language=input("Type the language of the scanned document (3 letter ID):")
#text = pytesseract.image_to_string(im, lang=language)
#filename=input("Type output file name:)")
#file1 = open(filename+"-"+language+".txt","w")
#file1.write(text)
#file1.close()
#print("Done!") 

### Pytesseract - Full document processing

Another opportunity to use this function presented itself as I was working on the original project. The following section was made as an addendum to the one above. I just needed to be able to send a PDF I found online to my kindle. The PDF wasn't in text, but in scanned images of the original book

I first used a 3rd party app to transform my PDF into images, and then looped the above model on the images. I then exported it in plain text and plan to convert it into EPUB to read.

In [3]:
# Necessary imports
from glob import glob
from pdf2image import convert_from_path, pdfinfo_from_path
import os

In [3]:
# pop_path = os.path.abspath(poppler.__file__)

In [4]:
from pdf2image.exceptions import (
    PDFInfoNotInstalledError,
    PDFPageCountError,
    PDFSyntaxError
)

In [5]:
# For google colab only
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

# os.chdir allows you to change directories, like cd in the Terminal
os.chdir('/content/drive/MyDrive/Colab Notebooks')

ModuleNotFoundError: No module named 'google.colab'

In [6]:
# Estabilishing current directory to access and create files
CURRENT_DIRECTORY = os.getcwd()
os.environ["TOKENIZERS_PARALLELISM"] = "false"

In [7]:
# Original function was incredibly memory intensive and killed the original kernel. Batch processing was implemented and is
# highly recommended. 
def pdf_to_images(file, batch=True, verbose=True):
    '''
    Converts a pdf file into a series of images, each representing a page in the original file. The objective is to be able to 
    extract text from photocopies/scans. The core funtion loops through batches of 10 pages and saves these as images in a 
    folder it creates with the same name as the original file.
    ---
    file: Name of the pdf file
    batch: if the function will process the image in batches or not. Defaults to True which is HIGHLY recommended
    verbose: gives updates on the process 
    '''
    os.mkdir(f'{file}')
    pdf_name = f"{file}.pdf"
    info = pdfinfo_from_path(f"{CURRENT_DIRECTORY}/{pdf_name}", userpw=None)
    maxPages = info["Pages"]
    if batch:
        for page in range(1, maxPages+1, 10) : 
            pages = convert_from_path(f"{CURRENT_DIRECTORY}/{pdf_name}", dpi=200, first_page=page, last_page = min(page+10-1,maxPages))
            for i in range(len(pages)):
                pages[i].save(f"{file}/{file} - Page {page+i}.jpg", 'JPEG')
            if verbose:
                print(f'Starting page{page}')
    else:
        pages = convert_from_path(f"{CURRENT_DIRECTORY}/{pdf_name}")
        for i in range(len(pages)):
            pages[i].save(f"{file}/{file} - Page {i+1}.jpg", 'JPEG')
            if i % 10 == 0:
                print(f'Starting page {page}')
    return print('Image Conversion done')

In [19]:
def get_image_paths(folder_name):
    '''
    Simple utility function that looks for all the file names for extracted images. It looks in a folder in the same
    directory and extracts paths for all jpgs.
    ---
    folder_name: name of folder you want to look in
    '''
    files = glob(f'{folder_name}/*.jpg')
    files_dict= {}
    num_in_title = 0
    for letter in folder_name:
        if letter.isnumeric():
            num_in_title += 1
    num_in_title = num_in_title*2
    for file in files:
        page_number = []
        for letter in file:
            if letter.isnumeric():
                page_number.append(letter)
        page_number = "".join(page_number)
        page_number = page_number[num_in_title:]
        page_number = int(page_number)
        files_dict[page_number] = file
    files_dict_2 = dict(sorted(files_dict.items()))
    files = list(files_dict_2.values())            
    return files

In [10]:
def extract_and_join_jpg(files, final_file_name, langua="eng", verbose=True):
    '''
    Takes a list of image paths and extracts the text using tesseract OCR, creating a .txt file.
    ---
    files: listof image paths
    final_file_name: name that is assigned to the .txt file
    langua: specifies what language the file is in, which helps tesseract more correctly identify the text
    verbose: gives a report on the progress of the text extraction every 10 images
    '''
    final_text = []
    for i, path in enumerate(files):
        dire= CURRENT_DIRECTORY+"/"+path
        im = Image.open(dire)
        text = pytesseract.image_to_string(im, lang=langua)
        final_text.append(text)
        if verbose:
            if i % 10 == 0:
                print(f"Image {i} done,{len(files) - i+1} images to go...")
    final_text = " ".join(final_text)
    
    file1 = open(final_file_name+"-"+langua+".txt","w")
    file1.write(final_text)
    file1.close()
    return final_text, print(".txt Done! Check your files")

In [11]:
def txt_checker(input_text):
    '''
    Uses regex to test if string is a .txt file name or path
    '''
    pattern = re.compile(r"[A-Za-z0-9]+\.txt", re.IGNORECASE)
    matching = re.search(pattern, input_text)
    if matching:
        return True
    return False

In [16]:
def convert_txt(txt, filler_name=None, formats ='epub'):
    '''
    Takes a string and checks if it is a .txt file. If so, imports the contents of the txt file as a string. If not, assumes
    that the string is the main text. it then converts this string into a specified format (default is epub) using pandoc
    --- 
    txt: string
    formats: the final format you want to convert the text into. defaults to epub
    
    '''
    if txt_checker(txt):
        with open(txt, "r") as file:
            txt_string = file.read()
        txt_string = txt_string.replace("-\n","")
        txt_document = pandoc.read(txt_string)
        pandoc.write(doc=txt_document,file=f'{txt[-4]}.epub', formats=formats)
    else:
        txt_string = txt.replace("-\n","")
        txt_document = pandoc.read(txt_string)
        pandoc.write(doc=txt_document,file=f'{filler_name}.epub', format=formats)
    return print(f'File saved as {formats}')

In [17]:
def full_extraction_process(file_name, langua='eng', convert=True, formats='epub', batch=True, verbose=True):
    '''
    Is a function that joins the pdf_to_images, get_image_paths, extract_and_join_jpg functions and (optionally) the 
    convert_txt function. Transforms a pdf into images, extracts the text from those images, consolidates it into a .txt file
    and then optionally converts the .txt file into a diffferent format (defaults to epub)
    ---
    file_name: the name of the pdf input
    langua: the language use in tesseract for OCR
    convert: if the text file will be converted or not
    formats: format of the final version is there is a conversion. Defaults to epub
    batch: if the image exraction will be done in batches of 10
    '''
    filler_name = file_name
    pdf_to_images(file_name, batch=batch,verbose=verbose)
    files = get_image_paths(file_name)
    final_txt, _ = extract_and_join_jpg(files,file_name,verbose=verbose)
    if convert:
        convert_txt(final_txt, formats=formats, filler_name=filler_name)
    return print('Fully done')

In [18]:
# Testing on new book
full_extraction_process("foucault-the-history-of-sexuality-volume-1")

Starting page1
Starting page11
Starting page21
Starting page31
Starting page41
Starting page51
Starting page61
Starting page71
Starting page81
Starting page91
Starting page101
Starting page111
Starting page121
Starting page131
Starting page141
Starting page151
Starting page161
Image Conversion done
27
68
93
142
153
61
14
5
59
138
92
80
1
76
121
114
90
82
136
84
161
11
158
49
47
16
118
21
83
77
162
151
50
55
64
24
3
113
30
43
104
111
46
160
39
119
134
143
8
37
164
19
17
13
88
25
131
85
108
79
144
122
125
159
42
127
115
149
44
33
141
106
150
102
4
7
12
38
72
71
129
34
89
140
29
31
70
53
75
103
60
36
73
28
137
22
10
32
6
154
81
145
78
117
105
35
97
45
124
98
126
123
95
133
65
9
2
107
146
109
58
74
101
18
91
94
116
156
163
56
100
23
157
87
86
52
128
62
41
54
48
110
69
112
20
120
135
152
155
67
139
40
57
148
66
51
96
15
26
63
130
132
147
99
Image 0 done,165 images to go...
Image 10 done,155 images to go...
Image 20 done,145 images to go...
Image 30 done,135 images to go...
Image 40 done,125

## Model Creation

In the following section, I will attempt to create a model from scratch, training and tuning a model based on papers found in PapersWithCode and Arxiv

In [None]:
# Importing Dataset from Hugging Face
from datasets import load_dataset


dataset = load_dataset("naver-clova-ix/synthdog-en")