# Arabic-OCR

*A Digital Humanities Exercise: Batch converting folders of Arabic PDFs to plaintext files.*

[David J. Thomas](mailto:dave.a.base@gmail.com), [thePort.us](http://thePort.us)<br />

This exercise is going to have us convert a bunch of PDFs of Arabic texts and use OCR (optical character recognition) to convert them into plain text files. There is no single Python library that handles every step, and so we will have to layer our approach using several steps. Arabic OCR libraries for Python work on images not PDF files. Because of this, we will first have to convert each PDF file into its own folder of full of images, one image for each page using a library called `pdf2image`. Then, for each PDF we will have to go into its folder of images and run OCR on each page and rebuild the results in a text file which we will output to another folder.

1. Establish the filepaths to our desired directories (input 'PDF' and output 'img' and 'txt' folders)
2. Estbalish contents of PDF folder
3. Use `pdf2image` to convert each PDF to a subfolder of images, each of which will be stored inside `img` folder
4. On each subfolder inside `img` (equivalent to a PDF) scan each image to read text
5. Combine text of every image for a particular PDF into an output file inside `txt` folder

*Libraries we will use*
* `os` - Built-in very important library for doing anything with the operating system, allows you to write code that works across operating systems
* `shutil` - Installed via `pip`, utilities for working with files, we are only going to use it to quickly cleanup used and temporary files
* `pdf2image`- Installed via `pip`, library for converting a given PDF into a folder of images
* `natsort` - Installed via `pip`, library to make it easy to naturally sort lists of things (in our case filenames)
* `ArabicOcr` - Installed via `pip`, library for reading individual images and extracting the plain text of Arabic

Add intro explanations here....

* setting up dev env and text editor for windows/mac
* os library link
* explain importing modules

### Step 1: Building directory paths and empty folders

_Functions called in following code_

* `os.getcwd()`  - Returns the current working directory (usually the directory where the script is located) [More Info](https://www.tutorialspoint.com/python/os_getcwd.htm)
* `os.path.abspath()` - Takes a relative path and returns the respective absolute path (e.g. '~/sample.jpg' becomes '/Users/sampleuser/sample.jpg' on OSX) [More Info](https://www.geeksforgeeks.org/python-os-path-abspath-method-with-example/)
* `os.path.join()` - Takes two string path segments and joins them together, often used to combine folder paths with a filename inside them [More Info](https://www.geeksforgeeks.org/python-os-path-join-method/)
* `os.path.exists()` - Takes a string path and returns true/false whether there is/isn't a file or folder there. See also similar functions os.path.isdir() and os.path.isfile() [More Info](https://www.geeksforgeeks.org/python-check-if-a-file-or-directory-exists/)
* `os.mkdir()` - Takes a string containing desired path and makes a new directory [More Info](https://www.geeksforgeeks.org/python-os-mkdir-method/)
* `shutil.rmtree()` - Take a string path and deletes any file/folder located there and anything inside, we will use it to cleanup unwanted files [More Info](https://www.geeksforgeeks.org/delete-an-entire-directory-tree-using-python-shutil-rmtree-method/)

In [24]:
# importing python libraries, the import statement loads everything from the specified library
import os
import shutil

"""full capitalizing variable names is standard ONLY for 'global constants' that is, variables that do not change
and are available everywhere (not inside a function). This is common for things like settings"""
# getting local directory...
CURRENT_DIRECTORY = os.getcwd()
print('Local root directory: ' + CURRENT_DIRECTORY)

""" Building path to folder of PDFs using os.path.join() combines current directory with the name of the subdirectory
(pdf) The reason you use os.path.join rather than say, something like current_directory + '/' + 'some_subfolder_name'
is that os.path.join works on multiple operating systems, since Windows uses backslash in filepaths and OSX/Linux
uses forward slash, this is important"""
# combining for full path to folder of pdfs
PDF_DIRECTORY = os.path.join(CURRENT_DIRECTORY, 'pdf')
print('PDF subfolder Path: ' + PDF_DIRECTORY)

# repeat path building to non-existant (currently) 'img' and 'txt' output folders
IMG_DIRECTORY = os.path.join(CURRENT_DIRECTORY, 'img')
print('Image subfolder Path: ' + IMG_DIRECTORY)
TXT_DIRECTORY = os.path.join(CURRENT_DIRECTORY, 'txt')
print('TXT subfolder Path: ' + TXT_DIRECTORY)

# if img directory doesn't exist, make it
if not os.path.exists(IMG_DIRECTORY): # this is the same as saying if os.path.exists(img_directory) == False
    os.mkdir(IMG_DIRECTORY)
    print('Image subfolder created')
# otherwise (if script was run before and folder exists) delete folder and make new a new empty one
else:
    shutil.rmtree(IMG_DIRECTORY)
    os.mkdir(IMG_DIRECTORY)
    print('Image folder re-intialized')

# if txt directory doesn't exist, make it
if not os.path.exists(TXT_DIRECTORY):
    os.mkdir(TXT_DIRECTORY)
    print('Text subfolder created')
# otherwise (if script was run before and folder exists) delete folder and make new a new empty one
else:
    shutil.rmtree(TXT_DIRECTORY)
    os.mkdir(TXT_DIRECTORY)
    print('Text folder re-intialized')

print('Step completed')

Local root directory: /Users/davidthomas/git/arabic-ocr
PDF subfolder Path: /Users/davidthomas/git/arabic-ocr/pdf
Image subfolder Path: /Users/davidthomas/git/arabic-ocr/img
TXT subfolder Path: /Users/davidthomas/git/arabic-ocr/txt
Image folder re-intialized
Text folder re-intialized
Step completed


## Step 2: Building a List of Files in the PDF folder

_Functions called in the following code_

* `"".endswith()` - Function attached to every Python string that takes another string and returns True if the first string ends with it. See similar function "".startswith()
* `os.walk()` - Receives a string path and function gives back an iterable list of every file/subfolder inside (including multiple levels deep). For the strange way you loop through this list, see example below and under More Info [More Info](https://www.tutorialspoint.com/python/os_walk.htm)

In [25]:
# create empty list to add pdf filepaths to
pdf_filepaths = []

# store root path, directory paths, and full filepaths from os.walk() of everything inside PDF folder
for root, dirs, files in os.walk(PDF_DIRECTORY):
    # loop through list of strings from os.walk()
    for file in files:
        # only add if a pdf file
        if file.endswith('.pdf'):
            # file is only the FILENAME, we must build full filepath by joining root with file in os.path.join()
            file_fullpath = os.path.join(root, file)
            # now add the full filepath to the list of files
            pdf_filepaths.append(file_fullpath)

# loop list of filepaths and print each out
for pdf_filepath in pdf_filepaths:
    print(pdf_filepath)

/Users/davidthomas/git/arabic-ocr/pdf/Al-Wansharisi Excerpt.pdf
/Users/davidthomas/git/arabic-ocr/pdf/Al-Tijani.pdf


## Step 3: Converting PDFs to Images

_Functions called in the following code_

* `os.path.basename()` - Receives a path and gives the last element of the path, whether a filename or deepest subfolder [More Info](https://www.geeksforgeeks.org/python-os-path-basename-method/)
* `"".replace()` - Receives two strings, first contains text to 'select', second with the text that replaces selected text [More Info](https://www.geeksforgeeks.org/python-string-replace/)
* `convert_from_path()` - Receives two strings, 1st a filepath of a PDF, 2nd the output folder for results [More Info](https://pypi.org/project/pdf2image/)

In [26]:
# expanding on import statement, "from X import Y" syntax imports selected subunits from the specified library
from pdf2image import convert_from_path


def convert_pdf_to_img(input_filepath):
    """Function receives a filepath to a PDF, calculates the corresponding output subfolder for the images
    based on the PDF filename, and then converts the PDF to a series of images stored in that output subfolder."""
    
    # use os.path.basename() to get PDF filename (instead of full path) to calculate name of new subfolder
    pdf_name = os.path.basename(input_filepath)
    # we need to remove the .pdf at the end, this is easy with Python's built-in string .replace() function
    pdf_name = pdf_name.replace('.pdf', '')
    # this is the final output path of all the images of the given PDF
    output_folderpath = os.path.join(IMG_DIRECTORY, pdf_name)
    # if no subfolder already exists for the output, make it
    if not os.path.exists(output_folderpath):
        os.mkdir(output_folderpath)
    print('Converting ' + input_filepath)
    # perform conversion on inputfilepath, outputting to output-folderpath, in jpeg format
    convert_from_path(input_filepath, output_folder=output_folderpath, fmt='jpeg')
    print('Successful!')
    

# loop through PDFs and call conversion function on each NOTE: CURRENTLY STOPS AFTER 1st PDF FOR TESTING PURPOSES
for pdf_filepath in pdf_filepaths:
    convert_pdf_to_img(pdf_filepath)
    # REMOVE THIS BREAK TO CONVERT ALL PDFS
    break


Converting /Users/davidthomas/git/arabic-ocr/pdf/Al-Wansharisi Excerpt.pdf
Successful!


### Step 4: Getting List of Filepaths for Each Image Subfolder

_Functions called in the following code_

* os.listdir() - Take a string path and returns a list of all files and folders found *directly* inside the path specified, does not go down multiple levels like os.walk() [More Info](https://www.geeksforgeeks.org/python-os-listdir-method/)
* os.path.isdir() - Takes a string path and returns True/False if it points to a folder, see similar function os.path.isfile() [More Info](https://www.geeksforgeeks.org/python-os-path-isdir-method/)

In [27]:
# we need to get a list of all subdirs inside our img folder (should be one for each PDF), start with an empty list
pdf_img_dirs = []

# first lets loop through a list the contents of what is directly inside our img folder
for sub_item in os.listdir(IMG_DIRECTORY):
    # since sub_item is only the name, we need to build the full filepath with os.path.join()
    sub_item_path = os.path.join(IMG_DIRECTORY, sub_item)
    # checks to see if sub_item is a directory or file
    if os.path.isdir(sub_item_path):
        # if so add it to list of folders
        pdf_img_dirs.append(sub_item_path)


# Loop and print each folder found
print('Images found at...')
for pdf_img_dir in pdf_img_dirs:
    print(pdf_img_dir)

Images found at...
/Users/davidthomas/git/arabic-ocr/img/Al-Wansharisi Excerpt


### Step 5: Defining Function to Convert a Single Folder

_Functions called in the following code_

* arabicocr.arabic_ocr() - Peforms ocr: takes 2 strings, an image filepath for input and an image filepath for error checking output, it RETURNS a list of the words recognized. [More Info](https://pypi.org/project/ArabicOcr/)

In [29]:
# import the library call for natural sorter and arabic ocr
import natsort
from ArabicOcr import arabicocr


def gather_folder(folderpath):
    img_files = os.listdir(folderpath)
    sorted_img_files = natsort.natsorted(img_files)
    for sorted_file in sorted_img_files:
        print(sorted_file)
    
for pdf_img_dir in pdf_img_dirs:
    gather_folder(pdf_img_dir)

ValueError: invalid literal for int() with base 10: '314e3cd9-d120-4e0a-9ed6-22df5bf31cf4-4'