Add intro explanations here....

* setting up dev env and text editor for windows/mac
* os library link
* relative and abspath explanation
* explain importing modules

### Step 1: Building directory paths and empty folders

_Functions called in following code_

* `os.getcwd()`  - Returns the current working directory (usually the directory where the script is located) [More Info](https://www.tutorialspoint.com/python/os_getcwd.htm)
* `os.path.abspath()` - Takes a relative path and returns the respective absolute path (e.g. '~/sample.jpg' becomes '/Users/sampleuser/sample.jpg' on OSX) [More Info](https://www.geeksforgeeks.org/python-os-path-abspath-method-with-example/)
* `os.path.join()` - Takes two string path segments and joins them together, often used to combine folder paths with a filename inside them [More Info](https://www.geeksforgeeks.org/python-os-path-join-method/)
* `os.path.exists()` - Takes a string path and returns true/false whether there is/isn't a file or folder there. See also similar functions os.path.isdir() and os.path.isfile() [More Info](https://www.geeksforgeeks.org/python-check-if-a-file-or-directory-exists/)
* `os.mkdir()` - Takes a string containing desired path and makes a new directory [More Info](https://www.geeksforgeeks.org/python-os-mkdir-method/)

In [1]:
# importing python libraries, the import statement loads everything from the specified library
import os

# Getting local directory, probably as an absolute path, but possibly as a relative path
current_directory = os.getcwd()
# Converts path to absolute path if it isn't already, otherwise leaves it unchanged
current_directory = os.path.abspath(current_directory)
print('Local root directory (absolute): ' + current_directory)

""" Building path to folder of PDFs using os.path.join() combines current directory with the name of the subdirectory
(pdf) The reason you use os.path.join rather than say, something like current_directory + '/' + 'some_subfolder_name'
is that os.path.join works on multiple operating systems, since Windows uses backslash in filepaths and OSX/Linux
uses forward slash, this is important"""

# combining for full path to folder of pdfs
pdf_directory = os.path.join(current_directory, 'pdf')
print('PDF subfolder Path: ' + pdf_directory)

# repeat path building to non-existant (currently) 'img' and 'txt' output folders
img_directory = os.path.join(current_directory, 'img')
print('Image subfolder Path: ' + img_directory)
txt_directory = os.path.join(current_directory, 'txt')
print('TXT subfolder Path: ' + txt_directory)

# if img directory doesn't exist, make it
if not os.path.exists(img_directory): # this is the same as saying if os.path.exists(img_directory) == False
    os.mkdir(img_directory)
    print('Image subfolder created')
    
# if txt directory doesn't exist, make it
if not os.path.exists(txt_directory):
    os.mkdir(txt_directory)
    print('TXT subfolder created')
    
print('Step completed')

Local root directory (absolute): /Users/davidthomas/git/arabic-ocr
PDF subfolder Path: /Users/davidthomas/git/arabic-ocr/pdf
Image subfolder Path: /Users/davidthomas/git/arabic-ocr/img
TXT subfolder Path: /Users/davidthomas/git/arabic-ocr/txt
Step completed


## Step 2: Building a List of Files in the PDF folder

_Functions called in the following code_

* `"".endswith()` - Function attached to every Python string that takes another string and returns True if the first string ends with it. See similar function "".startswith()
* `os.walk()` - Receives a string path and function gives back an iterable list of every file/subfolder inside (including multiple levels deep). For the strange way you loop through this list, see example below and under More Info [More Info](https://www.tutorialspoint.com/python/os_walk.htm)

In [2]:
# create empty list to add pdf filepaths to
pdf_filepaths = []

# store root path, directory paths, and full filepaths from os.walk() of everything inside PDF folder
for root, dirs, files in os.walk(pdf_directory):
    # loop through list of strings from os.walk()
    for file in files:
        # only add if a pdf file
        if file.endswith('.pdf'):
            # file is only the FILENAME, we must build full filepath by joining root with file in os.path.join()
            file_abspath = os.path.join(root, file)
            # now add the full filepath to the list of files
            pdf_filepaths.append(file_abspath)

# loop list of filepaths and print each out
for pdf_filepath in pdf_filepaths:
    print(pdf_filepath)

/Users/davidthomas/git/arabic-ocr/pdf/Al-Wansharisi Excerpt.pdf
/Users/davidthomas/git/arabic-ocr/pdf/Al-Tijani.pdf


## Step 3: Converting PDFs to Images

_Functions called in the following code_

* `os.path.basename()` - Receives a path and gives the last element of the path, whether a filename or deepest subfolder [More Info](https://www.geeksforgeeks.org/python-os-path-basename-method/)
* `"".replace()` - Receives two strings, first contains text to 'select', second with the text that replaces selected text [More Info](https://www.geeksforgeeks.org/python-string-replace/)
* `convert_from_path()` - Receives two strings, 1st a filepath of a PDF, 2nd the output folder for results [More Info](https://pypi.org/project/pdf2image/)

In [15]:
# expanding on import statement, "from X import Y" syntax imports selected subunits from the specified library
from pdf2image import convert_from_path


def convert_pdf_to_img(input_filepath):
    """Function receives a filepath to a PDF, calculates the corresponding output subfolder for the images
    based on the PDF filename, and then converts the PDF to a series of images stored in that output subfolder."""
    
    # use os.path.basename() to get PDF filename (instead of full path) to calculate name of new subfolder
    pdf_name = os.path.basename(input_filepath)
    # we need to remove the .pdf at the end, this is easy with Python's built-in string .replace() function
    pdf_name = pdf_name.replace('.pdf', '')
    # this is the final output path of all the images of the given PDF
    output_folderpath = os.path.join(img_directory, pdf_name)
    # if no subfolder already exists for the output, make it
    if not os.path.exists(output_folderpath):
        os.mkdir(output_folderpath)
    print('Converting ' + input_filepath)
    # perform conversion on inputfilepath, outputting to output-folderpath, in jpeg format
    convert_from_path(input_filepath, output_folder=output_folderpath, fmt='jpeg')
    print('Successful!')
    

# loop through PDFs and call conversion function on each NOTE: CURRENTLY STOPS AFTER 1st PDF FOR TESTING PURPOSES
for pdf_filepath in pdf_filepaths:
    convert_pdf_to_img(pdf_filepath)
    # REMOVE THIS BREAK TO CONVERT ALL PDFS
    break


Converting /Users/davidthomas/git/arabic-ocr/pdf/Al-Wansharisi Excerpt.pdf
Successful!


### Step 4: Running OCR on Images

_Functions called in the following code_

*