# OrchestrAit
## Mortgage Packet Sorting Made Easy

### Overview
This notebook serves as a proof of concept showing how scripting coupled with a trained ML model can automate the process of loan packet sorting. The scripts take care of folder creation and routing while the model reads data to determine document categories. Keep in mind your company's current workflow as you see how files and folders are created. The follwing questions will need to be answered when implementing OrchestrAit into your workflow:

1. What is the naming convention of your current loan packets?
2. Where are all your unprocessed loan packets currently stored? And in what format do they come in as?
3. What is the current file structure of your processed loan packets? What format are they stored in?
4. What programs need access to these processed loan packets? (E.g. Salesforce)

### Getting Started

Below we initialize some necessary python libraries and import the trained model to action on our sample files.

In [34]:
import tensorflow as tf
import numpy as np
import os
import pandas as pd
import PyPDF2
import shutil
import pathlib
from pdf2image import convert_from_path
import pytesseract
import cv2
import nltk
import shutil
from transformers import BertTokenizer

nltk.download('punkt')  # Download the required resource
pytesseract.pytesseract.tesseract_cmd =r"C:\Program Files\Tesseract-OCR\tesseract"

[nltk_data] Downloading package punkt to C:\Users\Tim
[nltk_data]     Williams\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [35]:
model=tf.keras.models.load_model('mortgage_doc_identifier_v5')

In [36]:
model.summary()

Model: "model_1"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_ids (InputLayer)         [(None, 512)]        0           []                               
                                                                                                  
 attention_mask (InputLayer)    [(None, 512)]        0           []                               
                                                                                                  
 bert (Custom>TFBertMainLayer)  {'last_hidden_state  109482240   ['input_ids[0][0]',              
                                ': (None, 512, 768)               'attention_mask[0][0]']         
                                , 'pooler_output':                                                
                                (None, 768)}                                                

To begin, we also define some loan document types. We can define as many as we like.

In [37]:
doc_type_df=pd.DataFrame({'doc_type': ['Foreclosure', 'General','Mortgage','Note','Origination','Title']})

### Support Functions
Next we define a few functions which will automatically create the folder structre we want for each loan packet. We also split each loan packet into its individual pages to prep for analyzing and routing.

In [38]:
def subdir_maker(full_file_path, final_output_path):

    # Extract the file name from the path
    file_name = os.path.basename(full_file_path)

    # Remove the file extension (if any)
    file_name = os.path.splitext(file_name)[0]

    # Create a folder with the file name
    folder_path = os.path.join(final_output_path, file_name)
    os.makedirs(folder_path)
    
    
    # create a list of these folders and iterate through the list to create new subdirs
    folder_list=doc_type_df.doc_type.tolist()
    for folder in folder_list:
        new_subfolder_path = os.path.join(folder_path, folder)
        os.makedirs(new_subfolder_path)
    
    # return the root folder path so that the split_pdf function can put the files that need to be sorted here.
    return folder_path, file_name

In [39]:
def split_pdf_template(input_path, final_output_path):
    file_list = os.listdir(input_path)
    doc_list=[]
    # Step 3: Iterate over each file in the list
    for file_name in file_list:
        
        file_name=file_name.replace(' ','_')
        # Construct the full file path
        file_path = os.path.join(input_path, file_name)
        
# ********************** UPDATE FUNCTION IF FILE ALREADY EXISTS!!! ************************************** #    

        new_folder, name = subdir_maker(file_path, final_output_path)

        with open(file_path, 'rb') as file:
            read_pdf = PyPDF2.PdfReader(file)

            for page_number, page in enumerate(read_pdf.pages, start=1):
                pdf_writer = PyPDF2.PdfWriter()
                pdf_writer.add_page(page)

                output_path = f"{name}_{page_number}.pdf"
                with open(output_path, 'wb') as output_file:
                    pdf_writer.write(output_file)
                shutil.move(name+"_"+str(page_number)+".pdf", new_folder)
                doc_list += [(new_folder,output_path, pdf_reader(pathlib.Path(new_folder,output_path)))]
                
    return doc_list

In [7]:
def pdf_reader(file):
    read=convert_from_path(file, dpi=400)
    for i in read:
        image=cv2.cvtColor(np.array(i), cv2.COLOR_RGB2BGR)
        # Grayscale, Gaussian blur, Otsu's threshold
        gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
        blur = cv2.GaussianBlur(gray, (3,3), 0)
        thresh = cv2.threshold(blur, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)[1]

        # Morph open to remove noise and invert image
        kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (3,3))
        opening = cv2.morphologyEx(thresh, cv2.MORPH_OPEN, kernel, iterations=1)
        invert = 255 - opening
        text=pytesseract.image_to_string(invert,lang='eng', config='--psm 6')
    text=text.replace('\n',' ')
    return text

In [8]:
def doc_to_dataframe(source_path, output_path):
    doc_list=split_pdf_template(source_path, output_path)
    doc_list_df=pd.DataFrame(doc_list, columns=['doc_path', 'doc_name','doc_content'])
    doc_list_df=doc_list_df.dropna()
    return doc_list_df

### Execute our Support Functions for New Files and Folders

Enter the location of the original files and the location of where the processed files should go.

In [28]:
sample_file = input('Enter the folder location of your source packet files')
output_path = input('Enter the folder location where you want to store your processed packet files')

Enter the folder location of your source packet files C:\Users\Tim Williams\Desktop\python_playground\unfiltered_forms\actual
Enter the folder location where you want to store your processed packet files C:\Users\Tim Williams\Desktop\python_playground\unfiltered_forms\final


We segment text into set number of word sequences so that our model can ingest and analyze large documents (10K + words no problem).

In [40]:
def segment_text(text, max_length):
    word_tokenizer = nltk.tokenize.RegexpTokenizer(r'\w+') # tokenize words
    tokens = word_tokenizer.tokenize(text)
    segments = []
    current_segment = []
    for token in tokens:
        current_segment.append(token)
        if len(current_segment) == max_length:
            segments.append(current_segment)
            current_segment = []
    # If there are remaining tokens at the end, add them as the last segment
    if current_segment:
        segments.append(current_segment)
    return segments

In [11]:
# Create a new DataFrame to store segmented text
def word_segmenter(df, seg_size):
    # Set the sequence length
    max_sequence_length = seg_size
    # Initialize an empty DataFrame to store the segmented text
    doc_segments_df = pd.DataFrame(columns=['doc_path', 'doc_name', 'doc_content', 'doc_segments'])

    # Iterate over each row in the doc_df DataFrame
    for index, row in df.iterrows():
        segments = segment_text(row['doc_content'], max_sequence_length)

        # Create a new DataFrame for the segments of the current document
        single_doc_segment_df = pd.DataFrame({'doc_path': row['doc_path'],
                                              'doc_name': row['doc_name'],
                                              'doc_content': row['doc_content'],
                                              'doc_segments': [' '.join(segment) for segment in segments]})

        # Append the segmented text of the current document to the doc_segments_df DataFrame
        doc_segments_df = pd.concat([doc_segments_df, single_doc_segment_df], ignore_index=True)
    return doc_segments_df

In [12]:
def prep_data(text):
    tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
    tokens=tokenizer.encode_plus(text,max_length=512,
                                 truncation=True, padding='max_length',
                                 add_special_tokens=True, return_token_type_ids=False,
                                 return_tensors='tf')
    return {
        'input_ids': tf.cast(tokens['input_ids'], tf.float64),
        'attention_mask': tf.cast(tokens['attention_mask'], tf.float64)
    }

In [13]:
def ml_sort_data(df, seg_size):
    doc_segments_df=word_segmenter(df, seg_size)
    doc_segments_df['tokenized_segments']=doc_segments_df['doc_segments'].apply(lambda x: prep_data(x))
    doc_segments_df['probability']=[model.predict(i)[0] for i in doc_segments_df.tokenized_segments]
    agg_probs_df=pd.DataFrame(doc_segments_df.groupby(['doc_path','doc_name'])['probability'].sum()).reset_index()
    agg_probs_df['doc_type_no']=agg_probs_df.probability.apply(lambda x: np.argmax(x))
    agg_probs_df=agg_probs_df.merge(doc_type_df, left_on='doc_type_no', right_on=doc_type_df.index)
    for index, row in agg_probs_df.iterrows():
        shutil.move(os.path.join(row.doc_path,row.doc_name), os.path.join(row.doc_path,row.doc_type))

In [48]:
doc_df = doc_to_dataframe(sample_file, output_path)
doc_df.info() # Quick sanity check shows how many documents we have and if there are any empty rows in our dataframe
doc_df.head(10)

incorrect startxref pointer(3)


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 88 entries, 0 to 87
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   doc_path     88 non-null     object
 1   doc_name     88 non-null     object
 2   doc_content  88 non-null     object
dtypes: object(3)
memory usage: 2.2+ KB


Unnamed: 0,doc_path,doc_name,doc_content
0,C:\Users\Tim Williams\Desktop\python_playgroun...,COLFIL-7-18-2016-Loan_7000087874-102130061_1.pdf,wo ) oy : eo _ 4 : 3 ADJUSTABLE RATE NOTE é ; ...
1,C:\Users\Tim Williams\Desktop\python_playgroun...,COLFIL-7-18-2016-Loan_7000087874-102130061_2.pdf,"/ ° : ON . ; < . Co | D 4, INTEREST RATE AND M..."
2,C:\Users\Tim Williams\Desktop\python_playgroun...,COLFIL-7-18-2016-Loan_7000087874-102130061_3.pdf,” SO wy] 7. BORROWER'S FAILURE TO PAY AS REQU...
3,C:\Users\Tim Williams\Desktop\python_playgroun...,COLFIL-7-18-2016-Loan_7000087874-102130061_4.pdf,": _ ‘ , - A . & . Transfer of the Property or ..."
4,C:\Users\Tim Williams\Desktop\python_playgroun...,COLFIL-7-18-2016-Loan_7000087874-102130061_5.pdf,"casoc . wih * . i . Finance America, LLC 16802..."
5,C:\Users\Tim Williams\Desktop\python_playgroun...,COLFIL-7-18-2016-Loan_7000087874-102130061_6.pdf,. . oe | Loan number: 9800078868 Borrower: R...
6,C:\Users\Tim Williams\Desktop\python_playgroun...,COLFIL-7-18-2016-Loan_7000087874-102130061_7.pdf,"a, | cg Note Allonge Borrower:JOHN R. KAISER A..."
7,C:\Users\Tim Williams\Desktop\python_playgroun...,COLFIL-7-18-2016-Loan_7000087874-102130061_8.pdf,(pH4079 - a
8,C:\Users\Tim Williams\Desktop\python_playgroun...,COLFIL-7-18-2016-Loan_7000087874-102130061_9.pdf,) Borrower:JOHN R. KAISER AND ROSEMARIE T. KAI...
9,C:\Users\Tim Williams\Desktop\python_playgroun...,COLFIL-7-18-2016-Loan_7000087874-102130061_10.pdf,[24047


In [46]:
doc_df.doc_content[0]

'wo ) oy : eo _ 4 : 3 ADJUSTABLE RATE NOTE é ; - ° (LIBOR Index - Rate Caps) . : . MIN 100052300401777767 THIS NOTE CONTAINS PROVISIONS ALLOWING FOR CHANGES IN MY INTEREST RATE AND MY MONTHLY PAYMENT. THIS NOTE LIMITS THE AMOUNT MY INTEREST RATE CAN CHANGE AT ANY ONE TIME AND THE MAXIMUM RATE I MUST PAY. . 04/24/04 IRVINE: CA sO, [Date] [City] {State] 2 FRANKLIN PL, MASSAPEQUA, NY 11758-7015 | [Property Address] ~ 1. BORROWER’S PROMISE TO PAY ; In return for a loan that I have received, [ promiseto payU.S.$ 415,000.00 (this amount is called “Principal"), plus interest, to the order of the Lender. The Lender is Finance America, LLC ; . I will make all payments under this Note in the form of cash, check or money order. I understand that the Lender may transfer this Note. The Lender or anyone who takes this Note by transfer and who is entitled to receive payments under this Note is called the "Note Holder.” 2. INTEREST | Interest will be charged on unpaid principal until the full amount o

In [49]:
ml_sort_data(doc_df, 512)

