# Recomemder Systems in a Large-Scale Document-Based Data Extraction Application

*Mauricio Alarcon <rmalarc@msn.com>*

## Introduction

This project explores the role and implementation of recomemder systems in a document processing application. In this case, we will consider an information extraction applicaiton, whose function is to support automatic data extraction from large document collections such as legal contracts and other types of commercial agreements.

## Application Workflow

At a high level, the flow of the application is:

1. Document is uploaded 
2. Perform OCR as needed
3. **Detect document type\***
4. **Parse and Extract defined data elements\***
5. Return document with extracted datapoints for review

* These 

## Recommender Systems in the Application Pipeline

Let's explore the following areas where recommender systems can support the application:

### Document Classification

The goal with this collaborative filtering recomender is to produce a document classification once a user uploads a document. Let's keep in mind that detecting the document type is a pre-requisite to firing up subsequent data extraction tasks. We don't want to attempt to extract a birth date from a lease agreement, however we want to extract the data element whenever the application finds a birth certificate.

Traditional document classification systems are based on machine learning algorithms such as logistic regression, naive bayes classification, amongst others. These supervised algorithms require an extensive dataset before they can start doing their job.

What if you have an interactive application where you need to generate a doucument classification based upon limited user interaction and there is no prior training dataset? 

We could use one of the traditional classification algorithms in a way that we first generate a training datased by capturing several records of user interaction and then generate predictions. This system adds latency, as the system would not be able to generate predictions until a rich training dataset is first generated. This latency often makes these algorithms hard and impractical to implement due to the dependency on the existence of a rich training dataset.

A general implementation of such an algorithm is presented here: https://github.com/rmalarc/DATA643/blob/master/src/main/scala/week4/project4-code.ipynb. The above mentioned project implements a low-latency document-document recommender system by generating a prospective lean training dataset captured from user interaction that minimizes prediction latency.

### Data Element Definition Propagation

As previously discussed, document detection directs the applicaiton flow towards certain data extraction subroutines. However, these sub-routines can be applicable to other document types. For instance, you may dates of birth mentioned not just in birth certificates but also in a variety of other documents. 

The goal here is to help the application propagate these data-extraction tasks to other documents in a context-aware collaborative filtering fashion.



## Application Demonstration

Let's see the implementation of the Document Classification Collaborative filtering. 

1. Document Upload
![Document Upload](https://raw.githubusercontent.com/rmalarc/DATA643/master/src/main/scala/finalproject/1.png)

2. Document Appears as Unknown
![Unknown Doc](https://raw.githubusercontent.com/rmalarc/DATA643/master/src/main/scala/finalproject/2.png)

3. User Assigns Document Type
![New Doc Type](https://raw.githubusercontent.com/rmalarc/DATA643/master/src/main/scala/finalproject/3.png)

4. New Document is Uploaded
![New Document Upload](https://raw.githubusercontent.com/rmalarc/DATA643/master/src/main/scala/finalproject/4.png)

5. Document is Recognized as Having the Same Type
![New Document Recognized](https://raw.githubusercontent.com/rmalarc/DATA643/master/src/main/scala/finalproject/5.png)


# Conclusions

* Collaboarative Filtering techniques appear to be valuable for interactive applications where low-latency is desired and little to no training data is available
* Machine-learning Applicaiton objectives have been with collaborative filtering recomendation systems
* Additional value remains to be gained by enabling information-extraction subroutine propagation