### DocumentAI

Please note that much of the work for this session will not be done in this notebook, it will be done via Snowsight

You may want to create a new window to do your work and keep this window open for instructions.

We will walk through how to train and leverage a Document Extraction Model using Document AI. We will use this model to convert unstructured Inspection Reports into structured, easy-to-analyze rows and columns all within the Snowflake AI Data Cloud.

* Document AI Documentation
* Snowflake's Artic-TILT LLM for Document AI Documentation

First, we'll do a little additional setup.

In [None]:
-- assume the accountadmin role
USE ROLE accountadmin;
USE DATABASE DOCAI;

-- create the raw_doc schema
CREATE OR REPLACE SCHEMA RAW_DOC;

-- create the doc_ai stage
CREATE OR REPLACE STAGE RAW_DOC.DOC_AI
    DIRECTORY = (ENABLE = TRUE)
    ENCRYPTION =  (TYPE = 'SNOWFLAKE_SSE');

-- create the inspection_reports stage
CREATE OR REPLACE STAGE RAW_DOC.INSPECTION_REPORTS
    DIRECTORY = (ENABLE = TRUE)
    ENCRYPTION =  (TYPE = 'SNOWFLAKE_SSE');


-- create the tb_doc_ai role
CREATE OR REPLACE ROLE tb_doc_ai;

-- grant document ai privileges
GRANT DATABASE ROLE SNOWFLAKE.DOCUMENT_INTELLIGENCE_CREATOR TO ROLE tb_doc_ai;

-- grant doc_ai warehouse privileges
GRANT USAGE, OPERATE ON WAREHOUSE DOCAI_WH TO ROLE tb_doc_ai;

-- grant tb_doc_ai database privileges
GRANT ALL ON DATABASE DOCAI TO ROLE tb_doc_ai;
GRANT ALL ON SCHEMA DOCAI.RAW_DOC TO ROLE tb_doc_ai;
GRANT CREATE STAGE ON SCHEMA DOCAI.RAW_DOC TO ROLE tb_doc_ai;
GRANT CREATE SNOWFLAKE.ML.DOCUMENT_INTELLIGENCE ON SCHEMA DOCAI.RAW_DOC TO ROLE tb_doc_ai;
GRANT ALL ON ALL STAGES IN SCHEMA DOCAI.RAW_DOC TO ROLE tb_doc_ai;

-- set my_user_var variable to equal the logged-in user
SET my_user_var = (SELECT  '"' || CURRENT_USER() || '"' );

-- grant the logged in user the doc_ai_role
GRANT ROLE tb_doc_ai TO USER identifier($my_user_var);

For use in a future step, we will need to download 2 separate .zip files that contain a Train (5 PDFs) and Full (11 PDFs) set of Inspection Reports.

Please leverage the two buttons below to download these .zip files. Once downloaded please unzip the files into a location that can easily be accessed in our next steps.

Training Set [ZIP](https://github.com/Snowflake-Labs/sfquickstarts/blob/master/site/sfguides/src/tasty_bytes_extracting_insights_with_docai/assets/inspection_reports_train.zip?_fsi=kVaBWUHz&_fsi=kVaBWUHz&_fsi=kVaBWUHz)

Full Set [ZIP](https://github.com/Snowflake-Labs/sfquickstarts/blob/master/site/sfguides/src/tasty_bytes_extracting_insights_with_docai/assets/inspection_reports_full.zip?_fsi=kVaBWUHz&_fsi=kVaBWUHz&_fsi=kVaBWUHz)

### Uploading our Inspection Reports to our Stage

**Navigating to our Stage**

Within the Snowsight interface, navigate to Data -> Databases and then search for DOCAI. From there navigate to the RAW_DOC schema and the INSPECTION_REPORTS stage.



In the top-right corner, click the +Files button and either drop or browse to the unzipped Inspection Reports Full from Step 1. From there click Upload.

This will kick off our file upload and you will soon see our Inspection Report PDF's within the Stage.

### Creating our Document AI Build
Within the Snowsight interface please switch your role to TB_DOC_AI and, navigate to AI & ML -> Document AI. From there click the + Build button.

Within the New Build screen, enter the following:

* Build Name: INSPECTION_REPORT_EXTRACTION
* Choose Database: DOCAI
* Choose Schema: RAW_DOC


### Uploading our Training Documents
After the build creation is successful, we will land in our Document AI Build Details window, where we can begin our document upload by clicking the Upload documents button.

Within the Upload documents screen drop or browse to the unzipped Training Inspection Reports from Step 1. From there click Upload.

### Specifying our Values and Questions

Once the upload process is complete, please click Define value so we can begin to specify value names and questions we want to ask the model.

To begin defining our values to extract from our documents, click the Add value button in the right-hand window.

From here please enter the following set of Values and Questions one by one that are documented below. For each pair, please complete the following before clicking Add value to enter the next pair:

**Did the Model extract the Value correctly?**

* If Yes - Click the check-box to indicate this value was extracted correctly.
* If No - Delete the provided value and enter the correct value.

**Value | Question**

* TRUCK_ID: What is the Truck Identifier?
* DATE: What is the Date?
* PIC_PRESENT: Was the Person in charge present (Y or N)?
* FOOD_PROPER_TEMP: Was the Food received at the proper temperature (Y or N)?
* VEHICLE_RUNS_WELL: Did the Vehicle run and was it in well maintained condition?

For demonstration purposes, we are only extracting 5 values however please feel free to add more.

Please see [Question optimization for extracting information with Document AI](https://docs.snowflake.com/en/user-guide/snowflake-cortex/document-ai/optimizing-questions?_fsi=kVaBWUHz&_fsi=kVaBWUHz&_fsi=kVaBWUHz) for best practices.



### Reviewing our Test Documents
After completing our initial document review in the previous step, we will now review the models' initial extraction results for our remaining test documents.

To begin, please move on to the second document by clicking the arrow at the bottom of the screen.


Once the next document appears, please again conduct the following steps for each Value and Question pair.

**Did the Model extract the Value correctly?**

* If Yes - Click the check-box to indicate this value was extracted correctly.
* If No - Delete the provided value and enter the correct value.

After completing review of all documents, please navigate back to the Document AI UI by click the arrow next to Documents review

Next navigate to the Build Details tab.

### Training our Model
Using the Model accuracy tile, we will now train our model by clicking the Train model button. Within the Start training pop-up click Start Training which will take around 20 minutes.

For more on Document AI training time estimation please visit our [Document AI documentation](https://docs.snowflake.com/en/user-guide/snowflake-cortex/document-ai/prepare-model-build?_fsi=kVaBWUHz&_fsi=kVaBWUHz&_fsi=kVaBWUHz#training-time-estimation).

In our case, it may take 10 minutes for training to complete. I've already done the training, so you can follow along with me until your training in completed.

When training is complete, you will see "Trained" next to the model name indicating it is ready to be leveraged.


### Using our Document AI Model against our Inspection Reports

In the last section, we walked through training our Inspection Report Extraction model in Snowflake. We will now use that model to extract our values from the full set of documents we uploaded to our stage earlier.



In [None]:
USE ROLE tb_doc_ai;
USE WAREHOUSE DOCAI_WH;
USE DATABASE DOCAI;
USE SCHEMA RAW_DOC;

In [None]:
-- Using a LIST command, let's first take a look at the files that we staged earlier and want to use our extraction model against.

LIST @inspection_reports;

To begin our extraction, let's use our model and the [PREDICT](https://docs.snowflake.com/en/sql-reference/classes/classification/methods/predict?_fsi=kVaBWUHz&_fsi=kVaBWUHz&_fsi=kVaBWUHz_) method against one of those staged files by executing the next query.



In [None]:
SELECT inspection_report_extraction!PREDICT(GET_PRESIGNED_URL(@inspection_reports, '02.13.2022.5.pdf'));


### Extraction for all Documents

Our extracted object looks great, but before we begin to flatten this out let's create a raw table based on extraction from all our staged documents. Please execute the next query which may take around 2 minutes and result in a Table IR_RAW successfully created. message.

In [None]:
CREATE TABLE IF NOT EXISTS ir_raw
COMMENT = '{"origin":"sf_sit-is", "name":"voc", "version":{"major":1, "minor":0}, "attributes":{"is_quickstart":1, "source":"sql", "vignette":"docai"}}'
AS
SELECT inspection_report_extraction!PREDICT(GET_PRESIGNED_URL(@inspection_reports, RELATIVE_PATH)) AS ir_object
FROM DIRECTORY(@inspection_reports);


In [None]:
-- Before moving on let's take a look at our raw, extracted results.

SELECT * FROM ir_raw;

### Flattening our Extracted Object

Everything that was extracted for each document is present in the ir_object column. To make analysis easier, let's complete the Quickstart by using Snowflake's Semi-Structured Data Support to show how we can extract columns from our object.

In production, a Data Engineer would typically flatten and promote this data downstream through our Medallion Architecture using objects like Dynamic Tables or Views.

Please execute the next query in which we will use [Dot Notation](https://docs.snowflake.com/en/user-guide/querying-semistructured?_fsi=kVaBWUHz&_fsi=kVaBWUHz&_fsi=kVaBWUHz#dot-notation) to: - Extract and normalize Date to a consistent format 
* Extract TRUCK_ID 
* Extract PIC_PRESENT and FOOD_PROPER_TEMP 
* Convert Y to Pass, N to Fail and X to Not Observed

### We're done!

We have moved from Unstructured PDF to easy to digest tabular results in the matter of minutes all within the Snowflake AI Data Cloud.