# Document AI

### Processing the Snowflake Quarterly Earnings Infographic
So now let's look at the second document AI project you created.  This project, was extracting the metrics from the quartely earnings infographic.  The out put of this will produced a structured dataset.  

-   Run through the cells and read instructions in this notebook.
-   Run the next cell to import utilised python packages.

In [None]:
# Import python packages
import streamlit as st
import pandas as pd

from snowflake.snowpark.functions import *
from snowflake.snowpark.types import *

# We can also use Snowpark for our analyses!
from snowflake.snowpark.context import get_active_session
session = get_active_session()


##### Here we have key metrics in images - you will utilise Document AI to extract the key metrics out.  The result should be a table of metrics over time.  The following cell loads images of each infographic along with the link as to where it's stored.

The following SQL cell creates a table which parses the key information we have requsted from the **INFOGRAPHIC** model.

The third Python cell extracts the data from the **json object** into a more readable object using **Snowpark Dataframes**.

In [None]:
infographics = session.sql('''SELECT *, GET_PRESIGNED_URL('@DOCUMENT_AI.INFOGRAPHICS',RELATIVE_PATH) IMAGE FROM DIRECTORY(@DOCUMENT_AI.INFOGRAPHICS)''')

infographics = infographics.select('IMAGE','RELATIVE_PATH').order_by('RELATIVE_PATH')

st.dataframe(
    infographics,
    column_config={
        "IMAGE": st.column_config.ImageColumn(
            "Preview Image", help="Snowflake infographic"
        )
    },
    hide_index=True,
)

In [None]:
CREATE TABLE IF NOT EXISTS DOCUMENT_AI.INFOGRAPHICS AS

SELECT RELATIVE_PATH,'@DOCUMENT_AI.INFOGRAPHICS' AS STAGE, 

DOCUMENT_AI.INFOGRAPHICS!PREDICT(GET_PRESIGNED_URL('@DOCUMENT_AI.INFOGRAPHICS',RELATIVE_PATH),1) METADATA

FROM DIRECTORY (@DOCUMENT_AI.INFOGRAPHICS);

SELECT GET_PRESIGNED_URL('@DOCUMENT_AI.INFOGRAPHICS',RELATIVE_PATH)IMAGE,RELATIVE_PATH,METADATA FROM DOCUMENT_AI.INFOGRAPHICS;

In [None]:
earnings_data = session.sql('''SELECT GET_PRESIGNED_URL('@DOCUMENT_AI.INFOGRAPHICS',RELATIVE_PATH)IMAGE,RELATIVE_PATH,METADATA FROM DOCUMENT_AI.INFOGRAPHICS''')

earnings_formatted = earnings_data.with_column('1M_CUSTOMERS',col('METADATA')['TOTAL_M_CUSTOMERS'][0]['value'].astype(StringType()))
earnings_formatted = earnings_formatted.with_column('GLOBAL_2000_CUSTOMERS',col('METADATA')['GLOBAL_2000_CUSTOMERS'][0]['value'].astype(StringType()))
earnings_formatted = earnings_formatted.with_column('NET_REVENUE_RETENTION',col('METADATA')['NET_REVENUE_RETENTION'][0]['value'].astype(StringType()))
earnings_formatted = earnings_formatted.with_column('PRODUCT REVENUE',col('METADATA')['PROD_REVENUE'][0]['value'].astype(StringType()))
earnings_formatted = earnings_formatted.with_column('TOTAL_CUSTOMERS',col('METADATA')['TOTAL_CUSTOMERS'][0]['value'].astype(StringType()))
earnings_formatted = earnings_formatted.with_column('TOTAL_CUSTOMERS',col('METADATA')['TOTAL_CUSTOMERS'][0]['value'].astype(StringType()))
earnings_formatted = earnings_formatted.with_column('DATE_OF_REPORT',col('METADATA')['DATE_OF_REPORT'][0]['value'].astype(StringType()))
earnings_formatted = earnings_formatted.with_column('MARKETPLACE_LISTINGS',col('METADATA')['MARKETPLACE_LISTINGS'][0]['value'].astype(StringType()))
earnings_formatted = earnings_formatted.with_column('QUARTER',col('METADATA')['QUARTER'][0]['value'].astype(StringType()))


st.dataframe(
    earnings_formatted,
    column_config={
        "IMAGE": st.column_config.ImageColumn(
            "Preview Image", help="Snowflake infographic"
        )
    },
    hide_index=True,
)


#### Use Cortex Complete for a simple date formatter

Cortex Complete can be used to help with data engineering.  We have a date field and the dates are stored in various formats.  You can see this in the table.  You may want to re train with new infogtraphics for better accuracy.  This time however, we will use **Cortex Complete** to return the date to a consistant format.  The model being used to answer the question is **reka-flash**.

In [None]:
def cortex_date(date_string):
    return call_function('SNOWFLAKE.CORTEX.COMPLETE','reka-flash',
                         concat(lit('return the following which can be parsed as a date in this format YYYY-MM-DD.  ONLY RETURN THE RESULT'),
                                lit(date_string)))

session.create_dataframe([{'DATE':'test'}]).with_column('DATE',cortex_date('April 30, 2022'))

### Keep only Numeric Characters
Remember that although cortex complete can be used to help with lot's of data engineering problems, there are also lot's of powerful and simple built in functions that can do logical operations more efficiently.  In this case, we are using the **regex_replace** function to ensure that number fields only contain numbers.

In [None]:
CREATE OR REPLACE FUNCTION DOCUMENT_AI.NUMBERS(input_string STRING)
RETURNS STRING
LANGUAGE SQL
AS
$$
    REGEXP_REPLACE(input_string, '[^0-9]', '')
$$;

Let's apply this to our structured data results. You will notice both the cortex complete function and the simple custom SQL function are both being used in a **single query**.

In [None]:
earnings_formatted_s = session.table('DOCUMENT_AI.INFOGRAPHICS')
earnings_formatted_2 = earnings_formatted_s.with_column('IMAGE',call_function('GET_PRESIGNED_URL',
                                                                              lit('@DOCUMENT_AI.INFOGRAPHICS'),
                                                                              col('RELATIVE_PATH'))).cache_result()
earnings_formatted_2 = earnings_formatted_2.with_column('TOTAL_M_CUSTOMERS',
            call_function('DOCUMENT_AI.NUMBERS',col('METADATA')['TOTAL_M_CUSTOMERS'][0]['value']).astype(IntegerType()))\
.with_column('GLOBAL_2000_CUSTOMERS',call_function('DOCUMENT_AI.NUMBERS',col('METADATA')['GLOBAL_2000_CUSTOMERS'][0]['value']).astype(IntegerType()))\
.with_column('NET_REVENUE_RETENTION',call_function('DOCUMENT_AI.NUMBERS',col('METADATA')['NET_REVENUE_RETENTION'][0]['value']).astype(IntegerType()))\
.with_column('PRODUCT_REVENUE',col('METADATA')['PROD_REVENUE'][0]['value'].astype(StringType()))\
.with_column('TOTAL_CUSTOMERS',col('METADATA')['TOTAL_CUSTOMERS'][0]['value'].astype(StringType()))\
.with_column('DATE_OF_REPORT',cortex_date(col('METADATA')['DATE_OF_REPORT'][0]['value']).astype(DateType()))\
.with_column('MARKETPLACE_LISTINGS',call_function('DOCUMENT_AI.NUMBERS',col('METADATA')['MARKETPLACE_LISTINGS'][0]['value']).astype(IntegerType()))\
.with_column('QUARTER',col('METADATA')['QUARTER'][0]['value'].astype(StringType()))\
.with_column('PRODUCT_REVENUE',call_function('DOCUMENT_AI.NUMBERS',col('PRODUCT_REVENUE')).astype(DecimalType(6,1)))\
.with_column('TOTAL_CUSTOMERS',call_function('DOCUMENT_AI.NUMBERS',col('TOTAL_CUSTOMERS')).astype(IntegerType()))\
.with_column('MARKETPLACE_LISTINGS',call_function('DOCUMENT_AI.NUMBERS',col('MARKETPLACE_LISTINGS')).astype(IntegerType()))\
.with_column('NET_REVENUE_RETENTION',call_function('DOCUMENT_AI.NUMBERS',col('NET_REVENUE_RETENTION')).astype(IntegerType()))\
.with_column('GLOBAL_2000_CUSTOMERS',call_function('DOCUMENT_AI.NUMBERS',col('GLOBAL_2000_CUSTOMERS')).astype(IntegerType()))\
.with_column('TOTAL_M_CUSTOMERS',call_function('DOCUMENT_AI.NUMBERS',col('TOTAL_M_CUSTOMERS')).astype(IntegerType()))\
.with_column('YEAR',year('DATE_OF_REPORT'))\
.drop('RELATIVE_PATH','STAGE','METADATA')

earnings_formatted_2.write.mode('overwrite').save_as_table("DOCUMENT_AI.EARNINGS_INFOGRAPHIC_PARSED")

earnings_formatted_2 = session.table('DOCUMENT_AI.EARNINGS_INFOGRAPHIC_PARSED')

earnings_formatted_2

Finally we will view the Snowflake reporting information in a Streamlit app.

In [None]:
st.markdown('### Recent Earnings infographics for SNOW')

st.markdown('#### KEY METRICS')
col1,col2,col3 = st.columns(3)
with col1:
    st.markdown('#### PRODUCT REVENUE')
    st.line_chart(earnings_formatted_2.to_pandas(),x='QUARTER',y='PRODUCT_REVENUE', color='YEAR')
with col2:
    st.markdown('#### TOTAL CUSTOMERS')
    st.line_chart(earnings_formatted_2.to_pandas(),x='QUARTER',y='TOTAL_CUSTOMERS', color='YEAR')
with col3:
    st.markdown('#### MARKETPLACE LISTINGS')
    st.line_chart(earnings_formatted_2.to_pandas(),x='QUARTER',y='MARKETPLACE_LISTINGS',color='YEAR')
st.divider()
st.markdown('#### ALL EXTRACTED DATA')
st.dataframe(
    earnings_formatted_2,
    column_config={
        "IMAGE": st.column_config.ImageColumn(
            "Preview Image", help="Snowflake infographic"
        )
    },
    hide_index=True,
)


st.divider()
st.markdown('#### ORIGINAL INFOGRAPHIC')
col1, col2 = st.columns(2)


with col1:
    selected_year = st.selectbox('Choose Year:',earnings_formatted_2.select('YEAR').distinct().sort(col('YEAR').desc()))
with col2:
    selected_quarter = st.selectbox('Choose Quarter:',earnings_formatted_2.filter(col('YEAR')==selected_year).select('QUARTER').distinct().sort('QUARTER'))





st.image(earnings_formatted_2.filter((col('QUARTER')==selected_quarter)
                                  &(col('YEAR')==selected_year)).select('IMAGE').limit(1).collect()[0][0])


So we have processed the infographic documents and extracted structured data into a table.  Let's now get more data and proceed to the **Sound Transcripts** section.