# PROCESSING DOCUMENTS

You should now have built three models using the DOCUMENT AI Model creation tool: 

- **GP_NOTES** 

- **COVID_VACCINATION_CONSENT_FORM**

- **RESEARCH_PAPERS**

Now, the next stage is to process multiple documents using these models.

## MODEL 1: PROCESS GP NOTES

Run the query below to view all the scanned in GP notes which are currently residing in a Snowflake Stage

In [None]:
select BUILD_SCOPED_FILE_URL('@DOCUMENT_AI.GP_NOTES',RELATIVE_PATH), * from directory(@DOCUMENT_AI.GP_NOTES)

### Process Documents from Document AI
You will now use the model previously created to process these documents.  Each document will produce meta data under the column name **DOC_META** - this will consist of all the fields that was built in the model

In [None]:
CREATE TABLE IF NOT EXISTS DEFAULT_SCHEMA.GP_NOTES_RAW AS

SELECT *, DEFAULT_DATABASE.DOCUMENT_AI.GP_NOTES!PREDICT(
  GET_PRESIGNED_URL(@DEFAULT_DATABASE.DOCUMENT_AI.GP_NOTES, RELATIVE_PATH), 1) DOC_META
FROM DIRECTORY(@DEFAULT_DATABASE.DOCUMENT_AI.GP_NOTES);

SELECT * FROM DEFAULT_SCHEMA.GP_NOTES_RAW

Let's now make this more readable as a structured table.  you will note that the GP notes have been transformed into a list.  Any field can have a list.  The SQL below only views the first value for all fields except the **GP_NOTES** column do to it containing multiple values.

In [None]:
CREATE OR REPLACE TABLE DEFAULT_SCHEMA.REPORTS_STRUCTURED AS 
select RELATIVE_PATH,

DOC_META:__documentMetadata:ocrScore OCR_SCORE,
DOC_META:NHS_NUMBER[0]:value::text NHS_NUMBER,
DOC_META:APPOINTMENT_DATE[0]:value::text APPOINTMENT_DATE,
DOC_META:CONSULTANT[0]:value::text CONSULTANT,
DOC_META:NOTES GP_NOTES,
DOC_META:PATIENT_NAME[0]:value::text PATIENT_NAME,
DOC_META:PRACTICE_ADDRESS[0]:value::text PRACTICE_ADDRESS,
DOC_META:PRACTICE_PHONE[0]:value::text PRACTICE_PHONE,
DOC_META:PRESCRIPTION[0]:value::text PRESCRIPTION,




from DEFAULT_SCHEMA.GP_NOTES_RAW;

SELECT * FROM DEFAULT_SCHEMA.REPORTS_STRUCTURED

## DATA CLEANING

We will now only extract all the notes and put it in the same field.  Each note will have a seperate line.  The function **REDUCE** allows you to pick out certain aspects of an array.

In [None]:
SELECT * EXCLUDE GP_NOTES,
       REDUCE(o.GP_NOTES,
              '',
              (acc, val) -> val:value || '\n' || acc) GP_NOTES
  FROM DEFAULT_SCHEMA.REPORTS_STRUCTURED o;

One of the other columns that is not consistant is the appointment date column, with the notes originating from a hand written form, formatting may not be logical.  Perhaps a trained model could help decifer this.

In [None]:
CREATE TABLE IF NOT EXISTS DEFAULT_SCHEMA.GP_NOTES_CLEANED AS
SELECT REPLACE(NHS_NUMBER,' ','') NHS_NUMBER,

SNOWFLAKE.CORTEX.COMPLETE('claude-3-5-sonnet',

CONCAT('convert the following date into a standardised date format such as YYYY-MM--DD',
            APPOINTMENT_DATE,'ONLY RETURN THE DATE WITH NO COMMENTARY'
            )) APPOINTMENT_DATE, 


* EXCLUDE (GP_NOTES,APPOINTMENT_DATE, OCR_SCORE,NHS_NUMBER),
       REDUCE(o.GP_NOTES,
              '',
              (acc, val) -> val:value || '\n' || acc) GP_NOTES
  FROM DEFAULT_SCHEMA.REPORTS_STRUCTURED o;

  SELECT * FROM DEFAULT_SCHEMA.GP_NOTES_CLEANED

## MODEL 2: PROCESS COVID TRIAL CONSENT FORMS

Let's now run the model to process the trial consent forms

In [None]:
CREATE TABLE IF NOT EXISTS DEFAULT_SCHEMA.COVID_TRIAL_CONSENT_RAW AS 
SELECT 
SIZE, 
LAST_MODIFIED,
RELATIVE_PATH, 
DEFAULT_DATABASE.DOCUMENT_AI.COVID_VACCINATION_CONSENT_FORM!PREDICT(
  GET_PRESIGNED_URL(@DEFAULT_DATABASE.DOCUMENT_AI.CONSENT_FORMS, 
  RELATIVE_PATH), 1) DOC_META
FROM DIRECTORY(@DEFAULT_DATABASE.DOCUMENT_AI.CONSENT_FORMS);

SELECT * FROM DEFAULT_SCHEMA.COVID_TRIAL_CONSENT_RAW

### Flattening out the data and creating a structured table

In [None]:
--- Create a table for the processed information
CREATE OR REPLACE VIEW DEFAULT_DATABASE.DEFAULT_SCHEMA.COVID_VACCINE_CONSENTS AS 
SELECT DOC_META,SIZE,LAST_MODIFIED,
DOC_META:CHILDBEARING_F[0]:value::text "Female Patient Childbearing Age (Y/N)"
,DOC_META:CONSENT[0]:value::text "CONSENT (Y/N)" 
,DOC_META:DATE_BIRTH[0]:value::text " Date of Birth"
,DOC_META:GENDER[0]:value::text "Gender"
,DOC_META:GP_NAME[0]:value::text "Name of Surgery"
,DOC_META:INJECTION[0]:value::text "Injection Arm"
,DOC_META:NHS_NUMBER[0]:value::text "NHS Number"
,DOC_META:PATIENT_NAME[0]:value::text "Patient Name"
,DOC_META:REASON[0]:value::text "Reason for no vaccine"
,DOC_META:Brand_vaccine[0]:value::text "Vaccination Brand"
,relative_path
FROM DEFAULT_SCHEMA.COVID_TRIAL_CONSENT_RAW;

-- checking the table
SELECT * FROM DEFAULT_SCHEMA.COVID_VACCINE_CONSENTS;

### VISUALISING DOCUMENT PROCESSING QUALITY SCORES

In [None]:
-- Create a table with all values and scores
CREATE OR REPLACE VIEW DEFAULT_SCHEMA.covid_ocr_score2
AS
WITH 
-- First part gets the result from applying the model on the pdf documents as a JSON with additional metadata
temp as(
    SELECT * FROM DEFAULT_SCHEMA.COVID_VACCINE_CONSENTS
    
)
-- Second part extract the values and the scores from the JSON into columns
SELECT

RELATIVE_PATH AS file_name
, SIZE AS file_size
, last_modified
, GET_PRESIGNED_URL(@DEFAULT_DATABASE.DOCUMENT_AI.CONSENT_FORMS, RELATIVE_PATH) snowflake_file_url
, DOC_META AS JSON
, json:__documentMetadata.ocrScore::FLOAT AS ocrScore
, json:CHILDBEARING_F[0]:value::STRING as Female_Patient_Childbearing
, json:CHILDBEARING_F[0]:score::FLOAT AS CHILDBEARING_F_score
, json:CONSENT[0]:value::STRING as CONSENT
, json:CONSENT[0]:score::FLOAT AS CONSENT_score
, json:DATE_BIRTH[0]:value::STRING as DATE_BIRTH
, json:DATE_BIRTH[0]:score::FLOAT AS DATE_BIRTH_score
, json:GENDER[0]:value::STRING as GENDER
, json:GENDER[0]:score::FLOAT AS GENDER_score
, json:GP_NAME[0]:value::STRING as GP_NAME
, json:GP_NAME[0]:score::FLOAT AS GP_NAME_score
, json:INJECTION[0]:value::STRING as INJECTION
, json:INJECTION[0]:score::FLOAT AS INJECTION_score
, json:NHS_NUMBER[0]:value::STRING as NHS_NUMBER
, json:NHS_NUMBER[0]:score::FLOAT AS NHS_NUMBER_score
, json:PATIENT_NAME[0]:value::STRING as PATIENT_NAME
, json:PATIENT_NAME[0]:score::FLOAT AS PATIENT_NAME_score
, json:REASON[0]:value::STRING as REASON
, json:REASON[0]:score::FLOAT AS REASON_score
FROM temp;

SELECT * FROM DEFAULT_SCHEMA.covid_ocr_score2

Finally we view the final view in a streamlit visualisation

In [None]:
from snowflake.snowpark.context import get_active_session
import streamlit as st
session = get_active_session()
table = session.table('DEFAULT_SCHEMA.covid_ocr_score2')

st.markdown('### DATA QUALITY SCORES')
col1,col2,col3,col4 = st.columns(4)
with col1:
    st.markdown('INJECTION')
    st.bar_chart(table, y='INJECTION_SCORE',x='FILE_NAME', color='#fc8702',x_label='',y_label='')
with col2:
    st.markdown('GP NAME')
    st.bar_chart(table, y='GP_NAME_SCORE',x='FILE_NAME', color='#7bcbff',x_label='',y_label='')
with col3:
    st.markdown('GENDER')
    st.bar_chart(table, y='GENDER_SCORE',x='FILE_NAME', color ='#ba78e5',x_label='',y_label='')

with col4:
    st.markdown('REASON')
    st.bar_chart(table, y='REASON_SCORE',x='FILE_NAME', color='#fc8702',x_label='',y_label='')

col1,col2,col3,col4 = st.columns(4)
with col1:
    st.markdown('NHS NUMBER')
    st.bar_chart(table, y='NHS_NUMBER_SCORE',x='FILE_NAME', color='#7bcbff',x_label='',y_label='')
with col2:
    st.markdown('PATIENT NAME')
    st.bar_chart(table, y='PATIENT_NAME_SCORE',x='FILE_NAME', color ='#ba78e5',x_label='',y_label='')

with col3:
    st.markdown('DATE OF BIRTH')
    st.bar_chart(table, y='DATE_BIRTH_SCORE',x='FILE_NAME', color='#7bcbff',x_label='',y_label='')
with col4:
    st.markdown('CONSENT')
    st.bar_chart(table, y='CONSENT_SCORE',x='FILE_NAME', color ='#ba78e5',x_label='',y_label='')

## MODEL 3: PROCESS RESEARCH PAPERS

This model has basic information about each document.  Once we have retrieved the results of this, we will utilise **CORTEX_PARSE_DOCUMENT** to extract all the text from each document.

## Extract Key Data from document AI
So Same as before, we are extracting the document AI specific fields.  Note that two of the fields contain lists

In [None]:
CREATE TABLE IF NOT EXISTS DEFAULT_SCHEMA.RESEARCH_PAPERS_RAW AS 
SELECT *, DEFAULT_DATABASE.DOCUMENT_AI.RESEARCH_PAPERS!PREDICT(
  GET_PRESIGNED_URL(@DEFAULT_DATABASE.DOCUMENT_AI.RESEARCH_PAPERS, RELATIVE_PATH), 2) DOC_META
FROM DIRECTORY(@DEFAULT_DATABASE.DOCUMENT_AI.RESEARCH_PAPERS);

SELECT * FROM DEFAULT_SCHEMA.RESEARCH_PAPERS_RAW

Now, the table is formatted go extract the fields into columns.  Where the field contains a list of answers, the **REDUCE** function is used to only extract the field values.  These lists are useful for the search service.

In [None]:
CREATE OR REPLACE TABLE DEFAULT_SCHEMA.RESEARCH_PAPERS_PARSED AS 
select RELATIVE_PATH,

DOC_META:__documentMetadata:ocrScore OCR_SCORE,

STRTOK_TO_ARRAY(
REDUCE(DOC_META:AUTHOR,
              '',
              (acc, val) -> val:value || ',' || acc),',') AUTHORS,

DOC_META:TITLE_OF_PAPER[0]:value::text TITLE_OF_PAPER,
DOC_META:DATE_PUBLICATION[0]:value::text DATE_PUBLICATION,
DOC_META:PUBLICATION_LOCATION[0]:value::text PUBLICATION_LOCATION,

STRTOK_TO_ARRAY(
REDUCE(DOC_META:MORBIDITIES,
              '',
              (acc, val) -> val:value || ',' || acc),',') MORBIDITIES



              




from DEFAULT_SCHEMA.RESEARCH_PAPERS_RAW;

SELECT * FROM DEFAULT_SCHEMA.RESEARCH_PAPERS_PARSED

Let's now create a new table which adds one additional column.  This column will contain ALL the text from all the documents.  The function **CORTEX.PARSE.DOCUMENT** is used for this.  You have two mode options - **LAYOUT** extracts the layout of the document into markdown and **ocr** only extracts the text.

In [None]:
CREATE TABLE IF NOT EXISTS DEFAULT_SCHEMA.RESEARCH_PAPERS_ALL_DATA AS 

select * exclude (layout, OCR_SCORE) from (

SELECT *, 

    build_stage_file_url(@DEFAULT_DATABASE.DOCUMENT_AI.RESEARCH_PAPERS,RELATIVE_PATH) URL,
    
    SNOWFLAKE.CORTEX.PARSE_DOCUMENT (
                                '@DEFAULT_DATABASE.DOCUMENT_AI.RESEARCH_PAPERS',
                                RELATIVE_PATH,
                                {'mode': 'LAYOUT'} )  AS LAYOUT, LAYOUT:content::text CONTENT, LAYOUT:metadata:pageCount PAGE_COUNT
        
FROM DEFAULT_SCHEMA.RESEARCH_PAPERS_PARSED);


CREATE OR REPLACE VIEW DOCUMENT_AI.RESEARCH_PAPERS_ALL_DATA AS 
SELECT * FROM DEFAULT_SCHEMA.RESEARCH_PAPERS_ALL_DATA;

SELECT * FROM DOCUMENT_AI.RESEARCH_PAPERS_ALL_DATA



Now you will view the parsed documents

In [None]:
# Import python packages
import streamlit as st
from snowflake.snowpark.context import get_active_session
from snowflake.snowpark import functions as F
from snowflake.snowpark import types as T



st.title("Research Papers")
session = get_active_session()

side_letters = session.table('DEFAULT_SCHEMA.RESEARCH_PAPERS_ALL_DATA').select('RELATIVE_PATH')#.filter(F.col('RELATIVE_PATH').like('ANALYST_REPORTS%'))
file_id = st.selectbox('Select Report:', side_letters)
doc_details = session.table('DEFAULT_SCHEMA.RESEARCH_PAPERS_ALL_DATA').filter(F.col('RELATIVE_PATH')==file_id).limit(1)
doc_detailsspd = doc_details.to_pandas()


st.markdown('#### Report Details')
col1,col2 = st.columns(2)

with col1:
    st.markdown(f'''__Report Date:__ {doc_detailsspd.DATE_PUBLICATION.iloc[0]}''')
    st.markdown(f'''__Title of Paper:__ {doc_detailsspd.TITLE_OF_PAPER.iloc[0]}''')
    st.markdown(f'''__Author(s):__ {doc_details.select(F.array_to_string(F.col('AUTHORS'),F.lit(','))).collect()[0][0]}''')
    
with col2:
    st.markdown(f'''__Publication Location:__ {doc_detailsspd.PUBLICATION_LOCATION.iloc[0]}''')
    st.markdown(f'''__Morbidities Mentioned:__ {doc_details.select(F.array_to_string(F.col('MORBIDITIES'),F.lit(','))).collect()[0][0]}''')
    

# New Section 
import streamlit as st
from snowflake.snowpark.context import get_active_session
from snowflake.snowpark import functions as snow_funcs

import pypdfium2 as pdfium
from datetime import datetime

# Write directly to the app


doc_ai_context = "DEFAULT_DATABASE.DOCUMENT_AI"
doc_ai_source_table = "RESEARCH_PAPERS_ALL_DATA"
doc_ai_source_verify_table = "RESEARCH_PAPERS_ALL_DATA"
doc_ai_doc_stage = "RESEARCH_PAPERS"

# Dict that has the name of the columns that needs to be verified, it has the column name of the column 
# with value and column with the score
value_dict = {
    "OPERATOR_VALUE": {
        "VAL_COL": "OPERATOR_VALUE",
        "SCORE_COL": "OPERATOR_SCORE"
    }
}

# The minimum score needed to not be verified
threshold_score = 0.5

# HELPER FUNCTIONS
# Function to generate filter to only get the rows that are missing values or have a score below the threshold
def generate_filter(col_dict:dict,  score_val:float): #score_cols:list, score_val:float, val_cols:list):
    
    filter_exp = ''

    # For each column
    for col in col_dict:
        # Create the filter on score threashold or missing value
        if len(filter_exp) > 0:
                filter_exp += ' OR '
        filter_exp += f'(({col_dict[col]["SCORE_COL"]} <= {score_val} ) OR ({col_dict[col]["VAL_COL"]} IS NULL))'

    if len(filter_exp) > 0:
       filter_exp = f'({filter_exp}) AND ' 
    
    # Filter out documents already verified
    filter_exp  += 'verification_date is null'
    return filter_exp

# Generates a column list for counting the number of documents that is missing values or a score less that the threashold
# by each column
def count_missing_select(col_dict:dict, score_val:float):
    select_list = []

    for col in col_dict:
        col_exp = (snow_funcs.sum(
                          snow_funcs.iff(
                                    (
                                        (snow_funcs.col(col_dict[col]["VAL_COL"]).is_null())
                                        | 
                                        (snow_funcs.col(col_dict[col]["SCORE_COL"]) <= score_val)
                                    ), 1,0
                              )
                      ).as_(col)
                )
        select_list.append(col_exp)
        
    return select_list

# Function to display a pdf page
def display_pdf_page():
    pdf = st.session_state['pdf_doc']
    page = pdf[st.session_state['pdf_page']]
            
    bitmap = page.render(
                    scale = 8, 
                    rotation = 0,
            )
    pil_image = bitmap.to_pil()
    st.image(pil_image)

# Function to move to the next PDF page
def next_pdf_page():
    if st.session_state.pdf_page + 1 >= len(st.session_state['pdf_doc']):
        st.session_state.pdf_page = 0
    else:
        st.session_state.pdf_page += 1

# Function to move to the previous PDF page
def previous_pdf_page():
    if st.session_state.pdf_page > 0:
        st.session_state.pdf_page -= 1

# Function to get the name of all documents that need verification
def get_documents(doc_df):
    
    lst_docs = [dbRow[0] for dbRow in doc_df.collect()]
    # Add a default None value
    lst_docs.insert(0, None)
    return lst_docs

# MAIN

# Get the table with all documents with extracted values
df_agreements = session.table(f"{doc_ai_context}.{doc_ai_source_table}")

# Get the documents we already gave verified
df_validated_docs = session.table(f"{doc_ai_context}.{doc_ai_source_verify_table}")

# Join
df_all_docs = df_agreements.join(df_validated_docs,on='RELATIVE_PATH', how='left', lsuffix = '_L', rsuffix = '_R')

# Filter out all document that has missing values of score below the threasholds
validate_filter = generate_filter(value_dict, threshold_score)
df_validate_docs = df_all_docs.filter(validate_filter)
#col1, col2 = st.columns(2)
#col1.metric(label="Total Documents", value=df_agreements.count())
#col2.metric(label="Documents Needing Validation", value=df_validate_docs.count())

# Get the number of documents by value that needs verifying
#select_list = count_missing_select(value_dict, threshold_score)
#df_verify_counts = df_validate_docs.select(select_list)
#verify_cols = df_verify_counts.columns

#st.subheader("Number of documents needing validation by extraction value")
#st.bar_chart(data=df_verify_counts.unpivot("needs_verify", "check_col", verify_cols), x="CHECK_COL", y="NEEDS_VERIFY")

# Verification section
st.divider()
col1, col2 = st.columns(2)
with col1:
    st.markdown('#### RAW PDF STORED IN FILE STORE')
    with st.container():
        # If we have selected a document
        if file_id:        
        # Display the extracted values
            df_doc = df_validate_docs.filter(snow_funcs.col("FILE_NAME") == file_id)
            if 'pdf_page' not in st.session_state:
                st.session_state['pdf_page'] = 0
            if 'pdf_url' not in st.session_state:
                st.session_state['pdf_url'] = file_id    
            if 'pdf_doc' not in st.session_state or st.session_state['pdf_url'] != file_id:
                pdf_stream = session.file.get_stream(f"@{doc_ai_context}.{doc_ai_doc_stage}/{file_id}")
                pdf = pdfium.PdfDocument(pdf_stream)
                st.session_state['pdf_doc'] = pdf
                st.session_state['pdf_url'] = file_id
                st.session_state['pdf_page'] = 0
                
            nav_col1, nav_col2, nav_col3 = st.columns(3)
            with nav_col1:
                if st.button("⏮️ Previous", on_click=previous_pdf_page):
                    pass    
                with nav_col2:
                    st.write(f"page {st.session_state['pdf_page'] +1} of {len(st.session_state['pdf_doc'])} pages")
                with nav_col3:
                    if st.button("Next ⏭️", on_click=next_pdf_page):
                        pass
        
    
    
            display_pdf_page()
    with col2:
        st.markdown('#### EXTRACTED TEXT FROM PDFS')
        with st.container(height=800):
            st.markdown(doc_detailsspd.CONTENT.iloc[0])


### CORTEX SUMMARISE
You will notice that some of these reports are very long.  Let's use Cortex Summarize to summarise them

In [None]:
SELECT TITLE_OF_PAPER, SNOWFLAKE.CORTEX.SUMMARIZE(CONTENT) SUMMARY FROM DEFAULT_SCHEMA.RESEARCH_PAPERS_ALL_DATA limit 1;

You will now create a table which combines all the document meta data with a summary for each research paper

In [None]:
-- Formatting the summarised reports to use with the search service

CREATE OR REPLACE TABLE DEFAULT_SCHEMA.SUMMARISED_RESEARCH_PAPERS AS 


SELECT 
RELATIVE_PATH,
AUTHORS,
TITLE_OF_PAPER,
DATE_PUBLICATION,
PUBLICATION_LOCATION,
'SUMMARY' CONTENT, 
SPLIT_PART(RELATIVE_PATH,'/',1)::text DOCUMENT,
SUMMARY TEXT

FROM


(





SELECT * EXCLUDE CONTENT,SNOWFLAKE.CORTEX.SUMMARIZE(CONTENT) SUMMARY FROM DEFAULT_SCHEMA.RESEARCH_PAPERS_ALL_DATA);

SELECT * FROM DEFAULT_SCHEMA.SUMMARISED_RESEARCH_PAPERS

In [None]:
select * from DEFAULT_SCHEMA.RESEARCH_PAPERS_ALL_DATA

Allthough a high level summary is now created, the user may need to drill into the details.  These details could reside in documents of many pages.  We will now 'chunk' each paper into bite size pieces.  This will effectively create muliple rows of data per chunk.  The function **SPLIT_TEXT_RECURSIVE_CHARACTER** is used. 

In [None]:
CREATE OR REPLACE TABLE DEFAULT_SCHEMA.DETAILED_RESEARCH_PAPERS_CHUNKED_AND_SUMMARISED AS 

SELECT * FROM
(
SELECT 
RELATIVE_PATH,
AUTHORS,
TITLE_OF_PAPER,
DATE_PUBLICATION,
PUBLICATION_LOCATION,
'DETAILED' CONTENT, 
SPLIT_PART(RELATIVE_PATH,'/',1)::text DOCUMENT,
VALUE::TEXT TEXT


FROM


DEFAULT_SCHEMA.RESEARCH_PAPERS_ALL_DATA,

LATERAL FLATTEN(SNOWFLAKE.CORTEX.SPLIT_TEXT_RECURSIVE_CHARACTER(CONTENT,'none',600,50,['\n\n', ' '])))

UNION 

SELECT * FROM DEFAULT_SCHEMA.SUMMARISED_RESEARCH_PAPERS;

SELECT * FROM DEFAULT_SCHEMA.DETAILED_RESEARCH_PAPERS_CHUNKED_AND_SUMMARISED

## CREATE A SEARCH SERVICE
Now you will create a search service in order for a **Cortex Agent** can search through and provide answers to your health related questions

In [None]:
CREATE OR REPLACE  CORTEX SEARCH SERVICE DEFAULT_SCHEMA.RESEARCH_PAPERS
  ON TEXT
  ATTRIBUTES AUTHORS,TITLE_OF_PAPER,DATE_PUBLICATION,CONTENT,DOCUMENT
  WAREHOUSE = DEFAULT_WH
  TARGET_LAG = '1 hour'
  COMMENT = 'SEARCH SERVICE FOR RESEARCH_PAPERS'
  AS SELECT * FROM DEFAULT_SCHEMA.DETAILED_RESEARCH_PAPERS_CHUNKED_AND_SUMMARISED;

So we have now created a search service to enable an agent to answer questions regarding the research papers.  The Structured datasets can also be used in the cortex agent - but before we can do this, we will need to bring this data together through a **semantic model**.  Structured data will not just contain the structured results from document ai - but will  contain more traditionally created datasets.  This lab contains a synthetic dataset summarising the general health of millions of people located in Great Britain. This will be a good place to start.  But it will be great to synthesise more information about sample patients such as blood pressure history and any medical reports.  Let's tryout **Cortex Playground** to see the capabilities of the built in LLMs. 