# AI SQL
AISQL delivers an easy-to-use composable query language that handles both text and multimodal data through familiar SQL syntax. Our high-performance batch engine processes queries faster through intelligent query optimization, while delivering lower costs than traditional AI solutions. Natively ingest structured and unstructured data into unified multimodal tables, enabling comprehensive insights through familiar SQL analytics for all your data.

## AI SQL Benefits:
### 1) Easy-to-use SQL syntax to build AI pipelines without complex coding
### 2) High-performance processing across all modalities (text, image, audio)
### 3) Lower cost batch processing through optimized architecture to support faster and larger batch jobs.


## Three Major function calls
### AI_FILTER
    AI-powered SQL operator for semantic filtering. You can use the same syntax on a single table (for filtering) or join multiple tables together upon a semantic relationship. 

### AI_AGG/AI_SUMMARIZE_AGG
    Advanced functionalities provide unmatched comprehensive analysis across multiple data entries with a user prompt. This method overcomes contextual window constraints, facilitating extensive multi-record examination with no comparable solutions in the market.

### AI_CLASSIFY
    Functions for easy classification across multi-modal data

## AI-SQL Functions For Equity Research

Snowflake's AI powered functions enable column level operations with LLM like traditional database operators. 

With this lab we will do the Following:

1. Parse document into text from pdf using [PARSE_DOCUMENT](https://docs.snowflake.com/en/sql-reference/functions/parse_document-snowflake-cortex)
2. Extract entities using snowflake [structured output](https://docs.snowflake.com/en/user-guide/snowflake-cortex/complete-structured-outputs)
3. Using Top-K join and AI Join to map entities to S&P 500 tickers ([AI_FILTER (PrPr)](https://docs.snowflake.com/LIMITEDACCESS/snowflake-cortex/ai_filter-snowflake-cortex))
4. Summarize research insights (using [AI_AGG  (PrPr)](https://docs.snowflake.com/LIMITEDACCESS/snowflake-cortex/ai_agg)) across multiple articles upon given ticker

The information in this lab uses 7 PDF documents with equity research information in various forms.  Inside each of the PDFs is bulleted data, table data, as well as text data.  We will use the Parse_Document function to get this information into a table so that we can feed it into Cortex.  We then extract the entities that are mentioned in those PDF documents into a table.  Then we join them to a stock ticker using a marketplace listing of common stock ticker and company names.  We will do this join using AI_FILTER so we can match and clean up what the original Top K join gave us but with the logic of LLMs.  Then we will summarize the findings on one of the stocks using AI_AGG 


Download the article PDFs from the repo's `data/EquityDOCS` folder. Then upload to a stage called `@equitydocs`

First Lets Parse research doc into text using [PARSE_DOCUMENT](https://docs.snowflake.com/en/sql-reference/functions/parse_document-snowflake-cortex) function from the stage containing the pdfs.

In [None]:
-- Uploaded pdfs into stage @equitydocs
-- (note: please create stage as client side encryption)
use role sysadmin;

CREATE OR REPLACE TABLE raw_docs_text AS SELECT
    relative_path, 
    GET_PRESIGNED_URL(@equitydocs, relative_path) as scoped_file_url, 
TO_VARIANT(SNOWFLAKE.CORTEX.PARSE_DOCUMENT(@equitydocs, relative_path , {'mode': 'layout'})) as raw_text_dict,
    raw_text_dict:content as raw_text
FROM DIRECTORY(@equitydocs);


Lets Review what that parseing did.  We can see that it was able to pull the information from the PDF and put it into a table for us as well as retain some of the information in bullet form and read from the charts as well

In [None]:
select * from raw_docs_text;

Now we can extract the company and a sentiment from the document using [STRUCTURED OUTPUT](https://docs.snowflake.com/en/user-guide/snowflake-cortex/complete-structured-outputs) reading from the documents we just parsed above.  

In [None]:
CREATE OR REPLACE TABLE ENTITY_EXTRACTION_EXAMPLE as 
select *,
    AI_COMPLETE(
        model => 'openai-gpt-5-mini',
        prompt => 'You are tasked with extracting companies from a research article. Extract "company" for each company that is identified and the "sentiment" which includes a sentiment of how the company was referenced.:\n\n' || RAW_TEXT,
        response_format => {
            'type':'json',
            'schema':{
                'type' : 'object',
                'properties': {
                    'company_sentiment': {
                        'type': 'array',
                        'items': {
                            'type': 'object',
                            'properties': {
                                'company': {'type': 'string'},
                                'sentiment': {'type': 'string'}
                            },
                            'required':['company','sentiment'],
                            'additionalProperties':false
                        }
                    }
                },
                'required':['company_sentiment'],
                'additionalProperties':false
            }
        }
    ) as extraction,
    extraction:company_sentiment::array as list_company_sentiment,
    AI_COMPLETE(
        model => 'openai-gpt-5-mini',
        prompt => PROMPT('Summarize this article into a few sentences {0}', RAW_TEXT)) as summary
    from raw_docs_text
    limit 3
    --QUALIFY ROW_NUMBER() OVER (ORDER BY raw_text) = 3;

-- Extracts companies and their sentiment from article. Takes 1m 6s for one article (gpt5). 
-- For all 7 it takes 6m 2s (gpt5), 56s (gpt5-mini), 46s (gpt5-nano)

Here we can see in the array object returned what the sentiment of each company is.  Looking at the extraction JSON return object we can see the values from the above statement. Double click on a row in the extraction column to see the full values

In [None]:
select * from  ENTITY_EXTRACTION_EXAMPLE ;

Lets use our marketplace listing we brought in during our setup to get a list of common tickers.  If you forgot to do this step please refer to the setup documentation for the lab

In [None]:
CREATE OR REPLACE TABLE TICKERS_LIST as select distinct(company_name), ticker
FROM S__P_500_BY_DOMAIN_AND_AGGREGATED_BY_TICKERS_SAMPLE.DATAFEEDS.SP_500
group by 1,2;

First we will need to flatten out this JSON object so we can have a value for each company in its own row.  This will make the joins easier 

In [None]:
create or replace view flattened_extraction as 
SELECT 
    relative_path as file_name,
    RAW_TEXT,
    summary,
    flattened.value:company::STRING AS Company,
    flattened.value:sentiment::STRING AS Sentiment,
    list_company_sentiment 
FROM 
    entity_extraction_example,
    LATERAL FLATTEN(INPUT => list_company_sentiment) AS flattened;

In [None]:
select * from flattened_extraction;

Lets join this flattened table to our tickers list using Top K logic and see our results

In [None]:
create or replace table top_candidates as 
SELECT c.*, 
    d.*, 
    AI_SIMILARITY(c.company, d.company_name)as sim_score
FROM  flattened_extraction c
CROSS JOIN TICKERS_LIST d
QUALIFY row_number() OVER (PARTITION BY company, file_name
                         ORDER BY sim_score DESC) <= 5;

Looking at our results we can see it did its best to join the ticker to our company name.  But we can easily see this logic is flawed as it joins some companies together that do not make any sense.

In [None]:
select 
    company as extracted, 
    company_name as mapped_company,
    sim_score,
    ticker as mapped_ticker 
from top_candidates;

Lets add some AI Logic to make this match better and clean up our results.  We will use  [AI_FILTER (PuPr)](https://docs.snowflake.com/en/sql-reference/functions/ai_filter) for the most accurate results.

A hint for the syntax here is AI_FILTER(concat('text', columnname, ' text', columnname,' text ', columnname))

In [None]:
//ENTITY DISAMBIGUATION - USING AI FILTER TO JOIN SUMMARY/EXTRACTION FROM ARTICLE TO S&P 500 Companies

create or replace table matched_candidates as 
SELECT 
    file_name, 
    raw_text, 
    summary, 
    company as extracted, 
    company_name as mapped_company, 
    ticker as mapped_ticker, 
FROM top_candidates
WHERE 1=1
and raw_text is not null 
and company_name is not null 
and summary is not null 
and AI_FILTER(PROMPT('Are all of these three referring to the same company?\nSummary:{0}, Company name:{1}, Another Company Name:{2}', summary, company_name, company))
ORDER BY FILE_NAME;

select * from matched_candidates;

Lets use another new AISQL function and aggregate insights across multiple documents on specific Ticker using [AI_AGG (PrPr)](https://docs.snowflake.com/LIMITEDACCESS/snowflake-cortex/ai_agg)

In [None]:
create or replace table AIAGG_RESULTS as (
select 
    mapped_ticker,
    count(*) as count_research,
    AI_AGG(raw_text,'Please help me summarize in bullet points any discussions relevant to Goldman Sachs') as Aggregate_finding
    --use AI_AGG function to run with the prompt of you are provided a couple research articles to the company; please help me summarize in bullet points on discussions relevant to the company as your aggregated_summary column.  You will use the raw_text column for this
from matched_candidates
where mapped_ticker = 'GS'
group by mapped_ticker);

Select * from aiagg_results;
