# Airbyte SQL Index Guide

We will show how to generate SQL queries on a Snowflake db generated by Airbyte.

In [1]:
# Uncomment to enable debugging.

# import logging
# import sys

# logging.basicConfig(stream=sys.stdout, level=logging.DEBUG)
# logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

### Airbyte ingestion

Here we show how to ingest data from Zendesk into a Snowflake db using Airbyte.

![airbyte_1.png](attachment:airbyte_1.png)

Let's create a new connection. Here we will be dumping our Zendesk tickets into a Snowflake db.

![zendesk_1.png](attachment:zendesk_1.png)

![zendesk_2.png](attachment:zendesk_2.png)

![zendesk_3.png](attachment:zendesk_3.png)

![snowflake_1-2.png](attachment:snowflake_1-2.png)

![snowflake_2.png](attachment:snowflake_2.png)

Choose the streams you want to sync.
![airbyte_4.png](attachment:airbyte_4.png)
![airbyte_8.png](attachment:airbyte_8.png)

![airbyte_7.png](attachment:airbyte_7.png)

Sync your data.

![airbyte_5.png](attachment:airbyte_5.png)

### Snowflake-SQLAlchemy version fix

Hack to make snowflake-sqlalchemy work despite incompatible sqlalchemy versions

Taken from https://github.com/snowflakedb/snowflake-sqlalchemy/issues/380#issuecomment-1470762025

In [2]:
# Hack to make snowflake-sqlalchemy work until they patch it

def snowflake_sqlalchemy_20_monkey_patches():
    import sqlalchemy.util.compat

    # make strings always return unicode strings
    sqlalchemy.util.compat.string_types = (str,)
    sqlalchemy.types.String.RETURNS_UNICODE = True

    import snowflake.sqlalchemy.snowdialect

    snowflake.sqlalchemy.snowdialect.SnowflakeDialect.returns_unicode_strings = True

    # make has_table() support the `info_cache` kwarg
    import snowflake.sqlalchemy.snowdialect

    def has_table(self, connection, table_name, schema=None, info_cache=None):
        """
        Checks if the table exists
        """
        return self._has_object(connection, "TABLE", table_name, schema)

    snowflake.sqlalchemy.snowdialect.SnowflakeDialect.has_table = has_table

# usage: call this function before creating an engine:
try:
    snowflake_sqlalchemy_20_monkey_patches()
except Exception as e:
    raise ValueError("Please run `pip install snowflake-sqlalchemy`")

### Define database

We pass the Snowflake uri to the SQL db constructor

In [3]:
snowflake_uri = 'snowflake://<user_login_name>:<password>@<account_identifier>/<database_name>/<schema_name>?warehouse=<warehouse_name>&role=<role_name>'


First we try connecting with sqlalchemy to check the db works.

In [4]:
from sqlalchemy import select, create_engine, MetaData, Table

# view current table
engine = create_engine(snowflake_uri)
metadata = MetaData(bind=None)
table = Table(
    'ZENDESK_TICKETS', 
    metadata, 
    autoload=True, 
    autoload_with=engine
)
stmt = select(table.columns)


with engine.connect() as connection:
    results = connection.execute(stmt).fetchone()
    print(results)
    print(results.keys())


(False, 'test case', '[]', datetime.datetime(2022, 7, 18, 16, 59, 13, tzinfo=<UTC>), 'test to', None, None, 'question', '{\n  "channel": "web",\n  "source": {\n    "from": {},\n    "rel": null,\n    "to": {}\n  }\n}', True, datetime.datetime(2022, 7, 18, 18, 1, 37, tzinfo=<UTC>), None, '[]', None, 134, None, 1658167297, 'test case', None, '[]', False, '{\n  "score": "offered"\n}', 360786799676, 'low', '[]', 'https://d3v-airbyte.zendesk.com/api/v2/tickets/134.json', '[]', 360000358316, 360000084116, '[]', None, '[]', 360033549136, True, None, False, 'new', 360786799676, 'abd39a87-b1f9-4390-bf8b-cf3c288b1f74', datetime.datetime(2023, 6, 9, 0, 25, 23, 501000, tzinfo=pytz.FixedOffset(-420)), datetime.datetime(2023, 6, 9, 0, 38, 20, 440000, tzinfo=<UTC>), '6577ef036668746df889983970579a55', '02522a2b2726fb0a03bb19f2d8d9524d')
RMKeyView(['from_messaging_channel', 'subject', 'email_cc_ids', 'created_at', 'description', 'custom_status_id', 'external_id', 'type', 'via', 'allow_attachments', 'up

### Build Index

We then build the SQL Index (`SQLStructStoreIndex`).

In [5]:
from llama_index import SQLStructStoreIndex, SQLDatabase, VectorStoreIndex
from llama_index.indices.struct_store import SQLContextContainerBuilder

sql_database = SQLDatabase(engine)
context_builder = SQLContextContainerBuilder(sql_database)
table_schema_index = context_builder.derive_index_from_context(
    VectorStoreIndex,
)
query_str = "When was the last zendesk ticket created?"
context_builder.query_index_for_context(table_schema_index, query_str, store_context_str=True)
context_container = context_builder.build_context_container()
index = SQLStructStoreIndex(
    sql_database=sql_database,
    sql_context_container=context_container,
)

Note that we add the context_container to restrict the prompt to only querying over the most relevant tables.
Otherwise we will get size overflow on our query.

### Query Index

We first show how we can execute a raw SQL query, which directly executes over the table.

In [6]:
query_engine = index.as_query_engine(
    query_mode="sql"
)
response = query_engine.query("SELECT created_at FROM ZENDESK_TICKETS limit 5")

In [7]:
from IPython.display import Markdown, display
display(Markdown(f"<b>{response}</b>"))

<b>[(datetime.datetime(2022, 7, 18, 16, 59, 13, tzinfo=<UTC>),), (datetime.datetime(2021, 9, 1, 11, 59, 40, tzinfo=<UTC>),), (datetime.datetime(2021, 9, 1, 12, 0, 29, tzinfo=<UTC>),), (datetime.datetime(2022, 7, 18, 15, 22, 39, tzinfo=<UTC>),), (datetime.datetime(2021, 9, 1, 12, 0, 25, tzinfo=<UTC>),)]</b>

We then show a natural language query, which is translated to a SQL query under the hood with our text-to-SQL prompt.

In [8]:

query_engine = index.as_query_engine()
display(Markdown(f"<b>{context_container.context_str}</b>"))
response = query_engine.query(query_str)

<b>
Table 'zendesk_tickets' has the relevant column for the query: created_at (TIMESTAMP_TZ). The full schema of the table is: 

Table 'zendesk_tickets' has columns: from_messaging_channel (BOOLEAN), subject (VARCHAR(16777216)), email_cc_ids (VARIANT), created_at (TIMESTAMP_TZ), description (VARCHAR(16777216)), custom_status_id (DECIMAL(38, 0)), external_id (VARCHAR(16777216)), type (VARCHAR(16777216)), via (VARIANT), allow_attachments (BOOLEAN), updated_at (TIMESTAMP_TZ), problem_id (DECIMAL(38, 0)), follower_ids (VARIANT), due_at (TIMESTAMP_TZ), id (DECIMAL(38, 0)), assignee_id (DECIMAL(38, 0)), generated_timestamp (DECIMAL(38, 0)), raw_subject (VARCHAR(16777216)), forum_topic_id (DECIMAL(38, 0)), custom_fields (VARIANT), allow_channelback (BOOLEAN), satisfaction_rating (VARIANT), submitter_id (DECIMAL(38, 0)), priority (VARCHAR(16777216)), collaborator_ids (VARIANT), url (VARCHAR(16777216)), tags (VARIANT), brand_id (DECIMAL(38, 0)), ticket_form_id (DECIMAL(38, 0)), sharing_agreement_ids (VARIANT), group_id (DECIMAL(38, 0)), followup_ids (VARIANT), organization_id (DECIMAL(38, 0)), is_public (BOOLEAN), recipient (VARCHAR(16777216)), has_incidents (BOOLEAN), status (VARCHAR(16777216)), requester_id (DECIMAL(38, 0)), _airbyte_ab_id (VARCHAR(16777216)), _airbyte_emitted_at (TIMESTAMP_TZ), _airbyte_normalized_at (TIMESTAMP_TZ), _airbyte_zendesk_tickets_hashid (VARCHAR(32)), _airbyte_unique_key (VARCHAR(32)) and foreign keys: .</b>

In [9]:
display(Markdown(f"<b>{response}</b>"))

<b> The last Zendesk ticket was created on September 19th, 2022 at 14:53:49 UTC.</b>

In [10]:
# you can also fetch the raw result from SQLAlchemy! 
response.extra_info["result"]

[(datetime.datetime(2022, 9, 19, 14, 53, 49, tzinfo=<UTC>),)]