<div id="singlestore-header" style="display: flex; background-color: rgba(255, 167, 103, 0.25); padding: 5px;">
    <div id="icon-image" style="width: 90px; height: 90px;">
        <img width="100%" height="100%" src="https://raw.githubusercontent.com/singlestore-labs/spaces-notebooks/master/common/images/header-icons/crystal-ball.png" />
    </div>
    <div id="text" style="padding: 5px; margin-left: 10px;">
        <div id="badge" style="display: inline-block; background-color: rgba(0, 0, 0, 0.15); border-radius: 4px; padding: 4px 8px; align-items: center; margin-top: 6px; margin-bottom: -2px; font-size: 80%">SingleStore Notebooks</div>
        <h1 style="font-weight: 500; margin: 8px 0 0 4px;">Building a Generative AI Application with Vertex AI and SingleStoreDB</h1>
    </div>
</div>

## Document Ingestion

Welcome to this guide on building a state-of-the-art General AI application using Google Cloud's Vertex AI and SingleStoreDB. This guide aims to provide a seamless experience, offering step-by-step instructions, code explanations, and best practices.

## Overview

Vertex AI, a product by Google Cloud, offers an integrated suite of machine learning tools that allows developers to build, deploy, and scale AI models faster than ever. On the other hand, SingleStoreDB offers a fast, scalable, and SQL-compliant relational database system. By combining the power of Vertex AI's machine learning capabilities with the efficient storage and retrieval mechanisms of SingleStoreDB, we can create robust AI applications that respond to user queries in real-time.

### What You'll Learn

- Setting up your environment with the necessary packages and credentials.
- Fetching and processing data to be used in our AI models.
- Storing and managing data efficiently using SingleStoreDB.
- Leveraging the power of Vertex AI for real-time data processing and insights.
- Building a retrieval-based QA system to answer user queries.

### Prerequisites

- Basic knowledge of Python programming.
- Familiarity with Google Cloud services and SQL databases.
- An active Google Cloud account.
- A SingleStoreDB hosted or self-managed instance.

**Let's dive in and start building!**

In [1]:
%pip install --quiet google-cloud-aiplatform langchain github-clone
%pip install --quiet unstructured unstructured[pdf] pytesseract
%pip install --quiet singlestoredb

## Authentication

The next step involves authenticating our session with Google Cloud. By running the following cell, you'll be prompted to log in using your Google Cloud credentials. Follow the instructions to complete the login process.

In [2]:
from google.colab import auth as google_auth

google_auth.authenticate_user()

## Import modules

In [3]:
# Vertex AI
import vertexai
from google.cloud import aiplatform
from vertexai.language_models import TextEmbeddingModel, TextGenerationModel

# Langchain
from langchain.llms import VertexAI
from langchain.vectorstores import SingleStoreDB

## Obtaining a dataset

The following is a dataset composed by public data provided by the IRS regarding the 2023 tax season.

You can download the dataset to your computer and explore it by following [this link](https://drive.google.com/file/d/1mdDHBnSWwDbMo2xyRk9gxUAswhyb9uKw/view?usp=drive_link).

After the dataset is downloaded, the contents will be ingested into SingleStore.

The Document processing includes chunking the documents leveraging Langchain's chunking libraries, and generating embeddings using the Google PaLM 2 text-gecko-001 model.

In [4]:
from google.colab import auth
from oauth2client.client import GoogleCredentials

FILE_URL = "https://github.com/datagabe/hollywood/raw/main/sample_tax_information.zip"

!wget {FILE_URL} -O dataset.zip
!mkdir dataset
!unzip dataset.zip -d dataset

## Loading Data from a Directory

Once you have downloaded the dataset from Google Drive, and it is already unzipped, you will leverage Langchain's DirectoryLoader loader to chunk the documents before ingesting them to your SingleStore DB.

In [5]:
import unstructured
from langchain.document_loaders import DirectoryLoader

loader = DirectoryLoader('dataset')

docs = loader.load()

## Splitting the Data

To process the data more efficiently, we'll split the loaded content into smaller chunks. The RecursiveCharacterTextSplitter class helps in achieving this by dividing the data based on specified character limits.

In [6]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap=50)
all_splits = text_splitter.split_documents(docs)

## Setting Up SingleStoreDB with Vertex AI Embeddings

For efficient storage and retrieval of our data, we use SingleStoreDB in conjunction with Vertex AI embeddings. The following cell sets up the necessary environment variables and initializes the SingleStoreDB instance with Vertex AI embeddings.

In [7]:
from langchain.embeddings import VertexAIEmbeddings

# Init Vertex AI Platform
aiplatform.init(project="", location="us-central1")

# Generate embeddings and ingest documents
vectorstore = SingleStoreDB.from_documents(documents=all_splits, embedding=VertexAIEmbeddings(model_name="textembedding-gecko@003"))

<div id="singlestore-footer" style="background-color: rgba(194, 193, 199, 0.25); height:2px; margin-bottom:10px"></div>
<div><img src="https://raw.githubusercontent.com/singlestore-labs/spaces-notebooks/master/common/images/singlestore-logo-grey.png" style="padding: 0px; margin: 0px; height: 24px"/></div>