<a href="https://colab.research.google.com/github/xtreamgit/06_web/blob/master/Fully_Scalable_Q%26A_System_with_Cassandra_Backend.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Vector Similarity Search QA Quickstart

Set up a simple Question-Answering system with LangChain and CassIO, using Cassandra as the Vector Database.

# New Section

## Colab-specific setup

Make sure you have a Database and get ready to upload the Secure Connect Bundle and supply the Token string
(see [Pre-requisites](https://cassio.org/start_here/#vector-database) on cassio.org for details. Remember you need a **custom Token** with role [Database Administrator](https://awesome-astra.github.io/docs/pages/astra/create-token/)).

Likewise, ensure you have the necessary secret for the LLM provider of your choice: you'll be asked to input it shortly
(see [Pre-requisites](https://cassio.org/start_here/#llm-access) on cassio.org for details).

_Note: this notebook is modified from the CassIO documentation. Visit [this page on cassIO.org](https://cassio.org/frameworks/langchain/qa-basic/)._


In [1]:
# install required dependencies
! pip install -q --progress-bar off \
    "git+https://github.com/hemidactylus/langchain@updated-full-preview--lab#egg=langchain&subdirectory=libs/langchain" \
    "cassio>=0.1.1" \
    "google-cloud-aiplatform>=1.25.0" \
    "jupyter>=1.0.0" \
    "openai==0.27.7" \
    "python-dotenv==1.0.0" \
    "tensorflow-cpu==2.12.0" \
    "tiktoken==0.4.0" \
    "transformers>=4.29.2"
exit()

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Building wheel for langchain (pyproject.toml) ... [?25l[?25hdone
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
llmx 0.0.15a0 requires cohere, which is not installed.
tensorflow 2.14.0 requires keras<2.15,>=2.14.0, but you have keras 2.12.0 which is incompatible.
tensorflow 2.14.0 requires tensorboard<2.15,>=2.14, but you have tensorboard 2.12.3 which is incompatible.
tensorflow 2.14.0 requires tensorflow-estimator<2.15,>=2.14.0, but you have tensorflow-estimator 2.12.0 which is incompatible.[0m[31m
[0m

⚠️ **Do not mind a "Your session crashed..." message you may see.**

It was us, making sure your kernel restarts with all the correct dependency versions. _You can now proceed with the notebook._

In [None]:
# Input your database keyspace name:
ASTRA_DB_KEYSPACE = input('Your Astra DB Keyspace name (e.g. cassio_tutorials): ')

Your Astra DB Keyspace name (e.g. cassio_tutorials): pg_vsearch


In [None]:
# Input your Astra DB token string, the one starting with "AstraCS:..."
from getpass import getpass
ASTRA_DB_APPLICATION_TOKEN = getpass('Your Astra DB Token ("AstraCS:qtcgniCjcQbMYdCIhiRmzTgy:ddef6e99e58d223be927d5bbcea0ae5f2c043b4c5379f6e2780e4f15da5bfa67"): ')

Your Astra DB Token ("AstraCS:..."): ··········


### Astra DB Secure Connect Bundle

Please upload the Secure Connect Bundle zipfile to connect to your Astra DB instance.

The Secure Connect Bundle is needed to establish a secure connection to the database.
Click [here](https://awesome-astra.github.io/docs/pages/astra/download-scb/#c-procedure) for instructions on how to download it from Astra DB.

In [None]:
# Upload your Secure Connect Bundle zipfile:
import os
from google.colab import files


print('Please upload your Secure Connect Bundle')
uploaded = files.upload()
if uploaded:
    astraBundleFileTitle = list(uploaded.keys())[0]
    ASTRA_DB_SECURE_BUNDLE_PATH = os.path.join(os.getcwd(), astraBundleFileTitle)
else:
    raise ValueError(
        'Cannot proceed without Secure Connect Bundle. Please re-run the cell.'
    )

Please upload your Secure Connect Bundle


Saving secure-connect-db-paulgraham.zip to secure-connect-db-paulgraham.zip


In [None]:
# colab-specific override of helper functions
from cassandra.cluster import (
    Cluster,
)
from cassandra.auth import PlainTextAuthProvider


def getCQLSession(mode='astra_db'):
    if mode == 'astra_db':
        cluster = Cluster(
            cloud={
                "secure_connect_bundle": ASTRA_DB_SECURE_BUNDLE_PATH,
            },
            auth_provider=PlainTextAuthProvider(
                "token",
                ASTRA_DB_APPLICATION_TOKEN,
            ),
        )
        astraSession = cluster.connect()
        return astraSession
    else:
        raise ValueError('Unsupported CQL Session mode')

def getCQLKeyspace(mode='astra_db'):
    if mode == 'astra_db':
        return ASTRA_DB_KEYSPACE
    else:
        raise ValueError('Unsupported CQL Session mode')

### LLM Provider

In the cell below you can choose between **GCP Vertex AI** or **OpenAI** for your LLM services.
(See [Pre-requisites](https://cassio.org/start_here/#llm-access) on cassio.org for more details).

Make sure you set the `llmProvider` variable and supply the corresponding access secrets in the following cell.

In [None]:
# Set your secret(s) for LLM access:
llmProvider = 'OpenAI'  # 'GCP_VertexAI', 'Azure_OpenAI'


In [None]:
from getpass import getpass
if llmProvider == 'OpenAI':
    apiSecret = getpass(f'Your secret for LLM provider "{llmProvider}": ')
    os.environ['OPENAI_API_KEY'] = apiSecret
elif llmProvider == 'GCP_VertexAI':
    # we need a json file
    print(f'Please upload your Service Account JSON for the LLM provider "{llmProvider}":')
    from google.colab import files
    uploaded = files.upload()
    if uploaded:
        vertexAIJsonFileTitle = list(uploaded.keys())[0]
        os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = os.path.join(os.getcwd(), vertexAIJsonFileTitle)
    else:
        raise ValueError(
            'No file uploaded. Please re-run the cell.'
        )
elif llmProvider == 'Azure_OpenAI':
    # a few parameters must be input
    apiSecret = input(f'Your API Key for LLM provider "{llmProvider}": ')
    os.environ['AZURE_OPENAI_API_KEY'] = apiSecret
    apiBase = input('The "Base URL" for your models (e.g. "https://YOUR-RESOURCE-NAME.openai.azure.com"): ')
    os.environ['AZURE_OPENAI_API_BASE'] = apiBase
    apiLLMDepl = input('The name of your LLM Deployment: ')
    os.environ['AZURE_OPENAI_LLM_DEPLOYMENT'] = apiLLMDepl
    apiLLMModel = input('The name of your LLM Model (e.g. "gpt-4"): ')
    os.environ['AZURE_OPENAI_LLM_MODEL'] = apiLLMModel
    apiEmbDepl = input('The name for your Embeddings Deployment: ')
    os.environ['AZURE_OPENAI_EMBEDDINGS_DEPLOYMENT'] = apiEmbDepl
    apiEmbModel = input('The name of your Embedding Model (e.g. "text-embedding-ada-002"): ')
    os.environ['AZURE_OPENAI_EMBEDDINGS_MODEL'] = apiEmbModel

    # The following is probably not going to change for some time...
    os.environ['AZURE_OPENAI_API_VERSION'] = '2023-03-15-preview'
else:
    raise ValueError('Unknown/unsupported LLM Provider')

Your secret for LLM provider "OpenAI": ··········


In [None]:
# retrieve the text of a few documents that will be indexed in the vector store
! mkdir texts
! curl https://raw.githubusercontent.com/jerryjliu/llama_index/main/examples/paul_graham_essay/data/paul_graham_essay.txt --output texts/paul_graham_essay.txt

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0100 75047  100 75047    0     0   366k      0 --:--:-- --:--:-- --:--:--  366k


### Colab preamble completed

The following cells constitute the demo notebook proper.

# Vector Similarity Search QA Quickstart

Set up a simple Question-Answering system with LangChain and CassIO, using Cassandra as the Vector Database.

_**NOTE:** this uses Cassandra's "Vector Similarity Search" capability.
Make sure you are connecting to a vector-enabled database for this demo._

In [None]:
from langchain.indexes import VectorstoreIndexCreator
from langchain.text_splitter import (
    CharacterTextSplitter,
    RecursiveCharacterTextSplitter,
)
from langchain.docstore.document import Document
from langchain.document_loaders import TextLoader

The following line imports the Cassandra flavor of a LangChain vector store:

In [None]:
from langchain.vectorstores.cassandra import Cassandra

A database connection is needed to access Cassandra. The following assumes
that a _vector-search-capable Astra DB instance_ is available. Adjust as needed.

In [None]:
# creation of the DB connection
cqlMode = 'astra_db'
session = getCQLSession(mode=cqlMode)
keyspace = getCQLKeyspace(mode=cqlMode)

ERROR:cassandra.connection:Closing connection <AsyncoreConnection(138953527602832) 8322abed-c007-43de-9e8a-5dddb00a380a-us-east1.db.astra.datastax.com:29042:4826ba60-d274-4b3b-944f-51d42fe5ae61> due to protocol error: Error from server: code=000a [Protocol error] message="Beta version of the protocol used (5/v5-beta), but USE_BETA flag is unset"


Both an LLM and an embedding function are required.

Below is the logic to instantiate the LLM and embeddings of choice. We chose to leave it in the notebooks for clarity.

In [None]:
import os
# creation of the LLM resources


if llmProvider == 'GCP_VertexAI':
    from langchain.llms import VertexAI
    from langchain.embeddings import VertexAIEmbeddings
    llm = VertexAI()
    myEmbedding = VertexAIEmbeddings()
    print('LLM+embeddings from Vertex AI')
elif llmProvider == 'OpenAI':
    os.environ['OPENAI_API_TYPE'] = 'open_ai'
    from langchain.llms import OpenAI
    from langchain.embeddings import OpenAIEmbeddings
    llm = OpenAI(temperature=0)
    myEmbedding = OpenAIEmbeddings()
    print('LLM+embeddings from OpenAI')
elif llmProvider == 'Azure_OpenAI':
    os.environ['OPENAI_API_TYPE'] = 'azure'
    os.environ['OPENAI_API_VERSION'] = os.environ['AZURE_OPENAI_API_VERSION']
    os.environ['OPENAI_API_BASE'] = os.environ['AZURE_OPENAI_API_BASE']
    os.environ['OPENAI_API_KEY'] = os.environ['AZURE_OPENAI_API_KEY']
    from langchain.llms import AzureOpenAI
    from langchain.embeddings import OpenAIEmbeddings
    llm = AzureOpenAI(temperature=0, model_name=os.environ['AZURE_OPENAI_LLM_MODEL'],
                      engine=os.environ['AZURE_OPENAI_LLM_DEPLOYMENT'])
    myEmbedding = OpenAIEmbeddings(model=os.environ['AZURE_OPENAI_EMBEDDINGS_MODEL'],
                                   deployment=os.environ['AZURE_OPENAI_EMBEDDINGS_DEPLOYMENT'])
    print('LLM+embeddings from Azure OpenAI')
else:
    raise ValueError('Unknown LLM provider.')

LLM+embeddings from OpenAI


## A minimal example

The following is a minimal usage of the Cassandra vector store. The store is created and filled at once, and is then queried to retrieve relevant parts of the indexed text, which are then stuffed into a prompt finally used to answer a question.

The following creates an "index creator", which knows about the type of vector store, the embedding to use and how to preprocess the input text:

_(Note: stores built with different embedding functions will need different tables. This is why we append the `llmProvider` name to the table name in the next cell.)_

In [None]:
table_name = 'vs_test1_' + llmProvider

index_creator = VectorstoreIndexCreator(
    vectorstore_cls=Cassandra,
    embedding=myEmbedding,
    text_splitter=CharacterTextSplitter(
        chunk_size=400,
        chunk_overlap=0,
    ),
    vectorstore_kwargs={
        'session': session,
        'keyspace': keyspace,
        'table_name': table_name,
    },
)

Loading a local text ( Great work by http://paulgraham.com/greatwork.html)

In [None]:
loader = TextLoader('texts/greatwork.txt', encoding='utf8')

In [None]:
loader = TextLoader('texts/paul_graham_essay.txt', encoding='utf8')

This takes a few seconds to run, as it must calculate embedding vectors for a number of chunks of the input text:

In [None]:
# Note: Certain LLM providers need workaround to evaluate batch embeddings
#       (as done in next cell).
#       As of 2023-06-29, Azure OpenAI would  error with:
#           "InvalidRequestError: Too many inputs. The max number of inputs is 1"
if llmProvider == 'Azure_OpenAI':
    from langchain.indexes.vectorstore import VectorStoreIndexWrapper
    docs = loader.load()
    subdocs = index_creator.text_splitter.split_documents(docs)
    #
    print(f'subdocument {0} ...', end=' ')
    vs = index_creator.vectorstore_cls.from_documents(
        subdocs[:1],
        index_creator.embedding,
        **index_creator.vectorstore_kwargs,
    )
    print('done.')
    for sdi, sd in enumerate(subdocs[1:]):
        print(f'subdocument {sdi+1} ...', end=' ')
        vs.add_texts(texts=[sd.page_content], metadata=[sd.metadata])
        print('done.')
    #
    index = VectorStoreIndexWrapper(vectorstore=vs)

In [None]:
if llmProvider != 'Azure_OpenAI':
    index = index_creator.from_loaders([loader])



### Check what's on DB

By way of demonstration, if you were to directly read the rows stored in your database table, this is what you would now find there (not that you'll ever _have to_, for LangChain and CassIO provide an abstraction on top of that):

In [None]:
cqlSelect = f'SELECT * FROM {keyspace}.{table_name} LIMIT 5;'  # (Not a production-optimized query ...)
rows = session.execute(cqlSelect)
for row_i, row in enumerate(rows):
    print(f'\nRow {row_i}:')
    # depending on the cassIO version, the underlying Cassandra table can have different structure ...
    try:
        # you are using the new cassIO 0.1.0+ : congratulations :)
        print(f'    row_id:            {row.row_id}')
        print(f'    vector:            {str(row.vector)[:64]} ...')
        print(f'    body_blob:         {row.body_blob[:64]} ...')
        print(f'    metadata_s:        {row.metadata_s}')
    except AttributeError:
        # Please upgrade your cassIO to the latest version ...
        print(f'    document_id:      {row.document_id}')
        print(f'    embedding_vector: {str(row.embedding_vector)[:64]} ...')
        print(f'    document:         {row.document[:64]} ...')
        print(f'    metadata_blob:    {row.metadata_blob}')

print('\n...')


Row 0:
    row_id:            a09800c9527a44f2aa97e5525a939a07
    vector:            [-0.0021464480087161064, -0.010701625607907772, 0.00421805959194 ...
    body_blob:         Over the next several years I wrote lots of essays about all kin ...
    metadata_s:        {'source': 'texts/paul_graham_essay.txt'}

Row 1:
    row_id:            ab2b7b7ce93e4f5c92382d7f1d349953
    vector:            [-0.01360295433551073, -0.01684478111565113, 0.02700249850749969 ...
    body_blob:         It's not necessarily a bad sign if work is a struggle, any more  ...
    metadata_s:        {'source': 'texts/greatwork.txt'}

Row 2:
    row_id:            780cb6069fd34c4faa363c99e4510001
    vector:            [0.0032079373486340046, 0.01262128446251154, 0.00697820913046598 ...
    body_blob:         [18] The principles defining a religion have to be mistaken. Oth ...
    metadata_s:        {'source': 'texts/greatwork.txt'}

Row 3:
    row_id:            712357b9bac249828c893549475a3b01
    vector:  

### Ask a question, get an answer

In [None]:
query = "What are the qualities of doing great work?"
index.query(query, llm=llm)

" Doing great work means doing something important so well that you expand people's ideas of what's possible. It needs to have three qualities: it has to be something you have a natural aptitude for, that you have a deep interest in, and that offers scope to do great work."

## Spawning a "retriever" from the index

You just saw how easily you can plug a Cassandra-backed Vector Index into a full question-answering LangChain pipeline.

But you can as easily work at a slightly lower level: the following code spawns a `VectorStoreRetriever` from the index for manual [retrieval](https://python.langchain.com/en/latest/modules/indexes/retrievers.html) of documents related to a given query text. The results are instances of LangChain's `Document` class.

In [None]:
retriever = index.vectorstore.as_retriever(search_kwargs={
    'k': 2,
})

In [None]:
retriever.get_relevant_documents(
    "What did the author liked looking at?"
)

[Document(page_content='I liked painting still lives because I was curious about what I was seeing. In everyday life, we aren\'t consciously aware of much we\'re seeing. Most visual perception is handled by low-level processes that merely tell your brain "that\'s a water droplet" without telling you details like where the lightest and darkest points are, or "that\'s a bush" without telling you the shape and position of every leaf. This is a feature of brains, not a bug. In everyday life it would be distracting to notice every leaf on every bush. But when you have to paint something, you have to look more closely, and when you do there\'s a lot to see. You can still be noticing new things after days of trying to paint something people usually take for granted, just as you can after days of trying to write an essay about something people usually take for granted.', metadata={'source': 'texts/paul_graham_essay.txt'}),
 Document(page_content="One of the most conspicuous patterns I've notic

## What now?

This demo is hosted [here](https://cassio.org/frameworks/langchain/qa-basic/) at cassio.org.

Discover the other ways you can integrate
Cassandra/Astra DB with your ML/GenAI needs,
right **within [your favorite framework](https://cassio.org/frameworks/langchain/about/)**.