# Using a Vector database 
Source of notebook implementation comes from: [Langchain Tutorial](https://python.langchain.com/docs/modules/data_connection/vectorstores/integrations/weaviate)

## Setting up and using a local vector database

Weaviate is an open-source vector database. It allows you to store data objects and vector embeddings from your favorite ML-models, and scale seamlessly into billions of data objects.

This notebook shows how to use functionality related to the Weaviate vector database.

There are two example in this notebook. It populates two database 'tables', one with content from a speech, and the second table is empty.  We then run the query across both tables and can see how the results come back.



### Set up your environment and start the docker infrastructure

You can overwrite the default environment variables in this cell.

To run the docker image, open a terminal window and run

```sh
docker-compose up -d && docker-compose logs -f
```

In [36]:
# Setup Azure OpenAI environment variables.
# ADD YOUR API KEY HERE if this has not been set in your environment.
# CHANGE API BASE if you are not using the FinTechX Azure OpenAI API endpoint.


from os import getenv

AZURE_OPENAI_API_KEY = getenv("AZURE_OPENAI_API_KEY") or '' #Your API Key
AZURE_OPENAI_API_BASE = getenv("AZURE_OPENAI_API_BASE") or 'https://fintechx-oai-eus.openai.azure.com/' 
AZURE_OPENAI_API_VERSION = getenv("AZURE_OPENAI_API_VERSION") or '2023-03-15-preview'
AZURE_OPENAI_API_TYPE = getenv("AZURE_OPENAI_API_TYPE") or '"azure"'
AZURE_OPENAI_DEPLOYMENT_DAVINCI = getenv("AZURE_OPENAI_DEPLOYMENT_DAVINCI") or 'davinci'
AZURE_OPENAI_DEPLOYMENT_EMBEDDINGS = getenv("AZURE_OPENAI_DEPLOYMENT_EMBEDDINGS") or 'embeddings'
AZURE_OPENAI_MODEL_NAME_EMBEDDINGS = getenv("AZURE_OPENAI_MODEL_NAME_EMBEDDINGS") or 'text-embedding-ada-002'

# configure weaviate client
WEAVIATE_URL = getenv('WEAVIATE_URL') or 'http://localhost:8080'


In [37]:
# install the weaviate client python package

! pip install weaviate-client



In [38]:
# configure client with weaviate api keys

from langchain.embeddings.openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(
    deployment=AZURE_OPENAI_DEPLOYMENT_EMBEDDINGS,
    model=AZURE_OPENAI_MODEL_NAME_EMBEDDINGS,
    openai_api_base=AZURE_OPENAI_API_BASE,
    openai_api_type=AZURE_OPENAI_API_TYPE,
    openai_api_key=AZURE_OPENAI_API_KEY,
)


In [56]:
# Load data from file and split file into right-sized chunks 
# for embeddings processing

from langchain.document_loaders import TextLoader
from langchain.text_splitter import CharacterTextSplitter

text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)

loader = TextLoader("./data/state_of_the_union.txt")
documents = loader.load()
docs = text_splitter.split_documents(documents)

empty_loader = TextLoader("./data/empty.txt")
empty_documents = empty_loader.load()
empty_docs = text_splitter.split_documents(empty_documents)


print(f'Number of records created from the document: {len(docs)}')
print(f'Number of records created from the document: {len(empty_docs)}')

Number of records created from the document: 42
Number of records created from the document: 1


In [40]:
# create weaviate vector store

from langchain.vectorstores import Weaviate

print(f'connecting to {WEAVIATE_URL}')

for doc in docs:
    # print(doc)
    # from_documments expects an array as it's first argument, so we need to wrap the doc in an array
    db_StateOfTheUnion_index = Weaviate.from_documents(
        [doc], 
        embeddings, 
        weaviate_url=WEAVIATE_URL,
        index_name='StateOfTheUnionIndex',
        by_text=False
    )


connecting to http://localhost:8080


In [58]:
for doc in empty_docs:
    db_empty_index = Weaviate.from_documents(
        [doc], 
        embeddings, 
        weaviate_url=WEAVIATE_URL,
        index_name='EmptyIndex',
        by_text=False
    )

In [62]:
# Basic queries on weaviate db
# the similarity search will return relevant vector database records ("documents") 
# based on the query

query = "What did the president say anti-viral drugs"
documents_found = db_StateOfTheUnion_index.similarity_search(query)
print(f'Number of documents found: {len(documents_found)}')
# print(documents_found_using_empty_index)

documents_found_with_score = db_StateOfTheUnion_index.similarity_search_with_score(query, by_text=False)
# print(documents_found_with_score[0])
print(f'Number of documents found: {len(documents_found_with_score)}')

# running the same query on the empty index
documents_found_using_empty_index = db_empty_index.similarity_search(query)
print(f'Number of documents found in the empty index: {len(documents_found_using_empty_index)}')


Number of documents found: 4
Number of documents found: 4
Number of documents found in the empty index: 2


In [67]:
from langchain.chains import RetrievalQAWithSourcesChain
from langchain.llms import AzureOpenAI


llm = AzureOpenAI(
    temperature=0.0,
    deployment_name=AZURE_OPENAI_DEPLOYMENT_DAVINCI,
    openai_api_base=AZURE_OPENAI_API_BASE,
    openai_api_key=AZURE_OPENAI_API_KEY,
    best_of=3,
)


chain_using_SOTU_index = RetrievalQAWithSourcesChain.from_chain_type(
    llm, chain_type="stuff", retriever=db_StateOfTheUnion_index.as_retriever()
)

chain_using_SOTU_index(
    {"question": "What did the president say anti-viral drugs?"},
    return_only_outputs=False,
)


{'question': 'What did the president say anti-viral drugs?',
 'answer': ' The president said that they have anti-viral treatments, such as the Pfizer pill, which reduces the chances of ending up in the hospital by 90%, and that they have launched the "Test to Treat" initiative so people can get tested at a pharmacy and receive antiviral pills on the spot at no cost.\n',
 'sources': './data/state_of_the_union.txt'}

In [68]:
# Running the same query on the empty index in Weaviate

chain_using_empty_index = RetrievalQAWithSourcesChain.from_chain_type(
    llm, chain_type="stuff", retriever=db_empty_index.as_retriever()
)

chain_using_empty_index(
    {"question": "What did the president say anti-viral drugs?"},
    return_only_outputs=False,
)

{'question': 'What did the president say anti-viral drugs?',
 'answer': " I don't know.\n",
 'sources': './data/empty.txt'}