<div id="singlestore-header" style="display: flex; background-color: rgba(235, 249, 245, 0.25); padding: 5px;">
    <div id="icon-image" style="width: 90px; height: 90px;">
        <img width="100%" height="100%" src="https://raw.githubusercontent.com/singlestore-labs/spaces-notebooks/master/common/images/header-icons/database.png" />
    </div>
    <div id="text" style="padding: 5px; margin-left: 10px;">
        <div id="badge" style="display: inline-block; background-color: rgba(0, 0, 0, 0.15); border-radius: 4px; padding: 4px 8px; align-items: center; margin-top: 6px; margin-bottom: -2px; font-size: 80%">SingleStore Notebooks</div>
        <h1 style="font-weight: 500; margin: 8px 0 0 4px;">A Deep Dive Into Vector Databases</h1>
    </div>
</div>

<div class="alert alert-block alert-warning">
    <b class="fa fa-solid fa-exclamation-circle"></b>
    <div>
        <p><b>Note</b></p>
        <p> This notebook can be run on a Free Starter Workspace. To create a Free Starter Workspace navigate to <tt>Start</tt> using the left nav. You can also use your existing Standard or Premium workspace with this Notebook. </p>
    </div>
</div>

**Author:** Ayush Pai, Georgia Tech
Run this notebook in Google Collab: https://colab.research.google.com/drive/1SJRg3UEANd7yKkhHGjlMD7fKkwegSkzW?usp=sharing

### What is a vector database?
- Vector databases are designed to store, index and retrieve data points with multiple dimensions, which are commonly referred to as vectors.
- Traditional databases handle data (like numbers and strings) organized in tables, while vector databases are specifically designed for managing data represented in multi-dimensional vector space.
- Highly suitable for AI and machine learning applications, where data often takes the form of vectors like image embeddings, text embeddings or other types of feature vectors.

### What are vector embeddings?
- Embeddings are numerical representations of real-world objects that AI systems use to understand complex knowledge domains in a similar fashion to humans.
- For language, embeddings represent the semantic meaning of words in relation with each other.





**Required Installations**

In [None]:
!pip install openai numpy pandas singlestoredb langchain==0.1.8 langchain-community==0.0.21 langchain-core==0.1.25 langchain-openai==0.0.6

## Vector Embedding Example

In this example, we demonstrate a rule based system that generates vector embeddings based on a word. The embedding that we generate contains 5 main features:
- Length of word
- Number of vowels in the word (normalized to the length of the word)
- Whether the word starts with a vowel (1) or not (0)
- Whether the word ends with a vowel (1) or not (0)
- Percentage of consonants in the word

This is a simple implementation of a **rule** based system to demonstrate the essence of what vector embedding models do. However, they utlize neural networks that are trained on vast datasets to learn key features and self-corrects using gradient descent.

In [1]:
def word_to_vector(word):
    # Define some basic rules for our vector components
    vector = [0] * 5  # Initialize a vector of 5 dimensions

    # Rule 1: Length of the word (normalized to a max of 10 characters for simplicity)
    vector[0] = len(word) / 10

    # Rule 2: Number of vowels in the word (normalized to the length of the word)
    vowels = 'aeiou'
    vector[1] = sum(1 for char in word if char in vowels) / len(word)

    # Rule 3: Whether the word starts with a vowel (1) or not (0)
    vector[2] = 1 if word[0] in vowels else 0

    # Rule 4: Whether the word ends with a vowel (1) or not (0)
    vector[3] = 1 if word[-1] in vowels else 0

    # Rule 5: Percentage of consonants in the word
    vector[4] = sum(1 for char in word if char not in vowels and char.isalpha()) / len(word)

    return vector

# Example usage
word = "example"
vector = word_to_vector(word)
print(f"Word: {word}\nVector: {vector}")


Word: example
Vector: [0.7, 0.42857142857142855, 1, 1, 0.5714285714285714]


## Vector Similarity Example

In this example, we demonstrate a way to determine the similarity between two vectors. There are many techniques to find the similiarity between two vectors but one of the most popular ways is using **cosine similarity**. Consine similarity is the the dot product between the two vectors divided by the product of the vector's normals (magnitudes).

This is just an example to show how vector databases search for similar vectors. The fundamental problem with a system like this is our rule-based embedding because it does not give us a semantic understanding of the word/sentences/paragraphs. Instead, it gives us a classification of a single word's structure.

In [17]:
import numpy as np

def cosine_similarity(vector_a, vector_b):
    # Calculate the dot product of vectors
    dot_product = np.dot(vector_a, vector_b)
    # Calculate the norm (magnitude) of each vector
    norm_a = np.linalg.norm(vector_a)
    norm_b = np.linalg.norm(vector_b)
    # Calculate cosine similarity
    similarity = dot_product / (norm_a * norm_b)
    return similarity

# Example usage
word1 = "example"
word2 = "sample"
vector1 = word_to_vector(word1)
vector2 = word_to_vector(word2)

# Calculate and print cosine similarity
similarity_score = cosine_similarity(vector1, vector2)
print(f"Cosine similarity between '{word1}' and '{word2}': {similarity_score}")

Cosine similarity between 'example' and 'sample': 0.8108320947993962


## Embedding Models

In order to generate semantic understanding of language within vectors, embedding models are required. Embedding models are trained on vast corpus of language data. Training embedding models starts by initializing word embeddings with random vectors. Each word in the vocabulary is assigned a vector of real numbers. They use neural networks trained on large datasets to predict a word from its context (Continuous Bag of Words model) or to predict the context given a word (Skip-Gram model). During training, the model adjusts the word vectors to minimize some loss function, often related to the likelihood of observing a word given its context (or vice versa) through gradient descent.

Examples of embedding models include Word2Vec, GloVe, BERT, OpenAI text-embedding.


In [21]:
from google.colab import userdata
OPENAI_KEY = userdata.get('openai')

from openai import OpenAI
client = OpenAI(api_key=OPENAI_KEY)

def openAIEmbeddings(input):
  response = client.embeddings.create(
      input="input",
      model="text-embedding-3-small"
  )
  return response.data[0].embedding

print(openAIEmbeddings("Golden Retreiver"))

[0.021182004362344742, 0.028040677309036255, 0.002601268468424678, -0.04792807251214981, 0.03217240795493126, -0.05652207136154175, 0.004018107894808054, 0.020892782136797905, -0.02341313846409321, -0.006104631349444389, 0.04492568224668503, 0.014846684411168098, -0.021085597574710846, -0.02345445565879345, 0.02131972834467888, -0.017986798658967018, 0.020217934623360634, -0.013999679125845432, 0.0007320909644477069, 0.04555921256542206, 0.05663225054740906, 0.02087900973856449, 0.03341192752122879, -0.005643255077302456, 0.005898044910281897, -0.024941878393292427, -0.011493096128106117, 0.030354445800185204, 0.026470618322491646, -0.03991251438856125, -0.028564028441905975, -0.0387556292116642, 0.016361651942133904, -0.025974811986088753, 0.03894844278693199, -0.006059871055185795, -0.012395191006362438, 0.023206552490592003, -0.024514934048056602, -0.021939488127827644, -0.028591573238372803, -0.021650267764925957, 0.009923039004206657, 0.024335891008377075, -0.01329728588461876, 0.

As you can see, this is a huge vector! Over 1000 dimensions just in this one vector. This is why it is important for us to have good dimensionality reduction techniques during the similarity searches.

## Creating a vector database with SingleStoreDB

In the following code we create a vector datbase with SingleStoreDB. We utilize Langchain to chunk and split the raw text into documents and use the OpenAI embeddings model to generate the vector embeddings. We then take the raw documents and embeddings and create a table with the columns "docs" and "embeddings".

To test this out, we perform a similarity search based on a query and it returns the most similar document in the vector database.

In [54]:
import openai
from langchain.text_splitter import CharacterTextSplitter
from langchain_community.document_loaders import TextLoader
from langchain_community.embeddings import OpenAIEmbeddings
from langchain_community.vectorstores.singlestoredb import SingleStoreDB
from openai import OpenAI
import os
from google.colab import userdata
import pandas as pd

os.environ["SINGLESTOREDB_URL"] = "ayush:Test1234@svc-3482219c-a389-4079-b18b-d50662524e8a-shared-dml.aws-virginia-6.svc.singlestore.com:3333/database_79fb0"


# Load and process documents
loader = TextLoader("michael_jackson.txt") # use your own document

documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=2000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)

# Generate embeddings and create a document search database
embeddings = OpenAIEmbeddings(api_key=userdata.get('openai'))

# Create Vector Database
vector_database = SingleStoreDB.from_documents(docs, embeddings, table_name="mjackson") # create your own table

query = "How old was Michael Jackson when he died?"
docs = vector_database.similarity_search(query)
print(docs[0].page_content)


Death
Main article: Death of Michael Jackson
Jackson's Star with flowers and notes on it
Fans placed flowers and notes on Jackson's star on the Hollywood Walk of Fame on the day of his death
On June 25, 2009, less than three weeks before his concert residency was due to begin in London, with all concerts sold out, Jackson died from cardiac arrest, caused by a propofol and benzodiazepine overdose.[307][308] Conrad Murray, his personal physician, had given Jackson various medications to help him sleep at his rented mansion in Holmby Hills, Los Angeles. Paramedics received a 911 call at 12:22 pm Pacific time (19:22 UTC) and arrived three minutes later.[309][310] Jackson was not breathing and CPR was performed.[311] Resuscitation efforts continued en route to Ronald Reagan UCLA Medical Center, and for more than an hour after Jackson's arrival there, but were unsuccessful,[312][313] and Jackson was pronounced dead at 2:26 pm Pacific time (21:26 UTC).[314][315]

Murray had administered propo

## Retrieval Augmented Generation System

RAG combines large language models with a retrieval mechanism to search a database for relevant information before generating responses. It utilizes real-world data from retrieved documents to ground responses, enhancing factual accuracy and reducing hallucinations. Documents are vectorized using embeddings and stored in a vector database for efficient retrieval. SingleStoreDB serves as a great vector database. The user query is converted into a vector, and a vector search is performed in the database to find documents relevant to that specific query. The system returns the documents with the highest relevance scores, which are then fed to the chatbot for generating informed responses.





In [59]:
import os
import openai
from langchain.text_splitter import CharacterTextSplitter
from langchain_community.document_loaders import TextLoader
from langchain_community.embeddings import OpenAIEmbeddings
from langchain_community.vectorstores.singlestoredb import SingleStoreDB
from openai import OpenAI

# Set up API keys and database URL

os.environ["SINGLESTOREDB_URL"] = "<username:password@url:port/database_name>"
client = OpenAI(api_key=userdata.get('openai'))

# Load and process documents
loader = TextLoader("michael_jackson.txt")
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=2000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)

# Generate embeddings and create a document search database
embeddings = OpenAIEmbeddings(api_key=userdata.get('openai'))
docsearch = SingleStoreDB.from_documents(docs, embeddings, table_name="mjackson")

# Chat loop
while True:
    # Get user input
    user_query = input("\nYou: ")

    # Check for exit command
    if user_query.lower() in ['quit', 'exit']:
        print("Exiting chatbot.")
        break

    # Perform similarity search
    docs = docsearch.similarity_search(user_query)
    if docs:
        context = docs[0].page_content

        # Generate response using OpenAI GPT-4
        response = client.chat.completions.create(
            model="gpt-4",
            messages=[
                {"role": "system", "content": "Context: " + context},
                {"role": "user", "content": user_query}
            ],
            stream=True,
            max_tokens=500,
        )

        # Output the response
        print("AI: ", end="")
        for chunk in response:
            if chunk.choices[0].delta.content is not None:
                print(chunk.choices[0].delta.content, end="")

    else:
        print("AI: Sorry, I couldn't find relevant information.")


<div id="singlestore-footer" style="background-color: rgba(194, 193, 199, 0.25); height:2px; margin-bottom:10px"></div>
<div><img src="https://raw.githubusercontent.com/singlestore-labs/spaces-notebooks/master/common/images/singlestore-logo-grey.png" style="padding: 0px; margin: 0px; height: 24px"/></div>