## Vector Stores

### University of Virginia
### DS 7200: Distributed Computing
### Last Updated: November 18, 2023

---

### SOURCES: 

- [Vector Stores](https://js.langchain.com/docs/modules/data_connection/vectorstores/)

- [What is a Vector Database?](https://www.pinecone.io/learn/vector-database/)

- [What are vector embeddings](https://www.pinecone.io/learn/vector-embeddings-for-developers/)

- [Word2Vec](https://en.wikipedia.org/wiki/Word2vec)

- [RAG pattern](https://vitalflux.com/retrieval-augmented-generation-rag-llm-examples/)

### OBJECTIVES

- Explain uses cases for vector embeddings
- Explain the value of a vector store
- Provide one example of how ANN works

### CONCEPTS

- Vector embedding
- Similarity of embeddings
- Vector database
- Approximate Nearest Neighbor (ANN)
- Projection Matrix

---

### Background on Vector Embeddings


Early Natural Language Processing (NLP) classifiers used presence or count of words in documents as predictors.

Keywords lose their context, which is often important

A large leap forward used vector representations (**embeddings**) of documents

The embeddings are vectors of fixed size like 64 or 128; values are floats.

All kinds of media are now embedded: documents, videos, images, etc.

Word embedding examples:

<img src="./embed_examples.png" width=300>

Sometimes the elements are interpretable. 

In this example, (dog/puppy/cat) elements have similar sign & direction for some columns

---

### Objects as Vectors

In the figure below, each object is projected into 2D space as a vector.  
There is a notion of object similarity which can be measured by distance between points.  
The light blue objects (represented as points) are more similar than the other objects.

<img src="./vector_space.png" width=300>

### Similarity

Different embeddings can be compared using a similarity score like [*cosine similarity*](https://en.wikipedia.org/wiki/Cosine_similarity)

<img src="./cosine_sim.png" width=300>


### Training and Use

Embeddings are formed by training a neural network on the data and taking the last hidden layer.

This layer is a vector which encodes rich information

One of the earliest models was [Word2Vec](https://en.wikipedia.org/wiki/Word2vec)

After objects are represented as vectors, they can be stored and reused later.


The flow looks like this:

```
raw data -> embedding model -> vector embedding
```
---

### Storage

The vectors can be stored in a traditional database (relational or NoSQL)...  
...but specialized databases have emerged to efficiently store, compare, and search on embeddings.

These are called **vector stores** or **vector databases**  

Examples:

- [Pinecone](https://www.pinecone.io/)
- [OpenSearch](https://aws.amazon.com/opensearch-service/)

We will look at a Pinecone demo in this module

**Benefits of a vector database**

- Optimized storage and querying

- Vector databases can store metadata associated with each vector entry. Users can then query the database using additional metadata filters for finer-grained queries.

-  real-time data updates,

- designed to scale with growing data volumes and user demands, providing better support for distributed and parallel processing

- Vector databases can more easily integrate with other components of a data processing ecosystem, such as ETL pipelines (like Spark), analytics tools

---

### Vector Databases use Approximate Search

In vector databases, we apply a similarity metric to find a vector that is the most similar to our query.

It can be too costly to compare the query vector to each content vector, so approximations are often made.

A vector database uses a combination of different algorithms to implement *Approximate Nearest Neighbor (ANN)* search.  
Algorithms use techniques like projection, hashing, quantization, or graph-based search.  
See [here](https://www.pinecone.io/learn/vector-database/) for details.

Vector database indexes vectors using an algorithm such as *Random Projection*.

**Random Projection**

Based on this important lemma:  
[*Johnson–Lindenstrauss lemma*](https://en.wikipedia.org/wiki/Johnson%E2%80%93Lindenstrauss_lemma):  
Set of points in high-dimensional space can be embedded into space of much lower dimension  
while nearly preserving distances between points.

This can be done by multiplying high-dim dataset by a lower-dimensional random matrix.

From [Bingham and Mannila](https://cs-people.bu.edu/evimaria/cs565/kdd-rp.pdf)


<img src='random_proj.png'>


Since the dimensionality of the data is reduced, the search process is significantly faster than searching the entire high-dimensional space.

It is also much faster than using PCA

Projection quality depends on properties of the projection matrix

---

### Vector Databases Need to Provide High Performance and Fault Tolerance

As the amount of data increases, more nodes are required

More nodes means increased risk of failure

To ensure high performance and fault tolerance, vector databases use sharding and replication.

*Sharding* - partitioning data across multiple nodes. Might save similar vectors in same partition.

*Replication* - creating multiple copies of data across different nodes

---

### Vector Embedding Use Cases

Some important use cases are:

- **Search** - Embeddings can represent deeper attributes of an object than keywords. 
  They can be much more effective in getting good search results like this:  
  - User query can be embedded (using a specific embedding model)
  - Each piece of content was embedded earlier (using that same embedding model)
  - Similarity between query embedding and each content embedding is calculated
  - Highest-scoring matches are selected
  - Apply any relevant filters
  - Return top results
  
- **Question answering** - better search allows for better ability to answer questions  
  
  
- **Recommendation** - better search allows for more relevant recommendations  
   Example: Given attribute information for users and items, recommend items with similar vectors
  
- **Generative AI** - GenAI models can produce new content and they can power chatbots  
  One major risk is *hallucination*. If there is a request where the model wasn't sufficiently trained, it may return nonsense.  
  Popular approach now is *RAG* - retrieval augmented generation   
  
  
RAG does this:
  - Embed the query
  - Search for most similar content embedding
  - Include matching content in user prompt: "Based on the content below, tell me about neural networks."  
  - Large language model (LLM) constructs and returns result based on prompt + context
  
**RAG Architecture Diagram**

<img src="./rag.png" width=600>

---

### Conclusions

The use cases for vector embeddings continue to expand in exciting ways

Generative AI has massively increased use and interest in vector databases

Some common patterns like RAG have emerged

These patterns bring together different components like storage and deep learning

---