## Text embeddings

The Embeddings class is a class designed for interfacing with text embedding models. There are lots of embedding model providers (OpenAI, Cohere, Hugging Face, etc) - this class is designed to provide a standard interface for all of them.

Embeddings create a vector representation of a piece of text. This is useful because it means we can think about text in the vector space, and do things like semantic search where we look for pieces of text that are most similar in the vector space.

The base Embeddings class in LangChain provides two methods: one for embedding documents and one for embedding a query. The former takes as input multiple texts, while the latter takes a single text. The reason for having these as two separate methods is that some embedding providers have different embedding methods for documents (to be searched over) vs queries (the search query itself).

## Text embeddings model

Let's start by initializing our embeddings model. 

In [1]:
from langchain.embeddings import OpenAIEmbeddings

embeddings_model = OpenAIEmbeddings()

You can embed a list of texts with a single call to `embed_documents`.

In [2]:
texts = [
    "I like to eat bananas.",
    "The dog ate my homework.",
    "It is sunny outside.",
]

embeddings = embeddings_model.embed_documents(texts)

In [3]:
print(f"Dimensions: {len(embeddings[0])}")
print(f"First 10 values: {embeddings[0][:10]}")

Dimensions: 1536
First 10 values: [-0.014765617668302448, -0.02597355510632489, 0.014292918368914687, 0.003741162189046819, 0.006266368650924281, -0.012140895429651924, -0.01944284730227332, -0.028685352281771787, -0.0017337472216025901, -0.011425627573981148]


You can also embed a single text with `embed_query`.

In [4]:
embedded_query = embeddings_model.embed_query(
    "What was the name mentioned in the conversation?"
)
embedded_query[:5]

[0.005356039053113689,
 -0.000552351843204413,
 0.03886180874395785,
 -0.002978327498861753,
 -0.008890430821354354]

## Vector stores

One of the most common ways to store and search over unstructured data is to embed it and store the resulting embedding vectors, and then at query time to embed the unstructured query and retrieve the embedding vectors that are 'most similar' to the embedded query. A vector store takes care of storing embedded data and performing vector search for you.

We will use the `chroma` vector database, which runs on your local machine as a library.

In [1]:
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain.embeddings.sentence_transformer import (
    SentenceTransformerEmbeddings,
)

# Load the raw text from a PDF file
documents = PyPDFLoader("./data/drr-qa-system.pdf").load()

# Split the text into chunks of 1000 characters
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 500,
    chunk_overlap  = 0,
    # length_function = len,
    # is_separator_regex = False,
    # separators=[" ", "\n", "\n\n"],
)
documents = text_splitter.split_documents(documents)

# Create the open-source embedding function
embedding_function = SentenceTransformerEmbeddings(
    model_name="all-MiniLM-L6-v2",
)

# Store the embeddings in a vector store
db = Chroma.from_documents(documents, embedding_function)

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
print(documents[0].page_content)

Developing A Question-Answering System for
Community-Centric Disaster Preparedness
Rixdon Niño R. Mape
Project Description
In times of disaster, the availability of precise and locally-tailored information can be the difference
between safety and peril. One fundamental problem communities face is obtaining up-to-the-minute,
relevant information specific to their needs. Traditional question-answering systems excel at drawing


You can search for similar documents using a query.

In [4]:
query = "What is hallucination?"
docs = db.similarity_search(query)
print(docs[0].page_content)

appropriate information to users. These LLMs are akin to sophisticated software that can compre-
hend human language and generate responses as a person might. However, LLMs, despite their in-
telligence, are prone to a critical limitation—they may generate plausible but false information when
faced with queries beyond their training scope, a phenomenon known as ‘hallucination’ . In the context
of disaster response, such misinformation could lead to dire consequences.


It is also possible to do a search for documents similar to a given embedding vector using `similarity_search_by_vector` which accepts an embedding vector as a parameter instead of a string.

In [5]:
embedding_vector = embedding_function.embed_query(query)
docs = db.similarity_search_by_vector(embedding_vector)
print(docs[0].page_content)

appropriate information to users. These LLMs are akin to sophisticated software that can compre-
hend human language and generate responses as a person might. However, LLMs, despite their in-
telligence, are prone to a critical limitation—they may generate plausible but false information when
faced with queries beyond their training scope, a phenomenon known as ‘hallucination’ . In the context
of disaster response, such misinformation could lead to dire consequences.


**TODO**: Add section for passing a Chroma Client into Langchain

**TODO**: Add section for filtering result by metadata