# LangChain and pgvector: Up and Running
[LangChain](https://langchain.com) is the most popular framework for building LLM applications and agents. This notebook is an introduction to building LLM applications with the LangChain framework, using PostgreSQL and pgvector as a vector database for embeddings data.

We'll use the example of creating a chatbot to answer questions about the blog posts from the Timescale blog to illustrate the following concepts:
- How to prepare your documents for insertion into PostgreSQL and pgvector using LangChain document transformer TextSplitter
- How to create embeddings from your data using the OpenAI embeddings model and insert them into PostgreSQL and pgvector.
- How to use embeddings retrieved from a vector database to augment LLM generation. 

This is a great first step for more advanced LangChain projects in Python -- for example, creating a chatbot for your company documentation or an application to answer questions from uploaded PDFs.

Let's get started!

## Setup and Configuration
- Signup for an OpenAI Developer Account and create an API Key. See [OpenAI's developer platform](https://platform.openai.com/overview).
- Install Python
- Install and configure a python virtual environment. We recommend [Pyenv](https://github.com/pyenv/pyenv)
- Install the requirements for this notebook using the following command:

```
pip install -r requirements.txt
```

Or if you already have langchain installed, run: 
```pip install --upgrade langchain```

Protip: Add your OpenAI API Key to the provided shell script (named ```setenv.sh```) and run it to set your environment variables.

In [1]:

pip install --upgrade langchain

Note: you may need to restart the kernel to use updated packages.


In [2]:
import openai
import os
import pandas as pd
import numpy as np
import json
import tiktoken
import psycopg2
import ast
import pgvector
import math
from psycopg2.extras import execute_values
from pgvector.psycopg2 import register_vector


In [3]:
import os
import openai
# Run export OPENAI_API_KEY=sk-YOUR_OPENAI_API_KEY...
# Get openAI api key by reading local .env file
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv())
OPENAI_API_KEY  = os.environ['OPENAI_API_KEY']

Next, we need a way for LangChain to interact with PostgreSQL and pgvector. This is acheived by importing the PGVector class from the langchain.vectorstores package as follows.

In [3]:
from langchain.vectorstores.pgvector import PGVector

Next we'll construct our connection string for LangChain to connect to our PostgreSQL database.

 Because LangChain uses SQLAlchemy to connect to SQL databases like PostgreSQL, we need to create our connection string programmatically, reading each of the components of the string (host, database name, password, port etc) from our environment variables.

In this example, we'll use a PostgreSQL database with pgvector installed that's hosted on Timescale. You can create your own cloud PostgreSQL database in minutes [at this link](https://console.cloud.timescale.com/signup) to follow along. If you're using a Timescale database, you can find all this information in the "Cheat Sheet" file you download when you first create your new database service.

Alternatively, you can also use a local PostgreSQL database if you prefer.

Protip: Add your database credentials to the provided shell script (named ```setenv.sh```)  and run it to set your environment variables.

In [9]:
# Build the PGVector Connection String from params
import os
# Found in the credential cheat-sheet or "Connection Info" in the Timescale console
# In terminal, run: export VAR_NAME=value for each of the values below
host= os.environ['TIMESCALE_HOST']
port= os.environ['TIMESCALE_PORT']
user= os.environ['TIMESCALE_USER']
password= os.environ['TIMESCALE_PASSWORD']
dbname= os.environ['TIMESCALE_DBNAME']

# We use postgresql rather than postgres in the conn string since LangChain uses sqlalchemy under the hood
# You can remove the ?sslmode=require if you have a local PostgreSQL instance running without SSL
CONNECTION_STRING = f"postgresql+psycopg2://{user}:{password}@{host}:{port}/{dbname}"

KeyError: 'TIMESCALE_HOST'

## Part 1: Split a CSV file into smaller chunks while preserving associated metadata
In this section, we will parse our CSV file into smaller chunks for similarity search and retrieval, with help from LangChains TokenTextSplitter.

First let's take a look at the CSV file we'll be working with

In [1]:
import pandas as pd
import numpy as np
df = pd.read_csv('blog_posts_data.csv')
df.head()

Unnamed: 0,title,content,url
0,"How to Build a Weather Station With Elixir, Ne...",This is an installment of our “Community Membe...,https://www.timescale.com/blog/how-to-build-a-...
1,CloudQuery on Using PostgreSQL for Cloud Asset...,This is an installment of our “Community Membe...,https://www.timescale.com/blog/cloudquery-on-u...
2,How a Data Scientist Is Building a Time-Series...,This is an installment of our “Community Membe...,https://www.timescale.com/blog/how-a-data-scie...
3,How Conserv Safeguards History: Building an En...,This is an installment of our “Community Membe...,https://www.timescale.com/blog/how-conserv-saf...
4,How Messari Uses Data to Open the Cryptoeconom...,This is an installment of our “Community Membe...,https://www.timescale.com/blog/how-messari-use...


As shown above, this is a CSV file of blog posts about Timescale use cases. 

Ordinarily, we would use the langchain [CSVLoader](https://python.langchain.com/docs/modules/data_connection/document_loaders/how_to/csv) to load the contents of a CSV file, but in this case, we need to pre-process the content column of our CSV to be able to create embeddings for each blog post within the token limits of the OpenAI embeddings API.

We also need a way to split the text of content column of the CSV while retaining the associated metadata with that text (i.e the blog title and URL).

LangChain has a number of built-in document transformers that make it easy to split, combine, filter, and otherwise manipulate documents.

We'll use LangChain's [Token Text Splitter](https://python.langchain.com/docs/modules/data_connection/document_transformers/text_splitters/split_by_token) to help us split up the content column of our CSV into chunks of a specified token amount. You an alternatively use the [Recursive Character Text Splitter](https://python.langchain.com/docs/modules/data_connection/document_transformers/text_splitters/character_text_splitter), if you'd rather split text by number of characters rather than tokens.

We will split the text into chunks of around 512 tokens, with a 20% or 103 token overlap.

In [2]:
import tiktoken
from langchain.text_splitter import TokenTextSplitter

# We need to split the text into chunks of 512 tokens, with 20% token overlap
text_splitter = TokenTextSplitter(chunk_size=512,chunk_overlap=103)


In [3]:
# Helper func: calculate number of tokens
def num_tokens_from_string(string: str, encoding_name = "cl100k_base") -> int:
    if not string:
        return 0
    # Returns the number of tokens in a text string
    encoding = tiktoken.get_encoding(encoding_name)
    num_tokens = len(encoding.encode(string))
    return num_tokens

In [4]:
#list for smaller chunked text and metadata
new_list = []
# Create a new list by splitting up text into token sizes of around 512 tokens
for i in range(len(df.index)):
    text = df['content'][i]
    token_len = num_tokens_from_string(text)
    if token_len <= 512:
        new_list.append([df['title'][i], df['content'][i], df['url'][i]])
    else:
        #split text into 512 token chunks using text splitter
        split_text = text_splitter.split_text(text)
        for j in range(len(split_text)):
            new_list.append([df['title'][i], split_text[j], df['url'][i]])

Let's take a look at how the content looks after being split:

In [5]:
df_new = pd.DataFrame(new_list, columns=['title', 'content', 'url'])
df_new.head()

Unnamed: 0,title,content,url
0,"How to Build a Weather Station With Elixir, Ne...",This is an installment of our “Community Membe...,https://www.timescale.com/blog/how-to-build-a-...
1,"How to Build a Weather Station With Elixir, Ne...",.One of the motivating factors for this book w...,https://www.timescale.com/blog/how-to-build-a-...
2,"How to Build a Weather Station With Elixir, Ne...",showing various graphs for various weather da...,https://www.timescale.com/blog/how-to-build-a-...
3,"How to Build a Weather Station With Elixir, Ne...",Some of the articles that helped us get start...,https://www.timescale.com/blog/how-to-build-a-...
4,CloudQuery on Using PostgreSQL for Cloud Asset...,This is an installment of our “Community Membe...,https://www.timescale.com/blog/cloudquery-on-u...


In [6]:
#Quick check on how many items in our new list
print(len(new_list))

173


In [7]:
#Optional: save to new csv for easy reloading
df_new.to_csv('blog_posts_data_chunked.csv', index=False)

## Part 2: Insert embeddings into PostgreSQL and pgvector

Now that we have our original CSV split up into smaller chunks and the associated metadata preserved, we will use the LangChain [Pandas Data Frame Loader](https://python.langchain.com/docs/modules/data_connection/document_loaders/integrations/pandas_dataframe) to load data from our new pandas data frame and insert it into our PostgreSQL database with pgvector installed.

Note that we must specify which column in the Data Frame contains the text that we'll create embeddings for.

In [None]:
#load documents from Pandas dataframe for insertion into database
from langchain.document_loaders import DataFrameLoader

# page_content_column is the column name in the dataframe that contains the we'll create embeddings for
loader = DataFrameLoader(df_new, page_content_column = 'content')
docs = loader.load()

We'll use the OpenAI embeddings model for our documents, so let's import the OpenAIEmbeddings module from the langchain.embeddings package and create an instance of it. 

This instance can be used to generate embeddings for text data using the OpenAI API.


In [None]:
from langchain.embeddings import OpenAIEmbeddings
embeddings = OpenAIEmbeddings()

Before we create embeddings for all the data in our DataFrame, let's briefly overview how creating an embedding works. 

Here's how we create an embedding for a string:

In [None]:
# Create OpenAI embedding using LangChain's OpenAIEmbeddings class
query_string = "PostgreSQL is my favorite database"
embed = embeddings.embed_query(query_string)
print(len(embed)) # Should be 1536, the dimensinality of the OpenAI model's embeddings
print(embed[:5]) # Should be a list of floats

Now for the main event, we'll connect to our PostgreSQL database and store the documents we loaded along with their embeddings.

Thanks to LangChain, creating the embeddings and storing the data in our PostgreSQL database is a one command operation!

We pass in the following arguments:
- ```documents```: The documents we loaded from the Pandas Data Frame.
- ```embedding```: Our instance of the OpenAI embeddings class, which is the model we'll use the create the embeddings. 
-  ```collection_name```: The name of the table we want our embeddings and metadata to live in
- ```distance_strategy```: The distance strategy we wan to use to calculate the distance between vectors, in our case we'll use cosine distance
- ```connection_string```: The connection string to our PostgreSQL database which we constructed in the setup section

In [None]:
# Create a PGVector instance to house the documents and embeddings
from langchain.vectorstores.pgvector import DistanceStrategy
db = PGVector.from_documents(
    documents= docs,
    embedding = embeddings,
    collection_name= "blog_posts",
    distance_strategy = DistanceStrategy.COSINE,
    connection_string=CONNECTION_STRING)

Now that our data is in the database, let's perform a similarity search to fetch the documents most similar to a query

In [None]:
from langchain.schema import Document

# Query for which we want to find semantically similar documents
query = "Tell me about how Edeva uses Timescale?"

#Fetch the k=3 most similar documents
docs =  db.similarity_search(query, k=3)

The query on our database returns a list of LangChain Documents, let's learn how to interact with those documents below:

In [None]:
type(docs)

In [None]:
type(docs[0])

In [None]:
# Interact with a document returned from the similarity search on pgvector
doc = docs[0]

# Access the document's content
doc_content = doc.page_content
# Access the document's metadata object
doc_metadata = doc.metadata

print("Content snippet:" + doc_content[:500])
print("Document title: " + doc_metadata['title'])
print("Document url: " + doc_metadata['url'])

## Part 3: Question Answering with Retrieval Augmented Generation
Next let's tie everything we've learned together and build a simple example of using LangChain for questions answering using an LLM from OpenAI and the most relevant documents the question from our database. 

This technique is called Retrieval Augmented Generation and works as follows:
- Create an embedding vector for the user question.
- Use pgvector to perform a vector similarity search and retrieve the k nearest neighbors to the question embedding from our database of embedding vectors representing the blog content. In our example, we’ll use k=3, finding the three most similar embedding vectors and associated content.
- Supply the content retrieved from the database as additional context to the model and ask it to perform a completion task to answer the user question.

To more easily retrieve documents from our PostgreSQL vector database, we'll use a LangChain [retriever](https://python.langchain.com/docs/modules/data_connection/retrievers/). 

In LangChain, a retriever is an interface that returns documents given an unstructured query. A retriever's main purpose is only to return (or retrieve) documents. 

We will use a [vector store-backed retriever](https://python.langchain.com/docs/modules/data_connection/retrievers/how_to/vectorstore) which is a retriever that uses a vector store to retrieve documents. It is a lightweight wrapper around the Vector Store class to make it conform to the Retriever interface. It uses the search methods implemented by a vector store, like similarity search and MMR, to query the texts in the vector store.

In [None]:
# Create retriever from database
# We specify the number of results we want to retrieve (k=3)
retriever = db.as_retriever(
    search_kwargs={"k": 3}
    )

Next, we'll import the LLM we want to use to generate a response to our question. In this case we'll use [OpenAI's GPT-3.5 model](https://platform.openai.com/docs/models/) with a 16k token context window, so that we won't have any trouble fitting in retrieved documents as context in addition to the user question.

In [None]:
from langchain.chat_models import ChatOpenAI
llm = ChatOpenAI(temperature = 0.0, model = 'gpt-3.5-turbo-16k')

Next up, we'll use one of the most useful chain's in LangChain, the [Retrieval Q+A chain](https://python.langchain.com/docs/modules/chains/popular/vector_db_qa), which is used for question answering over an a vector database (vector store or index as its also known.)


We'll combine it with a [stuff chain](https://python.langchain.com/docs/modules/chains/document/stuff) which takes a list of documents, inserts them all into a prompt (_stuffs_ them in) and passes that prompt to an LLM.

And for the final ingredient, let's formulate a question we want to the model to answer with the help from the documents in our database and pass it to our chain to process.

In [None]:
from langchain.chains import RetrievalQA
qa_stuff = RetrievalQA.from_chain_type(
    llm=llm, 
    chain_type="stuff", 
    retriever=retriever,
    verbose=True,
)

query =  "How does Edeva use continuous aggregates?"
response = qa_stuff.run(query)

from IPython.display import Markdown, display
display(Markdown(response))

## Bonus: Cite your sources with LangChain and pgvector for RAG

For even more advanced functionality, you might want your answer to include the sources used to give users peace of mind. Here's how you can do that with the RetrievalQA chain using the ```return_source_documents``` argument:

In [None]:
# New chain to return context and sources
qa_stuff_with_sources = RetrievalQA.from_chain_type(
    llm=llm, 
    chain_type="stuff", 
    retriever=retriever,
    return_source_documents=True,
    verbose=True,
)

query =  "How does Edeva use continuous aggregates?"

# To run the query, we use a different syntax since we're returning more than just the response text
responses = qa_stuff_with_sources({"query": query})

Let's inspect the source documents that got returned:

In [None]:
responses["source_documents"]

And finally, let's print the out the result with the source document cited:

In [None]:
source_documents = responses["source_documents"]
source_content = [doc.page_content for doc in source_documents]
source_metadata = [doc.metadata for doc in source_documents]

# Construct a single string with the LLM output and the source titles and urls
def construct_result_with_sources():
    result = responses['result']
    result += "\n\n"
    result += "Sources used:"
    for i in range(len(source_content)):
        result += "\n\n"
        result += source_metadata[i]['title']
        result += "\n\n"
        result += source_metadata[i]['url']
    return result

In [None]:
display(Markdown(construct_result_with_sources()))

The cite your sources functionality is useful because it can help explain unexpected responses from the model due to irrelevant but highly similar documents being retrieved from the database.

### Next steps:
- Check out [Conversational Retrieval QA Chain](https://python.langchain.com/docs/modules/chains/popular/chat_vector_db) for how to added memory and use what you learned above in a chatbot conversation setting
- Use [Chainlit](https://github.com/Chainlit/chainlit) to build your own LLM chatbot in python (Timescale tutorial coming soon!)
- Learn about how pgvector finds approximate nearest neighbors in this blog post: [What Are ivfflat Indexes in pgvector and How Do They Work](https://www.timescale.com/blog/nearest-neighbor-indexes-what-are-ivfflat-indexes-in-pgvector-and-how-do-they-work/)