In [1]:
# install langchain (version 0.0.191)
!pip install langchain==0.0.191
# install chromadb
!pip install chromadb==0.3.29
# install tiktoken
!pip install tiktoken
# install beautifulsoup4
!pip install beautifulsoup4



# Task 1: Load Data

To be able to embed and store data, we need to provide LangChain with Documents. This is easy to achieve in LangChain thanks to Document Loaders. In our case, we're targeting a "Read the docs" documentation, for which there is a loader ReadTheDocsLoader. In the folder rtdocs, you'll find all the HTML files from the [LangChain documentation](https://python.langchain.com/en/latest/index.html).

```bash
wget -r -A.html -P rtdocs https://python.langchain.com/en/latest/
```

In a bash console execute this code:
```bash
unzip contents.zip
```

Our first task is to load these HTML files as documents that we can use with langchain: we're going to use the ReadTheDocsLoader. It will read the directory containing all HTML files and transform them into Document objects.

`ReadTheDocsLoader` will read each HTML file, remove HTML tags to only keep the text and return it as a Document. At the end of this task, we'll have a variable raw_documents containing a list of Document: one Document per HTML file.

In [2]:
!pip install pypdf



In [3]:
# Import ReadTheDocsLoader
from langchain.document_loaders import PyPDFLoader

loader = PyPDFLoader("building-the-ai-bank-of-the-future.pdf")
pages = loader.load()

In [4]:
print("Size raw documents: ",len(pages))

Size raw documents:  66


# Task 2: Slice the documents into smaller chunks

Now, we turned each HTML file into a Document. These files may be very long, and are potentially too large to embed fully. It's also a good practice to avoid embedding large documents:
- long documents often contain several concepts. Retrieval will be easier if each concept is indexed separately;
- retrieved documents will be injected in a prompt, so keeping them short will keep the prompt small.

LangChain has a collection of tools to do this:
[Text Splitters](https://python.langchain.com/en/latest/modules/indexes/text_splitters.html).

We'll be using the most straightfoward one and simplest to use:
the [Recursive Character Text Splitter](https://python.langchain.com/en/latest/modules/indexes/text_splitters/examples/recursive_text_splitter.html).

*The `recursive text splitter` will recursively reduce the input by splitting it by paragraph, then sentences, then words as needed until the chunk is small enough.*
​

In [5]:
# Import RecursiveCharacterTextSplitter
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Create the text splitter
splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200
)

# Split the documents
documents = splitter.split_documents(pages)

In [6]:
print("Size documents: ",len(documents))

Size documents:  258


In [7]:
documents[0]

Document(page_content='© Getty ImagesGlobal Banking Practice\nBuilding the AI bank \nof the future\nMay 2021', metadata={'source': 'building-the-ai-bank-of-the-future.pdf', 'page': 0})

# Task 3: count tokens and get a cost estimate of embedding

We're ready to embed our documents. Before we do so, we'd like to get an idea of how big it is and how much it will cost to embed. To do so, we'll use the [`tiktoken`](https://github.com/openai/tiktoken) library. tiktoken allows to encode and decode strings of text into tokens. In our case, we're mostly interested in how many tokens our documents translate to.

> 💡 To better understand what a token is to GPT, head to [OpenAI's Tokenizer page](https://platform.openai.com/tokenizer) where you can see how a text translates to tokens.

Prices for different models in OpenAI can be found on their [pricing page](https://openai.com/pricing).

Prices for different models in Azure OpenAI can be found on their [pricing page]([Title](https://azure.microsoft.com/en-us/pricing/details/cognitive-services/openai-service/))

In [8]:
# Import tiktoken
import tiktoken

# Create an encoder
encoder = tiktoken.encoding_for_model("text-embedding-ada-002")

# Count tokens in each document
doc_tokens = [len(encoder.encode(doc.page_content)) for doc in documents]

# Calculate the sum of all token counts
total_tokens = sum(doc_tokens)

# Calculate a cost estimate
cost = (total_tokens/1000) * 0.0004
print(f"Total tokens: {total_tokens} - cost: ${cost:.2f}")

Total tokens: 46467 - cost: $0.02


# Task 4: embed the documents and store embeddings in the vector database

We'll want to save the embeddings into a database. LangChain can take care of all that using a [Vector Store](https://python.langchain.com/en/latest/modules/indexes/vectorstores.html).

There are plenty of vector stores to choose from (see the [full list](https://python.langchain.com/en/latest/modules/indexes/vectorstores.html)). Today we'll use [Chroma](https://docs.trychroma.com/), but you could be using any other as they have the same interface in LangChain. Once again you'll need to try many of them to see which best fits your use case: some vector stores have specific features (like multimodality or multilingual), so be sure to check them out.

Chroma is simple to use and can be persisted to disk. If you do not whish to embed the full set of documents yourself, feel free to skip this step and use the provided folder `chroma-data-langchain-docs`: we've already embedded all documents and persisted it in this folder.

In [9]:
# set the environment variables needed for openai package to know to reach out to azure
import os

os.environ["OPENAI_API_TYPE"] = "azure"
os.environ["OPENAI_API_BASE"] = "https://classbi-openai-02.openai.azure.com/"
os.environ["OPENAI_API_KEY"] = "3ba3c7f85c2c42589bea7b95f041c079"
os.environ["OPENAI_API_VERSION"] = "2023-03-15-preview"

In [36]:
!pip install openai == 0.28.1

ERROR: Invalid requirement: '=='


In [10]:
# Import chroma
from langchain.vectorstores import Chroma

# Import OpenAIEmbeddings
from langchain.embeddings.openai import OpenAIEmbeddings

# Create the mebedding function
embedding_function = OpenAIEmbeddings(deployment="text-embedding-ada-002",chunk_size = 1)

In [11]:
# Texting the embedding function

input_text = "This is for demonstration."
outcome = embedding_function.embed_query(input_text)
print(outcome)
print(len(outcome))

[-0.012424956075847149, 0.010575136169791222, 0.0013741046423092484, -0.009136387147009373, -0.008970632217824459, 0.014228364452719688, -0.008141860365867615, 0.001561407232657075, -0.0069749485701322556, -0.023987988010048866, 0.008705425076186657, 0.008930851705372334, -0.01381729356944561, -0.0033681311178952456, -0.0032322125043720007, 0.004899702500551939, 0.016243938356637955, -0.016601968556642532, 0.014162062667310238, -0.029650161042809486, -0.012710053473711014, 0.013452633284032345, 0.00537044508382678, 0.012729944661259651, -0.029411476105451584, -0.003858764423057437, 0.020275088027119637, -0.026865486055612564, 0.022900639101862907, -0.02088506519794464, 0.0019989991560578346, 0.0057682557962834835, 0.013671429827809334, -0.04102754965424538, -0.011151961982250214, -0.011311085894703865, 0.012716683559119701, -0.02241000533103943, -0.0032869114074856043, 0.004617919679731131, -0.011748678050935268, 0.01739758998155594, 0.006364972330629826, -0.027077652513980865, -0.0121

In [12]:
# Create a database from the documents and embedding function
db = Chroma.from_documents(documents=documents, embedding=embedding_function, persist_directory="my-pdf-embeddings")

In [13]:
# Persist the data to disk
db.persist()

In [14]:
db.get().keys()

dict_keys(['ids', 'embeddings', 'documents', 'metadatas'])

In [15]:
db.get()['documents'][0]

'BIS Papers No 4 19Graph 2\nExpense to assets ratios and asset size of banks in 1998\n0246810\n56789 1 0 1 1 1 2GGG\nGG\nG\nG\nGGG\nG\nGGG\nGG\nG\nGGGG\nG\nG\nGG\nGG\nGGG\nG\nGG\nG\nG\nGGG\nG\nG\n048121620\n56789 1 0 1 1 1 2GG\nG\nG GGG\nG\nGG\nGG\nGG\nG\nGGG\nG\nGG\nG\nGG\nGG\nGG\nGGG\nG GG\n0246810\n34567891 0G\nG\nGGG\nGG\nGG\nGGG\nG\nGGG\nG G\nGGGG\nG\nGGG\nG\nTotal assets of bank4\nTotal operating expense-to-total assets ratio Total operating expense-to-total assets ratio Total operating expense-to-total assets ratioAsia1\nLatin America2\nOther countries3\nNote:  Bank sample based on the five largest private domestic banks in the respective regions.\n1  China, Hong Kong, India, Korea, Malaysia, the Philippines, Singapore and Thailand.   2  Argentina, Brazil, Chile, Colombia,\nMexico, Peru and Venezuela.   3  The Czech Republic, Hungary, Israel, Poland, Saudi Arabia and South Africa.   4  In US dollar\nterms expressed logarithmically.\nSource:  Fitch-IBCA.'

## Alternative: use the provided embeddings

We have already executed the step above to embed all documents and stored the result in the `chroma-data-langchain-docs` folder. Instead of embedding all the documents yourself, you can use these embeddings at no cost.

The result of this step is the same as the step above, but will not call the OpenAI API and cost nothing.

In [17]:
# Import chroma
from langchain.vectorstores import Chroma

# Import OpenAIEmbeddings
from langchain.embeddings.openai import OpenAIEmbeddings

# Create the embedding function
embedding = OpenAIEmbeddings(deployment="text-embedding-ada-002",chunk_size = 1)

# Load the database from existing embeddings
db = Chroma(persist_directory="my-pdf-embeddings", embedding_function=embedding)

# Step 5: query the vector database

Now that we have a vector database, we can query it. A vector database stores embeddings (vectors) and allow to search through them using K-Nearest Neighbors algorithm (or a variation of it). When we query it the following will happen:
1. Embed the text query to obtain a vector. It is crucial that this embedding is made using the same embedding technique that was used to embed the documents;
2. Calculate the distance (or similarity) between the query vector and all other vectors;
3. Sort results by similarity;
4. Return the most similar documents.

To do this with LangChain, we can use the `.similarity_search_with_score()` method of the database.

In [18]:
db.get()['documents'][0]

'BIS Papers No 4 19Graph 2\nExpense to assets ratios and asset size of banks in 1998\n0246810\n56789 1 0 1 1 1 2GGG\nGG\nG\nG\nGGG\nG\nGGG\nGG\nG\nGGGG\nG\nG\nGG\nGG\nGGG\nG\nGG\nG\nG\nGGG\nG\nG\n048121620\n56789 1 0 1 1 1 2GG\nG\nG GGG\nG\nGG\nGG\nGG\nG\nGGG\nG\nGG\nG\nGG\nGG\nGG\nGGG\nG GG\n0246810\n34567891 0G\nG\nGGG\nGG\nGG\nGGG\nG\nGGG\nG G\nGGGG\nG\nGGG\nG\nTotal assets of bank4\nTotal operating expense-to-total assets ratio Total operating expense-to-total assets ratio Total operating expense-to-total assets ratioAsia1\nLatin America2\nOther countries3\nNote:  Bank sample based on the five largest private domestic banks in the respective regions.\n1  China, Hong Kong, India, Korea, Malaysia, the Philippines, Singapore and Thailand.   2  Argentina, Brazil, Chile, Colombia,\nMexico, Peru and Venezuela.   3  The Czech Republic, Hungary, Israel, Poland, Saudi Arabia and South Africa.   4  In US dollar\nterms expressed logarithmically.\nSource:  Fitch-IBCA.'

In [19]:
# Call the `similarity_search_with_score` method on `db`
results = db.similarity_search_with_score("what is the future of banks?")

In [20]:
for (doc, score) in results:
    print('score', score)
    print(doc.page_content)
    print('-----------------')
    break

score 0.26023781299591064
Global Banking & Securities
AI bank of the future: Can 
banks meet the AI challenge?
Artificial intelligence technologies are increasingly integral to the world we  
live in, and banks need to deploy these technologies at scale to remain  
relevant. Success requires a holistic transformation spanning multiple layers 
of the organization.
September  2020© Getty Imagesby Suparna Biswas, Brant Carson, Violet Chung, Shwaitang Singh, and Renny Thomas 
4
-----------------


In [21]:
# Print the results
for (doc, score) in results:
    print('score', score)
    print(doc.page_content)
    print('-----------------')

score 0.26023781299591064
Global Banking & Securities
AI bank of the future: Can 
banks meet the AI challenge?
Artificial intelligence technologies are increasingly integral to the world we  
live in, and banks need to deploy these technologies at scale to remain  
relevant. Success requires a holistic transformation spanning multiple layers 
of the organization.
September  2020© Getty Imagesby Suparna Biswas, Brant Carson, Violet Chung, Shwaitang Singh, and Renny Thomas 
4
-----------------
score 0.27077344059944153
Banking is at a pivotal moment. Technology 
disruption and consumer shifts are laying the basis 
for a new S-curve for banking business models, 
and the COVID-19 pandemic has accelerated 
these trends. Building upon this momentum, 
the advancement of artificial-intelligence (AI) 
technologies within financial services offers banks 
the potential to increase revenue at lower cost by 
engaging and serving customers in radically new 
ways, using a new business model we call “

# Step 6: Create a QA chain

Let's put it all together into a chat-like application. We want the user to ask a question, then search for relevant documents. We'll then create a prompt that includes the documents and the question so GPT can answer it (if possible).

First, we'll query the database in a similar manner to previous step. We'll use `.similarity_search()`:

```python
question = "show an example of adding memory to a chain"
context_docs = db.similarity_search(question)
```

Next, we will create a prompt that contains the question and the relevant documents:

> You can think of a PromptTemplate as an fstring in python: values in curly brances are used as placeholder and will be replaced by values we pass when running the chain.

```python
prompt = PromptTemplate(
    template=
    """"Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.
        <context>
        {context}
        </context>
Question: {question}
Helpful Answer:""",
    input_variables=["context", "question"]
)
```

To call the LLM with this prompt, we need to create an `LLMChain` and pass it an LLM and the prompt:

```python
llm = ChatOpenAI(temperature=0)
qa_chain = LLMChain(llm=llm, prompt=prompt)
```

We can now call our chain like so:

```python
qa_chain({"context": "<the context>", "question": "<the question>"})
```

This will return a dict with a `text` key containing the LLM response.

In [22]:
# Import
from langchain.prompts import PromptTemplate
from langchain.chains.llm import LLMChain
from langchain.chat_models import ChatOpenAI
from langchain.chat_models import AzureChatOpenAI
from langchain.schema import HumanMessage

# Set the question variable
question = "What is the future of banks"

# Query the database as store the results as `context_docs`
context_docs = db.similarity_search(question)

# Create a prompt with 2 variables: `context` and `question`
prompt = PromptTemplate(
    template=""""Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

<context>
{context}
</context>

Question: {question}
Helpful Answer, formatted in markdown:""",
    input_variables=["context", "question"]
)

# Create an LLM with ChatOpenAI
llm = AzureChatOpenAI(
    openai_api_base="https://classbi-openai-02.openai.azure.com/",
    openai_api_version="2023-03-15-preview",
    deployment_name="gpt-35-turbo",
    openai_api_key="3ba3c7f85c2c42589bea7b95f041c079",
    openai_api_type="azure",
)

In [23]:
# Create the chain
qa_chain = LLMChain(llm=llm, prompt=prompt)

# Call the chain
result = qa_chain({
    "question": question,
    "context": "\n".join([doc.page_content for doc in context_docs])
})

# Print the result
print(result["text"])

The future of banks lies in the adoption and integration of artificial intelligence technologies at scale. This will enable them to engage and serve customers in new ways, leading to deeper customer relationships, expanded market share, and stronger financial performance. The advancement of AI technologies within financial services offers banks the potential to increase revenue at lower cost, paving the way for a new business model known as "the AI bank of the future." This model will require a holistic transformation spanning multiple layers of the organization, and banks will need an AI-and-analytics capability stack that delivers intelligent, personalized solutions and distinctive experiences at scale in real time.
