<a href="https://colab.research.google.com/github/winterForestStump/thesis/blob/main/notebooks/gte_large_LangChain_ChromaDB_Llama2_7B_ver2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this notebook we will add 3 documents (chunked and embedded) into the DB. Then we will conduct QA on this corpus

#**Step 1: Install All the Required Packages and Libraries**

In [1]:
%%capture
!pip install huggingface_hub
!pip install langchain
!pip install sentence-transformers
!pip install chromadb
!pip install -q -U transformers peft accelerate optimum
!pip install auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/
!pip install GPUtil
!pip install llama-index

In [2]:
import pandas as pd
from langchain.text_splitter import RecursiveCharacterTextSplitter
#from langchain.retrievers import ParentDocumentRetriever
from langchain.vectorstores import Chroma
import torch
from langchain import HuggingFacePipeline
from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig, pipeline
from langchain.chains import RetrievalQA
from langchain import PromptTemplate
from langchain.embeddings import HuggingFaceEmbeddings
from textwrap import fill
from huggingface_hub import hf_hub_download
import GPUtil
import requests
import chromadb
from llama_index.vector_stores import ChromaVectorStore
from llama_index import SimpleDirectoryReader, ServiceContext, StorageContext, VectorStoreIndex
from llama_index.text_splitter import SentenceSplitter

#**Step 1A: Recursively split by character**

https://python.langchain.com/docs/modules/data_connection/document_transformers/recursive_text_splitter

In [3]:
import locale
locale.getpreferredencoding = lambda: "UTF-8"
!nvidia-smi

Mon Jan 29 18:25:06 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   48C    P8              10W /  70W |      0MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [4]:
%%capture
!mkdir -p 'data'
!wget 'https://raw.githubusercontent.com/winterForestStump/thesis/main/notebooks/example_texts/txt_examp_1.txt' -O 'data/txt_examp_1.txt'
!wget 'https://raw.githubusercontent.com/winterForestStump/thesis/main/notebooks/example_texts/txt_examp_2.txt' -O 'data/txt_examp_2.txt'
!wget 'https://raw.githubusercontent.com/winterForestStump/thesis/main/notebooks/example_texts/txt_examp_3.txt' -O 'data/txt_examp_3.txt'

In [5]:
%%capture
embeddings = HuggingFaceEmbeddings(
    model_name="thenlper/gte-large",
    model_kwargs={"device": "cuda"},
    encode_kwargs={"normalize_embeddings": True},
)

In [6]:
data = []
for i in range(1,3):
  with open(f"/content/data/txt_examp_{i}.txt", encoding='unicode_escape') as f:
    data.append(f.read())

In [7]:
CHUNK_SIZE = 256
CHUNK_OVERLAP = 32


text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=CHUNK_SIZE,
    chunk_overlap=CHUNK_OVERLAP,
    length_function=len,
    is_separator_regex=True)

create_docs = text_splitter.create_documents(data)


In [8]:
from langchain_community.vectorstores import Chroma
db = Chroma.from_documents(create_docs, embeddings)

In [9]:
query = "What was the consolidated revenue (sales) of the Air Products and Chemicals, Inc. in 2023 and how it changed since 2021?"
docs = db.similarity_search(query)

In [10]:
print(docs[0].page_content)

Air Products and Chemicals, Inc. and Subsidiaries
CONSOLIDATED INCOME STATEMENTS
For the Fiscal Years Ended 30 September
(Millions of U.S. Dollars, except for share and per share data) 2023 2022 2021
Sales $12,600.0 $12,698.6 $10,323.0


In [11]:
%%capture
MODEL_NAME = "TheBloke/Llama-2-7b-Chat-GPTQ"
TEMPERATURE = 0.0001
MAX_NEW_TOKENS = 1024
TOP_P = 0.95
REPETITION_PENALTY = 1.15


tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(MODEL_NAME, torch_dtype=torch.float16, trust_remote_code=True, device_map="auto")

generation_config = GenerationConfig.from_pretrained(MODEL_NAME)
generation_config.max_new_tokens = MAX_NEW_TOKENS
generation_config.temperature = TEMPERATURE
generation_config.top_p = TOP_P
generation_config.do_sample = True
generation_config.repetition_penalty = REPETITION_PENALTY

text_pipeline = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    generation_config=generation_config,
)

llm = HuggingFacePipeline(pipeline=text_pipeline, model_kwargs={"temperature": TEMPERATURE})

Creating a plot template

In [13]:
K = 10
SCORE_THRESHOLD = 0.5


template = """
<s>[INST] <<SYS>>
Use the following information from company annual reports and answer the question at the end.
If the answer is not contained in the provided information, say "The answer is not in the context".
<</SYS>>

{context}

{question} [/INST]
"""

prompt = PromptTemplate(template=template, input_variables=["context", "question"])

qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=db.as_retriever(search_type="similarity_score_threshold", search_kwargs={"k": K, "score_threshold": SCORE_THRESHOLD}),
    return_source_documents=True,
    chain_type_kwargs={"prompt": prompt},
)

In [14]:
print('GPU Usage after RetrievalQA:')
GPUtil.showUtilization()

GPU Usage after RetrievalQA:
| ID | GPU | MEM |
------------------
|  0 |  0% | 36% |


In [15]:
result = qa_chain(
    "What was the revenue (sales) of the Air Products and Chemicals, Inc. in 2023 and how it changed since 2021?"
)
print(fill(result["result"].strip(), width=90))

  warn_deprecated(


Based on the information provided in the annual report, the revenue (sales) of Air
Products and Chemicals, Inc. in 2023 was $12,600 million, which is slightly higher than
the previous year's revenue of $10,323 million in 2021. The change in revenue between 2021
and 2023 can be seen below: Year Revenue (in millions) Change 2021 $10,323.0 +10.5% 2023
$12,600.0 +21.5% Therefore, the answer is $12,600 million in 2023.


In [16]:
result

{'query': 'What was the revenue (sales) of the Air Products and Chemicals, Inc. in 2023 and how it changed since 2021?',
 'result': "Based on the information provided in the annual report, the revenue (sales) of Air Products and Chemicals, Inc. in 2023 was $12,600 million, which is slightly higher than the previous year's revenue of $10,323 million in 2021. The change in revenue between 2021 and 2023 can be seen below:\nYear Revenue (in millions) Change\n2021 $10,323.0 +10.5%\n2023 $12,600.0 +21.5%\nTherefore, the answer is $12,600 million in 2023.",
 'source_documents': [Document(page_content='Air Products and Chemicals, Inc. and Subsidiaries\nCONSOLIDATED INCOME STATEMENTS\nFor the Fiscal Years Ended 30 September\n(Millions of U.S. Dollars, except for share and per share data) 2023 2022 2021\nSales $12,600.0 $12,698.6 $10,323.0'),
  Document(page_content='Air Products and Chemicals, Inc. and Subsidiaries\nCONSOLIDATED COMPREHENSIVE INCOME STATEMENTS\nFor the Fiscal Years Ended 30 Sep

In [17]:
print('GPU Usage after the retrieval 1st prompt:')
GPUtil.showUtilization()

GPU Usage after the retrieval 1st prompt:
| ID | GPU | MEM |
------------------
|  0 | 77% | 46% |


other prompts:

In [18]:
result = qa_chain(
    "Summarize the financial results (revenue, costs and net incomes) of the Air Products and Chemicals, Inc. in 2023 and possible risks which can influence the results in the future in 10-12 sentences."
)
print(fill(result["result"].strip(), width=90))

In 2023, Air Products and Chemicals, Inc. reported a net income of $2.3 billion,
representing a 14% increase compared to the previous year. The company's revenue reached
$12.6 billion, up by 10%. The main contributors to the company's growth were the increased
sales of industrial gases, such as oxygen, nitrogen, and argon, as well as the higher
demand for chemicals and materials. However, the company also faced some challenges,
including rising energy costs and currency fluctuations, which might impact its future
results. To mitigate these risks, the company has implemented various cost-saving measures
and diversified its product portfolio. Overall, Air Products and Chemicals, Inc.'s strong
financial performance in 2023 suggests that the company is well-positioned to continue
growing in the future.


In [19]:
result

{'query': 'Summarize the financial results (revenue, costs and net incomes) of the Air Products and Chemicals, Inc. in 2023 and possible risks which can influence the results in the future in 10-12 sentences.',
 'result': "In 2023, Air Products and Chemicals, Inc. reported a net income of $2.3 billion, representing a 14% increase compared to the previous year. The company's revenue reached $12.6 billion, up by 10%. The main contributors to the company's growth were the increased sales of industrial gases, such as oxygen, nitrogen, and argon, as well as the higher demand for chemicals and materials. However, the company also faced some challenges, including rising energy costs and currency fluctuations, which might impact its future results. To mitigate these risks, the company has implemented various cost-saving measures and diversified its product portfolio. Overall, Air Products and Chemicals, Inc.'s strong financial performance in 2023 suggests that the company is well-positioned to

In [20]:
print('GPU Usage after the retrieval 2st prompt:')
GPUtil.showUtilization()

GPU Usage after the retrieval 2st prompt:
| ID | GPU | MEM |
------------------
|  0 | 79% | 46% |


In [21]:
result = qa_chain(
    "What were the net incomes from continuing operations of the Air Products and Chemicals, Inc. in the period from 2021 to 2023?"
)
print(fill(result["result"].strip(), width=90))

Based on the information provided in the annual report, the net incomes from continuing
operations of Air Products and Chemicals, Inc. in the period from 2021 to 2023 are as
follows: * For fiscal year 2021, the net income from continuing operations was $2,028.8
million. * For fiscal year 2022, the net income from continuing operations was $2,243.5
million. * For fiscal year 2023, the net income from continuing operations was $2,292.8
million.


In [22]:
result

{'query': 'What were the net incomes from continuing operations of the Air Products and Chemicals, Inc. in the period from 2021 to 2023?',
 'result': 'Based on the information provided in the annual report, the net incomes from continuing operations of Air Products and Chemicals, Inc. in the period from 2021 to 2023 are as follows:\n* For fiscal year 2021, the net income from continuing operations was $2,028.8 million.\n* For fiscal year 2022, the net income from continuing operations was $2,243.5 million.\n* For fiscal year 2023, the net income from continuing operations was $2,292.8 million.',
 'source_documents': [Document(page_content='Air Products and Chemicals, Inc. and Subsidiaries\nCONSOLIDATED COMPREHENSIVE INCOME STATEMENTS\nFor the Fiscal Years Ended 30 September\n(Millions of U.S. Dollars) 2023 2022 2021\nNet Income $2,338.6 $2,266.5 $2,114.9'),
  Document(page_content='Air Products and Chemicals, Inc. and Subsidiaries\nCONSOLIDATED INCOME STATEMENTS\nFor the Fiscal Years E

In [23]:
print('GPU Usage after the retrieval 3st prompt:')
GPUtil.showUtilization()

GPU Usage after the retrieval 3st prompt:
| ID | GPU | MEM |
------------------
|  0 | 86% | 46% |


In [24]:
result = qa_chain(
    "What were earnings per share (EPS) of the Air Products and Chemicals, Inc. in 2022-2023 period?"
)
print(fill(result["result"].strip(), width=90))

Based on the information provided in the annual report, the earnings per share (EPS) of
Air Products and Chemicals, Inc. in the 2022-2023 period was: * Basic EPS from continuing
operations: $10.31 (2022) and $10.35 (2023) * Basic EPS from discontinued operations:
$0.03 (2022) and $0.03 (2023) * Basic EPS attributable to Air Products: $10.35 (2022) and
$10.33 (2023) * Diluted EPS from continuing operations: $10.30 (2022) and $10.30 (2023) *
Diluted EPS from discontinued operations: $0.03 (2022) and $0.03 (2023) * Weighted average
common shares: 22.3 million (2022) and 222.0 million (2023)  Therefore, the answer is: EPS
of Air Products and Chemicals, Inc. in 2022-2023 period = $10.31 (basic) and $10.33
(diluted).


In [25]:
result

{'query': 'What were earnings per share (EPS) of the Air Products and Chemicals, Inc. in 2022-2023 period?',
 'result': 'Based on the information provided in the annual report, the earnings per share (EPS) of Air Products and Chemicals, Inc. in the 2022-2023 period was:\n* Basic EPS from continuing operations: $10.31 (2022) and $10.35 (2023)\n* Basic EPS from discontinued operations: $0.03 (2022) and $0.03 (2023)\n* Basic EPS attributable to Air Products: $10.35 (2022) and $10.33 (2023)\n* Diluted EPS from continuing operations: $10.30 (2022) and $10.30 (2023)\n* Diluted EPS from discontinued operations: $0.03 (2022) and $0.03 (2023)\n* Weighted average common shares: 22.3 million (2022) and 222.0 million (2023)\n\nTherefore, the answer is:\nEPS of Air Products and Chemicals, Inc. in 2022-2023 period = $10.31 (basic) and $10.33 (diluted).',
 'source_documents': [Document(page_content='Air Products and Chemicals, Inc. and Subsidiaries\nCONSOLIDATED INCOME STATEMENTS\nFor the Fiscal Year

In [26]:
print('GPU Usage after the retrieval 4th prompt:')
GPUtil.showUtilization()

GPU Usage after the retrieval 4th prompt:
| ID | GPU | MEM |
------------------
|  0 | 38% | 48% |


In [29]:
result = qa_chain(
    "What were the total current assets and total current liabilities of the Air Products and Chemicals, Inc. and Subsidiaries as of the end of 2023 fiscal year?"
)
print(fill(result["result"].strip(), width=90))

Based on the information provided in the annual report, the total current assets of Air
Products and Chemicals, Inc. and Subsidiaries as of the end of 2023 fiscal year were:
$12,600.0 million (FY 2023) The total current liabilities of Air Products and Chemicals,
Inc. and Subsidiaries as of the end of 2023 fiscal year were: $6,477.0 million (FY 2023)
Therefore, the answer to your question is: Total current assets: $12,600.0 million Total
current liabilities: $6,477.0 million


In [30]:
result

{'query': 'What were the total current assets and total current liabilities of the Air Products and Chemicals, Inc. and Subsidiaries as of the end of 2023 fiscal year?',
 'result': 'Based on the information provided in the annual report, the total current assets of Air Products and Chemicals, Inc. and Subsidiaries as of the end of 2023 fiscal year were:\n$12,600.0 million (FY 2023)\nThe total current liabilities of Air Products and Chemicals, Inc. and Subsidiaries as of the end of 2023 fiscal year were:\n$6,477.0 million (FY 2023)\nTherefore, the answer to your question is:\nTotal current assets: $12,600.0 million\nTotal current liabilities: $6,477.0 million',
 'source_documents': [Document(page_content='We have audited the accompanying consolidated balance sheets of Air Products and Chemicals, Inc. and subsidiaries (the "Company") as of September 30, 2023 and 2022, the related consolidated income statements, comprehensive income statements, statements of'),
  Document(page_content='Ai