<a href="https://colab.research.google.com/github/winterForestStump/thesis/blob/main/notebooks/gte-large_LangChain_ChromaDB_Llama2-7B.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Step 1: Install All the Required Packages and Libraries**

In [1]:
%%capture
!pip install huggingface_hub
!pip install langchain
!pip install sentence-transformers
!pip install chromadb
!pip install -q -U transformers peft accelerate optimum
!pip install auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/
!pip install GPUtil

In [2]:
import pandas as pd
from langchain.text_splitter import RecursiveCharacterTextSplitter
#from langchain.retrievers import ParentDocumentRetriever
from langchain.vectorstores import Chroma
import torch
from langchain import HuggingFacePipeline
from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig, pipeline
from langchain.chains import RetrievalQA
from langchain import PromptTemplate
from langchain.embeddings import HuggingFaceEmbeddings
from textwrap import fill
from huggingface_hub import hf_hub_download
import GPUtil
import requests

#**Step 1A: Recursively split by character**

https://python.langchain.com/docs/modules/data_connection/document_transformers/recursive_text_splitter

In [3]:
import locale
locale.getpreferredencoding = lambda: "UTF-8"
!nvidia-smi

Mon Jan 22 10:09:01 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   46C    P8               9W /  70W |      0MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [4]:
print('Initial GPU Usage:')
GPUtil.showUtilization()

Initial GPU Usage:
| ID | GPU | MEM |
------------------
|  0 |  0% |  0% |


In [5]:
text = requests.get('https://raw.githubusercontent.com/winterForestStump/thesis/main/data/txt_examp_ver2.txt').text

In [6]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=256,
    chunk_overlap=32,
    length_function=len,
    is_separator_regex=False)

texts = text_splitter.create_documents([text])

print('GPU Usage after text splitting:')
GPUtil.showUtilization()

GPU Usage after text splitting:
| ID | GPU | MEM |
------------------
|  0 |  0% |  0% |


In [7]:
len(texts)

2078

#**Step 1B: Embedding**

In [8]:
embeddings = HuggingFaceEmbeddings(
    model_name="thenlper/gte-large",
    model_kwargs={"device": "cuda"},
    encode_kwargs={"normalize_embeddings": True},
)

#Limitation
#This model exclusively caters to English texts, and any lengthy texts will be truncated to a maximum of 512 tokens.

print('GPU Usage after embedding:')
GPUtil.showUtilization()

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


.gitattributes:   0%|          | 0.00/1.52k [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/191 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/67.9k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/619 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/670M [00:00<?, ?B/s]

onnx/config.json:   0%|          | 0.00/632 [00:00<?, ?B/s]

model.onnx:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

onnx/special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

onnx/tokenizer.json:   0%|          | 0.00/712k [00:00<?, ?B/s]

onnx/tokenizer_config.json:   0%|          | 0.00/342 [00:00<?, ?B/s]

onnx/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/670M [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/57.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/712k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/342 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/385 [00:00<?, ?B/s]

GPU Usage after embedding:
| ID | GPU | MEM |
------------------
|  0 |  0% |  0% |


In [9]:
db = Chroma.from_documents(texts, embeddings, persist_directory="db")
results = db.similarity_search("What was the revenue of the company in 2023 and how it changed since 2021?", k=8)
print(results[0].page_content)

Corporate and other
Change vs. Prior Year
Fiscal Year Ended 30 September 2023 2022 $ %
Sales $889.0 $970.8 ($81.8) (8 %)
Operating loss (287.3) (184.7) (102.6) (56 %)
Adjusted EBITDA (222.4) (130.7) (91.7) (70 %)


In [10]:
print('GPU Usage after chroma DB:')
GPUtil.showUtilization()

GPU Usage after chroma DB:
| ID | GPU | MEM |
------------------
|  0 |  0% | 15% |


#**Step 2: Download the Model**

In [11]:
MODEL_NAME = "TheBloke/Llama-2-7b-Chat-GPTQ"
# Context Length: 4096

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(MODEL_NAME, torch_dtype=torch.float16, trust_remote_code=True, device_map="auto")

generation_config = GenerationConfig.from_pretrained(MODEL_NAME)
generation_config.max_new_tokens = 1024
generation_config.temperature = 0.0001
generation_config.top_p = 0.95
generation_config.do_sample = True
generation_config.repetition_penalty = 1.15

text_pipeline = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    generation_config=generation_config,
)

llm = HuggingFacePipeline(pipeline=text_pipeline, model_kwargs={"temperature": 0})

tokenizer_config.json:   0%|          | 0.00/727 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/411 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/789 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/3.90G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

In [12]:
print('GPU Usage after LLM and tokenizer initialization:')
GPUtil.showUtilization()

GPU Usage after LLM and tokenizer initialization:
| ID | GPU | MEM |
------------------
|  0 |  3% | 37% |


Creating a plot template

In [13]:
template = """
<s>[INST] <<SYS>>
Act like an expert in financial analysis and accounting. Use the following information from company annual reports and answer the question at the end.
If you don't know the answer, don't make up an answer, but answer "I don't know".
<</SYS>>

{context}

{question} [/INST]
"""

prompt = PromptTemplate(template=template, input_variables=["context", "question"])

qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=db.as_retriever(search_kwargs={"k": 8}),
    return_source_documents=True,
    chain_type_kwargs={"prompt": prompt},
)

In [14]:
print('GPU Usage after RetrievalQA:')
GPUtil.showUtilization()

GPU Usage after RetrievalQA:
| ID | GPU | MEM |
------------------
|  0 |  3% | 37% |


In [15]:
result = qa_chain(
    "What was the revenue (sales) of the company in 2023 and how it changed since 2021?"
)
print(fill(result["result"].strip(), width=90))

  warn_deprecated(


As an expert in financial analysis and accounting, I can provide you with the answers to
your questions based on the information provided in the company's annual report. The
revenue (sales) of the company in 2023 was $12.6 billion, which represents a decrease of
1% compared to 2022. In 2022, the company's sales were $970.8 million, representing an
increase of 8% compared to 2021. Therefore, the answers to your questions are: * Revenue
(sales) in 2023: $12.6 billion * Change in revenue (sales) between 2021 and 2023: -1%


In [16]:
print('GPU Usage after the retrieval 1st prompt:')
GPUtil.showUtilization()

GPU Usage after the retrieval 1st prompt:
| ID | GPU | MEM |
------------------
|  0 | 44% | 42% |


other prompts:

In [17]:
result = qa_chain(
    "Summarize the financial results (revenue, costs and net incomes) of the company in 2023 and possible risks which can influence the results in the future in 10-12 sentences."
)
print(fill(result["result"].strip(), width=90))

Based on the provided annual report excerpts, here are the key financial results and
potential risks for the company in 2023: The company's revenue increased by 15% in 2023
compared to the previous year, primarily driven by growth in the Americas and Asia
regions. The cost of goods sold also rose by 18%, mainly due to higher raw material costs.
As a result, the company's gross profit margin declined slightly to 34%. Operating
expenses grew by 10%, primarily due to increased selling and administrative costs.
However, operating income remained flat compared to the previous year. The company's net
income decreased by 6% in 2023, largely due to higher interest expenses. Looking ahead,
several factors could potentially impact the company's future financial performance,
including intensifying competition, supply chain disruptions, and currency fluctuations.
Additionally, the company faces risks associated with geopolitical tensions, changes in
regulatory policies, and shifts in consumer pref

In [18]:
print('GPU Usage after the retrieval 2st prompt:')
GPUtil.showUtilization()

GPU Usage after the retrieval 2st prompt:
| ID | GPU | MEM |
------------------
|  0 | 65% | 44% |


In [19]:
result = qa_chain(
    "What were the net incomes from continuing operations of the company in the period from 2021 to 2023?"
)
print(fill(result["result"].strip(), width=90))

As an expert in financial analysis and accounting, I can provide answers to your questions
based on the information provided in the company's annual reports. Here are the net
incomes from continuing operations of the company in each of the periods mentioned: *
Fiscal Year Ended 30 September 2021: Net income from continuing operations was $2,028.8
million. * Fiscal Year Ended 30 September 2022: Net income from continuing operations was
$2,243.5 million. * Fiscal Year Ended 30 September 2023: Net income from continuing
operations was $2,292.8 million. Therefore, the net incomes from continuing operations of
the company increased by $154.7 million (or 7.5%) from 2021 to 2022 and decreased by $10.7
million (or -0.5%) from 2022 to 2023.


In [20]:
print('GPU Usage after the retrieval 3st prompt:')
GPUtil.showUtilization()

GPU Usage after the retrieval 3st prompt:
| ID | GPU | MEM |
------------------
|  0 | 66% | 45% |


In [21]:
result = qa_chain(
    "What were earnings per share (EPS) in 2022-2023 period?"
)
print(fill(result["result"].strip(), width=90))

Based on the information provided in the annual report, the earnings per share (EPS) for
the 2022-2023 period can be calculated as follows: For 2023: * Net income from continuing
operations: $2,292.8 million * Number of diluted shares: 467.6 million (based on weighted
average diluted shares of 467.6 million) * Earnings per share (EPS): $10.33 Therefore, the
earnings per share (EPS) for the 2022-2023 period was $10.33.


In [22]:
print('GPU Usage after the retrieval 4th prompt:')
GPUtil.showUtilization()

GPU Usage after the retrieval 4th prompt:
| ID | GPU | MEM |
------------------
|  0 | 72% | 45% |


In [25]:
result = qa_chain(
    "What was the difference between total current assets and total current liabilities of the Air Products and Chemicals, Inc. and Subsidiaries as of the end of 2023 fiscal year?"
)
print(fill(result["result"].strip(), width=90))

As an expert in financial analysis and accounting, I can analyze the provided information
and provide answers to your questions. Here's my response to your first question: To
determine the difference between total current assets and total current liabilities of Air
Products and Chemicals, Inc. and its subsidiaries as of the end of the 2023 fiscal year,
we need to extract the relevant information from the consolidated balance sheet provided.
Current Assets: From the consolidated balance sheet, we can see that Air Products and
Chemicals, Inc. and its subsidiaries had the following current assets as of September 30,
2023: * Cash and cash equivalents: $2,351.0 million * Accounts receivable: $3,477.0
million * Inventory: $3,037.0 million * Prepaid expenses and other current assets:
$1,037.0 million Total Current Assets: $9,884.0 million Current Liabilities: From the same
balance sheet, we can see that Air Products and Chemicals, Inc. and its subsidiaries had
the following current liabilitie