#**Bloomberg GPT** - A RAG system with Generative AI

## **Imports**

I used LangChain to:
- Open pdf and turn into text pages,
- Create the chunks and define chunks parameters
- Store the already embedded (with openai) chunks in a vector database (Chroma)
- Retrieve the chunks for the vector store to answer the questions

I used OpenAI to:
- Create the chunks embeddings
- Create the embedding of the question to look for similar chunks (usign cosine similarity)
- Acess LLMs to take the question and generate an answer

In [1]:
!pip install langchain openai chromadb tiktoken pypdf sentence_transformers -U langchain-community --quiet

In [2]:
import os

from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
from langchain.llms import OpenAI
from langchain.chat_models import ChatOpenAI
from langchain.chains.retrieval_qa.base import RetrievalQA

from langchain.prompts import PromptTemplate

In [3]:
os.environ["OPENAI_API_KEY"] = ""

## **Functions to prepare the text to make the questions**

load_pdf -  to open the pdf and extract the text

create_smart_chunks - divide the whole content into chunks, with overlapping to help maintain context

store_in_vector_db - it acess it chunk created, embed it with openai, and store it on a vector database

process_pdf_for_rag - run all the above functions sequentially

In [7]:
def load_pdf(pdf_path):
    loader = PyPDFLoader(pdf_path)
    pages = loader.load()
    return pages

def create_smart_chunks(pages, chunk_size=800, chunk_overlap=200):
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
        length_function=len,
    )
    chunks = text_splitter.split_documents(pages)

    #print(f"Number of chunks: {len(chunks)}")

    # for i, chunk in enumerate(chunks[:30]):
    #     print(f"Chunk {i + 1}:")
    #     print(chunk)
    #     print("-" * 80)

    return chunks

def store_in_vector_db(chunks, vector_db_path='./chroma_store'):

    embedding_model = OpenAIEmbeddings(model='text-embedding-3-small', disallowed_special=())  # Allow all special tokens
    vectorstore = Chroma.from_documents(chunks, embedding=embedding_model, persist_directory=vector_db_path)
    vectorstore.persist()
    return vectorstore

def process_pdf_for_rag(pdf_path):

    pages = load_pdf(pdf_path)
    chunks = create_smart_chunks(pages)
    vectorstore = store_in_vector_db(chunks)

    return vectorstore


## **Answer Generation**

- Determine the file utilized (BloombergGPT pdf)

- Execute the above functions to prepara the data

- Call a LLM model (gpt-4o-mini, in the case) and set it up (acess to vector store, number of chunks retrieved to build the answer, etc)

- Frame the questions and generate the answer

- See the similiraty score for each chunk retrieved for that question

In [14]:
# FILE
pdf_path = "/content/bloomberggpt.pdf"

# VECTOR DATABASE
vectorstore = process_pdf_for_rag(pdf_path)

# Question and Answers
llm = ChatOpenAI(model="gpt-4o-mini")

qa_chain = RetrievalQA.from_chain_type(
                    llm,
                    chain_type="stuff",
                    retriever=vectorstore.as_retriever(search_kwargs={"k": 20}),
                    return_source_documents=True,
                    chain_type_kwargs={"verbose": False})

question = "summarize the BloombergGPT article"
answer = qa_chain.invoke(input=question)['result']
answer

#number of chunks
pages = load_pdf(pdf_path)
chunks = create_smart_chunks(pages)
print(f"Number of chunks: {len(chunks)}")

# # Similarity of retrieved chunks
# chunk_and_score = vectorstore.similarity_search_with_score(question, k=3)
# # print(chunk_and_score)

# for chunk, score in chunk_and_score:
#   print(f"Text: '{chunk.page_content}'")
#   print()
#   print(f"Score: {score}")
#   print("-" * 40)


Number of chunks stored: 362


## **Experiments: Questions to test the model**

I have set some questions and defined the ideal answer/context for these questions to be able to test the model's accuracy. The questions/answers pairs were separated in 2 categories:

- General content questions: to test the ability to answer general content info

- Tables' Content questions: to test the ability to answer info from the tables


#### Questions

In [15]:
general_content_questions = [
  "Summarize the BloombergGPT article",
  "Explain BloombergGPT model architecture. in plain text, no formulas",
  "What is BloombergGPT and why was it developed?",
  "Tell me about BloombergGPT hardware stack"
  #"How is the dataset for BloombergGPT different from general-purpose LLM datasets?",
  #"What are the main financial tasks that BloombergGPT is optimized for?",
  #"What were some of the challenges encountered during the training of BloombergGPT?",
  #"What are the key benefits of using a domain-specific model like BloombergGPT in the financial industry?",
  #"How does BloombergGPT perform in financial tasks compared to other models like GPT-3?",
  #"What are the top three data sources by token contribution"
  ]

# table_questions = [
#   "According to Table 1, what percentage of BloombergGPT's training data comes from public datasets?",
#   "Which dataset contributes the highest number of tokens in the financial-specific data as shown in Table 2?",
#   "What are the top three data sources by token contribution in the public datasets, as per Table 1?",
#   "From Table 3, how does BloombergGPT's tokenizer compare to that of GPT-NeoX in terms of tokens in 'The Pile' dataset?",
#   "What is the total percentage of tokens from financial-specific datasets used in training BloombergGPT, according to Table 1?"
#   ]

#### Reference Answers/Context

In [17]:
reference_general_content_answers = [
    "The use of NLP in the realm of financial technology is broad and complex, with applications ranging from sentiment analysis and named entity recognition to question answering. Large Language Models (LLMs) have been shown to be effective on a variety of tasks; however, no LLM specialized for the financial domain has been reported in literature. In this work, we present BloombergGPT, a 50 billion parameter language model that is trained on a wide range of financial data. We construct a 363 billion token dataset based on Bloomberg’s extensive data sources, perhaps the largest domain-specific dataset yet, augmented with 345 billion tokens from general purpose datasets. We validate BloombergGPT on stan- dard LLM benchmarks, open financial benchmarks, and a suite of internal benchmarks that most accurately reflect our intended usage. Our mixed dataset training leads to a model that outperforms existing models on financial tasks by significant margins without sacrificing performance on general LLM benchmarks. Additionally, we explain our model- ing choices, training process, and evaluation methodology. We release Training Chronicles (Appendix C) detailing our experience in training BloombergGPT.",
    "Our model is a decoder-only causal language model based on BLOOM (Scao et al., 2022). We present an overview of the architecture, with full details in Appendix A. The model contains 70 layers of transformer decoder blocks defined as follows: h ̄l = hl−1 + SA(LN(hl−1)) hl = h ̄l + FFN(LN(h ̄l)) where SA is multi-head self-attention, LN is layer-normalization, and FFN is a feed-forward network with 1-hidden layer. Inside FFN, the non-linear function is GELU (Hendrycks and Gimpel, 2016). ALiBi positional encoding is applied through additive biases at the self- attention component of the transformer network (Le Scao et al., 2022). The input token embeddings are tied to the linear mapping before the final softmax. Following Le Scao et al. (2022) and first used in Dettmers et al. (2022), the model has an additional layer normalization after token embeddings, formally: h ̄1 = LNem(h0) + SA(LN(LNem(h0))), where h0 is the initial token embedding and LNem is the new component of embedding layer- normalization. Notice that the second term includes two consecutive layer-normalizations.",
    "We train BloombergGPT, a 50 billion parameter language model that supports a wide range of tasks within the financial industry. Rather than building a general-purpose LLM, or a small LLM exclusively on domain-specific data, we take a mixed approach. General 3 models cover many domains, are able to perform at a high level across a wide variety of tasks, and obviate the need for specialization during training time. However, results from existing domain-specific models show that general models cannot replace them. At Bloomberg, we support a very large and diverse set of tasks, well served by a general model, but the vast majority of our applications are within the financial domain, better served by a specific model. For that reason, we set out to build a model that achieves best-in-class results on financial benchmarks, while also maintaining competitive performance on general-purpose LLM benchmarks.",
    "Hardware Stack. We use the Amazon SageMaker service provided by AWS to train and evaluate BloombergGPT. We use the latest version available at the time of training and train on a total of 64 p4d.24xlarge instances. Each p4d.24xlarge instance has 8 NVIDIA 40GB A100 GPUs with NVIDIA NVSwitch intra-node connections (600 GB/s) and NVIDIA GPUDirect using AWS Elastic Fabric Adapter (EFA) inter-node connections (400 Gb/s). This yields a total of 512 40GB A100 GPUs. For quick data access, we use Amazon FSX for Lustre, which supports up to 1000 MB/s read and write throughput per TiB storage unit."
]

#### Answers generated by the model

In [18]:
models_general_content_answers = []

for question in general_content_questions:
  answer = qa_chain.invoke(input=question)['result']
  models_general_content_answers.append(answer)

for answer in models_general_content_answers:
  print(answer)
  print("-" * 50)

BloombergGPT is a specialized large language model (LLM) designed for financial applications, comprising 50 billion parameters and trained on a comprehensive dataset called "FinPile," which includes a mix of financial documents and general-purpose texts. The model aims to improve interactions with financial data, making it easier to generate valid Bloomberg Query Language (BQL) queries and assist journalists in crafting news articles and newsletters. 

BloombergGPT demonstrates strong performance on various financial tasks and outperforms many existing models, including some larger ones, while maintaining competitive results on general NLP benchmarks. Its development benefited from insights gained from existing models and emphasized the importance of high-quality training datasets. The article discusses the model's architecture, training methodology, and evaluation results, while also addressing ethical considerations and limitations in the use of such technology in the finance sector.

#### Chunks retrieved and their similarity score for each question:

the langchain method similarity_search_with_score() takes the question as input to retrieve the most similar chunks. the higher the score, the better. it is useed by calculating cosine distances.

In [19]:
for quest in general_content_questions:
  chunk_and_score = vectorstore.similarity_search_with_score(quest, k=4)

  print(f'Question: {quest}')
  print()

  for i in range(min(3, len(chunk_and_score))):
      chunk = chunk_and_score[i]
      print(f'Chunk {i + 1}: {chunk[0].page_content}')
      print()
      print(f'Score: {chunk[1]}')
      print("-" * 50)

Question: Summarize the BloombergGPT article

Chunk 1: specialization.
Generation of Bloomberg Query Language. One use case for BloombergGPT is to
make interactions with financial data more natural. An existing way to retrieve data is via
the Bloomberg Query Language (BQL). BQL can be used to interact with different classes
of securities, each with its own fields, functions, and parameters. BQL is an incredibly
powerful but complex tool. As we show in Figure 4, BloombergGPT can be utilized to
make BQL more accessible by transforming natural language queries into valid BQL.
Suggestion of News Headlines. Other use cases that are well supported are in the news
space. Since it is trained on many news articles, it can be used for many news applications
and assist journalists in their day-to-day work. For example, when constructing newsletters,

Score: 0.700397789478302
--------------------------------------------------
Chunk 2: results, such as OPT (Zhang et al., 2022a), did not match the p

#### Metric 1: **BLEU**

Blue takes into account how many n-grams are there in the model's answers compared to reference answers. On paramter weights, you can determine how much emphasis (weight) you want to give to 1-gram, 2-gram, and so on. Since only I only provided context and not actual, correct answers, I put more weights on 1-grams since it is important that some single, important words from the context appear in the model's answer.

Between 0.3 and 0.5 is generally good

In [23]:
#!pip install nltk --quiet

import nltk
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction

#nltk.download('punkt')

reference_tokens = [nltk.word_tokenize(answer) for answer in reference_general_content_answers]
generated_tokens = [nltk.word_tokenize(answer) for answer in models_general_content_answers]

bleu_scores = []

for ref, gen in zip(reference_tokens, generated_tokens):

    weights = (0.70, 0.20, 0.05, 0.05)
    smoothing_function = SmoothingFunction().method1
    reference_list = [ref]

    bleu_score = sentence_bleu(
        reference_list,
        gen,
        weights=weights,
        smoothing_function=smoothing_function,
        auto_reweigh=False
    )

    bleu_scores.append(bleu_score)

for i, (score, question) in enumerate(zip(bleu_scores, general_content_questions)):
    print(f'Question {i+1}: {question}')
    print(f"BLEU score: {score:.6f}")
    print("-" * 50)


Question 1: Summarize the BloombergGPT article
BLEU score: 0.173698
--------------------------------------------------
Question 2: Explain BloombergGPT model architecture. in plain text, no formulas
BLEU score: 0.155016
--------------------------------------------------
Question 3: What is BloombergGPT and why was it developed?
BLEU score: 0.161842
--------------------------------------------------
Question 4: Tell me about BloombergGPT hardware stack
BLEU score: 0.703667
--------------------------------------------------


#### Metric 2: **ROUGE**

It looks how many of unigram. bigram, and longer sequences the generated contend is covering from the refernece answer (context in this case). It is more suitable for the purpose of this model, since lack of correct reference answers.

In [25]:
#!pip install rouge-score --quiet

from rouge_score import rouge_scorer
import nltk

#nltk.download('punkt')

scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)

reference = reference_general_content_answers
generated = models_general_content_answers

rouge_scores = []

for ref, gen in zip(reference, generated):
    score = scorer.score(ref, gen)
    rouge_scores.append(score)

for i, (score, question) in enumerate(zip(rouge_scores, general_content_questions)):
    print(f'Question {i+1}: {question}')
    print()
    print(f"ROUGE score for question {i + 1}:")
    print(f"ROUGE-1: {score['rouge1'].fmeasure:.4f}")
    print(f"ROUGE-2: {score['rouge2'].fmeasure:.4f}")
    print(f"ROUGE-L: {score['rougeL'].fmeasure:.4f}")
    print("-" * 50)


Question 1: Summarize the BloombergGPT article

ROUGE score for question 1:
ROUGE-1: 0.4587
ROUGE-2: 0.1108
ROUGE-L: 0.2141
--------------------------------------------------
Question 2: Explain BloombergGPT model architecture. in plain text, no formulas

ROUGE score for question 2:
ROUGE-1: 0.3494
ROUGE-2: 0.1212
ROUGE-L: 0.2048
--------------------------------------------------
Question 3: What is BloombergGPT and why was it developed?

ROUGE score for question 3:
ROUGE-1: 0.3934
ROUGE-2: 0.0726
ROUGE-L: 0.2164
--------------------------------------------------
Question 4: Tell me about BloombergGPT hardware stack

ROUGE score for question 4:
ROUGE-1: 0.8458
ROUGE-2: 0.6633
ROUGE-L: 0.6965
--------------------------------------------------


#### Metric 3: **BERT score**

Use the contextual embedding from BERT to compare semantic meaning between referenc/context and model's answers, considering tokens in both answers. It ranges from 0 to 1. The closer to 1, more similar in meaning they are.

In [28]:
#!pip install bert_score --quiet

from bert_score import score

references = reference_general_content_answers
models_answers = models_general_content_answers

# Compute BERTScore
P, R, F1 = score(models_answers, references, lang='en', verbose=False)


for i in range(len(references)):
    print(f'Question {i+1}: {general_content_questions[i]}')
    print()
    print(f"Generated Answer {i + 1}:")
    print(f"Precision: {P[i]:.4f}")
    print(f"Recall: {R[i]:.4f}")
    print(f"F1: {F1[i]:.4f}")
    print("-" * 50)


Question 1: Summarize the BloombergGPT article

Generated Answer 1:
Precision: 0.8734
Recall: 0.8650
F1: 0.8692
--------------------------------------------------
Question 2: Explain BloombergGPT model architecture. in plain text, no formulas

Generated Answer 2:
Precision: 0.8546
Recall: 0.7881
F1: 0.8200
--------------------------------------------------
Question 3: What is BloombergGPT and why was it developed?

Generated Answer 3:
Precision: 0.8541
Recall: 0.8626
F1: 0.8583
--------------------------------------------------
Question 4: Tell me about BloombergGPT hardware stack

Generated Answer 4:
Precision: 0.9583
Recall: 0.9498
F1: 0.9540
--------------------------------------------------


### Metric 4: **Cosine Similarity**

It embeds the whole reference/context answer and the whole generated answer and then compara its semanting similarity by assessing how close the cosines are. It ranges from 0 to 1. The closer to 1, more similar in meaning they are.

In [30]:
import torch
from transformers import BertTokenizer, BertModel
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

reference_answers = reference_general_content_answers
model_answers = models_general_content_answers

similarity_scores = []

for ref, gen in zip(reference_answers, model_answers):
    # Tokenize
    inputs_ref = tokenizer(ref, return_tensors='pt', truncation=True, padding=True)

    # Context. Embeddings
    with torch.no_grad():
        outputs_ref = model(**inputs_ref)
    ref_embedding = outputs_ref.last_hidden_state[:, 0, :].numpy()

    # Tokenize
    inputs_gen = tokenizer(gen, return_tensors='pt', truncation=True, padding=True)

    # Context. Embeddings
    with torch.no_grad():
        outputs_gen = model(**inputs_gen)
    gen_embedding = outputs_gen.last_hidden_state[:, 0, :].numpy()

    similarity_score = cosine_similarity(ref_embedding, gen_embedding)
    similarity_scores.append(similarity_score[0][0])

for i, (score, question) in enumerate(zip(similarity_scores, general_content_questions)):
    print(f'Question {i+1}: {question}')
    print(f"Cosine Similarity between both answers {i + 1}: {score:.4f}")
    print("-" * 50)


Question 1: Summarize the BloombergGPT article
Cosine Similarity between both answers 1: 0.8349
--------------------------------------------------
Question 2: Explain BloombergGPT model architecture. in plain text, no formulas
Cosine Similarity between both answers 2: 0.7369
--------------------------------------------------
Question 3: What is BloombergGPT and why was it developed?
Cosine Similarity between both answers 3: 0.8595
--------------------------------------------------
Question 4: Tell me about BloombergGPT hardware stack
Cosine Similarity between both answers 4: 0.9034
--------------------------------------------------
