<a href="https://colab.research.google.com/github/winterForestStump/thesis/blob/main/notebooks/experiment_Coca_Cola_chain_filter_flashrerank.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
%pip -q install langchain chromadb --quiet
%pip -q install sentence_transformers --quiet
%pip -q install -U FlagEmbedding --quiet
%pip install huggingface_hub --quiet
%pip install -q -U peft accelerate optimum --quiet
%pip install transformers==4.37.2 --quiet # downgraiding needed to solve AttributeError: 'LlamaRotaryEmbedding' object has no attribute 'cos_cached'
%pip install auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/ --quiet
%pip install GPUtil --quiet
%pip install unstructured --quiet
%pip install --upgrade langsmith langchainhub --quiet
%pip install jq --quiet
%pip install tqdm --quiet
%pip install numpy==1.24.4 --quiet
%pip install --upgrade --quiet  flashrank

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
from langchain.schema import Document
from langchain.vectorstores import Chroma
from langchain.retrievers import ParentDocumentRetriever
from langchain_community.document_loaders import DirectoryLoader
from langchain_community.document_loaders import JSONLoader

import chromadb

from langchain_community.embeddings import HuggingFaceBgeEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.storage import InMemoryStore
from langchain.storage._lc_store import create_kv_docstore
from langchain.storage.file_system import LocalFileStore
from langchain.document_loaders import TextLoader

from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_core.prompts import ChatPromptTemplate

from operator import itemgetter

from langchain.chains import RetrievalQA
from langchain import PromptTemplate
from langchain import HuggingFacePipeline
from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig, pipeline
from langchain import PromptTemplate
import torch
import GPUtil
import pandas as pd
import os
from tqdm import tqdm

In [None]:
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import FlashrankRerank

BGE models on the HuggingFace are the best open-source embedding models. BGE model is created by the Beijing Academy of Artificial Intelligence (BAAI). BAAI is a private non-profit organization engaged in AI research and development.

In [None]:
model_name = "BAAI/bge-small-en-v1.5"
encode_kwargs = {'normalize_embeddings': True} # set True to compute cosine similarity

bge_embeddings = HuggingFaceBgeEmbeddings(
    model_name=model_name,
    model_kwargs={'device': 'cuda'}, #gpu
    encode_kwargs=encode_kwargs
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


File Directory
This covers how to load all documents in a directory.

Under the hood, by default this uses the UnstructuredLoader.

In [None]:
# Define the metadata extraction function.
def metadata_function(record: dict, metadata: dict) -> dict:

    metadata["cik"] = record.get("cik")
    metadata["company"] = record.get("company")
    metadata["filing_type"] = record.get("filing_type")
    metadata["filing_date"] = record.get("filing_date")
    metadata["period_of_report"] = record.get("period_of_report")
    metadata["state_location"] = record.get("state_location")
    metadata["fiscal_year_end"] = record.get("fiscal_year_end")
    metadata["htm_filing_link"] = record.get("htm_filing_link")
    metadata["filename"] = record.get("filename")

    return metadata

### Questions Creation:

In [None]:
questions = pd.read_fwf("https://raw.githubusercontent.com/winterForestStump/thesis/main/questions/questions_ver2.txt", names=['question'])
#questions = pd.read_csv("https://raw.githubusercontent.com/winterForestStump/financebench/main/financebench_sample_150.csv")
questions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35 entries, 0 to 34
Data columns (total 1 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   question  35 non-null     object
dtypes: object(1)
memory usage: 408.0+ bytes


In [None]:
# Here you need to enter the company name
company = 'COCA COLA CO'
questions['question'] = questions['question'].str.replace('company', company)

In [None]:
questions['question'][0]

'What is the total revenue generated by the COCA COLA CO and how has the revenue changed over the past few years?'

When splitting documents for retrieval, there are often conflicting desires:

You may want to have small documents, so that their embeddings can most accurately reflect their meaning. If too long, then the embeddings can lose meaning.
You want to have long enough documents that the context of each chunk is retained.
The ParentDocumentRetriever strikes that balance by splitting and storing small chunks of data. During retrieval, it first fetches the small chunks but then looks up the parent ids for those chunks and returns those larger documents.

Note that “parent document” refers to the document that a small chunk originated from. This can either be the whole raw document OR a larger chunk.

Sometimes, the full documents can be too big to want to retrieve them as is. In that case, what we really want to do is to first split the raw documents into larger chunks, and then split it into smaller chunks. We then index the smaller chunks, but on retrieval we retrieve the larger chunks (but still not the full documents).

In [None]:
persistent_client = chromadb.PersistentClient('/content/drive/MyDrive/Thesis/chromadb')
collection = persistent_client.get_or_create_collection("Reports")

fs = LocalFileStore('/content/drive/MyDrive/Thesis/reports_store_location')
store = create_kv_docstore(fs)

# This text splitter is used to create the parent documents - The big chunks
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000)

# This text splitter is used to create the child documents - The small chunks
# It should create documents smaller than the parent
child_splitter = RecursiveCharacterTextSplitter(chunk_size=256)

# The vectorstore to use to index the child chunks
vectorstore = Chroma(client = persistent_client,
                     collection_name="Reports",
                     embedding_function=bge_embeddings,
                     persist_directory='/content/drive/MyDrive/Thesis/chromadb')

vectorstore.persist()

In [None]:
# Number of parent chunks retrieved
NUM_PAR_CHUNKS = 6

big_chunks_retriever = ParentDocumentRetriever(
    # The underlying vectorstore to use to store small chunks and their embedding vectors
    vectorstore=vectorstore,
    # The storage interface for the parent documents
    docstore=store,
    # The text splitter to use to create child documents.
    child_splitter=child_splitter,
    # The text splitter to use to create parent documents.
    parent_splitter=parent_splitter,
    search_kwargs={'filter': {'company': company}, 'k': NUM_PAR_CHUNKS}
)

# by default the search_type is 'similarity, also 'mmr' and 'similarity_score_threshold' are available

In [None]:
N_DOCS_RETURN = 2

compressor = FlashrankRerank(top_n = N_DOCS_RETURN)
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor, base_retriever=big_chunks_retriever
)

In [None]:
all = vectorstore.get()['metadatas']
doc_ids = []
ciks = []
for i in range(len(all)):
  doc_ids.append(all[i]['doc_id'])
  ciks.append(all[i]['cik'])

print(f"The number of unique companies: {len(set(ciks))}")
print(f"The number of parent documents: {len(set(doc_ids))}")
print(f"The number of child documnets: {len(vectorstore.get()['documents'])}")

The number of unique companies: 32
The number of parent documents: 17561
The number of child documnets: 356275


### Creating the Pipeline:

Explanation of the parameters from the `generation_config`:
* `max_new_tokens`: the maximum number of tokens that can be generated in the output.
* `do_sample`. When set to True, this parameter enables probabilistic sampling from the distribution of possible next tokens generated by the model. This introduces randomness and variety in the generated text. If set to False, the model would always pick the most likely next token, leading to deterministic and less varied outputs.
* `temperature`: controls how much randomness is introduced into the sampling process. A lower temperature value (closer to 0) makes the model more confident in its choices, resulting in less random outputs, while a higher temperature value (closer to 1) encourages more randomness and diversity.
* `top_p` controls nucleus sampling, a technique that considers only the most probable tokens with a cumulative probability above the threshold top_p. It helps in generating text that is both diverse and coherent, avoiding the inclusion of very low-probability tokens that could make the text nonsensical.
* `top_k` sampling limits the sampling pool to the k most likely next tokens. This further refines the set of tokens that the model will consider for generating the next piece of text, ensuring that the outputs remain relevant and coherent.
* `repetition_penalty` discourages the model from repeating the same tokens or phrases, promoting more interesting and diverse text. A value greater than 1 penalizes and thus reduces, the likelihood of tokens that have already appeared.

In [None]:
MODEL_NAME = "TheBloke/Llama-2-7b-Chat-GPTQ"
TEMPERATURE = 0.0001
MAX_NEW_TOKENS = 2048
TOP_P = 0.90
REPETITION_PENALTY = 1.10

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(MODEL_NAME, torch_dtype=torch.float16, trust_remote_code=True, device_map="auto")

generation_config = GenerationConfig.from_pretrained(MODEL_NAME)
generation_config.max_new_tokens = MAX_NEW_TOKENS
generation_config.temperature = TEMPERATURE
generation_config.top_p = TOP_P
generation_config.do_sample = True
generation_config.repetition_penalty = REPETITION_PENALTY


text_pipeline = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    generation_config=generation_config,
)

llm = HuggingFacePipeline(pipeline=text_pipeline, model_kwargs={"temperature": TEMPERATURE})

### Set Up the Chat Prompt Template
Chat prompt template will be used to structure the interaction with the LLM. It includes placeholders for context and a question, which will be dynamically filled during the execution of the chain.

In [None]:
template = """
<s>[INST] <<SYS>>
Use the following information from company annual reports and answer the question at the end.
If the answer is not contained in the provided information or if there is NO context at all, say "The answer is not in the context".
<</SYS>>

{context}

{question} [/INST]
"""

prompt = ChatPromptTemplate.from_template(template)


### Chain Construction:
* A `big_chunks_retriever` is used to fetch relevant information based on the query.
* `RunnablePassthrough()` is a component that simply passes along the question without any modification.
* The `llm` variable represents the Hugging Face pipeline, this pipeline will take the formatted input from the previous step and will produce an answer.
* The `StrOutputParser()` is an output parser, it takes the raw output from the Hugging Face pipeline and parse it into a string.

The following code defines a pipeline for a question-answering system with retrieval augmentation.

It starts by taking a question and uses it both directly as the question and as input to a base retrieval system (`big_chunks_retriever`) to fetch relevant context.

The retrieved context and the original question are then passed through a `RunnablePassthrough` for subsequent use, maintaining the context intact for reference.

Finally, the response is generated by a primary question-answering model `llm`, which takes the formatted prompt, consisting of the context and the question, and produces an answer.

In [None]:
chain = (
    {"context": itemgetter("question") | compression_retriever, "question": itemgetter('question')}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": prompt | llm, "context": itemgetter("context"), "question": itemgetter('question')}
)

print('GPU Usage:')
GPUtil.showUtilization()

GPU Usage:
| ID | GPU | MEM |
------------------
|  0 |  4% | 29% |


### Chain Invocation:
This invocation triggers the entire sequence of operations defined in the chain. The retriever searches for relevant information, which is then passed along with the question through the prompt and into the Hugging Face model. The model generates a response based on the inputs it receives.

In [None]:
from tqdm import tqdm

results_list = []

for i in tqdm(range(len(questions))):
    response = chain.invoke({"question": questions['question'][i]})
    results_list.append(pd.DataFrame({
        'question': [response['question']],
        'response': [response['response'].split('[/INST]\n')[1]],
        'context': [response['context']]
    }))

results = pd.concat(results_list, ignore_index=True)

100%|██████████| 35/35 [13:00<00:00, 22.29s/it]


In [None]:
results

Unnamed: 0,question,response,context
0,What is the total revenue generated by the COC...,"According to the document provided, the total ...","[page_content=""Open commodity derivatives that..."
1,What is the COCA COLA CO's cost of goods sold ...,"According to the document provided, the COCA C...",[page_content='Consolidated Balance Sheets\nCo...
2,What is the COCA COLA CO's gross profit margin...,The answer is in the context! According to the...,"[page_content=""Gross Profit Margin\nAs a resul..."
3,What are the COCA COLA CO's major operating ex...,Based on the information provided in the docum...,"[page_content=""Open commodity derivatives that..."
4,What is the COCA COLA CO's operating income an...,The answer to your question is not directly pr...,[page_content='Because of its inherent limitat...
5,What is the COCA COLA CO's net income for the ...,"According to the document provided, The Coca-C...",[page_content='Consolidated Balance Sheets\nCo...
6,What is the COCA COLA CO's earnings per share ...,"According to the provided document, the COCA C...",[page_content='Period Total Number of\nShares ...
7,What is the COCA COLA CO's cash flow generated...,"Based on the provided information, the COCA CO...","[page_content='1,905\nTOTAL EQUITY\n18,977\n23..."
8,How much has the COCA COLA CO invested in capi...,The answer is not in the context. There is no ...,"[page_content='During 2020, proceeds from disp..."
9,What is the COCA COLA CO's total outstanding d...,According to the information provided in the d...,"[page_content=""THE COCA-COLA COMPANY AND SUBSI..."


In [None]:
results.to_json('results_Coca_Cola_rerank_6-2_filter.json')