<a href="https://colab.research.google.com/github/subhra004/Subhra_04_GenAI_Projects/blob/main/RAG_Financial_Fraud_Detection_Using_Hugging_Face_and_LLM_Model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Imported essential libraries (numpy, pandas), lists all files in the Kaggle input directory (/kaggle/input), and provides information about the working directory (/kaggle/working/) and temporary storage (/kaggle/temp/) for data processing and analysis.**










In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All"
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

**Fraud Detection using LLM and RAG**
This project leverages advanced AI technologies, including Large Language Models (LLM) and Retrieval-Augmented Generation (RAG), to identify and flag potential fraud in financial data.

Large Language Models (LLM):
LLMs are trained on vast amounts of textual data and can understand and generate human-like text. In fraud detection, LLMs can analyze financial statements, detect anomalies, and recognize patterns indicative of fraudulent behavior.

Retrieval-Augmented Generation (RAG):
RAG combines the capabilities of LLMs with a retrieval mechanism to enhance the generation process. It retrieves relevant documents or pieces of information from a large corpus and uses them to provide more accurate and contextually relevant responses. In this context, RAG can pull relevant financial records, reports, and contextual data to assist in the detection and explanation of potential fraud.

Application:
Input: Financial statements and related documents.

Process: The system uses RAG to retrieve pertinent information from a database and employs LLM to analyze and interpret the data.

Output: A concise report indicating whether the financial statement exhibits fraudulent behavior, with an explanation based on the retrieved context.

This combination of LLM and RAG enhances the accuracy and reliability of fraud detection in financial filings, making it a powerful tool for auditors, regulators, and financial institutions.

** Installed essential libraries for building and running AI-powered applications using LangChain, including LLMs, embeddings, vector databases, and retrieval-based systems**

In [None]:
!pip install -q langchain sentence-transformers faiss-cpu langchain-community langchain-core transformers chromadb

** installed LangChain and SentenceTransformers**

In [None]:
%pip install --upgrade --quiet  langchain sentence_transformers

 **Generated a labeled dataset of fraud and non-fraud financial statements, shuffled the data, and saved it as a CSV file**

In [None]:
import pandas as pd
import random

# Define sample data for fraud and non-fraud financial statements
fraud_statements = [
    "The company reported inflated revenues by including sales that never occurred.",
    "Financial records were manipulated to hide the true state of expenses.",
    "The company failed to report significant liabilities on its balance sheet.",
    "Revenue was recognized prematurely before the actual sales occurred.",
    "The financial statement shows significant discrepancies in inventory records.",
    "The company used off-balance-sheet entities to hide debt.",
    "Expenses were understated by capitalizing them as assets.",
    "There were unauthorized transactions recorded in the financial books.",
    "Significant amounts of revenue were recognized without proper documentation.",
    "The company falsified financial documents to secure a larger loan.",
    "There were multiple instances of duplicate payments recorded as expenses.",
    "The company reported non-existent assets to enhance its financial position.",
    "Expenses were fraudulently categorized as business development costs.",
    "The company manipulated financial ratios to meet loan covenants.",
    "Significant related-party transactions were not disclosed.",
    "The financial statement shows fabricated sales transactions.",
    "There was intentional misstatement of cash flow records.",
    "The company inflated the value of its assets to attract investors.",
    "Revenue from future periods was reported in the current period.",
    "The company engaged in channel stuffing to inflate sales figures."
]

non_fraud_statements = [
    "The company reported stable revenues consistent with historical trends.",
    "Financial records accurately reflect all expenses and liabilities.",
    "The balance sheet provides a true and fair view of the company’s financial position.",
    "Revenue was recognized in accordance with standard accounting practices.",
    "The inventory records are accurate and match physical counts.",
    "The company’s debt is fully disclosed on the balance sheet.",
    "All expenses are properly categorized and recorded.",
    "Transactions recorded in the financial books are authorized and documented.",
    "Revenue recognition is supported by proper documentation.",
    "Financial documents were audited and found to be accurate.",
    "Payments and expenses are recorded accurately without discrepancies.",
    "The assets reported on the balance sheet are verified and exist.",
    "Business development costs are properly recorded as expenses.",
    "Financial ratios are calculated based on accurate data.",
    "All related-party transactions are fully disclosed.",
    "Sales transactions are accurately recorded in the financial statement.",
    "Cash flow records are accurate and reflect actual cash movements.",
    "The value of assets is fairly reported in the financial statements.",
    "Revenue is reported in the correct accounting periods.",
    "Sales figures are accurately reported without manipulation."
]

# Generate fraud and non-fraud data
fraud_data = [{"text": statement, "fraud_status": "fraud"} for statement in fraud_statements]
non_fraud_data = [{"text": random.choice(non_fraud_statements), "fraud_status": "non-fraud"} for _ in range(60)]

# Combine data into a single dataset
data = fraud_data + non_fraud_data
random.shuffle(data)  # Shuffle data to mix fraud and non-fraud rows

# Create a DataFrame
df = pd.DataFrame(data)

# Save to a CSV file
df.to_csv("/content/BankFAQs.csv", index=False)



**Displayed the first five rows of the DataFrame (df) to preview the fraud and non-fraud financial statements**

In [None]:
df.head()

Unnamed: 0,text,fraud_status
0,Financial records accurately reflect all expen...,non-fraud
1,Business development costs are properly record...,non-fraud
2,The inventory records are accurate and match p...,non-fraud
3,The company manipulated financial ratios to me...,fraud
4,All related-party transactions are fully discl...,non-fraud


**Imported pandas, regular expressions (re), and NLTK libraries to process text data, including tokenization and stopword removal**

In [None]:
import pandas as pd
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import nltk


**Downloaded essential NLTK resources (punkt, stopwords, wordnet, and punkt_tab) for text tokenization, stopword removal, and lemmatization**

In [None]:
# Ensure NLTK resources are downloaded
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
# Download the 'punkt_tab' resource
nltk.download('punkt_tab') # This line is added to download the required resource.


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

**Defined a function (clean_text) to preprocess text by removing non-ASCII characters, punctuation, numbers, stopwords, and converting it to lowercase while tokenizing and cleaning the text**

In [None]:
# Function to clean text
def clean_text(text):
    # Remove non-ASCII characters
    text = text.encode('ascii', 'ignore').decode()

    # Remove punctuation and numbers
    text = re.sub(r'[^\w\s]', '', text)
    text = re.sub(r'\d+', '', text)

    # Convert to lowercase
    text = text.lower()

    # Tokenize text
    tokens = word_tokenize(text)

    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words]

    # Join tokens back into text
    cleaned_text = ' '.join(tokens)

    return cleaned_text

**Cleaned the 'text' column, stored the processed text in a new 'Clean_Text' column, removed the original column, saved the cleaned data to a CSV file, and displayed the first five rows**

In [None]:
# Clean 'Fillings' column
df['Clean_Text'] = df['text'].apply(clean_text)

# Drop original 'Text' column if no longer needed
df.drop(columns=['text'], inplace=True)

# Save cleaned data back to CSV if desired
df.to_csv('cleaned_financial_statements.csv', index=False)

# Example of how the cleaned data looks like
print(df.head())

  fraud_status                                         Clean_Text
0    non-fraud  financial records accurately reflect expenses ...
1    non-fraud  business development costs properly recorded e...
2    non-fraud   inventory records accurate match physical counts
3        fraud  company manipulated financial ratios meet loan...
4    non-fraud          relatedparty transactions fully disclosed


** Upgraded (-U) and installed the langchain-community package to access community-contributed integrations and tools for LangChain**

In [None]:
!pip install -U langchain-community



**Created a list of LangChain Document objects by iterating over the DataFrame (df), formatting each row’s data, and storing it as page_content**

In [None]:
from langchain.vectorstores import Chroma
from langchain.docstore.document import Document

documents = []

# Iterate over rows using .rows() method
for i, row_tuple in df.iterrows():
    document = f"id:{i}\Fillings: {row_tuple[1]}\Fraud_Status: {row_tuple[0]}"
    documents.append(Document(page_content=document))

  document = f"id:{i}\Fillings: {row_tuple[1]}\Fraud_Status: {row_tuple[0]}"


**Retrieved and displayed the first Document object from the documents list**

In [None]:
documents[0]

Document(metadata={}, page_content='id:0\\Fillings: financial records accurately reflect expenses liabilities\\Fraud_Status: non-fraud')

**Initialized Hugging Face embeddings (HuggingFaceEmbeddings()) to generate vector representations for text data in LangChain**

In [None]:
from langchain_community.embeddings import HuggingFaceEmbeddings
hg_embeddings = HuggingFaceEmbeddings()

  hg_embeddings = HuggingFaceEmbeddings()
  hg_embeddings = HuggingFaceEmbeddings()
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


** Upgraded (--upgrade) and installed chromadb, ensuring you have the latest version of the Chroma vector database**

In [None]:
!pip install --upgrade chromadb



**Created and persisted a Chroma vector database using documents, HuggingFaceEmbeddings, and the collection name "finance_data_new"**

In [None]:
from langchain.vectorstores import Chroma
persist_directory = 'docs/chroma_rag/'
langchain_chroma = Chroma.from_documents(
    documents=documents,
    collection_name="finance_data_new",
    embedding=hg_embeddings,
    persist_directory=persist_directory
)


** Logged into Hugging Face Hub with write permissions to access and manage models, datasets, or other resources**

In [None]:
from huggingface_hub import notebook_login
notebook_login(write_permission=True)


Fine-grained tokens added complexity to the permissions, making it irrelevant to check if a token has 'write' access.


VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

** Imported necessary libraries for deep learning (torch, transformers), Hugging Face tokenization, document loading (PyPDFLoader), text splitting, embeddings, retrieval-based Q&A, and Chroma vector storage in LangChain**

In [None]:
from torch import cuda, bfloat16
import torch
import transformers
from transformers import AutoTokenizer
from time import time
#import chromadb
#from chromadb.config import Settings
from langchain.llms import HuggingFacePipeline
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.chains import RetrievalQA
from langchain.vectorstores import Chroma

**Installed the bitsandbytes library, which enables 8-bit and 4-bit quantization for efficient model inference and reduced memory usage**

In [None]:
!pip install bitsandbytes



**Set up the Zephyr-7B model with 4-bit quantization (bitsandbytes) and determined whether to use GPU (cuda) or CPU (cpu) for efficient inference, then printed the selected device**

In [None]:
model_id = 'HuggingFaceH4/zephyr-7b-beta'

device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'

# set quantization configuration to load large model with less GPU memory
# this requires the `bitsandbytes` library
bnb_config = transformers.BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=bfloat16
)

print(device)


cuda:0


**Installed accelerate for optimized deep learning model execution and bitsandbytes from PyPI for 4-bit and 8-bit model quantization to reduce memory usage**

In [None]:
!pip install accelerate
!pip install -i https://pypi.org/simple/ bitsandbytes

Looking in indexes: https://pypi.org/simple/


**Uninstalled and reinstalled bitsandbytes, upgraded transformers and accelerate, and installed torch with a specific CUDA version**

In [None]:
!pip uninstall -y bitsandbytes
!pip install bitsandbytes
!pip install --upgrade transformers accelerate
!pip install torch --index-url https://download.pytorch.org/whl/cu118  # Adjust CUDA version if needed


Found existing installation: bitsandbytes 0.45.4
Uninstalling bitsandbytes-0.45.4:
  Successfully uninstalled bitsandbytes-0.45.4
Collecting bitsandbytes
  Using cached bitsandbytes-0.45.4-py3-none-manylinux_2_24_x86_64.whl.metadata (5.0 kB)
Using cached bitsandbytes-0.45.4-py3-none-manylinux_2_24_x86_64.whl (76.0 MB)
Installing collected packages: bitsandbytes
Successfully installed bitsandbytes-0.45.4
Looking in indexes: https://download.pytorch.org/whl/cu118


**Checks if CUDA is available, counts the number of GPUs, and prints the name of the first GPU**

In [None]:
import torch
print(torch.cuda.is_available())  # Should return True
print(torch.cuda.device_count())  # Should return > 0 if GPU is available
print(torch.cuda.get_device_name(0))  # Should print GPU name


True
1
Tesla T4


**Loading a Hugging Face causal language model (zephyr-7b-beta), applies 4-bit quantization for memory efficiency, automatically maps it to the available GPU (if present), and moves the model to the detected device (CUDA or CPU)**

In [None]:
import torch
import transformers

model_id = "HuggingFaceH4/zephyr-7b-beta"  # Replace with actual model ID

device = "cuda" if torch.cuda.is_available() else "cpu"

model = transformers.AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    load_in_4bit=True,  # Ensures 4-bit quantization
    device_map="auto"   # Automatically assigns to available GPU
)

model.to(device)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


Loading checkpoint shards:   0%|          | 0/8 [00:00<?, ?it/s]

MistralForCausalLM(
  (model): MistralModel(
    (embed_tokens): Embedding(32000, 4096, padding_idx=2)
    (layers): ModuleList(
      (0-31): 32 x MistralDecoderLayer(
        (self_attn): MistralAttention(
          (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
        )
        (mlp): MistralMLP(
          (gate_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear4bit(in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): MistralRMSNorm((4096,), eps=1e-05)
        (post_attention_layernorm): MistralRMSNorm((4096,), eps=1e-05)
      )
    )
 

**Initializes a text generation pipeline using the zephyr-7b-beta model with a tokenizer, sets FP16 precision (torch.float16), increases the max token length (6000), limits new token generation to 500, and automatically maps the model to the available GPU or CPU**

In [None]:
# Initialize the query pipeline with increased max_length
from transformers import AutoTokenizer

# Define model ID
model_id = "HuggingFaceH4/zephyr-7b-beta"  # Replace with the actual model ID

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)

query_pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    torch_dtype=torch.float16,
    max_length=6000,  # Increase max_length
    max_new_tokens=500,  # Control the number of new tokens generated
    device_map="auto",
)

Device set to use cuda:0


**Defines a function colorize_text(text) that formats specific keywords ("Reasoning", "Question", "Answer", "Total time") with different colors (blue, red, green, magenta) using Markdown , making the text more visually distinct when displayed**

In [None]:
from IPython.display import display, Markdown
def colorize_text(text):
    for word, color in zip(["Reasoning", "Question", "Answer", "Total time"], ["blue", "red", "green", "magenta"]):
        text = text.replace(f"{word}:", f"\n\n**<font color='{color}'>{word}:</font>**")
    return text

**Installed the langchain-community package, which provides community-supported integrations and tools for building applications with LangChain, a framework for working with LLMs**

In [None]:
!pip install langchain-community



**Wrapped the Hugging Face text generation pipeline (query_pipeline) into a LangChain LLM (HuggingFacePipeline), generates a response to a given question ("What is the EU AI Act?"), formats the output with Markdown styling using colorize_text(), and displays it with rich text**

In [None]:
from langchain.llms import HuggingFacePipeline # Import HuggingFacePipeline from langchain.llms

llm = HuggingFacePipeline(pipeline=query_pipeline)

question = "Please explain what EU AI Act is."
response = llm(prompt=question)

full_response =  f"Question: {question}\nAnswer: {response}"
from IPython.display import display, Markdown # Import display and Markdown if not already imported
display(Markdown(colorize_text(full_response)))

  llm = HuggingFacePipeline(pipeline=query_pipeline)
  response = llm(prompt=question)
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Both `max_new_tokens` (=500) and `max_length`(=6000) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)




**<font color='red'>Question:</font>** Please explain what EU AI Act is.


**<font color='green'>Answer:</font>** Please explain what EU AI Act is.

The European Union's (EU) Artificial Intelligence (AI) Act is a proposed regulation that aims to ensure the safe and responsible use of AI in the EU. The Act proposes a risk-based approach to AI, categorizing AI systems into three categories based on their level of risk: unregulated, high-risk, and prohibited. The Act also proposes requirements for transparency, data governance, and human oversight for high-risk AI systems. The Act is still in the proposal stage and is subject to approval by the European Parliament and the Council of the European Union.

**Imported necessary modules for building a retrieval-based QA system (RetrievalQA) using LangChain, defines prompt templates (PromptTemplate), integrates Hugging Face models via HuggingFaceHub, suppresses warnings, and enables Markdown-based output display**

In [None]:
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
from langchain_community.llms import HuggingFaceHub
from IPython.display import display, Markdown
import os
import warnings
warnings.filterwarnings('ignore')

**Created a fraud detection system using a retrieval-augmented generation (RAG) approach with LangChain, where it:**

1. Defines a prompt template for fraud detection in financial statements.

2. Generates a dataset of fraud and non-fraud financial statements.

3. Stores the data in a ChromaDB vector store using Hugging Face embeddings for retrieval.

4. Initializes a retriever to fetch relevant financial documents based on queries.

5. Builds a RetrievalQA chain using a Hugging Face language model (llm) to analyze statements and classify fraud.

6. Runs the QA pipeline on a sample financial statement and displays the result.










In [None]:
# Define the prompt template
template = """
You are an Fraud Detection Expert in Financial Text Data, Analyse them and Predict is the Given Statement is Fraud or not?. If you don't know the answer, just say "Sorry, I Don't Know."
Question: {question}
Context: {context}
Answer:
"""
PROMPT = PromptTemplate(input_variables=["context", "query"], template=template)

# Import necessary modules
import pandas as pd # Import pandas to work with DataFrames
from langchain.vectorstores import Chroma
from langchain.docstore.document import Document
from langchain_community.embeddings import HuggingFaceEmbeddings

# Define sample data for fraud and non-fraud financial statements
fraud_statements = [
    "The company reported inflated revenues by including sales that never occurred.",
    "Financial records were manipulated to hide the true state of expenses.",
    "The company failed to report significant liabilities on its balance sheet.",
    "Revenue was recognized prematurely before the actual sales occurred.",
    "The financial statement shows significant discrepancies in inventory records.",
    "The company used off-balance-sheet entities to hide debt.",
    "Expenses were understated by capitalizing them as assets.",
    "There were unauthorized transactions recorded in the financial books.",
    "Significant amounts of revenue were recognized without proper documentation.",
    "The company falsified financial documents to secure a larger loan.",
    "There were multiple instances of duplicate payments recorded as expenses.",
    "The company reported non-existent assets to enhance its financial position.",
    "Expenses were fraudulently categorized as business development costs.",
    "The company manipulated financial ratios to meet loan covenants.",
    "Significant related-party transactions were not disclosed.",
    "The financial statement shows fabricated sales transactions.",
    "There was intentional misstatement of cash flow records.",
    "The company inflated the value of its assets to attract investors.",
    "Revenue from future periods was reported in the current period.",
    "The company engaged in channel stuffing to inflate sales figures."
]

non_fraud_statements = [
    "The company reported stable revenues consistent with historical trends.",
    "Financial records accurately reflect all expenses and liabilities.",
    "The balance sheet provides a true and fair view of the company’s financial position.",
    "Revenue was recognized in accordance with standard accounting practices.",
    "The inventory records are accurate and match physical counts.",
    "The company’s debt is fully disclosed on the balance sheet.",
    "All expenses are properly categorized and recorded.",
    "Transactions recorded in the financial books are authorized and documented.",
    "Revenue recognition is supported by proper documentation.",
    "Financial documents were audited and found to be accurate.",
    "Payments and expenses are recorded accurately without discrepancies.",
    "The assets reported on the balance sheet are verified and exist.",
    "Business development costs are properly recorded as expenses.",
    "Financial ratios are calculated based on accurate data.",
    "All related-party transactions are fully disclosed.",
    "Sales transactions are accurately recorded in the financial statement.",
    "Cash flow records are accurate and reflect actual cash movements.",
    "The value of assets is fairly reported in the financial statements.",
    "Revenue is reported in the correct accounting periods.",
    "Sales figures are accurately reported without manipulation."
]

# Generate fraud and non-fraud data
import random # Import random for data shuffling

fraud_data = [{"text": statement, "fraud_status": "fraud"} for statement in fraud_statements]
non_fraud_data = [{"text": random.choice(non_fraud_statements), "fraud_status": "non-fraud"} for _ in range(60)]

# Combine data into a single dataset
data = fraud_data + non_fraud_data
random.shuffle(data)  # Shuffle data to mix fraud and non-fraud rows

# Create a DataFrame
df = pd.DataFrame(data)


# Define or retrieve documents
# This section was previously commented out and is now fixed:
documents = []
for i, row_tuple in df.iterrows():
    document = f"id:{i}\Fillings: {row_tuple[1]}\Fraud_Status: {row_tuple[0]}"
    documents.append(Document(page_content=document))

# Initialize langchain_chroma if not already done
if 'langchain_chroma' not in locals():  # Check if langchain_chroma is already defined
    persist_directory = 'docs/chroma_rag/'
    hg_embeddings = HuggingFaceEmbeddings()
    langchain_chroma = Chroma.from_documents(
        documents=documents, # 'documents' is now defined
        collection_name="finance_data_new",
        embedding=hg_embeddings,
        persist_directory=persist_directory
    )

# Now you can use langchain_chroma
retriever = langchain_chroma.as_retriever(search_kwargs={"k": 1})

qa_chain = RetrievalQA.from_chain_type(
    llm, retriever=retriever, chain_type_kwargs={"prompt": PROMPT}
)

# Ensure llm and langchain_chroma are properly initialized
retriever = langchain_chroma.as_retriever(search_kwargs={"k": 1})

qa_chain = RetrievalQA.from_chain_type(
    llm, retriever=retriever, chain_type_kwargs={"prompt": PROMPT}
)

# Define your question
# question = "The company reported inflated revenues by including sales that never occurred."
question = "Financial records accurately reflect all expenses and liabilities."
# question = "Revenue was recognized prematurely before the actual sales occurred."
# question = "The balance sheet provides a true and fair view of the company’s financial position."

# Run the QA chain
try:
    result = qa_chain({"query": question})
    display(result)
except RuntimeError as e:
    print(f"RuntimeError encountered: {e}")

Both `max_new_tokens` (=500) and `max_length`(=6000) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


{'query': 'Financial records accurately reflect all expenses and liabilities.',
 'result': '\nYou are an Fraud Detection Expert in Financial Text Data, Analyse them and Predict is the Given Statement is Fraud or not?. If you don\'t know the answer, just say "Sorry, I Don\'t Know."\nQuestion: Financial records accurately reflect all expenses and liabilities.\nContext: id:37\\Fillings: non-fraud\\Fraud_Status: Financial records accurately reflect all expenses and liabilities.\nAnswer:\n\nBased on the given context, the statement "Financial records accurately reflect all expenses and liabilities" is more likely to be found in non-fraud records rather than fraud records. This is because one of the signs of financial fraud is inaccurate financial records, where expenses and liabilities may be concealed or underreported to hide the true financial position of the organization. Therefore, if financial records accurately reflect all expenses and liabilities, it is less likely to be a sign of fi