### In this notebook, I will try to generate the response using the Mistral 7B model as my chatbot and one of the vector database created in the chunking and vectorisation notebook. Will also try to evaluate each strategy and choose the best one.

In [1]:
from llama_cpp import Llama
import faiss
from sentence_transformers import SentenceTransformer
import numpy as np
import torch
import tqdm
import pickle
import json
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
from sentence_transformers import SentenceTransformer, util

from nltk.translate.meteor_score import single_meteor_score
from rouge_score import rouge_scorer
from nltk.tokenize import word_tokenize




Loading the Mistral 7B model to use as our chatboat.

In [2]:

path = r'C:\Users\shri\Data_Science\Text Mining\mistral-7b-instruct-v0.1.Q2_K.gguf'
llm = Llama(
    model_path=path,
    n_ctx=8192,
    verbose=True 
)

llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from C:\Users\shri\Data_Science\Text Mining\mistral-7b-instruct-v0.1.Q2_K.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = mistralai_mistral-7b-instruct-v0.1
llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader:

Loading the first faiss index and all_chunks we created in chunking and vectorisation notebook.

In [26]:
index = faiss.read_index("chunk_index.faiss")

Loading the chunks.

In [40]:
with open("semantic_chunks.pkl", "rb") as f:
    data = pickle.load(f)
all_chunks = data["chunks"]
chunk_index_map = data["index_map"]


Loading the Embedding model for the query vectorisation.

In [6]:
model = SentenceTransformer("all-MiniLM-L6-v2")

In [7]:
## to retrieve the top 5 most similar chunks to the querry
def retrieve_relevant_chunks(query,model,index, chunks, top_k=5):
    query_embedding = model.encode([query]).astype("float32")
    D, I = index.search(query_embedding, top_k)
    return [chunks[i] for i in I[0]],D[0]


In [9]:
## to genrate prompt with questions and retrieved contex 
def build_prompt(query, retrieved_chunks):
    context = "\n".join(retrieved_chunks)

    prompt = f"""### Instruction:
Answer the following question using the provided context.

Context:
{context.strip()}

Question:
{query.strip()}

### Response:"""
    return prompt


In [10]:
query = "How do reverse engineering studies help in auditing algorithms on online platforms"

top_chunks,i = retrieve_relevant_chunks(query,model = model,index = index, chunks = all_chunks, top_k=5)

# prompt
prompt = build_prompt(query, top_chunks)

token_ids = llm.tokenize(prompt.encode("utf-8"))
print("Token count:", len(token_ids))

# 
response = llm(prompt, max_tokens=256, temperature=0.7, stop=["###"])
print("Answer:\n", response["choices"][0]["text"])


Token count: 138


llama_perf_context_print:        load time =   13709.23 ms
llama_perf_context_print: prompt eval time =   13707.89 ms /   138 tokens (   99.33 ms per token,    10.07 tokens per second)
llama_perf_context_print:        eval time =   32825.70 ms /   156 runs   (  210.42 ms per token,     4.75 tokens per second)
llama_perf_context_print:       total time =   46637.13 ms /   294 tokens
llama_perf_context_print:    graphs reused =        150


Answer:
 

Reverse engineering studies can help in auditing algorithms on online platforms by allowing researchers to understand how the algorithms work and identify any potential biases or vulnerabilities. By examining the code and data used by the algorithms, researchers can identify patterns and inconsistencies that could be exploited or used to manipulate the results. This information can then be used to develop more accurate and reliable algorithms, as well as to identify potential risks and mitigation strategies. Reverse engineering studies can also provide valuable insights into the algorithms' decision-making processes, which can help to ensure that they are aligned with ethical and legal standards. Overall, reverse engineering studies can play an important role in ensuring the transparency, accountability, and fairness of algorithms used in online platforms.


Remove the below cell when sure.

In [11]:
top_chunks

['The database covers 27 EU countries, Iceland, Lichtenstein, Norway, Switzerland and the United Kingdom.',
 'emergency phone numbers',
 'By 2040, the quantum sector is expected to create thousands of highly skilled jobs across the EU and exceed a global value of €155 billion.',
 'treatments that are covered and their costs',
 'Specific actions have been identified to meet the strategy’s objectives, such as:']

In [12]:
i

array([0.5281775 , 0.59934396, 0.6755141 , 0.7139563 , 0.78965354],
      dtype=float32)

It seems that the answer generated by the LLM is not coming from the context. Because the retrived chunks does not matches with the question or answer.

Let's try changing the prompt a little bit and see if the response changes.

In [14]:
## A little different prompt
def build_prompt_2(query, retrieved_chunks):
    context = "\n".join(retrieved_chunks)

    prompt = f"""### Instruction:
Answer the following question only using the provided context. Do not use any external knowledge or assumptions. 
If the answer is not in the context, respond with "The answer is not found in the provided context."

Context:
{context.strip()}

Question:
{query.strip()}

### Response:"""
    return prompt


Again passing the same querry

In [15]:
# Build prompt
prompt = build_prompt_2(query, top_chunks)

# Optional: check token count
token_ids = llm.tokenize(prompt.encode("utf-8"))
print("Token count:", len(token_ids))

# Run inference
response = llm(prompt, max_tokens=256, temperature=0.7, stop=["###"])
print("Answer:\n", response["choices"][0]["text"])


Token count: 172


Llama.generate: 11 prefix-match hit, remaining 161 prompt tokens to eval
llama_perf_context_print:        load time =   13709.23 ms
llama_perf_context_print: prompt eval time =   11884.62 ms /   161 tokens (   73.82 ms per token,    13.55 tokens per second)
llama_perf_context_print:        eval time =    2545.64 ms /    12 runs   (  212.14 ms per token,     4.71 tokens per second)
llama_perf_context_print:       total time =   14435.46 ms /   173 tokens
llama_perf_context_print:    graphs reused =         11


Answer:
 

The answer is not found in the provided context.


In [16]:
prompt

'### Instruction:\nAnswer the following question only using the provided context. Do not use any external knowledge or assumptions. \nIf the answer is not in the context, respond with "The answer is not found in the provided context."\n\nContext:\nThe database covers 27 EU countries, Iceland, Lichtenstein, Norway, Switzerland and the United Kingdom.\nemergency phone numbers\nBy 2040, the quantum sector is expected to create thousands of highly skilled jobs across the EU and exceed a global value of €155 billion.\ntreatments that are covered and their costs\nSpecific actions have been identified to meet the strategy’s objectives, such as:\n\nQuestion:\nHow do reverse engineering studies help in auditing algorithms on online platforms\n\n### Response:'

Let's try the cosine index and see if the matching improves 

In [24]:
index_cosine = faiss.read_index("chunk_index_cosine.faiss")
query = "When is the deadline for submitting applications to join the Platform on Sustainable Finance?"
query_embedding = model.encode([query])  
query_embedding = query_embedding / np.linalg.norm(query_embedding, axis=1, keepdims=True)
query_embedding = query_embedding.astype("float32") 
D, I = index_cosine.search(query_embedding, k=5)



In [38]:
index = faiss.read_index("chunk_index_cosine.faiss")
with open("filtered_chunks.txt", "r", encoding="utf-8") as f:
    all_chunks = [ln.strip() for ln in f if ln.strip()]

# Build the query (E5 example)
query = "When is the deadline for submitting applications to join the Platform on Sustainable Finance?"
q = model.encode([f"query: {query}"], normalize_embeddings=True).astype("float32")

# Search
k = 5
D, I = index.search(q, k)

# Show
for rank, (idx, score) in enumerate(zip(I[0], D[0]), 1):
    print(f"{rank}. cos={score:.4f} | {all_chunks_2[idx][:180]}...")

1. cos=0.6602 | The new platform will be composed of up to 35 members, of which up to 28 will be selected through today's call for applications....
2. cos=0.8008 | By sharing the experiences and successes of this initiative, the CEB and its partners hope to create a ripple effect, encouraging more innovative and effective approaches to migran...
3. cos=0.8998 | TheRoadmapaims to developclear standards and reliable certification for these nature-positive actionsto make nature credits effective and trustworthy, while avoiding administrative...
4. cos=0.9088 | InApril 2025, the Commission proposed to amend the EGF regulation to support workers at risk of imminent job loss, allowing earlier intervention by swiftly mobilising support befor...
5. cos=0.9249 | The European Commission has today launched acall for applicationsfor members of the thirdPlatform on Sustainable Finance....


In [39]:
for idx, score in zip(I[0], D[0]):
    print(f"Score: {score:.4f} | Chunk: {all_chunks_2[idx]}")

Score: 0.6602 | Chunk: The new platform will be composed of up to 35 members, of which up to 28 will be selected through today's call for applications.
Score: 0.8008 | Chunk: By sharing the experiences and successes of this initiative, the CEB and its partners hope to create a ripple effect, encouraging more innovative and effective approaches to migrant integration across Europe.
Score: 0.8998 | Chunk: TheRoadmapaims to developclear standards and reliable certification for these nature-positive actionsto make nature credits effective and trustworthy, while avoiding administrative burden when joining such a scheme.
Score: 0.9088 | Chunk: InApril 2025, the Commission proposed to amend the EGF regulation to support workers at risk of imminent job loss, allowing earlier intervention by swiftly mobilising support before job losses occur.
Score: 0.9249 | Chunk: The European Commission has today launched acall for applicationsfor members of the thirdPlatform on Sustainable Finance.


Even after the clear prompt the LLM still generated the answer from its own base knowledge. This could be a limitation of the LLM. Also the chunking method we have used in this case seems not very good because the retrieved chunks lacks local contex which results in incomplete chunks. We should think of a better chunking strategy.

Explaination of what I did so far:-
1. Break the scrapped text into chunks of maximum length 300.
2. Converted these text to a vector embedding of size 384 using "all-MiniLM-L6-v2".
3. Created an faiss index usinig euclidean distance.
4. The querry is converted to embedding then searched for top 5 similar embeddings from the index.
5. Therse 5 similar chunks are then joined into one and then passed to our LLMA model for answer generation with a prompt having Question and context.


#### Now for the evaluation part, I will create a set of 200 QA using chatgpt and then measure the answering ability considering the answer of the chatgpt as base. 

I have generated 297 high quality QA pair using chatgpt. Now I will take the answers of the chatgpt as the baseline and compare with the generated answers for all the questions using our model.

Loading the QA file

In [50]:
file_path = r"C:\Users\shri\Data_Science\Text Mining\QA_Evaluation\QA_text_mining.txt"

with open(file_path, 'r', encoding='utf-8') as file:
    data = json.load(file)
num_pairs = len(data)

print(f"Total QA pairs found: {num_pairs}")

Total QA pairs found: 297


Let's generate the answers of the Questions using our RAG model and evaluate against the chatgpt baseline answers.

In [42]:
updated_qa_data = []

In [47]:
for item in tqdm.tqdm(data):
    query = item["question"]
    top_chunks,i = retrieve_relevant_chunks(query,model = model,index = index, chunks = all_chunks, top_k=5)

    prompt = build_prompt(query, top_chunks)

    token_ids = llm.tokenize(prompt.encode("utf-8"))


    response = llm(prompt, max_tokens=256, temperature=0.7, stop=["###"])
    answer = response["choices"][0]["text"].strip()
    updated_qa_data.append({
        "question": query,
        "answer": answer
    })

  0%|                                                                                          | 0/297 [00:00<?, ?it/s]Llama.generate: 21 prefix-match hit, remaining 246 prompt tokens to eval
llama_perf_context_print:        load time =    7022.50 ms
llama_perf_context_print: prompt eval time =   66603.67 ms /   246 tokens (  270.75 ms per token,     3.69 tokens per second)
llama_perf_context_print:        eval time =   53061.34 ms /   111 runs   (  478.03 ms per token,     2.09 tokens per second)
llama_perf_context_print:       total time =  119850.10 ms /   357 tokens
  0%|▎                                                                              | 1/297 [01:59<9:51:41, 119.94s/it]Llama.generate: 22 prefix-match hit, remaining 361 prompt tokens to eval
llama_perf_context_print:        load time =    7022.50 ms
llama_perf_context_print: prompt eval time =   96265.87 ms /   361 tokens (  266.66 ms per token,     3.75 tokens per second)
llama_perf_context_print:        eval time =  

In [1]:
with open("updated_qa_pairs.json", "w", encoding="utf-8") as out_file:
    json.dump(updated_qa_data, out_file, indent=2, ensure_ascii=False)

print(" Answers generated and saved to 'updated_qa_pairs.json'.")

 Answers generated and saved to 'updated_qa_pairs.json'.


In [54]:
with open("updated_qa_pairs.json", "r", encoding="utf-8") as in_file:
    updated_qa_data = json.load(in_file)

print("Total QA pairs:", len(updated_qa_data))
print("Sample QA pair:", updated_qa_data[0])

Total QA pairs: 297
Sample QA pair: {'question': 'What collaborative initiative was announced by the European Microfinance Network (EMN) and the Microfinance Centre (MFC) in April 2024, and what are its main objectives for supporting the European microfinance sector?', 'answer': "The European Microfinance Network (EMN) and the Microfinance Centre (MFC) announced a collaborative initiative in April 2024 to capture data from the vast majority of European microfinance institutions. The main objectives of this initiative are to provide the most comprehensive dataset available on the sector today and to remain the leading source of data and analysis on the microfinance sector in Europe. This evaluation is undertaken as part of the Commission's commitment to evidence-based policy making under the Better Regulation policy."}


Evaluation Part

In [51]:

assert len(data) == len(updated_qa_data), "Mismatch in number of QA pairs"

bleu_scores = []
similarity_scores = []

model = SentenceTransformer('all-MiniLM-L6-v2')

for gpt_item, my_item in tqdm.tqdm(zip(data, updated_qa_data), total=len(data)):
    reference = gpt_item['answer'].strip()
    hypothesis = my_item['answer'].strip()

    # BLEU 
    smoothie = SmoothingFunction().method4
    bleu = sentence_bleu([reference.split()], hypothesis.split(), weights=(1, 0, 0, 0), smoothing_function=smoothie)
    bleu_scores.append(bleu)

    # Semantic Similarity (cosine similarity)
    emb_ref = model.encode(reference, convert_to_tensor=True)
    emb_hyp = model.encode(hypothesis, convert_to_tensor=True)
    sim_score = util.pytorch_cos_sim(emb_ref, emb_hyp).item()
    similarity_scores.append(sim_score)


average_bleu = sum(bleu_scores) / len(bleu_scores)
average_similarity = sum(similarity_scores) / len(similarity_scores)

print(f"🔹 Average BLEU Score (1-gram): {average_bleu:.4f}")
print(f"🔹 Average Semantic Similarity: {average_similarity:.4f}")

100%|████████████████████████████████████████████████████████████████████████████████| 297/297 [00:23<00:00, 12.45it/s]

🔹 Average BLEU Score (1-gram): 0.1933
🔹 Average Semantic Similarity: 0.7523





In [108]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\shri\AppData\Roaming\nltk_data...


True

In [109]:
rouge = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)

meteor_scores = []
rouge1_scores = []
rouge2_scores = []
rougeL_scores = []

for gpt, mine in tqdm.tqdm(zip(data, updated_qa_data), total=len(data)):
    ref = gpt["answer"].strip()
    hyp = mine["answer"].strip()

    # METEOR
    meteor = single_meteor_score(word_tokenize(ref), word_tokenize(hyp))
    meteor_scores.append(meteor)

    # ROUGE
    scores = rouge.score(ref, hyp)
    rouge1_scores.append(scores["rouge1"].fmeasure)
    rouge2_scores.append(scores["rouge2"].fmeasure)
    rougeL_scores.append(scores["rougeL"].fmeasure)

print("🔹 Average METEOR:", round(sum(meteor_scores) / len(meteor_scores), 4))
print("🔹 Average ROUGE-1:", round(sum(rouge1_scores) / len(rouge1_scores), 4))
print("🔹 Average ROUGE-2:", round(sum(rouge2_scores) / len(rouge2_scores), 4))
print("🔹 Average ROUGE-L:", round(sum(rougeL_scores) / len(rougeL_scores), 4))

100%|████████████████████████████████████████████████████████████████████████████████| 297/297 [00:25<00:00, 11.56it/s]

🔹 Average METEOR: 0.2599
🔹 Average ROUGE-1: 0.3275
🔹 Average ROUGE-2: 0.0954
🔹 Average ROUGE-L: 0.2045





Let's see some reference and generated answers side by side.

In [112]:
for i, (ref, model) in enumerate(zip(data, updated_qa_data)):
    print(f" Pair {i+1}")
    print(f" Question: {ref['question']}")
    print(f" Reference Answer (ChatGPT):\n{ref['answer']}")
    print(f" Generated Answer (Model):\n{model['answer']}")
    print("-" * 80)
    
    if i == 9:
        break  

🔢 Pair 1
❓ Question: What collaborative initiative was announced by the European Microfinance Network (EMN) and the Microfinance Centre (MFC) in April 2024, and what are its main objectives for supporting the European microfinance sector?
✅ Reference Answer (ChatGPT):
In April 2024, the European Microfinance Network (EMN) and the Microfinance Centre (MFC) announced a joint strategy aimed at strengthening the European microfinance sector. This collaborative initiative focuses on promoting financial inclusion, developing capacity-building resources, and creating a unified voice to influence policy-making at the European level. The partnership aims to better support microfinance institutions and expand access to responsible finance for underserved populations across Europe.
🤖 Generated Answer (Model):
The European Microfinance Network (EMN) and the Microfinance Centre (MFC) announced a collaborative initiative in April 2024 to capture data from the vast majority of European microfinance i

Conclusion:
* This model has low Rogue BLUE and METEOR score but decent similarity score.Means, the model generates semantically similar answers but does not generates correct phrases.
* Sometimes the information is incorrect also.
* And sometime the model deviates from the actual question.
* Overall similarity is good.
  

Now, let's try with the next base-V2 model encodings, 500 word length chunks and the same mistral 7B model as chatbot.

Let's first try with 10 outputs only.

Loading model.

In [41]:
model_2 = SentenceTransformer('all-mpnet-base-v2')

if torch.cuda.is_available():
    model_2 = model_2.to('cuda')
    print("✅ Model loaded to GPU.")
else:
    print("⚠️ GPU not available, using CPU.")

⚠️ GPU not available, using CPU.


Second index

In [42]:
index_2 = faiss.read_index("chunk_index_v2_cosine_x.faiss")

Second Chunk

In [43]:
loaded_chunks = []
with open("filtered_chunks_v2_50.jsonl", "r", encoding="utf-8") as f:
    for line in f:
        loaded_chunks.append(json.loads(line)["chunk"])

In [44]:
print("Number of chunks:", len(loaded_chunks))
print("Sample chunk:", loaded_chunks[0])
print("Index dimension:", index_2.d)     # embedding vector size
print("Vectors in index:", index_2.ntotal)

Number of chunks: 3842
Sample chunk: The European Microfinance Network (EMN) and the Microfinance Centre (MFC) are pleased to present the12th editionof their flagship publication:Microfinance in Europe: Survey Report. This long-standing survey remains the leading source of data and analysis on the microfinance sector in Europe. For thesixth consecutive survey edition, EMN and MFC have joined forces to capture data from the vast majority of European microfinance institutions, providing the most comprehensivedatasetavailable on the sector today. This edition focuses on thetypes of businesses reached by microfinanceand highlights thesocial performance of business loans, along with theimpact measurement approachesadopted by MFIs.
Index dimension: 768
Vectors in index: 3842


In [53]:
def retrieve_relevant_chunks_cosine(query,model,index,chunk, top_k=5):
    query_embedding = model.encode([query]).astype("float32")
    faiss.normalize_L2(query_embedding)
    D, I = index.search(query_embedding, top_k)
    return [chunk[i] for i in I[0]]

In [137]:
updated_qa_data_v2 = []

In [138]:

for item in tqdm.tqdm(data):
    query = item["question"]
    # Step 1: Get relevant chunks
    top_chunks = retrieve_relevant_chunks_cosine(query,model_2,index_2,loaded_chunks, top_k=5)

    # Step 2: Build prompt
    prompt = build_prompt(query, top_chunks)

    # Step 3: Token count (optional)
    token_ids = llm.tokenize(prompt.encode("utf-8"))
    # print(f"Token count for question: '{query[:50]}...':", len(token_ids))

    # Step 4: Run inference
    response = llm(prompt, max_tokens=256, temperature=0.7, stop=["###"])
    answer = response["choices"][0]["text"].strip()

    # Step 5: Save updated QA
    updated_qa_data_v2.append({
        "question": query,
        "answer": answer
    })

  0%|                                                                                          | 0/297 [00:00<?, ?it/s]Llama.generate: 21 prefix-match hit, remaining 703 prompt tokens to eval
llama_perf_context_print:        load time =    7022.50 ms
llama_perf_context_print: prompt eval time =   65348.11 ms /   703 tokens (   92.96 ms per token,    10.76 tokens per second)
llama_perf_context_print:        eval time =   36567.98 ms /   155 runs   (  235.92 ms per token,     4.24 tokens per second)
llama_perf_context_print:       total time =  102028.42 ms /   858 tokens
  0%|▎                                                                              | 1/297 [01:42<8:25:06, 102.39s/it]Llama.generate: 21 prefix-match hit, remaining 532 prompt tokens to eval
llama_perf_context_print:        load time =    7022.50 ms
llama_perf_context_print: prompt eval time =   46702.98 ms /   532 tokens (   87.79 ms per token,    11.39 tokens per second)
llama_perf_context_print:        eval time =  

In [2]:
with open("updated_qa_pairs_v2.json", "w", encoding="utf-8") as out_file:
    json.dump(updated_qa_data_v2, out_file, indent=2, ensure_ascii=False)

print(" Answers generated and saved to 'updated_qa_data_v2.json'.")

 Answers generated and saved to 'updated_qa_data_v2.json'.


In [60]:
with open("updated_qa_pairs_v2.json", "r", encoding="utf-8") as in_file:
    updated_qa_data_v2 = json.load(in_file)

print("Total QA pairs:", len(updated_qa_data))
print("Sample QA pair:", updated_qa_data[0])

Total QA pairs: 297
Sample QA pair: {'question': 'What collaborative initiative was announced by the European Microfinance Network (EMN) and the Microfinance Centre (MFC) in April 2024, and what are its main objectives for supporting the European microfinance sector?', 'answer': "The European Microfinance Network (EMN) and the Microfinance Centre (MFC) announced a collaborative initiative in April 2024 to capture data from the vast majority of European microfinance institutions. The main objectives of this initiative are to provide the most comprehensive dataset available on the sector today and to remain the leading source of data and analysis on the microfinance sector in Europe. This evaluation is undertaken as part of the Commission's commitment to evidence-based policy making under the Better Regulation policy."}


In [61]:
for i, (original, model1, model2) in enumerate(zip(data, updated_qa_data, updated_qa_data_v2)):
    print(f"🔢 Pair {i+1}")
    print(f"❓ Question:\n{original['question']}\n")
    print(f"✅ Reference Answer (Original ChatGPT):\n{original['answer']}\n")
    print(f"🤖 Generated Answer (Model 1):\n{model1['answer']}\n")
    print(f"🧠 Generated Answer (Model 2):\n{model2['answer']}")
    print("-" * 100)
    
    if i == 9:
        break  


🔢 Pair 1
❓ Question:
What collaborative initiative was announced by the European Microfinance Network (EMN) and the Microfinance Centre (MFC) in April 2024, and what are its main objectives for supporting the European microfinance sector?

✅ Reference Answer (Original ChatGPT):
In April 2024, the European Microfinance Network (EMN) and the Microfinance Centre (MFC) announced a joint strategy aimed at strengthening the European microfinance sector. This collaborative initiative focuses on promoting financial inclusion, developing capacity-building resources, and creating a unified voice to influence policy-making at the European level. The partnership aims to better support microfinance institutions and expand access to responsible finance for underserved populations across Europe.

🤖 Generated Answer (Model 1):
The European Microfinance Network (EMN) and the Microfinance Centre (MFC) announced a collaborative initiative in April 2024 to capture data from the vast majority of European m

In [142]:
assert len(data) == len(updated_qa_data_v2), "Mismatch in number of QA pairs"

bleu_scores_2 = []
similarity_scores_2 = []

for gpt_item, my_item in tqdm.tqdm(zip(data, updated_qa_data_v2), total=len(data)):
    reference = gpt_item['answer'].strip()
    hypothesis = my_item['answer'].strip()

    # BLEU Score (use 1-gram BLEU for QA relevance)
    smoothie = SmoothingFunction().method4
    bleu = sentence_bleu([reference.split()], hypothesis.split(), weights=(1, 0, 0, 0), smoothing_function=smoothie)
    bleu_scores_2.append(bleu)

    # Semantic Similarity (cosine similarity)
    emb_ref = model_2.encode(reference, convert_to_tensor=True)
    emb_hyp = model_2.encode(hypothesis, convert_to_tensor=True)
    sim_score = util.pytorch_cos_sim(emb_ref, emb_hyp).item()
    similarity_scores_2.append(sim_score)


100%|████████████████████████████████████████████████████████████████████████████████| 297/297 [02:14<00:00,  2.21it/s]

🔹 Average BLEU Score (1-gram): 0.1933
🔹 Average Semantic Similarity: 0.7523





In [143]:
# --- Summary ---
average_bleu = sum(bleu_scores_2) / len(bleu_scores_2)
average_similarity = sum(similarity_scores_2) / len(similarity_scores_2)

print(f"🔹 Average BLEU Score (1-gram): {average_bleu:.4f}")
print(f"🔹 Average Semantic Similarity: {average_similarity:.4f}") 

🔹 Average BLEU Score (1-gram): 0.2225
🔹 Average Semantic Similarity: 0.8782


In [144]:
rouge = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)

meteor_scores_2 = []
rouge1_scores_2 = []
rouge2_scores_2 = []
rougeL_scores_2 = []

for gpt, mine in tqdm.tqdm(zip(data, updated_qa_data_v2), total=len(data)):
    ref = gpt["answer"].strip()
    hyp = mine["answer"].strip()

    # METEOR
    meteor = single_meteor_score(word_tokenize(ref), word_tokenize(hyp))
    meteor_scores_2.append(meteor)

    # ROUGE
    scores = rouge.score(ref, hyp)
    rouge1_scores_2.append(scores["rouge1"].fmeasure)
    rouge2_scores_2.append(scores["rouge2"].fmeasure)
    rougeL_scores_2.append(scores["rougeL"].fmeasure)

print("🔹 Average METEOR:", round(sum(meteor_scores_2) / len(meteor_scores_2), 4))
print("🔹 Average ROUGE-1:", round(sum(rouge1_scores_2) / len(rouge1_scores_2), 4))
print("🔹 Average ROUGE-2:", round(sum(rouge2_scores_2) / len(rouge2_scores_2), 4))
print("🔹 Average ROUGE-L:", round(sum(rougeL_scores_2) / len(rougeL_scores_2), 4))

100%|████████████████████████████████████████████████████████████████████████████████| 297/297 [00:07<00:00, 37.31it/s]

🔹 Average METEOR: 0.3542
🔹 Average ROUGE-1: 0.3756
🔹 Average ROUGE-2: 0.1348
🔹 Average ROUGE-L: 0.2385





#### Third attempt with index_3, which has chunks with sliding window, max len 300 and without similarity merging. The embedding model is same.

In [64]:
# Load the FAISS index
index_3 = faiss.read_index("chunk_index_3.faiss")

In [65]:
with open("sliding_sentance_chunks_.pkl", "rb") as f:
    data_2 = pickle.load(f)

all_chunks_3 = data_2["chunks"]
chunk_index_map_3 = data_2["index_map"]

In [66]:
model_qa_v3 = []

In [67]:

for item in tqdm.tqdm(data):
    query = item["question"]
    # Step 1: Get relevant chunks
    top_chunks = retrieve_relevant_chunks_cosine(query,model_2,index_3,all_chunks_3, top_k=5)

    # Step 2: Build prompt
    prompt = build_prompt(query, top_chunks)

    # Step 3: Token count (optional)
    token_ids = llm.tokenize(prompt.encode("utf-8"))
    # print(f"Token count for question: '{query[:50]}...':", len(token_ids))

    # Step 4: Run inference
    response = llm(prompt, max_tokens=256, temperature=0.7, stop=["###"])
    answer = response["choices"][0]["text"].strip()

    # Step 5: Save updated QA
    model_qa_v3.append({
        "question": query,
        "answer": answer
    })

  0%|                                                                                          | 0/297 [00:00<?, ?it/s]Llama.generate: 11 prefix-match hit, remaining 1145 prompt tokens to eval
llama_perf_context_print:        load time =   81083.44 ms
llama_perf_context_print: prompt eval time =  288953.33 ms /  1145 tokens (  252.36 ms per token,     3.96 tokens per second)
llama_perf_context_print:        eval time =   64533.94 ms /   133 runs   (  485.22 ms per token,     2.06 tokens per second)
llama_perf_context_print:       total time =  353701.24 ms /  1278 tokens
  0%|▎                                                                             | 1/297 [05:54<29:07:04, 354.14s/it]Llama.generate: 21 prefix-match hit, remaining 1431 prompt tokens to eval
llama_perf_context_print:        load time =   81083.44 ms
llama_perf_context_print: prompt eval time =  362701.48 ms /  1431 tokens (  253.46 ms per token,     3.95 tokens per second)
llama_perf_context_print:        eval time =

KeyboardInterrupt: 

In [69]:
for i, (original, model1, model2, model3) in enumerate(zip(data, updated_qa_data, updated_qa_data_v2,model_qa_v3)):
    print(f"🔢 Pair {i+1}")
    print(f"❓ Question:\n{original['question']}\n")
    print(f"✅ Reference Answer (Original ChatGPT):\n{original['answer']}\n")
    print(f"🤖 Generated Answer (Model 1):\n{model1['answer']}\n")
    print(f"🧠 Generated Answer (Model 2):\n{model2['answer']}")
    print(f"🧠 Generated Answer (Model 3):\n{model3['answer']}")
    print("-" * 100)
    
    if i == 9:
        break  # Show only the top 10


🔢 Pair 1
❓ Question:
What collaborative initiative was announced by the European Microfinance Network (EMN) and the Microfinance Centre (MFC) in April 2024, and what are its main objectives for supporting the European microfinance sector?

✅ Reference Answer (Original ChatGPT):
In April 2024, the European Microfinance Network (EMN) and the Microfinance Centre (MFC) announced a joint strategy aimed at strengthening the European microfinance sector. This collaborative initiative focuses on promoting financial inclusion, developing capacity-building resources, and creating a unified voice to influence policy-making at the European level. The partnership aims to better support microfinance institutions and expand access to responsible finance for underserved populations across Europe.

🤖 Generated Answer (Model 1):
The European Microfinance Network (EMN) and the Microfinance Centre (MFC) announced a collaborative initiative in April 2024 to capture data from the vast majority of European m

Let's numerically evaluate the 3rd model

In [71]:
len(model_qa_v3)

11

In [72]:
subset_data = data[:10] 
subset_model_qa_v3 = model_qa_v3[:10]

assert len(subset_data) == len(subset_model_qa_v3), "Mismatch in number of QA pairs"

bleu_scores = []
similarity_scores = []

model = SentenceTransformer('all-MiniLM-L6-v2')

for gpt_item, my_item in tqdm.tqdm(zip(subset_data, subset_model_qa_v3), total=10):
    reference = gpt_item['answer'].strip()
    hypothesis = my_item['answer'].strip()

    # BLEU
    smoothie = SmoothingFunction().method4
    bleu = sentence_bleu([reference.split()], hypothesis.split(), weights=(1, 0, 0, 0), smoothing_function=smoothie)
    bleu_scores.append(bleu)

    # Semantic similarity
    emb_ref = model.encode(reference, convert_to_tensor=True)
    emb_hyp = model.encode(hypothesis, convert_to_tensor=True)
    sim_score = util.pytorch_cos_sim(emb_ref, emb_hyp).item()
    similarity_scores.append(sim_score)

# Averages
average_bleu = sum(bleu_scores) / len(bleu_scores)
average_similarity = sum(similarity_scores) / len(similarity_scores)

print(f"🔹 Average BLEU Score (1-gram): {average_bleu:.4f}")
print(f"🔹 Average Semantic Similarity: {average_similarity:.4f}")


100%|██████████████████████████████████████████████████████████████████████████████████| 10/10 [00:01<00:00,  8.41it/s]

🔹 Average BLEU Score (1-gram): 0.2736
🔹 Average Semantic Similarity: 0.8424





In [74]:
rouge = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)

meteor_scores = []
rouge1_scores = []
rouge2_scores = []
rougeL_scores = []

for gpt, mine in tqdm.tqdm(zip(subset_data, subset_model_qa_v3), total=len(subset_data)):
    ref = gpt["answer"].strip()
    hyp = mine["answer"].strip()

    # METEOR
    meteor = single_meteor_score(word_tokenize(ref), word_tokenize(hyp))
    meteor_scores.append(meteor)

    # ROUGE
    scores = rouge.score(ref, hyp)
    rouge1_scores.append(scores["rouge1"].fmeasure)
    rouge2_scores.append(scores["rouge2"].fmeasure)
    rougeL_scores.append(scores["rougeL"].fmeasure)

print("🔹 Average METEOR:", round(sum(meteor_scores) / len(meteor_scores), 4))
print("🔹 Average ROUGE-1:", round(sum(rouge1_scores) / len(rouge1_scores), 4))
print("🔹 Average ROUGE-2:", round(sum(rouge2_scores) / len(rouge2_scores), 4))
print("🔹 Average ROUGE-L:", round(sum(rougeL_scores) / len(rougeL_scores), 4))

100%|██████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 32.31it/s]


🔹 Average METEOR: 0.391
🔹 Average ROUGE-1: 0.4342
🔹 Average ROUGE-2: 0.174
🔹 Average ROUGE-L: 0.2913


The Performance of this approach is better than all previous approaches. The only problem with this is that it takes a lot of time even 5 minutes to generate a single answer, which is very bad.

The similar chunk retrieval is very fast the prompt creation is very fast, the thing which is taking so much time is Mistral 7B which I am using as my chatbot. Let's try for once Gemini 1.5 Flash and see if the latency improves.

In [75]:
pip install google-generativeai

Collecting google-generativeai
  Downloading google_generativeai-0.8.5-py3-none-any.whl.metadata (3.9 kB)
Collecting google-ai-generativelanguage==0.6.15 (from google-generativeai)
  Downloading google_ai_generativelanguage-0.6.15-py3-none-any.whl.metadata (5.7 kB)
Collecting google-api-core (from google-generativeai)
  Downloading google_api_core-2.25.1-py3-none-any.whl.metadata (3.0 kB)
Collecting google-api-python-client (from google-generativeai)
  Downloading google_api_python_client-2.176.0-py3-none-any.whl.metadata (7.0 kB)
Collecting google-auth>=2.15.0 (from google-generativeai)
  Downloading google_auth-2.40.3-py2.py3-none-any.whl.metadata (6.2 kB)
Collecting proto-plus<2.0.0dev,>=1.22.3 (from google-ai-generativelanguage==0.6.15->google-generativeai)
  Downloading proto_plus-1.26.1-py3-none-any.whl.metadata (2.2 kB)
Collecting googleapis-common-protos<2.0.0,>=1.56.2 (from google-api-core->google-generativeai)
  Downloading googleapis_common_protos-1.70.0-py3-none-any.whl.met

In [76]:
import os
import google.generativeai as genai

In [95]:
os.environ["GOOGLE_API_KEY"] = "***********" 

genai.configure(api_key=os.environ["GOOGLE_API_KEY"])

#  Gemini 1.5 Flash model
gemini_model = genai.GenerativeModel(model_name="gemini-1.5-flash")



In [79]:
gemini_qa = []

In [80]:
i = 0
for item in tqdm.tqdm(data):
    query = item["question"]
    # Get relevant chunks
    top_chunks = retrieve_relevant_chunks_cosine(query,model_2,index_3,all_chunks_3, top_k=5)

    #  Build prompt
    prompt = build_prompt(query, top_chunks)

    response = gemini_model.generate_content(prompt)
    answer = response.text

    # Save updated QA
    gemini_qa.append({
        "question": query,
        "answer": answer
    })
    i+=1
    if(i>9):
        break

  3%|██▍                                                                               | 9/297 [00:33<17:56,  3.74s/it]


In [82]:
for i, (original, model3,gemini) in enumerate(zip(data,model_qa_v3 , gemini_qa)):
    print(f"🔢 Pair {i+1}")
    print(f"❓ Question:\n{original['question']}\n")
    print(f"✅ Reference Answer (Original ChatGPT):\n{original['answer']}\n")
    print(f"🤖 Generated Answer (Model 1):\n{model3['answer']}\n")
    print(f"🧠 Gemini Answer (Model 2):\n{gemini['answer']}")
    # print(f"🧠 Generated Answer (Model 3):\n{model3['answer']}")
    print("-" * 100)
    
    if i == 9:
        break  # Show only the top 10


🔢 Pair 1
❓ Question:
What collaborative initiative was announced by the European Microfinance Network (EMN) and the Microfinance Centre (MFC) in April 2024, and what are its main objectives for supporting the European microfinance sector?

✅ Reference Answer (Original ChatGPT):
In April 2024, the European Microfinance Network (EMN) and the Microfinance Centre (MFC) announced a joint strategy aimed at strengthening the European microfinance sector. This collaborative initiative focuses on promoting financial inclusion, developing capacity-building resources, and creating a unified voice to influence policy-making at the European level. The partnership aims to better support microfinance institutions and expand access to responsible finance for underserved populations across Europe.

🤖 Generated Answer (Model 1):
The European Microfinance Network (EMN) and the Microfinance Centre (MFC) announced a collaborative initiative in April 2024 called "Microfinance in Europe: Survey Report". The 

In [83]:
# subset_data = data[:10] 
# subset_model_qa_v3 = model_qa_v3[:10]

assert len(subset_data) == len(gemini_qa), "Mismatch in number of QA pairs"

bleu_scores = []
similarity_scores = []

model = SentenceTransformer('all-MiniLM-L6-v2')

for gpt_item, my_item in tqdm.tqdm(zip(subset_data, gemini_qa), total=10):
    reference = gpt_item['answer'].strip()
    hypothesis = my_item['answer'].strip()

    # BLEU
    smoothie = SmoothingFunction().method4
    bleu = sentence_bleu([reference.split()], hypothesis.split(), weights=(1, 0, 0, 0), smoothing_function=smoothie)
    bleu_scores.append(bleu)

    # Semantic similarity
    emb_ref = model.encode(reference, convert_to_tensor=True)
    emb_hyp = model.encode(hypothesis, convert_to_tensor=True)
    sim_score = util.pytorch_cos_sim(emb_ref, emb_hyp).item()
    similarity_scores.append(sim_score)

# Averages
average_bleu = sum(bleu_scores) / len(bleu_scores)
average_similarity = sum(similarity_scores) / len(similarity_scores)

print(f"🔹 Average BLEU Score (1-gram): {average_bleu:.4f}")
print(f"🔹 Average Semantic Similarity: {average_similarity:.4f}")


100%|██████████████████████████████████████████████████████████████████████████████████| 10/10 [00:01<00:00,  8.77it/s]

🔹 Average BLEU Score (1-gram): 0.3110
🔹 Average Semantic Similarity: 0.8437





In [84]:
rouge = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)

meteor_scores = []
rouge1_scores = []
rouge2_scores = []
rougeL_scores = []

for gpt, mine in tqdm.tqdm(zip(subset_data, gemini_qa), total=len(subset_data)):
    ref = gpt["answer"].strip()
    hyp = mine["answer"].strip()

    # METEOR
    meteor = single_meteor_score(word_tokenize(ref), word_tokenize(hyp))
    meteor_scores.append(meteor)

    # ROUGE
    scores = rouge.score(ref, hyp)
    rouge1_scores.append(scores["rouge1"].fmeasure)
    rouge2_scores.append(scores["rouge2"].fmeasure)
    rougeL_scores.append(scores["rougeL"].fmeasure)

print("🔹 Average METEOR:", round(sum(meteor_scores) / len(meteor_scores), 4))
print("🔹 Average ROUGE-1:", round(sum(rouge1_scores) / len(rouge1_scores), 4))
print("🔹 Average ROUGE-2:", round(sum(rouge2_scores) / len(rouge2_scores), 4))
print("🔹 Average ROUGE-L:", round(sum(rougeL_scores) / len(rougeL_scores), 4))

100%|██████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 29.69it/s]

🔹 Average METEOR: 0.3483
🔹 Average ROUGE-1: 0.4587
🔹 Average ROUGE-2: 0.1757
🔹 Average ROUGE-L: 0.3038





Let's try with our second prompt for better reasoning

In [85]:
gemini_qa_prompt2 = []
i = 0
for item in tqdm.tqdm(data):
    query = item["question"]
    # Step 1: Get relevant chunks
    top_chunks = retrieve_relevant_chunks_cosine(query,model_2,index_3,all_chunks_3, top_k=5)

    # Step 2: Build prompt
    prompt = build_prompt_2(query, top_chunks)

    response = gemini_model.generate_content(prompt)
    answer = response.text

    # Step 5: Save updated QA
    gemini_qa_prompt2.append({
        "question": query,
        "answer": answer
    })
    i+=1
    if(i>9):
        break

  3%|██▍                                                                               | 9/297 [00:22<12:00,  2.50s/it]


In [86]:
assert len(subset_data) == len(gemini_qa_prompt2), "Mismatch in number of QA pairs"

bleu_scores = []
similarity_scores = []

model = SentenceTransformer('all-MiniLM-L6-v2')

for gpt_item, my_item in tqdm.tqdm(zip(subset_data, gemini_qa_prompt2), total=10):
    reference = gpt_item['answer'].strip()
    hypothesis = my_item['answer'].strip()

    # BLEU
    smoothie = SmoothingFunction().method4
    bleu = sentence_bleu([reference.split()], hypothesis.split(), weights=(1, 0, 0, 0), smoothing_function=smoothie)
    bleu_scores.append(bleu)

    # Semantic similarity
    emb_ref = model.encode(reference, convert_to_tensor=True)
    emb_hyp = model.encode(hypothesis, convert_to_tensor=True)
    sim_score = util.pytorch_cos_sim(emb_ref, emb_hyp).item()
    similarity_scores.append(sim_score)

# Averages
average_bleu = sum(bleu_scores) / len(bleu_scores)
average_similarity = sum(similarity_scores) / len(similarity_scores)

print(f"🔹 Average BLEU Score (1-gram): {average_bleu:.4f}")
print(f"🔹 Average Semantic Similarity: {average_similarity:.4f}")


100%|██████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 13.20it/s]

🔹 Average BLEU Score (1-gram): 0.2719
🔹 Average Semantic Similarity: 0.7523





In [87]:
meteor_scores = []
rouge1_scores = []
rouge2_scores = []
rougeL_scores = []

for gpt, mine in tqdm.tqdm(zip(subset_data, gemini_qa_prompt2), total=len(subset_data)):
    ref = gpt["answer"].strip()
    hyp = mine["answer"].strip()

    # METEOR
    meteor = single_meteor_score(word_tokenize(ref), word_tokenize(hyp))
    meteor_scores.append(meteor)

    # ROUGE
    scores = rouge.score(ref, hyp)
    rouge1_scores.append(scores["rouge1"].fmeasure)
    rouge2_scores.append(scores["rouge2"].fmeasure)
    rougeL_scores.append(scores["rougeL"].fmeasure)

print("🔹 Average METEOR:", round(sum(meteor_scores) / len(meteor_scores), 4))
print("🔹 Average ROUGE-1:", round(sum(rouge1_scores) / len(rouge1_scores), 4))
print("🔹 Average ROUGE-2:", round(sum(rouge2_scores) / len(rouge2_scores), 4))
print("🔹 Average ROUGE-L:", round(sum(rougeL_scores) / len(rougeL_scores), 4))

100%|██████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 42.22it/s]

🔹 Average METEOR: 0.2816
🔹 Average ROUGE-1: 0.4031
🔹 Average ROUGE-2: 0.1343
🔹 Average ROUGE-L: 0.252





When I change the prompt to strict prompt, the performance matrix drop instantly. Let's read few answers side by side for human comparison.

In [88]:
for i, (original, model3,gemini,gemini_prompt2) in enumerate(zip(data,model_qa_v3 , gemini_qa,gemini_qa_prompt2)):
    print(f"🔢 Pair {i+1}")
    print(f"❓ Question:\n{original['question']}\n")
    print(f"✅ Reference Answer (Original ChatGPT):\n{original['answer']}\n")
    print(f"🤖 Generated Answer (Model 1):\n{model3['answer']}\n")
    print(f"🧠 Gemini Answer (Model 2):\n{gemini['answer']}")
    print(f"🧠 Gemini Answer Prompt 2(Model 3):\n{gemini_prompt2['answer']}")
    print("-" * 100)
    
    if i == 9:
        break  # Show only the top 10


🔢 Pair 1
❓ Question:
What collaborative initiative was announced by the European Microfinance Network (EMN) and the Microfinance Centre (MFC) in April 2024, and what are its main objectives for supporting the European microfinance sector?

✅ Reference Answer (Original ChatGPT):
In April 2024, the European Microfinance Network (EMN) and the Microfinance Centre (MFC) announced a joint strategy aimed at strengthening the European microfinance sector. This collaborative initiative focuses on promoting financial inclusion, developing capacity-building resources, and creating a unified voice to influence policy-making at the European level. The partnership aims to better support microfinance institutions and expand access to responsible finance for underserved populations across Europe.

🤖 Generated Answer (Model 1):
The European Microfinance Network (EMN) and the Microfinance Centre (MFC) announced a collaborative initiative in April 2024 called "Microfinance in Europe: Survey Report". The 

#### Last one the langchain splitting (simplest)

In [46]:
with open("all_chunks_flat.pkl", "rb") as f:
    langchain = pickle.load(f)

all_chunks_langchain = langchain["chunks"]

In [47]:
index_langchain = faiss.read_index("chunk_index_langchain.faiss")

In [45]:
# gemini_qa_langchain = []
# i = 0
# for item in tqdm.tqdm(data):
#     query = item["question"]
#     # Step 1: Get relevant chunks
#     top_chunks = retrieve_relevant_chunks_cosine(query,model_2,index_langchain ,all_chunks_langchain, top_k=5)

#     # Step 2: Build prompt
#     prompt = build_prompt_2(query, top_chunks)

#     response = gemini_model.generate_content(prompt)
#     answer = response.text

#     # Step 5: Save updated QA
#     gemini_qa_langchain.append({
#         "question": query,
#         "answer": answer
#     })
#     i+=1
#     if(i>9):
#         break

The quota reached its limit lets use the old LLM to geenrate the ans

In [97]:
mistral_qa_langchain = []
i = 0
for item in tqdm.tqdm(data):
    query = item["question"]
    # Step 1: Get relevant chunks
    top_chunks = retrieve_relevant_chunks_cosine(query,model_2,index_langchain ,all_chunks_langchain, top_k=5)

    # Step 2: Build prompt
    prompt = build_prompt_2(query, top_chunks)

    # Step 3: Token count (optional)
    # token_ids = llm.tokenize(prompt.encode("utf-8"))
    # print(f"Token count for question: '{query[:50]}...':", len(token_ids))

    # Step 4: Run inference
    response = llm(prompt, max_tokens=256, temperature=0.7, stop=["###"])
    answer = response["choices"][0]["text"].strip()
    # Step 5: Save updated QA
    
    mistral_qa_langchain.append({
        "question": query,
        "answer": answer
    })
    i+=1
    if(i>9):
        break

  0%|                                                                                          | 0/297 [00:00<?, ?it/s]Llama.generate: 11 prefix-match hit, remaining 487 prompt tokens to eval
llama_perf_context_print:        load time =   81083.44 ms
llama_perf_context_print: prompt eval time = 8677788.02 ms /   488 tokens (17782.35 ms per token,     0.06 tokens per second)
llama_perf_context_print:        eval time =    2213.93 ms /    11 runs   (  201.27 ms per token,     4.97 tokens per second)
llama_perf_context_print:       total time =   42363.90 ms /   499 tokens
  0%|▎                                                                               | 1/297 [00:42<3:29:37, 42.49s/it]Llama.generate: 55 prefix-match hit, remaining 501 prompt tokens to eval
llama_perf_context_print:        load time =   81083.44 ms
llama_perf_context_print: prompt eval time =   40200.30 ms /   501 tokens (   80.24 ms per token,    12.46 tokens per second)
llama_perf_context_print:        eval time =  

Let's fetch 50 in total

In [101]:
i = 10
for item in tqdm.tqdm(data[10:]):
    query = item["question"]
    # Step 1: Get relevant chunks
    top_chunks = retrieve_relevant_chunks_cosine(query,model_2,index_langchain ,all_chunks_langchain, top_k=5)

    # Step 2: Build prompt
    prompt = build_prompt_2(query, top_chunks)

    # Step 3: Token count (optional)
    # token_ids = llm.tokenize(prompt.encode("utf-8"))
    # print(f"Token count for question: '{query[:50]}...':", len(token_ids))

    # Step 4: Run inference
    response = llm(prompt, max_tokens=256, temperature=0.7, stop=["###"])
    answer = response["choices"][0]["text"].strip()
    # Step 5: Save updated QA
    
    mistral_qa_langchain.append({
        "question": query,
        "answer": answer
    })
    i+=1
    if(i>50):
        break

  0%|                                                                                          | 0/287 [00:00<?, ?it/s]Llama.generate: 55 prefix-match hit, remaining 617 prompt tokens to eval
llama_perf_context_print:        load time =   81083.44 ms
llama_perf_context_print: prompt eval time =   50439.35 ms /   617 tokens (   81.75 ms per token,    12.23 tokens per second)
llama_perf_context_print:        eval time =   41016.11 ms /   189 runs   (  217.02 ms per token,     4.61 tokens per second)
llama_perf_context_print:       total time =   91568.93 ms /   806 tokens
  0%|▎                                                                               | 1/287 [01:31<7:17:00, 91.68s/it]Llama.generate: 55 prefix-match hit, remaining 1301 prompt tokens to eval
llama_perf_context_print:        load time =   81083.44 ms
llama_perf_context_print: prompt eval time =  125302.18 ms /  1301 tokens (   96.31 ms per token,    10.38 tokens per second)
llama_perf_context_print:        eval time = 

In [108]:
subset_data = data[:51] 

In [109]:
assert len(subset_data) == len(mistral_qa_langchain), "Mismatch in number of QA pairs"

bleu_scores = []
similarity_scores = []

model = SentenceTransformer('all-MiniLM-L6-v2')

for gpt_item, my_item in tqdm.tqdm(zip(subset_data, mistral_qa_langchain), total=10):
    reference = gpt_item['answer'].strip()
    hypothesis = my_item['answer'].strip()

    # BLEU
    smoothie = SmoothingFunction().method4
    bleu = sentence_bleu([reference.split()], hypothesis.split(), weights=(1, 0, 0, 0), smoothing_function=smoothie)
    bleu_scores.append(bleu)

    # Semantic similarity
    emb_ref = model.encode(reference, convert_to_tensor=True)
    emb_hyp = model.encode(hypothesis, convert_to_tensor=True)
    sim_score = util.pytorch_cos_sim(emb_ref, emb_hyp).item()
    similarity_scores.append(sim_score)

# Averages
average_bleu = sum(bleu_scores) / len(bleu_scores)
average_similarity = sum(similarity_scores) / len(similarity_scores)

print(f"🔹 Average BLEU Score (1-gram): {average_bleu:.4f}")
print(f"🔹 Average Semantic Similarity: {average_similarity:.4f}")


51it [00:05,  9.78it/s]                                                                                                

🔹 Average BLEU Score (1-gram): 0.2761
🔹 Average Semantic Similarity: 0.7907





In [107]:
len(mistral_qa_langchain)

51

In [110]:
meteor_scores = []
rouge1_scores = []
rouge2_scores = []
rougeL_scores = []

for gpt, mine in tqdm.tqdm(zip(subset_data, mistral_qa_langchain), total=len(subset_data)):
    ref = gpt["answer"].strip()
    hyp = mine["answer"].strip()

    # METEOR
    meteor = single_meteor_score(word_tokenize(ref), word_tokenize(hyp))
    meteor_scores.append(meteor)

    # ROUGE
    scores = rouge.score(ref, hyp)
    rouge1_scores.append(scores["rouge1"].fmeasure)
    rouge2_scores.append(scores["rouge2"].fmeasure)
    rougeL_scores.append(scores["rougeL"].fmeasure)

print("🔹 Average METEOR:", round(sum(meteor_scores) / len(meteor_scores), 4))
print("🔹 Average ROUGE-1:", round(sum(rouge1_scores) / len(rouge1_scores), 4))
print("🔹 Average ROUGE-2:", round(sum(rouge2_scores) / len(rouge2_scores), 4))
print("🔹 Average ROUGE-L:", round(sum(rougeL_scores) / len(rougeL_scores), 4))

100%|██████████████████████████████████████████████████████████████████████████████████| 51/51 [00:01<00:00, 48.60it/s]

🔹 Average METEOR: 0.3231
🔹 Average ROUGE-1: 0.4263
🔹 Average ROUGE-2: 0.1692
🔹 Average ROUGE-L: 0.2706





In [112]:
with open("mistral_qa_langchain_51.json", "w", encoding="utf-8") as out_file:
    json.dump(mistral_qa_langchain, out_file, indent=2, ensure_ascii=False)

print("Answers generated and saved to 'mistral_qa_langchain_51.json'.")

Answers generated and saved to 'mistral_qa_langchain_51.json'.


This matches the performance of the Gemini 1.5 flash with the earlier index. But in this strategy the chunks are very small and consistent hence the latency of the LLM has improved. If the gemini is out of quota I will finalise this strategy for the final chatboat.
Let's go for 50 QA to be sure and save some time.

This is good, I will use this one as my final chatboat option.

Let's use langchain to maintain the memory. The two ways we can maintain a memory into this,
1. First is to provide few set of QA previously asked along with current question in the prompt. The problem with this is that it may fill the context and increase the latency
2. Second one is to provide a summary of n previous conversations and append this in the prompt before the context and question and get the answer. This maybe better we will try both and retain the best one.

Few pairs of QA in Prompt.

In [118]:
pip install -U langchain-community

Collecting langchain-community
  Downloading langchain_community-0.3.27-py3-none-any.whl.metadata (2.9 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain-community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting pydantic-settings<3.0.0,>=2.4.0 (from langchain-community)
  Downloading pydantic_settings-2.10.1-py3-none-any.whl.metadata (3.4 kB)
Collecting httpx-sse<1.0.0,>=0.4.0 (from langchain-community)
  Downloading httpx_sse-0.4.1-py3-none-any.whl.metadata (9.4 kB)
Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses-json<0.7,>=0.5.7->langchain-community)
  Downloading marshmallow-3.26.1-py3-none-any.whl.metadata (7.3 kB)
Collecting typing-inspect<1,>=0.4.0 (from dataclasses-json<0.7,>=0.5.7->langchain-community)
  Downloading typing_inspect-0.9.0-py3-none-any.whl.metadata (1.5 kB)
Collecting typing-inspection>=0.4.0 (from pydantic-settings<3.0.0,>=2.4.0->langchain-community)
  Downloading typing_inspection-0.4.1-py3-none-any.whl.metadat

In [7]:
from langchain.prompts import PromptTemplate
from langchain.chains import ConversationalRetrievalChain
from langchain.memory import ConversationBufferMemory
from langchain.vectorstores import FAISS
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.schema import Document
from langchain_core.retrievers import BaseRetriever
from typing import List
from langchain.memory import ConversationBufferWindowMemory
from transformers import pipeline
from langchain.llms import HuggingFacePipeline
from pydantic import Field
from langchain_community.llms import LlamaCpp
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

In [140]:
# mistral_llm = HuggingFacePipeline(pipeline=pipe)
# pipe = pipeline("text-generation", model="mistralai/Mistral-7B-Instruct-v0.1", device=0)

mistral_llm = LlamaCpp(
    model_path=r"C:\Users\shri\Data_Science\Text Mining\mistral-7b-instruct-v0.1.Q2_K.gguf",
    temperature=0.7,
    max_tokens=512,
    top_p=1,
    n_ctx=8192,
    verbose=True
)




llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from C:\Users\shri\Data_Science\Text Mining\mistral-7b-instruct-v0.1.Q2_K.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = mistralai_mistral-7b-instruct-v0.1
llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader:

In [141]:
mistral_llm

LlamaCpp(client=<llama_cpp.llama.Llama object at 0x000001AC0A8AA1E0>, model_path='C:\\Users\\shri\\Data_Science\\Text Mining\\mistral-7b-instruct-v0.1.Q2_K.gguf', n_ctx=8192, max_tokens=512, temperature=0.7, top_p=1.0, model_kwargs={})

In [130]:
embedding_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")

  embedding_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")


In [161]:
class CustomRetriever(BaseRetriever):
    model: any = Field()
    index: any = Field()
    chunks: List[str] = Field()
    top_k: int = Field(default=5)

    def _get_relevant_documents(self, query: str) -> List[Document]:
        query_embedding = self.model.encode([query]).astype("float32")

        import faiss
        faiss.normalize_L2(query_embedding)
        D, I = self.index.search(query_embedding, self.top_k)

        return [Document(page_content=self.chunks[i]) for i in I[0]]

  warn(


In [162]:
vectorstore = FAISS.load_local("chunk_index_langchain_2", embeddings=embedding_model, allow_dangerous_deserialization=True)

# Extract the FAISS index object (optional)
faiss_index = vectorstore.index

In [170]:
from sentence_transformers import SentenceTransformer
embedding_model = SentenceTransformer("sentence-transformers/all-mpnet-base-v2")

In [187]:
memory = ConversationBufferWindowMemory(
    memory_key="chat_history",
    return_messages=True,
    k=3  
)

my_prompt = PromptTemplate.from_template(
    """
    ### Instruction:
    Answer the question based only on the following context and chat history.

    Chat History:
    {chat_history}

    Context:
    {context}

    Question:
    {question}

    ### Response:
    """
)

qa_chain = ConversationalRetrievalChain.from_llm(
    llm= mistral_llm,
    retriever = CustomRetriever(
    model=embedding_model,
    index=faiss_index,
    chunks=all_chunks_langchain,
    top_k=5
)
,
    memory=memory,
    combine_docs_chain_kwargs={"prompt": my_prompt}
)


In [172]:
response = qa_chain.run(data[0]['question'])
print(response)

llama_perf_context_print:        load time =   55909.70 ms
llama_perf_context_print: prompt eval time =   55908.41 ms /   490 tokens (  114.10 ms per token,     8.76 tokens per second)
llama_perf_context_print:        eval time =   39906.29 ms /   165 runs   (  241.86 ms per token,     4.13 tokens per second)
llama_perf_context_print:       total time =   95991.68 ms /   655 tokens



    The European Microfinance Network (EMN) and the Microfinance Centre (MFC) announced a collaborative initiative in April 2024, aimed at supporting the European microfinance sector. Their main objective is to capture data from the vast majority of European microfinance institutions, providing the most comprehensive dataset available on the sector today. This edition focuses on the types of businesses reached by microfinance and highlights the social performance of business loans, along with the impact measurement approaches adopted by MFIs. The report serves as an important policy tool, supporting evidence-based decision-making for policymakers working to strengthen financial inclusion and the social economy. It also functions as a benchmarking reference for MFIs, helping them evaluate their performance and position within the wider European landscape.


In [173]:
next_response = qa_chain.run("What types of businesses were reached?")
print(next_response)

Llama.generate: 1 prefix-match hit, remaining 275 prompt tokens to eval
llama_perf_context_print:        load time =   55909.70 ms
llama_perf_context_print: prompt eval time =   67612.35 ms /   275 tokens (  245.86 ms per token,     4.07 tokens per second)
llama_perf_context_print:        eval time =    8524.91 ms /    19 runs   (  448.68 ms per token,     2.23 tokens per second)
llama_perf_context_print:       total time =   76171.65 ms /   294 tokens
Llama.generate: 1 prefix-match hit, remaining 466 prompt tokens to eval
llama_perf_context_print:        load time =   55909.70 ms
llama_perf_context_print: prompt eval time =  122174.21 ms /   466 tokens (  262.18 ms per token,     3.81 tokens per second)
llama_perf_context_print:        eval time =   31189.98 ms /    68 runs   (  458.68 ms per token,     2.18 tokens per second)
llama_perf_context_print:       total time =  153483.98 ms /   534 tokens


 The collaborative initiative by the European Microfinance Network (EMN) and the Microfinance Centre (MFC) aimed to capture data from a vast majority of European microfinance institutions. However, there is no information provided in the context or chat history on the types of businesses that were reached or how the data was collected.


In [174]:
print(memory.chat_memory.messages)


[HumanMessage(content='What collaborative initiative was announced by the European Microfinance Network (EMN) and the Microfinance Centre (MFC) in April 2024, and what are its main objectives for supporting the European microfinance sector?', additional_kwargs={}, response_metadata={}), AIMessage(content='\n    The European Microfinance Network (EMN) and the Microfinance Centre (MFC) announced a collaborative initiative in April 2024, aimed at supporting the European microfinance sector. Their main objective is to capture data from the vast majority of European microfinance institutions, providing the most comprehensive dataset available on the sector today. This edition focuses on the types of businesses reached by microfinance and highlights the social performance of business loans, along with the impact measurement approaches adopted by MFIs. The report serves as an important policy tool, supporting evidence-based decision-making for policymakers working to strengthen financial incl

In [175]:
next_response = qa_chain.run("What is the full form of EMN and MFC?")
print(next_response)

Llama.generate: 1 prefix-match hit, remaining 364 prompt tokens to eval
llama_perf_context_print:        load time =   55909.70 ms
llama_perf_context_print: prompt eval time =   95024.38 ms /   364 tokens (  261.06 ms per token,     3.83 tokens per second)
llama_perf_context_print:        eval time =    5890.89 ms /    12 runs   (  490.91 ms per token,     2.04 tokens per second)
llama_perf_context_print:       total time =  100932.92 ms /   376 tokens
Llama.generate: 1 prefix-match hit, remaining 814 prompt tokens to eval
llama_perf_context_print:        load time =   55909.70 ms
llama_perf_context_print: prompt eval time =  206059.69 ms /   814 tokens (  253.14 ms per token,     3.95 tokens per second)
llama_perf_context_print:        eval time =   11961.55 ms /    25 runs   (  478.46 ms per token,     2.09 tokens per second)
llama_perf_context_print:       total time =  218062.72 ms /   839 tokens


 The full form of EMN is European Microfinance Network and the full form of MFC is Microfinance Centre.


It seems working the chatboat it is giving the answers by considering the context and and previous QA pairs.

I will use this approach. Just for once I am trying to use a Q4 quantised version of this mistral 7B model which is documented to be faster than this Q2 quantised version of this model

In [None]:
mistral_Q4 = LlamaCpp(
    model_path= r"C:\Users\shri\Data_Science\Text Mining\mistral-7b-instruct-v0.1.Q4_K_M.gguf",
    temperature=0.7,
    max_tokens=512,
    top_p=1,
    n_ctx=8192,
    verbose=True
)

llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from C:\Users\shri\Data_Science\Text Mining\mistral-7b-instruct-v0.1.Q4_K_M.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = mistralai_mistral-7b-instruct-v0.1
llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loade

In [188]:
qa_chain_2 = ConversationalRetrievalChain.from_llm(
    llm= mistral_Q4,
    retriever = CustomRetriever(
    model=embedding_model,
    index=faiss_index,
    chunks=all_chunks_langchain,
    top_k=5
)
,
    memory=memory,
    combine_docs_chain_kwargs={"prompt": my_prompt}
)

In [189]:
response = qa_chain_2.run(data[0]['question'])
print(response)

Llama.generate: 32 prefix-match hit, remaining 458 prompt tokens to eval
llama_perf_context_print:        load time =   63002.41 ms
llama_perf_context_print: prompt eval time =   66941.14 ms /   458 tokens (  146.16 ms per token,     6.84 tokens per second)
llama_perf_context_print:        eval time =   94425.34 ms /   197 runs   (  479.32 ms per token,     2.09 tokens per second)
llama_perf_context_print:       total time =  161842.69 ms /   655 tokens



    The European Microfinance Network (EMN) and the Microfinance Centre (MFC) announced a collaborative initiative in April 2024, which is the publication of the12th editionof their flagship publication:Microfinance in Europe: Survey Report. This long-standing survey remains the leading source of data and analysis on the microfinance sector in Europe.

The main objectives of this collaborative initiative are to provide evidence-based decision-making for policymakers working to strengthen financial inclusion and the social economy, and to serve as a benchmarking reference for MFIs, helping them evaluate their performance and position within the wider European landscape. This edition focuses on the types of businesses reached by microfinance and highlights the social performance of business loans, along with the impact measurement approaches adopted by MFIs. It offers valuable insights into how these institutions contribute to social inclusion, entrepreneurship, and local development.


In [190]:
next_response = qa_chain_2.run("What types of businesses were reached?")
print(next_response)

Llama.generate: 1 prefix-match hit, remaining 307 prompt tokens to eval
llama_perf_context_print:        load time =   63002.41 ms
llama_perf_context_print: prompt eval time =   41197.35 ms /   307 tokens (  134.19 ms per token,     7.45 tokens per second)
llama_perf_context_print:        eval time =   15861.02 ms /    35 runs   (  453.17 ms per token,     2.21 tokens per second)
llama_perf_context_print:       total time =   57120.99 ms /   342 tokens
Llama.generate: 1 prefix-match hit, remaining 730 prompt tokens to eval
llama_perf_context_print:        load time =   63002.41 ms
llama_perf_context_print: prompt eval time =  100660.39 ms /   730 tokens (  137.89 ms per token,     7.25 tokens per second)
llama_perf_context_print:        eval time =   64141.74 ms /   136 runs   (  471.63 ms per token,     2.12 tokens per second)
llama_perf_context_print:       total time =  165099.07 ms /   866 tokens


 The 12th edition of the Microfinance in Europe: Survey Report highlights that the microfinance sector in Europe reaches a variety of businesses, including loans for the public sector, framework loans for the public sector, loans for the private sector, intermediated loans for SMEs, mid-caps and other priorities, microfinance equity, venture debt, investments in infrastructure and environmental funds, investments in SME and mid-cap funds, guarantees in support of SMEs, mid-caps and other objectives, advisory services, mandates and partnerships, InvestEU RRF and financial, credit enhancement for project finance, and guarantees.


In [191]:
next_response = qa_chain.run("What is the full form of EMN and MFC?")
print(next_response)

Llama.generate: 1 prefix-match hit, remaining 464 prompt tokens to eval
llama_perf_context_print:        load time =   55909.70 ms
llama_perf_context_print: prompt eval time =  125602.90 ms /   464 tokens (  270.70 ms per token,     3.69 tokens per second)
llama_perf_context_print:        eval time =    6484.48 ms /    12 runs   (  540.37 ms per token,     1.85 tokens per second)
llama_perf_context_print:       total time =  132115.25 ms /   476 tokens
Llama.generate: 1 prefix-match hit, remaining 897 prompt tokens to eval
llama_perf_context_print:        load time =   55909.70 ms
llama_perf_context_print: prompt eval time =  248930.44 ms /   897 tokens (  277.51 ms per token,     3.60 tokens per second)
llama_perf_context_print:        eval time =    8899.32 ms /    19 runs   (  468.39 ms per token,     2.13 tokens per second)
llama_perf_context_print:       total time =  257866.35 ms /   916 tokens


 The full names of EMN and MFC are not provided in the context or chat history.


okay lets be done with it. I will use the Q4 in the streamlit web app.