## Baseline Model
I used this https://docs.llamaindex.ai/en/stable/examples/chat_engine/chat_engine_best/ from LlamaIndex documentation as an inspiration to write this notebook.
This notebook contains a naive RAG, with just the basic components, the raw corpus (the WASHNORM report) is loaded in through Simple Directory Reader, an index is built with Vector Store Index and the query or chat engine is built on that index.

In [None]:
!pip install llama-index llama-index-llms-huggingface-api llama-index-llms-openai "huggingface_hub[inference]" llama-index-embeddings-huggingface

Necessary imports

In [10]:
from llama_index.llms.huggingface_api import HuggingFaceInferenceAPI
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings

  from .autonotebook import tqdm as notebook_tqdm



In [11]:
from dotenv import load_dotenv
import os

load_dotenv()

hf_token = os.getenv("HUGGING_FACE_TOKEN")
open_ai_token = os.getenv("OPEN_AI_API_KEY")

The LLM model is initialized here

In [29]:
llm = HuggingFaceInferenceAPI(model_name="meta-llama/Llama-3.2-1B-Instruct", token=hf_token)
# llm_2 = HuggingFaceInferenceAPI(model_name="HuggingFaceH4/zephyr-7b-alpha")

Settings.llm = llm

The data is loaded in, and the index is built and the model trained on only a subset of the data.

In [13]:
# loading in data
reader = SimpleDirectoryReader(input_dir='/Users/mac/project_in_cs/data')
data = reader.load_data()

In [14]:
small_dataset = data[16:28]

In [108]:
small_dataset

[Document(id_='c3beb57b-ed71-44ae-a005-9c94cb64ed21', embedding=None, metadata={'page_label': '17', 'file_name': 'washnorm_report.pdf', 'file_path': '/Users/mac/project_in_cs/data/washnorm_report.pdf', 'file_type': 'application/pdf', 'file_size': 9122373, 'creation_date': '2024-11-17', 'last_modified_date': '2024-11-17'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={}, text=' \n    \n \n \n \n16 \n \nWater, Sanitation, Hygiene  \nNational Outcome Routine Mapping 2021 Conclusions \n \nEXECUTIVE SUMMARY \nI. Overview \nhe W ater Sanitation and Hygiene  National Outcome Routine Mapping ( WASHNORM) is an \nannual national household and facility -based survey encompassing a comprehensive range of \nkey outcome indicators and parameters related to the WASH sector . 

In [15]:
index = VectorStoreIndex.from_documents(small_dataset)

To have a back and forth, we use the first line of code which is commented out and the RAG is built as a query engine because we ask it all the questions in the test data set one by one.

In [31]:
# chat_engine = index.as_chat_engine(llm=llm, verbose=True,chat_mode='context')
chat_engine = index.as_query_engine(llm=llm, verbose=True)

First test

In [60]:
# chat_engine.chat_repl()
response = chat_engine.chat("What does WASHNORM mean?")

In [64]:
from torcheval.metrics.functional import bleu_score
given_answer = response.response
given_answer.strip()
llm_answer = [given_answer]
correct_answer = ["WASHNORM means Water Sanitation and Hygiene National Outcome Routine Mapping"]

In [104]:
given_answer = ["WASHNORM stands for Water, Sanitation, Hygiene, National Outcome Routine Mapping"]

In [105]:
bleu_score(given_answer,correct_answer)

tensor(0.2778)

In [89]:
# using an example from bleu score documentation to have a basis of comparison
bleu_score(["a squirrel is eating a nut"],["the squirrel is eating a tasty nut"])

tensor(0.4548)

Load in test data (the question and answer pairs that were generated in the 'qa_extraction' notebook), and the RAG responses are also added to the dictionary.

In [1]:
import pickle
with open('test_data.pkl','rb') as f:
    test_data = pickle.load(f)

In [33]:
# Generate LLM responses and add them to the dictionary
for idx, data in test_data.items():
    question = data["question"]
    response = chat_engine.query(question)
    text = response.response
    test_data[idx]["llm_response"] = text
    print(idx)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90


In [34]:
# Filter the dictionary to get values with difficulty 1
easy = {key: value for key, value in test_data.items() if value['difficulty'] == 1}
medium = {key: value for key, value in test_data.items() if value['difficulty'] == 2}
hard = {key: value for key, value in test_data.items() if value['difficulty'] == 3}

In [43]:
from torcheval.metrics.functional import bleu_score

Evaulate the overall BLEU score of the generated text for the easy, medium, and hard difficulty level questions.

In [50]:
easy_answers = [value['answers'] for value in easy.values()]
easy_llm_response = [value['llm_response'] for value in easy.values()]

In [53]:
easy_bleu_score = bleu_score(easy_llm_response,easy_answers)
easy_bleu_score

tensor(0.4111)

In [54]:
medium_answers = [value['answers'] for value in medium.values()]
medium_llm_response = [value['llm_response'] for value in medium.values()]

medium_bleu_score = bleu_score(medium_llm_response,medium_answers)
medium_bleu_score

tensor(0.2275)

In [55]:
hard_answers = [value['answers'] for value in hard.values()]
hard_llm_response = [value['llm_response'] for value in hard.values()]

hard_bleu_score = bleu_score(hard_llm_response, hard_answers)
hard_bleu_score

tensor(0.1408)