# A localised cpu based question answering bot based on all 7 books of harry potter using LLama2

Approach:

* Store text of all the 7 books in smaller chunks in haystacks in memory document store.
* Use a retriever to retrieve the most relevent chunk based on the user query
* Pass this text chunk to the Llama 2 model as context along with the query to get answers

In [1]:
#pip install farm-haystack
#pip install farm-haystack[inference]
#!wget --no-check-certificate https://dl.xpdfreader.com/xpdf-tools-linux-4.04.tar.gz
#!tar -xvf xpdf-tools-linux-4.04.tar.gz && cp xpdf-tools-linux-4.04/bin64/pdftotext /usr/local/bin
#pip install farm-haystack[preprocessing]

## Creating a in memory document store

In [2]:
from haystack.document_stores import InMemoryDocumentStore

document_store = InMemoryDocumentStore(use_bm25=True,embedding_dim=512)

### Reading Data

#### Data source: https://kvongcmehsanalibrary.wordpress.com/portfolio/1128/

In [3]:
doc_dir = "/app/hp_data/"

In [4]:
import os
from haystack.pipelines.standard_pipelines import TextIndexingPipeline

files_to_index = [doc_dir + "/" + f for f in os.listdir(doc_dir)]

In [5]:
files_to_index

['/app/hp_data//harry-potter-and-the-deathly-hallows-j.k.-rowling.pdf',
 '/app/hp_data//harry-potter-and-the-half-blood-prince-j.k.-rowling.pdf',
 '/app/hp_data//5harry-potter-and-the-order-of-the-phoenix-1.pdf',
 '/app/hp_data//harry-potter-chamber-of-secret.pdf',
 '/app/hp_data//harry-potter-and-the-goblet-of-fire.pdf',
 '/app/hp_data//harry-potter-sorcerers-stone.pdf',
 '/app/hp_data//harrypotter 3.pdf']

#### Coverting files into haystack's doc format

In [6]:
from haystack.utils import convert_files_to_docs


all_docs = convert_files_to_docs(dir_path=doc_dir)


pdftotext version 4.04 [www.xpdfreader.com]
Copyright 1996-2022 Glyph & Cog, LLC


### Sample document

In [7]:
all_docs[0].content



* Here each book is stored as a single document in Haystack in memory store. 

* Haystack retrives the most relevant document based on the the query provided to it. If we create a single document from a book, it would retrive a complete book for a query. 

### Creating smaller chunks for all text with sliding window mechanism to store complete sentences and not lose the context

In [8]:
# Sliding window approach
from haystack.nodes import PreProcessor
docs_sliding_window=[]
preprocessor_sliding_window = PreProcessor(split_overlap=50, split_length=200, split_respect_sentence_boundary=True)
for i in range(len(all_docs)):
    docs_sliding_window.append(preprocessor_sliding_window.process(all_docs[i]))


  docs_sliding_window.append(preprocessor_sliding_window.process(all_docs[i]))
We found one or more sentences whose word count is higher than the split length.


In [9]:
docs_sliding_window = [item for sublist in docs_sliding_window for item in sublist]

### Writing docuuments into the store

In [10]:
document_store.write_documents(docs_sliding_window)
#indexing_pipeline = TextIndexingPipeline(document_store)

Updating BM25 representation...: 100% 16003/16003 [00:00<00:00, 21545.62 docs/s]


### Initializing the Retriever


In [11]:
from haystack.nodes import BM25Retriever

retriever = BM25Retriever(document_store=document_store, top_k=1,)
#initialising top_k as 1 will give the most relevant document for the query

In [12]:
import time
import copy
import asyncio
import requests
from llama_cpp import Llama
## using llama_cpp library to load quantised Llama 2 model

### Model link: https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF

In [13]:

llm = Llama(model_path="/app/llama-2-7b-chat.Q8_0.gguf")
print("Model loaded!")



llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from /app/llama-2-7b-chat.Q8_0.gguf (version GGUF V2 (latest))
llama_model_loader: - tensor    0:                token_embd.weight q8_0     [  4096, 32000,     1,     1 ]
llama_model_loader: - tensor    1:           blk.0.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor    2:            blk.0.ffn_down.weight q8_0     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor    3:            blk.0.ffn_gate.weight q8_0     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor    4:              blk.0.ffn_up.weight q8_0     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor    5:            blk.0.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor    6:              blk.0.attn_k.weight q8_0     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor    7:         blk.0.attn_output.weight q8_0     [  4096,  4096,     1,     

Model loaded!


....................................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: kv self size  =  256.00 MB
llama_new_context_with_model: compute buffer total size = 76.38 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | 


## First example

In [23]:
query="who was albus Dumbledore?"

In [24]:
context=retriever.retrieve(query)[0].content

* the retriever will retrieve the most relevant context

In [25]:
context

'Gellert was there in the house when it\nhappened, and he came back to my house all of a dither, told me he\nwanted to go home the next day. Terribly distressed, you know. So I\narranged a Portkey and that was the last I saw of him.\n“Albus was beside himself at Ariana’s death. It was so dreadful for\nthose two brothers. They had lost everybody except each other. No\nwonder tempers ran a little high. Aberforth blamed Albus, you know, as\x0cpeople will under these dreadful circumstances. But Aberforth always\ntalked a little madly, poor boy. All the same, breaking Albus’s nose at the\nfuneral was not decent. It would have destroyed Kendra to see her sons\nfighting like that, across her daughter’s body. A shame Gellert could not\nhave stayed for the funeral. . . . He would have been a comfort to Albus,\nat least. . . .”\nThis dreadful coffin-side brawl, known only to those few who\nattended Ariana Dumbledore’s funeral, raises several questions. Why\nexactly did Aberforth Dumbledore blame

In [26]:
# create a text prompt
prompt=f"""[INST] <<SYS>>
{context}
<</SYS>>
{query+ "please answer the query in maximum 3 sentences"} [/INST]"""


In [27]:
len(prompt.split())

205

In [28]:
# generate a response (takes several seconds)
output = llm(prompt)
# display the response
print(output["choices"][0]["text"])


Llama.generate: prefix-match hit


  Albus Dumbledore was a wizard and the younger brother of Aberforth Dumbledore. He was known to be a kind and compassionate person, but his grief over the death of his sister Ariana led to a confrontation with his brother at her funeral. Despite their differences, Albus and Aberforth were the only remaining family members of their parents, Gellert and Kendra Dumbledore.



llama_print_timings:        load time = 14426.80 ms
llama_print_timings:      sample time =    32.37 ms /    92 runs   (    0.35 ms per token,  2842.05 tokens per second)
llama_print_timings: prompt eval time = 13707.46 ms /   341 tokens (   40.20 ms per token,    24.88 tokens per second)
llama_print_timings:        eval time = 20833.49 ms /    91 runs   (  228.94 ms per token,     4.37 tokens per second)
llama_print_timings:       total time = 34775.36 ms


# samples

In [20]:
def get_answer(query):
    context=retriever.retrieve(query)[0].content
    # create a text prompt
    
    
    prompt=f"""[INST] <<SYS>>
    {context}
    <</SYS>>
    {query+ "please answer the query in maximum 3 sentences"} [/INST]"""

    start=time.time()

    # generate a response (takes several seconds)
    output = llm(prompt)
    
    # display the response
    print(output["choices"][0]["text"])



In [21]:
query="who killed albus Dumbledore?"
get_answer(query)

Llama.generate: prefix-match hit


  Severus Snape killed Albus Dumbledore, as revealed in the passage. Snape had been secretly working for Dumbledore but was forced to kill him when Dumbledore discovered his true loyalties. This event is a pivotal moment in the story and has significant consequences for the characters and their relationships.



llama_print_timings:        load time = 14426.80 ms
llama_print_timings:      sample time =    27.88 ms /    72 runs   (    0.39 ms per token,  2582.22 tokens per second)
llama_print_timings: prompt eval time = 16297.16 ms /   343 tokens (   47.51 ms per token,    21.05 tokens per second)
llama_print_timings:        eval time = 16069.35 ms /    71 runs   (  226.33 ms per token,     4.42 tokens per second)
llama_print_timings:       total time = 32551.49 ms


In [22]:
query="whom did snape really love?"
get_answer(query)

Llama.generate: prefix-match hit


  Snape's true love was Lily Potter, Harry's mother. Snape had a deep and unwavering affection for her, which he nursed throughout his life despite his complicated feelings towards the Potters. Their love was pure and enduring, despite the fact that they were from different worlds and had different social standings.



llama_print_timings:        load time = 14426.80 ms
llama_print_timings:      sample time =    29.45 ms /    77 runs   (    0.38 ms per token,  2614.60 tokens per second)
llama_print_timings: prompt eval time = 16894.66 ms /   353 tokens (   47.86 ms per token,    20.89 tokens per second)
llama_print_timings:        eval time = 17126.18 ms /    76 runs   (  225.34 ms per token,     4.44 tokens per second)
llama_print_timings:       total time = 34226.05 ms


In [29]:
query="which are the 7 horcrux?"
get_answer(query)

Llama.generate: prefix-match hit


  The seven Horcruxes created by Lord Voldemort are:
1. Tom Riddle's diary
2. Marvolo Gaunt's ring
3. Salazar Slytherin's locket
4. Helga Hufflepuff's cup
5. Nagini, the snake
6. The one that was once Ravenclaw's
7. The one that was once Gryffindor's (not confirmed)



llama_print_timings:        load time = 14426.80 ms
llama_print_timings:      sample time =    40.12 ms /   104 runs   (    0.39 ms per token,  2592.35 tokens per second)
llama_print_timings: prompt eval time = 17342.37 ms /   347 tokens (   49.98 ms per token,    20.01 tokens per second)
llama_print_timings:        eval time = 22701.13 ms /   103 runs   (  220.40 ms per token,     4.54 tokens per second)
llama_print_timings:       total time = 40316.79 ms
