# SabIA
SabIA, an innovative RAG (Retrieval Augmented Generation) pipeline, is the outcome of collaboration between the Software Innovation Laboratory (LIS), a joint initiative of the Pontifical Catholic University of Rio Grande do Sul (PUCRS) and HP Inc.

## Setup

If an error occurs during the installation of Llama CPP, use the following commands in your terminal within AI Studio:

1. Update the list of available packages and their versions:
   `sudo apt update`

2. Install essential packages for compiling:
   `sudo apt install build-essential`

In [1]:
!apt update
!apt install build-essential -y

Get:1 http://archive.ubuntu.com/ubuntu jammy InRelease [270 kB]
Get:2 http://security.ubuntu.com/ubuntu jammy-security InRelease [110 kB]
Get:3 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [119 kB][33m[33m
Get:4 http://security.ubuntu.com/ubuntu jammy-security/universe amd64 Packages [1076 kB]
Get:5 http://archive.ubuntu.com/ubuntu jammy-backports InRelease [109 kB]
Get:6 http://archive.ubuntu.com/ubuntu jammy/restricted amd64 Packages [164 kB]
Get:7 http://archive.ubuntu.com/ubuntu jammy/main amd64 Packages [1792 kB][33m[33m
Get:8 http://security.ubuntu.com/ubuntu jammy-security/multiverse amd64 Packages [44.6 kB]
Get:9 http://security.ubuntu.com/ubuntu jammy-security/restricted amd64 Packages [1951 kB]
Get:10 http://security.ubuntu.com/ubuntu jammy-security/main amd64 Packages [1561 kB]33m[33m
Get:11 http://archive.ubuntu.com/ubuntu jammy/multiverse amd64 Packages [266 kB][0m[33m
Get:12 http://archive.ubuntu.com/ubuntu jammy/universe amd64 Packages [17.5 MB][0m[33m

In [1]:
!pip install tiktoken
!pip install langchain
!pip install chromadb
!pip install PyMuPDF
#!pip install llama-cpp-python



Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Collecting tiktoken
  Downloading tiktoken-0.6.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.6 kB)
Downloading tiktoken-0.6.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m0m
[?25hInstalling collected packages: tiktoken
Successfully installed tiktoken-0.6.0
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.2[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython -m pip install --upgrade pip[0m
Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.2[0m[39;49m -> [0m[32;49m24.0[0m
[

## Importing components
Within the SabIA code, importing specific components is crucial for executing a chain. These components play fundamental roles in enabling the system's full functionality, ensuring that the chain process is carried out successfully



In [1]:
from framework_classes.composed_chain import ComposedChain
from framework_classes.memory import Memory
from framework_classes.message import Message

from src.tokenizers.sabia_tokenizer import SabIATokenizer
from src.models.sabia_embeddings_model import SabIAEmbeddingsModel
from src.vectordbs.chroma_vectordb import ChromaVectorDB
from src.chains.demo_loader_chain import DemoLoaderChain
from src.chains.demo_query_chain import DemoQueryChain
import time

## Model loading and chain execution
First, we load the model along with the tokenization pipeline. Next, we specify an absolute path within a list aiming to capture the desired file for performing embeddings. Finally, we initiate a chain of operations to carry out the execution pipeline.

In [2]:
start_time = time.time()

tokenizer = SabIATokenizer()
embedding = SabIAEmbeddingsModel(tokenizer=tokenizer)

end_time = time.time()
elapsed_time = end_time - start_time
print(f"Tokenizer and embedding loading: {elapsed_time}")


ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: Quadro P6000, compute capability 6.1, VMM: yes
llama_model_loader: loaded meta data with 15 key-value pairs and 291 tensors from /home/jovyan/datafabric/Llama7b/LlamaChat7b/llama-2-7b/ggml-model-f16.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = LLaMA v2
llama_model_loader: - kv   2:                       llama.context_length u32              = 4096
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llam

Tokenizer and embedding loading: 657.393717288971


AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | 
Model metadata: {'general.file_type': '1', 'tokenizer.ggml.model': 'llama', 'llama.attention.head_count_kv': '32', 'llama.attention.head_count': '32', 'llama.rope.dimension_count': '128', 'llama.attention.layer_norm_rms_epsilon': '0.000010', 'llama.feed_forward_length': '11008', 'llama.embedding_length': '4096', 'general.name': 'LLaMA v2', 'llama.block_count': '32', 'llama.context_length': '4096', 'general.architecture': 'llama'}
Using fallback chat format: None


In [3]:
vectordb = ChromaVectorDB(embedding_model=embedding)

docs_paths = ["/home/jovyan/local/ds-experiments/Temp/SabIA/docs/ZDocs-20231910.pdf"]

loader_chain = DemoLoaderChain(vectordb=vectordb, tokenizer=tokenizer, docs_paths=docs_paths)


In [5]:
start_time = time.time()
loader_chain.run()
end_time = time.time()
elapsed_time = end_time - start_time
print(f"Loader chain time: {elapsed_time}")

Loader chain time: 0.009295940399169922


In [None]:
start_time = time.time()

query_chain = DemoQueryChain(vectordb=vectordb)

end_time = time.time()
elapsed_time = end_time - start_time
print(f"Query loading time: {elapsed_time}")

## Inference

In [None]:
from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
from langchain_community.llms import LlamaCpp
import time

In [2]:


input = "Hello, my dear friend"
message = Message("User", input)
history = query_chain.memory.get_history()        
content = query_chain.prompt.get_prompt(message, history)

print(history)
#print(content)
prediction = self.model.predict(content)+" "
start = prediction.index("Answer: <answer>")+16
prediction = prediction[start:prediction.find("User", start)].strip()
        
self.memory.add_message(message)
self.memory.add_message(Message("Assistant", prediction))

print(prediction)

NameError: name 'query_chain' is not defined

In [None]:
start_time = time.time()
prediction = query_chain.model.predict(content)+" "
print(prediction)

end_time = time.time()
elapsed_time = end_time - start_time
print(f"Runtime: {elapsed_time}")

  warn_deprecated(


In [9]:
with open("content.txt", "w") as file:
    file.write(content)

In [None]:
start_time = time.time()
print(query_chain.run("What is AI Studio?"))
end_time = time.time()
elapsed_time = end_time - start_time
print(f"Runtime: {elapsed_time}")


  warn_deprecated(


In [None]:
while True:
    question = input("Enter your question (or 'exit' to close): ")
    if question.lower() != 'exit':
        print(query_chain.run(question))
    else:
        print("Session closed...")
        break

## Model Service

In [7]:
class SabIAQueryModel(mlflow.pyfunc.PythonModel):

    def load_context(self, context):
        from framework_classes.composed_chain import ComposedChain
        from src.tokenizers.sabia_tokenizer import SabIATokenizer
        from src.models.sabia_embeddings_model import SabIAEmbeddingsModel
        from src.vectordbs.chroma_vectordb import ChromaVectorDB
        from src.chains.demo_loader_chain import DemoLoaderChain
        from src.chains.demo_query_chain import DemoQueryChain


        self.tokenizer = SabIATokenizer()
        embedding = SabIAEmbeddingsModel(tokenizer=self.tokenizer)
        self.vectordb = ChromaVectorDB(embedding_model=embedding)

        docs_paths = context.artifacts["docs_paths"]
        
        loader_chain = DemoLoaderChain(vectordb=self.vectordb, tokenizer=self.tokenizer, docs_paths=docs_paths)
        loader_chain.run()

        self.query_chain = DemoQueryChain(vectordb=self.vectordb)

    def predict(self, context, model_input):
        question = model_input['question'][0]  
        answer = self.query_chain.run(question)
        return answer

    @classmethod
    def log_model(cls, model_name, docs_paths, demo_folder):
        input_schema = Schema([ColSpec("string", "question")])
        output_schema = Schema([ColSpec("string", "answer")])
        signature = ModelSignature(inputs=input_schema, outputs=output_schema)
        
        artifacts = {"docs_paths": docs_paths, "demo": demo_folder}

        mlflow.pyfunc.log_model(
            artifact_path=model_name,
            python_model=cls(),
            artifacts=artifacts,
            signature=signature
        )

mlflow.set_experiment(experiment_name='SabIA_Query_Service')

with mlflow.start_run(run_name='Document_Query_Run') as run:

    docs_paths = ["/mnt/d/SabIA/ds-experiments-nemo-experiments-for-demo/Temp/SabIA/docs/ZDocs-20231910.pdf"]
    demo_folder = "demo_folder_path"
    model_name = 'sabia_query_model'
    
    SabIAQueryModel.log_model(model_name=model_name, docs_paths=docs_paths, demo_folder=demo_folder)
    mlflow.register_model(model_uri=f"runs:/{run.info.run_id}/{model_name}", name=model_name)


NameError: name 'mlflow' is not defined