# Welcome

Authors:
- Célien Donzé, research assistant at Haute Ecole Arc Ingénierie, Switzerland
- Jonathan Guerne, research assistant at Haute Ecole Arc Ingénierie, Switzerland
- Henrique Marques Reis, research assistant at Haute Ecole Arc Ingénierie, Switzerland
- Pedro Costa, CO-Founder and CTO at Lumind, Switzerland

## Package installation

In [None]:
!pip install langchain langchain-community faiss-cpu pymupdf pypdf sentence_transformers rich wget

## Imports

In [None]:
from langchain.document_loaders import PyPDFDirectoryLoader
from langchain.vectorstores import FAISS
from langchain.embeddings import HuggingFaceBgeEmbeddings
from langchain.chains import RetrievalQA, LLMChain
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.prompts import PromptTemplate
from langchain.llms.ollama import Ollama
import os
import json
from rich.console import Console
import zipfile
import os
import wget


console=Console()

# Downloading the pdfs

In [None]:
# Create the "data/PDFs" folder if it doesn't exist
os.makedirs("data/PDFs", exist_ok=True)

# Download the PDFs
url = "https://www.dropbox.com/scl/fo/xhqjzofiqnbmraxksgvlh/AAoL_WMBFOYDuipk5T_tTus?rlkey=qbbcvw4gbw6bpxkeijt6m94kt&st=yhap82wh&dl=1"
filename=wget.download(url, ".")

zip_file_path = f"./{filename}"
extract_folder = "data/PDFs"

# Create the extract folder if it doesn't exist
os.makedirs(extract_folder, exist_ok=True)

# Extract the zip file
with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
    zip_ref.extractall(extract_folder)

print("Pdf file downloaded successfully.")

## Documentation

- [langchain](https://python.langchain.com/v0.1/docs/get_started/introduction/)
- [Ollama website](https://ollama.com/)

## Constants

In [None]:
OLLAMA_ADDRESS = "http://XXX.XX.XX.XX:11434"
LLM_NAME="gemma"

# start step 1

## Connecting to LLM

In [None]:
llm= Ollama(
        model=LLM_NAME,
        base_url=OLLAMA_ADDRESS,
        temperature=0.1, # Will be explained later
        stop=["<end_of_turn>"],
    )

## Creating a prompt

A prompt is generally divided into two parts: the context and the question.

The context provides the information that the model will use to generate its answer, while the question specifies what the model is expected to respond to.

In a prompt, special characters are used to delineate different sections. For instance, in Gemma, these are `<start_of_turn>` and `<end_of_turn>`.

Additionally, LangChain requires markers indicating where to insert the user's question and the context retrieved from documents. For the question.

Gemma prompt template :

```html
<start_of_turn>user
{{ if .System }}{{ .System }} {{ end }}{{ .Prompt }}<end_of_turn>
<start_of_turn>model
{{ .Response }}<end_of_turn>
```

In [None]:
template = """<start_of_turn>
You are an helpful assistant that answer the question in detail.

Human input: {question}<end_of_turn>
<start_of_turn>Assistant:<end_of_turn>"""

prompt = PromptTemplate(input_variables=["question"], template=template)

## Creating the chain and start a conversation

In [None]:
conversation = LLMChain(
    llm=llm,
    # verbose=True, # uncomment if you want to see more information about the chain
    prompt=prompt
)

In [None]:
result = conversation.invoke(input="What is the capital of Switzerland?")
console.print(result.get("text"))

# end step 1

# start step 2

## Loading a PDF

In [None]:
DATA_DIR = os.path.join("./", "data")
PDF_DIR = os.path.join(DATA_DIR, "PDFs")
VECTORSTORES_DIR = os.path.join(DATA_DIR, "vectorstores")

In [None]:
loader = PyPDFDirectoryLoader(PDF_DIR)
doc = ...

## Embedding a PDF in a vectorstore

In [None]:
CHUNK_SIZE = 500
CHUNK_OVERLAP = 100
EMBEDDING_MODEL_NAME = "BAAI/bge-large-en-v1.5"

<div>
<img src="chunk_overlap_size_scheme.png" width="800"/>
</div>

In [None]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=..., chunk_overlap=...
)

model_kwargs = {"device": "cpu"}
encode_kwargs = {"normalize_embeddings": True}
embedding_model = HuggingFaceBgeEmbeddings(
    model_name=EMBEDDING_MODEL_NAME, model_kwargs=model_kwargs, encode_kwargs=encode_kwargs
)

In [None]:
all_splits = text_splitter.split_documents(...)
vectorstore = ...


In [None]:
vectorstore.save_local(VECTORSTORES_DIR)

# end step 2

# start step 3

## Loading a vectorstore

In [None]:
vectorstore = ...

## What is temperature?

The temperature parameter in a language model (LLM) controls the randomness of the model's output.

A lower temperature value (closer to 0) makes the model more deterministic, favoring higher probability words and resulting in more predictable and repetitive text.

A higher temperature value (closer to 1) increases randomness, allowing for more creative and diverse responses by giving less probable words a better chance of being chosen.

Adjusting the temperature helps balance between coherence and creativity in the generated text.

## New prompt

In RAG we need to add another marker to indicate where the new information (or context) should be inserted for this we use the variable named `{context}`.

In [None]:
prompt = """..."""


prompt_template = PromptTemplate(input_variables=[...],template=...)

## Creating the chain

In [None]:
# Top k of chunks to retrieve from the vectorstore
NB_RETRIVED_CHUNKS = 8

In [None]:
rqa_chain = RetrievalQA.from_chain_type(
        llm,
        retriever=vectorstore.as_retriever(search_kwargs={"k": NB_RETRIVED_CHUNKS}),
        chain_type_kwargs={"prompt": ...},
        input_key=..., # same as the variable in the prompt
        output_key="answer",
        return_source_documents=True,
    )

## Chatting with a pdf

In [None]:
result = ...
result

## Embellishing the output

In [None]:
def prepare_document(x):
        return x if x is None else os.path.basename(x)

def prepare_page(x):
        return x if x is None else int(x) + 1

def prepare_source(x):
        return {
            "document": prepare_document(x.metadata.get("source", None)),
            "page": prepare_page(x.metadata.get("page", None)),
            "chunk": x.page_content,
        }

In [None]:
console.print(result.get("answer"))

In [None]:
sources = [prepare_source(x) for x in result["source_documents"]]
console.print(json.dumps(sources, indent=1),highlight=False)