# Welcome

Authors:
- Célien Donzé, research assistant at Haute Ecole Arc Ingénierie, Switzerland
- Jonathan Guerne, research assistant at Haute Ecole Arc Ingénierie, Switzerland
- Henrique Marques Reis, research assistant at Haute Ecole Arc Ingénierie, Switzerland
- Pedro Costa, CO-Founder and CTO at Lumind, Switzerland

## Package installation

In [46]:
!pip install langchain langchain-community faiss-cpu pymupdf pypdf sentence_transformers rich wget

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


## Imports

In [47]:
from langchain.document_loaders import PyPDFDirectoryLoader
from langchain.vectorstores import FAISS
from langchain.embeddings import HuggingFaceBgeEmbeddings
from langchain.chains import RetrievalQA, LLMChain
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.prompts import PromptTemplate
from langchain.llms.ollama import Ollama
import os
import json
from rich.console import Console
import zipfile
import os
import wget


console=Console()

# Downloading the pdfs

In [48]:
# Create the "data/PDFs" folder if it doesn't exist
os.makedirs("data/PDFs", exist_ok=True)

# Download the PDFs
url = "https://www.dropbox.com/scl/fo/xhqjzofiqnbmraxksgvlh/AAoL_WMBFOYDuipk5T_tTus?rlkey=qbbcvw4gbw6bpxkeijt6m94kt&st=yhap82wh&dl=1"
filename=wget.download(url, ".")

zip_file_path = f"./{filename}"
extract_folder = "data/PDFs"

# Create the extract folder if it doesn't exist
os.makedirs(extract_folder, exist_ok=True)

# Extract the zip file
with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
    zip_ref.extractall(extract_folder)

print("Pdf file downloaded successfully.")

Pdf file downloaded successfully.


## Documentation

- [langchain](https://python.langchain.com/v0.1/docs/get_started/introduction/)
- [Ollama website](https://ollama.com/)

## Constants

In [49]:
OLLAMA_ADDRESS = "http://157.26.83.14:11434"
# OLLAMA_ADDRESS = "http://165.1.69.123:11434"
LLM_NAME="gemma"

# start step 1

## Connecting to LLM

In [50]:
llm= Ollama(
        model=LLM_NAME,
        base_url=OLLAMA_ADDRESS,
        temperature=0.1, # Will be explained later
        stop=["<end_of_turn>"],
    )

## Creating a prompt

A prompt is generally divided into two parts: the context and the question.

The context provides the information that the model will use to generate its answer, while the question specifies what the model is expected to respond to.

In a prompt, special characters are used to delineate different sections. For instance, in Gemma, these are `<start_of_turn>` and `<end_of_turn>`.

Additionally, LangChain requires markers indicating where to insert the user's question and the context retrieved from documents. For the question.

Gemma prompt template :

```html
<start_of_turn>user
{{ if .System }}{{ .System }} {{ end }}{{ .Prompt }}<end_of_turn>
<start_of_turn>model
{{ .Response }}<end_of_turn>
```

In [51]:
template = """<start_of_turn>
You are an helpful assistant that answer the question in detail.

Human input: {question}<end_of_turn>
<start_of_turn>Assistant:<end_of_turn>"""

prompt = PromptTemplate(input_variables=["question"], template=template)

## Creating the chain and start a conversation

In [52]:
conversation = LLMChain(
    llm=llm,
    # verbose=True, # uncomment if you want to see more information about the chain
    prompt=prompt
)

In [53]:
result = conversation.invoke(input="What is the capital of Switzerland?")
console.print(result.get("text"))

# end step 1

# start step 2

## Loading a PDF

In [54]:
DATA_DIR = os.path.join("./", "data")
PDF_DIR = os.path.join(DATA_DIR, "PDFs")
VECTORSTORES_DIR = os.path.join(DATA_DIR, "vectorstores")

In [55]:
loader = PyPDFDirectoryLoader(PDF_DIR)
doc = loader.load()

## Embedding a PDF in a vectorstore

In [56]:
CHUNK_SIZE = 500
CHUNK_OVERLAP = 100
EMBEDDING_MODEL_NAME = "BAAI/bge-large-en-v1.5"

<div>
<img src="chunk_overlap_size_scheme.png" width="800"/>
</div>

In [57]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=CHUNK_SIZE, chunk_overlap=CHUNK_OVERLAP
)

model_kwargs = {"device": "cpu"}
encode_kwargs = {"normalize_embeddings": True}
embedding_model = HuggingFaceBgeEmbeddings(
    model_name=EMBEDDING_MODEL_NAME, model_kwargs=model_kwargs, encode_kwargs=encode_kwargs
)



In [58]:
all_splits = text_splitter.split_documents(doc)
vectorstore = FAISS.from_documents(documents=all_splits,embedding=embedding_model)


In [59]:
vectorstore.save_local(VECTORSTORES_DIR)

# end step 2

# start step 3

## Loading a vectorstore

In [60]:
vectorstore = FAISS.load_local(VECTORSTORES_DIR,embedding_model,allow_dangerous_deserialization=True)

## What is temperature?

The temperature parameter in a language model (LLM) controls the randomness of the model's output.

A lower temperature value (closer to 0) makes the model more deterministic, favoring higher probability words and resulting in more predictable and repetitive text.

A higher temperature value (closer to 1) increases randomness, allowing for more creative and diverse responses by giving less probable words a better chance of being chosen.

Adjusting the temperature helps balance between coherence and creativity in the generated text.

## New prompt

In RAG we need to add another marker to indicate where the new information (or context) should be inserted for this we use the variable named `{context}`.

In [61]:
prompt= """<start_of_turn>
Use the following pieces of context to answer the question at the end.
Don't try to make up an answer and only use the information you know.
Use three sentences maximum and keep the answer as concise as possible.
You must answer in english.
{context}<end_of_turn>
<start_of_turn>{question}<end_of_turn>"""


prompt_template = PromptTemplate(input_variables=["context", "question"],template=prompt)

## Creating the chain

In [78]:
# Top k of chunks to retrieve from the vectorstore
NB_RETRIVED_CHUNKS = 8

In [79]:
rqa_chain = RetrievalQA.from_chain_type(
        llm,
        retriever=vectorstore.as_retriever(search_kwargs={"k": NB_RETRIVED_CHUNKS}),
        chain_type_kwargs={"prompt": prompt_template},
        input_key="question", # same as the variable in the prompt
        output_key="answer",
        return_source_documents=True,
    )

## Chatting with a pdf

In [86]:
result = rqa_chain.invoke("Can I fly to Graubunden and is it there a lot of sunshine ?")
result

{'question': 'Can I fly to Graubunden and is it there a lot of sunshine ?',
 'answer': 'Graubunden can be reached by flight, with Zurich Airport being conveniently connected to the public transport network. The region enjoys plenty of sunshine, making it ideal for outdoor activities and a warm climate.',
 'source_documents': [Document(page_content='24Tips and Information \nCulture and leisure-time activities\nwww.graubuenden.ch\nWeather forecasts\nwww.wetter-graubuenden.ch \nPublic transport\nwww.rhb.ch | www.sbb.ch4. QUALITY OF LIFE AND LIFESTYLE\nHealth and Safety\nQuality of life: plenty of sunshine, no fog\nGood weather and a favourable climate improve wellbeing. \nThe Grisons is one of the areas with the most sunshine in \nSwitzerland and thanks to the long hours of sun and the \nmild climate, it is also the warmest vine growing area in the', metadata={'source': 'data/PDFs/Quality_of_life_in_the_Grisons.pdf', 'page': 4}),
  Document(page_content='How to get there\nFlying to Graubu

## Embellishing the output

In [87]:
def prepare_document(x):
        return x if x is None else os.path.basename(x)

def prepare_page(x):
        return x if x is None else int(x) + 1

def prepare_source(x):
        return {
            "document": prepare_document(x.metadata.get("source", None)),
            "page": prepare_page(x.metadata.get("page", None)),
            "chunk": x.page_content,
        }

In [88]:
console.print(result.get("answer"))

In [89]:
sources = [prepare_source(x) for x in result["source_documents"]]
console.print(json.dumps(sources, indent=1),highlight=False)