In [183]:
import os
from dotenv import load_dotenv, find_dotenv

load_dotenv(find_dotenv(), override=True)

True

In [184]:
!pip install langchain-core

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




Per utilizzare PineconeVectorStore è necessario installare le librerie "partner"

In [185]:
!pip install -qU langchain-pinecone pinecone-notebooks

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Quando si hanno testi lunghi è necessario splittarli.

Langchain consente di utilizzare document loader per ogni tipo di documento

In [186]:
with open("files/churcill_speach.txt") as f:
    churcill_speach = f.read()

Utilizzando il RecursiveCharacterTextSplitter, suddivido il testo il chunks di lunghezze 100 e con 20 char di sovrapposizione

In [187]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=100,
    chunk_overlap=20,
    length_function=len

)

chunks = text_splitter.create_documents([churcill_speach])
print(chunks[0])
print(chunks[1])
print(chunks[2])

page_content='Winston Churchill Speech - We Shall Fight on the Beaches\nWe Shall Fight on the Beaches\nJune 4, 1940'
page_content='June 4, 1940\nHouse of Commons'
page_content='From the moment that the French defenses at Sedan and on the Meuse were broken at the end of the'


In [188]:
print(f"Ottengo {len(chunks)} chunks")

Ottengo 300 chunks


## TIKTOKEN (per modelli OpenAI)
Definisco una funzione per verificare quanti token verranno ottenuti utilizzando il modello di OpenAI (che però non useremo successivamente)

In [189]:
import tiktoken

def print_embedding_cost(texts):
    enc = tiktoken.encoding_for_model("text-embedding-ada-002")
    total_tokens = sum([len(enc.encode(page.page_content)) for page in texts])
    print(f"NUmero di tokens : {total_tokens}")
    

In [190]:
print_embedding_cost(chunks)

NUmero di tokens : 4820


Invece di utilizzare OpenAI come embedding model (text-embedding-ada-002), provo ad usare sentence transformer

Prima devo installare la libreria con

pip install -q sentence-transformers

## EMBEDDING con HuggingFace

HuggingFaceEmbedding¶

The base HuggingFaceEmbedding class is a generic wrapper around any HuggingFace model for embeddings.<br>
All embedding models on Hugging Face should work. You can refer to the embeddings leaderboard for more recommendations.

This class depends on the sentence-transformers package, which you can install with pip install sentence-transformers.

In [191]:
#!pip install -qU langchain-huggingface

In [192]:
from langchain_community.embeddings import HuggingFaceEmbeddings

model_name = "sentence-transformers/all-MiniLM-L6-v2"
#model_name='DeepMount00/Anita'  # Specializzato per la lingua italiana
model_kwargs = {'device': 'cpu'}
encode_kwargs = {'normalize_embeddings': False}

model = HuggingFaceEmbeddings(
    model_name=model_name,
    model_kwargs=model_kwargs,
    encode_kwargs=encode_kwargs
)

In [193]:
model

HuggingFaceEmbeddings(client=SentenceTransformer(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
), model_name='sentence-transformers/all-MiniLM-L6-v2', cache_folder=None, model_kwargs={'device': 'cpu'}, encode_kwargs={'normalize_embeddings': False}, multi_process=False, show_progress=False)

### Test

Provo ad ottenere embeddings del primo chunk del testo precedentemente trattato. <br>
Ottengo la dimensione dell'embedding vector ottenuto (che varia a seconda del modello).<br>
Questa dimensione viene memorizzata in una variabile che verrà successivamente utilizzata per creare l'indice di Pinecone

In [194]:
embeddings = model.embed_query(chunks[0].page_content)
embedding_dim = len(embeddings)
embedding_dim

384

## Pinecone

Dopo aver importato le librerie, creo un pinecone Client

In [195]:
import pinecone
from langchain_community.vectorstores import Pinecone

pc = pinecone.Pinecone()

Creo un Pinecone Index. Dal momento che il piano gratuito consente un solo indice, devo prima cancellare quello esitente

In [196]:
for i in pc.list_indexes().names():
    print(f"Cancello tutti gli indici", end="\n")
    pc.delete_index(i)
    print(f"Cancellato indice {i}")

Cancello tutti gli indici
Cancellato indice churcill


Ri-creo l'indice con la dimensione corretta per l'embedding dimension ottenuta

In [197]:
index_name = "churcill"

if index_name not in pc.list_indexes().names():
    print("Creo indice {index_name}")
    pc.create_index(
        name=index_name,
        dimension=embedding_dim,
        metric="cosine",
        spec=pinecone.PodSpec(
            environment="gcp-starter"
        )
    )
    print("iNDIC CREATO")

Creo indice {index_name}
iNDIC CREATO


## Creazione Pinecone vector store
    
The PineconeVectorStore class provided by LangChain can be used to interact with Pinecone indexes.<br>
It’s important to remember that you must have an existing Pinecone index before you can create a PineconeVectorStore object.

To initialize a PineconeVectorStore object, you must provide:
- the name of the Pinecone index
- an Embeddings object initialized through LangChain.

There are two general approaches to initializing a PineconeVectorStore object:
- Initialize without adding records
- Initialize while adding records

Vedi https://docs.pinecone.io/integrations/langchain

In [198]:
import os
from langchain_pinecone import PineconeVectorStore

vectorstore = PineconeVectorStore(index_name=index_name, embedding=model)

## Inserisco i documenti nel vector store

In [199]:
from uuid import uuid4

uuids = [str(uuid4()) for _ in range(len(chunks))]

vectorstore.add_documents(documents=chunks, ids=uuids)

['d7ffe132-edcd-4c13-838a-5adf084045a8',
 '1443c9f0-94d8-46e4-bf61-de24f776e333',
 'ef6ea78d-c068-4b3c-8688-6ccf22b1b5b3',
 '542832f2-d325-47e8-b536-96670e20c250',
 '8d252fbe-7b73-4c39-8f72-d4a3a25504a3',
 '2a700138-c8d3-47e9-a034-3e4d9a8573f0',
 '7d7a8050-b87c-4dd5-8515-6e91226e2fab',
 'd07763dd-c30e-49b4-bdd1-49f1bb2ef547',
 '66a8c90e-da03-4563-b2c1-146e1a25cb17',
 '676bb008-ea67-4490-b6aa-3982ff154491',
 '64499a62-e4ff-4b49-bc52-402cf7d22d22',
 '90209b63-2bc3-4b22-b46d-6b8edfe4a9c3',
 '4f394077-7595-4fec-8c49-162e570ee902',
 '2c453581-452b-4932-ac4f-8f5eabb0836d',
 '8c7d6512-0b7b-437c-874b-ea06662271a7',
 '94e540b2-fb5f-4947-9c55-71208373cc87',
 '1a899313-4ad8-4aa4-a920-95d237128770',
 '275013ab-1f42-41c4-8bbb-ffe828ced55f',
 'dde43138-0bb0-4ef5-92cb-9812b4176adf',
 'f0f72db2-58c8-4aa8-b150-268f74ca5247',
 '28078d88-af3d-41aa-943b-bdfc36525b29',
 'c895b796-824a-421d-a464-82e715308ee7',
 'aa976b6a-f79a-4059-813a-218fd15abe86',
 '151ad8c8-eb8e-4387-9258-b9f0b7d8b55e',
 'ab30dfb3-795f-

Per creare un vector store da un indice già esistente su Pinecone

In [200]:
vector_store = Pinecone.from_existing_index(index_name="churcill", embedding=model)

## Similarity search

Dopo aver creato la nostra knowledge base, è possibile utilizzarla per ricerche basate su similarity search 

In [201]:
query = "where should I fight?"

result = vector_store.similarity_search(query)

# Visualizzo i risultati più simili al testo della query sulla base della metrica stabilita (cosine distance)
for r in result:
    print(r.page_content)

front, now on that, fighting
I return to the Army. In the long series of very fierce battles, now on this front, now on that,
shall fight on the beaches, we shall fight on the landing grounds, we shall fight in the fields and
streets, we shall fight in the hills; we shall never surrender, and even if, which I do not for a


Una volta estratti i risultati più simili alla query, è possibile passare questi risultati ad un LLM perchè questo fornisca la risposta finale in linguaggio naturale.

A tal fine una opzione è quella di invocare un modello OpenSource esposto dal server OpenAI di llamacpp (vedi progetto ad hoc)

In [202]:
from langchain.chains import RetrievalQA

from langchain_openai import ChatOpenAI

llm = ChatOpenAI(
        api_key='545454545',
        base_url='http://localhost:8000/v1'
    )

DEvo quindi creare una chain apposta pre RAG. A tal fine creo un retriever utilizzando il vector store popolato con il testo di input ed i relativi embeddings.

Una volta creato il retriever, creo la chain fornendo sia l'LLM che dovrà fornire la risposta finale che il retriever

In [207]:
retriever = vector_store.as_retriever(search_type="similarity", search_kwargs={'k' : 3})

chain = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=retriever)

Definisco la domanda ed invoco la chain fornendo la domanda

In [204]:
query = "where should I fight?"

answer = chain.run(query)

In [205]:
print(answer)

 I can see that you are referring to a quote from a famous speech by Winston Churchill during World War II. The quote mentions several places where battles were fought during the war, including the beaches, landing grounds, and fields. However, it does not specify where you should fight in this particular situation. If you have a specific question or need further clarification, please let me know.


In [206]:
query = "what about the french army?"
answer = chain.run(query)
print(answer)

 Based on the context provided, it seems that during a battle between the British and French Armies, the French First Army was captured and held by the British. However, without additional information, I cannot confirm whether the French Army as a whole was involved in the battle or if this refers specifically to the French First Army.
