# Avance 2

A01740032 Ricardo Morales Bustillos
A01793554 Jesús Esteiner Alonso Moreno
A01793556 Johan Andrés Castro Gómez

El objetivo principal del proyecto es implementar técnicas de RAG (Retrieval-Augmented Generation) y Graph RAG para abordar las limitaciones actuales de los Modelos de Lenguaje de Gran Escala (LLMs), que pierden relevancia en sus respuestas al enfrentarse a información no incluida en su entrenamiento. Como parte del proyecto, se desarrollará un chatbot que permitirá a los usuarios acceder a información pertinente extraída de una base de conocimientos.

En esta fase del proyecto, se han establecido los primeros componentes del flujo de datos. Estos incluyen **la carga del documento que contiene la base de información**, **la integración de un modelo de embedding preentrenado** y **la creación de una base de conocimientos vectorizada.**

## Modelo de encoder y busqueda por similaridad

In [21]:
#encoder_model_name = 'sentence-transformers/all-MiniLM-L12-v2'
encoder_model_name = 'avsolatorio/GIST-Embedding-v0'

In [22]:
from langchain_community.embeddings import HuggingFaceEmbeddings

encoder = HuggingFaceEmbeddings(
    model_name = encoder_model_name, 
    model_kwargs = {'device': "cpu"}
)



In [25]:
embeddings = encoder.embed_query("How are you?")

In [26]:
print(len(embeddings))

768


In [27]:
import numpy as np

q = encoder.embed_query("What is earnings?")
z1 = encoder.embed_query(
    'In finance, "earnings" refer to the net profits of a company after deducting all expenses from its revenues. It indicates a company profitability and is essential for assessing its financial health. Earnings are often reported quarterly or annually.'
)  # from wikipedia
z2 = encoder.embed_query(
    'A smartphone is a handheld electronic device that combines the functions of a mobile phone and a computer, enabling features such as internet browsing, app usage, multimedia consumption, and communication.'
)  # from wikipedia

print(np.dot(q, z1) / (np.linalg.norm(q) * np.linalg.norm(z1)))

print(np.dot(q, z2) / (np.linalg.norm(q) * np.linalg.norm(z2)))

0.9030871092616961
0.5814253937178707


## Carga de documento y separación de texto

In [2]:
from langchain.document_loaders import PyPDFLoader

loaders = [
    PyPDFLoader("goog023-alphabet-2023-annual-report-web-1.pdf"),
]
pages = []
for loader in loaders:
    pages.extend(loader.load())

In [3]:
pages

[Document(page_content='Alphabet 2023 Annual Report', metadata={'source': 'goog023-alphabet-2023-annual-report-web-1.pdf', 'page': 0}),
 Document(page_content='Directors\nLarry Page\nCo-Founder\nSergey Brin\nCo-Founder\nSundar Pichai\nChief Executive OfficerAlphabet and Google \nJohn L. Hennessy\nChair of the Board of DirectorsFormer PresidentStanford University\nFrances H. Arnold\nLinus Pauling Professor of Chemical Engineering, Bioengineering and BiochemistryCalifornia Institute of Technology\nR. Martin “Marty” Chávez\nPartner and Vice Chairman Sixth Street Partners\nL. John Doerr\nGeneral Partner and ChairKleiner Perkins\nRoger W. Ferguson Jr.\nFormer President and Chief Executive Officer TIAA \nK. Ram Shriram\nManaging PartnerSherpalo Ventures\nRobin L. Washington\nFormer Executive Vice President and Chief Financial OfficerGilead SciencesDirectors and executive \nofficers as of January 2024\nStockholder information\nFor further information about \nAlphabet Inc., contact:\nInvestor 

In [16]:
print(pages[2].page_content)

1 Annual Report 2023 To everyone around 
the world who uses our 
products, our employees, 
and our partners:
It’s a time for some gratitude, and a moment to reflect.
I’ve been thinking a lot about how far technology has 
come over the last 25 years and how people adapt to 
it. Years ago, when I was studying in the U.S., my dad – 
who was back in India – got his first email address. 
I was really excited to have a faster (and cheaper) 
way to communicate with him, so I sent a message.
And then I waited … and waited. It was two full days 
before I got this reply:
“Dear Mr. Pichai, email received. All is well.”
Perplexed by the delay and the formality, I called him 
up to see what happened. He told me that someone 
at his work had to bring up the email on their office 
computer, print it out, and then deliver it to him. 
My dad dictated a response, which the guy wrote 
down and eventually typed up to send back to me.
Fast-forward to a few months ago: I was with my 
teenage son. He saw som

In [29]:

from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from transformers import AutoTokenizer

text_splitter = RecursiveCharacterTextSplitter.from_huggingface_tokenizer(
    tokenizer=AutoTokenizer.from_pretrained(encoder_model_name),
     chunk_size=256,
     chunk_overlap=32,
     strip_whitespace=True,
)

docs = text_splitter.split_documents(pages)



In [31]:
print(len(pages))
print(len(docs))  

108
451


## Base de datos de conocimiento vectorizada

In [32]:
from langchain.vectorstores import FAISS
from langchain_community.vectorstores.utils import DistanceStrategy

faiss_db = FAISS.from_documents(
    docs, encoder, distance_strategy=DistanceStrategy.COSINE
)

In [33]:
faiss_db

<langchain_community.vectorstores.faiss.FAISS at 0x29fe6ab60>

In [34]:
def similarity_search(question: str, k: int = 3):
        retrieved_docs = faiss_db.similarity_search(question, k=k)
        context = "".join(doc.page_content + "\n" for doc in retrieved_docs)
        return context

In [39]:
print(similarity_search("what was the stock performance?", k=3))

graph tracks the performance of a $100 investment in our common stock and in each index (with the reinvestment of all 
dividends) from December 31, 2018, to December 31, 2023. The returns shown are based on historical results and are not intended to suggest future performance.
24 Alphabet  2023 Annual Report
The graph below matches Alphabet Inc. Class C’s cumulative five-year total stockholder return on capital stock with the 
cumulative total returns of the S&P 500 index, the NASDAQ Composite index, and the RDG Internet Composite index. The 
graph tracks the performance of a $100 investment in our Class C capital stock and in each index (with the reinvestment of 
all dividends) from December 31, 2018, to December 31, 2023. The returns shown are based on historical results and are not intended to suggest future performance.
Comparison	of	 Cumulative 5-YearTotal	Return*
ALPHABET INC. CLASS C CAPITAL STOCKAmong Alphabet Inc., the S&P 500 Index, the
NASDAQ Composite Index, and the RDG Int

## Conclusiones

Durante este avance, hemos desarrollado el flujo necesario para cargar y procesar texto desde archivos PDF. Este proceso implica la segmentación del texto en fragmentos más pequeños, que son posteriormente vectorizados utilizando un modelo de codificación avanzado. A partir de estos datos, hemos construido una base de conocimiento vectorizada que permite una organización y recuperación eficiente de la información. Con esta infraestructura, podemos identificar de manera precisa el contexto relevante que será utilizado por el Modelo de Lenguaje de Gran Escala (LLM) para responder preguntas con mayor exactitud.

Con la entrega actual, concluimos el desarrollo del componente 'Retriever' del sistema RAG (Retrieval-Augmented Generation).