# Componentes de una aplicación RAG


# 1 - Document loaders

Los `document loaders` (cargadores de docs) son componentes diseñados para facilitar la carga, manipulación y gestión eficiente de docs.

LangChain proporciona una serie de `document loaders` integrados para cargar datos de texto desde fuentes comunes como archivos, BBDD y APIs. También es posible crear `document loaders` personalizados para cargar datos de fuentes específicas.

## 1.1 - ¿Por qué usarlos?

* `Aislamiento de la fuente de datos`: Nos permiten separar las fuentes de datos de la app NLP.
* `Flexibilidad`: LangChain posee una larga lista de loaders ya implementados para una gran variedad de fuentes de datos.
* `Eficiencia`: Los document loaders pueden optimizarse para cargar datos de texto de forma eficiente.

## 1.2 - Tipos

* Archivos
    - HTML
        > `UnstructuredHTMLLoader`
        > `WebBaseLoader`
    - CSV
    - JSON
    - PDF
* BBDD
    - MongoDB
    - AWS S3
    - Pandas DataFrame
* APIs
    - Wikipedia
    - Twitter (X)
    - Telegram
    - GitHub


## 1.3 - Document loaders personalizados

Tb es posible crear document loaders personalizados para cargar datos de fuentes específicas. En este caso, tenemos varias opciones:
* Heredar la clase `BaseDocumentLoader`: Es el método general, para diferentes tipos de docs:
    
    abstract class BaseDocumentLoader implements DocumentLoader {
        abstract load(): Promise<Document[]>;    
    }
* Heredar la clase `TextLoader`: Específico para archivos de texto, pero que tengan un formato específico.
    
    abstract class TextLoader extends BaseDocumentLoader {
        abstract parse(raw:string): Promise<String[]>;
    }

* Heredar la clase `BufferLoader`: Específico para archivos binarios. La clase `BufferLoader` se encarga de leer el archivo, por lo que todo lo que tenemos que hacer es implementar el método de parseo.

    abstract class BufferLoader extends BaseDocumentLoader {
        abstract parse(
            raw: Buffer,
            metadata: Document['metadata']
        ): Promise<Document[]>;
    }


## 1.4 - Ejemplos

### 1.4.1 - PyMuPDFLoader

pip install pymupdf


In [21]:
from langchain_community.document_loaders import PyPDFDirectoryLoader

loader = PyPDFDirectoryLoader("./assets/")

doc = loader.load()

In [22]:
from langchain_community.document_loaders import PDFPlumberLoader

loader = PDFPlumberLoader("assets/SA_InteractiveBenefitsChart.pdf")

ValueError: File path assets/SA_InteractiveBenefitsChart.pdf is not a valid file or url

In [24]:
from langchain_community.document_loaders import UnstructuredPDFLoader
"""
    pip install pdf2image
    pip install unstructured
    pip install pillow_heif
    pip install opencv-python
    pip install unstructured_inference
    pip install pytesseract
"""
loader = UnstructuredPDFLoader("assets/SA_InteractiveBenefitsChart.pdf")

In [29]:
pdf_docs = loader.load()
len(pdf_docs)

[Errno 2] No such file or directory: 'assets/SA_InteractiveBenefitsChart.pdf'
PDF text extraction failed, skip text extraction...


PDFInfoNotInstalledError: Unable to get page count. Is poppler installed and in PATH?

## 1.4.2 - `WebBaseLoader`

In [33]:
from langchain.document_loaders import WebBaseLoader


web_loader = WebBaseLoader("https://github.com/OpenWebinarsNet/Desarrollo-de-Aplicaciones-impulsadas-por-LangChain-en-Python")
web_docs = web_loader.load()

In [34]:
len(web_docs)

1

In [35]:
web_docs[0].metadata

{'source': 'https://github.com/OpenWebinarsNet/Desarrollo-de-Aplicaciones-impulsadas-por-LangChain-en-Python',
 'title': 'GitHub - OpenWebinarsNet/Desarrollo-de-Aplicaciones-Impulsadas-por-LangChain-en-Python',
 'description': 'Contribute to OpenWebinarsNet/Desarrollo-de-Aplicaciones-Impulsadas-por-LangChain-en-Python development by creating an account on GitHub.',
 'language': 'en'}

In [36]:
web_docs[0].page_content

'\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nGitHub - OpenWebinarsNet/Desarrollo-de-Aplicaciones-Impulsadas-por-LangChain-en-Python\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nSkip to content\n\n\n\n\n\n\n\n\n\n\n\n\n\nToggle navigation\n\n\n\n\n\n\n\n\n\n\n          Sign in\n        \n\n\n \n\n\n\n\n\n\n\n\n\n\n\n\n\n        Product\n        \n\n\n\n\n\n\n\n\n\n\n\n\nActions\n        Automate any workflow\n      \n\n\n\n\n\n\n\nPackages\n        Host and manage packages\n      \n\n\n\n\n\n\n\nSecurity\n        Find and fix vulnerabilities\n      \n\n\n\n\n\n\n\nCodespaces\n        Instant dev environments\n      \n\n\n\n\n\n\n\nCopilot\n        Write better code with AI\n      \n\n\n\n\n\n\n\nCode review\n        Manage code changes\n      \n\n\n\n\n\n\n\nIssues\n        Plan and track work\n      \n\n\n\n\n\n\n\nDiscussions\n        Collaborate outs

Si bien este loader es capaz de extraer el texto directamente a partir de una pág HTML con una sola línea de código, el texto resultante puede requerir limpieza antes de ser utilizado para entrenar o para inferencia con un LLM

P. ej., en este caso como mínimo deberíamos limpiar el número de saltos de página


## 1.4.3 - WikipediaLoader

pip install wikipedia
En este caso, le hacemos una query a wikipedia (como si usáramos la parte "búsqueda" en la web) y recuperamos como máx los dos primeros docs

In [37]:
from langchain.document_loaders import WikipediaLoader

wiki_docs = WikipediaLoader(query="Naruto Shippuden", load_max_docs=2).load()

len(wiki_docs)

2

In [38]:
wiki_docs[0].metadata

{'title': 'Naruto (TV series)',
 'summary': "Naruto is a Japanese anime television series based on Masashi Kishimoto's manga series of the same name. The story follows Naruto Uzumaki, a young ninja who seeks recognition from his peers and dreams of becoming the Hokage, the leader of his village. Just like the manga, the anime series is divided into two separate parts: the first series retains the original manga's title and is set in Naruto's pre-teen years. The second series, a direct sequel titled Naruto: Shippuden, takes place during his teens. Both anime series were animated by Pierrot, produced by Aniplex and licensed in North America by Viz Media.\nThe first anime series aired on TV Tokyo and ran for 220 episodes from October 2002 to February 2007; an English dub produced by Viz Media aired on Cartoon Network and YTV from September 2005 to December 2009. The second series, Shippuden, also aired on TV Tokyo and ran for 500 episodes from February 2007 to March 2017. The English dub 

In [39]:
wiki_docs[1].metadata

{'title': 'List of Naruto: Shippuden episodes',
 'summary': "Naruto: Shippuden is an anime television series mainly adapted from Part II of Masashi Kishimoto's original Naruto manga series, with exactly 500 episodes. It is set two and a half years after the original series in the Naruto universe, following the teenage ninja Naruto Uzumaki and his allies. The series is directed by Hayato Date, and produced by Pierrot and TV Tokyo. It began broadcasting on February 15, 2007, on TV Tokyo, and concluded on March 23, 2017.On January 2, 2009, Viz Media and Crunchyroll provided eight English subtitled Naruto: Shippuden episodes on the official Naruto website. Later the following 2 weeks, Viz began providing subtitled versions of the latest Naruto: Shippuden episodes a week after they first aired in Japan, with a new episode being added to the Naruto website each subsequent Thursday. On July 24, 2009, Viz Media announced that the series would be released on the iTunes Store. The first DVD rele

In [40]:
wiki_docs[0].page_content

"Naruto is a Japanese anime television series based on Masashi Kishimoto's manga series of the same name. The story follows Naruto Uzumaki, a young ninja who seeks recognition from his peers and dreams of becoming the Hokage, the leader of his village. Just like the manga, the anime series is divided into two separate parts: the first series retains the original manga's title and is set in Naruto's pre-teen years. The second series, a direct sequel titled Naruto: Shippuden, takes place during his teens. Both anime series were animated by Pierrot, produced by Aniplex and licensed in North America by Viz Media.\nThe first anime series aired on TV Tokyo and ran for 220 episodes from October 2002 to February 2007; an English dub produced by Viz Media aired on Cartoon Network and YTV from September 2005 to December 2009. The second series, Shippuden, also aired on TV Tokyo and ran for 500 episodes from February 2007 to March 2017. The English dub of Shippuden was broadcast on Disney XD in t

In [41]:
wiki_docs[1].page_content

'Naruto: Shippuden is an anime television series mainly adapted from Part II of Masashi Kishimoto\'s original Naruto manga series, with exactly 500 episodes. It is set two and a half years after the original series in the Naruto universe, following the teenage ninja Naruto Uzumaki and his allies. The series is directed by Hayato Date, and produced by Pierrot and TV Tokyo. It began broadcasting on February 15, 2007, on TV Tokyo, and concluded on March 23, 2017.On January 2, 2009, Viz Media and Crunchyroll provided eight English subtitled Naruto: Shippuden episodes on the official Naruto website. Later the following 2 weeks, Viz began providing subtitled versions of the latest Naruto: Shippuden episodes a week after they first aired in Japan, with a new episode being added to the Naruto website each subsequent Thursday. On July 24, 2009, Viz Media announced that the series would be released on the iTunes Store. The first DVD release of the series in North America was released on Septembe

# 2 - Text splitter

Los `text splitters` (separadores de texto) son componentes diseñados para dividir los textos cargados en trozos que quepan en la ventana de contexto de un modelo de embeddings o un LLM

LangChain proporciona varios text splitters para texto general, HTML y código.

## 2.1 - ¿Por qué utilizarlos?

A primera vista, la idea de dividir el texto en trozos pequeños puede parecer simple. Sin embargo, hay mucho potencial de complejidad, ya que, por lo general, queremos mantener juntos los trozos de texto que se encuentren relacionados semánticamente. Lo que significa "relacionado semánticamente" podría depender del tipo de texto (no es lo mismo el texto de una novela que el de un código Python).

Distinguimos dos características principales en un `text splitter` (separador de texto):
* Cómo se divide el texto
* Cómo se mide el tamaño de trozo

## 2.2 - Tipos
* `RecursiveCharacterTextSplitter`
* `HTMLHeaderTextSplitter`
* `MarkdownHeaderTextSplitter`
* `Código`
* `Token`
* `CharacterTextSplitter`

## 2.3 - Ejemplos

### 2.3.1 - `RecursiveCharacterTextSplitter`

Este separador dev texto es el más recomendado para texto genérico. Intenta dividir el texto en orden hasta que los trozos sean lo suficientemente pequeños.

Está parametrizado por una lista de caracteres. La lista predeterminada es `[\n\n, \n, " ", ""]`. Esto tiene el efecto de intentar mantener todos los párrafos (y luego las oraciones, y luego las palabras) juntos tanto como sea posible, ya que esos serían genéricamente los trozos de texto semánticamente más relacionados.

In [46]:
text = wiki_docs[0].page_content

print(text)

Naruto is a Japanese anime television series based on Masashi Kishimoto's manga series of the same name. The story follows Naruto Uzumaki, a young ninja who seeks recognition from his peers and dreams of becoming the Hokage, the leader of his village. Just like the manga, the anime series is divided into two separate parts: the first series retains the original manga's title and is set in Naruto's pre-teen years. The second series, a direct sequel titled Naruto: Shippuden, takes place during his teens. Both anime series were animated by Pierrot, produced by Aniplex and licensed in North America by Viz Media.
The first anime series aired on TV Tokyo and ran for 220 episodes from October 2002 to February 2007; an English dub produced by Viz Media aired on Cartoon Network and YTV from September 2005 to December 2009. The second series, Shippuden, also aired on TV Tokyo and ran for 500 episodes from February 2007 to March 2017. The English dub of Shippuden was broadcast on Disney XD in the

In [47]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

recursive_text_splitter = RecursiveCharacterTextSplitter(
    # Ponemos un chunk size muy pequeño, simplemente para mostrar la funcionalidad
    chunk_size = 100, #caracteres máximos
    chunk_overlap = 20,
    length_function=len,
    is_separator_regex=False,
)

In [48]:
chunks = recursive_text_splitter.split_text(text)

for i in range(0,3):
    print(f'Chunk #{i}\n')
    print(chunks[i])
    print('\n')

Chunk #0

Naruto is a Japanese anime television series based on Masashi Kishimoto's manga series of the same


Chunk #1

series of the same name. The story follows Naruto Uzumaki, a young ninja who seeks recognition from


Chunk #2

recognition from his peers and dreams of becoming the Hokage, the leader of his village. Just like




## 2.2.2 - Dividir por tokens usando `tiktoken`

pip install tiktoken

In [50]:
from langchain.text_splitter import TokenTextSplitter

tiktoken_text_splitter = TokenTextSplitter(
    chunk_size = 100,
    chunk_overlap=0
)



In [52]:
chunks = tiktoken_text_splitter.split_text(text)

for i in range(0,2):
    print(f'Chunk #{i}\n')
    print(chunks[i])
    print('\n')

Chunk #0

Naruto is a Japanese anime television series based on Masashi Kishimoto's manga series of the same name. The story follows Naruto Uzumaki, a young ninja who seeks recognition from his peers and dreams of becoming the Hokage, the leader of his village. Just like the manga, the anime series is divided into two separate parts: the first series retains the original manga's title and is set in Naruto's pre-teen years. The second series, a direct sequel titled Naruto: Sh


Chunk #1

ippuden, takes place during his teens. Both anime series were animated by Pierrot, produced by Aniplex and licensed in North America by Viz Media.
The first anime series aired on TV Tokyo and ran for 220 episodes from October 2002 to February 2007; an English dub produced by Viz Media aired on Cartoon Network and YTV from September 2005 to December 2009. The second series, Shippuden, also aired on TV Tokyo and ran for 500 episodes from February 2007 to March 2017. The




## 2.3.3 - Dividir código

### Python

In [53]:
from langchain.text_splitter import (
    Language,
    RecursiveCharacterTextSplitter
)

PYTHON_CODE = """
def hello_world():
    print("hello world!")
    
# LLama a la función
hello_world()
"""

python_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.PYTHON, chunk_size=50, chunk_overlap=0
)

python_docs = python_splitter.create_documents([PYTHON_CODE])
python_docs

[Document(page_content='def hello_world():\n    print("hello world!")'),
 Document(page_content='# LLama a la función\nhello_world()')]

### Markdown

In [54]:
markdown_text = """
🤔 What is LangChain?
LangChain is a framework for developing applications powered by large language models (LLMs).

For these applications, LangChain simplifies the entire application lifecycle:

Open-source libraries: Build your applications using LangChain's modular building blocks and components. Integrate with hundreds of third-party providers.
Productionization: Inspect, monitor, and evaluate your apps with LangSmith so that you can constantly optimize and deploy with confidence.
Deployment: Turn any chain into a REST API with LangServe.
Open-source libraries
langchain-core: Base abstractions and LangChain Expression Language.
langchain-community: Third party integrations.
Some integrations have been further split into partner packages that only rely on langchain-core. Examples include langchain_openai and langchain_anthropic.
langchain: Chains, agents, and retrieval strategies that make up an application's cognitive architecture.
[LangGraph](https://python.langchain.com/docs/langgraph): A library for building robust and stateful multi-actor applications with LLMs by modeling steps as edges and nodes in a graph.
Productionization:
LangSmith: A developer platform that lets you debug, test, evaluate, and monitor chains built on any LLM framework and seamlessly integrates with LangChain.
Deployment:
LangServe: A library for deploying LangChain chains as REST APIs."""

md_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.MARKDOWN, 
    chunk_size=60,
    chunk_overlap=0
)

md_docs = md_splitter.create_documents([markdown_text])
md_docs

[Document(page_content='🤔 What is LangChain?'),
 Document(page_content='LangChain is a framework for developing applications'),
 Document(page_content='powered by large language models (LLMs).'),
 Document(page_content='For these applications, LangChain simplifies the entire'),
 Document(page_content='application lifecycle:'),
 Document(page_content='Open-source libraries: Build your applications using'),
 Document(page_content="LangChain's modular building blocks and components."),
 Document(page_content='Integrate with hundreds of third-party providers.'),
 Document(page_content='Productionization: Inspect, monitor, and evaluate your apps'),
 Document(page_content='with LangSmith so that you can constantly optimize and'),
 Document(page_content='deploy with confidence.'),
 Document(page_content='Deployment: Turn any chain into a REST API with LangServe.'),
 Document(page_content='Open-source libraries'),
 Document(page_content='langchain-core: Base abstractions and LangChain Expressi

# 3 - Modelo de embeddings

La clase Embeddings es una clase diseñada para interactuar con modelos de embeddings de texto. Hay muchos proveedores de modelos de embedding (OpenAI, Cohere, Hugging Face, etc) - esta clase está diseñada para proporcionar una interfaz estándar para todos ellos.

La clase base Embeddings en LangChain proporciona dos métodos:
* `embed_documents`: Método para convertir docs. Toma como entrada varios docs.
* `embed_query`: Método para convertir una consulta. Toma como entrada un texto.

La razón para tener dos métodos separados es que algunos modelos de embedding ofrecen funcionalidad distinta cuando el input es un solo texto o varios docs.



## 3.1 - OpenAI


In [55]:
from os import getenv
from dotenv import load_dotenv
#from langchain_openai import OpenAI
from langchain_openai.embeddings import OpenAIEmbeddings


load_dotenv()



openai_embeddings_model = OpenAIEmbeddings(
    model = "text-embedding-3-large",
    openai_api_key = getenv('OPEN_API_KEY', '')
)


In [56]:
text = "Hola Mundo"
embedded_query = openai_embeddings_model.embed_query(text)

print(len(embedded_query)) #nº de dimensiones de embeddings resultante
print(embedded_query[:5])

3072
[0.008390028175443806, 0.010463886144756681, 0.007829723084401498, 0.021000538386177416, 0.018439142353808877]


In [58]:
text1 = "Hola Mundo"
text2 = "Hello World"

texts = [text1, text2]
embedded_docs = openai_embeddings_model.embed_documents(texts)
print(len(embedded_docs)) # numero de vectores devueltos
print(len(embedded_docs[0])) # dimensionalidad de embeddings resultante
print(embedded_docs[0][:5])

2
3072
[0.008404222021682468, 0.010419780022282311, 0.007865769512188701, 0.020999639488354007, 0.018394694948444697]


## 3.2 - HuggingFace

### 3.4.2 - API

Podemos correr los modelos de forma remota mediante la API abierta de HuggingFace (aunque para temas serios tendríamos que pagar)

In [None]:
from langchain.embeddings import HuggingFaceInferenceAPIEmbeddings
huggingface_embeddings_model = HuggingFaceInferenceAPIEmbeddings(
    api_key = getenv("API_HUGGING", "")
    model_name = "sentence-transformers/all-MiniLM-16-v2"

    embedded_query = huggingface_embeddings_model.embedded_query(text)
    print(len(embedded_query)) # núm de dim
    print(embedded_query[:5])
)

### 3.2.2 - Local

También podemos correr el modelo de embeddings de forma local (debería salir el mismo resultado)

`pip install sentence_transformers`

In [59]:
from langchain.embeddings import HuggingFaceEmbeddings

local_huggingface_embeddings_model = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-MiniLM-L6-v2"
)
embedded_query = local_huggingface_embeddings_model.embed_query(text)
print(len(embedded_query))
print(embedded_query[:5])

  from .autonotebook import tqdm as notebook_tqdm
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


384
[-0.04199906066060066, 0.13960984349250793, 0.007648400496691465, -0.008638311177492142, -0.030084870755672455]


# 4 - Base de datos vectorial

Hay una gran cantidad de bases de datos vectoriales. Aquí vamos a mostrar la funcionalidad con Chroma, una alternativa open source que además no necesita de una API ya que funcionan de forma local (perfecto para hacer pruebas)

Para explorar todas las posibilidades que ofrece LangChain, lo mejor es acudir a la doc.

`pip install chromadb`


## 4.1 - Ejemplo con Chroma

In [70]:

from langchain.text_splitter import TokenTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Chroma

tiktoken_text_splitter = TokenTextSplitter(
    chunk_size = 100,
    chunk_overlap=0
)

chunks = tiktoken_text_splitter.split_documents(wiki_docs)

print(f'Hemos cargado {len(chunks)} chunks de texto')

#Cargamos y aplicamos el modelo de embedings
embeddings_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

#Guardamos los chunks en una instancia de Chroma que reside en memoria RAM
db = Chroma.from_documents(chunks, embeddings_model)

Hemos cargado 20 chunks de texto


Una vez hemos generado la db con los chunks de texto relevantes, podemos hacer querys sobre ella de dos formas:
* `db.similarity_search()`: Directamente con el texto en cuestión (LangChain llama auto al modelo y genera el vector de embeddings)
* `db-similarity_search_by_vector()`: Con el vector de embeddings (habiéndolo generado manualmente)

In [71]:
# prueba con el texto directamente
query = "when is Naruto from?"
docs = db.similarity_search(query)

print(docs[0].page_content)

 December 19, 2009.
"Naruto: Shippuden". Toonzone. 


In [72]:
query = "when is Naruto from"
query_embedding_vector = embeddings_model.embed_query(query)
docs = db.similarity_search_by_vector(query_embedding_vector)

print(docs[0].page_content)

 December 19, 2009.
"Naruto: Shippuden". Toonzone. 


# Siguiente

En la siguiente lección, veremos aspectos con los que extender la funcionalidad de nuestro sistema RAG