# Ingestion Pipeline

In this notebook we will demonstrate usage of Ingestion Pipeline in building RAG applications.

[Ingestion Pipeline](https://docs.llamaindex.ai/en/stable/module_guides/loading/ingestion_pipeline/)

## Installation

In [None]:
!pip install llama-index llama-index-vector-stores-qdrant

## Set API Key

In [None]:
import nest_asyncio

nest_asyncio.apply()

import os

os.environ["OPENAI_API_KEY"] = "sk-..."

## Download Data

In [None]:
!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'

--2024-04-26 13:35:44--  https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.111.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 75042 (73K) [text/plain]
Saving to: ‘data/paul_graham/paul_graham_essay.txt’


2024-04-26 13:35:44 (8.36 MB/s) - ‘data/paul_graham/paul_graham_essay.txt’ saved [75042/75042]



## Load Data

In [None]:
from llama_index.core import SimpleDirectoryReader

documents = SimpleDirectoryReader("./data/paul_graham/").load_data()

## Ingestion Pipeline - Apply Transformations

In [None]:
from llama_index.core import Document
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core.node_parser import TokenTextSplitter
from llama_index.core.extractors import TitleExtractor
from llama_index.core.ingestion import IngestionPipeline, IngestionCache

### Text Splitters

In [None]:
pipeline = IngestionPipeline(
    transformations=[
        TokenTextSplitter(chunk_size=1024, chunk_overlap=100),
    ]
)
nodes = pipeline.run(documents=documents)

In [None]:
nodes[0]

TextNode(id_='c6856f07-73bc-44ce-bd0b-5e27271f9f0f', embedding=None, metadata={'file_path': '/content/data/paul_graham/paul_graham_essay.txt', 'file_name': 'paul_graham_essay.txt', 'file_type': 'text/plain', 'file_size': 75042, 'creation_date': '2024-04-26', 'last_modified_date': '2024-04-26'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='244aec5e-98e0-48d1-81fd-9c12c2fe4c5c', node_type=<ObjectType.DOCUMENT: '4'>, metadata={'file_path': '/content/data/paul_graham/paul_graham_essay.txt', 'file_name': 'paul_graham_essay.txt', 'file_type': 'text/plain', 'file_size': 75042, 'creation_date': '2024-04-26', 'last_modified_date': '2024-04-26'}, hash='952e9dc1a243648316292b0771f0f024a059072e500f7da0092671800767f5

### Text Splitter + Metadata Extractor

In [None]:
pipeline = IngestionPipeline(
    transformations=[
        TokenTextSplitter(chunk_size=1024, chunk_overlap=100),
        TitleExtractor(),
    ]
)
nodes = pipeline.run(documents=documents)

100%|██████████| 5/5 [00:01<00:00,  3.71it/s]


In [None]:
nodes[0].metadata["document_title"]

'From Painting to Programming: A Journey through Writing, AI, and Fine Arts'

### Text Splitter + Metadata Extractor + OpenAI Embedding

In [None]:
pipeline = IngestionPipeline(
    transformations=[
        TokenTextSplitter(chunk_size=1024, chunk_overlap=100),
        TitleExtractor(),
        OpenAIEmbedding(),
    ]
)
nodes = pipeline.run(documents=documents)

100%|██████████| 5/5 [00:01<00:00,  4.31it/s]


In [None]:
nodes[0].metadata["document_title"]

'Journeys in Writing, Programming, and Art: Exploring the Evolution of Artificial Intelligence and the Intersection of Technology and Creativity'

In [None]:
nodes[0]

TextNode(id_='0a6d8435-cc5c-4100-b12c-e22d175190c6', embedding=[0.004466439131647348, -0.01828564889729023, -0.007774787023663521, -0.02322954684495926, 0.005550032947212458, 0.034214481711387634, -0.02435377426445484, -0.005089505575597286, -0.017676126211881638, -0.024462133646011353, 0.02787545509636402, 0.03118041716516018, 1.217588214785792e-05, -0.0046763853169977665, -0.0014975607628002763, 0.01750004291534424, 0.022349126636981964, 0.014804602600634098, 0.003238930134102702, -0.01607782579958439, -0.02095399796962738, -0.009826842695474625, 0.008926105685532093, -0.007693517487496138, 0.006928229238837957, -0.0003453955869190395, 0.019382787868380547, -0.04445444419980049, -0.003735013073310256, -0.014249260537326336, 0.01731041446328163, -0.008208224549889565, -0.006948546506464481, -0.006609923206269741, -0.032670360058546066, -0.0017185123870149255, -0.014587883837521076, -0.005929290782660246, 0.011431916616857052, -0.012617097236216068, 0.03632748872041702, 0.0266699567437

## Cache

In [None]:
pipeline = IngestionPipeline(
    transformations=[
        TokenTextSplitter(chunk_size=1024, chunk_overlap=100),
        TitleExtractor(),
    ]
)
nodes = pipeline.run(documents=documents)

100%|██████████| 5/5 [00:01<00:00,  4.76it/s]


In [None]:
# save and load
pipeline.cache.persist("./llama_cache.json")
new_cache = IngestionCache.from_persist_path("./llama_cache.json")

In [None]:
new_pipeline = IngestionPipeline(
    transformations=[
        TokenTextSplitter(chunk_size=1024, chunk_overlap=100),
        TitleExtractor(),
    ],
    cache=new_cache,
)

### Now it will run instantly due to the cache.

Will be very useful when extracting metadata and also creating embeddings

In [None]:
nodes = new_pipeline.run(documents=documents)

Now let's add embeddings to it. You will observe that the parsing of nodes, title extraction is loaded from cache and OpenAI embeddings are created now.

In [None]:
pipeline = IngestionPipeline(
    transformations=[
        TokenTextSplitter(chunk_size=1024, chunk_overlap=100),
        TitleExtractor(),
        OpenAIEmbedding(),
    ],
    cache=new_cache,
)
nodes = pipeline.run(documents=documents)

In [None]:
# save and load
pipeline.cache.persist("./nodes_embedding.json")
nodes_embedding_cache = IngestionCache.from_persist_path(
    "./nodes_embedding.json"
)

In [None]:
pipeline = IngestionPipeline(
    transformations=[
        TokenTextSplitter(chunk_size=1024, chunk_overlap=100),
        TitleExtractor(),
        OpenAIEmbedding(),
    ],
    cache=nodes_embedding_cache,
)

# Will load it from the cache as the transformations are same.
nodes = pipeline.run(documents=documents)

In [None]:
nodes[0].text

'What I Worked On\n\nFebruary 2021\n\nBefore college the two main things I worked on, outside of school, were writing and programming. I didn\'t write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep.\n\nThe first programs I tried writing were on the IBM 1401 that our school district used for what was then called "data processing." This was in 9th grade, so I was 13 or 14. The school district\'s 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it. It was like a mini Bond villain\'s lair down there, with all these alien-looking machines — CPU, disk drives, printer, card reader — sitting up on a raised floor under bright fluorescent lights.\n\nThe language we used was an early version of Fortran. You had to type programs on punch cards, then stack

## RAG using Ingestion Pipeline

In [None]:
import qdrant_client

from llama_index.vector_stores.qdrant import QdrantVectorStore

client = qdrant_client.QdrantClient(location=":memory:")
vector_store = QdrantVectorStore(
    client=client, collection_name="llama_index_vector_store"
)
pipeline = IngestionPipeline(
    transformations=[
        TokenTextSplitter(chunk_size=1024, chunk_overlap=100),
        TitleExtractor(),
        OpenAIEmbedding(),
    ],
    cache=nodes_embedding_cache,
    vector_store=vector_store,
)
# Ingest directly into a vector db
nodes = pipeline.run(documents=documents)

In [None]:
from llama_index.core import VectorStoreIndex

index = VectorStoreIndex.from_vector_store(vector_store)

In [None]:
query_engine = index.as_query_engine()

In [None]:
response = query_engine.query("What did paul graham do growing up?")

print(response)

Paul Graham skipped a step in the evolution of computers and went straight from batch processing to microcomputers, which made microcomputers seem all the more exciting to him.


## Custom Transformations

Implementing custom transformations is pretty easy.

Let's include a transformation that removes special characters from the text before generating embeddings.

The primary requirement for transformations is that they should take a list of nodes as input and return a modified list of nodes.

In [None]:
from llama_index.core.schema import TransformComponent
import re


class TextCleaner(TransformComponent):
    def __call__(self, nodes, **kwargs):
        for node in nodes:
            node.text = re.sub(r"[^0-9A-Za-z ]", "", node.text)
        return nodes


pipeline = IngestionPipeline(
    transformations=[
        TokenTextSplitter(chunk_size=1024, chunk_overlap=100),
        TextCleaner(),
        OpenAIEmbedding(),
    ],
    cache=nodes_embedding_cache,
)

nodes = pipeline.run(documents=documents)

In [None]:
nodes[0].text

'What I Worked OnFebruary 2021Before college the two main things I worked on outside of school were writing and programming I didnt write essays I wrote what beginning writers were supposed to write then and probably still are short stories My stories were awful They had hardly any plot just characters with strong feelings which I imagined made them deepThe first programs I tried writing were on the IBM 1401 that our school district used for what was then called data processing This was in 9th grade so I was 13 or 14 The school districts 1401 happened to be in the basement of our junior high school and my friend Rich Draves and I got permission to use it It was like a mini Bond villains lair down there with all these alienlooking machines  CPU disk drives printer card reader  sitting up on a raised floor under bright fluorescent lightsThe language we used was an early version of Fortran You had to type programs on punch cards then stack them in the card reader and press a button to loa