# Explore LlamaIndex

## Install Libraries

In [1]:
!pipenv install llama-index \
    llama-index-vector-stores-pinecone \
        llama-index-readers-file

[1mLoading .env environment variables...[0m
[32mCourtesy Notice[0m:
Pipenv found itself running within a virtual environment,  so it will 
automatically use that environment, instead of  creating its own for any 
project. You can set
[1;33mPIPENV_IGNORE_VIRTUALENVS[0m[1m=[0m[1;36m1[0m to force pipenv to ignore that environment and 
create  its own instead.
To activate this project's virtualenv, run [33mpipenv shell[0m.
Alternatively, run a command inside the virtualenv with [33mpipenv run[0m.
[1;32mInstalling llama-index...[0m
✔ Installation Succeeded
[1;32mInstalling llama-index-vector-stores-pinecone...[0m
✔ Installation Succeeded
[1;32mInstalling llama-index-readers-file...[0m
✔ Installation Succeeded
To activate this project's virtualenv, run [33mpipenv shell[0m.
Alternatively, run a command inside the virtualenv with [33mpipenv run[0m.
[1mInstalling dependencies from Pipfile.lock [0m[1;39m(e902c7)...[0m
[32mAll dependencies are now up-to-date![0m
[1;

## Load Data

In [46]:
from pathlib import Path
from llama_index.readers.file import PDFReader, PyMuPDFReader
from pathlib import Path
from typing import Dict, List, Optional, Union

from llama_index.core.readers.base import BaseReader
from llama_index.core.schema import Document


class PyMuPDFReader_v2(BaseReader):
    """Read PDF files using PyMuPDF library."""

    def load_data(
        self,
        file_path: Union[Path, str],
        metadata: bool = True,
        extra_info: Optional[Dict] = None,
    ) -> List[Document]:
        """Loads list of documents from PDF file and also accepts extra information in dict format."""
        return self.load(file_path, metadata=metadata, extra_info=extra_info)

    def load(
        self,
        file_path: Union[Path, str],
        metadata: bool = True,
        extra_info: Optional[Dict] = None,
    ) -> List[Document]:
        """Loads list of documents from PDF file and also accepts extra information in dict format.

        Args:
            file_path (Union[Path, str]): file path of PDF file (accepts string or Path).
            metadata (bool, optional): if metadata to be included or not. Defaults to True.
            extra_info (Optional[Dict], optional): extra information related to each document in dict format. Defaults to None.

        Raises:
            TypeError: if extra_info is not a dictionary.
            TypeError: if file_path is not a string or Path.

        Returns:
            List[Document]: list of documents.
        """
        import pymupdf

        # check if file_path is a string or Path
        if not isinstance(file_path, str) and not isinstance(file_path, Path):
            raise TypeError("file_path must be a string or Path.")

        # open PDF file
        doc = pymupdf.open(file_path)

        # if extra_info is not None, check if it is a dictionary
        if extra_info:
            if not isinstance(extra_info, dict):
                raise TypeError("extra_info must be a dictionary.")

        # if metadata is True, add metadata to each document
        if metadata:
            if not extra_info:
                extra_info = {}
            extra_info["total_pages"] = len(doc)
            extra_info["file_path"] = str(file_path)
            filtered_metadata = {k: v for k, v in doc.metadata.items() if v is not None and v != ""}
            print(filtered_metadata)
            extra_info = dict(extra_info, **filtered_metadata)
            
            # return list of documents
            return [
                Document(
                    text=page.get_text().encode("utf-8"),
                    extra_info=dict(
                        extra_info,
                        **{
                            "page": f"{page.number+1}",
                        },
                    ),
                )
                for page in doc
            ]

        else:
            return [
                Document(
                    text=page.get_text().encode("utf-8"), extra_info=extra_info or {}
                )
                for page in doc
            ]


In [47]:
loader = PyMuPDFReader_v2()
documents = loader.load_data(file_path=Path('../documents/FAST--Fast-Architecture-Sensit-542611fc-2992-4086-a9db-0d34117f512c.pdf'))
documents

{'format': 'PDF 1.3', 'title': 'FAST: fast architecture sensitive tree search on modern CPUs and GPUs', 'author': 'Changkyu Kim, Jatin Chhugani, Nadathur Satish, Eric Sedlar, Anthony D. Nguyen, Tim Kaldewey, Victor W. Lee, Scott A. Brandt, Pradeep Dubey', 'keywords': 'compression, cpu, data-level parallelism, gpu, thread-level parallelism, tree search', 'producer': 'Mac OS X 10.5.8 Quartz PDFContext', 'creationDate': "D:20100617231448Z00'00'", 'modDate': "D:20100617231448Z00'00'"}


[Document(id_='3854e77d-58b2-4a85-ab10-badb92a63e6e', embedding=None, metadata={'total_pages': 12, 'file_path': '../documents/FAST--Fast-Architecture-Sensit-542611fc-2992-4086-a9db-0d34117f512c.pdf', 'format': 'PDF 1.3', 'title': 'FAST: fast architecture sensitive tree search on modern CPUs and GPUs', 'author': 'Changkyu Kim, Jatin Chhugani, Nadathur Satish, Eric Sedlar, Anthony D. Nguyen, Tim Kaldewey, Victor W. Lee, Scott A. Brandt, Pradeep Dubey', 'keywords': 'compression, cpu, data-level parallelism, gpu, thread-level parallelism, tree search', 'producer': 'Mac OS X 10.5.8 Quartz PDFContext', 'creationDate': "D:20100617231448Z00'00'", 'modDate': "D:20100617231448Z00'00'", 'page': '1'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, metadata_template='{key}: {value}', metadata_separator='\n', text_resource=MediaResource(embeddings=None, data=None, text='FAST: Fast Architecture Sensitive Tree Search\non Modern CPUs and GPUs\nChangkyu Kim†, Jatin Chh

In [48]:
import re

def clean_up_text(content: str) -> str:
    """
    Remove unwanted characters and patterns in text input.

    :param content: Text input.
    
    :return: Cleaned version of original text input.
    """

    # Fix hyphenated words broken by newline
    content = re.sub(r'(\w+)-\n(\w+)', r'\1\2', content)

    # Remove specific unwanted patterns and characters
    unwanted_patterns = [
        "\\n", "  —", "——————————", "—————————", "—————",
        r'\\u[\dA-Fa-f]{4}', r'\uf075', r'\uf0b7'
    ]
    for pattern in unwanted_patterns:
        content = re.sub(pattern, "", content)

    # Fix improperly spaced hyphenated words and normalize whitespace
    content = re.sub(r'(\w)\s*-\s*(\w)', r'\1-\2', content)
    content = re.sub(r'\s+', ' ', content)

    return content

# Call function
cleaned_docs = []
for d in documents: 
    # print(d.text)
    cleaned_text = clean_up_text(d.text)
    d.set_content(cleaned_text)
    cleaned_docs.append(d)

# Inspect output
cleaned_docs[0].get_content()

'FAST: Fast Architecture Sensitive Tree Searchon Modern CPUs and GPUsChangkyu Kim†, Jatin Chhugani†, Nadathur Satish†, Eric Sedlar⋆, Anthony D. Nguyen†,Tim Kaldewey⋆, Victor W. Lee†, Scott A. Brandt⋄, and Pradeep Dubey†changkyu.kim@intel.com†Throughput Computing Lab,Intel Corporation⋆Special Projects Group,Oracle Corporation⋄University of California,Santa CruzABSTRACTIn-memory tree structured index search is a fundamental databaseoperation. Modern processors provide tremendous computing powerby integrating multiple cores, each with wide vector units. Therehas been much work to exploit modern processor architectures fordatabase primitives like scan, sort, join and aggregation. However,unlike other primitives, tree search presents signiﬁcant challengesdue to irregular and unpredictable data accesses in tree traversal.In this paper, we present FAST, an extremely fast architecturesensitive layout of the index tree. FAST is a binary tree logicallyorganized to optimize for architecture featu

In [49]:
cleaned_docs[0].metadata


{'total_pages': 12,
 'file_path': '../documents/FAST--Fast-Architecture-Sensit-542611fc-2992-4086-a9db-0d34117f512c.pdf',
 'format': 'PDF 1.3',
 'title': 'FAST: fast architecture sensitive tree search on modern CPUs and GPUs',
 'author': 'Changkyu Kim, Jatin Chhugani, Nadathur Satish, Eric Sedlar, Anthony D. Nguyen, Tim Kaldewey, Victor W. Lee, Scott A. Brandt, Pradeep Dubey',
 'keywords': 'compression, cpu, data-level parallelism, gpu, thread-level parallelism, tree search',
 'producer': 'Mac OS X 10.5.8 Quartz PDFContext',
 'creationDate': "D:20100617231448Z00'00'",
 'modDate': "D:20100617231448Z00'00'",
 'page': '1'}

## Ingestion Pipeline

In [50]:
from dotenv import load_dotenv
load_dotenv()

import os
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
PINECONE_API_KEY = os.getenv("PINECONE_API_KEY")

In [62]:
import time
from pinecone.grpc import PineconeGRPC
from pinecone import ServerlessSpec

from llama_index.vector_stores.pinecone import PineconeVectorStore

# Initialize connection to Pinecone
pc = PineconeGRPC(api_key=PINECONE_API_KEY)
index_name = "research-gpt"

# Create your index (can skip this step if your index already exists)

if index_name in pc.list_indexes().names():
    pc.delete_index(index_name)

pc.create_index(
    index_name,
    dimension=1536,
    spec=ServerlessSpec(cloud="aws", region="us-east-1"),
)

while not pc.describe_index(index_name).status['ready']:
    time.sleep(1)

# Initialize your index 
pinecone_index = pc.Index(index_name)

# Initialize VectorStore
vector_store = PineconeVectorStore(pinecone_index=pinecone_index)

# print(f"Connected to Pinecone index: {index_name}")
# print(f"Index summary:\n {pinecone_index.describe_index_stats()}")
# vector_store.model_json_schema()

In [63]:
from llama_index.core.node_parser import SemanticSplitterNodeParser, TextSplitter
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core.ingestion import IngestionPipeline

# This will be the model we use both for Node parsing and for vectorization
embed_model = OpenAIEmbedding(api_key=OPENAI_API_KEY)

# Define the initial pipeline
pipeline = IngestionPipeline(
    transformations=[
        SemanticSplitterNodeParser(buffer_size=1, breakpoint_percentile_threshold=95, embed_model=embed_model),
        embed_model,
        ],
        vector_store=vector_store,
    )

pipeline.run(documents=cleaned_docs)

Upserted vectors:   0%|          | 0/45 [00:00<?, ?it/s]

[TextNode(id_='0b4167d4-d58b-4b7b-bfb1-6883645caf6e', embedding=[-0.02250235341489315, 0.004352264106273651, -0.012009604834020138, -0.026677187532186508, 0.006815416272729635, 0.015599962323904037, -0.02933516539633274, -0.017033321782946587, -0.028416702523827553, -0.034261468797922134, 0.02326774038374424, 0.025814387947320938, -0.021180324256420135, -0.0054342420771718025, 0.005611672531813383, 0.007417288143187761, 0.021472562104463577, -8.050688484217972e-05, -0.0004466202517505735, 0.0033850944600999355, -0.018633674830198288, -0.0039834873750805855, -0.007431203965097666, 0.005204625893384218, -0.00576127041131258, 0.009831734001636505, 0.02735907770693302, -0.019078990444540977, -0.0386311300098896, -0.021110743284225464, 0.02008095011115074, -0.030531950294971466, -0.012204430997371674, 0.004498383495956659, -0.017993533983826637, -0.004901950713247061, 0.00930292159318924, -0.01998353749513626, 0.02154214307665825, -0.01196785643696785, 0.022432774305343628, -0.0088088996708

In [65]:
pinecone_index.describe_index_stats()


{'dimension': 1536,
 'index_fullness': 0.0,
 'namespaces': {'': {'vector_count': 45}},
 'total_vector_count': 45}

## Query 

In [71]:
from llama_index.core import VectorStoreIndex
from llama_index.core.retrievers import VectorIndexRetriever

# Instantiate VectorStoreIndex object from your vector_store object
vector_index = VectorStoreIndex.from_vector_store(vector_store=vector_store)

# Grab 5 search results
retriever = VectorIndexRetriever(index=vector_index, similarity_top_k=5)

# Query vector DB
answer = retriever.retrieve('What is the paper\'s contribution?')

# Inspect results
print([i.get_content() for i in answer])

# Response:
# ['some relevant search result 1', 'some relevant search result 1'...]

['A value of 1 imples no false positives. Evenfor very low entropy (choices per key ≤4), we perform relativelylow extra computation, with total work less than 1.9X as comparedto no false positives. For all other entropies, we measure less than5% excess work, signifying the effectiveness of our low cost partialkey computation. A similar overhead of around 5.2% is obesrvedfor the TPC-H data using our scheme, with the competing schemereporting around 12X increase.Figure 10 shows the relative throughput with varying key sizes(and ﬁxed entropy per byte: log2(10) bits). The number of keysfor each case is varied so that the total tree size of the uncompressed keys is ∼1GB. All numbers are normalized to the throughput achieved using 4-byte keys. Without our compression scheme,the througput reduces to around 50% for 16-byte keys, and as lowas 30% and 18% for key sizes 64 and 128 bytes respectively. Thisis due to the reduced effectiveness of cache lines read from mainmemory, and therefore increa

In [75]:
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core import get_response_synthesizer

# Pass in your retriever from above, which is configured to return the top 5 results
synth = get_response_synthesizer(streaming=True)
query_engine = RetrieverQueryEngine(retriever=retriever, response_synthesizer=synth)

# Now you query:
llm_response = query_engine.query('What is the summary of the results?')

llm_response.print_response_stream()

The study compared the search performance of tree search algorithms on CPUs, GPUs, and the Intel Many-Core Architecture Platform (MICA). Initially, unoptimized GPU search outperformed unoptimized CPU search by 8X for large trees, but with proper architecture optimizations, CPUs were only 1.7X slower on large trees and 2X faster on smaller trees. The MICA platform showed significant speedups over CPUs and GPUs for both small and large trees, leveraging a combination of large cache and high compute/bandwidth capabilities. Additionally, the study highlighted the impact of compression techniques on improving search performance, particularly for CPUs by reducing memory bandwidth bottlenecks.

In [73]:
llm_response_source_nodes = [i.get_content() for i in llm_query.source_nodes]

llm_response_source_nodes

['PlatformPeak GFlopsPeak BWTotal FrequencyCore i7103.03012.8GTX 280933.3141.739Table 1: Peak compute (GFlops), bandwidth (GB/sec), and total frequency (Cores * GHz) on the Core i7 and the GTX 280.Figure 6:Normalized search time with various architectural optimization techniques (lower is faster). The fastest reported performanceon CPUs [28] and GPUs [2] is also shown (for comparison).(core count · frequency) of the two platforms are shown in Table 1.We generate 32-bit (key, rid) tuples, with both keys and rids generated randomly. The tuples are sorted based on the key value andwe vary the number of tuples from 64K to 64M6. ',
 'The search keysare also 32-bit wide, and generated uniformly at random. Randomsearch keys exercise the worst case for index tree search with nocoherence between tree traversals of subsequent queries.We ﬁrst show the impact of various architecture techniques onsearch performance for both CPUs and GPUs and compare searchperformance with the best reported number o