# Vector store Llama index

# Vector store and index in Llama index

When preparing data for the Llama Index, it's essential to understand and utilize the provided utilities effectively. Let's explore how these components work together, specifically focusing on the `StorageContext`, `ServiceContext`, `Document`, and `VectorStoreIndex`, and how they are applied in the indexing process.

## ServiceContext

- Purpose: The `ServiceContext` serves as a container for the LlamaIndex index and query classes. It provides the necessary context for the retrieval system, ensuring that your queries interact seamlessly with the indexed data.
- Source Code: [ServiceContext Source](https://github.com/run-llama/llama_index/blob/main/llama_index/service_context.py)
- Parameters: While the `ServiceContext` supports various parameters (see the full list [here](https://github.com/run-llama/llama_index/blob/2b74fccadb87701eff91bf4ce315829fc3fd9e62/llama_index/service_context.py#L85)), we focus on the most commonly used ones to configure the LLM, embedding model, and text chunking method:
  - `embed_model`: The embedding model used for vectorizing the text.
  - `llm`: The Large Language Model used, which by default is OpenAI's model (details [here](https://github.com/run-llama/llama_index/blob/2b74fccadb87701eff91bf4ce315829fc3fd9e62/llama_index/llms/utils.py#L16)).
  - `text_splitter`: The method used to chunk the text. We use a `SentenceSplitter` with a `chunk_size` of 600 tokens and a chunk_overlap of 90 tokens. Given the distribution of document sizes in your dataset, these values are chosen to ensure that the documents are split into manageable chunks without losing context across chunks.

## Document

- Purpose: The `Document` class is used to convert and store documents in a format compatible with the Llama Index. It also allows adding metadata to improve searchability within the documents.
- Source Code: [Document Source](https://github.com/run-llama/llama_index/blob/2b74fccadb87701eff91bf4ce315829fc3fd9e62/llama_index/schema.py#L590)
Metadata: We enrich the documents with metadata such as 'communities', 'industry', 'interest', and 'title', which aids in more nuanced searches and better-organized indexing.

## StorageContext

- Purpose: The StorageContext acts as a centralized container for storing nodes, indices, and vectors. It's essentially the backbone of your data storage strategy within the Llama Index.
- Source Code: [StorageContext Source](https://github.com/run-llama/llama_index/blob/main/llama_index/storage/storage_context.py)

## VectorStoreIndex

- Purpose: The `VectorStoreIndex` is a fundamental component of the retrieval-augmented generation (RAG) system. It's designed to efficiently store and retrieve vectors representing chunks of text, facilitating quick and relevant responses to queries.
- Source Code: [VectorStoreIndex Source](https://github.com/run-llama/llama_index/blob/main/llama_index/indices/vector_store/base.py)

## Indexing Process:

The indexing process is central to the LlamaIndex. It involves converting documents into nodes and storing them in a structured manner to facilitate efficient retrieval. Each document is processed through a `text_splitter` before being vectorized using the `embed_model`. These processed documents are then stored in a `VectorStoreIndex`. The code snippet provided demonstrates this process, from setting up the `text_splitter` and `embed_model` to creating and populating the `VectorStoreIndex`.


In [None]:
from llama_index.schema import Document
from llama_index.node_parser import SentenceSplitter, get_leaf_nodes, get_root_nodes
from llama_index.embeddings import HuggingFaceEmbedding
from llama_index.storage.docstore import SimpleDocumentStore
from llama_index.storage import StorageContext
from llama_index import load_index_from_storage, VectorStoreIndex, ServiceContext

In [None]:
model = "sentence-transformers/all-MiniLM-L6-v2"
embed_model = HuggingFaceEmbedding(model_name=model)

In [None]:
path_project = 'XX'
folder_vector = 'XX'

In [None]:
os.makedirs(os.path.join(path_project, folder_vector), exist_ok = True)

We have the option to either build a new storage from scratch or load an existing one from a persistent storage.


### Building a New Storage

- Setting Up the Service Context: Initially, `service_context_with_splitter` is defined using `ServiceContext.from_defaults`. This context is configured with the embedding model (`embed_model`), the large language model (`llm`), and the text splitter (`text_splitter`) which chunks the text data.
- Creating Documents: Next, a loop runs through the dataframe (`df`), creating a `Document` object for each row. Each `Document` consists of the text from the 'markdown' column and metadata (tags, date, title, id). These documents are then collectively stored in a list named docs.
- Building the Index: The `VectorStoreIndex.from_documents` function is then used to create an index `index_with_splitter` from the docs. This index is built within the provided `service_context_with_splitter`.
- Persistence: Finally, the `index_with_splitter.storage_context.persist` method is called to save the index to a designated directory (`PERSISTENT_STORAGE`). This step ensures that the index is saved in a persistent storage, making it reusable and avoiding the need to rebuild the index from scratch in the future.

### Loading from Persistent Storage

- Rebuilding Storage Context: The `StorageContext.from_defaults` function is used to rebuild the `storage_context` from the `PERSISTENT_STORAGE` directory. This step prepares the environment to access the previously saved index.
- Loading the Index: The `load_index_from_storage` function is then called with the rebuilt `storage_context` and `service_context_with_splitter`. It loads the previously saved index `index_with_splitter` from the persistent storage. This process is much faster than rebuilding the index from scratch, especially for large datasets.

In [None]:
df = pd.read_parquet(os.path.join(path_project,"XX.parquet")).reset_index(drop =True)

In [None]:
create_persistante = False
text_splitter = SentenceSplitter(chunk_size=600, chunk_overlap=90)
### with

service_context_with_splitter = ServiceContext.from_defaults(
    embed_model=embed_model, llm=llm, text_splitter=text_splitter
)
if create_persistante:
    docs = []
    ### Create the document

    for i in range(len(df)):
        docs.append(
            Document(
                text=df["formatted_card"][i], #### change to the column with the text information
                metadata={
                    "communities": list(df["clean_communities"][i]), ### add tag
                },
            )
        )
    # Create your index with the specified embedding model

    index_with_splitter = VectorStoreIndex.from_documents(
        docs, service_context=service_context_with_splitter
    )

    ### make it persistent

    index_with_splitter.storage_context.persist(
        persist_dir=os.path.join(path_project, folder_vector)
    )
else:
    # rebuild storage context

    storage_context = StorageContext.from_defaults(
        persist_dir=os.path.join(path_project, folder_vector)
    )

    # load index

    index_with_splitter = load_index_from_storage(
        storage_context, service_context=service_context_with_splitter
    )


## How to naviguate through the doc store


When dealing with the document store in Llama Index, understanding the structure and navigation is important. Each document is segmented into smaller components known as nodes, optimizing the indexing and retrieval process.

In [None]:
len(index_with_splitter.docstore.docs)

To understand what's contained within these nodes, you can fetch a specific node and examine its contents:


In [None]:
first_key = next(iter(index_with_splitter.docstore.docs))
first_key

In [None]:
node_relationship = index_with_splitter.docstore.get_document(first_key)
node_relationship

This node encapsulates not just the textual content but also valuable metadata such as tags, date, title, and ID. This rich information enhances the context and relevance of your search operations.

In [None]:
node_relationship.text

Nodes within the document store are not isolated; they are interconnected, forming a comprehensive network of text. Exploring these relationships allows you to navigate through the document store systematically.

In [None]:
node_relationship.relationships

To reconstruct a complete document or explore related content, you can traverse through these nodes by identifying and fetching parent nodes and associated nodes:

In [None]:
### parent= https://docs.llamaindex.ai/en/stable/api/llama_index.schema.NodeRelationship.html#id2
parent = [values.node_id for key, values in node_relationship.relationships.items() if key.value == '1']
parent

Now that we have the parent, we can find the related nodes, and reconstruct the document

In [None]:
source = index_with_splitter.docstore.get_ref_doc_info(ref_doc_id = parent[0])
source

In [None]:
all_nodes = index_with_splitter.docstore.get_nodes(node_ids = source.node_ids)

In [None]:
### print the doc
[print(i.text) for i in all_nodes]