# Access Control demo with LlamaCloud

## Setup

Install core packages, download files. You will need to upload these documents to LlamaCloud.

In [None]:
!pip install llama-index
!pip install llama-index-core
!pip install llama-index-embeddings-openai
!pip install llama-index-question-gen-openai
!pip install llama-index-postprocessor-flag-embedding-reranker
!pip install git+https://github.com/FlagOpen/FlagEmbedding.git
!pip install llama-parse

Some OpenAI and LlamaParse details. The OpenAI LLM is used for response synthesis.

In [1]:
# llama-parse is async-first, running the async code in a notebook requires the use of nest_asyncio
import nest_asyncio
nest_asyncio.apply()

In [2]:
import os
# API access to llama-cloud
os.environ["LLAMA_CLOUD_API_KEY"] = ""

In [3]:
# Using OpenAI API for embeddings/llms
os.environ["OPENAI_API_KEY"] = ""

In [4]:
from llama_cloud.client import LlamaCloud

client = LlamaCloud(token=os.environ["LLAMA_CLOUD_API_KEY"])

## Setup Sharepoint Programatically
Please for more details about how to setup the permissions check our docuentation [here](https://docs.cloud.llamaindex.ai/llamacloud/integrations/data_sources/sharepoint).

Create Data Source

In [12]:
from llama_cloud.types import CloudSharepointDataSource

ds = {
    'name': '<your-data-source-name>',
    'source_type': 'MICROSOFT_SHAREPOINT', 
    'component': CloudSharepointDataSource(
        site_name='<site_name>',
        folder_path='<folder_path>',  # optional
        client_id='<client_id>',
        client_secret='<client_secret>',
        tenant_id='<tenant_id>',
    )
}

data_source = client.data_sources.create_data_source(request=ds)

### Setup Transformations/Embeddings Configs

In [13]:
# Embedding config
embedding_config = {
    'type': 'OPENAI_EMBEDDING',
    'component': {
        'api_key': os.environ["OPENAI_API_KEY"], # editable
        'model_name': 'text-embedding-ada-002' # editable
    }
}

# Transformation auto config
transform_config = {
    'mode': 'auto',
    'config': {
        'chunk_size': 1024, # editable
        'chunk_overlap': 20 # editable
    }
}

### Create Pipeline

In [15]:
pipeline = {
    'name': 'test-pipeline',
    'embedding_config': embedding_config,
    'transform_config': transform_config,
}

pipeline = client.pipelines.upsert_pipeline(request=pipeline)

### Add Sharepoint Data Source to Pipeline

In [16]:
data_sources = [
  {
    'data_source_id': data_source.id,
    'sync_interval': 43200.0 # Optional, scheduled sync frequency in seconds. In this case, every 12 hours.
  }
]

pipeline_data_sources = client.pipelines.add_data_sources_to_pipeline(pipeline.id, request=data_sources)

### Sync Pipeline
This triggers the data-source run


In [19]:
client.pipelines.sync_pipeline(pipeline.id)

Pipeline(id='477e2c1b-6a31-4f95-8b8c-f86a736cef30', created_at=datetime.datetime(2025, 2, 6, 1, 10, 43, 365914, tzinfo=datetime.timezone.utc), updated_at=datetime.datetime(2025, 2, 6, 1, 10, 43, 365914, tzinfo=datetime.timezone.utc), name='test-pipeline', project_id='177261ea-1af1-4a47-9029-15ed38c0cea8', embedding_model_config_id=None, pipeline_type=<PipelineType.MANAGED: 'MANAGED'>, managed_pipeline_id=None, embedding_config=PipelineEmbeddingConfig_OpenaiEmbedding(component=OpenAiEmbedding(model_name='text-embedding-ada-002', embed_batch_size=10, num_workers=None, additional_kwargs={}, api_key='********MUwA', api_base='https://api.openai.com/v1', api_version='', max_retries=10, timeout=60.0, default_headers=None, reuse_client=True, dimensions=None, class_name='OpenAIEmbedding'), type='OPENAI_EMBEDDING'), configured_transformations=[], config_hash=PipelineConfigurationHashes(embedding_config_hash='e23a92ac03041136e80e731de4f1f15b4aeb5a5f24c3c2ea4d', parsing_config_hash='e0d357e5d44a8e

## Define LlamaCloud Chunk Retriever over Documents

In this section we define a chunk-level LlamaCloud Retriever over these documents.

The chunk-level LlamaCloud retriever is our default retriever that returns chunks via hybrid search + reranking.

In [20]:
from llama_index.indices.managed.llama_cloud import LlamaCloudIndex
import os

index = LlamaCloudIndex(
  name=pipeline.name,
  project_id=pipeline.project_id,
  api_key=os.environ["LLAMA_CLOUD_API_KEY"],
  organization_id="04db4a56-04e3-43c5-aef5-0f39f1653dc8"
)

#### Define chunk retriever

The chunk-level retriever does vector search with a final reranked set of `rerank_top_n=5`.

In [26]:
from llama_index.core.vector_stores import (
    MetadataFilter,
    MetadataFilters,
    FilterOperator,
)

from llama_index.llms.openai import OpenAI

from llama_index.core.query_engine import RetrieverQueryEngine

# resolver user id/group id through sharepoint/db here 
# obs: you can also filter by the groups of the user
FILTER_BY_USER_ID = "11" # editable

filters = MetadataFilters(
    filters=[
        MetadataFilter(
            key="allowed_siteUser_ids", operator=FilterOperator.IN, value=[FILTER_BY_USER_ID]
        ),
    ]
)

chunk_retriever = index.as_retriever(
    retrieval_mode="chunks",
    rerank_top_n=5,
    filters=filters
)

llm = OpenAI(model="gpt-4o-mini")
query_engine_chunk = RetrieverQueryEngine.from_args(
    chunk_retriever, 
    llm=llm,
    response_mode="tree_summarize"
)

## Build an Agent

In this section we build an agent that takes in both file-level and chunk-level query engines as tools. It decides which query engine to call depending on the nature of this question.

In [22]:
from llama_index.core.tools import FunctionTool, ToolMetadata, QueryEngineTool


# this variable tells the agent specific properties about your document.
doc_metadata_extra_str = """\
Each document represents a complete 10K report for a given year (e.g. Apple in 2019). 
Here's an example of relevant documents:
1. apple_2019.pdf
"""

tool_chunk_description = f"""\
Synthesizes an answer to your question by feeding in a relevant chunk as context. Best used for questions that are more pointed in nature.
Do NOT use if the question asks seems to require a general summary of any given document. Use the doc_query_engine instead for that purpose.

Below we give details on the format of each document:
{doc_metadata_extra_str}
"""

tool_chunk = QueryEngineTool(
    query_engine=query_engine_chunk,
    metadata=ToolMetadata(
        name="chunk_query_engine",
        description=tool_chunk_description
    ),
)

In [23]:
from llama_index.core.agent import FunctionCallingAgentWorker
from llama_index.core.agent import AgentRunner
from llama_index.llms.openai import OpenAI

llm_agent = OpenAI(model="gpt-4o")
agent = FunctionCallingAgentWorker.from_tools(
    [tool_chunk], llm=llm_agent, verbose=True
).as_agent()

In [24]:
response = agent.chat("Tell me the revenue for Apple in 2019?")

Added user message to memory: Tell me the revenue for Apple in 2019?
=== Calling Function ===
Calling function: chunk_query_engine with args: {"input": "Apple revenue in 2019"}
=== Function Output ===
Apple's total net sales in 2019 amounted to $260.174 billion, which represented a 2% decrease compared to 2018. The revenue was primarily driven by various product categories, including iPhone, Mac, iPad, Wearables, Home and Accessories, and Services.
=== LLM Response ===
Apple's total net sales in 2019 amounted to $260.174 billion, which represented a 2% decrease compared to 2018.


In [25]:
response = agent.chat("Tell me the revenue for Apple in 2020?")

Added user message to memory: Tell me the revenue for Apple in 2020?
=== Calling Function ===
Calling function: chunk_query_engine with args: {"input": "Apple revenue in 2020"}
=== Function Output ===
Apple's total net sales in 2020 amounted to $274.5 billion, reflecting a 6% increase compared to 2019. The revenue was primarily driven by higher sales in Services and Wearables, Home and Accessories.
=== LLM Response ===
Apple's total net sales in 2020 amounted to $274.5 billion, reflecting a 6% increase compared to 2019.
