# Custom Pipeline

## Install requirements

In [6]:
import sys
from typing import Iterator

from langchain_core.documents import Document

sys.path.append("src")

In [7]:
! pip install -r requirements.txt

Looking in indexes: https://int.repositories.cloud.sap/artifactory/api/pypi/build-milestones-pypi/simple/, https://int.repositories.cloud.sap/artifactory/api/pypi/proxy-deploy-releases-hyperspace-pypi/simple
Obtaining langchain-rage from git+https://github.tools.sap/AI-BUS/rage-langchain#egg=langchain-rage (from -r requirements.txt (line 4))
  Updating ./.venv/src/langchain-rage clone
  Running command git fetch -q --tags
  Running command git reset --hard -q 993611ab9c40f3dfa8bf68215375fa5def9f7ec8
  Installing build dependencies ... [?25ldone
[?25h  Checking if build backend supports build_editable ... [?25ldone
[?25h  Getting requirements to build editable ... [?25ldone
[?25h  Preparing editable metadata (pyproject.toml) ... [?25ldone
Building wheels for collected packages: langchain-rage
  Building editable for langchain-rage (pyproject.toml) ... [?25ldone
[?25h  Created wheel for langchain-rage: filename=langchain_rage-0.0.1-0.editable-py3-none-any.whl size=177

In [8]:
#import logging

#logging.basicConfig(level=logging.INFO)

## RAGe Setup

#### Create RAGe Service Instance

Follow this guide to create the service instances for both `identity` and `document-grounding`. Download the service keys and keep them in a safe place. You may skip the parts related to `curl` in the wiki, all you need are the service keys.

⚠️ **DO NOT UPLOAD YOUR SERVICE KEYS TO GITHUB** ⚠️

Update the two variables below to point to your downloaded service keys.

### Setup RAGe Client

In [9]:
from langchain_rage.rage_clients.vector.client import VectorClient

dg_key = "./key-document-grounding-joule.json"
dg_identity_key = "./key-document-grounding-joule-identity.json"

vector_client = VectorClient.create_from_service_keys(
    path_document_grounding_key=dg_key, path_identity_key=dg_identity_key
)

### Create RAGe Collection

In [10]:
from langchain_rage.rage_clients.vector.model import MetaData as RAGeMetaData

collection_name = "joule-document-test"

# Do not call vector_client.delete_all_collections() - this will break Joule due to deleting internal data.

# Clean up all custom documents, if required
vector_client.delete_collection_by_metadata("type", "custom")

# Clean up any existing collection with the same name
vector_client.delete_all_collections_by_name(collection_name)

# type=custom enables Joule to use the collection
collection_id = vector_client.create_collection(collection_name, metadata=[RAGeMetaData(key="type", value=["custom"])])

print(f"Created collection: '{collection_id}'")

Created collection: '52f41940-e499-4650-b461-33299fb5453b'


## Downloading the dataset

You can bring your own dataset by just putting files into the `path` and comment out the download code below.

In [11]:
path = "tmp"

In [12]:
from dataclasses import dataclass

# Just a helper class for the download.
@dataclass
class DocSource:
    url: str
    target_file_name: str
    

urls = [
    # DocSource(url="https://help.sap.com/doc/c31b38b32a5d4e07a4488cb0f8bb55d9/CLOUD/en-US/f17fa8568d0448c685f2a0301061a6ee.pdf", target_file_name="Service Guide SAP AI Core.pdf"),
    DocSource(url="https://arxiv.org/pdf/2405.00200", target_file_name="arxiv-2405.00200.pdf"),   
]

In [13]:
import shutil
import os

import requests

if os.path.exists(path):
    shutil.rmtree(path)
os.makedirs(path)

for doc_source in urls:
    response = requests.get(doc_source.url, allow_redirects=True)
    response.raise_for_status()
    filename = os.path.join(path,doc_source.target_file_name)
    with open(filename, "wb") as file:
        file.write(response.content)

## Processing the dataset


### Setup the Document Loader

In [14]:
from langchain_community.document_loaders import DirectoryLoader

loader = DirectoryLoader(path=str(path))

### Configure Chunking Strategy

In [15]:
from langchain_text_splitters import RecursiveCharacterTextSplitter
# Compare to production pipeline here:
# * https://github.tools.sap/AI-BUS/rage-pipeline-api/blob/aeefda0814283c2a0ae4b0a6fb777cfa0315b54d/srv/services/pipeline-service.ts#L152-L152
# * https://github.tools.sap/AI-BUS/rage-pipeline-steps/blob/dccce7dd5c4f0787e0526750017fc63d917e39fa/src/processor.py#L171-L171
chunker = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=100,
    length_function=len,
    add_start_index=True,
)

### Process & Upload

In [17]:
from src.rage_custom_pipeline.joule_support import convert_rage_chunks_to_rage_document, \
    convert_langchain_document_to_rage_chunk

raw_documents = loader.load()

rage_documents = []

for raw_document in raw_documents:
    langchain_chunks = chunker.split_documents([raw_document])
    for langchain_chunk in langchain_chunks:
        # NB: Unlike the production pipeline, every _chunk_ is converted into an individual
        # RAGe document. This is useful as Joule will only process one chunk per RAGe document,
        # hence we pretend here that each chunk is a separate document.
        # This is particularly useful if only very few documents are ingested.
        rage_chunk = convert_langchain_document_to_rage_chunk(langchain_chunk)
        
        document_metadata = {
            "title": raw_document.metadata["source"]
        }
        rage_document = convert_rage_chunks_to_rage_document([rage_chunk], metadata=document_metadata)
        rage_documents.append(rage_document)

document_ids = vector_client.create_documents(
    collection_id=collection_id, documents=rage_documents
)        
print(f"Uploaded documents: {document_ids}")

Uploaded documents: ['42edd240-cd0b-47e1-af0c-9fa6bbea4213', '5b6ae87c-ebc3-4240-8401-fc2a4d04d80e', 'b8f8db2b-4072-472a-b014-3afe0247e4d9', '33c3c917-77ec-4bc0-82a9-30e5d3e718f1', '01ace981-8443-4a0a-9044-1792936d98a8', '0a6ffb3a-1132-4087-b2f9-313044a0defe', 'b9e86c43-13d7-4c6f-b484-6d87dbb06b9f', '062e4e3a-6a31-44cb-887c-7b7d75937511', '8c0e7673-badd-40af-ab89-e1dc7bd0dcf2', '61188f21-8bd8-4e0e-8a44-7517f3a65c83', 'b04ef852-3e9c-463f-8e56-d12e56bee11a', '2f6eb3a3-6ceb-4176-9259-4e2bf5102c54', 'b49ff332-8420-4889-9359-e257b6040b22', 'd8992d29-f75c-490b-90ff-93f0565bd0a3', '8e04def3-cb7b-425f-9f32-b9f58ef24c59', '2f4153e4-0a83-4904-af36-c9001bd8fb57', '082cace2-ca97-46aa-9eaf-6ec59c4b3fd8', 'b35fefd4-d4c6-4e34-b649-940b80516ed8', '5d068a1a-1d88-4ee7-b748-dea39bbdda03', 'cd870f48-d257-482f-b538-73d074277ecc', '211c38f0-a9a2-46a4-8d58-55aa92acd843', '17961e44-ad76-458a-8eb8-35428784f304', '7ae0e8e6-8f0f-429b-921f-9c38c7d3ae62', '57447945-02db-49e9-80d2-b9efded777a9', '66767983-056e-4faf

## Retrieval

To test the retrieval, we will query the collection with a simple question.

### Setup RAGe Retrieval Client

In [18]:
from langchain_rage.rage_clients.retrieval.client import RetrievalClient

retrieval_client = RetrievalClient.create_from_service_keys(
    path_document_grounding_key=dg_key, path_identity_key=dg_identity_key
)

### Example Query

In [19]:
query = "What is in-context learning?"

In [20]:
from langchain_rage.rage_clients.retrieval.model import SearchConfiguration, SearchFilter, Search

search_configuration = SearchConfiguration(maxChunkCount=5)
search_filter = SearchFilter(
    dataRepositories=[collection_id],
    dataRepositoryType="vector",
    searchConfiguration=search_configuration,
)
search = Search(query=query, filters=[search_filter])
response = retrieval_client.query(search)
from pprint import pprint

pprint(response.json(), width=200)

{'results': [{'filterId': 'b7abea59-7b7e-4b59-9b3f-370ac2408149',
              'results': [{'dataRepository': {'documents': [{'chunks': [{'content': '1\n'
                                                                                    '\n'
                                                                                    'Introduction\n'
                                                                                    '\n'
                                                                                    'When a few examples are provided in-context, large language models can perform many tasks with reasonable '
                                                                                    'accuracy. While questions remain about the exact mechanism behind this phenomena (Min et al., 2022b; von Oswald '
                                                                                    'et al., 2023), this paradigm of in-context learning (ICL) has seen widespread adoption i

## Joule Testing

This section of the notebook requires a functional Joule test environment, with the command line client pointing to the correct Joule instance. The Joule instance must be provisioned in the same BTP subaccount as the RAGe service.

In [21]:
! joule dialog sap_digital_assistant "What can you tell me about in-context learning?"

Messages:
{
  "type": "text",
  "content": "In-context learning (ICL) is a paradigm where large language models are provided with a few examples in a specific context to perform various tasks. It has gained popularity due to its ease of implementation, low computational cost, and the ability to reuse a single model across tasks [[1]](https://www.sap.com/). ICL has been studied extensively, and it has been shown that performance continues to improve with a larger number of demonstrations, especially for datasets with large label spaces [[2]](https://www.sap.com/). It is less sensitive to random input shuffling and can outperform other methods like example retrieval and fine-tuning in certain scenarios [[2]](https://www.sap.com/).",
  "markdown": true
}

{
  "type": "list",
  "content": {
    "displayInPanel": true,
    "panel": {
      "collapsed": true
    },
    "title": "Source Documents",
    "enableDetailView": false,
    "elements": [
      {
        "title": {


![](img/screenshot-joule.png)