# Scheme Document Agent

In [1]:
%load_ext autoreload
%autoreload 2

## step 1 - retrieve urls

I cannot reach external URLs, so this step is skipped.

In [2]:
import utils
urls = [
    'https://www.mastercard.com.au/en-au/business/overview/support/interchange.html',
    'https://www.visa.com.au/about-visa/interchange.html',
    'https://www.auspayplus.com.au/brands/eftpos-interchange-fees'
    ]
# utils.save_confluence_pages(urls)

## step 2 - summarise

In [3]:
from summariser import Summariser

chunk0_query="""this is chunk #0 of the document, summarise the content in detail to include all contents
                and memorise the doc name and numbered list number in your context 
                to be reused when processing subsequent chunks.
                Make sure all your output is in utf-8 encoding."""
chunk_query="""of the document which continues from previous chunks, 
            share the context from chunk #0 for doc name and listed number
            and continue to summarise the content about the same doc in detail, and include all contents.
            Note that this chunk is a portion of the whole document information only.
            Make sure all your output is in utf-8 encoding."""

### summarise pdf

In [5]:
system_prompt="""The documents are manuals / guides to schemes (i.e. Mastercard / MC, Visa, eftpos) interchange fees.
Capture all the contents in the document and convert them into Markdown format.
            """
suma_pdf = Summariser('input/scheme_fees/summary', system_prompt, chunk0_query, chunk_query)
suma_pdf.summarise_pdf_directory('input/scheme_fees/pdfs')

summarise_pdf got 517 chunks, table / doc name: Mastercard Interchage Manual.pdf


InternalServerError: <html>
<head><title>504 Gateway Time-out</title></head>
<body>
<center><h1>504 Gateway Time-out</h1></center>
</body>
</html>

: 

In [5]:
# the MC pdf file is too big and caused timeout. So save every 10 chunks.
system_prompt="""The documents are manuals / guides to schemes (i.e. Mastercard / MC, Visa, eftpos) interchange fees.
Capture all the contents in the document and convert them into Markdown format.
            """
suma_pdf = Summariser('input/scheme_fees/summary', system_prompt, chunk0_query, chunk_query)
suma_pdf.continue_summarise_pdf('input/scheme_fees/pdfs/202504-Mastercard Interchage Manual.pdf',
                                prev_chunk=0)

summarise_pdf got 517 chunks, table / doc name: Mastercard Interchage Manual.pdf
skipping chunks [0, 0).
...written chunks up to 0 into file successfully.
...written chunks up to 10 into file successfully.
...written chunks up to 20 into file successfully.
...written chunks up to 30 into file successfully.
...written chunks up to 40 into file successfully.
...written chunks up to 50 into file successfully.
...written chunks up to 60 into file successfully.
...written chunks up to 70 into file successfully.
...written chunks up to 80 into file successfully.
...written chunks up to 90 into file successfully.
...written chunks up to 100 into file successfully.
...written chunks up to 110 into file successfully.
...written chunks up to 120 into file successfully.
...written chunks up to 130 into file successfully.
...written chunks up to 140 into file successfully.
...written chunks up to 150 into file successfully.
...written chunks up to 160 into file successfully.
...written chunks up t

### summarise html

In [11]:
system_prompt="""The documents are manuals / guides to schemes (i.e. Mastercard / MC, Visa, eftpos) interchange fees.
Capture all the contents in the document and convert them into Markdown format.
Use page contents only, ignore stylesheets, scripts or html meta tags.
            """
suma_html = Summariser('input/scheme_fees/summary', system_prompt, chunk0_query, chunk_query)
suma_html.summarise_html_directory('input/scheme_fees/html')

summarise_html, got 46 chunks, table / doc name: Mastercard Interchange Fees


## step 3 - embedding

make sure everything is in utf-8 first.

In [12]:
import os
from pathlib import Path
import utils

directory = Path("input/scheme_fees/summary")
for file_path in directory.glob("*.txt"):
    filename = utils.get_basename_without_extension(file_path)
    print(f"processing file {file_path}...")
    try:
        bytes = file_path.read_bytes()
        text = bytes.decode('utf-8')
    except UnicodeDecodeError as e:
        print(f"decoding error: {e}. Skipping file {file_path}")
        continue
    with open(f"input/scheme_fees/summary/utf/{filename}_utf8.txt", mode='w', encoding='utf-8') as f:
        f.write(text)
    print(f"Converted {file_path} to UTF-8.")

processing file input\scheme_fees\summary\202406-Visa interregional interchange guide.txt...
Converted input\scheme_fees\summary\202406-Visa interregional interchange guide.txt to UTF-8.
processing file input\scheme_fees\summary\202503-Visa Interchange Manual.txt...
Converted input\scheme_fees\summary\202503-Visa Interchange Manual.txt to UTF-8.
processing file input\scheme_fees\summary\202504-Mastercard Interchage Manual.txt...
Converted input\scheme_fees\summary\202504-Mastercard Interchage Manual.txt to UTF-8.
processing file input\scheme_fees\summary\202504-Mastercard Interregional Programs.txt...
Converted input\scheme_fees\summary\202504-Mastercard Interregional Programs.txt to UTF-8.
processing file input\scheme_fees\summary\www.auspayplus.com.au_brands_eftpos-interchange-fees.txt...
decoding error: 'utf-8' codec can't decode byte 0x92 in position 13044: invalid start byte. Skipping file input\scheme_fees\summary\www.auspayplus.com.au_brands_eftpos-interchange-fees.txt
processin

In [16]:
file_path = "input/scheme_fees/summary/www.auspayplus.com.au_brands_eftpos-interchange-fees.txt"
filename = utils.get_basename_without_extension(file_path)
f = Path.open(file_path)
text = f.read()

with open(f"input/scheme_fees/summary/utf/{filename}_utf8.txt", mode='w', encoding='utf-8') as f:
        f.write(text)


In [6]:
emb_model = 'text-embedding-3-large_v1'

In [17]:
from embedder import Embedder

emb = Embedder(mem_path='memory_scheme', model=emb_model)
emb.embed_directory('input/scheme_fees/summary/utf', embed_common=False)

# step 4 - Q & A

In [19]:
from chat_bot import BotFactory, Chat_Bot

cb = BotFactory.bot(bot=BotFactory.available_bots[1])

In [21]:
prompt = """what is DARE"""
answer = cb.chat(prompt)
print("\nAnswer: ", answer)

process_tool_calls returning: finish_reason= stop
follow_up_response: ChatCompletion(id='chatcmpl-BSFm1XDlEbIlQ8VQ9lOWsTuiT17Ev', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='In the context of Visa and Mastercard interchange fee programs, **DARE** typically stands for **Domestic Acquirer Reimbursement Expense**.\n\n### What is DARE?\n- **DARE** is a term used by Visa (and sometimes by Mastercard) to refer to the **interchange reimbursement fee** that is paid by the acquirer (the merchant’s bank) to the issuer (the cardholder’s bank) for domestic transactions.\n- It is essentially the **domestic interchange fee**—the cost that acquirers pay to issuers for processing card transactions within the same country.\n\n### Where is DARE used?\n- You will see DARE referenced in Visa interchange fee tables, rules, and guides, often as a column or label for the domestic interchange rate applicable to a particular transaction type or merchant 

In [435]:
cb.chat_history

[{'role': 'user', 'content': 'what is DARE'},
 {'role': 'assistant',
  'content': 'DARE stands for "Data Acquisition and Repository Environment." It is a database system used to store and manage various types of transactional and merchant-related data. In the context of the provided information, DARE holds detailed information about merchants, terminals, transactions, and other related data specific to the operations of the Commonwealth Bank of Australia (CBA).\n\n### Key Entities in DARE Database:\n1. **Acquirer**: CBA acts as the acquirer for its merchants.\n2. **Customer**: Individuals or organizations who have purchased or subscribed to CBA\'s product offerings.\n3. **Merchant or Merchant Facility**: CBA customers who subscribe to CBA Merchant Products. Details are stored in the `DARE.daredbo.Facilities` table.\n4. **Terminal**: Devices that can read cards and accept transactions. Data is stored in the `DARE.daredbo.Terminals` table.\n5. **Merchant Product**: Products offered by CB