## Exploring the banking dataset using AutoLabel

#### Setup the API Keys for providers that you want to use

In [None]:
%env OPENAI_API_KEY=""
%env ANTHROPIC_API_KEY=""

### Get the correct data

First lets make sure that we download the correct data. Currently data has been hosted on s3 but we will upload it to huggingface data as well in the future.

In [12]:
!aws s3 cp s3://refuel-benchmarking data/ --recursive

download: s3://refuel-benchmarking/banking_seed.csv to data/banking_seed.csv
download: s3://refuel-benchmarking/conll2003_seed.csv to data/conll2003_seed.csv
download: s3://refuel-benchmarking/emotion_seed.csv to data/emotion_seed.csv
download: s3://refuel-benchmarking/banking_test.csv to data/banking_test.csv
download: s3://refuel-benchmarking/civil_comments_seed.csv to data/civil_comments_seed.csv
download: s3://refuel-benchmarking/company_seed.csv to data/company_seed.csv
download: s3://refuel-benchmarking/emotion_test.csv to data/emotion_test.csv
download: s3://refuel-benchmarking/ledgar_seed.csv to data/ledgar_seed.csv
download: s3://refuel-benchmarking/conll2003_test.csv to data/conll2003_test.csv
download: s3://refuel-benchmarking/civil_comments_test.csv to data/civil_comments_test.csv
download: s3://refuel-benchmarking/medqa_seed.csv to data/medqa_seed.csv
download: s3://refuel-benchmarking/pubmed_qa_seed.csv to data/pubmed_qa_seed.csv
download: s3://refuel-benchmarking/sciq_se

### Run the labeler after passing in my own seed examples

#### Create a dataset config
This config file has all the possible labels for the banking dataset. The model needs to choose one label from the label list provided. In input schema, we define the input columns that will be used by the oracle and the output column defines the label column, that is, the column that will be used as ground truth and will be tried to be generated by our library

In [2]:
dataset_config = {
    "labels_list": [
        "activate_my_card",
        "age_limit",
        "apple_pay_or_google_pay",
        "atm_support",
        "automatic_top_up",
        "balance_not_updated_after_bank_transfer",
        "balance_not_updated_after_cheque_or_cash_deposit",
        "beneficiary_not_allowed",
        "cancel_transfer",
        "card_about_to_expire",
        "card_acceptance",
        "card_arrival",
        "card_delivery_estimate",
        "card_linking",
        "card_not_working",
        "card_payment_fee_charged",
        "card_payment_not_recognised",
        "card_payment_wrong_exchange_rate",
        "card_swallowed",
        "cash_withdrawal_charge",
        "cash_withdrawal_not_recognised",
        "change_pin",
        "compromised_card",
        "contactless_not_working",
        "country_support",
        "declined_card_payment",
        "declined_cash_withdrawal",
        "declined_transfer",
        "direct_debit_payment_not_recognised",
        "disposable_card_limits",
        "edit_personal_details",
        "exchange_charge",
        "exchange_rate",
        "exchange_via_app",
        "extra_charge_on_statement",
        "failed_transfer",
        "fiat_currency_support",
        "get_disposable_virtual_card",
        "get_physical_card",
        "getting_spare_card",
        "getting_virtual_card",
        "lost_or_stolen_card",
        "lost_or_stolen_phone",
        "order_physical_card",
        "passcode_forgotten",
        "pending_card_payment",
        "pending_cash_withdrawal",
        "pending_top_up",
        "pending_transfer",
        "pin_blocked",
        "receiving_money",
        "Refund_not_showing_up",
        "request_refund",
        "reverted_card_payment?",
        "supported_cards_and_currencies",
        "terminate_account",
        "top_up_by_bank_transfer_charge",
        "top_up_by_card_charge",
        "top_up_by_cash_or_cheque",
        "top_up_failed",
        "top_up_limits",
        "top_up_reverted",
        "topping_up_by_card",
        "transaction_charged_twice",
        "transfer_fee_charged",
        "transfer_into_account",
        "transfer_not_received_by_recipient",
        "transfer_timing",
        "unable_to_verify_identity",
        "verify_my_identity",
        "verify_source_of_funds",
        "verify_top_up",
        "virtual_card_not_working",
        "visa_or_mastercard",
        "why_verify_identity",
        "wrong_amount_of_cash_received",
        "wrong_exchange_rate_for_cash_withdrawal"
    ],
    "dataset_schema": {
        "input_columns": [
            "example"
        ],
        "label_column": "label"
    },
    "seed_examples": [
        {
            "example": "Is it free to transfer money or is there a fee?",
            "label": "transfer_fee_charged"
        },
        {
            "example": "I need my PIN, where is it?",
            "label": "get_physical_card"
        },
        {
            "example": "Can my salary be received and transferred to my current currency in my country?",
            "label": "receiving_money"
        },
        {
            "example": "Why isn't my purchase exchange rate correct?",
            "label": "card_payment_wrong_exchange_rate"
        },
        {
            "example": "Why was I charged a fee on a cash withdrawal?",
            "label": "cash_withdrawal_charge"
        }
    ]
}

#### Create a task config

The format of the final prompt will be defined by this config file. The constituents of the prompt sent to the model will be -

{prefix_prompt} (Defines the capabilities of the model)\
{task_prompt} (Defines the task for eg QA, classification etc and defines what the model should do and how the input will look like)\
{output_prompt} (Defines what exactly the model should output)\
{seed_examples_prompt} (If seed examples are provided, these are filled to show the model some sample outputs)\
{current_example} (How the current example looks and how the input and output columns are combined together to show an input to the model)\

{example_prompt_template} (Has the input columns and the output columns defined between {} and is used to show how an input will look to the model)

In [3]:
task_config = {
    "project_name": "BankingClassification",
    "task_type": "classification",
    "prefix_prompt": "You are an expert at understanding twitter complaints."
}

#### Use an LLM Config

In [4]:
llm_config = {
    "provider_name": "openai",
    "model_name": "gpt-3.5-turbo",
    "has_logprob": False
}

For all the above config files, instead of loading into variables, you can also just specify the path to the config files, for eg. 'examples/configs/dataset_configs/banking_classification.json'

## Run the model

First dry run the model using the above specification and get an idea of the cost required to run the model 

Running it on just 100 examples to get an idea for the notebook, adjust the max_items as required and dont pass it if you need to run the oracle on the whole dataset.

In [7]:
from autolabel.labeler import LabelingAgent

In [8]:
o = LabelingAgent(task_config, llm_config)
o.plan('data/banking_test.csv', dataset_config, max_items = 100)

2023-05-10 20:40:55 botocore.credentials INFO: Found credentials in shared credentials file: ~/.aws/credentials
100%|██████████| 20/20 [00:00<00:00, 44.24it/s]

Total Estimated Cost: $0.135
Number of examples to label: 100
Average cost per example: $0.00135


A prompt example:

You are an expert at understanding twitter complaints.
Your job is to correctly label the provided input example into one of the following 77 categories.
Categories:
activate_my_card
age_limit
apple_pay_or_google_pay
atm_support
automatic_top_up
balance_not_updated_after_bank_transfer
balance_not_updated_after_cheque_or_cash_deposit
beneficiary_not_allowed
cancel_transfer
card_about_to_expire
card_acceptance
card_arrival
card_delivery_estimate
card_linking
card_not_working
card_payment_fee_charged
card_payment_not_recognised
card_payment_wrong_exchange_rate
card_swallowed
cash_withdrawal_charge
cash_withdrawal_not_recognised
change_pin
compromised_card
contactless_not_working
country_support
declined_card_payment
declined_cash_withdrawal
declined_transfer
direct_debit_payment_not_recognised
disposable_card_limits
edit_personal_details
exchange_charge
exchange_rate
excha




Now, actually run the model and generate the list of labels for the banking dataset. You will get the computed metrics at the end of the run.

In [18]:
labels, df, metrics_list = o.run('data/banking_test.csv', dataset_config, max_items = 100)

 65%|██████▌   | 13/20 [00:46<00:24,  3.46s/it]2023-05-08 21:09:59 openai INFO: error_code=None error_message='That model is currently overloaded with other requests. You can retry your request, or contact us through our help center at help.openai.com if the error persists. (Please include the request ID f61e4cf38f0f3b2deda15518704b6526 in your message.)' error_param=None error_type=server_error message='OpenAI API error received' stream_error=False
 90%|█████████ | 18/20 [01:39<00:12,  6.07s/it]2023-05-08 21:10:53 openai INFO: error_code=None error_message='That model is currently overloaded with other requests. You can retry your request, or contact us through our help center at help.openai.com if the error persists. (Please include the request ID be04aafd01d90738a846809dfa502a84 in your message.)' error_param=None error_type=server_error message='OpenAI API error received' stream_error=False
100%|██████████| 20/20 [02:20<00:00,  7.05s/it]

Metric: support: [(100, 'index=0')]
Metric: threshold: [(-inf, 'index=0')]
Metric: accuracy: [(0.7, 'index=0')]
Metric: completion_rate: [(1.0, 'index=0')]
Total number of failures: 0





## Analyze the dataframe

In [20]:
df

Unnamed: 0,example,label,new_task_llm_labeled_successfully,new_task_llm_label
0,I want to close my account,terminate_account,yes,terminate_account
1,It seems I was overcharged when I used an ATM ...,wrong_exchange_rate_for_cash_withdrawal,yes,cash_withdrawal_charge
2,I have a direct debit transaction I have not s...,direct_debit_payment_not_recognised,yes,direct_debit_payment_not_recognised
3,How much does it cost in fees to use your card?,order_physical_card,yes,card_payment_fee_charged
4,There is an extra $1 charge,extra_charge_on_statement,yes,extra_charge_on_statement
...,...,...,...,...
95,Is my salary eligible for this?,receiving_money,yes,receiving_money
96,My top-up was cancelled; will I receive a refund?,top_up_reverted,yes,pending_top_up
97,What do I need to do to change the address on ...,edit_personal_details,yes,edit_personal_details
98,Why am I being charged when I withdraw cash?,cash_withdrawal_charge,yes,cash_withdrawal_charge


In [21]:
df[df['label'] != df['new_task_llm_label']]

Unnamed: 0,example,label,new_task_llm_labeled_successfully,new_task_llm_label
1,It seems I was overcharged when I used an ATM ...,wrong_exchange_rate_for_cash_withdrawal,yes,cash_withdrawal_charge
3,How much does it cost in fees to use your card?,order_physical_card,yes,card_payment_fee_charged
7,"How can I fix my card, it got declined twice.",declined_transfer,yes,declined_card_payment
12,"I do not remember purchasing anything for 1£, ...",extra_charge_on_statement,yes,card_payment_not_recognised
15,"This is URGENT, I typed the wrong payment info...",cancel_transfer,yes,reverted_card_payment?
18,what does pending mean?,pending_card_payment,yes,pending_transfer
19,Is it possible for me to get money out in a di...,receiving_money,yes,cash_withdrawal_charge
24,i want to track the card you sent,card_arrival,yes,card_delivery_estimate
27,My card hasn't arrived yet.,card_arrival,yes,card_delivery_estimate
30,my phone was taken! can you place cancel my ac...,lost_or_stolen_phone,yes,terminate_account


## Change the prompt
Analyze the dataframe generated above and make changes to the prompt in order to get better results

In [22]:
task_config = {
    "project_name": "BankingClassification",
    "task_type": "classification",
    "prefix_prompt": "You are an expert at understanding twitter complaints.",
    "task_prompt": "Your job is to correctly label the provided input example into one of the following {num_labels} categories.\nCategories:\n{labels_list}\n Pay attention to ATM transfer and debit card payments more seriously and different from credit card payments."
}

In [23]:
o = LabelingAgent(task_config, llm_config)
o.plan('data/banking_test.csv', dataset_config, max_items = 100)

100%|██████████| 20/20 [00:00<00:00, 295.38it/s]

Total Estimated Cost: $0.143
Number of examples to label: 100
Average cost per example: $0.00143


A prompt example:

You are an expert at understanding twitter complaints.
Your job is to correctly label the provided input example into one of the following 77 categories.
Categories:
activate_my_card
age_limit
apple_pay_or_google_pay
atm_support
automatic_top_up
balance_not_updated_after_bank_transfer
balance_not_updated_after_cheque_or_cash_deposit
beneficiary_not_allowed
cancel_transfer
card_about_to_expire
card_acceptance
card_arrival
card_delivery_estimate
card_linking
card_not_working
card_payment_fee_charged
card_payment_not_recognised
card_payment_wrong_exchange_rate
card_swallowed
cash_withdrawal_charge
cash_withdrawal_not_recognised
change_pin
compromised_card
contactless_not_working
country_support
declined_card_payment
declined_cash_withdrawal
declined_transfer
direct_debit_payment_not_recognised
disposable_card_limits
edit_personal_details
exchange_charge
exchange_rate
excha




In [24]:
labels, df, metrics_list = o.run('data/banking_test.csv', dataset_config, max_items = 100)

  5%|▌         | 1/20 [00:03<01:14,  3.91s/it]2023-05-08 21:43:13 openai INFO: error_code=None error_message='That model is currently overloaded with other requests. You can retry your request, or contact us through our help center at help.openai.com if the error persists. (Please include the request ID 40f0434b57e9c81a4fb00c997dee99b9 in your message.)' error_param=None error_type=server_error message='OpenAI API error received' stream_error=False
100%|██████████| 20/20 [01:45<00:00,  5.28s/it]

Metric: support: [(100, 'index=0')]
Metric: threshold: [(-inf, 'index=0')]
Metric: accuracy: [(0.7, 'index=0')]
Metric: completion_rate: [(1.0, 'index=0')]
Total number of failures: 0





In [25]:
df[df['label'] != df['new_task_llm_label']]

Unnamed: 0,example,label,new_task_llm_labeled_successfully,new_task_llm_label
1,It seems I was overcharged when I used an ATM ...,wrong_exchange_rate_for_cash_withdrawal,yes,cash_withdrawal_charge
3,How much does it cost in fees to use your card?,order_physical_card,yes,card_payment_fee_charged
7,"How can I fix my card, it got declined twice.",declined_transfer,yes,declined_card_payment
12,"I do not remember purchasing anything for 1£, ...",extra_charge_on_statement,yes,card_payment_not_recognised
13,Will I be charged if I use European bank card ...,top_up_by_card_charge,yes,top_up_by_bank_transfer_charge
15,"This is URGENT, I typed the wrong payment info...",cancel_transfer,yes,reverted_card_payment?
18,what does pending mean?,pending_card_payment,yes,pending_transfer
19,Is it possible for me to get money out in a di...,receiving_money,yes,cash_withdrawal_not_recognised
24,i want to track the card you sent,card_arrival,yes,card_delivery_estimate
27,My card hasn't arrived yet.,card_arrival,yes,card_delivery_estimate


## Example selector
Let's try to use an example selector. Let's use the big seed example set so that we can choose from a bigger set of examples

In [21]:
dataset_config = {
    "labels_list": [
        "activate_my_card",
        "age_limit",
        "apple_pay_or_google_pay",
        "atm_support",
        "automatic_top_up",
        "balance_not_updated_after_bank_transfer",
        "balance_not_updated_after_cheque_or_cash_deposit",
        "beneficiary_not_allowed",
        "cancel_transfer",
        "card_about_to_expire",
        "card_acceptance",
        "card_arrival",
        "card_delivery_estimate",
        "card_linking",
        "card_not_working",
        "card_payment_fee_charged",
        "card_payment_not_recognised",
        "card_payment_wrong_exchange_rate",
        "card_swallowed",
        "cash_withdrawal_charge",
        "cash_withdrawal_not_recognised",
        "change_pin",
        "compromised_card",
        "contactless_not_working",
        "country_support",
        "declined_card_payment",
        "declined_cash_withdrawal",
        "declined_transfer",
        "direct_debit_payment_not_recognised",
        "disposable_card_limits",
        "edit_personal_details",
        "exchange_charge",
        "exchange_rate",
        "exchange_via_app",
        "extra_charge_on_statement",
        "failed_transfer",
        "fiat_currency_support",
        "get_disposable_virtual_card",
        "get_physical_card",
        "getting_spare_card",
        "getting_virtual_card",
        "lost_or_stolen_card",
        "lost_or_stolen_phone",
        "order_physical_card",
        "passcode_forgotten",
        "pending_card_payment",
        "pending_cash_withdrawal",
        "pending_top_up",
        "pending_transfer",
        "pin_blocked",
        "receiving_money",
        "Refund_not_showing_up",
        "request_refund",
        "reverted_card_payment?",
        "supported_cards_and_currencies",
        "terminate_account",
        "top_up_by_bank_transfer_charge",
        "top_up_by_card_charge",
        "top_up_by_cash_or_cheque",
        "top_up_failed",
        "top_up_limits",
        "top_up_reverted",
        "topping_up_by_card",
        "transaction_charged_twice",
        "transfer_fee_charged",
        "transfer_into_account",
        "transfer_not_received_by_recipient",
        "transfer_timing",
        "unable_to_verify_identity",
        "verify_my_identity",
        "verify_source_of_funds",
        "verify_top_up",
        "virtual_card_not_working",
        "visa_or_mastercard",
        "why_verify_identity",
        "wrong_amount_of_cash_received",
        "wrong_exchange_rate_for_cash_withdrawal"
    ],
    "dataset_schema": {
        "input_columns": [
            "example"
        ],
        "label_column": "label"
    },
    "seed_examples": 'data/banking_seed.csv'
}

In [10]:
task_config = {
    "project_name": "BankingClassification",
    "task_type": "classification",
    "prefix_prompt": "You are an expert at understanding twitter complaints.",
    "example_selector": {
        "strategy": "semantic_similarity",
        "num_examples": 4
    }
}

In [19]:
o = LabelingAgent(task_config, llm_config)
o.plan('data/banking_test.csv', dataset_config, max_items = 100)

100%|██████████| 20/20 [00:25<00:00,  1.26s/it]

Total Estimated Cost: $0.135
Number of examples to label: 100
Average cost per example: $0.00135


A prompt example:

You are an expert at understanding twitter complaints.
Your job is to correctly label the provided input example into one of the following 77 categories.
Categories:
activate_my_card
age_limit
apple_pay_or_google_pay
atm_support
automatic_top_up
balance_not_updated_after_bank_transfer
balance_not_updated_after_cheque_or_cash_deposit
beneficiary_not_allowed
cancel_transfer
card_about_to_expire
card_acceptance
card_arrival
card_delivery_estimate
card_linking
card_not_working
card_payment_fee_charged
card_payment_not_recognised
card_payment_wrong_exchange_rate
card_swallowed
cash_withdrawal_charge
cash_withdrawal_not_recognised
change_pin
compromised_card
contactless_not_working
country_support
declined_card_payment
declined_cash_withdrawal
declined_transfer
direct_debit_payment_not_recognised
disposable_card_limits
edit_personal_details
exchange_charge
exchange_rate
excha




In [12]:
labels, df, metrics_list = o.run('data/banking_test.csv', dataset_config, max_items = 100)

 55%|█████▌    | 11/20 [00:56<00:46,  5.21s/it]2023-05-08 21:56:37 openai INFO: error_code=None error_message='That model is currently overloaded with other requests. You can retry your request, or contact us through our help center at help.openai.com if the error persists. (Please include the request ID f1c4efdc2a5594e8337ab31cb4793037 in your message.)' error_param=None error_type=server_error message='OpenAI API error received' stream_error=False
2023-05-08 21:57:13 openai INFO: error_code=None error_message='That model is currently overloaded with other requests. You can retry your request, or contact us through our help center at help.openai.com if the error persists. (Please include the request ID 12284b58d016616f380c6761328cf12a in your message.)' error_param=None error_type=server_error message='OpenAI API error received' stream_error=False
 60%|██████    | 12/20 [02:11<03:29, 26.18s/it]2023-05-08 21:57:49 openai INFO: error_code=None error_message='That model is currently over

Metric: support: [(100, 'index=0')]
Metric: threshold: [(-inf, 'index=0')]
Metric: accuracy: [(0.74, 'index=0')]
Metric: completion_rate: [(1.0, 'index=0')]
Total number of failures: 0





## Confidence estimation
Let's try to see if the model is able to do well on confidence estimation, if the model knows what it doesn't know we might be able to trade off completion rate for accuracy and get a higher accuracy even though we may be labeling less amount of data.

In [22]:
task_config = {
    "project_name": "BankingClassification",
    "task_type": "classification",
    "prefix_prompt": "You are an expert at understanding twitter complaints.",
    "example_selector": {
        "strategy": "semantic_similarity",
        "num_examples": 4
    },
    "compute_confidence": "True"
}

In [25]:
o = LabelingAgent(task_config, llm_config)
o.plan('data/banking_test.csv', dataset_config, max_items = 100)

100%|██████████| 20/20 [00:24<00:00,  1.23s/it]

Total Estimated Cost: $0.135
Number of examples to label: 100
Average cost per example: $0.00135


A prompt example:

You are an expert at understanding twitter complaints.
Your job is to correctly label the provided input example into one of the following 77 categories.
Categories:
activate_my_card
age_limit
apple_pay_or_google_pay
atm_support
automatic_top_up
balance_not_updated_after_bank_transfer
balance_not_updated_after_cheque_or_cash_deposit
beneficiary_not_allowed
cancel_transfer
card_about_to_expire
card_acceptance
card_arrival
card_delivery_estimate
card_linking
card_not_working
card_payment_fee_charged
card_payment_not_recognised
card_payment_wrong_exchange_rate
card_swallowed
cash_withdrawal_charge
cash_withdrawal_not_recognised
change_pin
compromised_card
contactless_not_working
country_support
declined_card_payment
declined_cash_withdrawal
declined_transfer
direct_debit_payment_not_recognised
disposable_card_limits
edit_personal_details
exchange_charge
exchange_rate
excha




In [26]:
labels, df, metrics_list = o.run('data/banking_test.csv', dataset_config, max_items = 100)

[32m2023-05-08 22:21:12.180[0m | [1mINFO    [0m | [36mrefuel_oracle.oracle[0m:[36mannotate[0m:[36m122[0m - [1mTask run already exists.[0m
There is an existing task with following details: id='4220919502' created_at=datetime.datetime(2023, 5, 8, 22, 12, 21, 695444) task_id='fe18616c502136eecf138fd6f0fbd7a9' dataset_id='8d146e04eeed04671abdece0f40e0469' current_index=65 output_file='data/banking_test_labeled.csv' status=<TaskStatus.ACTIVE: 'active'> error=None metrics=None
Evaluating the existing task...


  avg = a.mean(axis, **keepdims_kw)
  ret = ret.dtype.type(ret / rcount)


Metric: auroc: 0.9310009718172984
Metric: support: [(70, 'index=0'), (0, 'index=1'), (1, 'index=2'), (19, 'index=3'), (20, 'index=4'), (39, 'index=5'), (40, 'index=6'), (41, 'index=7'), (43, 'index=8'), (50, 'index=9'), (52, 'index=10'), (53, 'index=11'), (55, 'index=12'), (57, 'index=13'), (70, 'index=14')]
Metric: threshold: [(-inf, 'index=0'), (3.7169041911318668, 'index=1'), (2.7169041911318668, 'index=2'), (2.7095740903214547, 'index=3'), (2.709052961047316, 'index=4'), (2.6906020486746898, 'index=5'), (2.6903324708649907, 'index=6'), (2.687156017308222, 'index=7'), (2.6787913093093048, 'index=8'), (2.630019572879165, 'index=9'), (2.5956249750404234, 'index=10'), (2.5799757948748283, 'index=11'), (2.568780087012188, 'index=12'), (2.5532547033659325, 'index=13'), (2.2109919215757605, 'index=14')]
Metric: accuracy: [(0.7, 'index=0'), (nan, 'index=1'), (1.0, 'index=2'), (1.0, 'index=3'), (0.95, 'index=4'), (0.9743589743589743, 'index=5'), (0.95, 'index=6'), (0.9512195121951219, 'inde

100%|██████████| 7/7 [00:48<00:00,  6.95s/it]


Metric: auroc: 0.9205333333333333
Metric: support: [(100, 'index=0'), (0, 'index=1'), (1, 'index=2'), (29, 'index=3'), (30, 'index=4'), (35, 'index=5'), (36, 'index=6'), (61, 'index=7'), (62, 'index=8'), (64, 'index=9'), (66, 'index=10'), (76, 'index=11'), (79, 'index=12'), (80, 'index=13'), (82, 'index=14'), (85, 'index=15'), (100, 'index=16')]
Metric: threshold: [(-inf, 'index=0'), (3.716918815608781, 'index=1'), (2.716918815608781, 'index=2'), (2.7095342984271964, 'index=3'), (2.709052961047316, 'index=4'), (2.7074403430995178, 'index=5'), (2.70737698174731, 'index=6'), (2.6906020486746898, 'index=7'), (2.6903324708649907, 'index=8'), (2.6864060122848614, 'index=9'), (2.6787913093093048, 'index=10'), (2.6220833554670904, 'index=11'), (2.5956249750404234, 'index=12'), (2.5799757948748283, 'index=13'), (2.568780087012188, 'index=14'), (2.5532547033659325, 'index=15'), (2.2109919215757605, 'index=16')]
Metric: accuracy: [(0.75, 'index=0'), (nan, 'index=1'), (1.0, 'index=2'), (1.0, 'ind

  avg = a.mean(axis, **keepdims_kw)
  ret = ret.dtype.type(ret / rcount)


## Chain of thought reasoning

Chain of thought requires an explanation for every seed example. As we don't have the explanations or the domain knowledge to construct these explanations, we can use the model to generate the explanations and then use these explanations as an input to the model

In [32]:
dataset_config = {
    "labels_list": [
        "activate_my_card",
        "age_limit",
        "apple_pay_or_google_pay",
        "atm_support",
        "automatic_top_up",
        "balance_not_updated_after_bank_transfer",
        "balance_not_updated_after_cheque_or_cash_deposit",
        "beneficiary_not_allowed",
        "cancel_transfer",
        "card_about_to_expire",
        "card_acceptance",
        "card_arrival",
        "card_delivery_estimate",
        "card_linking",
        "card_not_working",
        "card_payment_fee_charged",
        "card_payment_not_recognised",
        "card_payment_wrong_exchange_rate",
        "card_swallowed",
        "cash_withdrawal_charge",
        "cash_withdrawal_not_recognised",
        "change_pin",
        "compromised_card",
        "contactless_not_working",
        "country_support",
        "declined_card_payment",
        "declined_cash_withdrawal",
        "declined_transfer",
        "direct_debit_payment_not_recognised",
        "disposable_card_limits",
        "edit_personal_details",
        "exchange_charge",
        "exchange_rate",
        "exchange_via_app",
        "extra_charge_on_statement",
        "failed_transfer",
        "fiat_currency_support",
        "get_disposable_virtual_card",
        "get_physical_card",
        "getting_spare_card",
        "getting_virtual_card",
        "lost_or_stolen_card",
        "lost_or_stolen_phone",
        "order_physical_card",
        "passcode_forgotten",
        "pending_card_payment",
        "pending_cash_withdrawal",
        "pending_top_up",
        "pending_transfer",
        "pin_blocked",
        "receiving_money",
        "Refund_not_showing_up",
        "request_refund",
        "reverted_card_payment?",
        "supported_cards_and_currencies",
        "terminate_account",
        "top_up_by_bank_transfer_charge",
        "top_up_by_card_charge",
        "top_up_by_cash_or_cheque",
        "top_up_failed",
        "top_up_limits",
        "top_up_reverted",
        "topping_up_by_card",
        "transaction_charged_twice",
        "transfer_fee_charged",
        "transfer_into_account",
        "transfer_not_received_by_recipient",
        "transfer_timing",
        "unable_to_verify_identity",
        "verify_my_identity",
        "verify_source_of_funds",
        "verify_top_up",
        "virtual_card_not_working",
        "visa_or_mastercard",
        "why_verify_identity",
        "wrong_amount_of_cash_received",
        "wrong_exchange_rate_for_cash_withdrawal"
    ],
    "dataset_schema": {
        "input_columns": [
            "example"
        ],
        "label_column": "label"
    },
    "seed_examples": [
        {
            "example": "Is it free to transfer money or is there a fee?",
            "label": "transfer_fee_charged"
        },
        {
            "example": "I need my PIN, where is it?",
            "label": "get_physical_card"
        },
        {
            "example": "Can my salary be received and transferred to my current currency in my country?",
            "label": "receiving_money"
        },
        {
            "example": "Why isn't my purchase exchange rate correct?",
            "label": "card_payment_wrong_exchange_rate"
        },
        {
            "example": "Why was I charged a fee on a cash withdrawal?",
            "label": "cash_withdrawal_charge"
        }
    ]
}

In [33]:
task_config = {
    "project_name": "BankingClassification",
    "task_type": "classification",
    "prefix_prompt": "You are an expert at understanding twitter complaints.",
    "chain_of_thought": "True"
}

In [34]:
o = LabelingAgent(task_config, llm_config)
o.plan('data/banking_test.csv', dataset_config, max_items = 100)

  0%|          | 0/5 [00:00<?, ?it/s]

[32m2023-05-08 22:31:40.287[0m | [1mINFO    [0m | [36mrefuel_oracle.oracle[0m:[36mgenerate_explanations[0m:[36m423[0m - [1mChain of thought requires explanations for seed examples. Generating explanations for seed examples.[0m


100%|██████████| 5/5 [00:13<00:00,  2.80s/it]
100%|██████████| 20/20 [00:00<00:00, 244.74it/s]

Total Estimated Cost: $0.179
Number of examples to label: 100
Average cost per example: $0.00179


A prompt example:

You are an expert at understanding twitter complaints.
Your job is to correctly label the provided input example into one of the following 77 categories.
Categories:
activate_my_card
age_limit
apple_pay_or_google_pay
atm_support
automatic_top_up
balance_not_updated_after_bank_transfer
balance_not_updated_after_cheque_or_cash_deposit
beneficiary_not_allowed
cancel_transfer
card_about_to_expire
card_acceptance
card_arrival
card_delivery_estimate
card_linking
card_not_working
card_payment_fee_charged
card_payment_not_recognised
card_payment_wrong_exchange_rate
card_swallowed
cash_withdrawal_charge
cash_withdrawal_not_recognised
change_pin
compromised_card
contactless_not_working
country_support
declined_card_payment
declined_cash_withdrawal
declined_transfer
direct_debit_payment_not_recognised
disposable_card_limits
edit_personal_details
exchange_charge
exchange_rate
excha




In [35]:
labels, df, metrics_list = o.run('data/banking_test.csv', dataset_config, max_items = 100)

100%|██████████| 5/5 [00:00<00:00, 108660.73it/s]
 25%|██▌       | 5/20 [01:24<04:07, 16.53s/it]

[32m2023-05-08 22:33:56.520[0m | [1mINFO    [0m | [36mrefuel_oracle.tasks.base[0m:[36mparse_json_llm_response[0m:[36m131[0m - [1mError parsing LLM response: Let's think step by step.
This is not a complaint or issue related to any of the provided categories. It is a general question about the difference between two types of delivery options. Therefore, it does not fit into any of the categories and cannot be labeled.. AttributeError("'NoneType' object has no attribute 'strip'")[0m


 80%|████████  | 16/20 [04:47<01:16, 19.11s/it]2023-05-08 22:37:33 openai INFO: error_code=None error_message='That model is currently overloaded with other requests. You can retry your request, or contact us through our help center at help.openai.com if the error persists. (Please include the request ID 289fb6d43caca91739ab3cbe2192a97f in your message.)' error_param=None error_type=server_error message='OpenAI API error received' stream_error=False
100%|██████████| 20/20 [06:34<00:00, 19.71s/it]

Metric: support: [(99, 'index=0')]
Metric: threshold: [(-inf, 'index=0')]
Metric: accuracy: [(0.696969696969697, 'index=0')]
Metric: completion_rate: [(0.99, 'index=0')]
Total number of failures: 0





## Notes

I am noting a few things that I did not try out, but would have wanted to try to boost completion rate while keeping the accuracy high

1. Wanted to try chain of thought reasoning with a semantic similarity example selector
2. Increase the number of seed examples to be close to one example per class
3. Find the annotator guidelines and pass these as part of the prompt to the model

Trying out all bells and whistles together

In [36]:
dataset_config = {
    "labels_list": [
        "activate_my_card",
        "age_limit",
        "apple_pay_or_google_pay",
        "atm_support",
        "automatic_top_up",
        "balance_not_updated_after_bank_transfer",
        "balance_not_updated_after_cheque_or_cash_deposit",
        "beneficiary_not_allowed",
        "cancel_transfer",
        "card_about_to_expire",
        "card_acceptance",
        "card_arrival",
        "card_delivery_estimate",
        "card_linking",
        "card_not_working",
        "card_payment_fee_charged",
        "card_payment_not_recognised",
        "card_payment_wrong_exchange_rate",
        "card_swallowed",
        "cash_withdrawal_charge",
        "cash_withdrawal_not_recognised",
        "change_pin",
        "compromised_card",
        "contactless_not_working",
        "country_support",
        "declined_card_payment",
        "declined_cash_withdrawal",
        "declined_transfer",
        "direct_debit_payment_not_recognised",
        "disposable_card_limits",
        "edit_personal_details",
        "exchange_charge",
        "exchange_rate",
        "exchange_via_app",
        "extra_charge_on_statement",
        "failed_transfer",
        "fiat_currency_support",
        "get_disposable_virtual_card",
        "get_physical_card",
        "getting_spare_card",
        "getting_virtual_card",
        "lost_or_stolen_card",
        "lost_or_stolen_phone",
        "order_physical_card",
        "passcode_forgotten",
        "pending_card_payment",
        "pending_cash_withdrawal",
        "pending_top_up",
        "pending_transfer",
        "pin_blocked",
        "receiving_money",
        "Refund_not_showing_up",
        "request_refund",
        "reverted_card_payment?",
        "supported_cards_and_currencies",
        "terminate_account",
        "top_up_by_bank_transfer_charge",
        "top_up_by_card_charge",
        "top_up_by_cash_or_cheque",
        "top_up_failed",
        "top_up_limits",
        "top_up_reverted",
        "topping_up_by_card",
        "transaction_charged_twice",
        "transfer_fee_charged",
        "transfer_into_account",
        "transfer_not_received_by_recipient",
        "transfer_timing",
        "unable_to_verify_identity",
        "verify_my_identity",
        "verify_source_of_funds",
        "verify_top_up",
        "virtual_card_not_working",
        "visa_or_mastercard",
        "why_verify_identity",
        "wrong_amount_of_cash_received",
        "wrong_exchange_rate_for_cash_withdrawal"
    ],
    "dataset_schema": {
        "input_columns": [
            "example"
        ],
        "label_column": "label"
    },
    "seed_examples": 'data/banking_seed.csv'
}

In [41]:
task_config = {
    "project_name": "BankingClassification",
    "task_type": "classification",
    "prefix_prompt": "You are an expert at understanding twitter complaints.",
    "example_selector": {
        "strategy": "semantic_similarity",
        "num_examples": 4
    },
    "compute_confidence": "True",
    "chain_of_thought": "True"
}

Increase the maximum tokens allowed on the llm in case of chain of thought because some times the number of tokens produced by the llm in the explanation would exceed the default max token limit.

In [44]:
llm_config = {
    "provider_name": "openai",
    "model_name": "gpt-3.5-turbo",
    "has_logprob": False,
    "model_params": {
        "max_tokens": 1000, # This increases the maximum tokens
    }
}

In [45]:
o = LabelingAgent(task_config, llm_config)
o.plan('data/banking_test.csv', dataset_config, max_items = 100)

100%|██████████| 199/199 [00:00<00:00, 1434134.87it/s]
100%|██████████| 20/20 [00:29<00:00,  1.49s/it]

Total Estimated Cost: $0.362
Number of examples to label: 100
Average cost per example: $0.00362


A prompt example:

You are an expert at understanding twitter complaints.
Your job is to correctly label the provided input example into one of the following 77 categories.
Categories:
activate_my_card
age_limit
apple_pay_or_google_pay
atm_support
automatic_top_up
balance_not_updated_after_bank_transfer
balance_not_updated_after_cheque_or_cash_deposit
beneficiary_not_allowed
cancel_transfer
card_about_to_expire
card_acceptance
card_arrival
card_delivery_estimate
card_linking
card_not_working
card_payment_fee_charged
card_payment_not_recognised
card_payment_wrong_exchange_rate
card_swallowed
cash_withdrawal_charge
cash_withdrawal_not_recognised
change_pin
compromised_card
contactless_not_working
country_support
declined_card_payment
declined_cash_withdrawal
declined_transfer
direct_debit_payment_not_recognised
disposable_card_limits
edit_personal_details
exchange_charge
exchange_rate
excha




In [46]:
labels, df, metrics_list = o.run('data/banking_test.csv', dataset_config, max_items = 100)

100%|██████████| 199/199 [00:00<00:00, 1608220.61it/s]
 20%|██        | 4/20 [01:34<06:23, 23.99s/it]2023-05-08 23:10:44 openai INFO: error_code=None error_message='That model is currently overloaded with other requests. You can retry your request, or contact us through our help center at help.openai.com if the error persists. (Please include the request ID 23c095ac766b84d7caf88df9a89b769a in your message.)' error_param=None error_type=server_error message='OpenAI API error received' stream_error=False
100%|██████████| 20/20 [08:09<00:00, 24.49s/it]


Metric: auroc: 0.8120978120978121
Metric: support: [(100, 'index=0'), (0, 'index=1'), (1, 'index=2'), (16, 'index=3'), (17, 'index=4'), (18, 'index=5'), (20, 'index=6'), (24, 'index=7'), (25, 'index=8'), (26, 'index=9'), (28, 'index=10'), (37, 'index=11'), (38, 'index=12'), (43, 'index=13'), (44, 'index=14'), (51, 'index=15'), (52, 'index=16'), (54, 'index=17'), (56, 'index=18'), (60, 'index=19'), (61, 'index=20'), (65, 'index=21'), (66, 'index=22'), (68, 'index=23'), (70, 'index=24'), (71, 'index=25'), (72, 'index=26'), (74, 'index=27'), (75, 'index=28'), (77, 'index=29'), (78, 'index=30'), (80, 'index=31'), (85, 'index=32'), (86, 'index=33'), (100, 'index=34')]
Metric: threshold: [(-inf, 'index=0'), (3.71441431254104, 'index=1'), (2.71441431254104, 'index=2'), (2.5839988367995232, 'index=3'), (2.5772682289443654, 'index=4'), (2.571900894553442, 'index=5'), (2.546606833621545, 'index=6'), (2.529675078736434, 'index=7'), (2.529153546390658, 'index=8'), (2.5165535885857393, 'index=9'), 

  avg = a.mean(axis, **keepdims_kw)
  ret = ret.dtype.type(ret / rcount)


Increase the maximum tokens allowed on the llm in case of chain of thought because some times the number of tokens produced by the llm in the explanation would exceed the default max token limit.

## Self consistency

This increases the temperature and the randomness while generating explanations allowing the model to explore multiple reasoning paths. At the end, a majority vote is taken among the generations. The llm config below generates 5 reasoning paths and takes the majority vote over these reasoning paths.

In [48]:
llm_config = {
    "provider_name": "openai",
    "model_name": "gpt-3.5-turbo",
    "has_logprob": False,
    "model_params": {
        "max_tokens": 1000, # This increases the maximum tokens
        "temperature": 0.7,
        "n": 5 # This runs self consistency
    }
}