# ü§ñ Chatbot Testing Framework (Jupyter Notebook)

This notebook provides an interactive interface for testing both OpenAI and BERT chatbots across various radio communication scenarios.

**Features:**
- Direct API key setting for personal use
- Interactive scenario and model selection
- Test execution with progress
- Results analysis and visualization
- Export options

---

## 1. Setup & Imports

In [1]:
import os
import json
import time
import pandas as pd
from datetime import datetime
from test_scenarios import ScenarioTester, ModelType
import ipywidgets as widgets
from IPython.display import display, clear_output
import plotly.graph_objects as go
from plotly.subplots import make_subplots
print('‚úÖ All libraries imported!')

‚úÖ All libraries imported!


## 2. Set OpenAI API Key (Personal Use)

In [2]:
# Set your OpenAI API key here (personal use only)
import openai
openai.api_key = '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'  # <-- Replace with your actual API key
print('‚úÖ API Key set!')

‚úÖ API Key set!


## 3. Initialize Testing Framework

In [3]:
tester = None
try:
    tester = ScenarioTester()
    print(f'‚úÖ Loaded {len(tester.scenarios)} scenarios!')
    for scenario in tester.scenarios:
        print(f'  - {scenario.name} ({scenario.level})')
except Exception as e:
    print(f'‚ùå Error: {e}')
    print('You can still use BERT-only testing.')

Using device: cpu
==> Load processed file into database: RC SLP - inputs to the AI BOT
==> Load processed file into database: RCA - AI-Chatbot for conversational training
==> Load processed file into database: sample
==> Load processed file into database: SBST RC Comms SLP
==> Load processed file into database: SBST_RC_sim_talk_grp
Using device: cpu
==> Load processed file into database: Acronym
==> Load processed file into database: Alphabet
==> Load processed file into database: Catch_Phrases-List_A1
==> Load processed file into database: Common_Catch_Phrases
==> Load processed file into database: conversations
==> Load processed file into database: Words


  embedder = HuggingFaceEmbeddings(model_name="all-mpnet-base-v2")


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

‚úÖ Loaded 9 scenarios!
  - Scenario 2.1 (Beginner)
  - Scenario 2.2 (Beginner)
  - Scenario 2.3 (Beginner)
  - Scenario 6.1 (Intermediate)
  - Scenario 6.2 (Intermediate)
  - Scenario 6.3 (Intermediate)
  - Scenario 14.1 (Advanced)
  - Scenario 14.2 (Advanced)
  - Scenario 14.3 (Advanced)


## 4. Scenario & Model Selection

In [4]:
if tester:
    scenario_names = [s.name for s in tester.scenarios]
    scenario_select = widgets.SelectMultiple(
        options=scenario_names,
        value=scenario_names[:2],
        description='Scenarios:',
        style={'description_width': 'initial'},
        layout=widgets.Layout(width='50%')
    )
    openai_chk = widgets.Checkbox(value=True, description='OpenAI (GPT-4)')
    bert_chk = widgets.Checkbox(value=True, description='BERT')
    runs_slider = widgets.IntSlider(value=3, min=1, max=10, step=1, description='Runs:')
    display(scenario_select, openai_chk, bert_chk, runs_slider)
else:
    print('‚ùå No scenarios available.')

SelectMultiple(description='Scenarios:', index=(0, 1), layout=Layout(width='50%'), options=('Scenario 2.1', 'S‚Ä¶

Checkbox(value=True, description='OpenAI (GPT-4)')

Checkbox(value=True, description='BERT')

IntSlider(value=3, description='Runs:', max=10, min=1)

## 5. Run Tests

In [7]:
results = []
if tester:
    selected_scenarios = [s for s in tester.scenarios if s.name in scenario_select.value]
    selected_models = []
    if openai_chk.value:
        selected_models.append(ModelType.OPENAI)
    if bert_chk.value:
        selected_models.append(ModelType.BERT)
    num_runs = runs_slider.value
    total = len(selected_scenarios) * len(selected_models) * num_runs
    print(f'Running {total} tests...')
    for scenario in selected_scenarios:
        for model in selected_models:
            for run in range(1, num_runs+1):
                print(f'{scenario.name} | {model.value} | Run {run}')
                if model == ModelType.OPENAI:
                    result = tester.run_openai_test(scenario, run)
                else:
                    result = tester.run_bert_test(scenario, run)
                if result:
                    results.append(result)
                time.sleep(0.2)
    print('‚úÖ Testing complete!')
else:
    print('‚ùå Cannot run tests.')

Running 6 tests...
Scenario 2.1 | OpenAI | Run 1
Running OpenAI test for Scenario 2.1 - Run 1
==> Delete 1 items in history
==> Update experiment id
Error in OpenAI test: 

You tried to access openai.Embedding, but this is no longer supported in openai>=1.0.0 - see the README at https://github.com/openai/openai-python for the API.

You can run `openai migrate` to automatically upgrade your codebase to use the 1.0.0 interface. 

Alternatively, you can pin your installation to the old version, e.g. `pip install openai==0.28`

A detailed migration guide is available here: https://github.com/openai/openai-python/discussions/742

Error in OpenAI test: 

You tried to access openai.Embedding, but this is no longer supported in openai>=1.0.0 - see the README at https://github.com/openai/openai-python for the API.

You can run `openai migrate` to automatically upgrade your codebase to use the 1.0.0 interface. 

Alternatively, you can pin your installation to the old version, e.g. `pip install o

## 6. Results Table & Visualization

In [8]:
def create_comparison_table(results):
    scenario_data = {}
    for result in results:
        if result.scenario_name not in scenario_data:
            scenario_data[result.scenario_name] = {'OpenAI': [], 'BERT': []}
        if result.model_type == ModelType.OPENAI:
            scenario_data[result.scenario_name]['OpenAI'].append(result.total_score)
        else:
            scenario_data[result.scenario_name]['BERT'].append(result.total_score)
    table_data = []
    for scenario_name, scores in scenario_data.items():
        openai_scores = scores['OpenAI']
        bert_scores = scores['BERT']
        openai_avg = sum(openai_scores) / len(openai_scores) if openai_scores else 0
        bert_avg = sum(bert_scores) / len(bert_scores) if bert_scores else 0
        table_data.append({
            'Scenario': scenario_name,
            'OpenAI Scores': ' | '.join([f'{s:.1f}' for s in openai_scores]),
            'OpenAI Avg': f'{openai_avg:.1f}',
            'BERT Scores': ' | '.join([f'{s:.1f}' for s in bert_scores]) if bert_scores else 'N/A',
            'BERT Avg': f'{bert_avg:.1f}' if bert_scores else 'N/A'
        })
    return pd.DataFrame(table_data)

if results:
    df = create_comparison_table(results)
    display(df)
    # Plotly chart
    plot_data = []
    for r in results:
        plot_data.append({'Scenario': r.scenario_name, 'Model': r.model_type.value, 'Score': r.total_score})
    pdf = pd.DataFrame(plot_data)
    if not pdf.empty:
        fig = make_subplots(rows=1, cols=2, subplot_titles=('Score Distribution', 'Average by Scenario'))
        for model in pdf['Model'].unique():
            fig.add_trace(go.Box(y=pdf[pdf['Model']==model]['Score'], name=model), row=1, col=1)
        avg = pdf.groupby(['Scenario','Model'])['Score'].mean().reset_index()
        for model in avg['Model'].unique():
            fig.add_trace(go.Bar(x=avg[avg['Model']==model]['Scenario'], y=avg[avg['Model']==model]['Score'], name=model), row=1, col=2)
        fig.update_layout(height=400, showlegend=True)
        fig.show()
else:
    print('No results to show.')

Unnamed: 0,Scenario,OpenAI Scores,OpenAI Avg,BERT Scores,BERT Avg
0,Scenario 2.1,0.0 | 0.0 | 0.0,0.0,0.0 | 0.0 | 0.0,0.0


## 7. Export Results

In [None]:
if results:
    timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
    # Export JSON
    with open(f'test_results_{timestamp}.json', 'w', encoding='utf-8') as f:
        json.dump([r.__dict__ for r in results], f, indent=2, ensure_ascii=False)
    print(f'‚úÖ Exported JSON: test_results_{timestamp}.json')
    # Export CSV
    df.to_csv(f'comparison_table_{timestamp}.csv', index=False)
    print(f'‚úÖ Exported CSV: comparison_table_{timestamp}.csv')
else:
    print('No results to export.')