This notebook explores: 
* The relative usage of each tool  
* The overall accuracy of each tool
* The bias of each tool towards yes/no answers 
* The relationship between confidence and accuracy for each tool
* The diversity of tools and prompts used for different questions

TLDR: 
1. Some tools have been used a lot more than others. Also the settings have been changing. This makes analysis of the performance of the different tools more difficult. Creating the benchmark will help.
2. Some tools are biased towards one answer over another (e.g. answer yes twice as often). This is surpising since we would have expected a 50:50 split.
3. There's no discernible trend between accuracy and confidence. This is perhaps evidence that the output format (based on confidence) we are asking the tools for isn't working well. 
4. (To show for script data but true for autocast) Some tools output equal probability of yes and no more than others (especially Claude). This means that they make less bets. Perhaps this helps to maintain higher accuracy (while reducing recall).

### imports

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
pd.set_option("display.precision", 2)
pd.set_option('display.max_columns', None)

plt.style.use('ggplot')

### Functions

In [2]:
def accuracy(data):
    correct_answers_mask = data["currentAnswer"] == data["vote"]
    n_answers = correct_answers_mask.count()
    n_answers_success = correct_answers_mask.sum()

    if n_answers == 0:
        accuracy = None
    elif n_answers_success == 0:
        accuracy = 0
    else:
        accuracy = n_answers_success/n_answers * 100

    return pd.Series({"n_correct": n_answers_success, "n_pred": n_answers, "accuracy": accuracy})


def acc_per_tool(group, col: str, conf: float):
    return group.apply(lambda x: accuracy(x[x[col] == conf]))


def gen_stats(df, group, col: str):
    stats = {f"{col}_{prob}" : acc_per_tool(group, col, prob) for prob in sorted(df[col].unique())}
    stats["total"] = group.apply(accuracy)
    return pd.concat(stats.values(), axis=1, keys=stats.keys())

### load data

In [8]:
dataset = pd.read_csv("./data/dataset.csv")
print(f"Full dataset shape: {dataset.shape}")

# only error == False
dataset = dataset[dataset["error"] == False]
print(f"Dataset shape without errors: {dataset.shape}")

str_cols = ("id", "currentAnswer", "title", "request_id", "prompt_request", "prompt_response", "tool", "nonce", "vote")
for col in str_cols:
    dataset[col] = dataset[col].astype("string")
dataset.head()

Full dataset shape: (33204, 15)


KeyError: 'prompt_request'

In [6]:
dataset

Unnamed: 0,id,currentAnswer,title,request_id,request_block,prompt,tool,nonce,deliver_block,p_yes,p_no,confidence,info_utility,vote,win_probability
0,0x0094fa304017d5c2b355790e2976f769ea600492,No,Will the Hisense U8K be considered a top-tier ...,1429730407779530824523722231071959771311408049...,29544655,"With the given question ""Will the Hisense U8K ...",prediction-online,c6366b3f-eff5-4533-8dd9-d653b281b29d,29577379,0.60,0.40,0.8,0.5,Yes,0.60
1,0x0094fa304017d5c2b355790e2976f769ea600492,No,Will the Hisense U8K be considered a top-tier ...,1695055931594747475916883029584567955775422500...,29545478,"With the given question ""Will the Hisense U8K ...",prediction-online,1eed33a5-a3f0-41c4-beae-23e9022ffe22,29576660,0.60,0.40,0.8,0.7,Yes,0.60
2,0x0094fa304017d5c2b355790e2976f769ea600492,No,Will the Hisense U8K be considered a top-tier ...,5972945302788386668720465960403202339977906500...,29546230,"With the given question ""Will the Hisense U8K ...",prediction-online,dd376ef9-eb2c-4d9f-8a5a-cf9ae8deb0b3,29576574,0.60,0.40,0.8,0.7,Yes,0.60
3,0x0094fa304017d5c2b355790e2976f769ea600492,No,Will the Hisense U8K be considered a top-tier ...,1043402953919313937539182160739114840263108832...,29546982,"With the given question ""Will the Hisense U8K ...",prediction-online,91096f15-5e3b-4bf1-8178-f17f1efcf639,29576448,0.70,0.30,0.8,0.6,Yes,0.70
4,0x0094fa304017d5c2b355790e2976f769ea600492,No,Will the Hisense U8K be considered a top-tier ...,9433232780766388309643050548812272093999565778...,29547744,"With the given question ""Will the Hisense U8K ...",prediction-online,92321968-7888-4877-b33f-22fa4755fbc2,29576351,0.65,0.35,0.9,0.8,Yes,0.65
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
33199,0xffd6f34459ff26040e9cf1d9e4d9aaa7026a9683,No,Will YouTube's subscribe button light up whene...,4263190901304231495850761705989642389854368362...,30547221,Please take over the role of a Data Scientist ...,prediction-offline-sme,e595320f-2fe5-4054-a78a-2ec876fb18d2,30567380,0.60,0.40,0.8,0.5,Yes,0.60
33200,0xffd6f34459ff26040e9cf1d9e4d9aaa7026a9683,No,Will YouTube's subscribe button light up whene...,8369494795383516235624157123761657514877000913...,30547280,Please take over the role of a Data Scientist ...,prediction-offline,90b24122-766f-4311-b1a4-a63bcd3cb02f,30547290,0.80,0.20,0.9,0.5,Yes,0.80
33201,0xffd6f34459ff26040e9cf1d9e4d9aaa7026a9683,No,Will YouTube's subscribe button light up whene...,7918465905476728959297694493743499879185014792...,30548108,Please take over the role of a Data Scientist ...,claude-prediction-online,6e8ca36d-0035-47ad-8103-29800cf026d6,30548137,0.20,0.80,0.6,0.3,No,0.80
33202,0xffd6f34459ff26040e9cf1d9e4d9aaa7026a9683,No,Will YouTube's subscribe button light up whene...,7726801200361002068276204884331940960539584313...,30548848,Please take over the role of a Data Scientist ...,claude-prediction-offline,d36cc85a-52cf-4feb-9f51-6c29b5b88656,30548864,0.30,0.70,0.5,0.0,No,0.70


In [None]:
dataset.columns

```
id - bet id
currentAnswer - current winning answer (??)
title - the question
request_id - request for processing by Mech
request_block - the block in which the job was requested
prompt - prompt given by the Trader
tool - tool requested by the trader
nonce - nonce
deliver_block - block in which the result was delivered back to the trader
p_yes - probability of yes by the tool
p_no - probability of no by the tool
confidence - confidence by the tool
info_utility - utility of the additional info given to the LLM (??)
vote - the traders position
win_probability - win probability as predicted by the tool (same as p_yes or p_no)
```

- is actual winning answer captured?

In [None]:
dataset.shape

In [None]:
dataset.describe()

### Normalize confidence

In [None]:
dataset["confidence"].unique()

In [None]:
# number of votes below 0.5
print(f"Number of votes below 0.5: {dataset[dataset['confidence'] < 0.5].shape[0]}")

# drop votes below 0.5
dataset = dataset[dataset["confidence"] >= 0.5]

In [None]:
# bucket confidence into 0.5, 0.6, 0.7, 0.8, 0.9
dataset['confidence'] = dataset['confidence'].apply(lambda x: round(x, 1))

# if confidence is 1 --> 0.9
dataset['confidence'] = dataset['confidence'].apply(lambda x: 0.9 if x == 1.0 else x)

In [None]:
dataset.loc[(dataset["confidence"] >= 0.9) & (dataset["confidence"] < 1), "confidence"] = 0.9
dataset.loc[dataset["confidence"] == 0.85, "confidence"] = 0.8
dataset.loc[dataset["confidence"] == 0.75, "confidence"] = 0.7
dataset["confidence"].unique()

### Tool use

In [None]:
tool_use = dataset['tool'].groupby(dataset['tool']).count()

# plot tool use with data labels. use subplots plot % of tool use
fig, ax = plt.subplots()
ax.bar(tool_use.index, tool_use.values)
ax.set_xticklabels(tool_use.index, rotation=90)
ax.set_ylabel('Tool Use')
ax.set_xlabel('Tool')
ax.set_title('Tool Use')
for i, v in enumerate(tool_use.values):
    ax.text(i, v + 3, str(v), ha='center')
    # get percentage of tool use
    total = tool_use.values.sum()
    percent = v / total * 100
    ax.text(i, v + 1200, f'{percent:.2f}%', ha='center')

plt.show()


From above, there is currently a strong tendency to choose the prediction-online and prediction-online-sme tools.


In [None]:
# plot p_yes for every tool
dataset.groupby('tool')['p_yes'].mean().plot(kind='bar', title='p_yes for every tool', ylabel='p_yes', xlabel='tool', rot=90)

In [None]:
dataset.groupby('tool')['p_no'].mean().plot(kind='bar', title='p_no for every tool', ylabel='p_no', xlabel='tool', rot=90)

In [None]:
tool_vote = dataset.groupby(['tool', 'vote'])['vote'].count().unstack()

fig, ax = plt.subplots()
# plot Yes and No votes for every tool unstacked
ax = tool_vote.plot(kind='bar', stacked=False, ax=ax, rot=90, title='Yes and No votes for every tool')

# Data labels
for i, v in enumerate(tool_vote['Yes']):
    ax.text(i, v + 3, str(v), ha='center')

for i, v in enumerate(tool_vote['No']):
    ax.text(i, v + 20, str(v), ha='center')

ax.set_ylabel('Votes')
ax.set_xlabel('Tool')

From above, it seems like the prediction-online tool may have a large bias towards yes answers.

The Claude-based tools are the most accurate. However, they are not chosen nearly as much as the main two. 

### Check the percentage of wins vs confidence for all the tools

In [None]:
tools_group = dataset.groupby("tool")
tools_stats_per_conf = gen_stats(df=dataset, group=tools_group, col="confidence")
display(tools_stats_per_conf)

In [None]:
tools_stats_per_conf.loc[:, (slice(None), "accuracy")].sort_values(by=("total", "accuracy"), ascending=False)

In [None]:
tools = tools_stats_per_conf.index

fig, axes = plt.subplots(nrows=len(tools), ncols=1, figsize=(10, 40))

for i, tool in enumerate(tools):
    ax = axes[i]
    tool_stats = tools_stats_per_conf.loc[tool].unstack(level=0).loc["accuracy"]
    ax.bar(tool_stats.index, tool_stats.values)
    ax.set_title(tool)
    ax.set_ylabel("Accuracy")
    ax.set_xlabel("Confidence")

    for j, v in enumerate(tool_stats.values):

        # add n_pred and n_correct
        n_pred = tools_stats_per_conf.loc[tool].unstack(level=0).loc["n_pred"]
        n_correct = tools_stats_per_conf.loc[tool].unstack(level=0).loc["n_correct"]
        ax.text(j, v + 10, f"n_pred: {n_pred[j]}", ha='center')
        ax.text(j, v + 17, f"n_correct: {n_correct[j]}", ha='center')
        ax.text(j, v + 3, f"accuracy: {round(v, 2)}%", ha='center')

plt.tight_layout()

The above shows that for each tool, higher confidences do not really bring higher accuracy. 

In [None]:
vote_analysis = dataset['vote'].value_counts()
print(vote_analysis)

In [None]:
# value counts above 50
sel = list(dataset['id'].value_counts()[dataset['id'].value_counts() > 50].index)
sel_dataset = dataset[dataset['id'].isin(sel)]
sel_dataset['id'].nunique()

In [None]:
tools_per_id = sel_dataset.groupby('id')['tool'].nunique()

tools_per_id = tools_per_id.value_counts().sort_index()


fig, ax = plt.subplots()
ax.bar(tools_per_id.index, tools_per_id.values)
ax.set_ylabel('Number of IDs')
ax.set_xlabel('Number of Tools')

for i, v in enumerate(tools_per_id.values):
    ax.text(i+1, v, str(v), ha='center')

plt.title('Number of Tools per ID. Total IDs: ' + str(sel_dataset['id'].nunique()))

The above chart shows for each id/title what is the diversity of the tools used. Mostly, 1 or 2 is used.


In [None]:
dataset[['id', 'currentAnswer']].drop_duplicates()['currentAnswer'].value_counts()

In [None]:
vote_counts_per_id = dataset.groupby('id')['vote'].value_counts().unstack(fill_value=0)
vote_counts_per_id['yes_no_ratio'] = vote_counts_per_id['Yes'] / (vote_counts_per_id['No'] +1)

In [None]:
vote_counts_per_id['yes_perc'] = vote_counts_per_id['Yes']/(vote_counts_per_id['Yes'] + vote_counts_per_id['No'])
vote_counts_per_id['no_perc'] = vote_counts_per_id['No']/(vote_counts_per_id['Yes'] + vote_counts_per_id['No'])

In [None]:
vote_counts_per_id['yes_perc'].min(), vote_counts_per_id['yes_perc'].max(), vote_counts_per_id['yes_perc'].mean(), vote_counts_per_id['yes_perc'].median()

In [None]:
vote_counts_per_id['no_perc'].min(), vote_counts_per_id['no_perc'].max(), vote_counts_per_id['no_perc'].mean(), vote_counts_per_id['no_perc'].median()

In [None]:
dataset.groupby('id')['prompt_request'].nunique().value_counts().plot(kind='bar', title='Number of prompts per ID', ylabel='Number of IDs', xlabel='Number of prompts')

The above chart shows for each id/title what is the diversity of the prompt coming from the trader. It seems that the default prompt is rarely changed.