# EDA and Data Cleaning

An adaptation of Coco Cao Jinglu's [EDA notebook](https://github.com/coco-cao-jinglu/w210-active-prompt/blob/main/eda.ipynb).

In [19]:
# utility
from copy import deepcopy
from dateutil import parser
from datetime import timedelta
from tqdm import tqdm
import warnings
warnings.filterwarnings("ignore")

# data manipulation
import pandas as pd
import numpy as np

# display
from IPython.display import display
pd.set_option('display.max_colwidth', 290)

## Data

In [20]:
DATA_DIR = "../data"

In [21]:
autocast_questions = pd.read_json(f'{DATA_DIR}/autocast_questions.json')
autocast_questions.head(2)

Unnamed: 0,question,id,background,publish_time,close_time,tags,source_links,prediction_count,forecaster_count,answer,choices,status,qtype,crowd
0,What will the end-of-day closing value for the dollar against the renminbi be on 1 January 2016?,G1,"Outcome will be determined by the end-of-day closing value reported by Bloomberg, at http://www.bloomberg.com/quote/usdcny:cur. For historical trends, see http://www.bloomberg.com/quote/usdcny:cur/chart. For more information on China's economy see http://www.theworldin.com/article/10492.",2015-09-01 13:49:29.860000+00:00,2016-01-01 17:00:01+00:00,"[Finance, Economic Indicators]","[http://ftalphaville.ft.com/2015/08/17/2137329/what-are-chinese-capital-controls-really-part-2/, http://www.investmentweek.co.uk/investment-week/analysis/2427669/why-investors-need-to-remain-mindful-of-a-more-flexible-renminbi-regime, http://www.bbc.com/news/business-34825542, https://...",1549.0,385,D,"[Less than 6.30, Between 6.30 and 6.35, inclusive, More than 6.35 but less than 6.40, 6.40 or more]",Resolved,mc,"[{'timestamp': '2015-09-01 00:00:00+00:00', 'forecast': [0.33330000000000004, 0.25, 0.25, 0.25]}, {'timestamp': '2015-09-02 00:06:24.261000+00:00', 'forecast': [0.24, 0.24, 0.24, 0.28]}, {'timestamp': '2015-09-03 02:50:49.320000+00:00', 'forecast': [0.16, 0.22, 0.16, 0.46]}, {'timestam..."
1,How many seats will the Justice and Development Party (AKP) win in Turkey's snap elections?,G2,"The Justice and Development Party (AKP) failed to win a single-party majority in June's general election for the first time since 2002. After negotiations aimed at forming a coalition government collapsed, snap elections have been scheduled for 1 November. For more information see: www...",2015-09-01 13:54:25.050000+00:00,2015-11-01 22:00:20+00:00,"[Elections and Referenda, Non-US Politics]","[http://www.al-monitor.com/pulse/originals/2015/10/turkey-military-is-not-willing-to-intervene-politics-for-now.html, http://www.thestar.com.my/News/World/2015/10/22/Erdogan-seen-with-little-choice-but-to-share-power-after-Turkish-vote/, http://www.bbc.com/news/world-europe-34479873, h...",567.0,194,A,"[A majority, A plurality, Not a plurality]",Resolved,mc,"[{'timestamp': '2015-09-01 00:00:00+00:00', 'forecast': [0.33330000000000004, 0.33330000000000004, 0.33330000000000004]}, {'timestamp': '2015-09-01 23:56:04.467000+00:00', 'forecast': [0.28, 0.71, 0.01]}, {'timestamp': '2015-09-02 00:45:01.353000+00:00', 'forecast': [0.2, 0.79, 0.01]},..."


In [22]:
print('event coverage:')
print(min(autocast_questions['publish_time']))
print(max(autocast_questions['publish_time']))

print('\n')
print('status count')
display(autocast_questions.groupby('status').agg(n_questions=('id', 'nunique')).reset_index())

print('\n')
print('labels count')
l_all_tags = list(autocast_questions['tags'])
l_all_tags = [a for sublist in l_all_tags for a in sublist]
l_all_tags = list(set(l_all_tags))
print('number of unique tags = ' + str(len(l_all_tags)))
l_n_questions = []
for tag in l_all_tags:
    col_tag_count = autocast_questions['tags'].apply(lambda x: 1 if tag in x else 0)
    n_tag_count = sum(col_tag_count)
    l_n_questions.append(n_tag_count)
df_labels_count = pd.DataFrame({'tag': l_all_tags,
                                'n_questions': l_n_questions})
display(df_labels_count.sort_values(by = 'n_questions',
                                    ascending=False).reset_index())

event coverage:
2015-09-01 13:49:29.860000+00:00
2022-06-17 16:48:53.986000+00:00


status count


Unnamed: 0,status,n_questions
0,Active,2389
1,Closed,387
2,Resolved,3723




labels count
number of unique tags = 248


Unnamed: 0,index,tag,n_questions
0,0,Effective Altruism,500
1,69,Security and Conflict,456
2,172,Business,431
3,51,Novel Coronavirus (Covid-19),428
4,1,Politics – US,372
...,...,...,...
243,2,Computing – Operating Systems,2
244,144,Patents – General,1
245,40,Patents,1
246,114,Tree coverage loss,1


A peak of the planned pipeline

- Input: question
- Step 1: enrich with domain-specific language
- Step 2: few-shot sample selection of relevant examples
- Output: prediction

In [23]:
autocast_questions.sample()

Unnamed: 0,question,id,background,publish_time,close_time,tags,source_links,prediction_count,forecaster_count,answer,choices,status,qtype,crowd
5923,When will used/pre-owned RTX30 series Nvidia GPUs suitable for deep learning sell below retail price?,M9561,"As of January 2022, RTX30 series Nvidia GPUs suitable for deep learning are out of stock at all online retailers as part of the 2020–2022 global chip shortage. As long as used/pre-owned prices are greater than retail prices, scalpers will take advantage of the free arbitrage opportunit...",2022-02-11 05:00:00+00:00,2024-03-01 14:45:00+00:00,"[Computer Science – Computer Graphics, Computing – Computers]","[https://en.wikipedia.org/wiki/2020%E2%80%932022_global_chip_shortage#Graphics_cards_and_gaming_PCs, https://www.tomshardware.com/news/gpu-pricing-index, https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units#GeForce_30_series]",40.0,14,,,Active,num,"[{'timestamp': '2022-02-11 05:26:59.465888+00:00', 'forecast': 0.455}, {'timestamp': '2022-02-11 07:07:41.718873+00:00', 'forecast': 0.5814}, {'timestamp': '2022-02-11 21:15:07.007799+00:00', 'forecast': 0.39704000000000006}, {'timestamp': '2022-02-11 21:18:04.647134+00:00', 'forecast'..."


In [24]:
print('crowd source prediction')
l_preds = [x['forecast'] for x in list(autocast_questions[autocast_questions['id'] == 'M10154']['crowd'])[0]]
print('n_predictions = ' + str(len(l_preds)))
print('average = ' + str(sum(l_preds)/len(l_preds)))
print('median = ' + str(np.median(l_preds)))

crowd source prediction
n_predictions = 101
average = 0.08900990099009909
median = 0.09


In [25]:
# pipeline step 1. contextual enriching
print('background supplied by dataset:')
print(list(autocast_questions[autocast_questions['id'] == 'M10154']['background'])[0])

# 1.1 extract and summarise relevant info from the background supplied by dataset
# retrieval QA from Langchain library

# 1.2 external dataset (pending) and extract and summarise relevant info

background supplied by dataset:
According to Wikipedia: "A nuclear and radiation accident is defined by the International Atomic Energy Agency (IAEA) as "an event that has led to significant consequences to people, the environment or the facility. Examples include lethal effect to individuals, large radioactivity release to the environment, reactor core melt." The prime example of a "major nuclear accident" is one in which a reactor core is damaged and significant amounts of radioactive isotopes are released, such as in the Chernobyl disaster in 1986 and Fukushima nuclear disaster in 2011. Russian military forces seized Chernobyl during the first day of the Ukrainian invasion as well as Zaporizhzhia, the largest nuclear plant of its kind in Europe, during the seventh day. Fighting near nuclear power plants could possibly mean an increased risk of a serious radiation incident. Thus we ask: Will there be a serious radiation incident at any nuclear plant in Ukraine by 2024? The question w

In [26]:
autocast_questions.shape

(6532, 14)

In [27]:
# verify whether crowd prediction is a good proxy for truth from resolved questions
df_resolved = autocast_questions[autocast_questions['status'] == 'Resolved']

# only keep crowd predictions that are made before the resolved events' closing time
df_resolved['crowd'] = df_resolved.apply(lambda row: [k for k in row['crowd'] if k['timestamp'] <= row['close_time']],
                                         axis = 1)

# eliminate the cases where there are too few crowd predictions
df_resolved['n_crowd_predictions'] = df_resolved['crowd'].apply(lambda x: len(x))
print('minimal number of crowd predictions = ' + str(min(df_resolved['n_crowd_predictions'])))
print('keeping only resolved samples with n_predictions >= 10')
df_resolved = df_resolved[df_resolved['n_crowd_predictions'] >= 10].reset_index()
print('remaining resolved samples = ' + str(len(df_resolved)))

# get the average, median and majority predictions

# in case of multiple predictions submitted at the same time stamp, get their average
# input: timestamp-forecast pair as one dictionary item in the crowd predictions
# output: transformed dic item
def helper_integify(dic_item):
    if isinstance(dic_item['forecast'], int) or isinstance(dic_item['forecast'], float):
        return dic_item
    else:
        dic_item['forecast'] = sum(dic_item['forecast'])/len(dic_item['forecast'])
        return dic_item
    
df_resolved['crowd'] = df_resolved['crowd'].apply(lambda x: [helper_integify(k) for k in x])
col_all_crowd_predictions = df_resolved['crowd'].apply(lambda x: [k['forecast'] for k in x])
df_resolved['avg_pred'] = col_all_crowd_predictions.apply(lambda x: sum(x)/len(x))
df_resolved['median_pred'] = col_all_crowd_predictions.apply(lambda x: np.median(x))
df_resolved['majority_pred'] = col_all_crowd_predictions.apply(lambda x: max(set(x), key = x.count))

# transform the correct answer to a number
# input: row of df_resolved
# return: quantified answer
def helper_quantify_answer(row):
    if isinstance(row['answer'], int) or isinstance(row['answer'], float):
        return row['answer']
    elif row['answer'].lower() == 'yes':
        return 1
    elif row['answer'].lower() == 'no':
        return 0
    elif isinstance(row['answer'], str) and len(row['answer']) == 1:
        dic_alphebat_number = {'a': 0, 'b': 1, 'c': 2, 'd': 3, 'e': 4, 'f': 5, 'g': 6, 'h': 7, 'i': 8, 'j': 9, 'k': 10}
        # return row['choices'][dic_alphebat_number[row['answer'].lower()]]
        return dic_alphebat_number[row['answer'].lower()] # to keep aligned with the number type (index of choice) for crowd predictions for MCQ questions too
    else:
        print('exception cases. row answer = ' + str(row['answer']))
    
df_resolved['answer_numeric'] = df_resolved.apply(lambda row: helper_quantify_answer(row),
                                                  axis = 1)

# test if any is a good proxy for truth
display(df_resolved[['answer_numeric', 'avg_pred', 'median_pred', 'majority_pred', 'qtype']].head())
# for t/f questions, the numbers for both answer_numeric and crowd-prediction-derived numbers are 0 for false and 1 for true
# for mcq questions, the numbers for both answer_numeric and crowd-prediction-derived numbers are the number for the selection 
# (e.g. B in a MCQ with 4 choices translates to 0.5)

# therefore we have to discuss the cases separately
# difference threshold set for T/F and MCQ questions respectively
THRESHOLD_TF = 0.2
THRESHOLD_MCQ = 0.1
THRESHOLD_NUM = 0.1

# input
# row: df_resolved row
# col_pred: the column name for the specific prediction column we are testing against
def helper_check_correct(row, col_pred, threshold_tf = THRESHOLD_TF, threshold_mcq = THRESHOLD_MCQ, threshold_num = THRESHOLD_NUM):
    if row['qtype'] == 'mc':
        return abs(row[col_pred] - row['answer_numeric']) <= threshold_mcq
    elif row['qtype'] == 't/f':
        return abs(row[col_pred] - row['answer_numeric']) <= threshold_tf
    else: #row['qtype] == 'num' 
        return abs(1 - row[col_pred]/(row['answer_numeric'] + 0.0000001))<= threshold_num

dic_accuracy_rate = {}
for col in ['avg_pred', 'median_pred', 'majority_pred']:
    df_resolved[col + '_correction'] = df_resolved.apply(lambda row: helper_check_correct(row, col),
                                                         axis = 1)
    dic_accuracy_rate[col] = len(df_resolved[df_resolved[col + '_correction'] == 1])/len(df_resolved)

dic_accuracy_rate

# comments
# questions of narrower scope
# cross validation
# calculate error bars & positive predictive value in addition to accuracy rates for the three parameters

minimal number of crowd predictions = 2
keeping only resolved samples with n_predictions >= 10
remaining resolved samples = 3743


Unnamed: 0,answer_numeric,avg_pred,median_pred,majority_pred,qtype
0,3.0,0.250054,0.25,0.25,mc
1,0.0,0.333333,0.333333,0.333333,mc
2,1.0,0.22002,0.15,0.15,t/f
3,1.0,0.738228,0.74,0.85,t/f
4,0.0,0.083066,0.05,0.1,t/f


{'avg_pred': 0.313117819930537,
 'median_pred': 0.346513491851456,
 'majority_pred': 0.36548223350253806}

In [28]:
# find suitable subsets where crowd predictions may be a good proxy
# method 1. by labels
# find number of labels with more than n_resolved_min resolved questions and more than n_open_min open questions that satisfy the conditions
n_resolved_min = 150
n_open_min = 100

def helper_count_tag_samples(df, tag):
    df_with_tag = deepcopy(df)
    df_with_tag['tag_flag'] = df_with_tag['tags'].apply(lambda x: 1 if tag in x else 0)
    return sum(df_with_tag['tag_flag'])

def helper_get_eligible_samples(df, n_predictions_min):
    df['crowd'] = df.apply(lambda row: [k for k in row['crowd'] if k['timestamp'] <= row['close_time']],
                                         axis = 1)
    # eliminate the cases where there are too few crowd predictions
    df['n_crowd_predictions'] = df['crowd'].apply(lambda x: len(x))
    df = df[df['n_crowd_predictions'] >= n_predictions_min].reset_index()
    return df

l_good_tags = []
df_open = helper_get_eligible_samples(autocast_questions[autocast_questions['status'] == 'Active'], 10)
for tag in l_all_tags:
    if helper_count_tag_samples(df_resolved, tag) >= n_resolved_min and helper_count_tag_samples(df_open, tag) >= n_open_min:
        l_good_tags.append(tag)

print(l_good_tags)

# test if crowd prediction is a good proxy for either case
THRESHOLD_TF = 0.2
THRESHOLD_MCQ = 0.1

def helper_crowd_prediction_calculations(df_resolved, tag):
    df_tag = deepcopy(df_resolved)
    df_tag['whether_tag'] = df_tag['tags'].apply(lambda x: 1 if tag in x else 0)
    df_tag = df_tag[df_tag['whether_tag'] == 1]
    df_tag['answer_numeric'] = df_tag.apply(lambda row: helper_quantify_answer(row),
                                                  axis = 1)
    
    dic_accuracy_rate = {}
    for col in ['avg_pred', 'median_pred', 'majority_pred']:
        df_tag[col + '_correction'] = df_tag.apply(lambda row: helper_check_correct(row, col),
                                                         axis = 1)
        dic_accuracy_rate[col] = len(df_tag[df_tag[col + '_correction'] == 1])/len(df_tag)

    return dic_accuracy_rate


for tag in l_good_tags:
    dic_accuracy_rate = helper_crowd_prediction_calculations(df_resolved, tag)
    print(tag)
    print(dic_accuracy_rate)

['Politics – US', 'Business']
Politics – US
{'avg_pred': 0.3670212765957447, 'median_pred': 0.42021276595744683, 'majority_pred': 0.48404255319148937}
Business
{'avg_pred': 0.2596491228070175, 'median_pred': 0.2912280701754386, 'majority_pred': 0.3192982456140351}


In [29]:
# submission by time/trend
# N_MIN_CROWD_PRED = 10
N_MIN_CROWD_PRED = 5

# first, cross-validation to make sure we're not doing p-hacking-like techiniques
def cross_validation(df, valid_ratio = 0.8):
    df_test = df.sample(frac= valid_ratio)
    df_valid = df.drop(df_test.index)
    return df_test, df_valid

def test_res_cross_validated(df, test_func, valid_ratio = 0.8):
    df_test, df_valid = cross_validation(df, valid_ratio)
    return {'main result': test_func(df_test),
            'validation result': test_func(df_valid)}

# test idea 1: crowd predictions within pre-determined time period before closing date
def get_predictions_pre_closure(df, time_period):
    df_filtered = df.copy()

    def filter_row(row, time_period):
        close_time = parser.parse(str(row['close_time']))
        if close_time.tzinfo is not None:  # if 'close_time' is timezone-aware
            close_time = close_time.replace(tzinfo=None)  # convert 'close_time' to a timezone-naive datetime
        start_time = close_time - time_period
        filtered_forecasts = []
        for forecast in row['crowd']:
            try:
                timestamp = parser.parse(forecast['timestamp'])
                if timestamp.tzinfo is not None:  # if 'timestamp' is timezone-aware
                    timestamp = timestamp.replace(tzinfo=None)  # convert 'timestamp' to a timezone-naive datetime
                if start_time <= timestamp <= close_time:
                    filtered_forecasts.append(forecast)
            except (ValueError, OverflowError):
                pass  # Ignore timestamps that can't be parsed or are out of bounds
        return filtered_forecasts

    df_filtered['crowd'] = df_filtered.apply(lambda row: filter_row(row, time_period), axis=1)
    
    return df_filtered

def get_eligible_events_only(df, n_min_crowd_predictions = N_MIN_CROWD_PRED):
    l_n_crowd_preds = df['crowd'].apply(lambda x: len(x))
    df_filtered = df[l_n_crowd_preds >= 10].reset_index()
    return df_filtered

def get_crowd_prediction_res(df, threshold_tf = THRESHOLD_TF, threshold_mcq = THRESHOLD_MCQ):
    df_resolved = deepcopy(df)
    df_resolved['crowd'] = df_resolved['crowd'].apply(lambda x: [helper_integify(k) for k in x])
    col_all_crowd_predictions = df_resolved['crowd'].apply(lambda x: [k['forecast'] for k in x])
    df_resolved['avg_pred'] = col_all_crowd_predictions.apply(lambda x: sum(x)/len(x))
    df_resolved['median_pred'] = col_all_crowd_predictions.apply(lambda x: np.median(x))
    df_resolved['majority_pred'] = col_all_crowd_predictions.apply(lambda x: max(set(x), key = x.count))
    
    df_resolved['answer_numeric'] = df_resolved.apply(lambda row: helper_quantify_answer(row),
                                                  axis = 1)
    
    dic_accuracy_rate = {}
    for col in ['avg_pred', 'median_pred', 'majority_pred']:
        df_resolved[col + '_correction'] = df_resolved.apply(lambda row: helper_check_correct(row, col, threshold_tf=threshold_tf, threshold_mcq=threshold_mcq),
                                                         axis = 1)
        dic_accuracy_rate[col] = len(df_resolved[df_resolved[col + '_correction'] == 1])/len(df_resolved)

    return dic_accuracy_rate

n_days_delta = 30
df_pre_days = get_predictions_pre_closure(autocast_questions, timedelta(days=n_days_delta))
while len(helper_get_eligible_samples(df_pre_days[df_pre_days['status'] == 'Active'], 10)) < n_open_min:
    n_days_delta += 1
    df_pre_days = get_predictions_pre_closure(autocast_questions, timedelta(days=n_days_delta))
    while len(helper_get_eligible_samples(df_pre_days[df_pre_days['status'] == 'Resolved'], 10)) < n_resolved_min:
        n_days_delta += 1

print('min n_days for time_delta = ' + str(n_days_delta))

for days in [n_days_delta, n_days_delta+5, n_days_delta+10]:
    df_pre_days = get_predictions_pre_closure(autocast_questions, timedelta(days=days))
    df_filtered = get_eligible_events_only(df_pre_days)
    print('time_delta = ' + str(days) + ' days:')
    res = get_crowd_prediction_res(df_filtered[df_filtered['status'] == 'Resolved'])
    print(res)


KeyboardInterrupt: 

In [30]:
# test idea 2: select combination of labels

# first, get labels for which the crowd predictions may be a good truth proxy
# columns: tag, highest accuracy metric, n_samples for eligible resolved events, n_samples for eligible active events
l_highest_accuracy_metric = []
l_n_samples_resolved = []
l_n_samples_open = []

# N_MIN_CROWD_PRED = 10
# min_n_samples_resolved_combi = 40
# min_n_samples_open_combi = 30
# min_highest_accu_value = 0.85
N_MIN_CROWD_PRED = 5
min_n_samples_resolved_combi = 30
min_n_samples_open_combi = 20
min_highest_accu_value = 0.7

def helper_get_tag_events(df, tag):
    l_tag_existing = df['tags'].apply(lambda x: 1 if tag in x else 0)
    return df[l_tag_existing == 1].reset_index()


def helper_get_eligible_samples_v2(df, n_predictions_min):
    if len(df) == 0:
        return df 
    else:
        df['crowd'] = df.apply(lambda row: [k for k in row['crowd'] if k['timestamp'] <= row['close_time']],
                                         axis = 1)
        # eliminate the cases where there are too few crowd predictions
        df['n_crowd_predictions'] = df['crowd'].apply(lambda x: len(x))
        df = df[df['n_crowd_predictions'] >= n_predictions_min].reset_index(drop=True)
        return df


for tag in l_all_tags:
    df_tag = helper_get_tag_events(autocast_questions, tag)
    df_tag_active = helper_get_eligible_samples_v2(df_tag[df_tag['status'] == 'Active'], N_MIN_CROWD_PRED)
    df_tag_resolved = helper_get_eligible_samples_v2(df_tag[df_tag['status'] == 'Resolved'], N_MIN_CROWD_PRED)
    l_n_samples_open.append(len(df_tag_active))
    l_n_samples_resolved.append(len(df_tag_resolved))
    if len(df_tag_resolved) == 0:
        l_highest_accuracy_metric.append(None)
    else:
        dic_res = get_crowd_prediction_res(df_tag_resolved)
        max_accu = max(dic_res.values())
        l_highest_accuracy_metric.append({dict((v,k) for k,v in dic_res.items())[max_accu]:max_accu})


df_res = pd.DataFrame({'tag': l_all_tags,
                       'highest accuracy metric': l_highest_accuracy_metric,
                       'n_samples_resolved': l_n_samples_resolved,
                       'n_samples_open': l_n_samples_open})
df_res['highest accuracy value'] = df_res['highest accuracy metric'].apply(lambda x: list(x.values())[0] if x is not None else None)
df_res.sort_values(by = ['highest accuracy value', 'n_samples_resolved', 'n_samples_open'],
                   ascending=False,
                   inplace=True)
display(df_res)


display(df_res[(df_res['highest accuracy value'] >= min_highest_accu_value) & (df_res['n_samples_open'] >= min_n_samples_open_combi) & (df_res['n_samples_resolved'] >= min_n_samples_resolved_combi)])
# no label good enough for combination either

Unnamed: 0,tag,highest accuracy metric,n_samples_resolved,n_samples_open,highest accuracy value
53,Animal Charity Evaluators Strategy,{'majority_pred': 1.0},6,2,1.0
113,Social — Social Movements,{'majority_pred': 1.0},2,15,1.0
35,Physical Sciences – Chemistry,{'majority_pred': 1.0},1,5,1.0
59,Series – Self-Resolving Questions,{'majority_pred': 1.0},1,3,1.0
40,Patents,{'majority_pred': 1.0},1,0,1.0
...,...,...,...,...,...
36,Education,,0,3,
70,Series — Animal welfare — prediction party,,0,3,
73,Government Investment,,0,1,
114,Tree coverage loss,,0,0,


Unnamed: 0,tag,highest accuracy metric,n_samples_resolved,n_samples_open,highest accuracy value
69,Security and Conflict,{'median_pred': 0.7031630170316302},411,45,0.703163


In [31]:
l_filter = autocast_questions['tags'].apply(lambda x: 1 if 'Series — Forecasting AI Progress' in x or 'Economy – US – Economic Indicators' in x or 'Economy – US' in x or 'Novel Coronavirus (Covid-19)' in x or 'Security and Conflict' in x else 0)
autocast_questions[l_filter == 1].head()

Unnamed: 0,question,id,background,publish_time,close_time,tags,source_links,prediction_count,forecaster_count,answer,choices,status,qtype,crowd
5,Will Iran release Jason Rezaian before 31 October 2016?,G7,"For details of the case involving Jason Rezaian, the Washington Post correspondent being held in Iran, see: www. nytimes. com/2015/07/29/world/middleeast/irans-trial-of-jason-rezaian-illustrates-perils-faced-by-reporters. html www. cbsnews. com/news/lawyer-jason-rezaian-iran-free-wash...",2015-09-01 14:07:22.960000+00:00,2016-01-16 20:00:32+00:00,"[Foreign Policy, Security and Conflict]","[http://www.nytimes.com/2015/10/13/world/middleeast/jason-rezaian-washington-post-conviction-iran.html, http://www.nytimes.com/2015/10/17/opinion/what-iran-fears-from-reporters-like-jason-rezaian-and-me.html, http://www.businessinsider.com.au/iran-deal-jason-rezaian-2015-11, http://ira...",1283.0,423,yes,"[yes, no]",Resolved,t/f,"[{'timestamp': '2015-09-01 00:00:00+00:00', 'forecast': 0.5}, {'timestamp': '2015-09-02 00:28:05.032000+00:00', 'forecast': 0.15}, {'timestamp': '2015-09-02 01:24:29.625000+00:00', 'forecast': 0.15}, {'timestamp': '2015-09-02 07:42:27.854000+00:00', 'forecast': 0.29}, {'timestamp': '20..."
6,"Will North Korea launch a land based missile with the capacity to reach Alaska, Hawaii, or the continental United States before 1 January 2017?",G8,"A launch for military or testing purposes would count, i.e., www.nytimes.com/2012/12/24/world/asia/north-korean-rocket-had-military-purpose-seoul-says.html. The success of the launch, and the actual distance traveled, are irrelevant.",2015-09-01 14:10:57.372000+00:00,2016-02-06 22:00:38+00:00,[Security and Conflict],"[http://www.ctvnews.ca/world/north-korea-carries-out-long-range-rocket-test-1.2767400, http://nypost.com/2015/09/15/north-korea-says-it-has-restarted-all-nuclear-bomb-fuel-plants/, http://www.upi.com/Top_News/World-News/2015/11/12/North-Korea-collapse-is-unrealistic-magical-thinking-ex...",1582.0,517,yes,"[yes, no]",Resolved,t/f,"[{'timestamp': '2015-09-01 00:00:00+00:00', 'forecast': 0.5}, {'timestamp': '2015-09-02 01:41:35.573000+00:00', 'forecast': 0.2}, {'timestamp': '2015-09-02 17:45:43.718000+00:00', 'forecast': 0.25}, {'timestamp': '2015-09-03 06:47:21.372000+00:00', 'forecast': 0.25}, {'timestamp': '201..."
9,Will Congress pass a resolution disapproving the Joint Comprehensive Plan of Action?,G11,"In accordance with the Iran Nuclear Agreement Review Act, President Obama submitted the Joint Comprehensive Plan of Action (JCPoA) to Congress on 19 July 2015. Congress has sixty days to review the legislation and vote to approve or disapprove the agreement, but the President retains t...",2015-09-01 14:25:13.530000+00:00,2015-09-18 04:05:02+00:00,"[Foreign Policy, Security and Conflict, US Policy]","[http://www.blumenthal.senate.gov/newsroom/press/release/blumenthal-announces-support-for-iran-nuclear-agreement, http://abcnews.go.com/Politics/wireStory/democrat-mikulski-34th-senator-support-iran-nuke-deal-33479080, https://www.washingtonpost.com/blogs/plum-line/wp/2015/09/04/hey-de...",301.0,148,A,"[No, Yes, but the resolution will be vetoed by the President and the veto will stand, Yes, and the resolution will become law]",Resolved,mc,"[{'timestamp': '2015-09-01 00:00:00+00:00', 'forecast': 0.33330000000000004}, {'timestamp': '2015-09-01 18:32:57.629000+00:00', 'forecast': 0.3333333333333333}, {'timestamp': '2015-09-01 18:39:18.109000+00:00', 'forecast': 0.3333333333333333}, {'timestamp': '2015-09-01 18:58:21.218000+..."
13,Will Bashar al-Assad cease to be President of Syria before 1 March 2017?,G15,"A number of military setbacks have weakened Bashar al-Assad's grip on power and called into question the the commitment of his external supporters (BBC, NBC News). In the event that Assad reportedly disappears or flees the capital, the administrator will observe a three week waiting pe...",2015-09-01 14:37:52.668000+00:00,2017-02-28 00:00:00,"[Leader Entry/Exit, Security and Conflict]","[http://www.moderntimesworkplace.com/archives/ericsess/sessvol3/Ackoffp417.opd.pdf, http://www.reuters.com/article/us-mideast-crisis-russia-syria-envoy-idUSKCN0VR240, http://smallwarsjournal.com/blog/lawrence-and-his-message, http://www.msn.com/en-us/news/world/russia-says-it-doesnt-mi...",5509.0,1511,no,"[yes, no]",Resolved,t/f,"[{'timestamp': '2015-09-01 00:00:00+00:00', 'forecast': 0.5}, {'timestamp': '2015-09-02 01:38:44.069000+00:00', 'forecast': 0.325}, {'timestamp': '2015-09-03 11:46:03.901000+00:00', 'forecast': 0.325}, {'timestamp': '2015-09-04 17:56:35.831000+00:00', 'forecast': 0.30000000000000004}, ..."
15,Will Iran's President Hassan Rouhani meet Saudi Arabia's King Salman bin Abdulaziz Al Saud before 1 September 2016?,G17,"There is no indication that Rouhani and King Salman have met and tensions between the two countries remain high. Although Saudi Arabia and Iran continue to compete for influence in the Middle East, they have common interests in stabilizing oil prices, controlling domestic unrest, and f...",2015-09-07 13:07:57.732000+00:00,2016-08-31 17:00:56+00:00,"[Economic Policy, Foreign Policy, Security and Conflict]","[http://www.aljazeera.com/indepth/opinion/2015/12/china-opens-naval-base-africa-151222141545988.html, http://www.politico.com/story/2015/09/obama-salman-white-house-213304, http://www.msn.com/en-us/news/world/saudis-cut-ties-with-iran-following-shiite-cleric-execution/ar-AAgjyCH?li=BBn...",2081.0,784,no,"[yes, no]",Resolved,t/f,"[{'timestamp': '2015-09-08 00:00:00+00:00', 'forecast': 0.5}, {'timestamp': '2015-09-08 22:30:42.676000+00:00', 'forecast': 0.25}, {'timestamp': '2015-09-09 13:41:40.512000+00:00', 'forecast': 0.21}, {'timestamp': '2015-09-09 19:41:52.522000+00:00', 'forecast': 0.21}, {'timestamp': '20..."


In [32]:
autocast_questions.shape

(6532, 14)

In [33]:
THRESHOLD_TF = 0.2
THRESHOLD_MCQ = 0.1
THRESHOLD_NUM = 0.1

def cal_boundary(qtype, value, dic_threshold, boundary):
    if qtype == 'mc' or qtype == 't/f':
        if boundary == 'up':
            return value + dic_threshold[qtype]
        else:
            return value - dic_threshold[qtype]
    else: #qtype = num
        if boundary == 'up':
            return value * (1 + dic_threshold[qtype])
        else:
            return value * (1 - dic_threshold[qtype])
        
dic_threshold = {'t/f': THRESHOLD_TF,
                 'mc': THRESHOLD_MCQ,
                 'num': THRESHOLD_NUM}


l_filter = autocast_questions['tags'].apply(lambda x: 1 if 'Series — Forecasting AI Progress' in x or 'Economy – US – Economic Indicators' in x or 'Economy – US' in x or 'Novel Coronavirus (Covid-19)' in x or 'Security and Conflict' in x else 0)

# df_filtered = autocast_questions[l_filter == 1]
df_filtered = autocast_questions.copy()

df_filtered['crowd'] = df_filtered['crowd'].apply(lambda x: [helper_integify(k) for k in x])
col_all_crowd_predictions = df_filtered['crowd'].apply(lambda x: [k['forecast'] for k in x])
df_filtered['avg_pred'] = col_all_crowd_predictions.apply(lambda x: sum(x)/len(x))
df_filtered['median_pred'] = col_all_crowd_predictions.apply(lambda x: np.median(x))
df_filtered['majority_pred'] = col_all_crowd_predictions.apply(lambda x: max(set(x), key = x.count))
df_filtered['pred taken'] = df_filtered['majority_pred']
df_filtered['acceptable pred lower boundary'] = df_filtered.apply(lambda row: cal_boundary(row['qtype'], row['majority_pred'], dic_threshold, 'lower'),
                                                                  axis = 1)
df_filtered['acceptable pred upper boundary'] = df_filtered.apply(lambda row: cal_boundary(row['qtype'], row['majority_pred'], dic_threshold, 'up'),
                                                                  axis = 1)

In [34]:
df_filtered.shape

(6532, 20)

In [35]:
df_filtered.head(3)

Unnamed: 0,question,id,background,publish_time,close_time,tags,source_links,prediction_count,forecaster_count,answer,choices,status,qtype,crowd,avg_pred,median_pred,majority_pred,pred taken,acceptable pred lower boundary,acceptable pred upper boundary
0,What will the end-of-day closing value for the dollar against the renminbi be on 1 January 2016?,G1,"Outcome will be determined by the end-of-day closing value reported by Bloomberg, at http://www.bloomberg.com/quote/usdcny:cur. For historical trends, see http://www.bloomberg.com/quote/usdcny:cur/chart. For more information on China's economy see http://www.theworldin.com/article/10492.",2015-09-01 13:49:29.860000+00:00,2016-01-01 17:00:01+00:00,"[Finance, Economic Indicators]","[http://ftalphaville.ft.com/2015/08/17/2137329/what-are-chinese-capital-controls-really-part-2/, http://www.investmentweek.co.uk/investment-week/analysis/2427669/why-investors-need-to-remain-mindful-of-a-more-flexible-renminbi-regime, http://www.bbc.com/news/business-34825542, https://...",1549.0,385,D,"[Less than 6.30, Between 6.30 and 6.35, inclusive, More than 6.35 but less than 6.40, 6.40 or more]",Resolved,mc,"[{'timestamp': '2015-09-01 00:00:00+00:00', 'forecast': 0.270825}, {'timestamp': '2015-09-02 00:06:24.261000+00:00', 'forecast': 0.25}, {'timestamp': '2015-09-03 02:50:49.320000+00:00', 'forecast': 0.25}, {'timestamp': '2015-09-04 20:23:05.381000+00:00', 'forecast': 0.25}, {'timestamp'...",0.250054,0.25,0.25,0.25,0.15,0.35
1,How many seats will the Justice and Development Party (AKP) win in Turkey's snap elections?,G2,"The Justice and Development Party (AKP) failed to win a single-party majority in June's general election for the first time since 2002. After negotiations aimed at forming a coalition government collapsed, snap elections have been scheduled for 1 November. For more information see: www...",2015-09-01 13:54:25.050000+00:00,2015-11-01 22:00:20+00:00,"[Elections and Referenda, Non-US Politics]","[http://www.al-monitor.com/pulse/originals/2015/10/turkey-military-is-not-willing-to-intervene-politics-for-now.html, http://www.thestar.com.my/News/World/2015/10/22/Erdogan-seen-with-little-choice-but-to-share-power-after-Turkish-vote/, http://www.bbc.com/news/world-europe-34479873, h...",567.0,194,A,"[A majority, A plurality, Not a plurality]",Resolved,mc,"[{'timestamp': '2015-09-01 00:00:00+00:00', 'forecast': 0.33330000000000004}, {'timestamp': '2015-09-01 23:56:04.467000+00:00', 'forecast': 0.3333333333333333}, {'timestamp': '2015-09-02 00:45:01.353000+00:00', 'forecast': 0.3333333333333333}, {'timestamp': '2015-09-03 02:44:41.145000+...",0.333333,0.333333,0.333333,0.333333,0.233333,0.433333
2,Will there be an initial public offering on either the Shanghai Stock Exchange or the Shenzhen Stock Exchange before 1 January 2016?,G4,"China suspended initial public offerings (IPOs) in early July (http://www. bloomberg. com/news/articles/2015-07-04/china-stock-brokers-set-up-19-billion-fund-to-stem-market-rout , http://www. reuters. com/article/2015/07/05/us-china-markets-brokerage-pledge-idUSKCN0PE08E20150705 , http...",2015-09-01 13:58:30.138000+00:00,2015-11-30 14:00:15+00:00,[Finance],"[http://atimes.com/2015/11/china-will-allow-suspended-ipos-to-launch/, http://www.bloomberg.com/news/articles/2015-12-03/asian-futures-signal-more-stock-losses-on-draghi-disappointment, https://www.gjopen.com/comments/comments/50556, http://www.businessinsider.com/china-is-about-to-unl...",545.0,148,yes,"[yes, no]",Resolved,t/f,"[{'timestamp': '2015-09-01 00:00:00+00:00', 'forecast': 0.5}, {'timestamp': '2015-09-01 19:14:09.548000+00:00', 'forecast': 0.645}, {'timestamp': '2015-09-02 01:33:43.927000+00:00', 'forecast': 0.425}, {'timestamp': '2015-09-02 01:34:42.860000+00:00', 'forecast': 0.425}, {'timestamp': ...",0.22002,0.15,0.15,0.15,-0.05,0.35


In [37]:
df_filtered.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6532 entries, 0 to 6531
Data columns (total 20 columns):
 #   Column                          Non-Null Count  Dtype              
---  ------                          --------------  -----              
 0   question                        6532 non-null   object             
 1   id                              6532 non-null   object             
 2   background                      6532 non-null   object             
 3   publish_time                    6532 non-null   datetime64[ns, UTC]
 4   close_time                      6532 non-null   object             
 5   tags                            6532 non-null   object             
 6   source_links                    6532 non-null   object             
 7   prediction_count                6532 non-null   float64            
 8   forecaster_count                6532 non-null   int64              
 9   answer                          3748 non-null   object             
 10  choices     

In [36]:
save_path = f"{DATA_DIR}/filtered_events_xl.json"

# save df_filtered to json
df_filtered.to_json(save_path, orient="records", lines=True)

## Accuracy Refinement

In [31]:
# out of the tags that has accuracy between 0.7 and 0.8, further refine the accuracy with tag combinations
N_MIN_CROWD_PRED = 5
min_n_samples_resolved_combi = 20
min_n_samples_open_combi = 10
min_highest_accu_value = 0.7

# helper function: given a df, produce all tags present
def get_all_tags(df):
    l_tags = list(df['tags'])
    l_tags = [item for sublist in l_tags for item in sublist]
    return list(set(l_tags))

l_tags_combi = []
l_highest_accuracy_metric = []
l_n_samples_resolved = []
l_n_samples_open = []

for tag in ['Economy – US', 'Novel Coronavirus (Covid-19)', 'Security and Conflict']:
    df_tag = helper_get_tag_events(autocast_questions, tag)
    l_tags = get_all_tags(df_tag)
    l_tags.remove(tag)
    for another_tag in l_tags:
        df_tagged = helper_get_tag_events(df_tag, another_tag)
        df_tag_active = helper_get_eligible_samples_v2(df_tagged[df_tagged['status'] == 'Active'], N_MIN_CROWD_PRED)
        df_tag_resolved = helper_get_eligible_samples_v2(df_tagged[df_tagged['status'] == 'Resolved'], N_MIN_CROWD_PRED)
        if len(df_tag_active) >= min_n_samples_open_combi and len(df_tag_resolved) >= min_n_samples_resolved_combi:
            dic_res = get_crowd_prediction_res(df_tag_resolved)
            max_accu = max(dic_res.values())
        
            l_tags_combi.append([tag, another_tag])
            l_highest_accuracy_metric.append({dict((v,k) for k,v in dic_res.items())[max_accu]:max_accu})
            l_n_samples_open.append(len(df_tag_active))
            l_n_samples_resolved.append(len(df_tag_resolved))

df_res = pd.DataFrame({'tag': l_tags_combi,
                       'highest accuracy metric': l_highest_accuracy_metric,
                       'n_samples_resolved': l_n_samples_resolved,
                       'n_samples_open': l_n_samples_open})
df_res['highest accuracy value'] = df_res['highest accuracy metric'].apply(lambda x: list(x.values())[0] if x is not None else None)
df_res.sort_values(by = ['highest accuracy value', 'n_samples_resolved', 'n_samples_open'],
                   ascending=False,
                   inplace=True)
display(df_res)

Unnamed: 0,tag,highest accuracy metric,n_samples_resolved,n_samples_open,highest accuracy value
0,"[Security and Conflict, Foreign Policy]",{'median_pred': 0.6057692307692307},104,31,0.605769
1,"[Security and Conflict, Non-US Politics]",{'majority_pred': 0.5294117647058824},51,21,0.529412


In [42]:
# suggestion by Puya: develop a systematic methodology for context enrichment
# e.g. entropy increment

'''
some ideas for methods of retrieval (to extract relevant info from background in dataset to better answer question):
1. cosine similarity (between question and context) as a proxy for relevance
2. NER & coreference resolution
3. MIC which is a measure of dependence between two variables 
4. LDA
'''

# example for testing
example = autocast_questions[autocast_questions['id'] == 'M10154']
context = list(example['background'])[0]
question = list(example['question'])[0]
print('question:\n' + question)
print('context:\n' + context)

# 1. cosine similarity
from transformers import AutoTokenizer, AutoModel
import torch
from sklearn.metrics.pairwise import cosine_similarity

tokenizer = AutoTokenizer.from_pretrained('bert-base-cased-finetuned-mrpc')
model = AutoModel.from_pretrained('bert-base-cased-finetuned-mrpc')

def get_embedding(text):
    input_ids = tokenizer.encode(text, return_tensors='pt')
    with torch.no_grad():
        embeddings = model(input_ids).last_hidden_state.mean(dim=1).numpy()
    return embeddings

Q_embedding = get_embedding(question)
C_sentences = context.split('. ')
C_embeddings = [get_embedding(sentence) for sentence in C_sentences]

# Compute the cosine similarity between Q and each sentence in C
similarities = [cosine_similarity(Q_embedding, sentence_embedding) for sentence_embedding in C_embeddings]
# Find the index of the sentence with the highest similarity
most_similar_sentence_index = np.argmax(similarities)
# Print the most similar sentence
print('cosine similarity retrieval result:')
print(C_sentences[most_similar_sentence_index])


question:
Will there be a serious radiation incident at any nuclear plant in Ukraine by 2024?
context:
According to Wikipedia: "A nuclear and radiation accident is defined by the International Atomic Energy Agency (IAEA) as "an event that has led to significant consequences to people, the environment or the facility. Examples include lethal effect to individuals, large radioactivity release to the environment, reactor core melt." The prime example of a "major nuclear accident" is one in which a reactor core is damaged and significant amounts of radioactive isotopes are released, such as in the Chernobyl disaster in 1986 and Fukushima nuclear disaster in 2011. Russian military forces seized Chernobyl during the first day of the Ukrainian invasion as well as Zaporizhzhia, the largest nuclear plant of its kind in Europe, during the seventh day. Fighting near nuclear power plants could possibly mean an increased risk of a serious radiation incident. Thus we ask: Will there be a serious rad

Downloading (…)okenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/433 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/433M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-cased-finetuned-mrpc were not used when initializing BertModel: ['classifier.bias', 'classifier.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Thus we ask: Will there be a serious radiation incident at any nuclear plant in Ukraine by 2024? The question will be resolved positively if at any time between March 4, 2022 and December 31, 2023 the International Atomic Energy Agency reports, in connection with any nuclear power plant within the borders of Ukraine – as they stood in December 2021 – an accident of level 5, 6 or 7 of the International Nuclear and Radiological Event Scale.


In [None]:
# pipeline step 2. few-shot sample selection

# external dataset (possibly online search) of relevant historical events

# 2.1 sample selection method to select the most performance-improving samples 
# e.g. samples that increase the uncertainty of results (measured by variance between the top X answers) the most individually are selected
# the method would be greedy 
# while gauranteeing individual best performance, does not gaurantee about combination performance

# 2.2 sample ordering method to select the best way to order the samples
# e.g. the same uncertainty method

In [None]:
# some ideas to explore
# 1. to solve data leakage: machine unlearning
# 2. model forgetting memorised examples 
# We identify nondeterminism as a potential explanation, showing that deterministically trained models do not forget. Our results suggest that examples seen early when training with extremely large datasets—for instance those examples used to pre-train a model—may observe privacy benefits at the expense of examples seen later.
# https://arxiv.org/pdf/2207.00099.pdf
# 3. another idea: even for already resolved cases, measure rate of hallucination as a proxy for forgetting?

In [None]:
# for whether the crowd predictions are a good proxy to the truth
# analyse the resolved cases