# Measuring Data Quality

https://www.aclweb.org/anthology/2020.fever-1.6.pdf

Contents:

1. Prepare the data
2. Cue Productivity and Coverage\
2.1. Create a balanced dataset \
2.2. Calculate applicability, productivity, coverage, utility metrics
3. Dataset-weighted Cue Information\
3.1 Prepare skip-grams \
3.2 Calculate weighted cue information on the skip-grams

# Prepare the data

Using csv dump from adminer.

In [7]:
# !python -m pip install pymysql --user

In [8]:
import pandas as pd
import os
from collections import Counter, OrderedDict
import json

In [9]:
!ls ../

ColBERT    claims_quality_metrics  student_teacher
README.md  getadd.sh		   test.py
anserini   pretraining		   xlmr_to_longformer


In [10]:
path = '/mnt/data/factcheck/fever/data-cs/fever-data'
files = [os.path.join(path, file) for file in os.listdir(path) if file.endswith(".jsonl")]

In [11]:
files

['/mnt/data/factcheck/fever/data-cs/fever-data/test.jsonl',
 '/mnt/data/factcheck/fever/data-cs/fever-data/train.jsonl',
 '/mnt/data/factcheck/fever/data-cs/fever-data/dev.jsonl']

In [12]:
# sql = open('../fcheck.sql').read()

# connection = sqlite3.connect(":memory:")  # create DB in RAM
# cursor = connection.cursor()

# df = pd.read_sql(sql, connection)

In [13]:
# conn = pymysql.connect(host='127.0.0.1',
#                        port=int(3306),
#                        user='fcheck',
#                        password='gVLKsbh9OK9sdgATY6DKcc17Do8p0g',
#                        database='fcheck')

# df = a.read_sql_query("SELECT * FROM user", conn)

In [14]:
!ls /home/ryparmar/db

4-skip-unigram-dci.csv	claim_knowledge.csv  paragraph.csv	      user.csv
article.csv		evidence.csv	     paragraph_knowledge.csv
average_route_time.csv	export_uni_cv.csv    time_spent.csv
claim.csv		label.csv	     unigram-dci.csv


In [15]:
claims = pd.read_csv('/home/ryparmar/db/claim.csv')
claims.dropna(subset=['id', 'claim'], inplace=True)
claims.head()

Unnamed: 0,id,user,claim,paragraph,mutated_from,mutation_type,ners,sandbox,labelled,created_at,updated_at
0,2.0,3,Ben Sheets první zápas v MLB prohrál.,,,,,,,,
2,3.0,3,Baseballový tým USA porazil na olympijských hr...,,,,,,,,
4,4.0,3,USA na olympijských hrách v Sydney získalo v b...,,,,,,,,
6,5.0,3,ArcelorMittal Ostrava přispěje zaměstnancům na...,,,,,,,,
8,6.0,3,Věra Breiová je mluvčí společnosti ArcelorMitt...,172.0,,,,0.0,0.0,1604445000.0,1604445000.0


In [16]:
len(claims)

2259

In [17]:
claims['claim'].value_counts(sort=True)

Dánský premiér bude usilovat o pozici generálního tajemníka Severoatlantické aliance.          2
Robert Pelikán vyjádřil znepokojení nad Okamurovými výroky.                                    2
Maďarský prezident Sólyom neodcestoval 21. sprna 2009 na svou soukromou cestu do Slovenska.    2
Posádka raketoplánu Columbie čítala 7 lidí.                                                    2
Nejen Lidovci volají po odvolání Tomia Okamury z pozice místopředsedy Sněmovny.                2
                                                                                              ..
Ukrajina někdejšího gruzínského prezidenta vyhostila do Varšavy.                               1
Druhém kolo voleb bylo pro Kuncaře fatální.                                                    1
Dánský premiér Anders Fogh Rasmussen odmítl kandidovat na post generálního tajemníka NATO.     1
Pirátská kopie filmu Osm hrozných unikla na internet. ↵                                        1
Česká skupina Energo-PRO chce 

In [18]:
claims[claims.claim == 'Nejen Lidovci volají po odvolání Tomia Okamury z pozice místopředsedy Sněmovny.']

Unnamed: 0,id,user,claim,paragraph,mutated_from,mutation_type,ners,sandbox,labelled,created_at,updated_at
443,405.0,66,Nejen Lidovci volají po odvolání Tomia Okamury...,,,,,,,,
450,410.0,66,Nejen Lidovci volají po odvolání Tomia Okamury...,891.0,405.0,specific,"Tomia Okamury,Sněmovny",0.0,1.0,1606490000.0,1606490000.0


In [19]:
labels = pd.read_csv('/home/ryparmar/db/label.csv')
labels.head()

Unnamed: 0,id,user,claim,label,sandbox,oracle,flag,condition,created_at,updated_at,deleted
0,2,5,8,SUPPORTS,0,0,0,,1604492144,1604492144,0
1,3,5,2,SUPPORTS,0,0,0,,1604492158,1604492158,0
2,4,5,9,SUPPORTS,0,0,0,,1604492193,1604492193,0
3,5,6,4,SUPPORTS,0,0,0,,1604514332,1604514332,0
4,7,7,3,SUPPORTS,0,0,0,,1604589242,1604589242,0


In [20]:
len(labels)

1644

In [21]:
labels.claim.value_counts(sort=True)

545     6
136     6
142     6
644     5
303     5
       ..
1140    1
1139    1
1137    1
1136    1
2       1
Name: claim, Length: 1054, dtype: int64

In [22]:
labels[labels.claim == 545]

Unnamed: 0,id,user,claim,label,sandbox,oracle,flag,condition,created_at,updated_at,deleted
703,709,9,545,SUPPORTS,0,0,0,,1607202400,1607202400,0
829,835,62,545,SUPPORTS,0,1,0,,1607346370,1607346370,0
932,938,37,545,SUPPORTS,0,0,0,,1607367394,1607367394,0
1123,1129,40,545,SUPPORTS,0,0,0,,1607383646,1607383646,0
1226,1232,36,545,SUPPORTS,0,0,0,,1607416261,1607416261,0
1607,1613,17,545,SUPPORTS,0,0,0,,1607513079,1607513079,0


In [23]:
claims[claims.id==545]

Unnamed: 0,id,user,claim,paragraph,mutated_from,mutation_type,ners,sandbox,labelled,created_at,updated_at
596,545.0,62,Zásilkový obchod v ČR oproti západní Evropě a ...,15297,539,specific,"ČR,USA,Evropě",0.0,1.0,1606816000.0,1606816000.0


In [24]:
labels['label'].value_counts()

SUPPORTS           811
REFUTES            587
NOT ENOUGH INFO    195
Name: label, dtype: int64

In [25]:
print(f"#claims: {len(claims.claim.unique())} #labels: {len(labels.claim.unique())}")

#claims: 2247 #labels: 1054


In [26]:
labels.id = labels.claim
df = claims.merge(labels[['id', 'label']], on='id', how='left')
df = df.dropna(subset=['claim', 'label'])
df = df[['id', 'claim', 'label']]
df.reset_index(inplace=True, drop=True)

# del claims, labels

In [27]:
df.head()

Unnamed: 0,id,claim,label
0,2.0,Ben Sheets první zápas v MLB prohrál.,SUPPORTS
1,3.0,Baseballový tým USA porazil na olympijských hr...,SUPPORTS
2,4.0,USA na olympijských hrách v Sydney získalo v b...,SUPPORTS
3,7.0,ArcelorMittal Ostrava finančně podpoří odvykán...,SUPPORTS
4,7.0,ArcelorMittal Ostrava finančně podpoří odvykán...,SUPPORTS


In [28]:
df.count()

id       1593
claim    1593
label    1593
dtype: int64

In [29]:
print(f"Unique claims: {len(df.claim.unique())}\nTotal labels: {len(df)}")

Unique claims: 1018
Total labels: 1593


In [30]:
df.head(10)

Unnamed: 0,id,claim,label
0,2.0,Ben Sheets první zápas v MLB prohrál.,SUPPORTS
1,3.0,Baseballový tým USA porazil na olympijských hr...,SUPPORTS
2,4.0,USA na olympijských hrách v Sydney získalo v b...,SUPPORTS
3,7.0,ArcelorMittal Ostrava finančně podpoří odvykán...,SUPPORTS
4,7.0,ArcelorMittal Ostrava finančně podpoří odvykán...,SUPPORTS
5,8.0,Německá firma Bühler Motor investovala v Hradc...,SUPPORTS
6,9.0,Německá firma Bühler Motor má v Hradci Králové...,SUPPORTS
7,11.0,Německá firma Bühler Motor vyrábí elektropohony.,SUPPORTS
8,22.0,Společnost Bühler Motor investovala v Hradci K...,SUPPORTS
9,23.0,Christof Furtwängler je jednatelem firmy Bühle...,SUPPORTS


# Cue Productivity and Coverage

## Sample balanced dataset

This approach assumes a balanced dataset with regard to the frequency of each label. If executed on an imbalanced dataset, a given cue’s productivity would be dominated by the most frequent label, not because it is actually more likely to appear in a claim with that label but purely since the label is more frequent overall.

V clanku undersampluji majority classes, a to opakuji 10x, aby dostali robustnejsi odhad. Tady pocitam jak 10-fold CV, tak i metriky na vsech datech.

In [31]:
SAMPLE_SIZE = min(df['label'].value_counts())
NUM_SAMPLES = 10
SAMPLES = []
for i in range(NUM_SAMPLES):
    df_to_join = []
    for label in df.label.unique():
        df_to_join.append(df[df.label == label].sample(SAMPLE_SIZE)[['claim', 'label']])

    SAMPLES.append(pd.concat(df_to_join).reset_index(drop=True))

In [32]:
SAMPLES[0].head()

Unnamed: 0,claim,label
0,Internetový obchod v ČR je ve srovnání se zápa...,SUPPORTS
1,25 členů dětského parlamentu bude nápomocno mě...,SUPPORTS
2,Radnice v Postoloprtech zachraňuje místní kapli.,SUPPORTS
3,Festival amatérských filmů se tradičně pořádá ...,SUPPORTS
4,"Když se objeví křeče z přehřátí, je dobré zůst...",SUPPORTS


## Cues are represented by unigrams and bigrams

Cue = bias v datech, ktery ulehcuje ML rozhodovani jen na zaklade nej. (Napriklad slovo 'not' v anglickych datech bude pritono v claimech s labelem takoveho rozdeleni: 80% REFUTES, 5% SUPPORTS, 15% NOT ENOUGH INFO - coz vypada na bias vuci REFUTES class; ML predikuje REFUTES vzdy v pripade pritomnosti 'not', coz neni zadouci!)

In [33]:
LABELS = len(df['label'].unique())

In [34]:
# all data
unigrams_all = [[ii.strip('.') for ii in c.split()] for c in df['claim'] if isinstance(c, str)]
# unigrams = [([ii.strip('.') for ii in c.split()], df['label'][i]) for i, c in enumerate(df['claim']) if isinstance(c, str)]

bigrams_all = [[i.strip('.') + ' ' + ii.strip('.') 
            for i, ii in zip(c.split()[:-1], c.split()[1:])] 
           for c in df['claim'] if isinstance(c, str)]

# per samples / k-fold cross validation like
unigrams, bigrams = [], []
for sample in SAMPLES:
    unigrams.append([[ii.strip('.') for ii in c.split()] for c in sample['claim'] if isinstance(c, str)])
    bigrams.append([[i.strip('.') + ' ' + ii.strip('.')
                     for i, ii in zip(c.split()[:-1], c.split()[1:])]
                    for c in sample['claim'] if isinstance(c, str)])

In [35]:
unigrams[0][:2]

[['Internetový',
  'obchod',
  'v',
  'ČR',
  'je',
  've',
  'srovnání',
  'se',
  'západní',
  'Evropou',
  'nižší'],
 ['25',
  'členů',
  'dětského',
  'parlamentu',
  'bude',
  'nápomocno',
  'městským',
  'radním']]

In [36]:
bigrams[0][:2]

[['Internetový obchod',
  'obchod v',
  'v ČR',
  'ČR je',
  'je ve',
  've srovnání',
  'srovnání se',
  'se západní',
  'západní Evropou',
  'Evropou nižší'],
 ['25 členů',
  'členů dětského',
  'dětského parlamentu',
  'parlamentu bude',
  'bude nápomocno',
  'nápomocno městským',
  'městským radním']]

## Lets try also wordpieces as cues

In [37]:
import transformers

tokenizer = transformers.BertTokenizer.from_pretrained('bert-base-multilingual-cased')

In [50]:
wordpieces_all = [tokenizer.tokenize(c.rstrip('.')) for c in df['claim'] if isinstance(c, str)]

In [51]:
len(wordpieces_all)

1593

In [52]:
wordpieces_all

[['Ben', 'She', '##ets', 'první', 'zápas', 'v', 'MLB', 'pro', '##hr', '##ál'],
 ['Baseball',
  '##ový',
  'tým',
  'USA',
  'por',
  '##azil',
  'na',
  'olympijských',
  'hrách',
  'v',
  'Sydney',
  'hráč',
  '##e',
  'Ku',
  '##by'],
 ['USA',
  'na',
  'olympijských',
  'hrách',
  'v',
  'Sydney',
  'získal',
  '##o',
  'v',
  'baseball',
  '##u',
  'zlato',
  '##u',
  'medaili'],
 ['Arc',
  '##elor',
  '##M',
  '##itt',
  '##al',
  'Ostrava',
  'fina',
  '##n',
  '##čně',
  'pod',
  '##po',
  '##ří',
  'od',
  '##vy',
  '##kání',
  'ko',
  '##u',
  '##ření',
  'u',
  'svých',
  'za',
  '##mě',
  '##st',
  '##nan',
  '##ců'],
 ['Arc',
  '##elor',
  '##M',
  '##itt',
  '##al',
  'Ostrava',
  'fina',
  '##n',
  '##čně',
  'pod',
  '##po',
  '##ří',
  'od',
  '##vy',
  '##kání',
  'ko',
  '##u',
  '##ření',
  'u',
  'svých',
  'za',
  '##mě',
  '##st',
  '##nan',
  '##ců'],
 ['N',
  '##ě',
  '##me',
  '##cká',
  'firma',
  'B',
  '##ühle',
  '##r',
  'Motor',
  'in',
  '##vest',
  '##o

In [53]:
wordpieces = []
for sample in SAMPLES:
    wordpieces.append([tokenizer.tokenize(claim.rstrip('.')) 
                       for claim in sample['claim'] if isinstance(claim, str)])

In [54]:
# for simplicity use bigrams variable for wordpieces
bigrams = wordpieces

## Cue Applicability =  the absolute number of claims in the dataset that contain the cue irrespective of their label = v kolika claimech je cue pritomna

In [38]:
# all data
tmp = []
for ii in [set(i) for i in unigrams_all]:
    tmp += ii
applicability_uni_all = Counter(tmp)

tmp = []
for ii in [set(i) for i in bigrams_all]:
    tmp += ii
applicability_big_all = Counter(tmp)


# per samples / k-fold cross validation like
applicability_uni = []
for u in unigrams:
    tmp = []
    for ii in [set(i) for i in u]:
        tmp += ii
    applicability_uni.append(Counter(tmp))

applicability_big = []
for b in bigrams:
    tmp = []
    for ii in [set(i) for i in b]:
        tmp += ii
    applicability_big.append(Counter(tmp))

In [39]:
applicability_uni[0]

Counter({'nižší': 2,
         'v': 185,
         'Internetový': 2,
         'se': 88,
         'srovnání': 3,
         'západní': 4,
         'Evropou': 3,
         've': 31,
         'obchod': 4,
         'je': 61,
         'ČR': 10,
         'bude': 21,
         'parlamentu': 5,
         'dětského': 1,
         'městským': 1,
         '25': 1,
         'nápomocno': 1,
         'radním': 1,
         'členů': 1,
         'Postoloprtech': 3,
         'zachraňuje': 5,
         'kapli': 7,
         'Radnice': 11,
         'místní': 2,
         'od': 14,
         'pořádá': 2,
         '1960': 1,
         'Festival': 2,
         'roku': 19,
         'tradičně': 1,
         'filmů': 4,
         'amatérských': 1,
         'dále': 1,
         'dobré': 1,
         'zůstat': 1,
         'Když': 3,
         'předešlé': 1,
         'objeví': 3,
         'z': 44,
         'přehřátí,': 3,
         'horku': 3,
         'věnovat': 1,
         'aktivitě': 2,
         'a': 43,
         'křeče': 5,
     

## Cue Productivity = is the frequency of the most common label across the claims that contain the cue = cetnost nejcastejsiho labelu pro cue

Productivity in range [1/3, 1] for 3 possible labels

In practical terms, the productivity is the chance that a model correctly labels a claim by assigning it the most common label of a given cue in the claim.

In [40]:
productivity_uni_all = []
productivity_big_all = []

productivity_uni = []
productivity_big = []

# from collections import Counter
# from itertools import islice

# def count_ngrams(iterable,n=2):
#     return Counter(zip(*[islice(iterable,i,None) for i in range(n)]))

In [41]:
# counts = {}
# for i, words in enumerate([set(i) for i in unigrams]):
#     for w in words:
#         if w not in counts:
#             counts[w] = {}
#         if df['label'][i] not in counts[w]:
#             counts[w][df['label'][i]] = 1
#         else:
#             counts[w][df['label'][i]] += 1

In [42]:
def get_max(values: dict) -> (str, int):
    label, max_count = None, 0
    for k, v in values.items():
        if v > max_count:
            max_count = v
            label = k
    return label, max_count

In [43]:
# all data
counts_uni_all = {}  # rozdeleni labelu pro jednotlive cues
for i, words in enumerate([set(i) for i in unigrams_all]):
    for w in words:
        if w not in counts_uni_all:
            counts_uni_all[w] = {}
        if df['label'][i] not in counts_uni_all[w]:
            counts_uni_all[w][df['label'][i]] = 1
        else:
            counts_uni_all[w][df['label'][i]] += 1
max_counts_uni_all = {k: get_max(v) for k, v in counts_uni_all.items()}

counts_big_all = {}  # rozdeleni labelu pro jednotlive cues
for i, words in enumerate([set(i) for i in bigrams_all]):
    for w in words:
        if w not in counts_big_all:
            counts_big_all[w] = {}
        if df['label'][i] not in counts_big_all[w]:
            counts_big_all[w][df['label'][i]] = 1
        else:
            counts_big_all[w][df['label'][i]] += 1
max_counts_big_all = {k: get_max(v) for k, v in counts_big_all.items()}


# per samples / k-fold cross validation like
max_counts_uni = []
for s, sample in enumerate(SAMPLES):
    counts = {}
    for i, words in enumerate([set(i) for i in unigrams[s]]):
        for w in words:
            if w not in counts:
                counts[w] = {}
            if sample['label'][i] not in counts[w]:
                counts[w][sample['label'][i]] = 1
            else:
                counts[w][sample['label'][i]] += 1
    
    max_counts_uni.append({k: get_max(v) for k, v in counts.items()})
    
max_counts_big = []
for s, sample in enumerate(SAMPLES):
    counts = {}
    for i, words in enumerate([set(i) for i in bigrams[s]]):
        for w in words:
            if w not in counts:
                counts[w] = {}
            if sample['label'][i] not in counts[w]:
                counts[w][sample['label'][i]] = 1
            else:
                counts[w][sample['label'][i]] += 1

    max_counts_big.append({k: get_max(v) for k, v in counts.items()})

In [44]:
max_counts_uni[0]

{'nižší': ('SUPPORTS', 1),
 'v': ('SUPPORTS', 67),
 'Internetový': ('SUPPORTS', 1),
 'se': ('REFUTES', 33),
 'srovnání': ('SUPPORTS', 1),
 'západní': ('SUPPORTS', 2),
 'Evropou': ('SUPPORTS', 1),
 've': ('NOT ENOUGH INFO', 13),
 'obchod': ('SUPPORTS', 2),
 'je': ('REFUTES', 26),
 'ČR': ('SUPPORTS', 5),
 'bude': ('SUPPORTS', 9),
 'parlamentu': ('SUPPORTS', 3),
 'dětského': ('SUPPORTS', 1),
 'městským': ('SUPPORTS', 1),
 '25': ('SUPPORTS', 1),
 'nápomocno': ('SUPPORTS', 1),
 'radním': ('SUPPORTS', 1),
 'členů': ('SUPPORTS', 1),
 'Postoloprtech': ('REFUTES', 2),
 'zachraňuje': ('NOT ENOUGH INFO', 4),
 'kapli': ('REFUTES', 4),
 'Radnice': ('NOT ENOUGH INFO', 5),
 'místní': ('SUPPORTS', 1),
 'od': ('SUPPORTS', 6),
 'pořádá': ('SUPPORTS', 1),
 '1960': ('SUPPORTS', 1),
 'Festival': ('SUPPORTS', 1),
 'roku': ('SUPPORTS', 8),
 'tradičně': ('SUPPORTS', 1),
 'filmů': ('SUPPORTS', 2),
 'amatérských': ('SUPPORTS', 1),
 'dále': ('SUPPORTS', 1),
 'dobré': ('SUPPORTS', 1),
 'zůstat': ('SUPPORTS', 1),


In [45]:
# all data
productivity_uni_all = {k: v[1] / applicability_uni_all[k] for k, v in max_counts_uni_all.items()} 
productivity_uni_all = OrderedDict(sorted(productivity_uni_all.items(), key=lambda kv: kv[1], reverse=True))

productivity_big_all = {k: v[1] / applicability_big_all[k] for k, v in max_counts_big_all.items()} 
productivity_big_all = OrderedDict(sorted(productivity_big_all.items(), key=lambda kv: kv[1], reverse=True))


# per samples / k-fold cross validation like
productivity_uni = [
                    {k: v[1] / applicability_uni[i][k] for k, v in max_counts_uni[i].items()} 
                    for i in range(NUM_SAMPLES)]
productivity_uni = [
                    OrderedDict(sorted(productivity_uni[i].items(), key=lambda kv: kv[1], reverse=True)) 
                    for i in range(NUM_SAMPLES)]

productivity_big = [
                    {k: v[1] / applicability_big[i][k] for k, v in max_counts_big[i].items()} 
                    for i in range(NUM_SAMPLES)]
productivity_big = [
                    OrderedDict(sorted(productivity_big[i].items(), key=lambda kv: kv[1], reverse=True)) 
                    for i in range(NUM_SAMPLES)]

## Cue Coverage = applicability of a cue / total number of claims = v kolika claimech je cue pritomna / pocet claimu

In [46]:
# all data
coverage_uni_all = {k: v / len(df) for k, v in applicability_uni_all.items()}
sorted_d = OrderedDict(sorted(coverage_uni_all.items(), key=lambda kv: kv[1], reverse=True))

coverage_big_all = {k: v / len(df) for k, v in applicability_big_all.items()}
sorted_d = OrderedDict(sorted(coverage_big_all.items(), key=lambda kv: kv[1], reverse=True))

# per samples / k-fold cross validation like
coverage_uni = [
                {k: v / len(SAMPLES[i]) for k, v in applicability_uni[i].items()} 
                for i in range(NUM_SAMPLES)]

sorted_d = [
            OrderedDict(sorted(coverage_uni[i].items(), key=lambda kv: kv[1], reverse=True)) 
            for i in range(NUM_SAMPLES)]

coverage_big = [
                {k: v / len(SAMPLES[i]) for k, v in applicability_big[i].items()} 
                for i in range(NUM_SAMPLES)]

sorted_d = [
            OrderedDict(sorted(coverage_big[i].items(), key=lambda kv: kv[1], reverse=True)) 
            for i in range(NUM_SAMPLES)]

## Cue Utility (for ML algorithm: the higher utility the easier decision for ML alg)

In order to compare above metrics between datasets utility is the metric to go. A cue is only useful to a machine
learning model if productivity_k > 1 / m, where m is the number of possible labels (=3; supports, refutes, not enough info).

In [47]:
# all data
utility_uni_all = {k: v - 1/LABELS for k, v in productivity_uni_all.items()}

utility_big_all = {k: v - 1/LABELS for k, v in productivity_big_all.items()}


# per samples / k-fold cross validation like
utility_uni = [{k: v - 1/LABELS for k, v in productivity_uni[i].items()} 
              for i in range(NUM_SAMPLES)]

utility_big = [{k: v - 1/LABELS for k, v in productivity_big[i].items()} 
              for i in range(NUM_SAMPLES)]

In [51]:
utility_uni[0]['dále']

0.6666666666666667

## Results

How to read it:
- productivity = how strong the potential bias is; in our case of the 3 labels -- 1/3 is pure randomness
- utility = adjusted productivity for cross dataset comparison
- coverage = how common/widespread is the cue
- harmonic mean = harmonic mean of utility and coverage; the higher harmonic mean the higher risk of bias in the data

### Unigrams

In [66]:
res_uni = []
for i in range(NUM_SAMPLES):
    tmp = pd.DataFrame.from_dict(productivity_uni[i], orient='index', columns=['productivity']).join(
        [pd.DataFrame.from_dict(utility_uni[i], orient='index', columns=['utility']),
         pd.DataFrame.from_dict(coverage_uni[i], orient='index', columns=['coverage'])])

    tmp['harmonic_mean'] = tmp.apply(lambda x: 2 / (1/x['productivity'] + 1/x['coverage']), axis=1)
    res_uni.append(tmp)

In [67]:
RES = res_uni[0]
for i in range(1, NUM_SAMPLES):
    RES = RES.add(res_uni[i], fill_value=0)
RES = RES.div(NUM_SAMPLES)

In [68]:
RES.sort_values('harmonic_mean', ascending=False).round(4)[:20]

Unnamed: 0,productivity,utility,coverage,harmonic_mean
v,0.3619,0.0286,0.3007,0.3283
se,0.384,0.0506,0.1458,0.2109
na,0.3716,0.0383,0.1258,0.1878
V,0.3943,0.0609,0.1156,0.1782
je,0.3987,0.0654,0.1021,0.1624
a,0.4501,0.1168,0.0821,0.1386
z,0.4441,0.1108,0.0723,0.1242
o,0.467,0.1337,0.0463,0.0841
ve,0.4912,0.1579,0.046,0.0837
za,0.4467,0.1134,0.045,0.0817


In [69]:
RES.sort_values('productivity', ascending=False)[:10]

Unnamed: 0,productivity,utility,coverage,harmonic_mean
bezpečí,1.0,0.666667,0.001709,0.003413
"Asii,",1.0,0.666667,0.001709,0.003413
zvláštní,1.0,0.666667,0.001709,0.003413
premiéra,1.0,0.666667,0.001709,0.003413
odstraňuje,1.0,0.666667,0.001709,0.003413
Brazílii,1.0,0.666667,0.003419,0.006814
Jihomoravská,1.0,0.666667,0.001709,0.003413
Opavou,1.0,0.666667,0.003419,0.006814
unikla,1.0,0.666667,0.001709,0.003413
mimosoudní,1.0,0.666667,0.001709,0.003413


In [70]:
RES.sort_values('coverage', ascending=False)[:10]

Unnamed: 0,productivity,utility,coverage,harmonic_mean
v,0.361934,0.028601,0.300684,0.328295
se,0.383962,0.050629,0.145812,0.210905
na,0.371599,0.038266,0.125812,0.187843
V,0.394264,0.060931,0.115556,0.178248
je,0.398745,0.065412,0.102051,0.162355
a,0.450088,0.116755,0.082051,0.138633
z,0.444085,0.110752,0.072308,0.124166
o,0.467028,0.133695,0.046325,0.084122
ve,0.49119,0.157857,0.045983,0.083687
za,0.446729,0.113395,0.044957,0.081652


In [71]:
# 
for fold in max_counts_uni:
    for k,v in fold.items():
        if k == 'v': print(v)

('SUPPORTS', 70)
('SUPPORTS', 59)
('SUPPORTS', 66)
('SUPPORTS', 63)
('SUPPORTS', 68)
('SUPPORTS', 69)
('REFUTES', 62)
('REFUTES', 64)
('SUPPORTS', 58)
('NOT ENOUGH INFO', 58)


In [72]:
RES_ALL = pd.DataFrame.from_dict(productivity_uni_all, orient='index', columns=['productivity']).join(
        [pd.DataFrame.from_dict(utility_uni_all, orient='index', columns=['utility']),
         pd.DataFrame.from_dict(coverage_uni_all, orient='index', columns=['coverage'])])

RES_ALL['harmonic_mean'] = RES_ALL.apply(lambda x: 2 / (1/x['productivity'] + 1/x['coverage']), axis=1)

In [73]:
RES_ALL.sort_values('harmonic_mean', ascending=False)

Unnamed: 0,productivity,utility,coverage,harmonic_mean
v,0.545267,0.211934,0.305085,0.391256
se,0.531915,0.198582,0.147520,0.230981
na,0.526066,0.192733,0.132454,0.211625
V,0.550562,0.217228,0.111739,0.185774
je,0.472050,0.138716,0.101067,0.166489
...,...,...,...,...
72,1.000000,0.666667,0.000628,0.001255
vyšší,1.000000,0.666667,0.000628,0.001255
dolů,1.000000,0.666667,0.000628,0.001255
stavět,1.000000,0.666667,0.000628,0.001255


### Bigrams

In [74]:
RES_BIG_ALL = pd.DataFrame.from_dict(productivity_big_all, orient='index', columns=['productivity']).join(
        [pd.DataFrame.from_dict(utility_big_all, orient='index', columns=['utility']),
         pd.DataFrame.from_dict(coverage_big_all, orient='index', columns=['coverage'])])

RES_BIG_ALL['harmonic_mean'] = RES_BIG_ALL.apply(lambda x: 2 / (1/x['productivity'] + 1/x['coverage']), axis=1)

In [75]:
RES_BIG_ALL.sort_values('harmonic_mean', ascending=False)

Unnamed: 0,productivity,utility,coverage,harmonic_mean
v Praze,0.512195,0.178862,0.025738,0.049012
v roce,0.615385,0.282051,0.024482,0.047091
v Polsku,0.514286,0.180952,0.021971,0.042142
od roku,0.515152,0.181818,0.020716,0.039830
americké protiraketové,0.600000,0.266667,0.018832,0.036519
...,...,...,...,...
v Jemenu,1.000000,0.666667,0.000628,0.001255
pro výstavbu,1.000000,0.666667,0.000628,0.001255
"železárny, cementárny",1.000000,0.666667,0.000628,0.001255
celky pro,1.000000,0.666667,0.000628,0.001255


In [76]:
res_big = []
for i in range(NUM_SAMPLES):
    tmp = pd.DataFrame.from_dict(productivity_big[i], orient='index', columns=['productivity']).join(
        [pd.DataFrame.from_dict(utility_big[i], orient='index', columns=['utility']),
         pd.DataFrame.from_dict(coverage_big[i], orient='index', columns=['coverage'])])

    tmp['harmonic_mean'] = tmp.apply(lambda x: 2 / (1/x['productivity'] + 1/x['coverage']), axis=1)
    res_big.append(tmp)

In [77]:
RES_BIG = res_big[0]
for i in range(1, NUM_SAMPLES):
    RES_BIG = RES_BIG.add(res_big[i], fill_value=0)
RES_BIG = RES_BIG.div(NUM_SAMPLES)

In [78]:
RES_BIG.sort_values('harmonic_mean', ascending=False).round(4)[:20]

Unnamed: 0,productivity,utility,coverage,harmonic_mean
v,0.3571,0.0237,0.3889,0.3722
z,0.3897,0.0564,0.1985,0.2629
.,0.3878,0.0545,0.1559,0.2222
##y,0.3795,0.0461,0.1504,0.2151
se,0.384,0.0506,0.1458,0.2109
za,0.4115,0.0782,0.139,0.2075
V,0.3815,0.0481,0.1427,0.2069
na,0.3695,0.0362,0.1311,0.1934
",",0.3942,0.0609,0.1265,0.1909
##m,0.4052,0.0718,0.1058,0.1676


In [79]:
RES_BIG.sort_values('productivity', ascending=False)[:10]

Unnamed: 0,productivity,utility,coverage,harmonic_mean
tym,1.0,0.666667,0.005128,0.010204
Slovensku,1.0,0.666667,0.001709,0.003413
##ora,1.0,0.666667,0.001709,0.003413
Oscara,1.0,0.666667,0.001709,0.003413
rest,1.0,0.666667,0.001709,0.003413
Rod,1.0,0.666667,0.001709,0.003413
##bul,1.0,0.666667,0.001709,0.003413
Sei,1.0,0.666667,0.001709,0.003413
trad,1.0,0.666667,0.002393,0.004772
br,1.0,0.666667,0.002906,0.005792


In [80]:
RES_BIG.sort_values('coverage', ascending=False)[:10]

Unnamed: 0,productivity,utility,coverage,harmonic_mean
v,0.357075,0.023742,0.388889,0.37216
z,0.389727,0.056394,0.198462,0.262864
.,0.38779,0.054457,0.155897,0.222187
##y,0.379465,0.046132,0.150427,0.215083
se,0.383962,0.050629,0.145812,0.210905
V,0.381455,0.048121,0.142735,0.206908
za,0.411532,0.078198,0.138974,0.207538
na,0.369509,0.036176,0.131111,0.193434
",",0.394207,0.060873,0.126496,0.190938
##m,0.405169,0.071836,0.105812,0.167648


# Dataset-weighted Cue Information

## Calculate skip-grams

http://www.lrec-conf.org/proceedings/lrec2006/pdf/357_pdf.pdf

“Insurgents killed in ongoing fighting.”

Bi-grams = {insurgents killed, killed in, in ongoing, ongoing fighting}.

2-skip-bi-grams = {insurgents killed, insurgents in, insurgents ongoing, killed in, killed ongoing, killed fighting, in ongoing, in fighting, ongoing fighting}

Tri-grams = {insurgents killed in, killed in ongoing, in ongoing fighting}.

2-skip-tri-grams = {insurgents killed in, insurgents killed ongoing, insurgents killed fighting, insurgents in ongoing, insurgents in fighting, insurgents ongoing fighting, killed in ongoing, killed in fighting, killed ongoing fighting, in ongoing fighting}.

In [89]:
from nltk.util import skipgrams
import math

In [90]:
# all data
unigrams_all = [[ii.strip('.') for ii in c.split()] for c in df['claim'] if isinstance(c, str)]
# unigrams = [([ii.strip('.') for ii in c.split()], df['label'][i]) for i, c in enumerate(df['claim']) if isinstance(c, str)]

# bigrams_all = [[i.strip('.') + ' ' + ii.strip('.') 
#                 for i, ii in zip(c.split()[:-1], c.split()[1:])] 
#                 for c in df['claim'] if isinstance(c, str)]

# trigrams_all = [[i.strip('.') + ' ' + ii.strip('.') + ' ' + iii.strip('.') 
#                 for i, ii, iii in zip(c.split()[:-2], c.split()[1:-1], c.split()[2:])] 
#                 for c in df['claim'] if isinstance(c, str)]

In [158]:
SKIPS = [0, 1, 2, 3, 4]
def get_skipgram_counts(cue_rep: str, skip: str):
    """
        See the equation (3) in the paper
        skip_label == |D_cue=k and D_class=i|
        skip_total == |D_cue=k|
        total == total number of skipgrams
        
        cue_rep = Cue representation: bigram, trigram
        skip = number of skipped tokens
        
        if skip == 4, then skipgrams function generates all the skipgrams with 0, 1, 2, 3 and 4 skipped tokens
        
        Returns:
            TBD
    """
    skip_label, skip_total, total = {}, {}, 0
    skip_doc_freq, total_docs = {}, len(df.claim)
    rep2int = {'bigram': 2, 'trigram': 3}
    if cue_rep in ['bigram', 'trigram']:
        for i, claim in enumerate(df.claim):
            for skipgram in skipgrams(claim.split(), rep2int[cue_rep], skip):
                skipgram = " ".join(list(skipgram))
                # Count skipgrams per cue
                if skipgram in skip_total:
                    skip_total[skipgram] += 1
                else:
                    skip_total[skipgram] = 1
                if skipgram in skip_label:
                    if df.label[i] in skip_label[skipgram]:
                        skip_label[skipgram][df.label[i]] += 1
                    else:
                        skip_label[skipgram][df.label[i]] = 1
                else:
                    skip_label[skipgram] = {df.label[i]: 1}
                total += 1
                
                # Count document frequency per cue
                if skipgram in skip_doc_freq:
                    skip_doc_freq[skipgram].add(i)
                else:
                    skip_doc_freq[skipgram] = set([i])
                
        # Count the distinct docs         
        for k, v in skip_doc_freq.items():
            skip_doc_freq[k] = len(v)
    else:
        print("Cue representation not valid. Only bigram / trigram are valid.")
    return skip_label, skip_total, total, skip_doc_freq, total_docs

In [159]:
bi_skip_label, bi_skip_total, total, bi_skip_df, total_docs = get_skipgram_counts('bigram', 4)
tri_skip_label, tri_skip_total, total, tri_skip_df, total_docs = get_skipgram_counts('trigram', 4)

In [160]:
len(bi_skip_label) == len(bi_skip_total) == total

False

In [161]:
len(bi_skip_label)

24549

In [162]:
total

115430

In [185]:
def compute_normalised_dist(nominator: dict, denominator: dict):
    """Returns normalised distribution over cues and labels"""
    return {cue: 
            {label: count / total for label, count in nominator[cue].items()} 
            for cue, total in denominator.items()}


def entropy(x: dict):
    return sum([v * math.log(v, 10) for k, v in x.items()])


def lambda_h(N: dict):
    """Information based factor (entropy)"""
    h = {k: 1 + entropy(v) for k, v in N.items()}
    return h


def lambda_f(s: int, doc_freq_per_cue: dict, total_docs: int):
    """
    Frequency-based scaling factor
    equivalent to normalized/scaled document frequency of a cue 
    = the number of documents in which is the cue present
    """
    f = {k: math.pow((v / total_docs), (1/s)) for k, v in doc_freq_per_cue.items()}
    return f


def DCI(lamh: dict, lamf: dict):
    dci = {k: math.sqrt(vh * lamf[k]) for k, vh in lamh.items()}
    return dci

In [197]:
N = compute_normalised_dist(bi_skip_label, bi_skip_total)
lambh = lambda_h(N)
lambf = lambda_f(3, bi_skip_df, total_docs)
dci = DCI(lambh, lambf)

In [198]:
dci['ozonová díra']

0.2541078455237247

In [199]:
dci = sorted(dci.items(), key=lambda kv: kv[1], reverse=True)

In [201]:
dci[:20]

[('Bühler Motor', 0.46539105318290536),
 ('hlavním městě', 0.45609097785104474),
 ('se v', 0.44383368472024554),
 ('V hlavním', 0.43637797057691785),
 ('v roce', 0.4337396859638323),
 ('v a', 0.43073199232234977),
 ('Společnost Bühler', 0.42814000822537124),
 ('Společnost Motor', 0.42814000822537124),
 ('se z', 0.42814000822537124),
 ('Motor výrobní', 0.42202462140113994),
 ('V městě', 0.42202462140113994),
 ('V se', 0.4192649005125955),
 ('v v', 0.41388391647892814),
 ('Bühler výrobní', 0.4138208500999247),
 ('od roku', 0.4088054353935084),
 ('křeče je', 0.4082637974977277),
 ('90 miliónů', 0.4047128967633357),
 ('firmy Bühler', 0.4047128967633357),
 ('V byl', 0.4047128967633357),
 ('Andrzej rakovinu.', 0.4047128967633357)]

In [202]:
pd.DataFrame(dci, columns =['Cue', 'DCI']) 

Unnamed: 0,Cue,DCI
0,Bühler Motor,0.465391
1,hlavním městě,0.456091
2,se v,0.443834
3,V hlavním,0.436378
4,v roce,0.433740
...,...,...
24544,firma chce,0.254108
24545,firma koupit,0.254108
24546,Air France,0.254108
24547,Díky ozonová,0.254108


In [177]:
N['ozonová díra']

{'SUPPORTS': 0.3333333333333333,
 'NOT ENOUGH INFO': 0.3333333333333333,
 'REFUTES': 0.3333333333333333}

In [196]:
dci['ozonová díra']

TypeError: list indices must be integers or slices, not str

In [178]:
a = math.log(1/3, 10) * (1/3)
a

-0.15904041823988746

In [179]:
3 * a

-0.4771212547196624

In [180]:
1 - (3*a)

1.4771212547196624

In [181]:
for k,v in lambf.items():
    if v < 0:
        print(k, v)

In [182]:
for k,v in lambh.items():
    if v < 0:
        print(k, v)

v v -0.06039520745013127
v Králové -0.06085694715802137
v asi -0.039720770839917874
Hradci Králové -0.011404264707351786
Králové v -0.004242473054076434
Záplavy v -0.054920167986144186
Záplavy Třebíči -0.054920167986144186
Záplavy jsou -0.054920167986144186
Záplavy vyloučeny. -0.054920167986144186
v jsou -0.07899220787758332
v vyloučeny. -0.054920167986144186
Třebíči jsou -0.054920167986144186
Třebíči vyloučeny. -0.054920167986144186
jsou vyloučeny. -0.054920167986144186
v je -0.054920167986144186
V České -0.06085694715802137
V republice -0.011404264707351564
V než -0.09861228866810956
České republice -0.054920167986144186
republice než -0.09861228866810956
V Polsku -0.039720770839917874
V se -0.0018639073312898269
Andrzej zemřel -0.09005965871078381
Žulawski zemřel -0.009613758174038312
zemřel ve -0.07755632706680093
v ve -0.039720770839917874
Žulawski na -0.043353426942290385
Ve se -0.054920167986144186
druhé polovině -0.09861228866810956
druhé roku -0.09861228866810956
polovině roku

In [183]:
bi_skip_label['ozonová díra']

{'SUPPORTS': 1, 'NOT ENOUGH INFO': 1, 'REFUTES': 1}

In [147]:
for k,v in lambh.items():
    if bi_skip_total[k] > 10 and v > 0.8:
        print(k, v)

Bühler Motor 0.9168568916127294
Společnost Bühler 0.8882479343209676
Společnost Motor 0.8882479343209676
na území 0.8043237532297054
V hlavním 1.0
hlavním městě 0.9104519301976715
v a 0.842693953748972
se z 0.8882479343209676


In [114]:
bi_skip_label['Bühler Motor']

{'SUPPORTS': 20, 'NOT ENOUGH INFO': 1}