# RumourEval-2019 - A Heterogeneous Example 

This notebooks gives a brief example of how to use pyconversations, particularly for a collection of multi-platform data. This dataset was specifically selected due to its multi-platform nature, its public availability, and the fact that it was distributed (by-and-large) as a raw data dump (with some minor structural annotations, thougn unneceessary when using pyconversations).

Specifically, this tutorial uses the data from [SemEval-2019 Task 7: RumourEval](https://aclanthology.org/S19-2147/), which you will need a copy of if you wish to follow this tutorial exactly.
Information on how to obtain a copy of this data is available at the [CodaLab Competition page](https://competitions.codalab.org/competitions/19938).

Though this dataset can be read out of the box without pyconversations, this package contains several aspects out of the box (redaction, segmentation, feature generation) that make it a valuable package when dealing with data like this.
For example, one could develop a reader that splits up conversations per their file organization or, using pyconversations, one could write a much simpler file reader and allow pyconversations to handle the "heavy-lifting" of segmentation.

To gain a bit clearer of a view of what this dataset's construction was all about, here's the abstract from the task's paper:

```
Since the first RumourEval shared task in 2017, interest in automated claim validation has greatly increased, as the danger of “fake news” has become a mainstream concern. However automated support for rumour verification remains in its infancy. It is therefore important that a shared task in this area continues to provide a focus for effort, which is likely to increase. Rumour verification is characterised by the need to consider evolving conversations and news updates to reach a verdict on a rumour’s veracity. As in RumourEval 2017 we provided a dataset of dubious posts and ensuing conversations in social media, annotated both for stance and veracity. The social media rumours stem from a variety of breaking news stories and the dataset is expanded to include Reddit as well as new Twitter posts. There were two concrete tasks; rumour stance prediction and rumour verification, which we present in detail along with results achieved by participants. We received 22 system submissions (a 70% increase from RumourEval 2017) many of which used state-of-the-art methodology to tackle the challenges involved.
```

In [1]:
import json
import pprint

from glob import glob
from tqdm import tqdm

In [2]:
# Update this to your local data directory
DATA_PATH = '/Users/hsh28/data/rumoureval2019/'  
TRAIN_PATH = DATA_PATH + 'rumoureval-2019-training-data/'
TEST_PATH = DATA_PATH + 'rumoureval-2019-test-data/'

# Answer keys, we can use these to extract some annotations
# and tag data with their associated split too
TRAIN_KEY = json.load(open(TRAIN_PATH + 'train-key.json'))
DEV_KEY = json.load(open(TRAIN_PATH + 'dev-key.json'))
FINAL = json.load(open(DATA_PATH + 'final-eval-key.json'))

TRAIN_KEY.keys(), DEV_KEY.keys(), FINAL.keys()

(dict_keys(['subtaskaenglish', 'subtaskbenglish']),
 dict_keys(['subtaskaenglish', 'subtaskbenglish']),
 dict_keys(['subtaskaenglish', 'subtaskbenglish', 'subtaskbdanish', 'subtaskbrussian']))

In [3]:
from pyconversations.convo import Conversation
from pyconversations.message import RedditPost
from pyconversations.message import Tweet

## Reading data

Now that we've loaded the annotations and pyconversations, let's dive into reading the file and verifying the appropriate post counts / annotation splits. 

Here, let's write two small helper functions to bulk read a file and extract either all tweets or all Reddit posts:

In [4]:
def read_tweets(path):
    return Tweet.parse_raw(json.load(open(path)), lang_detect=True)

In [5]:
def read_reddit_posts(path):
    raw = json.load(open(path))
    if type(raw) == dict:
        return RedditPost.parse_raw(raw, lang_detect=True)
    else:
        return [y for x in raw for y in RedditPost.parse_raw(x, lang_detect=True)]

All pyconversations messages should be stored in a `Conversation` object for maximal functionality. 
Here, we'll exhibit this bulk reading by making minimal assumptions about the data (other than its raw data from either Twitter or Reddit) and, provided we appropriately tag posts, we can easily recover (and verify) the data has been properly loaded.

In [6]:
data_set = Conversation()

In [7]:
# reading Twitter data 
for tweet_path in tqdm([x for pat in ('source-tweet', 'replies') for x in glob(TRAIN_PATH + f'twitter-english/*/*/{pat}/*.json')], desc='reading train+dev'):
    for t in read_tweets(tweet_path):
        subject = tweet_path.split('/')[-3]
        tstr = str(t.uid)
        if tstr in TRAIN_KEY['subtaskaenglish']:
            t.add_tag('split=TRAIN')
            t.add_tag(f'taskA={TRAIN_KEY["subtaskaenglish"][tstr]}')
            if tstr in TRAIN_KEY['subtaskbenglish']:
                t.add_tag(f'taskB={TRAIN_KEY["subtaskbenglish"][tstr]}')
        elif tstr in DEV_KEY['subtaskaenglish']:
            t.add_tag('split=DEV')
            t.add_tag(f'taskA={DEV_KEY["subtaskaenglish"][tstr]}')
            if tstr in DEV_KEY['subtaskbenglish']:
                t.add_tag(f'taskB={DEV_KEY["subtaskbenglish"][tstr]}')
        elif tstr in FINAL['subtaskaenglish']:
            t.add_tag('split=TEST')
            t.add_tag(f'taskA={FINAL["subtaskaenglish"][tstr]}')
            if tstr in FINAL['subtaskbenglish']:
                t.add_tag(f'taskB={FINAL["subtaskbenglish"][tstr]}')
            
        t.add_tag(subject)

        data_set.add_post(t)
        
for tweet_path in tqdm([x for pat in ('source-tweet', 'replies') for x in glob(TEST_PATH + f'twitter-en-test-data/*/*/{pat}/*.json')], desc='reading test'):
    for t in read_tweets(tweet_path):
        subject = tweet_path.split('/')[-3]
        tstr = str(t.uid)
        if tstr in FINAL['subtaskaenglish']:
            t.add_tag('split=TEST')
            t.add_tag(f'taskA={FINAL["subtaskaenglish"][tstr]}')
            if tstr in FINAL['subtaskbenglish']:
                t.add_tag(f'taskB={FINAL["subtaskbenglish"][tstr]}')
            
        t.add_tag(subject)
        
        data_set.add_post(t)

reading train+dev: 100%|██████████| 5568/5568 [00:13<00:00, 415.21it/s]
reading test: 100%|██████████| 1066/1066 [00:02<00:00, 501.15it/s]


In [8]:
# reading Reddit data 
for reddit_path in tqdm(
    list(glob(TRAIN_PATH + 'reddit-training-data/*/raw.json')) + 
    [x for pat in ('source-tweet', 'replies') for x in glob(TRAIN_PATH + f'reddit-training-data/*/{pat}/*.json')],
    desc='reading train'
):
    for t in read_reddit_posts(reddit_path):
        t.add_tag('split=TRAIN')
        if t.uid in TRAIN_KEY['subtaskaenglish']:
            t.add_tag(f'taskA={TRAIN_KEY["subtaskaenglish"][t.uid]}')
        
        if t.uid in TRAIN_KEY['subtaskbenglish']:
            t.add_tag(f'taskB={TRAIN_KEY["subtaskbenglish"][t.uid]}')
        
        data_set.add_post(t)
        
for reddit_path in tqdm(
    list(glob(TRAIN_PATH + 'reddit-dev-data/*/raw.json')) + 
    [x for pat in ('source-tweet', 'replies') for x in glob(TRAIN_PATH + f'reddit-dev-data/*/{pat}/*.json')],
    desc='reading dev'
):
    for t in read_reddit_posts(reddit_path):
        t.add_tag('split=DEV')

        if t.uid in DEV_KEY['subtaskaenglish']:
            t.add_tag(f'taskA={DEV_KEY["subtaskaenglish"][t.uid]}')
        
        if t.uid in DEV_KEY['subtaskbenglish']:
            t.add_tag(f'taskB={DEV_KEY["subtaskbenglish"][t.uid]}')
            
        data_set.add_post(t)
        
for reddit_path in tqdm(
    list(glob(TRAIN_PATH + 'reddit-test-data/*/raw.json')) + 
    [x for pat in ('source-tweet', 'replies') for x in glob(TEST_PATH + f'reddit-test-data/*/{pat}/*.json')],
    'reading test'
):
    for t in read_reddit_posts(reddit_path):
        t.add_tag('split=TEST')
        
        if t.uid in FINAL['subtaskaenglish']:
            t.add_tag(f'taskA={FINAL["subtaskaenglish"][t.uid]}')
        
        if t.uid in FINAL['subtaskbenglish']:
            t.add_tag(f'taskB={FINAL["subtaskbenglish"][t.uid]}')
        
        data_set.add_post(t)

reading train: 100%|██████████| 728/728 [00:05<00:00, 130.30it/s]
reading dev: 100%|██████████| 446/446 [00:03<00:00, 116.94it/s]
reading test: 100%|██████████| 761/761 [00:04<00:00, 173.45it/s]


## Verifying Correctness

In the [original paper](https://aclanthology.org/S19-2147.pdf), Tables 3 and 4 show us the anticipated data in this dataset. 
Here, we just have some print outs that verify correctness of these counts. 
Notice in this portion:

* how we operate the filter operation (on a `Conversation`) to consider subsets of the original
* how we use the segment operation to split out disjoint threads

In [10]:
print('Twitter Train (including DEV)')

train = data_set.filter(by_platform={'Twitter'}, by_tags={'split=TRAIN'}) + data_set.filter(by_platform={'Twitter'}, by_tags={'split=DEV'})

print('\tSubtask A')
print(f'\t\tTotal messages: {len(train.posts)} (5568 in paper)')

train_supp = train.filter(by_tags={'taskA=support'})
print(f'\t\tSupport messages: {len(train_supp.posts)} (1004 in paper)')

train_deny = train.filter(by_tags={'taskA=deny'})
print(f'\t\tDeny messages: {len(train_deny.posts)} (415 in paper)')

train_quest = train.filter(by_tags={'taskA=query'})
print(f'\t\tQuery messages: {len(train_quest.posts)} (464 in paper)')

train_comm = train.filter(by_tags={'taskA=comment'})
print(f'\t\tComment messages: {len(train_comm.posts)} (3685 in paper)')

print('\tSubtask B')
print(f'\t\tTotal threads: {len(train.segment())} (325 in paper)')

train_true = train.filter(by_tags={'taskB=true'})
print(f'\t\tTrue threads: {len(train_true.posts)} (145 in paper)')

train_false = train.filter(by_tags={'taskB=false'})
print(f'\t\tFalse threads: {len(train_false.posts)} (74 in paper)')

train_unv = train.filter(by_tags={'taskB=unverified'})
print(f'\t\tUnverified threads: {len(train_unv.posts)} (106 in paper)')

print('Twitter Test')

test = data_set.filter(by_platform={'Twitter'}, by_tags={'split=TEST'})

print('\tSubtask A')
print(f'\t\tTotal messages: {len(test.posts)} (1066 in paper)')

test_supp = test.filter(by_tags={'taskA=support'})
print(f'\t\tSupport messages: {len(test_supp.posts)} (141 in paper)')

test_deny = test.filter(by_tags={'taskA=deny'})
print(f'\t\tDeny messages: {len(test_deny.posts)} (92 in paper)')

test_quest = test.filter(by_tags={'taskA=query'})
print(f'\t\tQuery messages: {len(test_quest.posts)} (62 in paper)')

test_comm = test.filter(by_tags={'taskA=comment'})
print(f'\t\tComment messages: {len(test_comm.posts)} (771 in paper)')

print('\tSubtask B')
print(f'\t\tTotal threads: {len(test.segment())} (56 in paper)')

test_true = test.filter(by_tags={'taskB=true'})
print(f'\t\tTrue threads: {len(test_true.posts)} (22 in paper)')

test_false = test.filter(by_tags={'taskB=false'})
print(f'\t\tFalse threads: {len(test_false.posts)} (30 in paper)')

test_unv = test.filter(by_tags={'taskB=unverified'})
print(f'\t\tUnverified threads: {len(test_unv.posts)} (4 in paper)')

Twitter Train (including DEV)
	Subtask A
		Total messages: 5568 (5568 in paper)
		Support messages: 1004 (1004 in paper)
		Deny messages: 415 (415 in paper)
		Query messages: 464 (464 in paper)
		Comment messages: 3685 (3685 in paper)
	Subtask B
		Total threads: 327 (325 in paper)
		True threads: 145 (145 in paper)
		False threads: 74 (74 in paper)
		Unverified threads: 106 (106 in paper)
Twitter Test
	Subtask A
		Total messages: 1066 (1066 in paper)
		Support messages: 141 (141 in paper)
		Deny messages: 92 (92 in paper)
		Query messages: 62 (62 in paper)
		Comment messages: 771 (771 in paper)
	Subtask B
		Total threads: 61 (56 in paper)
		True threads: 22 (22 in paper)
		False threads: 30 (30 in paper)
		Unverified threads: 4 (4 in paper)


In [11]:
print('Reddit Train (including DEV)')

train = data_set.filter(by_platform={'Reddit'}, by_tags={'split=TRAIN'}) + data_set.filter(by_platform={'Reddit'}, by_tags={'split=DEV'})

print('\tSubtask A')
print(f'\t\tTotal messages: {len(train.posts)} (1134 in paper)')

train_supp = train.filter(by_tags={'taskA=support'})
print(f'\t\tSupport messages: {len(train_supp.posts)} (23 in paper)')

train_deny = train.filter(by_tags={'taskA=deny'})
print(f'\t\tDeny messages: {len(train_deny.posts)} (45 in paper)')

train_quest = train.filter(by_tags={'taskA=query'})
print(f'\t\tQuery messages: {len(train_quest.posts)} (51 in paper)')

train_comm = train.filter(by_tags={'taskA=comment'})
print(f'\t\tComment messages: {len(train_comm.posts)} (1015 in paper)')

print('\tSubtask B')
print(f'\t\tTotal threads: {len(train.segment())} (40 in paper)')

train_true = train.filter(by_tags={'taskB=true'})
print(f'\t\tTrue threads: {len(train_true.posts)} (9 in paper)')

train_false = train.filter(by_tags={'taskB=false'})
print(f'\t\tFalse threads: {len(train_false.posts)} (24 in paper)')

train_unv = train.filter(by_tags={'taskB=unverified'})
print(f'\t\tUnverified threads: {len(train_unv.posts)} (7 in paper)')

print('Reddit Test')

test = data_set.filter(by_platform={'Reddit'}, by_tags={'split=TEST'})

print('\tSubtask A')
print(f'\t\tTotal messages: {len(test.posts)} (806 in paper)')

test_supp = test.filter(by_tags={'taskA=support'})
print(f'\t\tSupport messages: {len(test_supp.posts)} (16 in paper)')

test_deny = test.filter(by_tags={'taskA=deny'})
print(f'\t\tDeny messages: {len(test_deny.posts)} (54 in paper)')

test_quest = test.filter(by_tags={'taskA=query'})
print(f'\t\tQuery messages: {len(test_quest.posts)} (31 in paper)')

test_comm = test.filter(by_tags={'taskA=comment'})
print(f'\t\tComment messages: {len(test_comm.posts)} (705 in paper)')

print('\tSubtask B')
print(f'\t\tTotal threads: {len(test.segment())} (25 in paper)')

test_true = test.filter(by_tags={'taskB=true'})
print(f'\t\tTrue threads: {len(test_true.posts)} (9 in paper)')

test_false = test.filter(by_tags={'taskB=false'})
print(f'\t\tFalse threads: {len(test_false.posts)} (10 in paper)')

test_unv = test.filter(by_tags={'taskB=unverified'})
print(f'\t\tUnverified threads: {len(test_unv.posts)} (6 in paper)')

Reddit Train (including DEV)
	Subtask A
		Total messages: 1134 (1134 in paper)
		Support messages: 23 (23 in paper)
		Deny messages: 45 (45 in paper)
		Query messages: 51 (51 in paper)
		Comment messages: 1015 (1015 in paper)
	Subtask B
		Total threads: 40 (40 in paper)
		True threads: 9 (9 in paper)
		False threads: 24 (24 in paper)
		Unverified threads: 7 (7 in paper)
Reddit Test
	Subtask A
		Total messages: 761 (806 in paper)
		Support messages: 16 (16 in paper)
		Deny messages: 9 (54 in paper)
		Query messages: 31 (31 in paper)
		Comment messages: 705 (705 in paper)
	Subtask B
		Total threads: 25 (25 in paper)
		True threads: 9 (9 in paper)
		False threads: 10 (10 in paper)
		Unverified threads: 6 (6 in paper)


### Reflection

Not bad! It appears that we've just about loaded everything per what is described in the paper minus:
* A chunk of Reddit `deny` comments (45 total) from the test set
* Some linkages in the Twitter data appear to be fragmented leading to increased thread counts... (I believe this may be the ill-formed thread referenced on line 114 [here](https://github.com/kochkinaelena/RumourEval2019/blob/master/preprocessing/preprocessing_tweets.py))

We can also briefly enumerate through our keys and see if there is anything missing when inspecting the data through this lens:

In [12]:
def output_missing_ids(key, label):
    print(label)
    for task, annots in key.items():
        for pid in annots:
            if (pid.isnumeric() and int(pid) not in data_set.posts) or (not pid.isnumeric() and pid not in data_set.posts):
                print('\t', pid)

In [13]:
output_missing_ids(TRAIN_KEY, 'train')
output_missing_ids(DEV_KEY, 'dev')
output_missing_ids(FINAL, 'test')

train
dev
test


Interestingly enough, it doesn't seem as though we have any missing messages (per these supplied annotation keys)! 
I also haven't found any additional data included with this data release. 

Hopefully this speeds up your handling of heterogenous platform data! If you notice anything off about this tutorial, submit a bug [here](https://github.com/hunter-heidenreich/pyconversations).