In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
%load_ext autotime

time: 331 µs


In [3]:
%cd ..

/Users/rubenbroekx/Documents/Projects/twitter-sentiment-classifier/twitter_sentiment_classifier
time: 1.29 ms


# Data Parsing

This script reads in the raw annotated dataset and translates this to a better formatted format. Information that isn't necessary or is prone to change later in the pipeline (e.g. sentence encodings) are removed as well.

The input of this script is the raw `annotations.jsonl` file that is used in the Prodigy annotation tool. This file will be cleaned and transformed (since it contains several deprecated fields) with as result the `tweets_annotated.jsonl` file.

**Note: Since the raw data isn't provided, it will not be able to run this script on your local machine**

In [4]:
import json
import os
from tqdm import tqdm
from collections import Counter

time: 7.53 ms


In [5]:
# Read in the data, be aware of possible parsing error (due to OOM on EC2)
with open(os.path.expanduser('~/data/twitter/annotations.jsonl'), 'r') as f:
    annotations = []
    line = f.readline()
    while line:
        try:
            annotations.append(json.loads(line))
        except json.JSONDecodeError:
            print(' -> JSONDecoderError')
            pass
        finally:
            line = f.readline()
print(f"Loaded in {len(annotations)} annotations")

 -> JSONDecoderError
Loaded in 49576 annotations
time: 9.88 s


In [7]:
# Extend with extra annotations performed locally
with open(os.path.expanduser('~/data/twitter/extra/annotations_extra.jsonl'), 'r') as f:
    annotations += [json.loads(line) for line in f.readlines()]
print(f"Loaded in {len(annotations)} annotations")

Loaded in 51576 annotations
time: 443 ms


## Fields

The raw data has plenty of fields:
- **id** The unique tweet ID of the sample
- **created_at** The creation date of the tweet
- **text** The cleaned text of the tweet
- **text_raw** The raw text of the tweet 
- **truncated** Flag if the tweet is truncated or not
- **is_quote** Flag if the tweet was a quote or not
- **quoted_lang** The language of the tweet
- **quoted_tweet** The processed text of the quoted tweet, if it exists
- **quoted_tweet_raw** The raw text of the quoted tweet, if it exists
- **quote_count** Number of times the tweet (of **id**) is quoted
- **is_reply** Flag if the tweet is a reply itself
- **replied_tweet_id** The ID of the tweet to which this tweet applies to
- **reply_count** Number of replies on the tweet
- **retweet_count** Number of times the tweet is retweeted
- **favorite_count** Number of times the tweet is favored
- **hashtags** The hashtags present in the tweet
- **user_followers** The number of followers the tweet's creator has
- **user_friends** The number of friends the tweet's creator has
- **user_verified** Flag if the creator is a verified Twitter user
- **user_tweet_count** Number of tweets sent by the creator during the account's lifetime
- **user_created_at** Account creation date
- **features** Sentence embedding
- **prediction** Prediction of the model that was used during training
- **_input_hash** Input-hash ID of the tweet, as defined by Prodigy
- **_task_hash** Task-hash ID of the tweet, as defined by Prodigy
- **meta** Meta-data attached to the tweet
- **_session_id** Session-ID, indicating which annotator was annotating
- **answer** Indication if the tweet is accepted or not
- **annotators** All annotators of the tweet
- **sentiment** Annotated sentiment
- **__label** Annotated sentiment label
- **flagged** Flag indicating that the annotator flagged the tweet
- **validation** Flag indicating that the tweet has been flagged before

## Update data

Update the data over a set of rules:
1. Anonymise annotators; substitute `_session_id` names with anonymised representations (such as unique letters)
2. Update the `annotators` field with the new annotators names and correct missing annotators (since added incrementally)
3. Make the assigned labels easy readable 
4. Assign `flag` label to tweets which were flagged by at least one user
5. Flag disagreement in the tweet's annotations
6. Remove all redundant fields

### 1. Anonymise annotators

Substitute `_session_id` names with anonymised representations, in the form of unique letters.

This step cereates the `annotator` field.

In [8]:
# 4 different annotators
ANNOTATOR_MAP = {
    'tweet_annotations-hanne': 'H',
    'tweet_annotations-aiko': 'A',
    'tweet_annotations-ilya': 'I',
    'tweet_annotations-ruben': 'R',
    'tweet_annotations-ilya?': 'S',
    'tweet_annotations-shoera?': 'S',
    'tweet_annotations-shoera': 'O',
}

def update_annotator(sample):
    """Assign anonymised annotator labels to each tweet."""
    sample['annotator'] = ANNOTATOR_MAP[a['_session_id']]

time: 556 µs


In [9]:
for a in tqdm(annotations, desc="Updating annotator names"):
    update_annotator(a)

Updating annotator names: 100%|██████████| 51576/51576 [00:00<00:00, 356315.87it/s]

time: 178 ms





In [10]:
counter = Counter()
for a in annotations:
    counter[a['annotator']] += 1

print("Annotator overview:")
for annotator, count in sorted(counter.items(), key=lambda x: x[1], reverse=True):
    print(f" - {annotator}: {count}")

Annotator overview:
 - H: 19663
 - A: 10973
 - O: 6534
 - I: 6519
 - S: 5726
 - R: 2161
time: 30.3 ms


### 2. Update annotators

Update the `annotators` field with the new annotators names and correct missing annotators (since added incrementally).

This step creates the `annotators` field.

In [11]:
# Initialise empty annotators memory
annotators_memory = {}

def update_annotators_memory(sample):
    """Update the annotators memory with the annotator of the given tweet."""
    s_id = sample['id']
    if s_id not in annotators_memory:
        annotators_memory[s_id] = set()
    annotators_memory[s_id].add(sample['annotator'])
    
# Initialise the annotator memory
for a in tqdm(annotations, desc="Creating annotators memory"):
    update_annotators_memory(a)

Creating annotators memory: 100%|██████████| 51576/51576 [00:00<00:00, 86112.39it/s]

time: 601 ms





In [12]:
# Give overview of number of annotators
counter = Counter()
for v in annotators_memory.values():
    counter[len(v)] += 1

print("#annotators: #tweets")
for n_anntoators, n_tweets in sorted(counter.items()):
    print(f"{n_anntoators:^11}: {n_tweets:^7}")

#annotators: #tweets
     1     :  46098 
     2     :   740  
     3     :   709  
     4     :   106  
     5     :   66   
time: 21.7 ms


In [13]:
def assign_anntotors(sample):
    """Assign the correct annotators to the sample."""
    sample['annotators'] = sorted(annotators_memory[sample['id']])

time: 460 µs


In [14]:
for a in tqdm(annotations, desc="Assigning annotators"):
    assign_anntotors(a)

Assigning annotators: 100%|██████████| 51576/51576 [00:00<00:00, 470163.29it/s]

time: 112 ms





### 3. Readable labels

Make the assigned labels easy readable. 

This step creates the `accept` and `label` fields.

In [15]:
def update_label(sample):
    """Assign the correct / readable sentiment label to the sample."""
    if a['answer'] != 'accept' or not sample['sentiment']:
        sample['accept'] = False
        sample['label'] = 'REJECT'
        return
    else:
        sample['accept'] = True
        sample['label'] = sample['sentiment']

time: 579 µs


In [16]:
for a in tqdm(annotations, desc="Updating labels"):
    update_label(a)

Updating labels: 100%|██████████| 51576/51576 [00:00<00:00, 931445.50it/s]

time: 57.4 ms





In [17]:
counter = Counter()
for a in annotations:
    counter[a['label']] += 1

print("Label overview:")
for label, count in sorted(counter.items(), key=lambda x: x[1], reverse=True):
    print(f" - {label}: {count}")

Label overview:
 - NEUTRAL: 19083
 - NEGATIVE: 15483
 - POSITIVE: 15210
 - REJECT: 1800
time: 33.1 ms


In [18]:
counter = Counter()
for a in annotations:
    counter[a['accept']] += 1

print("Accept overview:")
for accept, count in sorted(counter.items(), key=lambda x: x[1], reverse=True):
    print(f" - {accept}: {count}")

Accept overview:
 - True: 49776
 - False: 1800
time: 51.3 ms


### 4. Flag tweets

Assign `flag` label to tweets which were flagged by at least one user.

This step creates the `flag` field.

In [19]:
# Collect all the flagged tweet IDs
flagged_ids = set()

# Collect all the flagged IDs
for a in annotations:
    if 'flagged' in a.keys() and a['flagged']:
        flagged_ids.add(a['id'])

print(f"Total of {len(flagged_ids)} unique flagged tweets")

Total of 1028 unique flagged tweets
time: 20.9 ms


In [20]:
def check_flag(sample):
    """Assign flag label to the tweet if necessary."""
    sample['flag'] = sample['id'] in flagged_ids

time: 478 µs


In [21]:
for a in tqdm(annotations, desc="Assigning flagging flag"):
    check_flag(a)

Assigning flagging flag: 100%|██████████| 51576/51576 [00:00<00:00, 1016275.52it/s]

time: 52.8 ms





In [22]:
counter = Counter()
for a in annotations:
    counter[a['flag']] += 1

print("Flagged overview:")
for flag, count in sorted(counter.items(), key=lambda x: x[1], reverse=True):
    print(f" - {flag}: {count}")

Flagged overview:
 - False: 49232
 - True: 2344
time: 30.7 ms


### 5. Correct disagreement

Reject the tweets that are annotated by multiple annotators but don't have matching labels.

This step creates the `agreement` field.

In [23]:
def flag_disagreement(sample):
    """Reject tweets that are annotated differently by multiple annotators."""
    # Disagreement only possible if multiple annotators
    if len(sample['annotators']) == 1: 
        sample['agreement'] = True
        return
    
    # Check if annotated differently
    labels = {a['label'] for a in annotations if a['id'] == sample['id']}
    sample['agreement'] = len(labels) == 1

time: 583 µs


In [24]:
for a in tqdm(annotations, desc="Flagging disagreements"):
    flag_disagreement(a)

Flagging disagreements: 100%|██████████| 51576/51576 [01:10<00:00, 727.10it/s] 

time: 1min 10s





In [25]:
counter = Counter()
for a in annotations:
    counter[a['agreement']] += 1

print("Agreement overview:")
for agreement, count in sorted(counter.items(), key=lambda x: x[1], reverse=True):
    print(f" - {agreement}: {count}")

Agreement overview:
 - True: 48801
 - False: 2775
time: 30.2 ms


### 6. Remove redundant

Remove all the redundant fields from the tweets.

**Note:** After this step, the previous steps cannot be executed anymore.

In [26]:
REDUNDANT = {
    'validation',  # Covered by 'annotators'
    'flagged',  # Covered by 'flag'
    '__label',  # Covered by 'label'
    'sentiment',  # Covered by 'label'
    '_session_id',  # Covered by 'annotator'
    'answer',  # Covered by 'accept'
    'features',  # Changes later in the pipeline
    'prediction',  # Changes later in the pipeline
    '_input_hash',  # Irrelevant for future use
    '_task_hash',  # Irrelevant for future use
    'meta',  # Irrelevant for future use
}

def remove_redundant_fields(sample):
    """Remove all the redundant fields from the tweet."""
    for redundant in REDUNDANT:
        if redundant in sample.keys():
            del sample[redundant]

time: 464 µs


In [27]:
for a in tqdm(annotations, desc="Removing redundant fields"):
    remove_redundant_fields(a)

Removing redundant fields: 100%|██████████| 51576/51576 [00:00<00:00, 61225.86it/s]

time: 844 ms





### Final tweet

Show an example of the final annotated tweet-form.

In [28]:
annotations[0]

{'id': 772094208530321400,
 'created_at': '2016-09-03 15:29:38',
 'text': 'Los gaan pfff 2013 memories',
 'text_raw': 'Los gaan pfff 2013 memories',
 'truncated': False,
 'is_quote': False,
 'quoted_lang': '',
 'quoted_tweet': '',
 'quoted_tweet_raw': '',
 'quote_count': 0,
 'is_reply': False,
 'replied_tweet_id': None,
 'reply_count': 0,
 'retweet_count': 0,
 'favorite_count': 0,
 'hashtags': [],
 'user_followers': 1101,
 'user_friends': 835,
 'user_verified': False,
 'user_tweet_count': 34401,
 'user_created_at': '2015-03-18 13:09:06',
 'annotators': ['H'],
 'annotator': 'H',
 'accept': True,
 'label': 'POSITIVE',
 'flag': True,
 'agreement': True}

time: 2.34 ms


## Store

Store the results. Overwrite the original `annotations.jsonl`.

In [29]:
with open(os.path.expanduser('~/data/twitter/tweets_annotated.jsonl'), 'w') as f:
    f.write('\n'.join([json.dumps(a) for a in annotations])+'\n')

time: 612 ms
