# PyConversations in Application -- SemEval 2019 Task 7a
## Classifying the Support Labels of Rumours

This notebook gives a simple tutorial of using features extracted from PyConversations in a machine learning pipeline. 
Here, we apply PyConversations to [SemEval 2019 Task 7 - RumourEval](https://aclanthology.org/S19-2147.pdf) on sub-task A.
The goal of the task is to classifying whether comments are (S)upporting, (D)enying, (Q)uerying, or (C)ommenting (thus, SDQC is a short name for this task type).
This notebook takes a simplistic stab at this task using only descriptive features from PyCovnersations (e.g., no sematic vectors to augment the data in PyConversations)

In [1]:
import numpy as np

from sklearn.feature_selection import SelectFromModel

from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import SGDClassifier

from sklearn.metrics import classification_report
from sklearn.metrics import f1_score

from sklearn.preprocessing import LabelEncoder

from tqdm import tqdm

from pyconversations.convo import Conversation
from pyconversations.feature_extraction import ConversationVectorizer
from pyconversations.feature_extraction import PostVectorizer
from pyconversations.feature_extraction import UserVectorizer

from RumourEval import load_rumoureval

In [2]:
# fit a label encoder to the label types
classes = ['comment', 'support', 'deny', 'query']

le = LabelEncoder()
le.fit(classes)

le.transform(classes)

array([0, 3, 1, 2])

In [3]:
# augment with where this data lives locally on your machine
DATA_PATH = '/Users/hsh28/data/rumoureval2019/'

In [4]:
# see previous tutorial if you have questions about where this code came from!
dataset = load_rumoureval(DATA_PATH)
len(dataset.posts)

reading train+dev: 100%|██████████| 5568/5568 [00:13<00:00, 420.29it/s]
reading test: 100%|██████████| 1066/1066 [00:02<00:00, 477.60it/s]
reading train: 100%|██████████| 728/728 [00:05<00:00, 136.65it/s]
reading dev: 100%|██████████| 446/446 [00:03<00:00, 117.57it/s]
reading test: 100%|██████████| 761/761 [00:04<00:00, 170.10it/s]


8534

In [5]:
def get_split(dataset, split):
    """
    Simple function for splitting out the different cuts of the data
    """
    return Conversation(posts={pid: dataset.posts[pid] for pid in dataset.filter(by_tags={f'split={split}'})})

In [6]:
# split the posts
keys = ['train', 'dev', 'test']
splits = {k: get_split(dataset, k.upper()) for k in keys}

{s: len(splits[s].posts) for s in splits}

{'train': 5217, 'dev': 1485, 'test': 1827}

In [34]:
# build different vectorizers

# norm = None  # NOT RECOMMENDED
# norm = 'standard'
norm = 'minmax'
# norm = 'mean'

cvec = ConversationVectorizer(normalization=norm)
pvec = PostVectorizer(normalization=norm)
uvec = UserVectorizer(normalization=norm)

In [35]:
# split out data at the conversational level for more feature availability
convos = {s: splits[s].segment() for s in splits}
{s: len(convos[s]) for s in convos}

{'train': 327, 'dev': 40, 'test': 86}

In [36]:
# fit the vectorizers to the training split for normalization
k = 'train'
cvec.fit(convos[k])
pvec.fit(convos[k])
uvec.fit(convos[k])

ConvVec: Fitting by conversations: 100%|██████████| 327/327 [05:25<00:00,  1.01it/s]  
PostVec: Fitting by conversations: 100%|██████████| 327/327 [03:32<00:00,  1.54it/s]  
UserVec: Fitting by user: 100%|██████████| 3427/3427 [04:58<00:00, 11.48it/s] 


<pyconversations.feature_extraction.extractors.UserVectorizer at 0x1235dffd0>

In [37]:
# transform all splits of the dataset into the vector and ID mappings 

cvs = {}
pvs = {}
uvs = {}
cids = {}
pids = {}
uids = {}

for k, cxs in convos.items():
    v, i = cvec.transform(cxs, include_ids=True)
    cvs[k] = v
    cids[k] = i
    
    v, i = pvec.transform(cxs, include_ids=True)
    pvs[k] = v
    pids[k] = i
    
    v, i = uvec.transform(cxs, include_ids=True)
    uvs[k] = v
    uids[k] = i

ConvVec: Transforming by conversations: 100%|██████████| 327/327 [05:22<00:00,  1.01it/s]  
PostVec: Transforming by conversations: 100%|██████████| 327/327 [03:32<00:00,  1.54it/s]  
UserVec: Transforming by users: 100%|██████████| 3427/3427 [05:08<00:00, 11.11it/s] 
ConvVec: Transforming by conversations: 100%|██████████| 40/40 [04:53<00:00,  7.35s/it]
PostVec: Transforming by conversations: 100%|██████████| 40/40 [04:08<00:00,  6.21s/it]
UserVec: Transforming by users: 100%|██████████| 1022/1022 [04:48<00:00,  3.54it/s]
ConvVec: Transforming by conversations: 100%|██████████| 86/86 [08:56<00:00,  6.24s/it]  
PostVec: Transforming by conversations: 100%|██████████| 86/86 [06:12<00:00,  4.33s/it]  
UserVec: Transforming by users: 100%|██████████| 1277/1277 [07:46<00:00,  2.74it/s] 


In [45]:
def build_input(cxs, pvs, cvs, uvs, pids, cids, uids):
    """
    This function builds vectors for each post for SDQC classification.
    Here, we naively produce vectors that contain:
    * all features for the post in question
    * all features for the conversation the post is in
    * all features for the user who wrote the post
    
    This may not be the best feature set but is here as a demonstration 
    of how one might construct more complex feature vectors beyond 
    simple vectorization using built-in vectorizers.

    For example, it might be advantageous to use information about the parent post
    or source post and their users as well.
    """
    pdim = pvs.shape[1]
    cdim = cvs.shape[1]
    udim = uvs.shape[1]
    
    # @ post-level
    xs = np.zeros((len(pids), pdim + cdim + udim)) # (post, post_vec + convo_vec + author_of_post_vec)
    ys = np.zeros(len(pids))  # (post,)
    
    for cx in tqdm(cxs, desc='Building XY-pairs'):
        cid = cx.convo_id
        for pid in cx.posts:
            ix = pids[(cid, pid)]
            px = cx.posts[pid]
            user = px.author
            
            # place post vector
            xs[ix, :pdim] = pvs[ix, :]
            off = pdim
            
            # place conversation vector
            xs[ix, off:cdim + off] = cvs[cids[cid], :]
            off += cdim
            
            # place author's user vector
            xs[ix, off:] = uvs[uids[user], :]
            
            for t in px.tags:
                if 'taskA' in t:
                    lx = t.split('taskA=')[-1]
                    ys[ix] = le.transform([lx])[0]
    return xs, ys

In [39]:
# Construct XY pairs for each data split from vectorized data and maps
xs = {}
ys = {}
for k in convos:
    x, y = build_input(convos[k], pvs[k], cvs[k], uvs[k], pids[k], cids[k], uids[k])
    xs[k] = x
    ys[k] = y
    
    print(k, xs[k].shape, ys[k].shape)

Building XY-pairs: 100%|██████████| 327/327 [00:00<00:00, 1534.97it/s]
Building XY-pairs: 100%|██████████| 40/40 [00:00<00:00, 768.27it/s]
Building XY-pairs: 100%|██████████| 86/86 [00:00<00:00, 1330.22it/s]

train (5217, 2308) (5217,)
dev (1485, 2308) (1485,)
test (1827, 2308) (1827,)





In [40]:
# feature selection on train, a simple model for dropping un-helpful features
k = 'train'
model = SGDClassifier(loss='log', eta0=1e-4, learning_rate='adaptive', n_jobs=-1, random_state=0, max_iter=10_000)
selector = SelectFromModel(estimator=model).fit(xs[k], ys[k])

In [41]:
# print the selected threshold and count the retained features
selector.threshold_, selector.get_support().sum()

(0.02911557573016249, 774)

In [42]:
# trim x-data to recommended feature set
xs_trimmed = {
    k: selector.transform(xs[k])
    for k in xs
}

{k: xs_trimmed[k].shape for k in xs_trimmed}

{'train': (5217, 774), 'dev': (1485, 774), 'test': (1827, 774)}

In [43]:
# a simple dev-based hyperparameter selection approach
best_s = 0
best_eta = None
best_loss = None

k = 'train'

for loss in ['hinge', 'log', 'modified_huber', 'squared_hinge', 'perceptron']:
    for eta in tqdm([1e-4, 1e-3, 1e-2, 1e-1]):
        model = SGDClassifier(loss=loss, eta0=eta, learning_rate='adaptive', n_jobs=-1, random_state=0, max_iter=10_000)
        model.fit(xs_trimmed[k], ys[k])
        dev_preds = model.predict(xs_trimmed['dev'])
        s = f1_score(ys['dev'], dev_preds, average='macro')
    
        if s > best_s:
            best_s = s
            best_eta = eta
            best_loss = loss
        
best_s, best_eta, best_loss

100%|██████████| 4/4 [00:02<00:00,  1.50it/s]
100%|██████████| 4/4 [00:04<00:00,  1.03s/it]
100%|██████████| 4/4 [00:04<00:00,  1.13s/it]
100%|██████████| 4/4 [01:16<00:00, 19.10s/it]
100%|██████████| 4/4 [00:03<00:00,  1.25it/s]


(0.4024571153448402, 0.0001, 'perceptron')

In [44]:
# re-fit the data and observe results
model = SGDClassifier(loss=best_loss, eta0=best_eta, learning_rate='adaptive', n_jobs=-1, random_state=0, max_iter=10_000)
model.fit(xs_trimmed[k], ys[k])

train_preds = model.predict(xs_trimmed['train'])
dev_preds = model.predict(xs_trimmed['dev'])
test_preds = model.predict(xs_trimmed['test'])

print('='*50)
print('train')
print(classification_report(ys['train'], train_preds))
print('='*50)
print('dev')
print(classification_report(ys['dev'], dev_preds))
print('='*50)
print('test')
print(classification_report(ys['test'], test_preds))
print('='*50)

train
              precision    recall  f1-score   support

         0.0       0.81      0.82      0.82      3519
         1.0       0.27      0.17      0.21       378
         2.0       0.48      0.52      0.50       395
         3.0       0.55      0.60      0.57       925

    accuracy                           0.71      5217
   macro avg       0.53      0.53      0.52      5217
weighted avg       0.70      0.71      0.71      5217

dev
              precision    recall  f1-score   support

         0.0       0.85      0.87      0.86      1181
         1.0       0.00      0.00      0.00        82
         2.0       0.55      0.47      0.51       120
         3.0       0.19      0.30      0.24       102

    accuracy                           0.75      1485
   macro avg       0.40      0.41      0.40      1485
weighted avg       0.74      0.75      0.74      1485

test
              precision    recall  f1-score   support

         0.0       0.85      0.79      0.82      1476
      

Though the scores are low, they are better than half of the submissions to the original SemEval competition! Also, recall that there is no sematic information (e.g., word vectors) used in this very, very simple pipeline, so it's not too shabby to get these results!