# Baseline: HeteroGraphConv

This notebook is dedicated to running HeteroGraphConv, the original graph model used by Nielsen and McConville (2022) for the supervised learning tasks on the MuMiN dataset.

Note: Much of the code used here is borrowed from the authors' repository for running the models for their paper: https://github.com/MuMiN-dataset/mumin-baseline. The code is imported via `git submodule`.

In [1]:
# Load the autoreload extension
%load_ext autoreload
%autoreload 2

In [2]:
# Import libraries for this notebook
from mumin import MuminDataset, save_dgl_graph
import pandas as pd
from pathlib import Path

In [3]:
# Import modules
from src.train.scripts.claim_classification import *

/home/ericm/Repos/mumin-graph-attention/src/train/scripts/../../mumin-baseline/src/


Using backend: pytorch
2022-04-15 16:16:03.693130: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/ros/melodic/lib
2022-04-15 16:16:03.693173: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.


## Load, preview, and prepare data

The data consists of 20 `pandas` dataframes (see the `README.md` under `data/` on how to retrieve it). 7 contain node/entity data (tweet/claim/article/image/user/hashtag/reply), while the other 13 contain edges/relationships between these entities.

Originally, the authors export the data to the Deep Graph Library (DGL). To ensure consistency, we will do the same here.

In [4]:
# Select size (small, medium, or large)
size = 'small'
#size = 'medium'
#size = 'large'

In [4]:
# Load (already compiled) dataset
dataset = MuminDataset(twitter_bearer_token=None, dataset_path=f'data/mumin-{size}.zip')
dataset.compile()
dataset.add_embeddings()

2022-04-15 15:13:33,980 [INFO] Loading dataset


MuminDataset(num_nodes=386,542, num_relations=472,489, size='small', compiled=True, bearer_token_available=False)

In [5]:
# Export to DGL (save to file)
save_dgl_graph(dataset.to_dgl(), Path(f'dgl-graph-{size}.bin'))

2022-04-15 15:25:30,577 [INFO] Outputting to DGL
Using backend: pytorch


In [6]:
# Print list of nodes/entities
node_list = list(dataset.nodes.keys())

In [7]:
# Print information about each node/entity
for node in node_list:
    dataset.nodes[node].dropna(inplace=True)
    print(node)
    print("    len() =", len(dataset.nodes[node]))
    print("    cols =", dataset.nodes[node].columns.to_list())
    print()

claim
    len() = 2100
    cols = ['embedding', 'label', 'reviewers', 'date', 'language', 'keywords', 'cluster_keywords', 'cluster', 'train_mask', 'val_mask', 'test_mask', 'reviewer_emb']

tweet
    len() = 4101
    cols = ['tweet_id', 'text', 'created_at', 'lang', 'source', 'num_retweets', 'num_replies', 'num_quote_tweets', 'text_emb', 'lang_emb']

user
    len() = 153912
    cols = ['user_id', 'verified', 'protected', 'created_at', 'username', 'description', 'url', 'name', 'num_followers', 'num_followees', 'num_tweets', 'num_listed', 'location', 'description_emb']

image
    len() = 1016
    cols = ['url', 'pixels', 'width', 'height', 'pixels_emb']

article
    len() = 1452
    cols = ['url', 'title', 'content', 'title_emb', 'content_emb']

hashtag
    len() = 28182
    cols = ['tag']

reply
    len() = 180106
    cols = ['tweet_id', 'text', 'created_at', 'lang', 'source', 'num_retweets', 'num_replies', 'num_quote_tweets', 'text_emb', 'lang_emb']



In [8]:
# Print list of edges/relations
edge_list = list(dataset.rels.keys())

In [10]:
# Print information about each edge/relation
for edge in edge_list:
    dataset.rels[edge].dropna(inplace=True)
    print(edge)
    print("    len() =", len(dataset.rels[edge]))
    print("    cols =", dataset.rels[edge].columns.to_list())
    print()

('tweet', 'discusses', 'claim')
    len() = 5083
    cols = ['src', 'tgt']

('tweet', 'mentions', 'user')
    len() = 1121
    cols = ['src', 'tgt']

('tweet', 'has_image', 'image')
    len() = 1024
    cols = ['src', 'tgt']

('tweet', 'has_hashtag', 'hashtag')
    len() = 2307
    cols = ['src', 'tgt']

('tweet', 'has_article', 'article')
    len() = 1899
    cols = ['src', 'tgt']

('reply', 'reply_to', 'tweet')
    len() = 90101
    cols = ['src', 'tgt']

('reply', 'quote_of', 'tweet')
    len() = 101203
    cols = ['src', 'tgt']

('user', 'posted', 'tweet')
    len() = 4101
    cols = ['src', 'tgt']

('user', 'posted', 'reply')
    len() = 180106
    cols = ['src', 'tgt']

('user', 'mentions', 'user')
    len() = 2825
    cols = ['src', 'tgt']

('user', 'has_hashtag', 'hashtag')
    len() = 50743
    cols = ['src', 'tgt']

('user', 'retweeted', 'tweet')
    len() = 13434
    cols = ['src', 'tgt']

('user', 'follows', 'user')
    len() = 18542
    cols = ['src', 'tgt']



## Task 1: claim classification

“Given a claim and its surrounding subgraph extracted from social media, predict whether or not the claim is misinformation or factual”

This is a **node prediction** task on the knowledge graph.

In [6]:
claim_classification(model="hgc", size=size)

Training:   0%|          | 0/300 [00:00<?, ?it/s]



KeyboardInterrupt: 

## Task 2: tweet classification

“Given a source tweet that has not yet been fact checked, predict whether or not the tweet discusses a claim whose verdict is misinformation or factual“

This is an **edge prediction** task on the knowledge graph.