<a href="https://colab.research.google.com/github/verma-saloni/Thesis-Work/blob/main/09_10_22_again_politifact_pytorch_biggraph.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Installing the necessary libraries for the Pytorch BigGraph embeddings. 

In [1]:
!pip -qq install jsonlines

In [2]:
!git clone -qb working https://github.com/verma-saloni/PyTorch-BigGraph

In [3]:
%cd PyTorch-BigGraph/
!pip -qq install .
%cd /content/

/content/PyTorch-BigGraph
[33m  DEPRECATION: A future pip version will change local packages to be built in-place without first copying to a temporary directory. We recommend you use --use-feature=in-tree-build to test your packages with this new behavior before it becomes the default.
   pip 21.3 will remove support for this functionality. You can find discussion regarding this at https://github.com/pypa/pip/issues/7555.[0m
  Building wheel for torchbiggraph (setup.py) ... [?25l[?25hdone
/content


In [4]:
from google.colab import drive
drive.mount('/gdrive')

from pathlib import Path
base_dir = Path("/gdrive/MyDrive/ResearchFND")
assert base_dir.exists()

Mounted at /gdrive


## Data prep

In [5]:
import pandas as pd
import ast
import os
import json
import jsonlines
import numpy as np
import torch

import IPython.display as ipd

Reading the input file for the dataset. 


In [6]:
df = pd.read_csv(base_dir/'politifact_agg.csv', index_col=0)
df.head(2)

Unnamed: 0,title,text,tweets,retweets,label,url,tweet_ids,num_retweets,log_num_retweets,num_tweets,log_num_tweets
0,Actress Emma Stone ‘For the first time in his...,,[],"['1020554564334964741', '1020817527046197248',...",fake,,[],2911,7.976595,0,0.0
1,Breaking President Trump makes English the of...,,[],[],fake,,[],0,0.0,0,0.0


In [7]:
with open(base_dir/'t2u.json') as f:
    t2u = json.load(f)

with open(base_dir/'users_info.json') as f:
    users_info = json.load(f)

In [8]:
df['tweets'] = df.tweets.map(ast.literal_eval)
users_tweeted = df.tweets.map(lambda x: [int(e['user_id']) for e in x])

In [9]:
df['retweets'] = df.retweets.map(ast.literal_eval)
users_retweeted = df.retweets.map(lambda x: [t2u[str(e)] for e in x if (str(e) in t2u) ])

In [10]:
len(users_tweeted), sum(users_tweeted.map(len) > 0)

(894, 149)

In [11]:
len(users_retweeted), sum(users_retweeted.map(len) > 0)

(894, 22)

In [12]:
follow_src = []
follow_dst = []
with jsonlines.open(base_dir/"followers.jsonl") as reader:
    for line in reader:
        v = line["user_id"]
        for u in line["followers"]:
            follow_src.append(u)
            follow_dst.append(v)

In [13]:
with jsonlines.open(base_dir/"following.jsonl") as reader:
    for line in reader:
        u = line["user_id"]
        for v in line["following"]:
            follow_src.append(u)
            follow_dst.append(v)

In [14]:
# "retweet" users
for u, info in users_info.items():
    u = int(u)
    for v in info['followers']:
        follow_src.append(v)
        follow_dst.append(u)
    for v in info['friends']:
        follow_src.append(u)
        follow_dst.append(v)

In [15]:
tweet_src = []
tweet_dst = []

for v, l in users_tweeted.iteritems():
    if not len(l): 
        continue
    for u in l:
        tweet_src.append(u)
        tweet_dst.append(v)

In [16]:
for v, l in users_retweeted.iteritems():
    if not len(l):
        continue
    for u in l:
        tweet_src.append(u)
        tweet_dst.append(v)

In [17]:
with open('edges.txt', 'w') as f:
    for src, dst in zip(follow_src, follow_dst):
        f.write(f"{src}\t{dst}\tfollows\n")
    for src, dst in zip(tweet_src, tweet_dst):
        f.write(f"{src}\t{dst}\ttwitted\n")

In [18]:
!head -n 5 edges.txt

983553159057498114	961251714857828357	follows
988529911873921024	961251714857828357	follows
961635897929359360	961251714857828357	follows
159717173	961251714857828357	follows
4737344780	961251714857828357	follows


In [19]:
cp edges.txt $base_dir/

## Training

In [20]:
pip install torch

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [21]:
import random
from pathlib import Path

import attr
import pkg_resources
from torchbiggraph.config import add_to_sys_path, ConfigFileLoader
from torchbiggraph.converters.importers import convert_input_data, TSVEdgelistReader
from torchbiggraph.converters.utils import download_url, extract_gzip, extract_tar
from torchbiggraph.eval import do_eval
from torchbiggraph.train import train
from torchbiggraph.util import (
    set_logging_verbosity,
    setup_logging,
    SubprocessInitializer,
)

Creating a new edges txt file in the library, which will show the connections between different Twitter users with their Twitter IDs. 


In [22]:
data_dir = Path('./data')
data_dir.mkdir(parents=True, exist_ok=True)
fpath = base_dir/'edges.txt'

splitting into train and validation files, 80-20 split, random splits and then starts to follow a path. 

In [23]:
import random

def split_edges(input_file, pct=0.8, train_file='data/train_edges.txt', valid_file='data/valid_edges.txt'):

    with open(input_file) as f:
        lines = f.readlines()

    follow_edges, tweet_edges = [], []
    for line in lines:
        if line.strip().endswith('follows'):
            follow_edges.append(line)
        else:
            tweet_edges.append(line)

    random.shuffle(follow_edges)
    random.shuffle(tweet_edges)
    follow_split, tweet_split = int(pct*len(follow_edges)), int(pct*len(tweet_edges))
    train_edges = follow_edges[:follow_split] + tweet_edges[:tweet_split]
    valid_edges = follow_edges[follow_split:] + tweet_edges[tweet_split:]

    with open(train_file, 'w') as f:
        f.writelines(train_edges)

    with open(valid_file, 'w') as f:
        f.writelines(valid_edges)

In [24]:
split_edges(fpath)

In [25]:
loader = ConfigFileLoader()
config = loader.load_config('PyTorch-BigGraph/torchbiggraph/examples/configs/politifact_config.py', [])
set_logging_verbosity(0)
subprocess_init = SubprocessInitializer()
#subprocess_init.register(setup_logging, 1). # commented so it will remove some unnecessary outputs for the next cells..for the github gist it was set to 1, in case it was needed later for analysis of connections
subprocess_init.register(setup_logging, 0)
subprocess_init.register(add_to_sys_path, loader.config_dir.name)
input_edge_paths = [data_dir/'train_edges.txt', data_dir/'valid_edges.txt']
output_train_path, output_test_path = config.edge_paths

Converting the input data to entities, relations and paths through which they are connected. 


In [26]:
convert_input_data(
    config.entities,
    config.relations,
    config.entity_path,
    config.edge_paths,
    input_edge_paths,
    TSVEdgelistReader(lhs_col=0, rhs_col=1, rel_col=2),
    dynamic_relations=config.dynamic_relations,
)

[2022-09-10 11:36:06.463218] Using the 2 relation types given in the config
[2022-09-10 11:36:06.466648] Searching for the entities in the edge files...
[2022-09-10 11:36:08.639378] Entity type user:
[2022-09-10 11:36:08.640631] - Found 678041 entities
[2022-09-10 11:36:08.646778] - Removing the ones with fewer than 1 occurrences...
[2022-09-10 11:36:08.901279] - Left with 678041 entities
[2022-09-10 11:36:08.911562] - Shuffling them...
[2022-09-10 11:36:09.731474] Entity type article:
[2022-09-10 11:36:09.732871] - Found 171 entities
[2022-09-10 11:36:09.737945] - Removing the ones with fewer than 1 occurrences...
[2022-09-10 11:36:09.740792] - Left with 171 entities
[2022-09-10 11:36:09.742400] - Shuffling them...
[2022-09-10 11:36:09.750680] Preparing counts and dictionaries for entities and relation types:
[2022-09-10 11:36:09.752502] - Writing count of entity type user and partition 0
[2022-09-10 11:36:10.586361] - Writing count of entity type article and partition 0
[2022-09-10 1

In [27]:
train_config = attr.evolve(config, edge_paths=[output_train_path])
train(train_config, subprocess_init=subprocess_init)

INFO:torchbiggraph:Loading entity counts...
INFO:torchbiggraph:Creating workers...
INFO:torchbiggraph:Initializing global model...
INFO:torchbiggraph:Starting epoch 1 / 30, edge path 1 / 1, edge chunk 1 / 1
INFO:torchbiggraph:Edge path: data/train_partitioned
INFO:torchbiggraph:still in queue: 0
INFO:torchbiggraph:Swapping partitioned embeddings None ( 0 , 0 )
INFO:torchbiggraph:Loading partitioned embeddings from checkpoint
INFO:torchbiggraph:( 0 , 0 ): Stats before training: loss:  453.605 , pos_rank:  994.635 , mrr:  0.003828 , r1:  0.0005276 , r10:  0.00420431 , r50:  0.024352 , auc:  0.499357 , count:  30326
INFO:torchbiggraph:( 0 , 0 ): Training stats: loss:  23.375 , reg:  0 , violators_lhs:  44.4689 , violators_rhs:  45.0932 , count:  576212
INFO:torchbiggraph:( 0 , 0 ): Stats after training: loss:  439.466 , pos_rank:  835.509 , mrr:  0.0371225 , r1:  0.0234287 , r10:  0.0659995 , r50:  0.117259 , auc:  0.584416 , count:  30326
INFO:torchbiggraph:( 0 , 0 ): bucket 1 / 1 : Trai

In [28]:
eval_config = attr.evolve(config, edge_paths=[output_test_path])
do_eval(eval_config, subprocess_init=subprocess_init)

INFO:torchbiggraph:Starting edge path 1 / 1 (data/test_partitioned)
INFO:torchbiggraph:( 0 , 0 ): Processed 151636 edges in 1.9 s (0.079M/sec); load time: 0.45 s
INFO:torchbiggraph:Stats for edge path 1 / 1, bucket ( 0 , 0 ): loss:  27.5657 , pos_rank:  46.443 , mrr:  0.112117 , r1:  0.0646647 , r10:  0.165558 , r50:  0.552514 , auc:  0.538025 , count:  151636
INFO:torchbiggraph:
INFO:torchbiggraph:Stats for edge path 1 / 1: loss:  27.5657 , pos_rank:  46.443 , mrr:  0.112117 , r1:  0.0646647 , r10:  0.165558 , r50:  0.552514 , auc:  0.538025 , count:  151636
INFO:torchbiggraph:
INFO:torchbiggraph:
INFO:torchbiggraph:Stats: loss:  27.5657 , pos_rank:  46.443 , mrr:  0.112117 , r1:  0.0646647 , r10:  0.165558 , r50:  0.552514 , auc:  0.538025 , count:  151636
INFO:torchbiggraph:


In [29]:
!torchbiggraph_export_to_tsv \
    'PyTorch-BigGraph/torchbiggraph/examples/configs/politifact_config.py' \
    --entities-output entity_embeddings.tsv \
    --relation-types-output relation_types_parameters.tsv

Loading relation types and entities...
Initializing model...
Loading model check point...
Writing entity embeddings...
Reading embeddings for entity type user partition 0 from checkpoint...
Writing embeddings for entity type user partition 0 to output file...
- Processed 5000/678041 entities so far...
- Processed 10000/678041 entities so far...
- Processed 15000/678041 entities so far...
- Processed 20000/678041 entities so far...
- Processed 25000/678041 entities so far...
- Processed 30000/678041 entities so far...
- Processed 35000/678041 entities so far...
- Processed 40000/678041 entities so far...
- Processed 45000/678041 entities so far...
- Processed 50000/678041 entities so far...
- Processed 55000/678041 entities so far...
- Processed 60000/678041 entities so far...
- Processed 65000/678041 entities so far...
- Processed 70000/678041 entities so far...
- Processed 75000/678041 entities so far...
- Processed 80000/678041 entities so far...
- Processed 85000/678041 entities so 

In [30]:
# Gitsource for PBG: https://github.com/facebookresearch/PyTorch-BigGraph#training: all explanations here. 