
# CDCR Corpus Builder

This notebook can be used to generate the CD$^2$CR corpus for non-commercial 
personal and academic use.

The notebook can be run inside Google Colab environment and will automatically
install necessary dependencies.

The output is a set of JSON files compatible with [Cattan et al. state of the art CDCR model](https://github.com/ariecattan/coref) and optionally a CONLL 
formatted text file for use with other models.



In [1]:
!pip install bs4 requests pyrouge tqdm stanford-corenlp numpy spacy newspaper3k
!python -m spacy download en_core_web_sm

Collecting pyrouge
[?25l  Downloading https://files.pythonhosted.org/packages/11/85/e522dd6b36880ca19dcf7f262b22365748f56edc6f455e7b6a37d0382c32/pyrouge-0.1.3.tar.gz (60kB)
[K     |████████████████████████████████| 61kB 3.7MB/s 
Collecting stanford-corenlp
  Downloading https://files.pythonhosted.org/packages/31/a9/695357743b55c08e74e46fe72579ca2a2559fa9b196d9f2035339af89b94/stanford_corenlp-3.9.2-py2.py3-none-any.whl
Collecting newspaper3k
[?25l  Downloading https://files.pythonhosted.org/packages/d7/b9/51afecb35bb61b188a4b44868001de348a0e8134b4dfa00ffc191567c4b9/newspaper3k-0.2.8-py3-none-any.whl (211kB)
[K     |████████████████████████████████| 215kB 5.8MB/s 
Collecting corenlp-protobuf>=3.8.0
  Downloading https://files.pythonhosted.org/packages/78/93/cc40d521cf6635fffa400b62799ddc761159302643d400cee72bd910efa9/corenlp_protobuf-3.8.0-py2.py3-none-any.whl
Collecting cssselect>=0.9.2
  Downloading https://files.pythonhosted.org/packages/3b/d4/3b5c17f00cce85b9a1e6f91096e1cc8e8ede2

We now download and install [Grenander et al. De-biasing news summarizer](https://github.com/mgrenander/banditsum-kl) for generation of news summaries as in our paper.

In [2]:
!git clone https://github.com/mgrenander/banditsum-kl
!gdown https://drive.google.com/uc?id=1-E8IakncMDn5DkSl4hZXbg332ISwpjHG && mv /content/banditsum_kl_model.pt /content/banditsum-kl/model/
!gdown https://drive.google.com/uc?id=1QCrb4bpPP7ldpbEthWYRh4hMFOAzTSPP && mv /content/vocab_100d.p /content/banditsum-kl/data/vocab
#Stanford coreNLP only needed if you are doing other stuff with banditsum
#!wget http://nlp.stanford.edu/software/stanford-corenlp-latest.zip && unzip stanford-corenlp-latest.zip
#!export CORENLP_HOME=/content/stanford-corenlp-4.1.0/

Cloning into 'banditsum-kl'...
remote: Enumerating objects: 86, done.[K
remote: Counting objects: 100% (86/86), done.[K
remote: Compressing objects: 100% (59/59), done.[K
remote: Total 86 (delta 46), reused 55 (delta 23), pack-reused 0[K
Unpacking objects: 100% (86/86), done.
Downloading...
From: https://drive.google.com/uc?id=1-E8IakncMDn5DkSl4hZXbg332ISwpjHG
To: /content/banditsum_kl_model.pt
266MB [00:02, 107MB/s]
Downloading...
From: https://drive.google.com/uc?id=1QCrb4bpPP7ldpbEthWYRh4hMFOAzTSPP
To: /content/vocab_100d.p
165MB [00:02, 69.6MB/s]


Next we use `webgetter.py` to download and store full text news articles from URLS in `news_urls.json`. The output is stored in `news_content.json` incrementally (so if you stop this process it will continue from where it left off)

We enforce a 2 second wait inbetween each HTTP request (to avoid saturating the websites we are downloading from) but you can change this if you wish using the `--wait` argument below

In [231]:
!python webgetter.py ./news_urls.json ./news_content.json --wait 2

[1, 2, 4, 5, 8, 9, 10, 12, 13, 14, 15, 16, 17, 18, 20, 21, 22, 24, 29, 31, 41, 44, 45, 46, 48, 49, 53, 55, 57, 60, 61, 63, 64, 65, 66, 67, 69, 70, 71, 72, 73, 77, 78, 79, 80, 81, 83, 86, 87, 88, 89, 93, 94, 105, 106, 109, 110, 111, 115, 116, 118, 119, 122, 123, 124, 125, 127, 128, 131, 132, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 161, 162, 168, 169, 173, 174, 175, 176, 180, 181, 182, 184, 185, 186, 189, 190, 195, 204, 205, 213, 214, 215, 220, 222, 223, 224, 225, 226, 227, 228, 229, 230, 232, 233, 234, 237, 238, 239, 240, 242]
[1/243] Fetch content for http://m.bbc.co.uk/news/health-17221910...
[0.0s per article - est. 0 minutes remaining]
Skipping content for existing doc http://m.bbc.co.uk/news/health-17221910
[2/243] Fetch content for http://m.bbc.co.uk/news/health-17398746...
[0.0s per article - est. 0 minutes remaining]
Skipping content for existing doc http://m.bbc.co.uk/news/health-17398746


We now import and initialize the grenander summarisation model with default configuration values

In [185]:
import sys

sys.path.append('/content/banditsum-kl/src')

import helper
import pickle
import argparse

VOCAB_FILE = "banditsum-kl/data/vocab/vocab_100d.p"

with open(VOCAB_FILE, 'rb') as f:
    vocab = pickle.load(f, encoding='latin1')

args = argparse.Namespace()
args.vocab_size=len(vocab.word_list)
args.hidden=200
args.embedding_dim = 100
args.position_size = 500
args.position_dim = 50
args.word_input_size = 100
args.sent_input_size = 2 * args.hidden
args.word_LSTM_hidden_units = args.hidden
args.sent_LSTM_hidden_units = args.hidden
args.pretrained_embedding = vocab.embedding
args.word2id = vocab.w2i
args.id2word = vocab.i2w
args.rl_sample_size=20
args.epsilon=0.1
args.max_num_sents=3
args.kl_method='none'
args.kl_weight=0.0095
args.model_file = "banditsum-kl/model/banditsum_kl_model.pt"

we define some helper functions for carrying out the summarisation itself

In [5]:
import torch
from model import SimpleRNN

def convert_tokens_to_ids(doc, args):
    max_len = len(max(doc, key=lambda x: len(x)))
    sent_list = []
    for i in range(len(doc)):
        words = doc[i]
        sent = [args.word2id[word] if word in args.word2id else 1 for word in words]
        sent += [0 for _ in range(max_len - len(sent))]  # this is to pad at the end of each sequence
        sent_list.append(sent)
    return torch.tensor(sent_list).long()

def init_model(args):
    rewards = {"train": None, "train_single": None, "dev": None, "dev_single": None}
    model = SimpleRNN(args, rewards)
    model.cuda()
    checkpoint = torch.load(args.model_file)
    model.load_state_dict(checkpoint['model_state_dict'])
    model.eval()
    
    return model

def summarize(text, spacy_nlp, model):
    
    doc = spacy_nlp(text)
    
    sentwords = []

    for sent in doc.sents:
        words = [word.text for word in sent if not word.is_punct]
        if len(words) > 1:
            sentwords.append(words)
            
    doc_ids = convert_tokens_to_ids(sentwords, args)
        
    with torch.no_grad():
            summary_idx = model(doc_ids.cuda())
        
    sents = [sent for i,sent in enumerate(doc.sents) if i in summary_idx]
    
    summ = " ".join([s.text for s in sents])
    
    return summ

we load the news text from `news_content.json` and metadata from `news_urls.json` which tells the summarizer what options to use and what the checksum of the final summary should be in order to match the original corpus.

In [338]:
import json

with open("news_urls.json",'r') as f:
  news_urls = json.load(f)

with open("news_content.json","r") as f:
  news_content = json.load(f)

The summarisation and spacy NLP models are loaded and prepared

In [8]:
import torch
import spacy

model = init_model(args)
nlp = spacy.load('en')


The summarisation process is carried out for each news article.

In [340]:
import hashlib
import re
from tqdm.auto import tqdm


#prev_match = match
#match = {}
nomatch = []
nocontent = []

for id, news_obj in tqdm(news_urls.items()):

  if news_obj['url'] in match:
    continue

  content = news_content.get(news_obj['url'], "")

  #if news_obj['url'] == 'https://web.archive.org/web/20160820032329/http://www.telegraph.co.uk/news/health/news/9633402/Closed-drug-trials-leave-patients-at-risk-and-doctors-in-the-dark.html':
  #  news_obj['str_insert'] = [{"str":" ", "loc":161}]
  

  #if news_obj['url'] in nomatch and "theguardian.com" in news_obj['url']:
  #  news_obj['strip_specialchars'] = False

  if content != "":
    if news_obj.get('strip_specialchars',True):
      content = re.sub(r'[^\x00-\x7F\x80-\xFF\u0100-\u017F\u0180-\u024F\u1E00-\u1EFF]', '', content) 

    if news_obj.get('strip_newlines', True):
      content = content .replace("\n\n"," ")
    #else:
    #  print(news_obj['url'])
  
    if news_obj.get('strip_whitespace',False):
      content = re.sub(r"[^\S]+", " ", content)

    summary = summarize(content, nlp, model).replace("\n","").strip()

    for insert in sorted(news_obj.get("str_insert", []), key=lambda x:x['loc'], reverse=False):
      summary = summary[:insert['loc']] + insert['str'] + summary[insert['loc']:]

    # if news_obj['url'] == 'https://www.theguardian.com/science/2015/jul/23/genes-influence-academic-ability-across-all-subjects-latest-study-shows':
    #   print(summary)
    #   break
    h = hashlib.new('sha256', summary.encode('utf8'))

    if h.hexdigest() == news_obj['sha256']:
      print(f"Match {news_obj['url']}")
      match[news_obj['url']] = summary
    else:
      nomatch.append(news_obj['url'])
  else:
    nocontent.append(news_obj['url'])


HBox(children=(FloatProgress(value=0.0, max=243.0), HTML(value='')))

Match https://www.theguardian.com/science/2015/jul/23/genes-influence-academic-ability-across-all-subjects-latest-study-shows



In [341]:
len(match)

171

In [218]:
len(nocontent)

0

In [342]:
len(nomatch)

72

In [325]:
nomatch

['https://web.archive.org/web/20190308160419/http://news.bbc.co.uk/2/hi/science/nature/4640420.stm',
 'https://www.eurekalert.org/pub_releases/2020-04/ibri-ndi042120.php',
 'https://www.eurekalert.org/pub_releases/2020-04/kauo-bpe040720.php',
 'https://www.eurekalert.org/pub_releases/2020-04/tiot-ftw042420.php',
 'https://www.eurekalert.org/pub_releases/2020-04/uoc-qru042120.php',
 'https://www.theguardian.com/science/2015/jul/23/genes-influence-academic-ability-across-all-subjects-latest-study-shows',
 'https://www.theguardian.com/science/2016/apr/25/musical-play-may-boost-understanding-and-long-term-learning-in-babies',
 'https://www.theguardian.com/science/2016/jul/20/updated-map-of-the-human-brain-hailed-as-a-scientific-tour-de-force',
 'https://www.theguardian.com/science/2016/jun/08/doctors-edge-closer-to-creating-babies-with-dna-from-three-people',
 'https://www.theguardian.com/science/2016/may/02/could-these-newly-discovered-planets-orbiting-an-ultracool-dwarf-host-life',
 'htt

In [150]:
from newspaper import Article
a = Article(url=url)
a.download()
a.parse()
a.text

'Tiny droplets of saliva that are sprayed into the air when people speak may be sufficient to spread coronavirus, according to US government scientists who say the finding could help control the outbreak.\n\nResearchers at the US National Institutes of Health (NIH) in Maryland found that talking released thousands of fine droplets into the air that could pose a risk to others if the speaker were infected with the virus.\n\nThe scientists used laser imaging and high-speed videography to show how thousands of droplets that are too small to see with the naked eye are emitted in normal speech, even in short phrases such as “stay healthy”.\n\nThe work is preliminary and has not been peer-reviewed or published, but in a report the scientists claim the findings may have “vital implications” for containing the pandemic.\n\n“If speaking and oral fluid viral load proves to be a major mechanism of Sars-CoV-2 [the official name of the virus] transmission, wearing any kind of cloth mouth cover in p

In [62]:
from importlib import reload

import webgetter
reload(webgetter)

<module 'webgetter' from '/content/webgetter.py'>

In [170]:
import requests
from bs4 import BeautifulSoup
r = requests.get(url)

bs = BeautifulSoup(r.text)

bs.select('.content__standfirst p')

[<p>US scientists say findings add to case for wearing masks in public to control outbreak</p>]

In [317]:
eureka_sel =  ".entry p"
guardian_sel = ".content__standfirst, .content__article-body p"
nyt_sel = ".StoryBodyCompanionColumn p"

content = get_body_text(url, guardian_sel)
#doc = nlp(webgetter.get_body_text(url, guardian_sel, join_str=""))
#sentences = [sent.text.strip() for sent in doc.sents]
#content = "".join(sentences)

#content
#content = re.sub(r'[^\x00-\x7F\x80-\xFF\u0100-\u017F\u0180-\u024F\u1E00-\u1EFF]', '', content)
#content = content.replace("\n\n"," ")
#content = re.sub(r"[\s]+", " ", content)

content

'Around 60% of differences in GCSE results can be explained by genetic factors, with the same genes responsible for maths, science and the humanities You may feel you are just not a maths person, or that you have a special gift for languages, but scientists have shown that the genes influencing numerical skills are the same ones that determine abilities in reading, arts and humanities. The study suggests that if you have an academic Achilles heel, environmental factors such as a teaching are more likely to be to blame. The findings add to growing evidence that school performance has a large heritable component, with around 60% of the differences in pupil’s GCSE results being explained by genetic factors. Although scientists are yet to pinpoint specific genes, the latest work, published in the journal Scientific Reports, suggests that the same ones are involved across subjects. Robert Plomin, a professor of genetics at King’s College London and the study’s senior author, said: “We found

In [314]:
url='https://www.theguardian.com/science/2015/jul/23/genes-influence-academic-ability-across-all-subjects-latest-study-shows'

In [327]:
content = news_content[url]
#content = a.text
#doc = nlp(content)
#sentences = [sent.text.strip() for sent in doc.sents]
#content = content.replace("\n\n"," ")
#content = re.sub(r'[^\x00-\x7F\x80-\xFF\u0100-\u017F\u0180-\u024F\u1E00-\u1EFF]', '', content)
#content = re.sub(r"[\s]+", " ", content)
#content = re.sub(r"[\'\"]")

#content = " ".join(sentences)
content

'Around 60% of differences in GCSE results can be explained by genetic factors, with the same genes responsible for maths, science and the humanities You may feel you are just not a maths person, or that you have a special gift for languages, but scientists have shown that the genes influencing numerical skills are the same ones that determine abilities in reading, arts and humanities. The study suggests that if you have an academic Achilles heel, environmental factors such as a teaching are more likely to be to blame. The findings add to growing evidence that school performance has a large heritable component, with around 60% of the differences in pupil’s GCSE results being explained by genetic factors. Although scientists are yet to pinpoint specific genes, the latest work, published in the journal Scientific Reports, suggests that the same ones are involved across subjects. Robert Plomin, a professor of genetics at King’s College London and the study’s senior author, said: “We found

In [328]:
newsum =summarize(content, nlp, model)
#newsum = re.sub(r"[^\S]+", " ", newsum).strip()
print(newsum)

Around 60% of differences in GCSE results can be explained by genetic factors, with the same genes responsible for maths, science and the humanities You may feel you are just not a maths person, or that you have a special gift for languages, but scientists have shown that the genes influencing numerical skills are the same ones that determine abilities in reading, arts and humanities. The study suggests that if you have an academic Achilles heel, environmental factors such as a teaching are more likely to be to blame. The findings add to growing evidence that school performance has a large heritable component, with around 60% of the differences in pupil’s


In [329]:
oldsum = """Around 60% of differences in GCSE results can be explained by genetic factors, with the same genes responsible for maths, science and the humanities You may feel you are just not a maths person, or that you have a special gift for languages, but scientists have shown that the genes influencing numerical skills are the same ones that determine abilities in reading, arts and humanities. The study suggests that if you have an academic Achilles heel, environmental factors such as a teaching are more likely to be to blame. The findings add to growing evidence that school performance has a large heritable component, with around 60% of the differences in pupil’s"""
oldsum = re.sub(r"[^\S]+", " ", oldsum)
print(oldsum)

Around 60% of differences in GCSE results can be explained by genetic factors, with the same genes responsible for maths, science and the humanities You may feel you are just not a maths person, or that you have a special gift for languages, but scientists have shown that the genes influencing numerical skills are the same ones that determine abilities in reading, arts and humanities. The study suggests that if you have an academic Achilles heel, environmental factors such as a teaching are more likely to be to blame. The findings add to growing evidence that school performance has a large heritable component, with around 60% of the differences in pupil’s


In [330]:
len(oldsum)

663

In [331]:
len(newsum)

663

In [332]:
import hashlib
h1 = hashlib.new('sha256', newsum.encode('utf8')).hexdigest()
h2 = hashlib.new('sha256',oldsum.encode('utf8')).hexdigest()

h1 == h2

True

In [335]:
h1

'6dcfce2a698e4dace5cbaf44d916b7d380e4d298ac7b439ebfdfd40664d15cb2'

In [337]:
item['sha256']

'465b7af306327e0092c9002e156fa46aaae5bf9b44364823ee3b46d9b979ec66'

In [None]:
for url in nomatch:

  if 'theguardian.com' in url:
    news_content[url] = ""

  for id,item in news_urls.items():
    if item['url'] == url:
      item['legacy'] = True
      break

with open("news_urls.json",'w') as f:
  json.dump(news_urls, f, indent=2)
  
with open("news_content.json","w") as f:
  json.dump(news_content, f, indent=2)

In [None]:
with open("news_urls.json",'w') as f:
  json.dump(news_urls, f, indent=2)

In [198]:
for id, item in news_urls.items():

  if item['url'] in nomatch and 'theguardian' in item['url']:
    news_content[item['url']] = ""
    news_urls[id]['legacy'] = False

with open("news_urls.json",'w') as f:
  json.dump(news_urls, f, indent=2)

with open("news_content.json","w") as f:
  json.dump(news_content, f, indent=2)