Media Frames Corpus 4.0
=======================

This dataset represents version 4.0 of the Media Frames Corpus, released on March 5, 2021. It contains labeled and unlabeled articles on six issues from 14 newspapers covering the years 1990-2014, though some issues have broader coverage.

## File structure

Each of the six sub-directories contains files associated with one of the six issues.

For each issue, there are two .json files. One contains the text for each article, along with basic metadata (newspaper source and date). The other file contains the annotations for a subset of articles, as well as the text of the article as it was annotated (which occasionally involves minor modification from the original text by the annotators).

The text of each article was shortened to 225 words, rounded up to the end of the paragraph. The article ID and the word "PRIMARY" were also prepended to the beginning of the text.

The codes.json file in this directory contains a mapping from numeric codes to tones and frames.

## Annotations

In most cases, annotated articles were annotated by at least two annotators. There are three categories of annotations: relevancy, framing, and tone. For each category, the json file contains the annotations from each annotator. Note that different annotators annotated different categories in some cases, and not all categories have been annotated for all articles. 

Relevancy (irrelevant) is simply a judgment (by each annotator) as to whether or not the article is relevant for the issue (as defined in the paper).

Tone is a judgement as to whether the article is neutral, or is (implicitly or explicitly) pro or anti towards the default position on the issue. (For example, for immigration, "pro" suggests a pro-immigration position. Note that for gun control, these were annotated as pro or anti gun, so "anti" is actually pro gun control).

The framing annotations identify spans of text that cue each of 15 framing dimensions, including special codes for the "primary" frame of the article, as well as for the headline, although these were not always used precisely as intended. In general the word "PRIMARY" at the start of the article is where annotators identify the primary frame. 

For framing, the start and end of each span is given as a 0-based index into the text.

In addition to the detailed annotations, an attempt was made to resolve the consensus of the annotators for the (primary) tone, primary frame, headline frame, and whether or not the article is irrelevant, each of which occurs as an additional field for each article.

## Sampling issues

For most issues, the .json file with unlabeled articles contains all articles on the issue from the sources used, based on an external database, as described in the original paper. The annotated articles are approximately a random sample from the full set of articles (following de-duplications). There are a few exceptions to this however. 

First, the availability of some sources changed over time, thus there is a slight change in composition of sources for some issues.

Second, gun control was later extended, such that articles from 2015-2018 are over-represented among the annotated articles.

Third, the data collection for climate change unfortunately failed, and was not noticed until later, such that the unlabeled articles for that issue does not include the full set of articles that should have been collected. As such, it is less comprehensive than the other issues, and any patterns noticed over time should be treated with skepticism for this one issue.

## A note on sources

14 named newspapers are listed among the sources in this dataset. Most of these are self-explanatory (e.g., "new york times"), but a few require clarification: "daily news" is the New York Daily News. "herald-sun" includes articles from both the Durham Herald Sun and the Chapel Hill Herald, which are affiliated. Also, note that the St. Petersburg Times was renamed to the Tampa Bay Times in 2012. These two sources represent the same paper, but the individual names have been preserved in this dataset.

## Citation

To cite this dataset, please cite the following paper:

```
@inproceedings{card.2015,
  title = {The Media Frames Corpus: Annotations of Frames Across Issues},
  author = {Dallas Card and Amber E. Boydstun and  Justin H. Gross and  Philip Resnik and Noah A. Smith},
  year = {2015},
  booktitle={Proceedings of ACL},
  url={https://www.aclweb.org/anthology/P15-2072/},
} 
```



In [5]:
base_url = "../../mfc_v4.0/"
subdirs = ["climate", "deathpenalty", "guncontrol", "immigration", "samesex", "tobacco"]

In [6]:
import json


all_texts = []

for ext in subdirs:
    with open(base_url + ext + "/" + ext + "_all_with_duplicates.json", "r") as f: 
        all_texts.append(json.load(f))


In [7]:
sum(len(topic.keys()) for topic in all_texts)

173719

In [8]:
from sys import getsizeof

In [9]:
getsizeof(all_texts)

128

In [10]:
all_annotations = []

for ext in subdirs:
    with open(base_url + ext + "/" + ext + "_labeled.json", "r") as f: 
        all_annotations.append(json.load(f))

In [12]:
print(list(all_texts[1].values())[0])
print(list(all_annotations[1].items())[0])

('death_penalty_1', {'day': 30, 'fulldate': 19831230, 'month': 12, 'source': 'new york times', 'text': "DEA-1\n\nPRIMARY\n\nIN THE NATION; EXCUSE IT, PLEASE\n\nOn Dec.  2, I began an article with the suggestion that in executing Robert Sullivan, ''the state of Florida may well have killed an innocent man.''\n\nMy intended point - from which I do not retreat - was that enough ''doubts and questions'' had been raised as to whether Mr. Sullivan actually was guilty of murder ''at least to have precluded a death penalty - particularly since, under Florida law, such a sentence can only be imposed by a separate tribunal deciding whether the facts of a crime warrant capital punishment.''\n\nThis article led many readers to the conclusion that I was, in fact, arguing that Mr. Sullivan was innocent. For example, I enumerated in some detail the ''doubts and questions'' about his guilt, but I wrote only that ''some other evidence, of course, tended to support the guilty verdict.'' Numerous readers

In [14]:
for i in range(len(subdirs)):
    
    n_texts = len(all_texts[i].keys())
    n_annotations = len(all_annotations[i].keys())
    
    if n_texts != n_annotations:
        print("ERROR: Number of keys for "+subdirs[i] + " does not match between text and annotations")
        print("Number of texts: ", n_texts)
        print("Number of articels with annotations: ", n_annotations)
    

ERROR: Number of keys for climate does not match between text and annotations
Number of texts:  18595
Number of articels with annotations:  5155
ERROR: Number of keys for deathpenalty does not match between text and annotations
Number of texts:  34568
Number of articels with annotations:  6398
ERROR: Number of keys for guncontrol does not match between text and annotations
Number of texts:  22355
Number of articels with annotations:  10383
ERROR: Number of keys for immigration does not match between text and annotations
Number of texts:  54782
Number of articels with annotations:  6757
ERROR: Number of keys for samesex does not match between text and annotations
Number of texts:  14200
Number of articels with annotations:  10583
ERROR: Number of keys for tobacco does not match between text and annotations
Number of texts:  29219
Number of articels with annotations:  5274


In [23]:
#make single document dict with top-level topic being subfield

article_dict = {}

for i,ext in enumerate(subdirs):
    
    curr_text_dict = all_texts[i]
    
    for article in curr_text_dict:
        article_dict[article] = curr_text_dict[article]
        article_dict[article]["topic"] = ext
        
        article_dict[article]["annotations"] = []
        
        if article in all_annotations[i] and "framing" in all_annotations[i][article]["annotations"]:
            for annotations_by_author in all_annotations[i][article]["annotations"]["framing"].values():
                article_dict[article]["annotations"] = article_dict[article]["annotations"] + annotations_by_author
            




In [24]:
len(article_dict.keys())

173719

In [25]:
annotated_articles = dict([item for item in article_dict.items() if len(item[1]["annotations"]) > 0])
unannotated_articles = dict([item for item in article_dict.items() if len(item[1]["annotations"]) == 0])

In [26]:
len(annotated_articles)

32014

In [27]:
len(unannotated_articles)

141705

In [28]:
with open(base_url + "annotated_articles.json", "w") as f:
    json.dump(annotated_articles, f)
    
with open(base_url + "unannotated_articles.json", "w") as f:
    json.dump(unannotated_articles, f)

In [29]:
def add_article_spans(corpus, article, out):
    
    for annotation in corpus[article]["annotations"]:
        if annotation['code'] not in out:
            out[annotation['code']] = []
        
        out[annotation['code']].append(corpus[article]["text"][annotation['start']:annotation['end']])


In [30]:
get_article_spans(annotated_articles, list(annotated_articles.keys())[0])

{6.2: ['PRIMARY', '\nPRIMARY'],
 15.1: ['Too much hot air', 'Too much hot air'],
 13.0: ['The White House moved boldly Monday to go it alone on braking climate change, leaving behind a gridlocked Congress.',
  'The Obama administration is acting to regulate carbon dioxide - the main greenhouse gas driving up global temperatures - as the deadly pollutant it is',
  'In 2010, Congress rejected a similar model, starting an unconscionable stalemate that left the President with no choice but to take solo action.',
  'Obama',
  'the President',
  'Gov. Cuomo',
  'The White House moved boldly Monday to go it alone on braking climate change, leaving behind a gridlocked Congress.\n\nThe Obama administration is acting to regulate carbon dioxide -',
  ' In 2010, Congress rejected a similar model, starting an unconscionable stalemate that left the President with no choice but to take solo action.',
  'Obama',
  'as Gov. Cuomo stalls any action to permit extraction.'],
 6.0: ['New EPA rules zero in 