Media Cloud: Sources and Collections
====================================

At this point you should be ready to query Media Cloud for data. **This notebook demonstrates how to browse and download information about the media sources and collections within Media Cloud system**. This explores some of the API methods under the hood of our [Source Manager tool](https://sources.mediacloud.org), which is used to browse and administer sources and collections in our system.

Media Cloud is a suite of web tools to support research into media coverage online. The underlying database has 1.5 billion stories (as of early 2020). Every open-web news story is connected to a `media` source. Sources are grouped together into collections (via `tags`). Our primary collections are [geography-based](https://sources.mediacloud.org/#/collections/country-and-state) (at the national and provider/state level).

We regularly scrape RSS feeds from a small set of our sources (around 60k as of early 2020). We are slowly rolling out the ability to ingest stories from news stories via their sitemap files (the hard part is determining which URLs arenews story pages and what are not). Other stories are discovered and added in via spidering links or finding a share of a news URL on social media. We don't advise using our entire database because it is skewed towards the topics of investigations ourselves, and collaborating researchers, have done. You can mitigate this by using the afore-mentioned geographic collections.

Our Python API exposes a few methods that are particularly helpful for looking at sources, their associated metadata, and collections: 

* `mediaList`: useful to search for media, or page through all the media in a collection ([documentation](https://github.com/berkmancenter/mediacloud/blob/master/doc/api_2_0_spec/api_2_0_spec.md#apiv2medialist))
* `media`: all metadata data about one media source, by `media_id` ([documentation](https://github.com/berkmancenter/mediacloud/blob/master/doc/api_2_0_spec/api_2_0_spec.md#apiv2mediasingle))
* `feedList`: page through any RSS feeds associated with a media source ([documentation](https://github.com/berkmancenter/mediacloud/blob/master/doc/api_2_0_spec/api_2_0_spec.md#apiv2feedslist))
* `tagSet`: collecitons are grouped into `tag_sets` ([documentation](https://github.com/berkmancenter/mediacloud/blob/master/doc/api_2_0_spec/api_2_0_spec.md#apiv2tag_setssingle))
* `tagList`: list all the collections in a `tag_set` ([documentation](https://github.com/berkmancenter/mediacloud/blob/master/doc/api_2_0_spec/api_2_0_spec.md#apiv2tag_setslist))
* `tag`: information about a collection ([documentation](https://github.com/berkmancenter/mediacloud/blob/master/doc/api_2_0_spec/api_2_0_spec.md#apiv2tagssingle))

## Setup

In [6]:
# Grab your API key from the environment variable and create a client for talking to Media Cloud
import os, mediacloud.api
from dotenv import load_dotenv
from IPython.display import JSON
load_dotenv()  # load config from .env file
mc = mediacloud.api.MediaCloud('')
mediacloud.__version__

'3.12.5'

## Searching for Media Sources

You can search for specific media, or media matching a set of criteria.

In [8]:
# try to find a media source based on its URL
matching_sources = mc.mediaList(name_like='hindustantimes', sort='num_stories')
matching_sources

[{'is_healthy': True,
  'is_monitored': False,
  'media_id': 39872,
  'media_source_tags': [{'description': 'Large list of all sites collected by the Europe Media Monitor project (http://emm.newsbrief.eu). Added in October of 2012. Includes anywhere from five to dozens of sources from almost every country.  This is our main set for broad coverage of international mainstream media.',
    'label': 'Europe Media Monitor',
    'media_id': 39872,
    'show_on_media': True,
    'show_on_stories': None,
    'tag': 'europe_media_monitor_20121015',
    'tag_set': 'collection',
    'tag_sets_id': 5,
    'tagged_date': None,
    'tags_id': 8876474},
   {'description': None,
    'label': None,
    'media_id': 39872,
    'show_on_media': False,
    'show_on_stories': None,
    'tag': 'webnews',
    'tag_set': 'emm_type',
    'tag_sets_id': 554,
    'tagged_date': None,
    'tags_id': 8876475},
   {'description': None,
    'label': None,
    'media_id': 39872,
    'show_on_media': False,
    'show_o

The first thing you'll notice is that our sources are rather noisy. We are trying to move to a model where each told level domain is a media source (with a handful of exceptions), but we haven't finished that work yet. So for it can be useful to sort there results by how content content they produce (ie. `sort='num_stories'`).

### Searching for Media by Metadata
Media sources have lots and lots of `tags` one them. Tags are used to represent many things in Media Cloud. In this case two relevant uses are:
* tags are used to cluster media sources in collections
* tags are used to add metadata to media sources - these are helpfully parsed out for you by the API client in the `metadata` property

In [9]:
# use metadata tags to find media published in India in English
TAG_PUBLISHED_IN_INDIA = 9353533
TAG_PUBLISHED_IN_MOSTLY_ENGLISH = 9361422
# this `tags_id_X` syntax is a little hokey, but it what we built quickly
# `tags_id_X` clauses are AND'ed together, while the array of values for each are OR'ed together
indian_english_sources = mc.mediaList(tags_id_1=[TAG_PUBLISHED_IN_INDIA],
                                      tags_id_2=[TAG_PUBLISHED_IN_MOSTLY_ENGLISH],
                                      sort='num_stories')
[m['url'] for m in indian_english_sources]

['http://timesofindia.com/',
 'http://expressindia.com/',
 'http://www.thehindu.com/',
 'http://www.hindu.com',
 'http://newindianexpress.com/#spider',
 'http://news18.com/',
 'https://economictimes.indiatimes.com/',
 'http://indianexpress.com/',
 'http://economictimes.indiatimes.com/et-now',
 'http://articles.economictimes.indiatimes.com',
 'http://ibnlive.in.com',
 'https://www.news18.com/',
 'http://profit.ndtv.com/',
 'http://www.business-standard.com',
 'https://news.google.com/news/?ned=hi_in',
 'http://www.frontline.in/#spider',
 'https://news.google.com/news/?ned=te_in',
 'http://www.newkerala.com#spider',
 'https://news.google.com/news/?ned=ml_in',
 'http://epaper.freepressjournal.in/t/8345/Bhopal/']

### Paging Through Media Lists & Saving Result CSVs
But you probably want to page through results to see all the matching sources in our system. `mediaList` supports that thorugh the `last_media_id` param.

In [10]:
# page through a list of media list results
def all_media_list(**kwargs):
    last_media_id = None
    more_results = True
    matching_media = []
    while more_results:
        media_page = mc.mediaList(**kwargs, last_media_id=last_media_id)
        print("  got a page of {} matching media".format(len(media_page)))
        if len(media_page) == 0:
            more_results = False
        else:
            matching_media += media_page
            last_media_id = media_page[-1]['media_id']
    return matching_media
all_indian_english_sources = all_media_list(tags_id_1=[TAG_PUBLISHED_IN_INDIA],
                                            tags_id_2=[TAG_PUBLISHED_IN_MOSTLY_ENGLISH],
                                            sort='num_stories',
                                            rows=100)
"found {} matching sources total".format(len(all_indian_english_sources))

  got a page of 100 matching media
  got a page of 100 matching media
  got a page of 73 matching media
  got a page of 24 matching media
  got a page of 10 matching media
  got a page of 3 matching media
  got a page of 0 matching media


'found 310 matching sources total'

In [16]:
# and you may want to save that as a CSV, like Source Manager lets you do online
fieldnames = ['media_id', 'url', 'name',
              'pub_country', 'pub_state', 'primary_language', 'subject_country', 'media_type',
              'public_notes', 'stories_per_day', 'first_story']
for m in all_indian_english_sources: # do some data prep to make it easy to output
    m['pub_country'] = m['metadata']['pub_country']['tag'] if m['metadata']['pub_country'] else None
    m['pub_state'] = m['metadata']['pub_state']['tag'] if m['metadata']['pub_state'] else None
    m['primary_language'] = m['metadata']['language']['tag'] if m['metadata']['language'] else None
    m['subject_country'] = m['metadata']['about_country']['tag'] if m['metadata']['about_country'] else None
    m['media_type'] = m['metadata']['media_type']['tag'] if m['metadata']['media_type'] else None
    m['stories_per_day'] = m['num_stories_90']
    m['first_story'] = m['start_date']
# and write a CSV
import csv
with open('media-list.csv', 'w', newline='',encoding = 'utf-8') as csvfile:
    writer = csv.DictWriter(csvfile, fieldnames=fieldnames, extrasaction='ignore')
    writer.writeheader()
    for s in all_indian_english_sources:
        writer.writerow(s)

## Media Source Feeds
Media Sources are created manually, or automatically generated by our system when a story is ingested from a domain we have not seen before. In the latter situation, a placeholder inactive RSS feed is generated to maintain database consistency (this feed usually has _"#spidered"_ on the end of its URL). For the limited number of sources that we ingest from daily, we have manually and automatically created RSS feeds (see our [`feed_seeker` package](https://github.com/mitmedialab/feed_seeker)).
```
media source
  ↳ feed
    ↳ story
```

In [18]:
# learn about the first result from above, which is our canonical one for the Hindustan Times
hindustan_times = mc.media(matching_sources[0]['media_id'])
hindustan_times

{'is_healthy': True,
 'is_monitored': False,
 'media_id': 39872,
 'media_source_tags': [{'description': 'Large list of all sites collected by the Europe Media Monitor project (http://emm.newsbrief.eu). Added in October of 2012. Includes anywhere from five to dozens of sources from almost every country.  This is our main set for broad coverage of international mainstream media.',
   'label': 'Europe Media Monitor',
   'media_id': 39872,
   'show_on_media': True,
   'show_on_stories': None,
   'tag': 'europe_media_monitor_20121015',
   'tag_set': 'collection',
   'tag_sets_id': 5,
   'tagged_date': None,
   'tags_id': 8876474},
  {'description': None,
   'label': None,
   'media_id': 39872,
   'show_on_media': False,
   'show_on_stories': None,
   'tag': 'webnews',
   'tag_set': 'emm_type',
   'tag_sets_id': 554,
   'tagged_date': None,
   'tags_id': 8876475},
  {'description': None,
   'label': None,
   'media_id': 39872,
   'show_on_media': False,
   'show_on_stories': None,
   'tag': 

In [19]:
# list all the feeds associated with this media source
hindistan_times_feeds = mc.feedList(media_id=hindustan_times['media_id'], rows=200)
hindistan_times_feeds

[{'active': False,
  'feeds_id': 130764,
  'last_attempted_download_time': '2016-12-18 05:05:50.831552-05:00',
  'last_new_story_time': '2015-09-16 20:28:37.564327-04:00',
  'last_successful_download_time': '2015-10-11 12:26:20-04:00',
  'media_id': 39872,
  'name': 'News Stories - Hindustan Times',
  'type': 'syndicated',
  'url': 'http://www.hindustantimes.com/RSSFeed/News.aspx'},
 {'active': False,
  'feeds_id': 130765,
  'last_attempted_download_time': '2016-12-18 05:05:50.831552-05:00',
  'last_new_story_time': '2015-09-22 23:34:58.379944-04:00',
  'last_successful_download_time': '2015-10-10 15:45:26-04:00',
  'media_id': 39872,
  'name': 'Latest India News Stories - Hindustan Times',
  'type': 'syndicated',
  'url': 'http://www.hindustantimes.com/RSSFeed/India.aspx'},
 {'active': False,
  'feeds_id': 130766,
  'last_attempted_download_time': '2016-12-18 05:05:50.831552-05:00',
  'last_new_story_time': '2015-09-23 07:36:05.052711-04:00',
  'last_successful_download_time': '2015-1

In [20]:
# but only some of these are active ones that we ingest news from every day
active_feeds = [f for f in hindistan_times_feeds if f['active']]
"{}/{} of the feeds are checked for new stories each day".format(len(active_feeds), len(hindistan_times_feeds))

'27/41 of the feeds are checked for new stories each day'

In [21]:
# fetch all the feeds and dump to a csv
def all_source_feeds(media_id):
    more_feeds = True
    last_feeds_id = None
    all_feeds = []
    while more_feeds:
        feed_page = mc.feedList(media_id, last_feeds_id=last_feeds_id, rows=100)
        print("  fetched a page of {} feeeds".format(len(feed_page)))
        if len(feed_page) == 0:
            more_feeds = False
        else:
            all_feeds += feed_page
            last_feeds_id = feed_page[-1]['feeds_id']
    return all_feeds
all_hindistan_times_feeds = all_source_feeds(media_id=hindustan_times['media_id'])
# dump to CSV
fieldnames = ['feeds_id', 'active', 'type', 'media_id', 'name', 'url']
filename = 'media-{}-feeds.csv'.format(hindustan_times['media_id'])
with open(filename, 'w', newline='') as csvfile:
    writer = csv.DictWriter(csvfile, fieldnames=fieldnames, extrasaction='ignore')
    writer.writeheader()
    for s in all_hindistan_times_feeds:
        writer.writerow(s)
print("Wrote {} feeds to {}".format(len(all_hindistan_times_feeds), filename))

  fetched a page of 41 feeeds
  fetched a page of 0 feeeds
Wrote 41 feeds to media-39872-feeds.csv


## Collections
Media Sources are grouped together into collections. These are implement internally as `tags`. Collections are grouped together as `tag_sets` for convenience of internal system operations. This can be confusing to navigate as humans. We have tons of collections, but the geographic ones the most useful place to start investigating things.

In [22]:
# all the geographic collections were seeeded from http://www.abyznewslinks.com in 2018; since then we have cleaned and augmented them
GEOGRAPHIC_COLLECTIONS_TAG_SET = 15765102
mc.tagSet(GEOGRAPHIC_COLLECTIONS_TAG_SET)

{'description': 'Tags in this set indicate that the media source covers a certain geographic area',
 'label': 'Geographic Collections',
 'name': 'geographic_collection',
 'show_on_media': False,
 'show_on_stories': False,
 'tag_sets_id': 15765102}

In [23]:
# list the collections in this tag set
geographic_collections = mc.tagList(tag_sets_id=GEOGRAPHIC_COLLECTIONS_TAG_SET)
[c['label'] for c in geographic_collections]

['Canada - National',
 'Saint Martin (French part) - National',
 'Sao Tome and Principe - National',
 'Cambodia - National',
 'Switzerland - National',
 'Ethiopia - National',
 'Aruba - National',
 'Swaziland - National',
 'Svalbard and Jan Mayen - National',
 'Congo, The Democratic Republic of the - National',
 'Argentina - National',
 'Bolivia, Plurinational State of - National',
 'Burkina Faso - National',
 'Bahrain - National',
 'Saudi Arabia - National',
 'Rwanda - National',
 'South Georgia and the South Sandwich Islands - National',
 'Japan - National',
 'American Samoa - National',
 'Northern Mariana Islands - National']

In [24]:
# page through a list of all of the collections in this tag_set, using the `last_tags_id` parameter
def all_tags_in_tag_set(tag_sets_id):
    more_tags = True
    last_tags_id = None
    all_tags = []
    while more_tags:
        tag_page = mc.tagList(tag_sets_id=tag_sets_id, rows=500, last_tags_id=last_tags_id)
        print("  got a page of {} tags".format(len(tag_page)))
        if len(tag_page) == 0:
            more_tags = False
        else:
            all_tags += tag_page
            last_tags_id = tag_page[-1]['tags_id']
    return all_tags

all_geographic_collections = all_tags_in_tag_set(GEOGRAPHIC_COLLECTIONS_TAG_SET)
"Found {} total geographic collections".format(len(all_geographic_collections))

  got a page of 500 tags
  got a page of 500 tags
  got a page of 500 tags
  got a page of 37 tags
  got a page of 0 tags


'Found 1537 total geographic collections'

In [25]:
# this isn't encoded into a heirarchy, but there are some conventions here:
# 1. each country's national and state/local collections start with the country name
# 2. each country's province/state level collections include their alpha2 name
# for instance, this finds all the collections related to Spain
spain_collections = [c for c in all_geographic_collections if c['label'] and (c['label'].startswith('Spain') or c['tag'].startswith('geo_ES-'))]
[c['label'] for c in spain_collections]

['Spain - National',
 'Spain - State & Local',
 'Madrid, Spain - State & Local',
 'Andalucía, Spain - State & Local',
 'Aragón, Spain - State & Local',
 'Asturias, Principado de, Spain - State & Local',
 'Illes Balears, Spain - State & Local',
 'Canarias, Spain - State & Local',
 'Cantabria, Spain - State & Local',
 'Castilla-La Mancha, Spain - State & Local',
 'Castilla y León, Spain - State & Local',
 'Catalunya, Spain - State & Local',
 'Ceuta, Spain - State & Local',
 'Extremadura, Spain - State & Local',
 'Galicia, Spain - State & Local',
 'La Rioja, Spain - State & Local',
 'Melilla, Spain - State & Local',
 'Murcia, Spain - State & Local',
 'Navarra / Nafarroa, Spain - State & Local',
 'País Vasco / Euskal Herria, Spain - State & Local',
 'Valenciana, Comunidad / Valenciana, Comunitat, Spain - State & Local']

## Reference: Media Source Metadata Tags
Here is a quick utility to let you generate a list of all the values possible for each type of media metadata (aka the `tags` in each media metadata `tag_set`). These constants are available in `mediacloud.tags`

In [26]:
fieldnames = ['tags_id', 'label', 'tag', 'description']
for metadata_tag_sets_id in mediacloud.tags.METADATA_TAG_SETS:
    tag_set = mc.tagSet(metadata_tag_sets_id)
    print("{}:".format(tag_set['label']))
    tags_in_set = all_tags_in_tag_set(metadata_tag_sets_id)
    filename = 'metadata-{}-tags.csv'.format(tag_set['name'])
    with open(filename, 'w', newline='') as csvfile:
        writer = csv.DictWriter(csvfile, fieldnames=fieldnames, extrasaction='ignore')
        writer.writeheader()
        for s in tags_in_set:
            writer.writerow(s)
    print("  Wrote {} tags to {}".format(len(tags_in_set), filename))

Publication Country:
  got a page of 245 tags
  got a page of 0 tags
  Wrote 245 tags to metadata-pub_country-tags.csv
Publication State:
  got a page of 500 tags
  got a page of 500 tags
  got a page of 190 tags
  got a page of 0 tags


UnicodeEncodeError: 'charmap' codec can't encode character '\u010d' in position 12: character maps to <undefined>