Media Cloud: Measuring Attention
================================

At this point you should be ready to query Media Cloud for data. You can use boolean query syntax - [read our query guide](https://mediacloud.org/support/query-guide) for more details about the exact syntax (it runs an [ElasticSearch search](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html) under the hood). **This notebook demonstrates how to quickly measure attention paid to an issue by the media**.

Studying media attention is critical for understanding how much readers are exposed to an issue, and has a long tradition. Media Cloud supports investigating attention within individual sources we track, or within collections or sources. We have wide global coverage with both national-level and regional/state-level collections for most countries. You can [browse our geographic collections](https://search.mediacloud.org/collections/news/geographic) to see more.

Our Python API exposes two methods that are particularly helpful for studying attention: 

* `story_count`: return the total number of stories in our database matching your query
* `story_count_over_time`: return the total number of stories in our database matching your query _by day_
* `story_list`: page through the actual stories that match your query

In [3]:
# Set up your API key and import needed things
import os, mediacloud.api
from importlib.metadata import version

# from dotenv import load_dotenv
import datetime as dt
from IPython.display import JSON
import bokeh.io

bokeh.io.reset_output()
# bokeh.io.output_notebook()
import datetime
import os

import mediacloud.api
from rich import print as pp

MC_API_KEY = os.getenv("MEDIA_CLOUD_API_KEY")
if MC_API_KEY is None:
    try:
        with open("../config/media.cloud.key") as f:
            MC_API_KEY = f.read().strip()
        pp("[bold green][SUCCESS] MC API Key found.[/bold green]")
    except FileNotFoundError:
        pp(
            "[bold red][ERROR] MC API key not found. Check ENV 'MEDIA_CLOUD_API_KEY' or file './config/media.cloud.key'[/bold red]"
        )
else:
    pp("[bold green][SUCCESS] MC API Key found.[/bold green]")
search_api = mediacloud.api.SearchApi(MC_API_KEY)
pp(f"[gray][INFO] Using Media Cloud python client v{version('mediacloud')}[/gray]")

## Listing Stories

Story counts are fine, but often what you really want is the story themselves. Note that **we cannot provide story content** due to copyright restrictions. However, you can get a list of all the URLs and then fetch them yourself. We can also return word counts down to the story level (see the "language" notebook for more info on that).

In [18]:
# grab the most recent stories about this issue
stories, _ = search_api.story_list(my_query, start_date, end_date)
stories[:3]

[{'id': 'dee7470dce016f1590d166919e2a959d1aff6e5e8105d911baf583abb3ad8025',
  'media_name': 'teachingexpertise.com',
  'media_url': 'teachingexpertise.com',
  'title': '50 Celebratory Earth Day Books For Kids',
  'publish_date': datetime.date(2023, 11, 2),
  'url': 'https://www.teachingexpertise.com/books/earth-day-books-for-kids/',
  'language': 'en',
  'indexed_date': datetime.datetime(2024, 7, 17, 7, 24, 16, 813213)},
 {'id': '81046b94a255a5a39c1b8996b95b111801de42f5fdd4e6387c89b394bbd01201',
  'media_name': 'auswaertiges-amt.de',
  'media_url': 'auswaertiges-amt.de',
  'title': 'UNIDAS: Together for women’s rights and democracy',
  'publish_date': datetime.date(2023, 11, 2),
  'url': 'https://www.auswaertiges-amt.de/en/aussenpolitik/regionaleschwerpunkte/lateinamerika/unidas-ni-una-menos/2518476',
  'language': 'en',
  'indexed_date': datetime.datetime(2024, 6, 20, 2, 15, 40, 254150)},
 {'id': '6bb05f429910b01ae4185c7d487971c061984ded44e880b80d0f8cf776058c18',
  'media_name': 'swar

If you want to list ALL the stories matching, you need to page through the results. This is accomplished via the `pagination_token` parameter. This code snippet pages through all the stories in a query.

In [None]:
# let's fetch all the stories matching our query on one day
all_stories = []
more_stories = True
pagination_token = None
while more_stories:
    page, pagination_token = search_api.story_list(
        my_query,
        dt.date(2023, 11, 29),
        dt.date(2023, 11, 30),
        collection_ids=[US_NATIONAL_COLLECTION],
        pagination_token=pagination_token,
    )
    all_stories += page
    more_stories = pagination_token is not None
len(all_stories)

As you may have noted, this can take a while for long time periods. If you look closely you'll notice that it can't be easily parallelized, because it requires content in the results to make the next call. A workaround is to divide you query up by time and query in parallel for something like each day. This can speed up the response. Also **just contact us directly if you are trying to do larger data dumps, or hit up against your API quota**.

### Writing a CSV of Story Data

What you probably want is a csv of all this story data. Here's a quick exmaple of dumping that data to a CSV (like our Search tool does).

In [None]:
import csv

fieldnames = [
    "id",
    "publish_date",
    "title",
    "url",
    "language",
    "media_name",
    "media_url",
    "indexed_date",
]
with open("story-list.csv", "w", newline="") as csvfile:
    writer = csv.DictWriter(csvfile, fieldnames=fieldnames, extrasaction="ignore")
    writer.writeheader()
    for s in all_stories:
        writer.writerow(s)

In [None]:
# and let's make sure it worked by checking out by loading it up as a pandas DataFrame
import pandas

df = pandas.read_csv("story-list.csv")
df.head()