# News API Quickstart

The News API's main functions are `fetch_urls()` and `fetch_urls_to_file()`. The latter outputs the retrieved content to a .jsonl file.

Both functions require these two arguments:
- `urls`: a dictionary or a list of dictionaries. Each dictionary should have a `url` key and a URL string as its value.
    - To pass along extra info to the output, you can add (JSON serializable) key-value pairs to the input dictionary.
- `fetch_function`:
    - `request_active_url()`: request the URL directly (the URL is actively served by the URL domain).
    - `request_archived_url()`: request the oldest archived version of the URL from the Internet Archive's Wayback Machine.
    - `fetch_url()`: call `request_active_url()`. If it fails, call `request_archived_url()` as a fallback option.

`fetch_urls_to_file()` also requires `path` and `filename` arguments.


Every `fetch_function` returns a stringified JSON object with the following keys. If additional key-value pairs are included in the input dictionary, they are added to the output as well.

- `article_maintext` (str): main text of the article extracted by [news-please](https://github.com/fhamborg/news-please)
- `original_url` (str): the input URL
- `resolved_url` (str): `response_url` processed for errors
    - `http://example.com/__CLIENT_ERROR__`
    - `http://example.com/__CONNECTIONPOOL_ERROR__`
- `resolved_domain` (str): domain of `resolved_url`
- `resolved_netloc` (str): network location of `resolved_url`
- `standardized_url` (str): netloc + path + query of `resolved_url`
    - Common analytics-related prefixes and query parameters are removed. The URL is also lower-cased.
- `is_generic_url` (bool): indicates if the standardized URL is likely a generic URL which doesn't refer to a specific article's webpage. If `True`, `article_maintext` and `resolved_text` should probably be excluded as noisy data.
- `response_code` (int): response status code
- `response_reason` (str): response status code reason
- `fetch_error` (bool): indicates success or failure of the HTTP request
- `resolved_text` (str): the HTML returned by the server. This is useful if news-please's article extractor didn't succeed (`article_maintext`) and custom extraction logic is needed.
- `FETCH_FUNCTION` (str): "request_active_url" or "request_archived_url"
- `FETCH_AT` (str): "2021-11-05T23:25:15.611729+00:00" (timezone-aware UTC)


In [1]:
import json
import os
import urlexpander

In [2]:
dir_out = os.path.join('..', 'examples', 'output')

In [3]:
def filter_keys(fetched, exclude_keys=['resolved_text', 'article_maintext']):
    """remove keys which show actual HTML/article text
    Args:
        fetched (dict)
        excluded_keys (list)

    Returns:
        fetched (dict) - filtered
    """
    return {k: fetched[k] for k in fetched.keys() if k not in exclude_keys}

## Example URLs

The first example with Breitbart's URL includes the minimum required information. \
The second example with One America News' URL adds an extra key-value pair, which will be passed along to the output.

In [4]:
examples = [
    {"url": "http://feedproxy.google.com/~r/breitbart/~3/bh9JQvQPihk/"},
    {
        "url": "http://www.oann.com/pm-abe-to-send-message-japan-wont-repeat-war-atrocities-2/",
        "outlet": "One America News",
    },
]


## Fetch URLs (generator)

When `fetch_function=urlexpander.fetch_url`, we first try to retrieve the article with a direct server request. If it fails, we try to fetch an archived version.

In [5]:
# generator
g_ftc = urlexpander.fetch_urls(urls=examples, fetch_function=urlexpander.fetch_url)

# fetch
r_ftc = [json.loads(r) for r in g_ftc]

# filter out keys with actual text
r_ftc = [filter_keys(r) for r in r_ftc]

url 0, fetch_url: http://feedproxy.google.com/~r/breitbart/~3/bh9JQvQPihk/
url 1, fetch_url: http://www.oann.com/pm-abe-to-send-message-japan-wont-repeat-war-atrocities-2/


In [6]:
print(f"Fetched {len(r_ftc)} URLs.")

Fetched 2 URLs.


In the Breitbart example, the direct request to the server succeeds. Since `fetch_error` is `False`, it doesn't trigger the fallback function to the archive.

In [7]:
# returns from the first attempt
r_ftc[0]

{'original_url': 'http://feedproxy.google.com/~r/breitbart/~3/bh9JQvQPihk/',
 'resolved_url': 'https://www.breitbart.com/radio/2017/08/15/raheem-kassam-no-go-zones-statue-destruction-muslim-migration-left-wants-erase-america/?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+breitbart+%28Breitbart+News%29',
 'resolved_domain': 'breitbart.com',
 'resolved_netloc': 'www.breitbart.com',
 'standardized_url': 'www.breitbart.com/radio/2017/08/15/raheem-kassam-no-go-zones-statue-destruction-muslim-migration-left-wants-erase-america',
 'is_generic_url': False,
 'response_code': 200,
 'response_reason': 'OK',
 'fetch_error': False,
 'FETCH_FUNCTION': 'request_active_url',
 'FETCH_AT': '2021-11-05T23:25:15.611729+00:00'}

In the One America News example, the retrieved content comes from the fallback request to the Internet Archive's Wayback Machine (`FETCH_FUNCTION: 'request_archived_url'`). This means that the first attempt with the direct server response failed.

In [8]:
# The HTML is stored in `resolved_text` and the extracted article is stored in `article_maintext`.
# Due to copyright, these two keys are filtered out before displaying the output.
r_ftc[1]

{'outlet': 'One America News',
 'original_url': 'http://www.oann.com/pm-abe-to-send-message-japan-wont-repeat-war-atrocities-2/',
 'resolved_url': 'http://www.oann.com/pm-abe-to-send-message-japan-wont-repeat-war-atrocities-2/',
 'resolved_domain': 'oann.com',
 'resolved_netloc': 'www.oann.com',
 'standardized_url': 'www.oann.com/pm-abe-to-send-message-japan-wont-repeat-war-atrocities-2',
 'is_generic_url': False,
 'response_code': 200,
 'response_reason': 'OK',
 'fetch_error': False,
 'FETCH_FUNCTION': 'request_archived_url',
 'FETCH_AT': '2021-11-05T23:25:43.460667+00:00'}

To illustrate the two steps more clearly, we can retrieve the second example with `fetch_function=urlexpander.request_active_url` and `fetch_function=urlexpander.request_archived_url` separately.

In [9]:
# generators
g_exp = urlexpander.fetch_urls(urls=examples[1], fetch_function=urlexpander.request_active_url)
g_wbm = urlexpander.fetch_urls(urls=examples[1], fetch_function=urlexpander.request_archived_url)

# fetch
r_exp = [json.loads(r) for r in g_exp][0]
r_wbm = [json.loads(r) for r in g_wbm][0]

# filter out keys with actual text
r_exp = filter_keys(r_exp)
r_wbm = filter_keys(r_wbm)

url 0, request_active_url: http://www.oann.com/pm-abe-to-send-message-japan-wont-repeat-war-atrocities-2/
url 0, request_archived_url: http://www.oann.com/pm-abe-to-send-message-japan-wont-repeat-war-atrocities-2/


The first attempt with `request_active_url` returns an error which triggers the fallback attempt to the archive.

In [10]:
r_exp

{'outlet': 'One America News',
 'original_url': 'http://www.oann.com/pm-abe-to-send-message-japan-wont-repeat-war-atrocities-2/',
 'resolved_url': 'http://oann.com/__CLIENT_ERROR__',
 'resolved_domain': 'oann.com',
 'resolved_netloc': 'oann.com',
 'standardized_url': 'oann.com/__client_error__',
 'is_generic_url': False,
 'response_code': '',
 'response_reason': '',
 'fetch_error': True,
 'FETCH_FUNCTION': 'request_active_url',
 'FETCH_AT': '2021-11-05T23:25:57.292192+00:00'}

The `response_code` and `response_reason` indicate that the Wayback Machine has an archived version available. This is the same output we got when `fetch_function=urlexpander.fetch_url`.

In [11]:
r_wbm

{'outlet': 'One America News',
 'original_url': 'http://www.oann.com/pm-abe-to-send-message-japan-wont-repeat-war-atrocities-2/',
 'resolved_url': 'http://www.oann.com/pm-abe-to-send-message-japan-wont-repeat-war-atrocities-2/',
 'resolved_domain': 'oann.com',
 'resolved_netloc': 'www.oann.com',
 'standardized_url': 'www.oann.com/pm-abe-to-send-message-japan-wont-repeat-war-atrocities-2',
 'is_generic_url': False,
 'response_code': 200,
 'response_reason': 'OK',
 'fetch_error': False,
 'FETCH_FUNCTION': 'request_archived_url',
 'FETCH_AT': '2021-11-05T23:26:08.241936+00:00'}

## Fetch URLs and store the fetched content in a .jsonl file

In [12]:
# set filenames
fn_ftc = f"news_api_examples.jsonl"

In [13]:
# write to file
urlexpander.fetch_urls_to_file(
    urls=examples,
    fetch_function=urlexpander.fetch_url,
    path=dir_out,
    filename=fn_ftc,
    write_mode="a",
)


url 0, fetch_url: http://feedproxy.google.com/~r/breitbart/~3/bh9JQvQPihk/
url 1, fetch_url: http://www.oann.com/pm-abe-to-send-message-japan-wont-repeat-war-atrocities-2/


In [14]:
# read from file
g_ftc_file = urlexpander.load_fetched_from_file(path=dir_out, filename=fn_ftc)

In [15]:
# fetch
r_ftc_file = [json.loads(r)  for r in g_ftc_file]

In [16]:
# filter out keys with actual text
r_ftc_file = [filter_keys(r) for r in r_ftc_file]

In [17]:
print(f"Loaded fetched content for {len(r_ftc_file)} URLs from {fn_ftc}.")

Loaded fetched content for 2 URLs from news_api_examples.jsonl.


In [18]:
r_ftc_file[0]

{'original_url': 'http://feedproxy.google.com/~r/breitbart/~3/bh9JQvQPihk/',
 'resolved_url': 'https://www.breitbart.com/radio/2017/08/15/raheem-kassam-no-go-zones-statue-destruction-muslim-migration-left-wants-erase-america/?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+breitbart+%28Breitbart+News%29',
 'resolved_domain': 'breitbart.com',
 'resolved_netloc': 'www.breitbart.com',
 'standardized_url': 'www.breitbart.com/radio/2017/08/15/raheem-kassam-no-go-zones-statue-destruction-muslim-migration-left-wants-erase-america',
 'is_generic_url': False,
 'response_code': 200,
 'response_reason': 'OK',
 'fetch_error': False,
 'FETCH_FUNCTION': 'request_active_url',
 'FETCH_AT': '2021-11-05T23:26:17.441628+00:00'}

In [19]:
r_ftc_file[1]

{'outlet': 'One America News',
 'original_url': 'http://www.oann.com/pm-abe-to-send-message-japan-wont-repeat-war-atrocities-2/',
 'resolved_url': 'http://www.oann.com/pm-abe-to-send-message-japan-wont-repeat-war-atrocities-2/',
 'resolved_domain': 'oann.com',
 'resolved_netloc': 'www.oann.com',
 'standardized_url': 'www.oann.com/pm-abe-to-send-message-japan-wont-repeat-war-atrocities-2',
 'is_generic_url': False,
 'response_code': 200,
 'response_reason': 'OK',
 'fetch_error': False,
 'FETCH_FUNCTION': 'request_archived_url',
 'FETCH_AT': '2021-11-05T23:26:41.223514+00:00'}