# Search the articles

This notebook will search the articles for the project exploiting the [Media Cloud](https://mediacloud.org/) database.
One have to set the customizable parameters:

1. `PATH_TO_DATA` is the Path object for saving the constructed DataFrames;
2. `MY_KEY` parameter is the user key every Media Cloud user has been provided once signed up, for more info go [here](https://github.com/mediacloud/backend/blob/master/doc/api_2_0_spec/api_2_0_spec.md#authentication);
3. `MIN_LEGTH` parameter is the minimiun length that a word must have to be considered for the word vector;
4. `MIN_FREQUENCY` parameter is the minimum frequency a word must appear in at least one article to be considered for the word vector;
5. `MAX_ARTICLES` parameter set the maximum number of articles to search;
6. `N_THREADS` parameter set the number of threads for parallelizing some of the procedures.

In [None]:
import os
import json
import pandas as pd
from pathlib import Path
from tqdm.notebook import tqdm
from joblib import Parallel, delayed
import mediacloud.api
from IPython.display import JSON
import datetime
import matplotlib.pyplot as plt
from collections import Counter

## These following have to be customized
PATH_TO_DATA = Path('../data')
# Media Cloud account keys
LOLLO_KEY_1 = 'fa108cf51bdb186f9f037bc196d0183b18b24caac3158416a858b5a9b58dc143'
LOLLO_KEY_2 = '66aa9cf8dbd642b0e47f6811764cbe451a84d9429b8d2b3647c97c0af8fd40f5'
DANI_KEY = '00692be452b478dc158269f890533127ceb444b9f0cc05411ad154f67d55fec1'
# for joblib multithreading
N_THREADS = -1

Mediatags for restricting queries

In [None]:
ITALY_M_TAG = 'tags_id_media:38380117'
US_M_TAG = 'tags_id_media:38379429'
UK_M_TAG = 'tags_id_media:38381111'
FRANCE_M_TAG = 'tags_id_media:38379799'
GERMANY_M_TAG = 'tags_id_media:38379816'
SPAIN_M_TAG = 'tags_id_media:38002034'

Filters for language.
Recall that the Word2Vect model used in the following steps have been trained only from english words!!!

In [None]:
IT_LANG = 'language:it'
EN_LANG = 'language:en'
FR_LANG = 'language:fr'
DE_LANG = 'language:de'
SP_LANG = 'language:sp'

Function for getting info about articles.

In [None]:
# TODO: limit max number of stories
# TODO: check for repeated stories
def all_matching_stories(mc_client, q, fq):
    """
    Return all the stories matching a query within Media Cloud. Page through the results automatically.
    :param mc_client: a `mediacloud.api.MediaCloud` object instantiated with your API key already
    :param q: your boolean query
    :param fq: your date range query
    :return: a list of media cloud story items
    """
    last_id = 0
    more_stories = True
    stories = []
    while more_stories:
        page = mc_client.storyList(q, fq, last_processed_stories_id=last_id, rows=500, sort='processed_stories_id')
        print("  got one page with {} stories".format(len(page)))
        if len(page) == 0:
            more_stories = False
        else:
            stories += page[:-1]
            last_id = page[-1]['processed_stories_id']
    return stories

Function for processing info from stories.

In [None]:
def process_info(story):
    return pd.DataFrame(
            [{
                'article_id': story['stories_id'],
                'timestamp': str(story['publish_date']),
                'source': story['media_id'],
            }]
        )

Creating the output folders if they don't exist.

In [None]:
files = [
    'all_stories.json',
    'word_matrix.json',
    ]
stories = [
    'world_russia',
    'world_norway',
    'world_capitol_hill',
]
os.makedirs(PATH_TO_DATA, exist_ok=True)
[os.makedirs(PATH_TO_DATA/story, exist_ok=True) for story in stories]

Instantiate the Media Cloud client, and getting some info about the status of the account.

In [None]:
mc = mediacloud.api.MediaCloud(DANI_KEY)
print('Media cloud version '+str(mediacloud.__version__))
# make sure your connection and API key work by asking for the high-level system statistics
# and print it out as a nice json tree - we'll use this later (only works in Jupyter Lab)
JSON(mc.stats())
# italy collection 38380117

Query definition.
The following cells will run the queries that will get articles info and the word matrix (used in the project).
Some stats will be visualized.
One has to customize the filters for getting what he/she is interested in.
Remind that the epidemic model works fine for events that have a rapid spreading (popular, interesting) and then die, because of that one has to pay attention not only to the argument but also to the time windows.
For more info about constructing the query look [here](https://github.com/mediacloud/backend/blob/master/doc/api_2_0_spec/api_2_0_spec.md#query-parameters-5).

Russia shooting.

In [None]:
# world_russia_query = '(russia AND school AND shooting) OR (russia AND scuola AND sparatoria) OR (russie AND école AND fusillade) OR (rusia AND colegio AND tiroteo) OR  (russland AND schule AND (angriff OR schießen))'
world_russia_query = '(russia AND school AND shooting) AND '+EN_LANG
start_date = datetime.date(2021, 5, 10)
end_date = datetime.date(2021, 6, 10)
if not Path(PATH_TO_DATA/stories[0]/files[0]).exists():
    date_range = mc.dates_as_query_clause(start_date, end_date)
    story_count = mc.storyCount(world_russia_query, date_range)['count']
    print('Media Cloud found {} stories'.format(story_count)) # 4322
    all_stories = all_matching_stories(
        mc,
        world_russia_query,
        date_range)
    with open(PATH_TO_DATA/stories[0]/files[0], 'x') as json_file:
        json.dump(all_stories, json_file)
else:
    with open(PATH_TO_DATA/stories[0]/files[0], 'r') as json_file:
        all_stories = json.load(json_file)
print('Processing {} stories'.format(len(list(all_stories))))
stories_iterator = tqdm(
    list(all_stories),
    leave=True,
    unit='stories',
)
articles_info = Parallel(n_jobs=N_THREADS)(delayed(process_info)(i) for i in stories_iterator)
info_df = pd.concat(articles_info, axis=0).reset_index().drop('index', 1)
info_df['timestamp'] = pd.to_datetime(info_df.timestamp)
info_df = info_df.sort_values(by='timestamp')
del articles_info
print(info_df.head())
info_df['time_diff'] = info_df['timestamp'].map(lambda x: round((x-info_df['timestamp'][0]).total_seconds()/3600/24))
fig = plt.figure(figsize=(15,8))
plt.plot(list(Counter(info_df['time_diff']).values()))
plt.xlabel('time (days)')
plt.ylabel('articles published')
plt.grid()
plt.show()
if not Path(PATH_TO_DATA/stories[0]/files[1]).exists():
    print('Getting word matrix')
    stories_words = mc.storyWordMatrix(
        world_russia_query,
        date_range,
        rows=len(list(all_stories)))
    with open(PATH_TO_DATA/stories[0]/files[1], 'w') as json_file:
        json.dump(stories_words, json_file)
else:
    with open(PATH_TO_DATA/stories[0]/files[1], 'r') as json_file:
        all_stories = json.load(json_file)

Norway terroristic attack.

In [None]:
# world_norway_query = '(attack AND norway) OR (attacco AND norvegia) OR (ataque AND noruega) OR (attacke AND norwegen) OR (attaque AND norvège)'
world_norway_query = '(attack AND norway) AND '+EN_LANG
start_date = datetime.date(2021, 7, 21)
end_date = datetime.date(2021, 8, 21)
if not Path(PATH_TO_DATA/stories[1]/files[0]).exists():
    date_range = mc.dates_as_query_clause(start_date, end_date)
    story_count = mc.storyCount(world_norway_query, date_range)['count']
    print('Media Cloud found {} stories'.format(story_count)) # 3875
    all_stories = all_matching_stories(
        mc,
        world_norway_query,
        date_range)
    with open(PATH_TO_DATA/stories[1]/files[0], 'x') as json_file:
        json.dump(all_stories, json_file)
else:
    with open(PATH_TO_DATA/stories[1]/files[0], 'r') as json_file:
        all_stories = json.load(json_file)
print('Processing {} stories'.format(len(list(all_stories))))
stories_iterator = tqdm(
    list(all_stories),
    leave=True,
    unit='stories',
)
articles_info = Parallel(n_jobs=N_THREADS)(delayed(process_info)(i) for i in stories_iterator)
info_df = pd.concat(articles_info, axis=0).reset_index().drop('index', 1)
info_df['timestamp'] = pd.to_datetime(info_df.timestamp)
info_df = info_df.sort_values(by='timestamp')
del articles_info
print(info_df.head())
info_df['time_diff'] = info_df['timestamp'].map(lambda x: round((x-info_df['timestamp'][0]).total_seconds()/3600/24))
fig = plt.figure(figsize=(15,8))
plt.plot(list(Counter(info_df['time_diff']).values()))
plt.xlabel('time (days)')
plt.ylabel('articles published')
plt.grid()
plt.show()
if not Path(PATH_TO_DATA/stories[1]/files[1]).exists():
    print('Getting word matrix')
    stories_words = mc.storyWordMatrix(
        world_norway_query,
        date_range,
        rows=len(list(all_stories)))
    with open(PATH_TO_DATA/stories[1]/files[1], 'w') as json_file:
        json.dump(stories_words, json_file)
else:
    with open(PATH_TO_DATA/stories[1]/files[1], 'r') as json_file:
        all_stories = json.load(json_file)

Capitol hill mob.

In [None]:
# world_capitol_hill_query = 'capitol AND hill AND (assault OR mob OR asalto OR agression OR angriff OR assalto)'
world_capitol_hill_query = 'capitol AND hill AND (assault OR mob) AND '+EN_LANG
start_date = datetime.date(2021, 1, 5)
end_date = datetime.date(2021, 2, 5)
if not Path(PATH_TO_DATA/stories[2]/files[0]).exists():
    date_range = mc.dates_as_query_clause(start_date, end_date)
    story_count = mc.storyCount(world_capitol_hill_query, date_range)['count']
    print('Media Cloud found {} stories'.format(story_count)) # 45004
    all_stories = all_matching_stories(
        mc,
        world_capitol_hill_query,
        date_range)
    with open(PATH_TO_DATA/stories[2]/files[0], 'x') as json_file:
        json.dump(all_stories, json_file)
else:
    with open(PATH_TO_DATA/stories[2]/files[0], 'r') as json_file:
        all_stories = json.load(json_file)
print('Processing {} stories'.format(len(list(all_stories))))
stories_iterator = tqdm(
    list(all_stories),
    leave=True,
    unit='stories',
)
articles_info = Parallel(n_jobs=N_THREADS)(delayed(process_info)(i) for i in stories_iterator)
info_df = pd.concat(articles_info, axis=0).reset_index().drop('index', 1)
info_df['timestamp'] = pd.to_datetime(info_df.timestamp)
info_df = info_df.sort_values(by='timestamp')
del articles_info
print(info_df.head())
info_df['time_diff'] = info_df['timestamp'].map(lambda x: round((x-info_df['timestamp'][0]).total_seconds()/3600/24))
fig = plt.figure(figsize=(15,8))
plt.plot(list(Counter(info_df['time_diff']).values()))
plt.xlabel('time (days)')
plt.ylabel('articles published')
plt.grid()
plt.show()
if not Path(PATH_TO_DATA/stories[2]/files[1]).exists():
    print('Getting word matrix')
    stories_words = mc.storyWordMatrix(
        world_capitol_hill_query,
        date_range,
        rows=len(list(all_stories)))
    with open(PATH_TO_DATA/stories[2]/files[1], 'w') as json_file:
        json.dump(stories_words, json_file)
else:
    with open(PATH_TO_DATA/stories[2]/files[1], 'r') as json_file:
        all_stories = json.load(json_file)

Charlie Hebdo terrorist attack.

In [None]:
# TODO: proper filtering to reduce the number of stories
query = 'charlie hebdo AND '+UK_M_TAG
start_date = datetime.date(2015, 1, 7)
end_date = datetime.date(2015, 2, 7)
date_range = mc.dates_as_query_clause(start_date, end_date)
story_count = mc.storyCount(query,
                            date_range)['count']
print('Media Cloud found {} stories'.format(story_count)) # WORLD 141156, ITALY 999, FRANCE 4310, US 3613, UK 69

In [None]:
story_to_elaborate = 0
with open(PATH_TO_DATA/stories[story_to_elaborate]/files[0]) as json_file:
    all_stories = json.load(json_file)
# JSON(all_stories)
with open(PATH_TO_DATA/stories[story_to_elaborate]/files[1]) as json_file:
    stories_words = json.load(json_file)
# JSON(stories_words)