# Search the articles

This notebook will search the articles for the project exploiting the [Media Cloud](https://mediacloud.org/) database.
One have to set the customizable parameters:

1. `PATH_TO_DATA` is the Path object for saving the constructed DataFrames;
2. `MY_KEY` parameter is the user key every Media Cloud user has been provided once signed up, for more info go [here](https://github.com/mediacloud/backend/blob/master/doc/api_2_0_spec/api_2_0_spec.md#authentication);
3. `MIN_LEGTH` parameter is the minimiun length that a word must have to be considered for the word vector;
4. `MIN_FREQUENCY` parameter is the minimum frequency a word must appear in at least one article to be considered for the word vector;
5. `MAX_ARTICLES` parameter set the maximum number of articles to search;
6. `N_THREADS` parameter set the number of threads for parallelizing some of the procedures.

In [28]:
import json
import pandas as pd
import numpy as np
from pathlib import Path
from tqdm.notebook import tqdm
from joblib import Parallel, delayed


# Constants, NOT TO BE CHANGED
MEDIA_CLOUD_URL = 'https://api.mediacloud.org'
STORIES_SINGLE = '/api/v2/stories_public/single/'
STORIES_WORD_MATRIX = '/api/v2/stories_public/word_matrix/'
# These following have to be customized
PATH_TO_DATA = Path('../data')
MY_KEY = '66aa9cf8dbd642b0e47f6811764cbe451a84d9429b8d2b3647c97c0af8fd40f5'
MIN_LENGTH = 0
MIN_FREQUENCY = 10
MAX_ARTICLES = 3000
N_THREADS = 64

In [2]:
import mediacloud.api
from IPython.display import JSON
import datetime

MY_KEY = 'fa108cf51bdb186f9f037bc196d0183b18b24caac3158416a858b5a9b58dc143'
mc = mediacloud.api.MediaCloud(MY_KEY)
mediacloud.__version__
# make sure your connection and API key work by asking for the high-level system statistics
mc.stats()
# or print it out as a nice json tree - we'll use this later (only works in Jupyter Lab)


JSON(mc.stats())
# italy collection 38380117

<IPython.core.display.JSON object>

In [3]:
ita_query = 'assalto capitol hill and tags_id_media:34412372'#tags_id_media:38380117'
us_query = 'capitol hill assault and tags_id_media:34412234'
start_date = datetime.date(2021, 1, 5)
end_date = datetime.date(2021, 2, 5)
date_range = mc.dates_as_query_clause(start_date, end_date)

In [None]:
story_count = mc.storyCount(us_query,
                            date_range)['count']
print(story_count)
story_count = mc.storyCount(ita_query,
                            date_range)['count']
print(story_count)
# by default this matrix checks 1000 stories, but you can do up to 100,000 via the `rows` parameter
# stories = mc.storyWordMatrix(us_query,#ita_query,
                            #  date_range, rows=story_count)

Function for getting info about articles.

In [4]:
def all_matching_stories(mc_client, q, fq):
    """
    Return all the stories matching a query within Media Cloud. Page through the results automatically.
    :param mc_client: a `mediacloud.api.MediaCloud` object instantiated with your API key already
    :param q: your boolean query
    :param fq: your date range query
    :return: a list of media cloud story items
    """
    last_id = 0
    more_stories = True
    stories = []
    while more_stories:
        page = mc_client.storyList(q, fq, last_processed_stories_id=last_id, rows=500, sort='processed_stories_id')
        print("  got one page with {} stories".format(len(page)))
        if len(page) == 0:
            more_stories = False
        else:
            stories += page
            last_id = page[-1]['processed_stories_id']
    return stories

In [5]:
#all_stories = all_matching_stories(mc, us_query, date_range)
#print(len(all_stories))
all_stories = all_matching_stories(mc, ita_query, date_range)
print(len(all_stories))
all_stories

  got one page with 500 stories
  got one page with 500 stories
  got one page with 500 stories
  got one page with 413 stories
  got one page with 0 stories
1913


[{'ap_syndicated': False,
  'collect_date': '2021-01-05 02:10:14.000905',
  'feeds': None,
  'guid': 'https://www.repubblica.it/spettacoli/people/2021/01/05/news/robert_duvall_compleanno-280973039/?rss',
  'language': 'it',
  'media_id': 41552,
  'media_name': 'repubblica',
  'media_url': 'http://www.repubblica.it/',
  'processed_stories_id': 2225035435,
  'publish_date': '2021-01-05 01:43:17',
  'stories_id': 1815167122,
  'story_tags': [],
  'title': 'Robert Duvall, l\'eterno "consigliori" che ha rubato la scena alle star',
  'url': 'https://www.repubblica.it/spettacoli/people/2021/01/05/news/robert_duvall_compleanno-280973039/?rss',
  'word_count': None,
  'metadata': {'date_guess_method': None,
   'extractor_version': None,
   'geocoder_version': None,
   'nyt_themes_version': None}},
 {'ap_syndicated': False,
  'collect_date': '2021-01-05 05:41:41.657120',
  'feeds': None,
  'guid': 'agi:agi:10909276',
  'language': 'it',
  'media_id': 300123,
  'media_name': 'Ultime Notizie Onlin

In [29]:
with open(PATH_TO_DATA/'all_stories.json', 'x') as json_file:
    json.dump(all_stories, json_file)

The following cell will run the query that will get the word matrix (used in the project).
One has to customize the filters for getting what he/she is interested in.
Remind that the epidemic model works fine for events that have a rapid spreading (popular, interesting) and then die, because of that one has to pay attention not only to the argument but also to the time windows.
For more info about constructing the query look [here](https://github.com/mediacloud/backend/blob/master/doc/api_2_0_spec/api_2_0_spec.md#query-parameters-5).

In [31]:
# by default this matrix checks 1000 stories, but you can do up to 100,000 via the `rows` parameter
stories_words = mc.storyWordMatrix(ita_query,
                                   date_range,
                                   rows=len(all_stories))
stories_words

{'word_list': [['intanto', 'intanto'],
  ['tarrio', 'tarrio'],
  ['votat', 'votate'],
  ['quando', 'quando'],
  ['è', 'è'],
  ['banner', 'banner'],
  ['vota', 'vota'],
  ['dandolo', 'dandolo'],
  ['fiamm', 'fiamme'],
  ['si', 'si'],
  ['tipo', 'tipo'],
  ['ammesso', 'ammesso'],
  ['un', 'un'],
  ['oggi', 'oggi'],
  ['finita', 'finita'],
  ['di', 'di'],
  ['strappato', 'strappato'],
  ['ha', 'ha'],
  ['fort', 'forte'],
  ['aver', 'averli'],
  ['matter', 'matter'],
  ['all', 'alle'],
  ['live', 'lives'],
  ['mai', 'mai'],
  ["dall'edificio", "dall'edificio"],
  ['fino', 'fino'],
  ['black', 'black'],
  ['guardia', 'guardia'],
  ['sarà', 'sarà'],
  ['temono', 'temono'],
  ['dispiegata', 'dispiegata'],
  ['nazional', 'nazionale'],
  ['disordini', 'disordini'],
  ['la', 'la'],
  ['presunti', 'presunti'],
  ['spot', 'spot'],
  ['ascoltata', 'ascoltata'],
  ['piacerà', 'piacerà'],
  ['deputati', 'deputati'],
  ['aiuterà', 'aiuterà'],
  ['note', 'note'],
  ['americana', 'americana'],
  ['risul

In [34]:
with open(PATH_TO_DATA/'stories_words.json', 'w') as json_file:
    json.dump(stories_words, json_file)

The following cell will build the words vector DataFrame using `pandas` library.
The DataFrame will have a column (`id`) with the id of the article, and also one column per word named with that word.
For 3k articles this should take TODO.

In [35]:
words_df = pd.DataFrame()
n_words = len(stories_words['word_list'])
columns = [stories_words['word_list'][i][0] for i in range(len(stories_words['word_list']))]
columns.insert(0,'article_id')
keys_iterator = tqdm(
    stories_words['word_matrix'].keys(),
    leave=True,
    unit='articles',
)
def process(key):
    word_vector_dict = stories_words['word_matrix'][key]
    word_vector = np.zeros(n_words)
    for kkey in word_vector_dict.keys():
        word_vector[eval(kkey)] = word_vector_dict[kkey]
    word_vector = list([key])+list(word_vector)
    return pd.DataFrame([word_vector], columns=columns)
    
results = Parallel(n_jobs=N_THREADS)(delayed(process)(i) for i in keys_iterator)

  0%|          | 0/1913 [00:00<?, ?articles/s]

In [39]:
words_df = pd.concat(results, axis=0).reset_index().drop('index', 1)#.drop('level_0', 1)
#del results
words_df.head()

Unnamed: 0,article_id,intanto,tarrio,votat,quando,è,banner,vota,dandolo,fiamm,...,l'oligarchia,dopò,gravano,sovrappongono,ascoltatori,sopravvivrà,paragonando,attaccavano,stento,un'audi
0,1815167122,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1815303448,1.0,2.0,1.0,1.0,3.0,1.0,1.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,1815314004,1.0,2.0,1.0,1.0,3.0,1.0,1.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,1815387215,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,1815537720,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


The following cell will drop out the column corresponding to the word that should not be counted for the words vectors as set by the parameters `MIN_LENGTH` and `MIN_FREQUENCY`.

In [67]:
cols_iterator = tqdm(
    words_df.columns[1:],
    leave=True,
    unit='columns',
)

special_characters = ".\!@#$%^&*()+?_=,<>/"

def has_numbers(inputString):
    return any(char.isdigit() for char in inputString)

def has_special_chars(inputString):
    return any(c in special_characters for c in inputString)

def process_drop(col):
    # tmp = pd.to_numeric(words_df[col])
    # if len(col) < MIN_LENGTH or len(tmp[tmp>0]) == 0 or has_numbers(col) or has_special_chars(col):
    if has_numbers(col) or has_special_chars(col):
        return col
    
cols = Parallel(n_jobs=N_THREADS)(delayed(process_drop)(i) for i in cols_iterator)
# cols

  0%|          | 0/41566 [00:00<?, ?columns/s]

Elaborate columns to exclude errors while dropping them.

In [69]:
cols = [x for x in cols if (str(x) != 'nan' and str(x) != 'NaN' and str(x) != '' and str(x) != 'None')]
words_df = words_df.drop(cols, 1)
words_df['article_id'] = pd.to_numeric(words_df.article_id)
# del cols
words_df.head()

Unnamed: 0,article_id,intanto,tarrio,votat,quando,è,banner,vota,dandolo,fiamm,...,l'oligarchia,dopò,gravano,sovrappongono,ascoltatori,sopravvivrà,paragonando,attaccavano,stento,un'audi
0,1815167122,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,1815537720,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1815303448,1.0,2.0,1.0,1.0,3.0,1.0,1.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,1815314004,1.0,2.0,1.0,1.0,3.0,1.0,1.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,1815387215,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


The following cell will build the article info DataFrame using `pandas` library.
Data about articles will be extracted by usign a query to Media Cloud for the single story.
The DataFrame will have a column (`id`) with the id of the article.
Column `timestamp` will carry the data in which the article has been published.
Columns `source` will carry the media id of the media in which the article has been published.
It is necessary to reordering in cronological order the articles.

In [7]:
list(all_stories)[0]

{'ap_syndicated': False,
 'collect_date': '2021-01-05 02:10:14.000905',
 'feeds': None,
 'guid': 'https://www.repubblica.it/spettacoli/people/2021/01/05/news/robert_duvall_compleanno-280973039/?rss',
 'language': 'it',
 'media_id': 41552,
 'media_name': 'repubblica',
 'media_url': 'http://www.repubblica.it/',
 'processed_stories_id': 2225035435,
 'publish_date': '2021-01-05 01:43:17',
 'stories_id': 1815167122,
 'story_tags': [],
 'title': 'Robert Duvall, l\'eterno "consigliori" che ha rubato la scena alle star',
 'url': 'https://www.repubblica.it/spettacoli/people/2021/01/05/news/robert_duvall_compleanno-280973039/?rss',
 'word_count': None,
 'metadata': {'date_guess_method': None,
  'extractor_version': None,
  'geocoder_version': None,
  'nyt_themes_version': None}}

In [24]:
#all_stories = all_matching_stories(mc, ita_query, date_range)
print("Stories found {}".format(len(all_stories)))
stories_iterator = tqdm(
    list(all_stories),
    leave=True,
    unit='stories',
)
def process_info(story):
    return pd.DataFrame(
            [{
                'article_id': story['stories_id'],
                'timestamp': str(story['publish_date']),
                'source': story['media_id'],
            }]
        )
articles_info = Parallel(n_jobs=N_THREADS)(delayed(process_info)(i) for i in stories_iterator)
#articles_info

Stories found 1913


  0%|          | 0/1913 [00:00<?, ?stories/s]

In [25]:
info_df = pd.concat(articles_info, axis=0).reset_index().drop('index', 1)
info_df['timestamp'] = pd.to_datetime(info_df.timestamp)
info_df = info_df.sort_values(by='timestamp')
#del articles_info
info_df.head()

Unnamed: 0,article_id,timestamp,source
0,1815167122,2021-01-05 01:43:17,41552
4,1815537720,2021-01-05 03:49:27,39834
1,1815303448,2021-01-05 04:58:00,300123
2,1815314004,2021-01-05 04:58:00,39740
3,1815387215,2021-01-05 06:56:36,300147


It's then necessary to filter out from word matrix those stories than have not been elaborated.

In [64]:
print("Stories found by querying info {}".format(len(all_stories)))
print("Stories found by word matrix info {}".format(len(list(stories_words['word_matrix'].keys()))))
print("Stories to drop {}".format(len(all_stories)-len(list(stories_words['word_matrix'].keys()))))
ids_iterator = tqdm(
    list(info_df['article_id']),
    leave=True,
    unit='ids',
)
def process_ids(id):
    if id in list(words_df['article_id']):
        return None
    else:
        return id
ids_to_drop = Parallel(n_jobs=N_THREADS)(delayed(process_ids)(i) for i in ids_iterator)
len(ids_to_drop)

Stories found by querying info 1913
Stories found by word matrix info 1913
Stories to drop 0


  0%|          | 0/1913 [00:00<?, ?ids/s]

1913

In [65]:
info_df.drop((info_df[info_df['article_id'] == ids_to_drop]).index)

Unnamed: 0,article_id,timestamp,source
0,1815167122,2021-01-05 01:43:17.000000,41552
4,1815537720,2021-01-05 03:49:27.000000,39834
1,1815303448,2021-01-05 04:58:00.000000,300123
2,1815314004,2021-01-05 04:58:00.000000,39740
3,1815387215,2021-01-05 06:56:36.000000,300147
...,...,...,...
1901,1843599484,2021-02-05 09:45:53.000000,40216
1902,1843692039,2021-02-05 10:26:43.000000,40216
1903,1843800908,2021-02-05 13:30:07.636862,38859
1906,1844581078,2021-02-05 21:14:00.000000,334276


The same order must be given to the other DataFrame obviously.

In [66]:
info_df = info_df.sort_values(by='timestamp')
words_df = words_df.reindex(info_df.index.tolist())
#words_df

Finally, the DataFrames are save in the data folder, defined by `PATH_TO_DATA`.

In [43]:
words_df.to_csv(PATH_TO_DATA/'words_dataframe.csv')
info_df.to_csv(PATH_TO_DATA/'info_dataframe.csv')