# Search the articles

This notebook will search the articles for the project exploiting the [Media Cloud](https://mediacloud.org/) database.
One have to set the customizable parameters:

1. `PATH_TO_DATA` is the Path object for saving the constructed DataFrames;
2. `MY_KEY` parameter is the user key every Media Cloud user has been provided once signed up, for more info go [here](https://github.com/mediacloud/backend/blob/master/doc/api_2_0_spec/api_2_0_spec.md#authentication);
3. `MIN_LEGTH` parameter is the minimiun length that a word must have to be considered for the word vector;
4. `MIN_FREQUENCY` parameter is the minimum frequency a word must appear in at least one article to be considered for the word vector;
5. `MAX_ARTICLES` parameter set the maximum number of articles to search;
6. `N_THREADS` parameter set the number of threads for parallelizing some of the procedures.

In [6]:
import requests
import pandas as pd
import numpy as np
from pathlib import Path
from tqdm import tqdm
from joblib import Parallel, delayed
import time


# Constants, NOT TO BE CHANGED
MEDIA_CLOUD_URL = 'https://api.mediacloud.org'
STORIES_SINGLE = '/api/v2/stories_public/single/'
STORIES_WORD_MATRIX = '/api/v2/stories_public/word_matrix/'
# These following have to be customized
PATH_TO_DATA = Path('../data')
MY_KEY = '66aa9cf8dbd642b0e47f6811764cbe451a84d9429b8d2b3647c97c0af8fd40f5'
MIN_LENGTH = 0
MIN_FREQUENCY = 10
MAX_ARTICLES = 1000
N_THREADS = 12

The following cell will run the query that will get the word matrix (used in the project).
One has to customize the filters for getting what he/she is interested in.
Remind that the epidemic model works fine for events that have a rapid spreading (popular, interesting) and then die, because of that one has to pay attention not only to the argument but also to the time windows.
For more info about constructing the query look [here](https://github.com/mediacloud/backend/blob/master/doc/api_2_0_spec/api_2_0_spec.md#query-parameters-5).

In [8]:
start = time.time()
# Initialize articles' ids list
stories_id = []
# Filter to pass to the query
params = {
    # numer of articles per query
    'rows': MAX_ARTICLES,
    # filtering the language of the article to have consistent words
    'q': 'capitol hill assault',
    # main filters: arguments and so on
    'fq': 'publish_date:[2021-01-05T23:59:59.999Z TO 2021-04-05T23:59:59.999Z]',
    # personal key to pass to the query
    'key': MY_KEY,
}
end = time.time()
print('Build request, overall time elapsed: {}'.format(end - start))
# Get word matrix of the query
r = requests.get(
    MEDIA_CLOUD_URL+STORIES_WORD_MATRIX,
    params=params,
    headers={'Accept': 'application/json'})
end = time.time()
print('Sent and retrieved request, overall time elapsed: {}'.format(end - start))
# Retrieve the results to be analyzed
stories_word_info = r.json()
print('Found {} articles'.format(len(stories_word_info)))
end = time.time()
print('Overall time elapsed: {}'.format(end - start))

Build request, overall time elapsed: 0.00010442733764648438
Sent and retrieved request, overall time elapsed: 32.85556960105896
Found 2 articles
Overall time elapsed: 32.99162268638611


The following cell will build the words vector DataFrame using `pandas` library.
The DataFrame will have a column (`id`) with the id of the article, and also one column per word named with that word.
For 3k articles this should take TODO.

In [15]:
words_df = pd.DataFrame()
n_words = len(stories_word_info['word_list'])
columns = [stories_word_info['word_list'][i][0] for i in range(len(stories_word_info['word_list']))]
columns.insert(0,'article_id')
keys_iterator = tqdm(
    stories_word_info['word_matrix'].keys(),
    leave=True,
    unit='articles',
)
def process(key):
    word_vector_dict = stories_word_info['word_matrix'][key]
    word_vector = np.zeros(n_words)
    for kkey in word_vector_dict.keys():
        word_vector[eval(kkey)] = word_vector_dict[kkey]
    word_vector = list([key])+list(word_vector)
    return pd.DataFrame([word_vector], columns=columns)
    
results = Parallel(n_jobs=N_THREADS)(delayed(process)(i) for i in keys_iterator)

100%|██████████| 997/997 [03:11<00:00,  5.22keys/s]


Unnamed: 0,article_id,stay,video,plaza,jan,juvenil,month,northeastern,nation,ireland,...,amiri,flaccus,semi-retir,szelag,best-cas,cordoned-off,dearborn,blocked-off,cort,soul-search
0,1816284987,0.0,4.0,0.0,0.0,0.0,2.0,0.0,7.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1816399963,0.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,1816432915,1.0,0.0,4.0,0.0,0.0,0.0,0.0,4.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,1816531981,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,1816681233,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
992,1925068603,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
993,1927466638,5.0,0.0,0.0,0.0,0.0,6.0,0.0,10.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
994,1928459370,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
995,1929004953,0.0,0.0,0.0,0.0,0.0,2.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [17]:
words_df = pd.concat(results, axis=0).reset_index().drop('index', 1).drop('level_0', 1)
#del results
words_df

Unnamed: 0,article_id,stay,video,plaza,jan,juvenil,month,northeastern,nation,ireland,...,amiri,flaccus,semi-retir,szelag,best-cas,cordoned-off,dearborn,blocked-off,cort,soul-search
0,1816284987,0.0,4.0,0.0,0.0,0.0,2.0,0.0,7.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1816399963,0.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,1816432915,1.0,0.0,4.0,0.0,0.0,0.0,0.0,4.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,1816531981,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,1816681233,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
992,1925068603,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
993,1927466638,5.0,0.0,0.0,0.0,0.0,6.0,0.0,10.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
994,1928459370,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
995,1929004953,0.0,0.0,0.0,0.0,0.0,2.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


The following cell will drop out the column corresponding to the word that should not be counted for the words vectors as set by the parameters `MIN_LENGTH` and `MIN_FREQUENCY`.

In [18]:
cols_iterator = tqdm(
    words_df.columns[1:],
    leave=True,
    unit='columns',
)

special_characters = ".\!@#$%^&*()-+?_=,<>/"

def has_numbers(inputString):
    return any(char.isdigit() for char in inputString)

def has_special_chars(inputString):
    return any(c in special_characters for c in inputString)

def process_drop(col):
    tmp = pd.to_numeric(words_df[col])
    #if len(col) < MIN_LENGTH or len(tmp[tmp>0]) == 0 or has_numbers(col) or has_special_chars(col):
    if has_numbers(col):# or has_special_chars(col):
        return col
    
cols = Parallel(n_jobs=N_THREADS)(delayed(process_drop)(i) for i in cols_iterator)
cols

100%|██████████| 50847/50847 [00:13<00:00, 3666.60columns/s]


[None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,

In [19]:
cols = [x for x in cols if (str(x) != 'nan' and str(x) != 'NaN' and str(x) != '' and str(x) != 'None')]
words_df = words_df.drop(cols, 1)
#del cols
words_df

Unnamed: 0,article_id,stay,video,plaza,jan,juvenil,month,northeastern,nation,ireland,...,amiri,flaccus,semi-retir,szelag,best-cas,cordoned-off,dearborn,blocked-off,cort,soul-search
0,1816284987,0.0,4.0,0.0,0.0,0.0,2.0,0.0,7.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1816399963,0.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,1816432915,1.0,0.0,4.0,0.0,0.0,0.0,0.0,4.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,1816531981,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,1816681233,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
992,1925068603,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
993,1927466638,5.0,0.0,0.0,0.0,0.0,6.0,0.0,10.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
994,1928459370,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
995,1929004953,0.0,0.0,0.0,0.0,0.0,2.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


The following cell will build the article info DataFrame using `pandas` library.
Data about articles will be extracted by usign a query to Media Cloud for the single story.
The DataFrame will have a column (`id`) with the id of the article.
Column `timestamp` will carry the data in which the article has been published.
Columns `source` will carry the media id of the media in which the article has been published.
It is necessary to reordering in cronological order the articles.

In [20]:
article_id_iterator = tqdm(
    words_df['article_id'],
    leave=True,
    unit='columns',
)
def process_info(article_id):
    r = requests.get(
        MEDIA_CLOUD_URL+STORIES_SINGLE+str(article_id),
        params={'key': MY_KEY},
        headers={'Accept': 'application/json'}
    )
    story = r.json()[0]
    return pd.DataFrame(
            [{
                'article_id': article_id,
                'timestamp': str(story['publish_date'])[:10],
                'source': story['media_id'],
            }]
        )
articles_info = Parallel(n_jobs=N_THREADS)(delayed(process_info)(i) for i in article_id_iterator)
articles_info

100%|██████████| 997/997 [01:21<00:00, 12.29columns/s]


[   article_id   timestamp  source
 0  1816284987  2021-01-06  152133,
    article_id   timestamp  source
 0  1816399963  2021-01-06  121175,
    article_id   timestamp  source
 0  1816432915  2021-01-06   90123,
    article_id   timestamp  source
 0  1816531981  2021-01-06   98039,
    article_id   timestamp  source
 0  1816681233  2021-01-06   70300,
    article_id   timestamp  source
 0  1816696057  2021-01-06  239578,
    article_id   timestamp  source
 0  1816701081  2021-01-06   18364,
    article_id   timestamp  source
 0  1816728190  2021-01-06      74,
    article_id   timestamp  source
 0  1816734031  2021-01-06  280031,
    article_id   timestamp  source
 0  1816773278  2021-01-06  195373,
    article_id   timestamp  source
 0  1816802115  2021-01-06    5721,
    article_id   timestamp  source
 0  1816806402  2021-01-06  661229,
    article_id   timestamp  source
 0  1816818727  2021-01-06  119212,
    article_id   timestamp  source
 0  1816840585  2021-01-06   27590,
    ar

In [21]:
info_df = pd.concat(articles_info, axis=0).reset_index().drop('index', 1)
info_df['timestamp'] = pd.to_datetime(info_df.timestamp)
info_df = info_df.sort_values(by='timestamp')
#del articles_info
info_df

Unnamed: 0,article_id,timestamp,source
0,1816284987,2021-01-06,152133
38,1817375539,2021-01-06,663002
34,1817237705,2021-01-06,1708
29,1817150705,2021-01-06,307431
22,1817006484,2021-01-06,92482
...,...,...,...
982,1897143334,2021-04-04,18350
984,1897556452,2021-04-05,390379
983,1897427444,2021-04-05,661229
986,1897867369,2021-04-05,212407


The same order must be given to the other DataFrame obviously.

In [22]:
words_df = words_df.reindex(info_df.index.tolist())
words_df

Unnamed: 0,article_id,stay,video,plaza,jan,juvenil,month,northeastern,nation,ireland,...,amiri,flaccus,semi-retir,szelag,best-cas,cordoned-off,dearborn,blocked-off,cort,soul-search
0,1816284987,0.0,4.0,0.0,0.0,0.0,2.0,0.0,7.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
38,1817375539,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
34,1817237705,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
29,1817150705,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
22,1817006484,0.0,7.0,0.0,0.0,0.0,0.0,0.0,3.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
982,1897143334,0.0,0.0,0.0,0.0,0.0,1.0,0.0,2.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
984,1897556452,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
983,1897427444,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
986,1897867369,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Finally, the DataFrames are save in the data folder, defined by `PATH_TO_DATA`.

In [None]:
words_df.to_csv(PATH_TO_DATA/'words_dataframe.csv')
info_df.to_csv(PATH_TO_DATA/'info_dataframe.csv')