# Search the articles

This notebook will search the articles for the project exploiting the [Media Cloud](https://mediacloud.org/) database.
One have to set the customizable parameters:

1. `PATH_TO_DATA` is the Path object for saving the constructed DataFrames;
2. `MY_KEY` parameter is the user key every Media Cloud user has been provided once signed up, for more info go [here](https://github.com/mediacloud/backend/blob/master/doc/api_2_0_spec/api_2_0_spec.md#authentication);
3. `MIN_LEGTH` parameter is the minimiun length that a word must have to be considered for the word vector;
4. `MIN_FREQUENCY` parameter is the minimum frequency a word must appear in at least one article to be considered for the word vector;
5. `MAX_ARTICLES` parameter set the maximum number of articles to search;
6. `N_THREADS` parameter set the number of threads for parallelizing some of the procedures.

In [8]:
import requests
import pandas as pd
import numpy as np
from pathlib import Path
from tqdm import tqdm
from joblib import Parallel, delayed

# Constants, NOT TO BE CHANGED
MEDIA_CLOUD_URL = 'https://api.mediacloud.org'
STORIES_SINGLE = '/api/v2/stories_public/single/'
STORIES_WORD_MATRIX = '/api/v2/stories_public/word_matrix/'
# These following have to be customized
PATH_TO_DATA = Path('../data')
MY_KEY = '66aa9cf8dbd642b0e47f6811764cbe451a84d9429b8d2b3647c97c0af8fd40f5'
MIN_LENGTH = 10
MIN_FREQUENCY = 10
MAX_ARTICLES = 3000
N_THREADS = 12

The following cell will run the query that will get the word matrix (used in the project).
One has to customize the filters for getting what he/she is interested in.
Remind that the epidemic model works fine for events that have a rapid spreading (popular, interesting) and then die, because of that one has to pay attention not only to the argument but also to the time windows.
For more info about constructing the query look [here](https://github.com/mediacloud/backend/blob/master/doc/api_2_0_spec/api_2_0_spec.md#query-parameters-5).

In [24]:
%%timeit
# Initialize articles' ids list
stories_id = []
# Filter to pass to the query
params = {
    # numer of articles per query
    'rows': MAX_ARTICLES,
    # filtering the language of the article to have consistent words
    'q': 'capitol hill assault',
    # main filters: arguments and so on
    'fq': 'publish_date:[2021-01-05T23:59:59.999Z TO 2021-04-05T23:59:59.999Z]',
    # personal key to pass to the query
    'key': MY_KEY,
}
# Get word matrix of the query
r = requests.get(
    MEDIA_CLOUD_URL+STORIES_WORD_MATRIX,
    params=params,
    headers={'Accept': 'application/json'})
# Retrieve the results to be analyzed
stories_word_info = r.json()
print('Found {} articles'.format(len(stories_word_info)))

Found 2 articles
Found 2 articles
Found 2 articles
Found 2 articles
Found 2 articles
Found 2 articles
Found 2 articles
Found 2 articles
995 ms ± 82.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


The following cell will build the words vector DataFrame using `pandas` library.
The DataFrame will have a column (`id`) with the id of the article, and also one column per word named with that word.
For 3k articles this should take TODO.

In [19]:
words_df = pd.DataFrame()
n_words = len(stories_word_info['word_list'])
columns = [stories_word_info['word_list'][i][0] for i in range(len(stories_word_info['word_list']))]
columns.insert(0,'article_id')
keys_iterator = tqdm(
    stories_word_info['word_matrix'].keys(),
    leave=True,
    unit='keys',
)
def process(key):
    word_vector_dict = stories_word_info['word_matrix'][key]
    word_vector = np.zeros(n_words)
    for kkey in word_vector_dict.keys():
        word_vector[eval(kkey)] = word_vector_dict[kkey]
    word_vector = list([key])+list(word_vector)
    return pd.DataFrame([word_vector], columns=columns)
    
results = Parallel(n_jobs=N_THREADS)(delayed(process)(i) for i in keys_iterator)
words_df = pd.concat(results, axis=0).reset_index().drop('index', 1)
del results
words_df

100%|██████████| 3/3 [00:00<00:00, 2233.79keys/s]


Unnamed: 0,article_id,twitter,photo,awami,globe,asia,brad,mission,critic,share,...,spread,governor,week,"800,000",risk-avers,congress,mlb,philip,nonprofit,unfortun
0,1865247844,1.0,1.0,1.0,1.0,1.0,1.0,3.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1873690975,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,1897917137,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


The following cell will drop out the column corresponding to the word that should not be counted for the words vectors as set by the parameters `MIN_LENGTH` and `MIN_FREQUENCY`.

In [21]:
cols_iterator = tqdm(
    words_df.columns[1:],
    leave=True,
    unit='columns',
)

def process_drop(col):
    tmp = pd.to_numeric(words_df[col])
    if len(col) < MIN_LENGTH or len(tmp[tmp>0]) == 0:
        return col
    
cols = Parallel(n_jobs=N_THREADS)(delayed(process_drop)(i) for i in cols_iterator)
cols = [x for x in cols if (str(x) != 'nan' and str(x) != 'NaN' and str(x) != '' and str(x) != 'None')]
words_df.drop(cols, 1)
del cols
words_df

100%|██████████| 448/448 [00:00<00:00, 3148.53columns/s]


Unnamed: 0,article_id,twitter,photo,awami,globe,asia,brad,mission,critic,share,...,spread,governor,week,"800,000",risk-avers,congress,mlb,philip,nonprofit,unfortun
0,1865247844,1.0,1.0,1.0,1.0,1.0,1.0,3.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1873690975,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,1897917137,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


The following cell will build the article info DataFrame using `pandas` library.
Data about articles will be extracted by usign a query to Media Cloud for the single story.
The DataFrame will have a column (`id`) with the id of the article.
Column `timestamp` will carry the data in which the article has been published.
Columns `source` will carry the media id of the media in which the article has been published.

In [23]:
article_id_iterator = tqdm(
    words_df['article_id'],
    leave=True,
    unit='columns',
)
def process_info(article_id):
    r = requests.get(
        MEDIA_CLOUD_URL+STORIES_SINGLE+str(article_id),
        params={'key': MY_KEY},
        headers={'Accept': 'application/json'}
    )
    story = r.json()[0]
    return pd.DataFrame(
            [{
                'id': article_id,
                'timestamp': str(story['publish_date'])[:10],
                'source': story['media_id'],
            }]
        )
articles_info = Parallel(n_jobs=N_THREADS)(delayed(process_info)(i) for i in article_id_iterator)
info_df = pd.concat(articles_info, axis=0).reset_index().drop('index', 1)
del articles_info
info_df


100%|██████████| 3/3 [00:00<00:00, 4022.67columns/s]


Unnamed: 0,id,timestamp,source
0,1865247844,2021-02-27,320927
1,1873690975,2021-03-08,179329
2,1897917137,2021-04-05,77362


Finally, the DataFrames are save in the data folder, defined by `PATH_TO_DATA`.

In [None]:
words_df.to_csv(PATH_TO_DATA/'words_dataframe.csv')
info_df.to_csv(PATH_TO_DATA/'info_dataframe.csv')