# Create articles matrix

This notebook will create the article matrix by processing the .json returned from searching the articles of interest.
The article matrix is thus composed of a matrix that has one line for every article and 300 columns representing the component of the vector of the article built with a Word2Vect model that has a 300-components representation for every word.
Every article will be represented by the sum of the 300-components vector of its words weighted by the tf-idf score.
The model chosen for the representation is the [Google News Word2Vec](https://code.google.com/archive/p/word2vec/) model.
A dataframe containing the important features (`article_id`, `publish_time`, `source`) for the dynamical model is also returned.

In [1]:
import json
import pandas as pd
import numpy as np
from pathlib import Path
from tqdm.notebook import tqdm
from joblib import Parallel, delayed
from functools import partial
from gensim.models.word2vec import KeyedVectors

## These following have to be customized
PATH_TO_DATA = Path('../data')
# for joblib multithreading
N_THREADS = -1

Function for processing the info of the articles from the .json.

In [2]:
def process_info(story):
    return pd.DataFrame(
            [{
                'article_id': story['stories_id'],
                'timestamp': str(story['publish_date']),
                'source': story['media_id'],
            }]
        )

Function for processing the occurences of words of the articles from the .json.

In [3]:
def process_article_word_matrix_json(article_words_occurences, n_words: int):
    words_occurences = np.zeros(n_words)
    for key, value in article_words_occurences.items():
        words_occurences[eval(key)] = value
    return words_occurences

Function for filtering the words, it turns out to be uncessary since the Word2Vect model does not give any representation to those words that were excluded.

In [4]:
special_characters = ".\!@#$%^&*()+?_=,<>/"

def has_numbers(inputString):
    return any(char.isdigit() for char in inputString)

def has_special_chars(inputString):
    return any(c in special_characters for c in inputString)

def process_drop_columns(col):
    # tmp = pd.to_numeric(words_df[col])
    # if len(col) < MIN_LENGTH or len(tmp[tmp>0]) == 0 or has_numbers(col) or has_special_chars(col):
    if has_numbers(col) or has_special_chars(col):
        return col

Load Word2Vect model.

In [None]:
google_news_word2vec = KeyedVectors.load_word2vec_format(PATH_TO_DATA/'word2vec-google-news-300.gz', binary=True)

Get the .json files.

In [23]:
files = [
    'all_stories.json',
    'word_matrix.json',
    'np_docvs_norm.npz',
    'dists_triu.csv',
    'info_df.csv',
    ]
stories = [
    'world_russia',
    'world_norway',
    'world_capitol_hill',
]
story_to_elaborate = 1
with open(PATH_TO_DATA/stories[story_to_elaborate]/files[0]) as json_file:
    all_stories = json.load(json_file)
with open(PATH_TO_DATA/stories[story_to_elaborate]/files[1]) as json_file:
    stories_words = json.load(json_file)

Get the matrix of words occurences in articles, first step before computing the articles matrix.

In [24]:
n_words = len(stories_words['word_list'])

articles_iter = tqdm(
    stories_words['word_matrix'].values(),
    leave=True,
    unit='articles',
)
fn = partial(process_article_word_matrix_json, n_words=n_words)
results = np.array(Parallel(n_jobs=N_THREADS)(delayed(fn)(i) for i in articles_iter))
# results

  0%|          | 0/2599 [00:00<?, ?articles/s]

Load and filter the model to be personalized to the actual vocabolary.

In [25]:
all_words = [stories_words['word_list'][i][0] for i in range(len(stories_words['word_list']))]
google_news_word2vec = google_news_word2vec.vectors_for_all(all_words)

Function for getting the actual articles matrix.
Every step of this process is optimized and parallelize to guarantee the maximum speed in processing exploiting the entire CPU capabilities (for `N_THREADS`=-1).

In [9]:
def create_articles_matrix(articles_words, model):
    # total number of articles
    n_articles = articles_words.shape[0]
    # number of articles containing that word for every word
    art_per_word = np.array([np.sum(articles_words[:,i]>0) for i in range(articles_words.shape[1])])
    # all words idf
    words_idf = np.log(n_articles/art_per_word)
    
    ## get all words vectors
    words_iterator = tqdm(
        all_words,
        leave=True,
        unit='words',
    )
    # function to parallelize 
    def get_word_vector(word):
        try:
            word_vector = model.get_vector(word)
            try:
                assert np.isfinite(word_vector).all()
            except AssertionError:
                print(word_vector)
        except KeyError:
            word_vector = [0]*300
        return np.array(word_vector)
    words_vectors = np.array([get_word_vector(word) for word in words_iterator])

    ## get the articles vectors
    # instatiate the article iterator
    articles_iterator = tqdm(
        articles_words,
        leave=True,
        unit='articles',
    )
    # function to parallelize 
    def get_article_vector(article):
        article_vector = np.zeros((1, 300))
        for i, word_vector in enumerate(words_vectors):
            ## using tf-idf as weight
            # occurences of word in the article
            tf = article[i]
            # if there are some
            if tf > 0:
                # tf-idf of word in article
                weight = tf*words_idf[i]
                # add with weight this word vector to whole article vector
                article_vector = article_vector + word_vector*weight
        return article_vector
    list_of_docvs = Parallel(n_jobs=N_THREADS)(delayed(get_article_vector)(i) for i in articles_iterator)
    
    return np.array(list_of_docvs).squeeze()

Get the actual articles matrix.

In [26]:
articles_matrix = create_articles_matrix(np.array(results), google_news_word2vec)
articles_matrix.shape

  0%|          | 0/188697 [00:00<?, ?words/s]

  0%|          | 0/2599 [00:00<?, ?articles/s]

(2599, 300)

Get the dataframe with the important features of the articles.

In [27]:
articles_info_iter = tqdm(
    all_stories,
    leave=True,
    unit='stories',
)

articles_info_df = pd.concat(Parallel(n_jobs=N_THREADS)(delayed(process_info)(i) for i in articles_info_iter), axis=0).reset_index()
articles_info_df['timestamp'] = pd.to_datetime(articles_info_df.timestamp)
articles_info_df = articles_info_df.sort_values(by='timestamp')
articles_info_df.head()

  0%|          | 0/2611 [00:00<?, ?stories/s]

Unnamed: 0,index,article_id,timestamp,source
52,0,1996434927,2021-07-21 00:00:00,396984
2,0,1996327005,2021-07-21 00:00:00,125334
3,0,1996330114,2021-07-21 00:00:00,69934
58,0,1996453424,2021-07-21 00:00:00,40268
1,0,1996202810,2021-07-21 00:19:04,84097


Check the articles retrieved in the two .json to be the same.
The two sets of articles (the one from `*_all_stories.json`, and the one from `*_word_matrix.json`) will be filtered to contain the same articles in the same order.

Ids for articles from `*_word_matrix.json`.

In [28]:
articles_ids_word_matrix = np.array([eval(a) for a in list(stories_words['word_matrix'].keys())])
articles_ids_word_matrix.shape

(2599,)

Length of ids for articles from `*_all_stories.json`.

In [29]:
len(articles_info_df['article_id'])

2611

Length of ids for articles from `*_all_stories.json`.

In [30]:
len(articles_ids_word_matrix)

2599

Ids to be removed from the dataframe.

In [31]:
remove_from_df = list(set(articles_info_df['article_id']) - set(articles_ids_word_matrix))
remove_from_df

[2006074272,
 2008001829,
 2005206662,
 2003482725,
 2002532521,
 2019848970,
 2012380297,
 2007440684,
 2006087149,
 2001692657,
 1997648210,
 1999354035,
 2005497172,
 1997714996,
 2015712887,
 2023509146,
 2011491326,
 2021793311]

Ids to be removed from matrix.

In [32]:
remove_from_matrix = list(set(articles_ids_word_matrix) - set(articles_info_df['article_id']))
remove_from_matrix

[2001486402, 2007730531, 2015181549, 1997922062, 2132215122, 2021374618]

Deletion from dataframe.

In [33]:
[articles_info_df.drop(articles_info_df[articles_info_df['article_id'] == i].index, inplace=True) for i in remove_from_df]
len(articles_info_df['article_id'])

2593

Deletion from matrix.

In [34]:
indices = [articles_ids_word_matrix.tolist().index(i) for i in remove_from_matrix]
articles_ids_word_matrix = np.delete(articles_ids_word_matrix, indices, axis=0)
articles_matrix = np.delete(articles_matrix, indices, axis=0)
articles_matrix.shape

(2593, 300)

Reordering of articles in the matrix, since we want them to be in the same order (ascending in time) of the dataframe.

In [35]:
new_indices = [articles_ids_word_matrix.tolist().index(i) for i in articles_info_df['article_id']]
new_articles_matrix = articles_matrix.copy()
for i,j in enumerate(new_indices):
    new_articles_matrix[i,:] = new_articles_matrix[j,:]

Normalize and save the article matrix using numpy methods.

In [36]:
row_sums = articles_matrix.sum(axis=1)
np_docvs_norm = (articles_matrix / np.sqrt((articles_matrix ** 2).sum(-1))[..., np.newaxis]).astype('float')
print("Shape of normalized matrix is {}.".format(np_docvs_norm.shape))
print("Sum of normalized matrix is {}.".format(np.sum(np_docvs_norm)))
print("Max={}; Min={}.".format(np.max(np_docvs_norm), np.min(np_docvs_norm)))
np.savez(PATH_TO_DATA/stories[story_to_elaborate]/files[2], np_docvs_norm)
np_docvs_norm.shape

Shape of normalized matrix is (2593, 300).
Sum of normalized matrix is -3708.213049179245.
Max=0.24483098695949793; Min=-0.21940790958173134.


(2593, 300)

Compute and save up-triangular distance matrix.

In [37]:
dists = np.dot(np_docvs_norm, np_docvs_norm.T).astype('float')
dists_triu = np.triu(dists, k=1)
np.savetxt(PATH_TO_DATA/stories[story_to_elaborate]/files[3], dists_triu, delimiter=',')
print("Shape of similarity matrix is {}.".format(dists_triu.shape))
print("Sum of similarity matrix is {}.".format(np.sum(dists_triu)))
print("Max={}; Min={}.".format(np.max(dists_triu), np.min(dists_triu)))
dists_triu

Shape of similarity matrix is (2593, 2593).
Sum of similarity matrix is 2402656.978175076.
Max=1.0000000000000007; Min=0.0.


array([[0.        , 0.74970341, 0.8454205 , ..., 0.87679063, 0.80614634,
        0.83195407],
       [0.        , 0.        , 0.76036679, ..., 0.78907547, 0.71995947,
        0.68784944],
       [0.        , 0.        , 0.        , ..., 0.84411785, 0.75879691,
        0.74960482],
       ...,
       [0.        , 0.        , 0.        , ..., 0.        , 0.83620062,
        0.86417294],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.87251212],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ]])

Save the important features dataframe.

In [38]:
with open(PATH_TO_DATA/stories[story_to_elaborate]/files[4], 'w') as csv_file:
    articles_info_df.to_csv(csv_file)