# Topic modeling of Portuguese-language news

# Problem Definition



We use data available through NewsAPI that the service associates with a few distinct keywords (e.g. "congress, politics, economy, sports"), and train a Latent Dirichlet Allocation model on 400 pulled articles, trying to identify the dominant topics in each of them.

The work presented here was started as part of my Capstone Project proposal (Flora, 2021).

# Solution Specification

I decided to use NewsAPI's data because of their ease to obtain, and immediate connection with the topic at hand. The code below is responsible for pulling the data from the API, and storing it in a dataframe we can use later to train scikit-learn models.

To my great fortune, NewsAPI accesses data in Portuguese, and so I specified news articles that approximated 4 sets of keywords: one for politics, one for economic news, one for news about the "president" (in addition to the common names of the three most recent presidents of Brazil), and one for sports.

Pulling 400 articles from the API, I could put their contents (the body of the text alone) into a list and tokenize and lemmatize them with relative ease. The steps below 

In [1]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import (
    NMF, 
    LatentDirichletAllocation,
    TruncatedSVD
)
import requests, gensim, spacy, json, nltk
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
# warnings.filterwarnings('ignore')
from sklearn.model_selection import GridSearchCV
from nltk.tokenize import wordpunct_tokenize
from nltk.metrics import ConfusionMatrix
from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist
from nltk.stem import PorterStemmer, WordNetLemmatizer
import matplotlib.pyplot as plt
from functools import reduce
from io import StringIO
from time import time
import pandas as pd
import numpy as np
import re

  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


In [2]:
# !python3 -m spacy download pt_core_news_sm
import pt_core_news_sm
nlp = pt_core_news_sm.load()

# Load data

In [3]:
# Get stopwords for Porguese and add prominent missing stopword ("to be")
# Full list of stopwords printed in appendix
pt_stopwords = nltk.corpus.stopwords.words('portuguese')
en_stopwords = nltk.corpus.stopwords.words('english')
html_elements = ['li', 'ul']
pt_stopwords.append("ser")
nltk_stopwords = pt_stopwords + en_stopwords + html_elements

## NewsAPI

In [4]:
# Define API key and URLs for requests
headers = {'Authorization': '843f966d9f6d465bba288803803ee132'}
everything_news_url = 'https://newsapi.org/v2/everything'
sources_url = 'https://newsapi.org/v2/sources'

In [5]:
# headlines_payload = {'category': 'general', 'country': 'br, pt', "pageSize":100}
everything_politics_payload = {'q': 'congresso, política', 'language': 'pt', 'sortBy': 'popularity',\
                               'from':'2021-03-25', "pageSize":100}
everything_president_payload = {'q': "presidente, Dilma, Temer, Bolsonaro", 'language': 'pt', 'sortBy': 'popularity',\
                               'from':'2021-03-25', "pageSize":100}
everything_economy_payload = {'q': "economia", 'language': 'pt', 'sortBy': 'popularity',\
                             'from':'2021-03-25', "pageSize":100}
everything_sports_payload = {'q': 'esporte', 'language': 'pt', 'sortBy': 'popularity',
                            'from':'2021-03-25', "pageSize":100}
sources_payload= {'category': 'general', 'language': 'pt'}
sports_sources_payload = {'category': 'sports', 'language': 'pt'}

In [6]:
resp_sources = requests.get(url = sources_url,\
                           headers = headers,\
                           params = sources_payload)
resp_sports_sources = requests.get(url = sources_url,\
                           headers = headers,\
                           params = sports_sources_payload)

In [7]:
response_politics = requests.get(url=everything_news_url,\
                                 headers=headers, params=everything_politics_payload)
response_president = requests.get(url=everything_news_url,\
                                  headers=headers, params=everything_president_payload)
response_economy = requests.get(url=everything_news_url,\
                                  headers=headers, params=everything_economy_payload)
response_sports = requests.get(url=everything_news_url,\
                                  headers=headers, params=everything_sports_payload)

In [8]:
dict_pol = json.loads(json.dumps(response_politics.json()))
dict_pres = json.loads(json.dumps(response_president.json()))
dict_econ = json.loads(json.dumps(response_economy.json()))
dict_sports = json.loads(json.dumps(response_sports.json()))

In [9]:
dict_sources = json.loads(json.dumps(resp_sources.json()))
dict_sports_sources = json.loads(json.dumps(resp_sports_sources.json()))

In [10]:
dicts = [dict_pol, dict_pres, dict_econ, dict_sports]

for d in dicts:
    for article in d['articles']:
        article['source_name'] = article['source']['name']

In [11]:
pol_articles = dict_pol['articles']
pres_articles = dict_pres['articles']
econ_articles = dict_econ['articles']
sports_articles = dict_sports['articles']

In [12]:
df_pol = pd.read_json(StringIO(json.dumps(pol_articles)))
df_pres = pd.read_json(StringIO(json.dumps(pres_articles)))
df_econ = pd.read_json(StringIO(json.dumps(econ_articles)))
df_sports = pd.read_json(StringIO(json.dumps(sports_articles)))

In [13]:
df_pol.columns.values.tolist()

['source',
 'author',
 'title',
 'description',
 'url',
 'urlToImage',
 'publishedAt',
 'content',
 'source_name']

In [14]:
df_pol['category'] = 0
df_pres['category'] = 1
df_econ['category'] = 2
df_sports['category'] = 3

CAT_TO_VAL = {'politics':0,'president':1,'economy':2,'sports':3}

In [15]:
# Have an idea of the content in the data
df_pres['content']

0     O Alto Comando das forças armadas\r\n manda um...
1     Foto: André Coelho/Bloomberg via Getty Images\...
2     Seis substituições de ministros na maior remod...
3     <ul><li>Mariana Alvim</li><li>Da BBC News Bras...
4     <ul><li>Thais Carrança</li><li>Da BBC News Bra...
                            ...                        
95    O presidente do Brasil, Jair Bolsonaro, discur...
96    O procurador-geral da República Augusto Aras e...
97    RIO - Uma recusa oficial a rezar uma missa pel...
98    Um dia após ter conseguido na Justiça a amplia...
99    BRASÍLIA - O presidente Jair Bolsonaro poderá ...
Name: content, Length: 100, dtype: object

In [16]:
dfs = [df_pol, df_econ, df_pres, df_sports]

df = pd.concat(dfs, axis=0)
df = df.sample(frac=1).reset_index(drop=True)
df

Unnamed: 0,source,author,title,description,url,urlToImage,publishedAt,content,source_name,category
0,"{'id': None, 'name': 'B9.com.br'}",Pedro Strazza,Criaram uma planilha para acompanhar a gincana...,Explodiu nas redes sociais no fim da última qu...,https://www.b9.com.br/141227/criaram-uma-plani...,https://assets.b9.com.br/wp-content/uploads/20...,2021-03-26T14:14:50Z,Explodiu nas redes sociais no fim da última qu...,B9.com.br,2
1,"{'id': None, 'name': 'Tecmundo.com.br'}",Nilton Kleina,AppLock: bloqueie aplicativos no seu celular A...,"Em certas ocasiões, o padrão de segurança conv...",https://www.tecmundo.com.br/software/214261-ap...,https://img.ibxk.com.br/2021/03/21/21182129600...,2021-03-27T11:00:01Z,"Em certas ocasiões, o padrão de segurança conv...",Tecmundo.com.br,2
2,"{'id': None, 'name': 'Tecmundo.com.br'}",Nilton Kleina,EUA prometem cortar 50% das emissões de carbon...,"O presidente dos Estados Unidos, Joe Biden, of...",https://www.tecmundo.com.br/ciencia/215998-eua...,https://img.ibxk.com.br/2021/04/22/22091604149...,2021-04-22T13:00:02Z,"O presidente dos Estados Unidos, Joe Biden, of...",Tecmundo.com.br,2
3,"{'id': None, 'name': 'Catracalivre.com.br'}",Redação,Professor improvisa aula em caminhão e leva ed...,Em meios aos desafios provocados pela pandemia...,https://catracalivre.com.br/cidadania/professo...,https://catracalivre.com.br/wp-content/uploads...,2021-03-29T15:34:28Z,Últimas notícias:\r\nUtilizamos cookies essenc...,Catracalivre.com.br,0
4,"{'id': None, 'name': 'Tecmundo.com.br'}",Thiago Simões,NEOGEO POCKET COLOR SELECTION Vol.1 peca pela ...,O mercado de videogames resolveu reativar os j...,https://www.tecmundo.com.br/voxel/215351-neoge...,https://img.ibxk.com.br/2021/04/01/01142116271...,2021-04-11T22:00:01Z,O mercado de videogames resolveu reativar os j...,Tecmundo.com.br,3
...,...,...,...,...,...,...,...,...,...,...
395,"{'id': None, 'name': 'Tecnoblog.net'}",Victor Hugo Silva,"Receita quer imposto sobre livros, e isso deve...","Para defender cobrança de imposto, Receita Fed...",https://tecnoblog.net/429453/receita-quer-impo...,https://tecnoblog.net/wp-content/uploads/2019/...,2021-04-07T22:16:38Z,A Receita Federal voltou a defender a cobrança...,Tecnoblog.net,2
396,"{'id': None, 'name': 'BBC News'}",https://www.facebook.com/bbcnews,Matt Gaetz: escândalo sexual ameaça um dos pri...,Congressista republicano da Flórida pode ter t...,https://www.bbc.com/portuguese/internacional-5...,https://ichef.bbci.co.uk/news/1024/branded_por...,2021-04-06T11:51:00Z,<ul><li>Anthony Zurcher - @awzurcheron</li><li...,BBC News,0
397,"{'id': None, 'name': 'Uol.com.br'}",UOL,Advogado de Flávio Bolsonaro | CGU vê risco de...,Uma auditoria da CGU (Controladoria Geral da U...,https://noticias.uol.com.br/politica/ultimas-n...,https://conteudo.imguol.com.br/c/noticias/f3/2...,2021-04-10T21:52:03Z,Uma auditoria da CGU (Controladoria Geral da U...,Uol.com.br,1
398,"{'id': None, 'name': 'Terra.com.br'}",Gazeta Esportiva,Felipe Conceição comemora evolução do Cruzeiro...,Depois da vitória por 2 a 0 contra o Coimbra n...,https://www.terra.com.br/esportes/cruzeiro/fel...,https://p2.trrsf.com/image/fget/cf/1200/628/mi...,2021-04-07T23:17:35Z,Depois da vitória por 2 a 0 contra o Coimbra n...,Terra.com.br,3


In [17]:
X = df['content']
# Not immediately useful for unsupervised learning, but for planned
# supervised-learning extensions
y = df['category']

In [18]:
df.to_csv('news_api.csv')

In [19]:
# Remove escaped characters and single quotes
data = X.values.tolist()
data = [re.sub(r'\s+', ' ', t) for t in data]
data = [re.sub(r"\'", "", t) for t in data]

In [20]:
# First 3 data points
data[:3]

['Explodiu nas redes sociais no fim da última quinta-feira (25) uma gincana de calouros da Fundação Getúlio Vargas (FGV) que envolveu todo tipo de celebridade após o que aparentemente foi um erro geral… [+2925 chars]',
 'Em certas ocasiões, o padrão de segurança convencional do seu celular Android pode não ser o suficiente e o usuário pode querer uma camada adicional de proteção que vai além de senha, PIN, biometria … [+3355 chars]',
 'O presidente dos Estados Unidos, Joe Biden, oficializou nesta quinta-feira (22) uma nova e ousada meta do país para reduzir os impactos das mudanças climáticas. Em comunicado, Biden confirmou que o p… [+992 chars]']

## Data preparation steps

The code below is heavily based on the work of Chen (2018). 

We first clean the data by removing unimportant characters and single quotes from the articles. Stopwords are passed as arguments to the models later, but removing them could also be a part of this step.

After cleaning the data to have just relevant words and no additional characters like punctuation marks in the data, we tokenize it, as seen below. The data is tokenized by breaking each article into smaller pieces, in this case words. (Chen, 2018).

In [21]:
def text_to_words(article):
    for sentence in article:
        yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))

data = list(text_to_words(data))
print(data[:1])
words = data

[['explodiu', 'nas', 'redes', 'sociais', 'no', 'fim', 'da', 'ultima', 'quinta', 'feira', 'uma', 'gincana', 'de', 'calouros', 'da', 'fundacao', 'getulio', 'vargas', 'fgv', 'que', 'envolveu', 'todo', 'tipo', 'de', 'celebridade', 'apos', 'que', 'aparentemente', 'foi', 'um', 'erro', 'geral', 'chars']]


We proceed by stemming, or lemmatizing, the data. This process requires language-specific data that allows us to remove redundancy from our data, with words that share meaning and spelling. We use the `spacy` package to do so, since it has good support for lemmatization in Portuguese, while NLTK and other packages do not.

In [22]:
def lemmatization(data, allowed_postags = ['NOUN', 'ADJ', 'VERB', 'ADV']): #'NOUN', 'ADJ', 'VERB', 'ADV'
    data_lemmatized = []
    for sent in data:
        doc = nlp(" ".join(sent)) 
        data_lemmatized.append(" ".join([token.lemma_ if token.lemma_ not in ['-PRON-'] else '' for token in doc if token.pos_ in allowed_postags]))
    return data_lemmatized

In [23]:
# Initialize spacy 'pt' model
nlp = spacy.load("pt_core_news_sm", disable=['parser', 'ner'])

# Perform lemmatization keeping nouns, adjectives, verbs, and adverbs
data_lemmatized = lemmatization(words, allowed_postags=['NOUN', 'VERB']) #select noun and verb
print(data_lemmatized[:3])

['explodir rede fim ultimar feirar gincana calouro fundacao envolver tipo celebridade apo errar chars', 'ocasioes padrao seguranca celular android poder poder querer camada protecao alar senha biometria', 'presidente estar unir oficializar feirar ousar meter pai reduzir impacto mudancas comunicar confirmar chars']


### Vectorization

The first use of Scikit-Learn here is to build a document-word matrix using CountVectorizer. We pass the stopwords we generated before using a mix of Portuguese and English stopwords. All words that have more than 10 occurrences are considered in the vectorizer.

In [24]:
# Use sklearn's CountVectorizer to vectorize the lemmatized data
vectorizer = CountVectorizer(analyzer='word',       
                             min_df=10,
                             stop_words = nltk_stopwords,
                             lowercase=True,
                             token_pattern='[a-zA-Z0-9]{3,}')

data_vectorized = vectorizer.fit_transform(data_lemmatized)

  'stop_words.' % sorted(inconsistent))


### Train LDA model

We proceed to use the vectorized data to train both a Non-negative Matrix Factorization model and a Latent Dirichlet Allocation model using the news data we obtained. While I will not use the NMF model further, it could be used for an extension of the current analysis, allowing us to extract insights from language data by getting some of the most explanatory features in high-dimensional data.

I performed grid search on the parameters of the LDA model, varying the number of components (topics) that the model would extract, and its learning rate.

In [25]:
# Train an NMF model for comparison with LDA
nmf_model = NMF(n_components = 20, random_state = 2021,
              alpha=.1, l1_ratio=.5)

nmf_output = nmf_model.fit_transform(data_vectorized)



In [26]:
print("Reconstruction error ", nmf_model.reconstruction_err_)

Reconstruction error  26.409832884164427


In [27]:
search_params = {'n_components': [10, 15, 20, 25, 30], 'learning_decay': [.5, .75, .95]}

lda = LatentDirichletAllocation(max_iter = 10,random_state = 2021)

model = GridSearchCV(lda, param_grid=search_params)

# Perform grid search on the possible LDA models

model.fit(data_vectorized)

GridSearchCV(cv=None, error_score='raise',
       estimator=LatentDirichletAllocation(batch_size=128,
             evaluate_every=-1),
       param_grid={'n_topics': [5, 10, 15, 20], 'learning_decay': [0.5, 0.7, 0.9]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

GridSearchCV(error_score='raise', estimator=LatentDirichletAllocation(),
             param_grid={'learning_decay': [0.5, 0.7, 0.9],
                         'n_topics': [5, 10, 15, 20]},
             return_train_score='warn')

# Testing and Analysis

The best log-likelihood score obtained on the LDA models searched above is in the range of -2000, much better than a default model (log-likelihood around -8000). With the number of components the best model used, we can identify both the most prominent topics (10), and the top words associated with each of them.

In [28]:
# Best model
best_lda_model = model.best_estimator_
print("Best model parameters: ", model.best_params_)
print("Best log-likelihood score: ", model.best_score_)

Best model parameters:  {'learning_decay': 0.5, 'n_components': 10}
Best log-likelihood score:  -2138.068831754471


In [29]:
lda_output = best_lda_model.transform(data_vectorized)

topic_names = ["Topic " + str(i) for i in range(best_lda_model.n_components)]
article_names = ["Article " + str(i) for i in range(len(data))]

# Put matrix in a dataframe with the appropriate labels
df_document_topic = pd.DataFrame(np.round(lda_output, 2), columns = topic_names, index = article_names)

# Extract the dominant topic for each document and put it in a column
dominant_topic = np.argmax(df_document_topic.values, axis=1)
df_document_topic["dominant_topic"] = dominant_topic
df_document_topic

Unnamed: 0,Topic 0,Topic 1,Topic 2,Topic 3,Topic 4,Topic 5,Topic 6,Topic 7,Topic 8,Topic 9,dominant_topic
Article 0,0.02,0.02,0.02,0.54,0.02,0.02,0.02,0.02,0.02,0.30,3
Article 1,0.42,0.02,0.02,0.22,0.02,0.02,0.02,0.22,0.02,0.02,0
Article 2,0.02,0.02,0.02,0.02,0.02,0.02,0.02,0.02,0.85,0.02,8
Article 3,0.01,0.01,0.01,0.01,0.91,0.01,0.01,0.01,0.01,0.01,4
Article 4,0.37,0.37,0.03,0.03,0.03,0.03,0.03,0.03,0.03,0.03,0
...,...,...,...,...,...,...,...,...,...,...,...
Article 395,0.52,0.03,0.03,0.03,0.27,0.03,0.03,0.03,0.03,0.03,0
Article 396,0.01,0.01,0.01,0.01,0.01,0.91,0.01,0.01,0.01,0.01,5
Article 397,0.52,0.03,0.03,0.27,0.03,0.03,0.03,0.03,0.03,0.03,0
Article 398,0.03,0.03,0.37,0.03,0.03,0.03,0.37,0.03,0.03,0.03,2


The table above gives us the probability that each article in our dataset belongs to each of 10 topics. The topics with the highest probabilities are called the dominant topics, and we can extract the words most strongly associated with each topic from the model as well, as shown below.

In [30]:
df_topic_keywords = pd.DataFrame(best_lda_model.components_)

df_topic_keywords.columns = vectorizer.get_feature_names()
df_topic_keywords.index = topic_names
df_topic_keywords.head()

Unnamed: 0,acordar,agenciar,ano,anunciar,apo,apresentar,armar,aumentar,bbc,brasil,...,superliga,supremo,tecnologia,temer,ter,terca,tres,utilizar,voce,voltar
Topic 0,0.100007,2.099941,36.922568,0.100004,0.100013,0.100004,0.1,0.100002,0.1,0.100006,...,0.100001,0.100001,0.100005,0.100003,0.100012,0.100002,4.04666,0.100007,0.100006,11.099996
Topic 1,0.1,2.10008,0.100006,0.100004,1.468048,14.099966,0.100019,0.100002,0.100001,1.654864,...,0.100004,0.100005,0.1,2.10002,0.119536,0.100008,0.1,0.1,0.1,0.100019
Topic 2,0.100007,0.100008,0.100009,0.10001,0.100002,0.100007,0.100001,0.1,0.100001,0.100042,...,0.100039,0.1,0.100005,0.100006,0.100009,14.09995,0.1,0.1,0.100006,2.099935
Topic 3,0.100006,0.100004,0.100006,34.099957,20.053013,0.100001,0.100001,3.100004,0.1,0.100008,...,14.099938,0.100002,0.1,0.1,0.100009,0.100002,0.100004,0.100004,0.100004,0.100011
Topic 4,24.099949,0.1,0.100007,0.100007,0.1,0.1,0.1,0.1,0.1,3.742706,...,0.100001,0.100002,13.099983,0.1,0.100002,0.100002,0.1,13.100127,11.784818,0.100002


Finally, we can see the top words associated with each topic by sorting them according to their weight in determining each topic. This also gives us, the researchers, a good idea of how to label each topic if so was necessary.

In [31]:
def show_topics(vectorizer=vectorizer, lda_model = best_lda_model, n_words = 10):
    keywords = np.array(vectorizer.get_feature_names())
    topic_keywords = []
    for topic_weights in lda_model.components_:
        top_keyword_locs = (-topic_weights).argsort()[:n_words]
        topic_keywords.append(keywords.take(top_keyword_locs))
    return topic_keywords

topic_keywords = show_topics(vectorizer=vectorizer, lda_model = best_lda_model)

df_topic_keywords = pd.DataFrame(topic_keywords)
df_topic_keywords.columns = ['Word '+str(i) for i in range(df_topic_keywords.shape[1])]
df_topic_keywords.index = ['Topic '+str(i) for i in range(df_topic_keywords.shape[0])]
df_topic_keywords

Unnamed: 0,Word 0,Word 1,Word 2,Word 3,Word 4,Word 5,Word 6,Word 7,Word 8,Word 9
Topic 0,chars,ano,poder,empresar,sera,voltar,criar,dar,publicar,feirar
Topic 1,marcar,apresentar,portugal,publicar,governar,presidente,economia,feirar,agenciar,temer
Topic 2,futebol,mundo,feirar,partir,chars,esporte,terca,semana,clube,sera
Topic 3,anunciar,chars,apo,presidente,clube,superliga,exercitar,querer,feirar,fazer
Topic 4,acordar,chars,utilizar,tecnologia,continuar,politicar,voce,concordar,condicoes,passar
Topic 5,ler,foto,creditar,legendar,bbc,paulo,chars,brasilia,agenciar,presidente
Topic 6,ministrar,dizer,chars,silvar,forcar,armar,defeso,fernando,dia,tres
Topic 7,presidente,chars,jair,governar,supremo,lei,stf,seguranca,feirar,dia
Topic 8,chars,presidente,dar,pai,estar,brasil,iniciar,pedir,estao,rousseff
Topic 9,chars,ter,pandemia,governar,crise,covid,chegar,economia,semana,partir


From the table above, we can see that there are well defined topics around:
- The Brazilian president's handling of the covid-19 pandemic
- The budget that is going through congress in Brazil at the moment
- The European football super-league, which was canceled earlier this week.

These results show how we can use LDA models to extract topics and have insights into what receives the most coverage in newspapers. A planned extension to this work will use the domain from which the news come as a source of information, as I try to grapple with source bias in my analysis.

# References

Chen, Y. (2018). How to generate an LDA topic model for text analysis. Retrieved from https://medium.com/@yanlinc/how-to-build-a-lda-topic-model-using-from-text-601cdcbfd3a6.

Flora, T. (2021). Exploratory analysis of sentiment and bias in Portuguese-language news. Retrieved from https://gist.github.com/TiagoFlora/f532bd2aeaa35fe4a6ef82106d352bf2.

Grisel, O., Buitinck, L., Yau, C.K. (n.d.). Retrieved from https://scikit-learn.org/stable/auto_examples/applications/plot_topics_extraction_with_nmf_lda.html

# Appendix
## Stopwords

In [32]:
print(nltk_stopwords)

['de', 'a', 'o', 'que', 'e', 'é', 'do', 'da', 'em', 'um', 'para', 'com', 'não', 'uma', 'os', 'no', 'se', 'na', 'por', 'mais', 'as', 'dos', 'como', 'mas', 'ao', 'ele', 'das', 'à', 'seu', 'sua', 'ou', 'quando', 'muito', 'nos', 'já', 'eu', 'também', 'só', 'pelo', 'pela', 'até', 'isso', 'ela', 'entre', 'depois', 'sem', 'mesmo', 'aos', 'seus', 'quem', 'nas', 'me', 'esse', 'eles', 'você', 'essa', 'num', 'nem', 'suas', 'meu', 'às', 'minha', 'numa', 'pelos', 'elas', 'qual', 'nós', 'lhe', 'deles', 'essas', 'esses', 'pelas', 'este', 'dele', 'tu', 'te', 'vocês', 'vos', 'lhes', 'meus', 'minhas', 'teu', 'tua', 'teus', 'tuas', 'nosso', 'nossa', 'nossos', 'nossas', 'dela', 'delas', 'esta', 'estes', 'estas', 'aquele', 'aquela', 'aqueles', 'aquelas', 'isto', 'aquilo', 'estou', 'está', 'estamos', 'estão', 'estive', 'esteve', 'estivemos', 'estiveram', 'estava', 'estávamos', 'estavam', 'estivera', 'estivéramos', 'esteja', 'estejamos', 'estejam', 'estivesse', 'estivéssemos', 'estivessem', 'estiver', 'estiv

## Example of data object used

In [33]:
pretty_json_output = json.dumps(response_politics.json(), indent=4)
print(pretty_json_output)

{
    "status": "ok",
    "totalResults": 1410,
    "articles": [
        {
            "source": {
                "id": null,
                "name": "The Intercept"
            },
            "author": "Lucas Rezende",
            "title": "Militares ficar\u00e3o abra\u00e7ados a Bolsonaro at\u00e9 o fim do governo",
            "description": "Mesmo que surjam novos atritos, generais est\u00e3o de cabe\u00e7a na miss\u00e3o. Conhe\u00e7a 4 cen\u00e1rios para o futuro \u2013 nenhum \u00e9 bom para a democracia.\nThe post Militares ficar\u00e3o abra\u00e7ados a Bolsonaro at\u00e9 o fim do governo appeared first on The Intercept.",
            "url": "https://theintercept.com/2021/04/13/militares-ficarao-abracados-a-bolsonaro-ate-o-fim-do-governo/",
            "urlToImage": "https://theintercept.com/wp-uploads/sites/1/2021/04/vozes-milbols.jpg",
            "publishedAt": "2021-04-13T17:00:55Z",
            "content": "Foto: Andr\u00e9 Coelho/Bloomberg via Getty Images\r\nA demiss\u0

In [34]:
dict_pres['articles'][:3]

[{'source': {'id': None, 'name': 'The Intercept'},
  'author': 'Rafael Moro Martins, Leandro Demori',
  'title': 'Imprensa dá voz à farsa de que generais se descolaram de Bolsonaro, mas militares seguem afundados no governo',
  'description': 'Forças Armadas tentam saída à francesa da festa de terror que ajudaram a criar. Digitais dos quartéis estão nos mais de 317 mil mortos da pandemia.\nThe post Imprensa dá voz à farsa de que generais se descolaram de Bolsonaro, mas militares seguem afundados no …',
  'url': 'https://theintercept.com/2021/03/30/imprensa-farsa-militares-governo-bolsonaro/',
  'urlToImage': 'https://theintercept.imgix.net/wp-uploads/sites/1/2021/03/destt-1.jpg?auto=compress%2Cformat&q=90&fit=crop&w=1200&h=800',
  'publishedAt': '2021-03-30T22:25:20Z',
  'content': 'O Alto Comando das forças armadas\r\n manda um recado a Bolsonaro: não cederá ao golpismo e nem irá politizar os quartéis, avisa a imprensa brasileira num uníssono narrativo que não se ouvia desde os á… [+1

In [35]:
pretty_sources = json.dumps(resp_sources.json(), indent=4)
print(pretty_sources)

{
    "status": "ok",
    "sources": [
        {
            "id": "blasting-news-br",
            "name": "Blasting News (BR)",
            "description": "Descubra a se\u00e7\u00e3o brasileira da Blasting News, a primeira revista feita pelo  p\u00fablico, com not\u00edcias globais e v\u00eddeos independentes. Junte-se a n\u00f3s e torne- se um rep\u00f3rter.",
            "url": "https://br.blastingnews.com",
            "category": "general",
            "language": "pt",
            "country": "br"
        },
        {
            "id": "globo",
            "name": "Globo",
            "description": "S\u00f3 na globo.com voc\u00ea encontra tudo sobre o conte\u00fado e marcas do Grupo Globo. O melhor acervo de v\u00eddeos online sobre entretenimento, esportes e jornalismo do Brasil.",
            "url": "http://www.globo.com/",
            "category": "general",
            "language": "pt",
            "country": "br"
        },
        {
            "id": "google-news-br",
      