# Topic modeling of Portuguese-language news

# Problem Definition



We use data available through NewsAPI that the service associates with a few distinct keywords (e.g. "congress, politics, economy, sports"), and train a Latent Dirichlet Allocation model on 400 pulled articles, trying to identify the dominant topics in each of them.

The work presented here was started as part of my Capstone Project proposal (Flora, 2021).

# Solution Specification

I decided to use Newscatcher's data because of their ease to obtain, and immediate connection with the topic at hand. The code below is responsible for pulling the data from the API, and storing it in a dataframe we can use later to train scikit-learn models.

Pulling 2000 articles from the API, I could put their contents (the body of the text alone) into a list and tokenize and lemmatize them with relative ease. The steps below 

In [None]:
#### Choose to predetermine a number of topics and API-specified topic ####
n_topics = None
select_coverage = None
technique = "LDA"
dt_extension = '10-Dec-2021-17:27'

In [None]:
!python3 -m spacy download pt_core_news_lg
!pip install gensim
!pip install wordcloud
!python -m nltk.downloader stopwords
!pip install pyldavis
!python -m nltk.downloader stopwords
import pt_core_news_lg
nlp = pt_core_news_lg.load()

2021-12-10 20:38:34.020911: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2021-12-10 20:38:34.020969: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
You should consider upgrading via the '/root/venv/bin/python3 -m pip install --upgrade pip' command.[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('pt_core_news_lg')
Collecting gensim
  Downloading gensim-4.1.2-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (24.1 MB)
[K     |████████████████████████████████| 24.1 MB 22.8 MB/s 
Installing collected packages: gensim
Successfully installed gensim-4.1.2
You should consider upgrading via the '/root/venv/bin/python -m pip install --upgrade pip' command.[0m
Collecting wordcloud
  Downloading word

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import (
    NMF, 
    LatentDirichletAllocation,
    TruncatedSVD
)
import requests, gensim, spacy, json, nltk
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
# warnings.filterwarnings('ignore')
from nltk.stem import PorterStemmer, WordNetLemmatizer
from sklearn.model_selection import GridSearchCV
from nltk.tokenize import wordpunct_tokenize
from wordcloud import WordCloud, STOPWORDS
from nltk.metrics import ConfusionMatrix
from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist
import matplotlib.colors as mcolors
import matplotlib.pyplot as plt
from functools import reduce
from io import StringIO
from time import time
import pandas as pd
import numpy as np
import re
import pyLDAvis
import pyLDAvis.sklearn
pyLDAvis.enable_notebook()

# Load data

In [None]:
# Get stopwords for Porguese and add prominent missing stopword ("to be")
# Full list of stopwords printed in appendix
pt_stopwords = nltk.corpus.stopwords.words('portuguese')
en_stopwords = nltk.corpus.stopwords.words('english')
html_elements = ['li', 'ul']
pt_stopwords.append("ser")
nltk_stopwords = pt_stopwords + en_stopwords + html_elements

  and should_run_async(code)


## Newscatcher dataset

In [None]:
df = pd.read_csv('text-added-news.csv', index_col=0)
df.head()

  and should_run_async(code)


Unnamed: 0,title,author,published_date,published_date_precision,link,clean_url,excerpt,summary,rights,rank,...,country,language,authors,media,is_opinion,twitter_account,_score,_id,used_params,text
0,Violência em mercados ilegais: a importância d...,Da Redação,2021-12-07 17:00:50,full,https://exame.com/blog/impacto-social/violenci...,exame.com,Delegar aos agentes de mercado a função de vig...,Leila Pereira* e Rafael Pucci**\nA extração il...,exame.com,8779,...,unknown,pt,['Da Redação'],https://exame.com/wp-content/uploads/2017/09/o...,False,@exame,17.560171,c5eee2892be8d9b5ef4caa314ae90a43,"{'q': 'Economia OR Politica OR Crime', 'lang':...",Leila Pereira* e Rafael Pucci**\n\nA extração ...
3,Procuradoria arquiva investigações sobre empre...,Aguirre Talento,2021-12-01 13:25:43,full,https://oglobo.globo.com/economia/procuradoria...,globo.com,PGR apontou que não há indícios de crimes por ...,BRASÍLIA — A Procuradoria-Geral da República (...,globo.com,784,...,PT,pt,['Aguirre Talento'],https://ogimg.infoglobo.com.br/in/25293709-04c...,False,@JornalOGlobo,15.093987,bc6f77336a925f55871888ed0e152f00,"{'q': 'Economia OR Politica OR Crime', 'lang':...",BRASÍLIA — A Procuradoria-Geral da República (...
4,PGR arquiva apuração sobre Guedes e Campos Net...,Reuters,2021-12-01 22:13:06,full,https://esportes.yahoo.com/noticias/pgr-arquiv...,yahoo.com,BRASÍLIA (Reuters) - A Procuradoria-Geral da R...,BRASÍLIA (Reuters) - A Procuradoria-Geral da R...,yahoo.com,30,...,US,pt,['Reuters'],https://s.yimg.com/uu/api/res/1.2/v6j5VOJOBuD3...,False,@YahooBr,14.962729,0f35ddf111ffde9aede8ea4a40426047,"{'q': 'Economia OR Politica OR Crime', 'lang':...",BRASÍLIA (Reuters) - A Procuradoria-Geral da R...
5,PGR arquiva apuração sobre Guedes e Campos Net...,01/12/2021 19h13,2021-12-01 22:13:06,full,https://noticias.uol.com.br/ultimas-noticias/r...,uol.com.br,BRASÍLIA (Reuters) - A Procuradoria-Geral da R...,BRASÍLIA (Reuters) - A Procuradoria-Geral da R...,uol.com.br,657,...,BR,pt,[],https://conteudo.imguol.com.br/c/_layout/v3/lo...,False,UOLNoticias @UOL,14.953114,67874b1834bab4b87f4fbd61c7466ef6,"{'q': 'Economia OR Politica OR Crime', 'lang':...",BRASÍLIA (Reuters) - A Procuradoria-Geral da R...
6,Cerca de 20% da população do Rio compra produt...,Lucas Vettorazzo,2021-12-03 21:30:38,full,https://veja.abril.com.br/blog/radar/cerca-de-...,abril.com.br,Levantamento da Fecomércio-RJ mostrou que gast...,"Pesquisa da Fecomércio-RJ mostrou que 20,5% da...",abril.com.br,1997,...,BR,pt,['Lucas Vettorazzo'],https://veja.abril.com.br/wp-content/uploads/2...,False,@VEJA,14.827173,faf879f5a6f088ea8d29f0e8983af576,"{'q': 'Economia OR Politica OR Crime', 'lang':...","Notas exclusivas sobre política, negócios e en..."


In [None]:
X = df['text'].astype(str)
# Not immediately useful for unsupervised learning, but for planned
# supervised-learning extensions
y = df['topic']

  and should_run_async(code)


In [None]:
# Remove escaped characters and single quotes
data = X.values.tolist()
data = [re.sub(r'\s+', ' ', t) for t in data]
data = [re.sub(r"\'", "", t) for t in data]

  and should_run_async(code)


In [None]:
# First 3 data points
data[:3]

  and should_run_async(code)


['Leila Pereira* e Rafael Pucci** A extração ilegal de recursos naturais é um problema que aflige diversos países em desenvolvimento, como Colômbia, Brasil e República Democrática do Congo. Em parte, a existência de ilegalidade nesse contexto advém da falta de capacidade governamental para vigiar e punir efetivamente quem opera fora da lei. No Brasil, esse problema se manifesta de forma particularmente evidente no caso da mineração de ouro ilícita na Amazônia, pois é custoso, para o governo, o monitoramento de milhares de garimpeiros espalhados pela imensa floresta. Governos, todavia, têm outras ferramentas à sua disposição para coibir mercados ilegais. Uma alternativa bastante utilizada é o que chamamos, em nosso estudo, de monitoramento privado. Neste caso, autoridades delegam aos próprios agentes do mercado a tarefa de vigiar e denunciar potenciais atividades ilícitas. Este é o caso, por exemplo, de se atribuir aos compradores de ouro bruto a responsabilidade por verificar a legalid

## Data preparation steps

The code below is heavily based on the work of Chen (2018). 

We first clean the data by removing unimportant characters and single quotes from the articles. Stopwords are passed as arguments to the models later, but removing them could also be a part of this step.

After cleaning the data to have just relevant words and no additional characters like punctuation marks in the data, we tokenize it, as seen below. The data is tokenized by breaking each article into smaller pieces, in this case words. (Chen, 2018).

In [None]:
def text_to_words(article):
    for sentence in article:
        yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))

data = list(text_to_words(data))
print(data[:1])
words = data

  and should_run_async(code)
[['leila', 'pereira', 'rafael', 'pucci', 'extracao', 'ilegal', 'de', 'recursos', 'naturais', 'um', 'problema', 'que', 'aflige', 'diversos', 'paises', 'em', 'desenvolvimento', 'como', 'colombia', 'brasil', 'republica', 'democratica', 'do', 'congo', 'em', 'parte', 'existencia', 'de', 'ilegalidade', 'nesse', 'contexto', 'advem', 'da', 'falta', 'de', 'capacidade', 'governamental', 'para', 'vigiar', 'punir', 'efetivamente', 'quem', 'opera', 'fora', 'da', 'lei', 'no', 'brasil', 'esse', 'problema', 'se', 'manifesta', 'de', 'forma', 'particularmente', 'evidente', 'no', 'caso', 'da', 'mineracao', 'de', 'ouro', 'ilicita', 'na', 'amazonia', 'pois', 'custoso', 'para', 'governo', 'monitoramento', 'de', 'milhares', 'de', 'garimpeiros', 'espalhados', 'pela', 'imensa', 'floresta', 'governos', 'todavia', 'tem', 'outras', 'ferramentas', 'sua', 'disposicao', 'para', 'coibir', 'mercados', 'ilegais', 'uma', 'alternativa', 'bastante', 'utilizada', 'que', 'chamamos', 'em', 'nosso

We proceed by stemming, or lemmatizing, the data. This process requires language-specific data that allows us to remove redundancy from our data, with words that share meaning and spelling. We use the `spacy` package to do so, since it has good support for lemmatization in Portuguese, while NLTK and other packages do not.

In [None]:
def lemmatization(data, allowed_postags = ['NOUN', 'ADJ', 'VERB', 'ADV']): #'NOUN', 'ADJ', 'VERB', 'ADV'
    data_lemmatized = []
    for sent in data:
        doc = nlp(" ".join(sent)) 
        data_lemmatized.append(" ".join([token.lemma_ if token.lemma_ not in ['-PRON-'] else '' for token in doc if token.pos_ in allowed_postags]))
    return data_lemmatized

  and should_run_async(code)


In [None]:
# Initialize spacy 'pt_core_news_lg' model
nlp = spacy.load("pt_core_news_lg", disable=['parser', 'ner'])

# Perform lemmatization keeping nouns, adjectives, verbs, and adverbs
data_lemmatized = lemmatization(words, allowed_postags=['NOUN', 'VERB']) #select noun and verb
print(data_lemmatized[:3])

  and should_run_async(code)
['pucci extracao recurso problema afligir paises desenvolvimento congo partir existencia ilegalidade contexto falto capacidade vigiar punir operar lei problema manifesto formar casar mineracao ourar ilicita governar milhar garimpeiro espalhar florestar governo ter ferramenta disposicao coibir mercar alternativo utilizar chamar estudar monitoramento casar autoridade delegar proprios agente mercar tarefar vigiar denunciar atividades ilicitas casar exemplo atribuir comprador ourar verificar legalidade metal adquirir idear comprador verificar vendedor possuir permissoes lavrar garimpeiro ourar extraido casar facam verificacao comprador governar executar politicar atividade garimpeiro dificuldade vender ourar reduzir custar monitoramento autoridade fato so precisar garantir comprador fazer partir vez monitorarem locar extracao ourar florestar evidenciar entanto tipo politicar funcionar combater mercar estudar focar alteracao lei praticar inviabilizar monitoramen

### Vectorization

The first use of Scikit-Learn here is to build a document-word matrix using CountVectorizer. We pass the stopwords we generated before using a mix of Portuguese and English stopwords. All words that have more than 3 occurrences are considered in the vectorizer.

In [None]:
# Use sklearn's TfidfVectorizer to vectorize the lemmatized data
vectorizer = TfidfVectorizer(analyzer='word',       
                             min_df=3,
                             stop_words = nltk_stopwords,
                             lowercase=True,
                             token_pattern='[a-zA-Z0-9]{3,}')

data_vectorized = vectorizer.fit_transform(data_lemmatized)

  and should_run_async(code)
  % sorted(inconsistent)


### Train LDA and NMF model

We proceed to use the vectorized data to train both a Non-negative Matrix Factorization model and a Latent Dirichlet Allocation model using the news data we obtained. While I will not use the NMF model further, it could be used for an extension of the current analysis, allowing us to extract insights from language data by getting some of the most explanatory features in high-dimensional data.

I performed grid search on the parameters of the LDA model, varying the number of components (topics) that the model would extract, and its learning rate.

In [None]:
# Train an NMF model for comparison with LDA
nmf_model = NMF(n_components = 20, random_state = 2021,
              alpha=.1, l1_ratio=.5)

nmf_output = nmf_model.fit_transform(data_vectorized)

  and should_run_async(code)


In [None]:
print("Reconstruction error ", nmf_model.reconstruction_err_)

Reconstruction error  45.635883807515675
  and should_run_async(code)


In [None]:
search_params = {'n_components': [10, 15, 20, 25, 30], 'learning_decay': [.5, .75, .95]}

lda = LatentDirichletAllocation(max_iter = 10,random_state = 2021)

model = GridSearchCV(lda, param_grid=search_params)

# Perform grid search on the possible LDA models

model.fit(data_vectorized)

  and should_run_async(code)


GridSearchCV(estimator=LatentDirichletAllocation(random_state=2021),
             param_grid={'learning_decay': [0.5, 0.75, 0.95],
                         'n_components': [10, 15, 20, 25, 30]})

In [None]:
GridSearchCV(cv=None, error_score='raise',
       estimator=LatentDirichletAllocation(batch_size=128,
             evaluate_every=-1),
       param_grid={'n_topics': [10, 15, 20, 25, 30], 'learning_decay': [0.5, 0.75, 0.95]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

  and should_run_async(code)


GridSearchCV(error_score='raise', estimator=LatentDirichletAllocation(),
             param_grid={'learning_decay': [0.5, 0.75, 0.95],
                         'n_topics': [10, 15, 20, 25, 30]},
             return_train_score='warn')

# Testing and Analysis

The best log-likelihood score obtained on the LDA models searched above is in the range of -2000, much better than a default model (log-likelihood around -8000). With the number of components the best model used, we can identify both the most prominent topics (10), and the top words associated with each of them.

In [None]:
# Best model
best_lda_model = model.best_estimator_
print("Best model parameters: ", model.best_params_)
print("Best log-likelihood score: ", model.best_score_)

Best model parameters:  {'learning_decay': 0.5, 'n_components': 10}
Best log-likelihood score:  -49638.28890341027
  and should_run_async(code)


In [None]:
lda_output = best_lda_model.transform(data_vectorized)

topic_names = ["Topic " + str(i) for i in range(best_lda_model.n_components)]
article_names = ["Article " + str(i) for i in range(len(data))]

# Put matrix in a dataframe with the appropriate labels
df_document_topic = pd.DataFrame(np.round(lda_output, 2), columns = topic_names, index = article_names)

# Extract the dominant topic for each document and put it in a column
dominant_topic = np.argmax(df_document_topic.values, axis=1)
df_document_topic["dominant_topic"] = dominant_topic
df_document_topic

  and should_run_async(code)


Unnamed: 0,Topic 0,Topic 1,Topic 2,Topic 3,Topic 4,Topic 5,Topic 6,Topic 7,Topic 8,Topic 9,dominant_topic
Article 0,0.01,0.01,0.01,0.01,0.04,0.01,0.01,0.01,0.87,0.01,8
Article 1,0.01,0.01,0.14,0.01,0.01,0.01,0.01,0.01,0.78,0.01,8
Article 2,0.01,0.01,0.19,0.01,0.01,0.01,0.01,0.01,0.72,0.01,8
Article 3,0.01,0.01,0.19,0.01,0.01,0.01,0.01,0.01,0.72,0.01,8
Article 4,0.01,0.07,0.01,0.01,0.01,0.01,0.01,0.01,0.82,0.01,8
...,...,...,...,...,...,...,...,...,...,...,...
Article 2430,0.01,0.02,0.01,0.01,0.01,0.01,0.01,0.01,0.90,0.01,8
Article 2431,0.01,0.01,0.09,0.01,0.01,0.01,0.01,0.01,0.84,0.01,8
Article 2432,0.01,0.01,0.01,0.07,0.01,0.01,0.01,0.01,0.85,0.01,8
Article 2433,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.92,0.01,8


The table above gives us the probability that each article in our dataset belongs to each of 10 topics. The topics with the highest probabilities are called the dominant topics, and we can extract the words most strongly associated with each topic from the model as well, as shown below.

In [None]:
df_topic_keywords = pd.DataFrame(best_lda_model.components_)

df_topic_keywords.columns = vectorizer.get_feature_names()
df_topic_keywords.index = topic_names
df_topic_keywords.head()

  and should_run_async(code)


Unnamed: 0,abadiania,abaixar,abalar,abalo,abandonar,abandono,abarcar,abastar,abastecer,abastecimento,...,zemmour,zerada,zerar,zhao,zhu,zinho,zollinger,zonar,zoom,zoraida
Topic 0,1.839985,0.100027,0.100003,0.1,0.100017,0.1,0.1,0.1,0.1,0.1,...,0.1,0.1,0.1,0.1,0.10001,0.1,0.1,0.100011,0.1,0.1
Topic 1,0.1,0.100005,0.100001,0.1,0.100009,0.1,0.100004,0.100029,0.100004,0.1,...,0.1,0.1,0.1,0.1,0.100004,0.1,0.1,0.100001,0.1,0.1
Topic 2,0.1,0.1,0.100002,0.100169,0.100004,0.1,0.1,0.100001,0.100001,0.100002,...,0.1,0.10002,0.1,0.1,0.100003,0.1,0.102922,0.100005,0.100001,0.1
Topic 3,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,...,0.1,0.1,0.1,0.361708,0.1,0.1,0.1,0.1,0.1,0.1
Topic 4,0.1,0.1,0.100001,0.1,0.100002,0.1,0.100005,0.100002,0.100003,0.1,...,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.100017,0.1,0.1


Finally, we can see the top words associated with each topic by sorting them according to their weight in determining each topic. This also gives us, the researchers, a good idea of how to label each topic if so was necessary.

In [None]:
def show_topics(vectorizer=vectorizer, lda_model = best_lda_model, n_words = 10):
    keywords = np.array(vectorizer.get_feature_names())
    topic_keywords = []
    for topic_weights in lda_model.components_:
        top_keyword_locs = (-topic_weights).argsort()[:n_words]
        topic_keywords.append(keywords.take(top_keyword_locs))
    return topic_keywords

topic_keywords = show_topics(vectorizer=vectorizer, lda_model = best_lda_model)

df_topic_keywords = pd.DataFrame(topic_keywords)
df_topic_keywords.columns = ['Word '+str(i) for i in range(df_topic_keywords.shape[1])]
df_topic_keywords.index = ['Topic '+str(i) for i in range(df_topic_keywords.shape[0])]
df_topic_keywords

  and should_run_async(code)


Unnamed: 0,Word 0,Word 1,Word 2,Word 3,Word 4,Word 5,Word 6,Word 7,Word 8,Word 9
Topic 0,fazendeiro,caseiro,entear,corumba,opep,ciclista,churrasco,gameleira,abadiania,chapeu
Topic 1,imperador,limite,kwh,aquisicoes,listagem,empenhar,ambrosio,otis,cabrito,gastronomia
Topic 2,senado,relator,emendar,precatorios,senador,aprovar,proposto,ministrar,orcamento,texto
Topic 3,delator,delacao,saudade,asfixiar,medicamentar,consorcios,ips,mecanico,efetuaram,psiquiatro
Topic 4,navegador,ater,suportar,video,atualizacao,atualizar,considerar,ver,contratacoes,bueiro
Topic 5,amanda,cavar,cova,jaguariuna,rodear,colunista,camarote,jurere,junho,franciane
Topic 6,verde,chanceler,coalizao,eat,difamacao,deliveroo,mitigacao,cilindrar,tabata,triplice
Topic 7,doce,mocao,fisco,brde,alavancarmos,submarino,demolidor,mcu,destituicao,russia
Topic 8,ano,ter,dizer,poder,governar,economia,presidente,fazer,crime,dia
Topic 9,coalizao,chanceler,coligacao,ostra,verde,maurilio,olaf,pecresse,fdp,jairinho


From the table above, we can see that there are well defined topics around:
- The Brazilian president's handling of the covid-19 pandemic
- The budget that is going through congress in Brazil at the moment
- The European football super-league, which was canceled earlier this week.

These results show how we can use LDA models to extract topics and have insights into what receives the most coverage in newspapers. A planned extension to this work will use the domain from which the news come as a source of information, as I try to grapple with source bias in my analysis.

## Visualizations

In [None]:
pyLDAvis.sklearn.prepare(best_lda_model, data_vectorized, vectorizer)

  and should_run_async(code)


In [None]:
# Inspired by the following Kaggle notebook:
# https://www.kaggle.com/rajmehra03/topic-modelling-using-lda-and-lsa-in-sklearn

wc_n = 10
vocab = vectorizer.get_feature_names()

from wordcloud import WordCloud

def draw_wc(index, vocab = vocab, model = best_lda_model):
  imp_words_topic=""
  comp = model.components_[index]
  vocab_comp = zip(vocab, comp)
  sorted_words = sorted(vocab_comp, key= lambda x:x[1], reverse=True)[:50]
  for word in sorted_words:
    imp_words_topic=imp_words_topic+" "+word[0]

  wordcloud = WordCloud(width=600, height=400).generate(imp_words_topic)
  plt.figure(figsize=(5,5))
  plt.imshow(wordcloud)
  plt.axis("off")
  plt.tight_layout()
  plt.show()


  and should_run_async(code)


In [None]:
draw_wc(0)

  and should_run_async(code)


NameError: name 'lda_model' is not defined

# References

Chen, Y. (2018). How to generate an LDA topic model for text analysis. Retrieved from https://medium.com/@yanlinc/how-to-build-a-lda-topic-model-using-from-text-601cdcbfd3a6.

Flora, T. (2021). Exploratory analysis of sentiment and bias in Portuguese-language news. Retrieved from https://gist.github.com/TiagoFlora/f532bd2aeaa35fe4a6ef82106d352bf2.

Grisel, O., Buitinck, L., Yau, C.K. (n.d.). Retrieved from https://scikit-learn.org/stable/auto_examples/applications/plot_topics_extraction_with_nmf_lda.html

# Appendix
## Stopwords

In [None]:
print(nltk_stopwords)

['de', 'a', 'o', 'que', 'e', 'é', 'do', 'da', 'em', 'um', 'para', 'com', 'não', 'uma', 'os', 'no', 'se', 'na', 'por', 'mais', 'as', 'dos', 'como', 'mas', 'ao', 'ele', 'das', 'à', 'seu', 'sua', 'ou', 'quando', 'muito', 'nos', 'já', 'eu', 'também', 'só', 'pelo', 'pela', 'até', 'isso', 'ela', 'entre', 'depois', 'sem', 'mesmo', 'aos', 'seus', 'quem', 'nas', 'me', 'esse', 'eles', 'você', 'essa', 'num', 'nem', 'suas', 'meu', 'às', 'minha', 'numa', 'pelos', 'elas', 'qual', 'nós', 'lhe', 'deles', 'essas', 'esses', 'pelas', 'este', 'dele', 'tu', 'te', 'vocês', 'vos', 'lhes', 'meus', 'minhas', 'teu', 'tua', 'teus', 'tuas', 'nosso', 'nossa', 'nossos', 'nossas', 'dela', 'delas', 'esta', 'estes', 'estas', 'aquele', 'aquela', 'aqueles', 'aquelas', 'isto', 'aquilo', 'estou', 'está', 'estamos', 'estão', 'estive', 'esteve', 'estivemos', 'estiveram', 'estava', 'estávamos', 'estavam', 'estivera', 'estivéramos', 'esteja', 'estejamos', 'estejam', 'estivesse', 'estivéssemos', 'estivessem', 'estiver', 'estiv

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=2392c230-4672-47ef-bbd2-e97d3feaff76' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>