# Processamento de Linguagem Natural aplicada à Gestão Pública

Aula 5 (08/06): Reconhecimento de entidades nomeadas, Classificação de Textos e Modelagem de Tópicos

Reconhecimento de menção a pessoas, lugares e organizações e outras entidades. Classificação de Textos. Identificação de tópicos. Alocação Latente de Dirichlet.



## Preparação do ambiente

In [24]:
# Download dos dados
%%capture
!git clone https://github.com/samuelbarbosaa/oficina_nlp.git

In [25]:
import pickle
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

plt.rcParams['figure.figsize'] = [10, 5]

In [26]:
# Carrega textos já processados
with open('/content/oficina_nlp/data/textos_padronizados.pkl', 'rb') as fp:
  textos_proposicoes = pickle.load(fp)

textos_proposicoes

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,texto_padr,texto
siglaTipoProjeto,numero,ano,Unnamed: 3_level_1,Unnamed: 4_level_1
PL,502,1999,projeto lei autorizar executivo criar concessã...,PROJETO DE LEI Nº 502/99 Autoriza o Poder Exec...
PL,2126,2002,projeto lei alterar dispositivo lei dezembro c...,PROJETO DE LEI Nº 2.126/2002 Altera dispositiv...
PL,2127,2002,projeto lei alterar redação art lei dezembro d...,PROJETO DE LEI Nº 2.127/2002 Altera a redação ...
PL,120,2003,projeto lei projeto lei instituir medalha méri...,PROJETO DE LEI Nº 120/2003 (EX-PROJETO DE LEI ...
PL,1519,2004,projeto lei dispor gratuidade tranporte coleti...,PROJETO DE LEI Nº 1.519/2004 Dispõe sobre a gr...
PL,...,...,...,...
PL,3614,2022,projeto lei autorizar executivo doar município...,# Projeto de Lei nº 3.614/2022\n\nAutoriza o P...
PEC,81,2022,proposta emenda constituição acrescentar art i...,# Proposta de Emenda à Constituição nº 81/2022...
PL,3622,2022,projeto lei declarar utilidade público institu...,# Projeto de Lei nº 3.622/2022\n\nDeclara de u...
PL,3624,2022,projeto lei denominação viaduto localizar alça...,# Projeto de Lei nº 3.624/2022\n\nDá denominaç...


Assuntos a que se referem cada proposição (*label*/classes):

In [27]:
assuntos = pd.read_csv("/content/oficina_nlp/data/assunto_proposicoes.csv").drop_duplicates().set_index(["siglaTipoProjeto", "numero", "ano"])
assuntos

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,assuntoGeral
siglaTipoProjeto,numero,ano,Unnamed: 3_level_1
PEC,1,1972,Administração Municipal
PEC,1,1972,Constituição Estadual
PEC,1,1975,Assembleia Legislativa do Estado de Minas Gera...
PEC,1,1975,Constituição Estadual
PEC,1,1975,Deputado Estadual
...,...,...,...
PL,3776,2022,Transporte e Trânsito
PL,3778,2022,Patrimônio Cultural
PL,3779,2022,Evento
PL,2383,2020,Educação


Utilizaremos como variáveis preditoras (*features*) a matriz LSA.

Matriz TF-IDF:

In [28]:
from sklearn.feature_extraction.text import TfidfVectorizer

metodo = TfidfVectorizer()
modelo = metodo.fit(textos_proposicoes["texto_padr"])
TF_IDF = modelo.transform(textos_proposicoes["texto_padr"])
TF_IDF

<4534x23841 sparse matrix of type '<class 'numpy.float64'>'
	with 729992 stored elements in Compressed Sparse Row format>

Matriz LSA

In [29]:
from sklearn.decomposition import TruncatedSVD

tsvd = TruncatedSVD(n_components=10)
M_LSA = tsvd.fit_transform(TF_IDF)

In [30]:
M_LSA.shape

(4534, 10)

## Construção da matriz $X$, de variáveis preditoras

In [31]:
X = pd.DataFrame(M_LSA, index=textos_proposicoes.index)
X

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,0,1,2,3,4,5,6,7,8,9
siglaTipoProjeto,numero,ano,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
PL,502,1999,0.197148,-0.070334,0.006180,0.019315,-0.007822,-0.055337,0.009857,0.040396,-0.044711,0.034256
PL,2126,2002,0.219005,-0.116131,-0.043454,0.021444,-0.030288,-0.184952,-0.005094,0.107220,0.101947,-0.071822
PL,2127,2002,0.291456,-0.146212,-0.025173,0.046497,-0.045563,-0.253030,-0.006564,0.150605,0.136953,-0.102723
PL,120,2003,0.147777,-0.015505,-0.011531,-0.023840,-0.023328,-0.003911,-0.020818,0.022958,-0.017735,0.005108
PL,1519,2004,0.260409,-0.120940,-0.012898,-0.050121,-0.068699,-0.028392,-0.008942,0.151119,0.056900,-0.138414
PL,...,...,...,...,...,...,...,...,...,...,...,...
PL,3614,2022,0.294968,0.036816,0.351056,-0.014532,0.008926,0.066146,0.047059,-0.022350,0.025295,0.017744
PEC,81,2022,0.260494,-0.059573,-0.031587,-0.046052,-0.095933,-0.003337,-0.064729,-0.004338,-0.071300,0.024842
PL,3622,2022,0.277894,0.212062,-0.073688,-0.005789,-0.024179,0.035966,0.010046,-0.003185,0.006937,-0.008978
PL,3624,2022,0.108521,0.021439,0.000908,0.001291,-0.000109,-0.015583,0.023233,0.041107,0.008079,0.026790


In [32]:
dados_classificacao = X.join(assuntos, how="inner")
dados_classificacao

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,0,1,2,3,4,5,6,7,8,9,assuntoGeral
siglaTipoProjeto,numero,ano,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
PEC,1,2019,0.187110,-0.041069,-0.027076,-0.022069,-0.104861,-0.064377,-0.194709,-0.033675,-0.094415,0.017330,Administração Estadual
PEC,1,2019,0.187110,-0.041069,-0.027076,-0.022069,-0.104861,-0.064377,-0.194709,-0.033675,-0.094415,0.017330,Constituição Estadual
PEC,1,2019,0.187110,-0.041069,-0.027076,-0.022069,-0.104861,-0.064377,-0.194709,-0.033675,-0.094415,0.017330,Pessoal
PEC,1,2019,0.187110,-0.041069,-0.027076,-0.022069,-0.104861,-0.064377,-0.194709,-0.033675,-0.094415,0.017330,Pessoal Militar
PEC,2,2019,0.188398,-0.058256,-0.014387,-0.002127,-0.088998,-0.062617,-0.175627,-0.038617,-0.081991,0.015682,Constituição Estadual
...,...,...,...,...,...,...,...,...,...,...,...,...,...
PRE,167,2022,0.219018,-0.063685,0.007136,0.311005,0.070929,0.053633,-0.206210,0.085730,-0.134410,-0.024340,Saúde Pública
PRE,171,2022,0.215951,-0.055345,-0.001155,0.318780,0.079152,0.060715,-0.194591,0.083473,-0.144922,-0.021876,Calamidade Pública
PRE,171,2022,0.215951,-0.055345,-0.001155,0.318780,0.079152,0.060715,-0.194591,0.083473,-0.144922,-0.021876,Finanças Públicas
PRE,171,2022,0.215951,-0.055345,-0.001155,0.318780,0.079152,0.060715,-0.194591,0.083473,-0.144922,-0.021876,Municípios e Desenvolvimento Regional


In [33]:
X = dados_classificacao.drop(columns=["assuntoGeral"])
Y = dados_classificacao[["assuntoGeral"]]

Separação dos dados em conjunto de treinamento e de teste:

In [34]:
from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=0)

## Método 01: *Random Forest*

In [35]:
from sklearn.ensemble import RandomForestClassifier

metodo = RandomForestClassifier(n_estimators=200, random_state=0)
modelo = metodo.fit(X_train, Y_train) 

  after removing the cwd from sys.path.


In [36]:
Y_predito = modelo.predict(X_test)
Y_predito

array(['Crédito', 'Estabelecimento Penal', 'Conselho Estadual', ...,
       'Pessoal', 'Defesa do Consumidor', 'Segurança Pública'],
      dtype=object)

In [37]:
print(Y_predito.shape, Y_test.shape)

(2045,) (2045, 1)


### Avaliando o modelo

In [38]:
from sklearn.metrics import classification_report, accuracy_score

accuracy_score(Y_train, modelo.predict(X_train))

0.5047700587084148

In [39]:
print(f"Acurácia (geral) do modelo: {accuracy_score(Y_test, Y_predito)}")
print("* Qual o percentual de vezes que o modelo prediz corretamente a categoria.")

Acurácia (geral) do modelo: 0.16674816625916872
* Qual o percentual de vezes que o modelo prediz corretamente a categoria.


Relatório por Categoria:

*   Precisão: percentual de acerto na categoria [TP / (TP+FP)]
*   Recall: percentual de vezes que o modelo foi capaz de predizer a categoria [TP / (TP+FN)]

In [40]:
print(classification_report(Y_test, Y_predito))

                                                                          precision    recall  f1-score   support

                                                                  (LGBT)       0.00      0.00      0.00         4
                                                       Acordo Financeiro       0.00      0.00      0.00         1
                                                  Administração Estadual       0.02      0.03      0.02        40
                                                 Administração Municipal       0.00      0.00      0.00         4
                                                   Administração Pública       0.07      0.06      0.06        16
                                                            Agropecuária       0.00      0.00      0.00        22
                                                             Alimentação       0.00      0.00      0.00        18
                                                              Artesanato       0.00    

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [41]:
from sklearn.linear_model import LogisticRegression

In [42]:
clf_LR = LogisticRegression(multi_class="multinomial")

In [43]:
fit_LR = clf_LR.fit(X_train, Y_train)

  y = column_or_1d(y, warn=True)


In [44]:
Y_pred_LR = fit_LR.predict(X_test)

In [45]:
print(f"Acurácia (geral) do modelo: {accuracy_score(Y_test, Y_pred_LR)}")
print("* Qual o percentual de vezes que o modelo prediz corretamente a categoria.")

Acurácia (geral) do modelo: 0.2723716381418093
* Qual o percentual de vezes que o modelo prediz corretamente a categoria.


In [46]:
print(classification_report(Y_test, Y_pred_LR))

                                                                          precision    recall  f1-score   support

                                                                  (LGBT)       0.00      0.00      0.00         4
                                                       Acordo Financeiro       0.00      0.00      0.00         1
                                                  Administração Estadual       0.19      0.12      0.15        40
                                                 Administração Municipal       0.00      0.00      0.00         4
                                                   Administração Pública       0.00      0.00      0.00        16
                                                            Agropecuária       0.00      0.00      0.00        22
                                                             Alimentação       0.00      0.00      0.00        18
                                                              Artesanato       0.00    

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
