---
# Objetivo: Treinar o modelo doc2vec para avaliar a performance de predição da categoria: Fruta, Animal e Grão.


**1- Descrição do problema ou tarefa:**

Prever a categorias frutas, animais e graos a partir das perguntas.

**2- Descrição da solução de IA:**
Treinamento supervisionado de modelo de classificação as categorias das perguntas (3 classes) com os dados das 16000 perguntas sobre os temas Frutas, animais e graos 

**3- Fonte de dados:**


Livro de perguntas e respostas da Embrapa
https://mais500p500r.sct.embrapa.br/view/index.php4

**4-Variáveis independentes:** 
perguntas

**5- Variável dependente:** 
Categorias frutas, animais e graos


-----------------------------------------------------------------
Autor do caderno: Wellington Rangel
Data: 23/11/2021

In [1]:
# inicializando a seed e importando algumas bibliotecas
random_seed=42

import numpy as np
import random
import os

os.environ['PYTHONHASHSEED'] = '0'
np.random.seed(random_seed)
random.seed(random_seed)

In [2]:
from IPython.display import clear_output
!pip install ftfy
!pip install gensim=='3.8.3'
!pip install git+https://github.com/felipemaiapolo/legalnlp
clear_output()

In [3]:
import re
import ftfy
from nltk.tokenize import word_tokenize
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from legalnlp.clean_functions import *

In [4]:
from sklearn.linear_model import SGDClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
import nltk
from sklearn.pipeline import Pipeline

In [5]:

##########################################
# libs externas
##########################################
from gensim.models import Word2Vec
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from gensim.test.utils import common_texts
import nltk
from nltk.tokenize import sent_tokenize
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [6]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [7]:
%cd /content/drive/MyDrive/Colab Notebooks/projeto/

/content/drive/MyDrive/Colab Notebooks/projeto


In [35]:
# Carrega os dados
df = pd.read_csv('dados_agrupados.csv',  sep=',', low_memory=False,encoding='latin-1' )


In [36]:
#df['texto'] = df['Pergunta'] + " " + df['Resposta'] 

In [37]:
df['texto'] = df['Pergunta']

In [38]:
df.head()

Unnamed: 0,Numero,Livro,Capitulo,Pergunta,Resposta,Target,Target_final,Target_final1,Target_final2,Target_final3,texto
0,1,Pera,Generalidades,Qual é o centro de origem da pereira?,São citados três centros de origem da pereira:...,Pera,Frutas,,Frutas,,Qual é o centro de origem da pereira?
1,2,Pera,Generalidades,Qual é o centro de origem mais importante?,O centro do Oriente Médio é considerado de imp...,Pera,Frutas,,Frutas,,Qual é o centro de origem mais importante?
2,3,Pera,Generalidades,Como ocorreu a disseminação da pereira pelo mu...,"Com base em estudos bioquímicos, verificou-se ...",Pera,Frutas,,Frutas,,Como ocorreu a disseminação da pereira pelo mu...
3,4,Pera,Generalidades,Quais são as espécies de pereira mais cultivad...,"Na Europa, na América do Norte, na América do ...",Pera,Frutas,,Frutas,,Quais são as espécies de pereira mais cultivad...
4,5,Pera,Generalidades,Quando a pereira foi introduzida no Brasil?,Não há relatos na literatura sobre a introduçã...,Pera,Frutas,,Frutas,,Quando a pereira foi introduzida no Brasil?


In [39]:
df = df[['texto', 'Target_final']]

In [40]:
df = df[df['texto'].notnull()]


In [41]:
df.dropna(inplace=True)

In [42]:
df.head()

Unnamed: 0,texto,Target_final
0,Qual é o centro de origem da pereira?,Frutas
1,Qual é o centro de origem mais importante?,Frutas
2,Como ocorreu a disseminação da pereira pelo mu...,Frutas
3,Quais são as espécies de pereira mais cultivad...,Frutas
4,Quando a pereira foi introduzida no Brasil?,Frutas


In [43]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [44]:
stopwords = nltk.corpus.stopwords.words('portuguese')


In [45]:
type(df["texto"])

pandas.core.series.Series

In [46]:
df["texto"] = df["texto"].apply(lambda x: ' '.join([word for word in x.split() if word not in (stopwords)]))

In [47]:
df["texto"] = df["texto"].str.lower()

In [48]:
df.columns = [ 'text',  'label']

In [49]:
df.head()

Unnamed: 0,text,label
0,qual centro origem pereira?,Frutas
1,qual centro origem importante?,Frutas
2,como ocorreu disseminação pereira mundo?,Frutas
3,quais espécies pereira cultivadas mundo?,Frutas
4,quando pereira introduzida brasil?,Frutas


In [50]:
print("Quantidade de dados faltantes: ", df.isna().any().sum())

Quantidade de dados faltantes:  0


In [51]:
#df['text'] = df['text'].apply(lambda x: clean(x))

## Treinando o Modelo

In [52]:
DEFAULT_RANDOM_STATE = 42

In [53]:
def remover_caracteres_especiais(t):
  return re.sub(r'[!,.?()\[\]\n]', r' ', t)

In [54]:
documents = [TaggedDocument(remover_caracteres_especiais(s).split(), [i]) for i, doc in enumerate(df['text']) for s in sent_tokenize(doc)]
model = Doc2Vec(documents, vector_size=100, window=5, min_count=3, workers=1, epochs=10, seed=DEFAULT_RANDOM_STATE)

Com o modelo pré-treinado, podemos testar 

In [55]:
model.wv.similar_by_word("pereira")

[('associados', 0.9963803291320801),
 ('dessa', 0.9962798357009888),
 ('polinizadoras', 0.9961004257202148),
 ('estruturas', 0.9958788752555847),
 ('nutricionais', 0.9951620697975159),
 ('vírus-da-manchaanelar', 0.99495929479599),
 ('usos', 0.9949545860290527),
 ('leprose', 0.9948598146438599),
 ('hospedeiros', 0.9948444366455078),
 ('favoráveis', 0.9948288798332214)]

In [56]:
model.wv.similar_by_word("maçã")

[('sul', 0.997516393661499),
 ('nordeste', 0.997467041015625),
 ('pera', 0.9965601563453674),
 ('importador', 0.9955219030380249),
 ('exportado', 0.9943673610687256),
 ('ameixeira', 0.9943341016769409),
 ('centro-oeste', 0.9941427111625671),
 ('pessegueiro', 0.9940899610519409),
 ('américa', 0.9940505027770996),
 ('estimados', 0.9940185546875)]

In [57]:
model.wv.similar_by_word("pera")

[('mundo', 0.9988070130348206),
 ('pessegueiro', 0.9973428249359131),
 ('uvas', 0.9972939491271973),
 ('estimados', 0.9970382452011108),
 ('ameixeira', 0.9969502687454224),
 ('maçã', 0.9965601563453674),
 ('diferentes', 0.9963815808296204),
 ('exportado', 0.9956212639808655),
 ('trabalhos', 0.9950789213180542),
 ('américa', 0.9947072267532349)]

##De textos para vetores

In [58]:
encoder = LabelEncoder()
df['encoded'] = encoder.fit_transform(encoder.fit_transform(df['label']))
df.loc[[0, 2000, 15000]]

Unnamed: 0,text,label,encoded
0,qual centro origem pereira?,Frutas,1
2000,é economicamente vantajoso plantar soja transg...,Graos,2
15000,o condiciona sabor produtos defumados?,Animal,0


Agora vamos inferir os textos para obter vetores.

In [59]:
from tqdm import tqdm

In [60]:
def vetor_inferido(texto):

  string = str(texto).split()

  model.random.seed(random_seed)
  inferido = model.infer_vector(string, steps = 100)
  
  vetores = np.array(inferido)

  return vetores

In [61]:
df['infered'] = df['text'].apply(lambda x: vetor_inferido(x))

In [62]:
df.head(5)

Unnamed: 0,text,label,encoded,infered
0,qual centro origem pereira?,Frutas,1,"[-0.06854016, 0.008388899, 0.00080350135, 0.04..."
1,qual centro origem importante?,Frutas,1,"[-0.068602435, 0.0056723445, -0.004048582, 0.0..."
2,como ocorreu disseminação pereira mundo?,Frutas,1,"[-0.011550166, -0.029851705, 0.08680195, -0.09..."
3,quais espécies pereira cultivadas mundo?,Frutas,1,"[-0.10277663, 0.23485768, 0.25543827, -0.09155..."
4,quando pereira introduzida brasil?,Frutas,1,"[-0.09202158, -0.062365327, 0.058921706, -0.13..."


In [63]:
for i in range(0, 100):
  df[str(i)] = np.vstack(df.loc[:, 'infered'])[:, i]

In [64]:
df.head(5)

Unnamed: 0,text,label,encoded,infered,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,...,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99
0,qual centro origem pereira?,Frutas,1,"[-0.06854016, 0.008388899, 0.00080350135, 0.04...",-0.06854,0.008389,0.000804,0.047589,-0.028872,0.139739,-0.090662,0.121242,-0.048302,0.037798,-0.148107,0.076917,0.057312,-0.039587,-0.110055,0.072867,-0.061348,-0.04932,-0.077161,0.172598,0.148774,-0.083025,0.047373,0.080054,-0.119203,0.015849,-0.064748,0.007486,0.111363,-0.163756,-0.106167,0.076879,-0.114849,0.142855,0.116098,0.131554,...,-0.087486,-0.042807,0.048643,-0.031178,0.07235,0.02088,0.009057,0.082637,0.049311,0.071684,0.214402,-0.140717,0.022779,0.005459,0.028286,0.021183,0.112212,-0.002666,-0.073521,0.059076,0.134767,0.113307,0.117999,-0.024801,-0.03489,0.105382,0.05028,0.053371,-0.018242,-0.076977,-0.05566,0.121378,0.134723,-0.106304,-0.023535,0.019611,-0.073678,-0.076563,0.049614,-0.046318
1,qual centro origem importante?,Frutas,1,"[-0.068602435, 0.0056723445, -0.004048582, 0.0...",-0.068602,0.005672,-0.004049,0.044742,-0.022941,0.141607,-0.084429,0.122802,-0.045096,0.03392,-0.141995,0.080001,0.053498,-0.036971,-0.111662,0.069802,-0.065557,-0.055571,-0.076726,0.173373,0.148163,-0.07961,0.047212,0.078178,-0.116945,0.02317,-0.064312,0.005388,0.11135,-0.165651,-0.112385,0.082895,-0.119103,0.140038,0.108487,0.126037,...,-0.092884,-0.038729,0.048613,-0.027501,0.071192,0.020739,0.00851,0.080734,0.049911,0.073677,0.214743,-0.13739,0.018684,0.012158,0.025339,0.01427,0.112484,0.004247,-0.073981,0.060658,0.136869,0.111522,0.121604,-0.028996,-0.031646,0.106866,0.046661,0.05777,-0.015218,-0.0714,-0.051854,0.12099,0.130076,-0.112844,-0.025754,0.018145,-0.080825,-0.079461,0.043307,-0.046509
2,como ocorreu disseminação pereira mundo?,Frutas,1,"[-0.011550166, -0.029851705, 0.08680195, -0.09...",-0.01155,-0.029852,0.086802,-0.095231,-0.068509,0.063628,-0.048996,-0.019869,0.031899,0.053697,-0.162526,0.102209,0.038364,-0.08221,-0.022352,-0.034508,0.075335,-0.102704,0.001891,0.038196,0.106755,0.006811,-0.022264,-0.006513,-0.125544,0.054503,-0.141844,-0.033642,0.061711,-0.087479,-0.043109,-0.094629,-0.020163,-0.022761,0.063284,-0.076682,...,0.04863,0.076148,-0.063511,0.031387,0.077866,0.024719,-0.039954,0.065291,0.050937,-0.043759,0.072706,-0.005123,0.147172,0.009404,0.100738,-0.094785,0.044192,0.006651,-0.09612,-0.088729,0.093959,0.103154,-0.032985,0.01782,-0.099384,0.058481,-0.102357,-0.030059,0.134668,-0.005526,-0.132408,0.105761,-0.044023,-0.17401,-0.024313,-0.103699,-0.014942,-0.086835,0.013108,-0.007594
3,quais espécies pereira cultivadas mundo?,Frutas,1,"[-0.10277663, 0.23485768, 0.25543827, -0.09155...",-0.102777,0.234858,0.255438,-0.091553,0.033661,0.106272,-0.181073,-0.168772,-0.047658,-0.234931,-0.336579,0.296167,0.391762,0.132478,-0.307576,0.125865,0.232477,-0.23348,-0.027032,0.179783,0.229668,0.051366,-0.180092,0.149371,0.002621,-0.035104,0.00467,-0.221382,0.022736,0.024086,-0.153485,0.064251,0.036812,-0.060556,0.087736,-0.058607,...,0.118517,0.148758,-0.134773,-0.229297,0.129219,-0.084946,0.140321,-0.018472,-0.101721,-0.124174,-0.08868,0.287626,0.163881,-0.173085,-0.046747,0.144748,-0.190289,0.174222,0.15647,-0.136321,0.202592,0.069497,0.084147,-0.105718,-0.088435,0.05818,0.014897,-0.007631,0.024807,-0.13761,-0.019028,-0.105944,0.176448,-0.214963,-0.03025,-0.229992,-0.034443,-0.167177,-0.128473,-0.033014
4,quando pereira introduzida brasil?,Frutas,1,"[-0.09202158, -0.062365327, 0.058921706, -0.13...",-0.092022,-0.062365,0.058922,-0.137295,-0.009834,-0.01325,-0.039285,0.070745,0.037498,0.172552,-0.055033,0.168881,0.00265,-0.069344,0.072327,-0.04404,0.084403,-0.092817,0.000646,0.014093,0.106454,-0.072625,0.068465,-0.010419,-0.188989,0.024834,-0.203957,0.051379,0.088324,-0.125045,-0.015869,-0.203874,-0.087496,0.042361,0.060693,-0.128076,...,0.055728,0.015632,-0.056889,0.131431,0.048922,0.029954,-0.039958,0.111819,0.082959,-0.01381,0.167086,-0.159296,0.178688,0.022225,0.173649,-0.156547,0.165427,0.004653,-0.243802,0.008231,0.035723,0.08743,-0.017667,0.032946,-0.043222,0.0576,-0.039879,0.033061,0.197668,-0.059378,-0.285278,0.200499,-0.071772,-0.210824,0.02304,0.067508,-0.085081,-0.145705,0.095153,0.007724


### Dados de Treino e Teste

In [65]:
X_train, X_test, y_train, y_test = train_test_split( df.drop(columns = ['encoded', 'text', 'label', 'infered']), df['label'], random_state = random_seed, test_size = 0.2)

In [66]:
# Tamanhos dos x e y de treino e teste
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(13276, 100)
(3320, 100)
(13276,)
(3320,)


#Classificação


In [67]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import seaborn as sns
import matplotlib.pyplot as plt

In [83]:
# Define o pipeline incluindo: tokenização (tfidf_vectorizer) e treinamento do classificador
pipeline = Pipeline([
    ('clf', SGDClassifier(loss="log",  max_iter=50000, random_state=42)), # default hinge. To get probabilities, use loss='log' or 'modified_huber'
])


In [84]:
X_train[0:1]

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,...,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99
14301,-0.056321,-0.096743,0.039836,-0.010824,0.060898,0.097984,-0.0806,0.074663,-0.107627,0.12632,-0.092227,0.131806,-0.084563,-0.146647,-0.004875,-0.056188,-0.031578,-0.146916,-0.057455,0.24072,0.057688,0.023548,0.011121,0.052558,-0.172044,0.095042,-0.122678,-0.021582,0.042456,-0.241496,-0.118086,-0.02175,-0.169746,0.152788,0.176179,0.048516,0.085892,-0.072029,0.031519,-0.079184,...,-0.097035,0.021292,0.103412,0.050234,0.164194,0.014269,-0.059141,0.090943,0.16746,0.100974,0.174324,-0.095813,0.106565,0.026353,0.169472,-0.083883,0.107398,-0.035988,-0.174073,-0.05668,0.155976,0.126144,0.041628,-0.028807,-0.199917,0.03481,-0.062271,0.043952,0.058936,-0.090484,-0.142674,0.22657,-0.104521,-0.178929,0.012113,-0.039746,-0.042,-0.071031,0.069172,-0.029562


In [72]:
classificador = pipeline.fit(X_train, y_train)

In [73]:
# Mostra a acurácia do modelo nos dados de teste 
classificador.score(X_test,y_test)

0.5936746987951808

In [74]:
# Faça a predição nos dados de teste
predicted = classificador.predict(X_test)

In [75]:
predicted

array(['Frutas', 'Frutas', 'Animal', ..., 'Animal', 'Frutas', 'Animal'],
      dtype='<U6')

In [76]:
X_test['predicted'] = predicted

In [77]:
X_test

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,...,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,predicted
14026,-0.145874,0.132167,0.239608,-0.087786,-0.092967,0.154677,-0.235195,0.027584,-0.058500,-0.127445,-0.290769,0.292913,0.264506,0.031144,-0.224720,0.072318,0.157470,-0.223512,0.041094,0.214229,0.238112,0.006624,-0.097262,0.179018,-0.132764,0.021350,-0.081363,-0.183464,0.139207,-0.106298,-0.207485,0.026027,-0.023878,0.007454,0.198563,-0.019828,0.182623,-0.130143,-0.000365,-0.072412,...,0.102202,-0.099623,-0.079369,0.158083,-0.001126,0.027777,0.037911,-0.042753,-0.030045,0.123192,0.158409,0.218658,-0.099401,0.006207,0.012073,0.034299,0.142023,0.054989,-0.033299,0.245575,0.206894,0.114443,-0.047870,-0.154393,0.118134,-0.021369,-0.046921,0.033718,-0.068971,-0.058042,0.044586,0.127536,-0.334068,-0.052804,-0.111935,-0.102677,-0.167041,0.006002,-0.008604,Frutas
4789,-0.196017,-0.226406,-0.243421,-0.277949,0.227553,-0.020349,0.138743,0.025016,-0.022783,0.324804,0.129328,0.375387,-0.217633,-0.218214,0.038908,0.128642,-0.109111,-0.092221,-0.207508,0.130157,0.014690,-0.083542,0.212129,-0.083905,-0.283327,0.221476,-0.378810,0.425161,-0.111353,-0.011125,0.293636,-0.221635,-0.060601,0.314254,0.039254,-0.146609,-0.065325,0.305349,0.076571,0.350730,...,-0.111056,0.028809,0.315437,-0.127587,0.156011,-0.023049,0.194044,0.266918,0.035101,0.397722,-0.153498,0.062209,-0.053616,0.320596,-0.331574,0.288053,-0.270777,-0.599598,-0.080449,0.211471,-0.021267,-0.178126,-0.214300,-0.132561,0.155784,0.006280,0.430705,0.163841,-0.098964,-0.392994,0.290341,-0.213138,0.045591,-0.104900,-0.167312,-0.134586,-0.084036,0.194265,-0.216169,Frutas
14041,-0.059568,-0.058774,-0.131015,0.049946,0.070473,0.134127,-0.029794,0.110324,-0.078203,0.094839,-0.060509,0.011455,-0.057054,-0.086136,-0.073276,0.096116,-0.156012,-0.005564,-0.132282,0.165647,0.083271,-0.125289,0.100974,0.026781,-0.075683,0.089074,-0.092916,0.120272,0.018955,-0.159479,-0.063588,0.099218,-0.148196,0.202387,0.096787,0.136216,0.113957,-0.045600,0.080838,0.038977,...,-0.097414,0.066169,0.000933,0.019520,0.012286,0.008687,0.112111,0.074624,0.114058,0.222055,-0.189526,-0.033710,0.016881,0.065483,0.011208,0.109767,-0.062393,-0.148874,0.097271,0.067447,0.054305,0.143444,-0.052755,-0.006257,0.069285,0.044435,0.117396,-0.035372,-0.076464,-0.089203,0.132242,0.053011,0.008802,-0.047902,0.041386,-0.072286,-0.042925,0.074025,-0.072210,Animal
14765,-0.196128,0.155784,0.151485,-0.071423,-0.038441,0.055993,-0.067610,0.083810,-0.031246,-0.034852,-0.205283,0.233663,0.343619,0.074418,-0.122222,0.090242,0.289229,-0.276964,-0.042388,0.018649,0.232807,0.033589,-0.026344,0.070186,-0.067234,-0.085319,-0.022342,-0.100883,0.152697,0.039507,-0.066864,-0.094252,-0.017226,-0.056598,0.067098,-0.069267,0.047762,-0.068862,-0.036657,-0.042081,...,0.174341,-0.093647,-0.055252,0.087687,0.035219,0.061948,-0.010330,-0.005794,-0.092680,0.045712,0.054600,0.204304,-0.118695,-0.010990,0.059890,-0.012607,0.131787,0.023808,-0.125199,0.198744,0.093652,-0.141018,-0.072414,-0.027937,0.117264,0.069361,-0.046175,0.175548,-0.212535,-0.085614,0.009158,0.108439,-0.277070,0.053175,-0.086165,-0.045731,-0.235543,-0.059550,0.050338,Frutas
5211,-0.027885,0.262225,0.114146,0.032822,0.037950,-0.092822,-0.042279,-0.079508,-0.107165,-0.138999,0.008235,0.320486,0.537668,0.246523,-0.133470,0.155399,0.200467,0.020936,-0.007273,0.119176,0.134938,-0.010129,-0.148584,0.104186,0.237731,-0.117542,0.032870,-0.172044,0.147799,0.057334,-0.079518,-0.028502,0.289991,-0.299460,-0.166241,-0.021206,0.019430,0.066892,0.080035,-0.052913,...,0.118146,-0.106860,-0.158742,0.099522,0.052217,0.014067,-0.076340,-0.030507,-0.164180,-0.069680,0.082738,0.065263,-0.099274,-0.051121,0.266798,-0.261708,0.277353,0.091130,0.145554,0.060004,-0.218225,-0.056376,0.037650,-0.180028,-0.091213,0.283815,-0.024841,0.047261,-0.017082,-0.021824,-0.089295,0.318391,-0.229079,0.058802,-0.005863,-0.126169,0.010782,-0.181077,0.131892,Frutas
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4100,-0.114805,0.136794,0.223460,-0.063697,-0.002226,0.102636,-0.202085,-0.069007,-0.050511,-0.152640,-0.323577,0.323349,0.315174,0.064225,-0.264259,0.101005,0.166610,-0.209715,-0.032302,0.247215,0.226030,0.046092,-0.138840,0.161937,-0.096340,-0.055251,-0.052203,-0.178523,0.038546,-0.064120,-0.171666,0.029045,-0.058142,-0.008296,0.167764,-0.018967,0.288864,-0.150928,-0.020503,-0.091586,...,0.090419,-0.068003,-0.135676,0.161805,-0.020286,0.081177,0.028416,-0.018294,-0.056851,0.009710,0.195706,0.166017,-0.103959,-0.006299,0.079944,-0.073129,0.107972,0.071442,-0.113865,0.213560,0.146302,0.109526,-0.118003,-0.120010,0.077893,0.010494,-0.013857,0.023739,-0.131620,-0.085159,-0.008743,0.170678,-0.242467,-0.017213,-0.195818,-0.046581,-0.164397,-0.081629,-0.050500,Frutas
833,-0.070599,-0.052399,-0.030257,-0.030995,-0.085670,0.041763,-0.112511,0.241379,-0.032224,0.133185,0.055408,0.049543,-0.084940,-0.041408,0.111697,0.005174,-0.066332,-0.076724,0.023539,0.001215,0.055503,-0.064646,0.172762,0.000962,-0.191688,0.010080,-0.069064,0.146731,0.115411,-0.130401,-0.018930,-0.095406,-0.177870,0.191793,0.072798,0.111007,-0.113147,-0.026684,0.015338,0.038965,...,-0.110769,0.052783,0.183820,0.062092,0.049653,-0.033622,0.103439,0.043479,0.111177,0.268063,-0.240540,0.054069,0.024468,0.063029,-0.065483,0.370205,-0.054229,-0.141445,0.100446,0.002711,0.186739,0.049078,0.000645,0.038438,0.121951,-0.018499,0.070848,0.057967,-0.036849,-0.167340,0.136463,0.017648,-0.075132,-0.032363,0.226872,-0.119019,-0.068913,0.126711,-0.022048,Frutas
7596,-0.099352,0.017739,-0.032773,-0.016209,0.024391,0.164038,-0.098054,0.102626,-0.053988,0.035583,-0.144852,0.099608,0.033078,-0.027951,-0.135145,0.100652,-0.064014,-0.084814,-0.101286,0.158602,0.128666,-0.117225,0.085382,0.062290,-0.123250,0.063386,-0.107144,0.075198,0.050848,-0.143122,-0.074434,0.077837,-0.137182,0.197237,0.103406,0.109536,0.119941,-0.091541,0.067181,0.034525,...,-0.055991,0.011304,-0.019731,0.030215,-0.006322,0.040504,0.113819,0.037121,0.075113,0.228806,-0.136086,0.028216,-0.026970,0.049804,-0.006814,0.130511,-0.019754,-0.120139,0.061913,0.114637,0.115151,0.158110,-0.051783,-0.008432,0.115626,0.023985,0.134202,-0.011003,-0.082698,-0.091864,0.130697,0.105172,-0.067576,-0.073718,0.028167,-0.083977,-0.076195,0.063448,-0.069014,Animal
10067,-0.071816,0.005346,-0.001181,0.052857,-0.024365,0.145884,-0.090609,0.123532,-0.051538,0.034946,-0.141324,0.083171,0.050560,-0.042244,-0.110864,0.077251,-0.057034,-0.047681,-0.080411,0.178814,0.147540,-0.085666,0.042913,0.082695,-0.118072,0.013575,-0.066791,0.011956,0.115258,-0.167515,-0.104583,0.082203,-0.114679,0.134699,0.110719,0.127443,0.126278,-0.103885,0.061300,-0.013785,...,-0.039891,0.047311,-0.032757,0.068197,0.020345,0.015171,0.084581,0.058526,0.072597,0.216504,-0.138249,0.025947,0.011255,0.027549,0.016235,0.105778,-0.000688,-0.067333,0.066483,0.135080,0.110951,0.121547,-0.032859,-0.036287,0.108163,0.046689,0.054632,-0.019108,-0.069343,-0.053721,0.122256,0.127600,-0.105735,-0.024772,0.022841,-0.073559,-0.072145,0.050656,-0.040478,Frutas


In [78]:

df.iloc[9284].text

'quais principais causas baixo rendimento gergelim cultivo?'

In [79]:
# avalie o modelo usando 'classification_report' do sklearn
from sklearn import metrics
print(metrics.classification_report(y_test, predicted))

              precision    recall  f1-score   support

      Animal       0.75      0.47      0.58       848
      Frutas       0.57      0.72      0.63      1414
       Graos       0.56      0.52      0.54      1058

    accuracy                           0.59      3320
   macro avg       0.63      0.57      0.59      3320
weighted avg       0.61      0.59      0.59      3320



In [81]:
import numpy

nova_pergunta = "já quantidade sementes regulada comprimento"
nova_pergunta = nova_pergunta.lower()
nova_pergunta = vetor_inferido(nova_pergunta)

result = classificador.predict(nova_pergunta.reshape(1, -1))
result[0]


  "X does not have valid feature names, but"


'Frutas'