<a href="https://colab.research.google.com/github/viniciusrpb/datavis_book/blob/main/cap4_fundamentos_de_textos.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Capítulo 4: Fundamentos de Textos

Instalação do nltk

In [1]:
!pip install nltk



In [57]:
import nltk
from nltk.corpus import stopwords,reuters
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.metrics import classification_report
from sklearn.svm import SVC

Download dos resouces necessário para o uso das técnicas de pré-processamento

In [59]:
nltk.download('punkt')
nltk.download('reuters')
nltk.download('stopwords')
stopwords = stopwords.words('english')

!unzip /root/nltk_data/corpora/reuters.zip -d /root/nltk_data/corpora/

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
  inflating: /root/nltk_data/corpora/reuters/training/2231  
  inflating: /root/nltk_data/corpora/reuters/training/2232  
  inflating: /root/nltk_data/corpora/reuters/training/2234  
  inflating: /root/nltk_data/corpora/reuters/training/2236  
  inflating: /root/nltk_data/corpora/reuters/training/2237  
  inflating: /root/nltk_data/corpora/reuters/training/2238  
  inflating: /root/nltk_data/corpora/reuters/training/2239  
  inflating: /root/nltk_data/corpora/reuters/training/2240  
  inflating: /root/nltk_data/corpora/reuters/training/2244  
  inflating: /root/nltk_data/corpora/reuters/training/2246  
  inflating: /root/nltk_data/corpora/reuters/training/2247  
  inflating: /root/nltk_data/corpora/reuters/training/2249  
  inflating: /root/nltk_data/corpora/reuters/training/225  
  inflating: /root/nltk_data/corpora/reuters/training/2251  
  inflating: /root/nltk_data/corpora/reuters/training/2252  
  inflating: /root/nl

Pegamos os IDS dos documentos para o conjunto de treinamento e de testes

In [24]:
documents = reuters.fileids()

train_docs_id = list(filter(lambda doc: doc.startswith("train"),documents))
test_docs_id = list(filter(lambda doc: doc.startswith("test"),documents))

Recupera os textos originais a partir dos IDs

In [None]:
train_docs = [reuters.raw(documento) for documento in train_docs_id]
test_docs = [reuters.raw(documento) for documento in test_docs_id]

A seguir, imprimir os cinco primeiros documentos (notícias) do corpus da Reuters

In [68]:
for doc in range(0,5):
  print(train_docs[doc])

BAHIA COCOA REVIEW
  Showers continued throughout the week in
  the Bahia cocoa zone, alleviating the drought since early
  January and improving prospects for the coming temporao,
  although normal humidity levels have not been restored,
  Comissaria Smith said in its weekly review.
      The dry period means the temporao will be late this year.
      Arrivals for the week ended February 22 were 155,221 bags
  of 60 kilos making a cumulative total for the season of 5.93
  mln against 5.81 at the same stage last year. Again it seems
  that cocoa delivered earlier on consignment was included in the
  arrivals figures.
      Comissaria Smith said there is still some doubt as to how
  much old crop cocoa is still available as harvesting has
  practically come to an end. With total Bahia crop estimates
  around 6.4 mln bags and sales standing at almost 6.2 mln there
  are a few hundred thousand bags still in the hands of farmers,
  middlemen, exporters and processors.
      There are doubt

A partir do texto original, realiza-se o pré-processamento dos textos fazendo a remoção de stopwords (da língua Inglesa) e transformação para uma matriz Term Frequency - Inverse Document Frequency (TF-IDF). Assim, cada documento será representado por um vetor TF-IDF.

Observe que a geração dos vocabulário do corpus é feita apenas com o conjunto de treinamento - nunca inclua o conjunto de testes, caso contrário, o modelo vai aprender o vocabulário do conjunto de testes - o que significa um *cheating*.


In [60]:
tfidf = TfidfVectorizer(stop_words=stopwords)

tfidf_training = tfidf.fit(train_docs)

tfidf_training = tfidf.transform(train_docs)
tfidf_test = tfidf.transform(test_docs)

Transforma o problema de classificação originalmente multilabel para um problema de classificação unária.

In [None]:
mlb = MultiLabelBinarizer()
y_train = mlb.fit_transform([reuters.categories(documento) for documento in train_docs_id])
y_test = mlb.fit_transform([reuters.categories(documento) for documento in test_docs_id])


Pegamos apenas o primeiro tópico como o atributo classe. Lembre-se que esse atributo é binário, em que o valor 1 indica a presença do tópico em um documento, e o valor 0 indica a ausência do tópico no documento.

In [None]:
oneclass_train = y_train[:,0]
oneclass_test = y_test[:,0]

In [61]:
print(tfidf_test.shape)

print(tfidf_training.shape)

print(oneclass_train)

(7769, 26147)

Cria o objeto da classe Suport Vector Classifier e treina o modelo utilizando os textos de treinamento, já transformados como uma matriz TF-IDF:

In [63]:
clfsvm = SVC(kernel="rbf",gamma='scale',C=1)

clfsvm.fit(tfidf_training,oneclass_train)

SVC(C=1, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)

Realiza a classificação utilizando os textos do teste, que está na representação TF-IDF:

In [64]:
y_classified = clfsvm.predict(tfidf_test)

Apresenta os resultados da classificação no conjunto de testes. Observe o desbalanceamento!

In [65]:
print(classification_report(oneclass_test,y_classified))

              precision    recall  f1-score   support

           0       0.98      0.99      0.99      2300
           1       0.98      0.95      0.97       719

    accuracy                           0.98      3019
   macro avg       0.98      0.97      0.98      3019
weighted avg       0.98      0.98      0.98      3019

