## Modelagem de Tópicos do Noticiário Financeiro

In [23]:
# Imports
import numpy as np
from nltk.corpus import stopwords
from sklearn.datasets import load_files
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.metrics import accuracy_score

In [24]:
>>> import nltk
>>> nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\theyl\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [25]:
# Loading the data
noticias = load_files('dados', encoding = 'utf-8', decode_error = 'replace')

In [26]:
# Separating input and output variables
X = noticias.data
y = noticias.target

In [27]:
# Empty list and dictionary to save the values
d = {}
d['soft'] = []
d['hard'] = []

## Creating the models

1. First, we need to split the data into training and testing sets using the train_test_split() method from the sklearn.model_selection module.


2. Then, we call the stopwords method from the nltk.corpus module to eliminate words that are so commonly used that they carry very little useful information.


3. Next, we create a vectorizer using the TfidfVectorizer() method from the sklearn.feature_extraction.text module, which will transform the text into a numerical representation.


4. We apply the vectorization on both the training and testing sets of the input data using the fit_transform() and transform() methods, respectively.


5. We create three models: a Logistic Regression model, a Random Forest model, and a Multinomial Naive Bayes model.


6. We perform another for loop with two iterations: one for a "soft" voting classifier and another for a "hard" voting classifier.


7. For each iteration of the loop, we create a voting classifier using the VotingClassifier() method from the sklearn.ensemble module. The voting classifier combines the three models created earlier with the voting method specified in the loop (soft or hard).


8. We fit the voting classifier on the training set of the input data using the fit() method.


9. We predict the categories of the testing set of the input data using the fitted voting classifier and the predict() method.


10. We calculate and print the accuracy score for the predictions using the accuracy_score() method from the sklearn.metrics module.


11. We append the accuracy score to a dictionary d with the corresponding voting method and iteration number.


12. After the outer for loop completes, we print the best results based on the highest accuracy score for each voting method using the max() and key() functions.

In [29]:
# Loop
for x in range(1,10):

    # Divisão treino/teste
    X_treino, X_teste, y_treino, y_teste = train_test_split(X, y, test_size=0.30, random_state = x)

    # Stop words
    my_stop_words = set(stopwords.words('english'))

    # Vetorização
    vectorizer = TfidfVectorizer(norm = None, stop_words = my_stop_words, max_features = 1000, decode_error = "ignore")

    # Aplica a vetorização
    X_treino_vectors = vectorizer.fit_transform(X_treino)
    X_teste_vectors = vectorizer.transform(X_teste)

    # Cria os modelos base
    modelo1 = LogisticRegression(multi_class = 'multinomial', solver = 'lbfgs', random_state = 30, max_iter = 1000)
    modelo2 = RandomForestClassifier(n_estimators = 1000, max_depth = 100, random_state = 1)
    modelo3 = MultinomialNB()

    # Loop
    for i in ['soft','hard']:
        voting_model = VotingClassifier(estimators = [ ('lg', modelo1), ('rf', modelo2), ('nb', modelo3)], voting = i)
        voting_model = voting_model.fit(X_treino_vectors, y_treino)
        previsoes = voting_model.predict(X_teste_vectors)
        print('-Random State:', x, '-Voting:', i, '-Accuracy :', accuracy_score(y_teste, previsoes))
        d[i].append((x,accuracy_score(y_teste, previsoes)))
    

-Random State: 1 -Voting: soft -Acurácia : 0.968562874251497
-Random State: 1 -Voting: hard -Acurácia : 0.9670658682634731
-Random State: 2 -Voting: soft -Acurácia : 0.9730538922155688
-Random State: 2 -Voting: hard -Acurácia : 0.9775449101796407
-Random State: 3 -Voting: soft -Acurácia : 0.9745508982035929
-Random State: 3 -Voting: hard -Acurácia : 0.9790419161676647
-Random State: 4 -Voting: soft -Acurácia : 0.9700598802395209
-Random State: 4 -Voting: hard -Acurácia : 0.9745508982035929
-Random State: 5 -Voting: soft -Acurácia : 0.9625748502994012
-Random State: 5 -Voting: hard -Acurácia : 0.9670658682634731
-Random State: 6 -Voting: soft -Acurácia : 0.9565868263473054
-Random State: 6 -Voting: hard -Acurácia : 0.9655688622754491
-Random State: 7 -Voting: soft -Acurácia : 0.9670658682634731
-Random State: 7 -Voting: hard -Acurácia : 0.9655688622754491
-Random State: 8 -Voting: soft -Acurácia : 0.9760479041916168
-Random State: 8 -Voting: hard -Acurácia : 0.9835329341317365
-Random S

In [30]:
# Extracting best results
h = max(d['hard'], key = lambda x:x[1])
s = max(d['soft'], key = lambda x:x[1])


Melhores Resultados:


In [32]:
print('\nBest Results:')
print('-Random State:',h[0], '-Voting:hard', '-Accuracy:', h[1])
print('-Random State:',s[0], '-Voting:soft', '-Accuracy:', s[1])



Best Results:
-Random State: 8 -Voting:hard -Acurácia: 0.9835329341317365
-Random State: 9 -Voting:soft -Acurácia: 0.9805389221556886


