Projeto final do curso da Mentorama

Aluno: Rodrigo Martini Riboldi

Projeto: Classificador de notícias falsas

Esse notebook tem como objetivo a preparação dos dados para NLP, a execução de testes com diferentes algorítmos classificadores e o treinamento dos modelos após a obtenção dos melhores parâmetros.

In [15]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Importando os dados

## Dados de treino

In [16]:
X_train = pd.read_csv('dados/X_train_fake_news.csv').drop(columns = ['Unnamed: 0'])
X_train.head()

Unnamed: 0,text
0,"Hollywood, Florida (CNN) Donald Trump's new de..."
1,"Bernie Sanders has civil rights credentials, b..."
2,0 comments \nIn case you didn’t already know t...
3,This is how it works in the Clinton Cabal...or...
4,Trump will also meet with retiring Indiana Sen...


Como será construído um classificador de fake news, iremos importar noticias fakes com o label 1 e noticias reais com o label 0

In [17]:
y_train = pd.read_csv('dados/y_train_fake_news.csv').drop(columns = ['Unnamed: 0'])
y_train.label = y_train.label.map(dict({'REAL':0,'FAKE':1}))
y_train.head()

Unnamed: 0,label
0,0
1,0
2,1
3,1
4,0


## Dados de teste

In [18]:
X_test = pd.read_csv('dados/X_test_fake_news.csv').drop(columns = ['Unnamed: 0'])
X_test.head()

Unnamed: 0,text
0,U.S. Secretary of State John F. Kerry said Mon...
1,Shocking! Michele Obama & Hillary Caught Glamo...
2,0 \nHillary Clinton has barely just lost the p...
3,"With little fanfare this fall, the New York de..."
4,Mitch McConnell has an unusual admonition for ...


In [19]:
y_test = pd.read_csv('dados/y_test_fake_news.csv').drop(columns = ['Unnamed: 0'])
y_test.label = y_test.label.map(dict({'REAL':0,'FAKE':1}))
y_test.head()

Unnamed: 0,label
0,0
1,1
2,1
3,0
4,0


# Preparando dados para NLP

## Criando bag of words

In [20]:
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(stop_words='english')

## Treinando bag of words

In [21]:
X_train_vectorized = cv.fit_transform(X_train.text.values)

## Salvando modelo de bag of words

In [22]:
import pickle

In [23]:
filename = 'modelos/bag_of_words.sav'
pickle.dump(cv, open(filename, 'wb'))

# Definindo nosso objetivo

O objetivo deste projeto é construir alguns classificadores que consigam classificar a maior parte das notícias corretamente e que mantenham uma baixa taxa de falsos positivos e faltos negativos

Segundo pesquisas realizadas para o desenvolvimento deste projeto, apenas 38% dos brasileiros sabem reconhecer notícias falsas: https://canaltech.com.br/seguranca/brasileiros-nao-sabem-reconhecer-fake-news-diz-pesquisa-160415/

Esse problema não é muito diferente quando falamos de americanos, como visto em https://edition.cnn.com/2021/05/31/health/fake-news-study/index.html, https://www.sciencealert.com/most-americans-are-overestimating-their-ability-to-spot-fake-news-survey-finds e https://www.weforum.org/agenda/2022/04/can-you-spot-a-lie-fake-news/

Na última matéria podemos ver o trecho "[...] participants in our experiments provided the correct evaluation of veracity only in 62% of the news stories they saw [...]" que já pode nos dar um objetivo inicial de criar um classificador com acurácia superior a 62%

# Testando melhores classificadores

## Utilizando GridSearch para obter melhores parâmetros de diferentes classificadores

Para tentar obter boas métricas sem um alto gasto computacional e de forma simples e prática será utilizado Grid Search.

Após a obtenção dos melhores parâmetros, os classificadores serão retreinados.

Para este projeto foram escolhidos 4 classificadores pelo seu bom desempenho com NLP, facilidade de manipulação e simplicidade de funcionamento.

Os classificadores são:
- Random Forest
- XGBoost
- SVM
- Naive Bayes

### Random Forest

In [24]:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier

#definindo o estimador base
estimador_base = RandomForestClassifier()

#definindo o dicionario de parâmetros do modelo
params_RF = {"n_estimators":[100,500,1000],
             "criterion":['entropy'],
             "warm_start":[True, False],
             "n_jobs":[-1],
             "random_state":[0]}

In [25]:
grid_rf = GridSearchCV(estimator = estimador_base, 
                    param_grid = params_RF, 
                    scoring = ['roc_auc','accuracy'],
                    refit = 'roc_auc',
                    cv = 5)

grid_rf

GridSearchCV(cv=5, estimator=RandomForestClassifier(),
             param_grid={'criterion': ['entropy'],
                         'n_estimators': [100, 500, 1000], 'n_jobs': [-1],
                         'random_state': [0], 'warm_start': [True, False]},
             refit='roc_auc', scoring=['roc_auc', 'accuracy'])

In [26]:
#treinando os modelos no grid
grid_rf.fit(X_train_vectorized, y_train.values.ravel())

GridSearchCV(cv=5, estimator=RandomForestClassifier(),
             param_grid={'criterion': ['entropy'],
                         'n_estimators': [100, 500, 1000], 'n_jobs': [-1],
                         'random_state': [0], 'warm_start': [True, False]},
             refit='roc_auc', scoring=['roc_auc', 'accuracy'])

In [27]:
grid_rf.best_params_

{'criterion': 'entropy',
 'n_estimators': 1000,
 'n_jobs': -1,
 'random_state': 0,
 'warm_start': True}

In [28]:
grid_rf.best_score_

0.9644025273776273

In [29]:
grid_rf.best_estimator_

RandomForestClassifier(criterion='entropy', n_estimators=1000, n_jobs=-1,
                       random_state=0, warm_start=True)

In [30]:
grid_rf.refit_time_

3.8606815338134766

In [31]:
filename = 'modelos/grid_rf.sav'
pickle.dump(grid_rf, open(filename, 'wb'))

### XGBoost

In [32]:
import xgboost

In [33]:
#definindo o estimador base
estimador_base = xgboost.XGBClassifier()

#definindo o dicionario de parâmetros do modelo
params_xgb = {"n_estimators":[10,50,100],
              "n_jobs":[-1],
              "random_state":[0]}

In [34]:
grid_xgb = GridSearchCV(estimator = estimador_base, 
                    param_grid = params_xgb, 
                    scoring = ['roc_auc','accuracy'],
                    refit = 'roc_auc',
                    cv = 5)

grid_xgb

GridSearchCV(cv=5,
             estimator=XGBClassifier(base_score=None, booster=None,
                                     callbacks=None, colsample_bylevel=None,
                                     colsample_bynode=None,
                                     colsample_bytree=None,
                                     early_stopping_rounds=None,
                                     enable_categorical=False, eval_metric=None,
                                     feature_types=None, gamma=None,
                                     gpu_id=None, grow_policy=None,
                                     importance_type=None,
                                     interaction_constraints=None,
                                     learning_rate=None,...
                                     max_cat_threshold=None,
                                     max_cat_to_onehot=None,
                                     max_delta_step=None, max_depth=None,
                                     max_leaves=Non

In [35]:
#treinando os modelos no grid
grid_xgb.fit(X_train_vectorized, y_train.values.ravel())

GridSearchCV(cv=5,
             estimator=XGBClassifier(base_score=None, booster=None,
                                     callbacks=None, colsample_bylevel=None,
                                     colsample_bynode=None,
                                     colsample_bytree=None,
                                     early_stopping_rounds=None,
                                     enable_categorical=False, eval_metric=None,
                                     feature_types=None, gamma=None,
                                     gpu_id=None, grow_policy=None,
                                     importance_type=None,
                                     interaction_constraints=None,
                                     learning_rate=None,...
                                     max_cat_threshold=None,
                                     max_cat_to_onehot=None,
                                     max_delta_step=None, max_depth=None,
                                     max_leaves=Non

In [36]:
grid_xgb.best_params_

{'n_estimators': 100, 'n_jobs': -1, 'random_state': 0}

In [37]:
grid_xgb.best_score_

0.9716430066827471

In [38]:
grid_xgb.best_estimator_

XGBClassifier(base_score=0.5, booster='gbtree', callbacks=None,
              colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1,
              early_stopping_rounds=None, enable_categorical=False,
              eval_metric=None, feature_types=None, gamma=0, gpu_id=-1,
              grow_policy='depthwise', importance_type=None,
              interaction_constraints='', learning_rate=0.300000012,
              max_bin=256, max_cat_threshold=64, max_cat_to_onehot=4,
              max_delta_step=0, max_depth=6, max_leaves=0, min_child_weight=1,
              missing=nan, monotone_constraints='()', n_estimators=100,
              n_jobs=-1, num_parallel_tree=1, predictor='auto', random_state=0, ...)

In [39]:
grid_xgb.refit_time_

1.1437013149261475

In [40]:
filename = 'modelos/grid_xgb.sav'
pickle.dump(grid_xgb, open(filename, 'wb'))

### SVM

In [41]:
from sklearn.svm import SVC

#definindo o estimador base
estimador_base = SVC()

#definindo o dicionario de parâmetros do modelo
params_SVC = {"kernel":['linear','sigmoid','poly','rfb'],
             "random_state":[0]}

In [42]:
grid_SVC = GridSearchCV(estimator = estimador_base, 
                    param_grid = params_SVC, 
                    scoring = ['roc_auc','accuracy'],
                    refit = 'roc_auc',
                    cv = 5)

grid_SVC

GridSearchCV(cv=5, estimator=SVC(),
             param_grid={'kernel': ['linear', 'sigmoid', 'poly', 'rfb'],
                         'random_state': [0]},
             refit='roc_auc', scoring=['roc_auc', 'accuracy'])

In [43]:
#treinando os modelos no grid
grid_SVC.fit(X_train_vectorized, y_train.values.ravel())

5 fits failed out of a total of 20.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
5 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\rodri\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 680, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\rodri\anaconda3\lib\site-packages\sklearn\svm\_base.py", line 255, in fit
    fit(X, y, sample_weight, solver_type, kernel, random_seed=seed)
  File "C:\Users\rodri\anaconda3\lib\site-packages\sklearn\svm\_base.py", line 342, in _sparse_fit
    kernel_type = self._sparse_kernels.index(kernel)
ValueError: 'rfb' is not in list



GridSearchCV(cv=5, estimator=SVC(),
             param_grid={'kernel': ['linear', 'sigmoid', 'poly', 'rfb'],
                         'random_state': [0]},
             refit='roc_auc', scoring=['roc_auc', 'accuracy'])

Apenas um warning no treinamento que não deve ocorrer no retreinamento com os melhores parâmetros

In [44]:
grid_SVC.best_params_

{'kernel': 'linear', 'random_state': 0}

In [45]:
grid_SVC.best_score_

0.9254248041104288

In [46]:
grid_SVC.best_estimator_

SVC(kernel='linear', random_state=0)

In [47]:
grid_SVC.refit_time_

6.100074529647827

In [48]:
filename = 'modelos/grid_SVC.sav'
pickle.dump(grid_SVC, open(filename, 'wb'))

### Naive Bayes

In [49]:
from sklearn.naive_bayes import MultinomialNB

#definindo o estimador base
estimador_base = MultinomialNB()

#definindo o dicionario de parâmetros do modelo
params_NB = {"alpha":[0.5, 1, 5, 10]}

In [50]:
grid_NB = GridSearchCV(estimator = estimador_base, 
                    param_grid = params_NB, 
                    scoring = ['roc_auc','accuracy'],
                    refit = 'roc_auc',
                    cv = 5)

grid_NB

GridSearchCV(cv=5, estimator=MultinomialNB(),
             param_grid={'alpha': [0.5, 1, 5, 10]}, refit='roc_auc',
             scoring=['roc_auc', 'accuracy'])

In [51]:
#treinando os modelos no grid
grid_NB.fit(X_train_vectorized, y_train.values.ravel())

GridSearchCV(cv=5, estimator=MultinomialNB(),
             param_grid={'alpha': [0.5, 1, 5, 10]}, refit='roc_auc',
             scoring=['roc_auc', 'accuracy'])

In [52]:
grid_NB.best_params_

{'alpha': 0.5}

In [53]:
grid_NB.best_score_

0.9415199745725001

In [54]:
grid_NB.best_estimator_

MultinomialNB(alpha=0.5)

In [55]:
grid_NB.refit_time_

0.00550079345703125

In [56]:
filename = 'modelos/grid_NB.sav'
pickle.dump(grid_NB, open(filename, 'wb'))

============

Todos os modelos testados apresentaram um valor alto de acurácia. Precisamos verificar com os dados de teste e validação se não se trata de overfit.

# Preparando dados de teste

In [57]:
# Carregando o bag of words
cv = pickle.load(open('modelos/bag_of_words.sav', 'rb'))

In [58]:
X_test_vectorized = cv.transform(X_test.text.values)

## Retreinando e testando classificadores com os melhores parâmetros encontrados

Retreinando e verificando as métricas com os dados de teste

### Random Forest

In [59]:
rf_classifier = RandomForestClassifier(n_estimators = 1000,
                                       criterion = 'entropy',
                                       n_jobs = -1,
                                       warm_start = True,
                                       random_state = 0)

rf_classifier.fit(X_train_vectorized,y_train.values.ravel())

RandomForestClassifier(criterion='entropy', n_estimators=1000, n_jobs=-1,
                       random_state=0, warm_start=True)

In [60]:
filename = 'modelos/random_forrest.sav'
pickle.dump(rf_classifier, open(filename, 'wb'))

In [61]:
rf_predict = rf_classifier.predict(X_test_vectorized)

In [62]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import recall_score # The recall is intuitively the ability of the classifier to find all the positive samples.
from sklearn.metrics import precision_score  #The precision is intuitively the ability of the classifier not to label as positive a sample that is negative.
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_auc_score

In [63]:
print("Accuracy:  ", accuracy_score(y_test, rf_predict))
print("Precision: ", precision_score(y_test, rf_predict))
print("Recall:    ", recall_score(y_test, rf_predict))
print("AUC score: ", roc_auc_score(y_test, rf_predict))

Accuracy:   0.9053254437869822
Precision:  0.916
Recall:     0.89453125
AUC score:  0.9054329556772908


In [64]:
print("Confusion Matrix:  ")
confusion_matrix(y_test, rf_predict)

Confusion Matrix:  


array([[460,  42],
       [ 54, 458]], dtype=int64)

O random forest apresentou valores excelentes, muito acima da meta estabelecida.  
O modelo conseguiu classificar corretamente a maioria das notícias e manter uma baixa taxa falsos positivos e falsos negativos.

### XGBoost

In [65]:
xgb_classifier = xgboost.XGBClassifier(n_estimators = 100,
                                       n_jobs = -1,
                                       random_state = 0)

xgb_classifier.fit(X_train_vectorized,y_train.values.ravel())

XGBClassifier(base_score=0.5, booster='gbtree', callbacks=None,
              colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1,
              early_stopping_rounds=None, enable_categorical=False,
              eval_metric=None, feature_types=None, gamma=0, gpu_id=-1,
              grow_policy='depthwise', importance_type=None,
              interaction_constraints='', learning_rate=0.300000012,
              max_bin=256, max_cat_threshold=64, max_cat_to_onehot=4,
              max_delta_step=0, max_depth=6, max_leaves=0, min_child_weight=1,
              missing=nan, monotone_constraints='()', n_estimators=100,
              n_jobs=-1, num_parallel_tree=1, predictor='auto', random_state=0, ...)

In [66]:
filename = 'modelos/xgboost.sav'
pickle.dump(xgb_classifier, open(filename, 'wb'))

In [67]:
xgb_predict = xgb_classifier.predict(X_test_vectorized)

In [68]:
print("Accuracy:  ", accuracy_score(y_test, xgb_predict))
print("Precision: ", precision_score(y_test, xgb_predict))
print("Recall:    ", recall_score(y_test, xgb_predict))
print("AUC score: ", roc_auc_score(y_test, xgb_predict))

Accuracy:   0.9250493096646942
Precision:  0.9192307692307692
Recall:     0.93359375
AUC score:  0.9249642056772909


In [69]:
print("Confusion Matrix:  ")
confusion_matrix(y_test, xgb_predict)

Confusion Matrix:  


array([[460,  42],
       [ 34, 478]], dtype=int64)

O XGBoost também apresentou valores excelentes, muito acima da meta estabelecida.  
O modelo conseguiu classificar corretamente a maioria das notícias e manter uma baixa taxa falsos positivos e falsos negativos.

### SVM

In [70]:
svm_classifier = SVC(kernel = 'linear', probability=True)

svm_classifier.fit(X_train_vectorized,y_train.values.ravel())

SVC(kernel='linear', probability=True)

In [71]:
filename = 'modelos/svm.sav'
pickle.dump(svm_classifier, open(filename, 'wb'))

In [72]:
svm_predict = svm_classifier.predict(X_test_vectorized)

In [73]:
print("Accuracy:  ", accuracy_score(y_test, svm_predict))
print("Precision: ", precision_score(y_test, svm_predict))
print("Recall:    ", recall_score(y_test, svm_predict))
print("AUC score: ", roc_auc_score(y_test, svm_predict))

Accuracy:   0.863905325443787
Precision:  0.8638132295719845
Recall:     0.8671875
AUC score:  0.8638726344621515


In [74]:
print("Confusion Matrix:  ")
confusion_matrix(y_test, svm_predict)

Confusion Matrix:  


array([[432,  70],
       [ 68, 444]], dtype=int64)

O SVM foi o pior modelo até o momento, mas mesmo assim apresentou ótimos valores, muito acima da meta estabelecida.  
O modelo conseguiu classificar corretamente a maioria das notícias e manter uma taxa interessate de falsos positivos e falsos negativos.

### Naive Bayes

In [75]:
nb_classifier = MultinomialNB(alpha = 0.5)

nb_classifier.fit(X_train_vectorized,y_train.values.ravel())

MultinomialNB(alpha=0.5)

In [76]:
filename = 'modelos/naive_bayes.sav'
pickle.dump(nb_classifier, open(filename, 'wb'))

In [77]:
nb_predict = nb_classifier.predict(X_test_vectorized)

In [78]:
print("Accuracy:  ", accuracy_score(y_test, nb_predict))
print("Precision: ", precision_score(y_test, nb_predict))
print("Recall:    ", recall_score(y_test, nb_predict))
print("AUC score: ", roc_auc_score(y_test, nb_predict))

Accuracy:   0.8767258382642998
Precision:  0.9125799573560768
Recall:     0.8359375
AUC score:  0.8771320966135457


In [79]:
print("Confusion Matrix:  ")
confusion_matrix(y_test, nb_predict)

Confusion Matrix:  


array([[461,  41],
       [ 84, 428]], dtype=int64)

O Naive Bayes ficou próximo do SVM, mas mesmo assim apresentou ótimos valores, muito acima da meta estabelecida.  
O modelo conseguiu classificar corretamente a maioria das notícias e manter uma taxa interessate de falsos positivos e falsos negativos.

=================

Os quatro classificadores escolhidos apresentaram um bom resultado com os dados de teste e agora serão testatos com os dados de validação, simulando uma aplicação real.