Projeto final do curso da Mentorama

Aluno: Rodrigo Martini Riboldi

Projeto: Classificador de notícias falsas

Esse notebook tem como objetivo realizar a validação e análises da performance dos classificadores, simulando um ambiente de produção.

In [47]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pickle

# Importando modelos

In [48]:
# Carregando o bag of words
cv = pickle.load(open('modelos/bag_of_words.sav', 'rb'))

In [49]:
# Carregando modelos
rf_classifier = pickle.load(open('modelos/random_forrest.sav', 'rb'))
xgb_classifier = pickle.load(open('modelos/xgboost.sav', 'rb'))
svm_classifier = pickle.load(open('modelos/svm.sav', 'rb'))
nb_classifier = pickle.load(open('modelos/naive_bayes.sav', 'rb'))

# Importando dados de validação

In [50]:
df_validation = pd.read_csv('dados/df_validation.csv').drop(columns = 'Unnamed: 0')

In [51]:
df_validation['y_validation'] = df_validation.label.map(dict({'REAL':0,'FAKE':1}))

## Preparando textos das notícias para rodar nos classificadores

## Classificando notícias

In [52]:
X_validation_vectorized = cv.transform(df_validation.text.values)

In [53]:
df_validation['rf_validation'] = rf_classifier.predict(X_validation_vectorized)

In [54]:
df_validation['xgb_validation'] = xgb_classifier.predict(X_validation_vectorized)

In [55]:
df_validation['svm_validation'] = svm_classifier.predict(X_validation_vectorized)

In [56]:
df_validation['nb_validation'] = nb_classifier.predict(X_validation_vectorized)

In [57]:
df_validation

Unnamed: 0,title,text,label,y_validation,rf_validation,xgb_validation,svm_validation,nb_validation
0,Pentagon weighs using force to protect US-back...,Senior U.S. military leaders and defense offic...,REAL,0,0,0,0,0
1,UK announces new troop deployment near Russia'...,Military British Defense Secretary Michael Fal...,FAKE,1,1,1,1,1
2,Dog Waited Faithfully For Over A Month After H...,There have been horror stories of families lea...,FAKE,1,1,1,1,1
3,Defense Secretary Carter endorses 3-year timel...,Defense Secretary Ash Carter on Wednesday endo...,REAL,0,0,0,0,0
4,Hilarious Cartoon Reveals 2016 Political Versi...,Pinterest \nC.E. Dyer writes that the disaster...,FAKE,1,0,0,1,0
...,...,...,...,...,...,...,...,...
1262,"Architect Of Paris Attacks Was Killed In Raid,...","Architect Of Paris Attacks Was Killed In Raid,...",REAL,0,0,0,0,0
1263,Investment Strategist Forecasts Collapse Timel...,Home » Headlines » Finance News » Investment S...,FAKE,1,1,1,1,1
1264,A Message to my Fellow Republicans,As the unfolding Clinton email story plays out...,REAL,0,0,0,1,1
1265,Donald Trump: 'I'm not flip-flopping' on immig...,Washington (CNN) It's still undecided whether ...,REAL,0,0,0,0,0


## Calculando métricas com dados de validação

In [58]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import recall_score # The recall is intuitively the ability of the classifier to find all the positive samples.
from sklearn.metrics import precision_score  #The precision is intuitively the ability of the classifier not to label as positive a sample that is negative.
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_auc_score

In [59]:
print("Accuracy:  ", accuracy_score(df_validation.y_validation, df_validation.rf_validation))
print("Precision: ", precision_score(df_validation.y_validation, df_validation.rf_validation))
print("Recall:    ", recall_score(df_validation.y_validation, df_validation.rf_validation))
print("AUC score: ", roc_auc_score(df_validation.y_validation, df_validation.rf_validation))

Accuracy:   0.9084451460142068
Precision:  0.9321486268174475
Recall:     0.8863287250384024
AUC score:  0.9090734534282922


In [60]:
model_metrics = dict()
model_metrics['RF Accuracy'] = accuracy_score(df_validation.y_validation, df_validation.rf_validation)
model_metrics['RF Precision'] = precision_score(df_validation.y_validation, df_validation.rf_validation)
model_metrics['RF Recall'] = recall_score(df_validation.y_validation, df_validation.rf_validation)
model_metrics['RF AUC score'] = roc_auc_score(df_validation.y_validation, df_validation.rf_validation)

In [61]:
print("Confusion Matrix:  ")
confusion_matrix(df_validation.y_validation, df_validation.rf_validation)

Confusion Matrix:  


array([[574,  42],
       [ 74, 577]], dtype=int64)

In [62]:
print("Accuracy:  ", accuracy_score(df_validation.y_validation, df_validation.xgb_validation))
print("Precision: ", precision_score(df_validation.y_validation, df_validation.xgb_validation))
print("Recall:    ", recall_score(df_validation.y_validation, df_validation.xgb_validation))
print("AUC score: ", roc_auc_score(df_validation.y_validation, df_validation.xgb_validation))

Accuracy:   0.9210734017363852
Precision:  0.9271317829457364
Recall:     0.9185867895545314
AUC score:  0.9211440441279151


In [63]:
model_metrics['XGBoost Accuracy'] = accuracy_score(df_validation.y_validation, df_validation.xgb_validation)
model_metrics['XGBoost Precision'] = precision_score(df_validation.y_validation, df_validation.xgb_validation)
model_metrics['XGBoost Recall'] = recall_score(df_validation.y_validation, df_validation.xgb_validation)
model_metrics['XGBoost AUC score'] = roc_auc_score(df_validation.y_validation, df_validation.xgb_validation)

In [64]:
print("Confusion Matrix:  ")
confusion_matrix(df_validation.y_validation, df_validation.xgb_validation)

Confusion Matrix:  


array([[569,  47],
       [ 53, 598]], dtype=int64)

In [65]:
print("Accuracy:  ", accuracy_score(df_validation.y_validation, df_validation.svm_validation))
print("Precision: ", precision_score(df_validation.y_validation, df_validation.svm_validation))
print("Recall:    ", recall_score(df_validation.y_validation, df_validation.svm_validation))
print("AUC score: ", roc_auc_score(df_validation.y_validation, df_validation.svm_validation))

Accuracy:   0.861878453038674
Precision:  0.8777777777777778
Recall:     0.8494623655913979
AUC score:  0.862231182795699


In [66]:
model_metrics['SVM Accuracy'] = accuracy_score(df_validation.y_validation, df_validation.svm_validation)
model_metrics['SVM Precision'] = precision_score(df_validation.y_validation, df_validation.svm_validation)
model_metrics['SVM Recall'] = recall_score(df_validation.y_validation, df_validation.svm_validation)
model_metrics['SVM AUC score'] = roc_auc_score(df_validation.y_validation, df_validation.svm_validation)

In [67]:
print("Confusion Matrix:  ")
confusion_matrix(df_validation.y_validation, df_validation.svm_validation)

Confusion Matrix:  


array([[539,  77],
       [ 98, 553]], dtype=int64)

In [68]:
print("Accuracy:  ", accuracy_score(df_validation.y_validation, df_validation.nb_validation))
print("Precision: ", precision_score(df_validation.y_validation, df_validation.nb_validation))
print("Recall:    ", recall_score(df_validation.y_validation, df_validation.nb_validation))
print("AUC score: ", roc_auc_score(df_validation.y_validation, df_validation.nb_validation))

Accuracy:   0.8918705603788477
Precision:  0.9461805555555556
Recall:     0.837173579109063
AUC score:  0.8934244518921939


In [69]:
model_metrics['NB Accuracy'] = accuracy_score(df_validation.y_validation, df_validation.nb_validation)
model_metrics['NB Precision'] = precision_score(df_validation.y_validation, df_validation.nb_validation)
model_metrics['NB Recall'] = recall_score(df_validation.y_validation, df_validation.nb_validation)
model_metrics['NB AUC score'] = roc_auc_score(df_validation.y_validation, df_validation.nb_validation)

In [70]:
print("Confusion Matrix:  ")
confusion_matrix(df_validation.y_validation, df_validation.nb_validation)

Confusion Matrix:  


array([[585,  31],
       [106, 545]], dtype=int64)

In [71]:
model_metrics

{'RF Accuracy': 0.9084451460142068,
 'RF Precision': 0.9321486268174475,
 'RF Recall': 0.8863287250384024,
 'RF AUC score': 0.9090734534282922,
 'XGBoost Accuracy': 0.9210734017363852,
 'XGBoost Precision': 0.9271317829457364,
 'XGBoost Recall': 0.9185867895545314,
 'XGBoost AUC score': 0.9211440441279151,
 'SVM Accuracy': 0.861878453038674,
 'SVM Precision': 0.8777777777777778,
 'SVM Recall': 0.8494623655913979,
 'SVM AUC score': 0.862231182795699,
 'NB Accuracy': 0.8918705603788477,
 'NB Precision': 0.9461805555555556,
 'NB Recall': 0.837173579109063,
 'NB AUC score': 0.8934244518921939}

In [72]:
import json

with open('metricas/model_metrics.json', 'w') as fp:
    json.dump(model_metrics, fp)

In [73]:
with open('metricas/model_metrics.json', 'r') as openfile:
    
    model_metrics_json = json.load(openfile)
    
model_metrics_json

{'RF Accuracy': 0.9084451460142068,
 'RF Precision': 0.9321486268174475,
 'RF Recall': 0.8863287250384024,
 'RF AUC score': 0.9090734534282922,
 'XGBoost Accuracy': 0.9210734017363852,
 'XGBoost Precision': 0.9271317829457364,
 'XGBoost Recall': 0.9185867895545314,
 'XGBoost AUC score': 0.9211440441279151,
 'SVM Accuracy': 0.861878453038674,
 'SVM Precision': 0.8777777777777778,
 'SVM Recall': 0.8494623655913979,
 'SVM AUC score': 0.862231182795699,
 'NB Accuracy': 0.8918705603788477,
 'NB Precision': 0.9461805555555556,
 'NB Recall': 0.837173579109063,
 'NB AUC score': 0.8934244518921939}

Com os dados de valiação, os quatro classificadores se sairam muito bem, replicando o comportamento dos dados de teste, com o SVM e o Naive Bayes possuindo desempenhos um pouco inferiores ao Random Forest e o XGBoost.

# Analisando resultados das predições

Análise com o intuito de identificar problemas e verificar vieses de comportamento

In [74]:
df_validation['sum_classifications'] = df_validation.rf_validation + df_validation.xgb_validation + df_validation.svm_validation + df_validation.nb_validation

## Notícias reais classificadas como fake por todos algorítmos

In [75]:
df_validation.loc[(df_validation.y_validation == 0) & (df_validation.sum_classifications == 4)]

Unnamed: 0,title,text,label,y_validation,rf_validation,xgb_validation,svm_validation,nb_validation,sum_classifications
113,Patriots Day 2015: Boston does not stand alone,In the two years since the horrific marathon b...,REAL,0,1,1,1,1,4
211,States with the most people on food stamps,"With grocery bills priced as high as $1,300 pe...",REAL,0,1,1,1,1,4
717,Black faith is under attack: How to make sense...,"When I woke up on Thursday morning, the world ...",REAL,0,1,1,1,1,4
927,Providing Balanced Information Is Not Facebook...,Catherine R. Squires is a professor of communi...,REAL,0,1,1,1,1,4
1152,Donald Trump Call For Immediate Shutdown of Cl...,Donald Trump called this morning for the Clint...,REAL,0,1,1,1,1,4


In [76]:
df_validation.iloc[113].text

'In the two years since the horrific marathon bombing, Boston has been nothing less than resilient. The city has stood defiant and proud even as it painfully relived those grim events during a trial and persevered through a winter of record blizzards. But now, spring has come again, and we remember on this Patriots’ Day that if history is any guide, it holds not only the promise of Boston’s continued steadfastness, but also an affirmation from across America that Boston does not stand alone, and never has.\n\nAs far back as 1775 and predating our nation’s independence, Boston stood strong against those who would do the city harm. Punished for the singular act of dumping tea into Boston Harbor and the far broader initiatives of Massachusetts toward representative government, Boston chafed under the onerous provisions of what rebels called the Intolerable Acts. Chief among them was the Boston Port Bill, which closed Boston Harbor to commerce and effectively strangled the city. The questi

In [77]:
df_validation.iloc[211].text

"With grocery bills priced as high as $1,300 per month as of late, some American workers simply cannot afford all of their groceries on top of everything else they already have to buy. This is why the government offers food stamps.\n\nThe USDA Food and Nutrition Service reports that as of September 2014, there were around 46.5 million individual food stamp recipients (22.7 million households) receiving an average benefit of $123.74 each (around $257 per household).\n\nTo be eligible, a household has to earn a gross income amount that's less than 130% of the poverty level, or a net income amount (gross income minus deductions) that's less than 100% of the poverty level for their family size.\n\nThis means, a single person can be eligible for food stamps if his or her gross monthly income is under $1,265 ($15,180 per year), and a family of four can be eligible if they gross less than $2,584 per month ($31,008 per year). The applicant also can't be a wealthy person who simply doesn't have

É possível perceber algumas notícias reais que utilizaram termos superlativos e imperativos em seu texto sendo classificadas como fake pelos algorítmos. Felizmente isso ocorreu com apenas 5 notícias da base de 1267 

## Notícias fakes classificadas como reais por todos algorítmos

In [78]:
df_validation.loc[(df_validation.y_validation == 1) & (df_validation.sum_classifications == 0)]

Unnamed: 0,title,text,label,y_validation,rf_validation,xgb_validation,svm_validation,nb_validation,sum_classifications
117,Trump Is Deadbeating On His Campaign Debts By ...,The Washington Post reported: Donald Trump’s h...,FAKE,1,0,0,0,0,0
471,Migrants FLOOD Into U.S. From Mexico Right Bef...,"With 12 days to go until the election, all eye...",FAKE,1,0,0,0,0,0
495,Suspect captured in ‘ambush-style’ killings of...,Suspect captured in ‘ambush-style’ killings of...,FAKE,1,0,0,0,0,0
616,"Little-Loved by Scholars, Trump Also Gets Litt...","Little-Loved by Scholars, Trump Also Gets Litt...",FAKE,1,0,0,0,0,0
781,Trump Dedicates D.C. Hotel: 'The Future Lies W...,\nRepublican presidential nominee Donald Trum...,FAKE,1,0,0,0,0,0
855,Donald Trump Elected 45th President Of The Uni...,Via AP : \nDonald Trump was elected America’s ...,FAKE,1,0,0,0,0,0
929,Police: Oklahoma Double Murder Suspect Has Hit...,Yahoo News \nA 38-year-old Oklahoma man who ha...,FAKE,1,0,0,0,0,0
938,Fact Check: Democrats Have Created Twice As Ma...,Comments \nDemocrats are better for the econom...,FAKE,1,0,0,0,0,0
1132,Memo to Trump: 'Action This Day!',"=> \n“In victory, magnanimity!” said Winston...",FAKE,1,0,0,0,0,0
1250,National Attention On Ayotte - Hassan (*NH) Se...,"Maggie Hassan, left and Kelly Ayotte Hassan de...",FAKE,1,0,0,0,0,0


In [79]:
df_validation.iloc[117].text

'The Washington Post reported: Donald Trump’s hiring of pollster Tony Fabrizio in May was viewed as a sign that the real estate mogul was finally bringing seasoned operatives into his insurgent operation. \nBut the Republican presidential nominee appears to have taken issue with some of the services provided by the veteran GOP strategist, who has advised candidates from 1996 GOP nominee Bob Dole to Florida Gov. Rick Scott. The Trump campaign’s latest Federal Election Commission report shows that it is disputing nearly $767,000 that Fabrizio’s firm says it is still owed for polling. \nTrump’s decision not to pay his pollster is the first of this type of story but given Trump’s history of not paying for services performed; it won’t be the last. It is astonishing that the party of supposed fiscal responsibility and conservativism would put someone forward as their presidential nominee who has made a career out of running up debt for personal gain. \nDonald Trump’s mentality has always bee

In [80]:
df_validation.iloc[938].text

'Comments \nDemocrats are better for the economy. This statement is not an opinion, but a fact. According to economist Steven Stoft, who created a series of graphs charting job creation under each party over the last 72 years (during which time Democrats and Republicans have held control for 36 years each), Democrats have created 58 million jobs while Republicans can only claim 26 million . \nFor roughly the last century, electing a Democrat has been the better option for the economy, with Dems creating more than double the jobs than that of Republicans, and faster. \nEven when taking the percent change of number of jobs held, or scaling population (to avoid counting an increased population, thus falsely indicating an increase in jobs), Democrats still prove more successful than Republicans in job creation, and by a wide margin. \nAnother way of studying job creation is to take unemployment into account. When a Democrat is in the White House, logically unemployment decreases as well. B

É possível perceber algumas notícias fakes que foram bem escritas e conseguem se passar por notícias reais, enganando os classificadores. Felizmente isso ocorreu com apenas 10 notícias da base de 1267 

## Verificando todas as combinações

In [81]:
df_validation[['y_validation', 'sum_classifications', 'text']].groupby(['y_validation', 'sum_classifications']).agg({'text':'nunique'})

Unnamed: 0_level_0,Unnamed: 1_level_0,text
y_validation,sum_classifications,Unnamed: 2_level_1
0,0,493
0,1,54
0,2,24
0,3,24
0,4,5
1,0,10
1,1,23
1,2,46
1,3,123
1,4,440


É possível perceber que a maioria das noticias reais foi classificada corretamente por todos os modelos e 75% do total de notícias da base foi classificado corretamente por todos os modelos. Apenas 1,2% foram classificadas incorretamente por todos os modelos.

Em relação as notícias fakes, 69% foram classificadas corretamente por todos os algorítmos e 88% por pelo menos três modelos diferentes. Apenas 1,6% foram classificadas como notícia real por todos.

# Conclusões

Os quatro modelo parecem se completar e é uma boa estratégia mostrar os resultados de todos em uma aplicação. Indicando categoricamente se a notícia pode ou não ser fake e mostrando o resultado dos quatro classificadores, juntamente com a probabilidade calculada por cada modelo da notícia ser fake.

Os dados serão exibidos categoricamente da seguinte forma:

- Se nenhum modelo classificou como fake: "Essa notícia provavelmente não é fake."
- Se apenas um modelo classificou como fake: "Essa notícia possui poucas chances de ser fake."
- Se dois modelos classificaram como fake: "Essa notícia pode ser fake."
- Se três modelos classificaram como fake: "Essa notícia possui muita chance de ser fake."
- Se todos modelos classificaram como fake: "Essa notícia provavelmente é fake."