<h3 align="center"><font size="15"><b>Fake News detection</b></font></h3> 

<img src="https://www.txstate.edu/cache78a0c25d34508c9d84822109499dee61/imagehandler/scaler/gato-docs.its.txstate.edu/jcr:21b3e33f-31c9-4273-aeb0-5b5886f8bcc4/fake-fact.jpg?mode=fit&width=1600" height=200 width=400>

<br></br>

**Task type:** Classification

**Models used:** LinearSVC, MultinomialNB, XGBoost, PyCaret, CatBoost

**Tools used:** NLP preprocessing tools, semi-supervised learning technique, new feature engineering, Word Cloud

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# 1. Loading data

In [None]:
fake = pd.read_csv('../input/fake-and-real-news-dataset/Fake.csv')
fake['flag'] = 0
fake

In [None]:
true = pd.read_csv('../input/fake-and-real-news-dataset/True.csv')
true['flag'] = 1
true

In [None]:
df = pd.DataFrame()
df = true.append(fake)

# 2. EDA + Data cleaning

**Let's check the datatypes.**

In [None]:
df.info()

**Removing the duplicates and preventing problems with indexing.**

In [None]:
df = df.drop_duplicates()
df = df.reset_index(drop=True)

**We can see that the date format is not the one we need. I will apply the appropriate date format for future purposes.**

In [None]:
# Correcting some data
df['date'] = df['date'].replace(['19-Feb-18'],'February 19, 2018')
df['date'] = df['date'].replace(['18-Feb-18'],'February 18, 2018')
df['date'] = df['date'].replace(['17-Feb-18'],'February 17, 2018')
df['date'] = df['date'].replace(['16-Feb-18'],'February 16, 2018')
df['date'] = df['date'].replace(['15-Feb-18'],'February 15, 2018')
df['date'] = df['date'].replace(['14-Feb-18'],'February 14, 2018')
df['date'] = df['date'].replace(['13-Feb-18'],'February 13, 2018')


df['date'] = df['date'].str.replace('Dec ', 'December ')
df['date'] = df['date'].str.replace('Nov ', 'November ')
df['date'] = df['date'].str.replace('Oct ', 'October ')
df['date'] = df['date'].str.replace('Sep ', 'September ')
df['date'] = df['date'].str.replace('Aug ', 'August ')
df['date'] = df['date'].str.replace('Jul ', 'July ')
df['date'] = df['date'].str.replace('Jun ', 'June ')
df['date'] = df['date'].str.replace('Apr ', 'April ')
df['date'] = df['date'].str.replace('Mar ', 'March ')
df['date'] = df['date'].str.replace('Feb ', 'February ')
df['date'] = df['date'].str.replace('Jan ', 'January ')

In [None]:
df['date'] = df['date'].str.replace(' ', '')

In [None]:
for i, val in enumerate(df['date']):
    df['date'].iloc[i] = pd.to_datetime(df['date'].iloc[i], format='%B%d,%Y', errors='coerce') # by setting the parameter to "coerce", we will set unappropriate values to NaT (null)

In [None]:
df['date'] = df['date'].astype('datetime64[ns]')

In [None]:
df.info()

In [None]:
import datetime as dt
df['year'] = pd.to_datetime(df['date']).dt.to_period('Y')
df['month'] = pd.to_datetime(df['date']).dt.to_period('M')

df['month'] = df['month'].astype(str)

**Next we will try to elicit insights from non-text features to get to know if they will help us boost the Text Classifier.**

## Fake news dynamics

In [None]:
sub = df[['month', 'flag']]
sub = sub.dropna()
sub = sub.groupby(['month'])['flag'].sum()

In [None]:
sub = sub.drop('NaT')

In [None]:
import matplotlib.pyplot as plt

plt.suptitle('Dynamics of fake news')
plt.xticks(rotation=90)
plt.ylabel('Number of fake news')
plt.xlabel('Month-Year')
plt.plot(sub.index, sub.values, linewidth=2, color='green')

**What a spike in the dynamics of fake news in late 2017!**

## Subject distribution

In [None]:
sub2 = df[['subject', 'flag']]
sub2 = sub2.dropna()
sub2 = sub2.groupby(['subject'])['flag'].sum()

In [None]:
plt.suptitle('Fake news among different categories')
plt.xticks(rotation=90)
plt.ylabel('Number of fake news')
plt.xlabel('Category')

plt.bar(sub2.index, height=sub2.values, color='green')
#ax1.plot(x, y)
#ax2.plot(x, -y)

**As we have discovered, such features as**
* subject
* date

**might be also crucial for the algorithm to decide whether the piece of news is fake or real. We will try to include them in the model.**

# 3. Text preparation

In [None]:
nlp = df

**I will add the 'subject' feature to the title field as it might have an influence on the outcome of classification.**

In [None]:
#nlp['title'] = nlp['title'] + ' ' + nlp['subject']

## 3.1 Word Cloud visualization

**Here I am going to take one example and try visualize tfidf as a wordcloud.**

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

corpus = nlp[nlp['flag'] == 1]['title'].iloc[0:500] # We will take a slice of fake news, to see what vocabulary there looks like
tfidf1 = TfidfVectorizer()
vecs = tfidf1.fit_transform(corpus)

feature_names = tfidf1.get_feature_names()
dense = vecs.todense()
list_words = dense.tolist()
df_words = pd.DataFrame(list_words, columns=feature_names)

In [None]:
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
df_words.T.sum(axis=1)
Cloud = WordCloud(background_color="white", max_words=100).generate_from_frequencies(df_words.T.sum(axis=1))

In [None]:
import matplotlib.pyplot as plt
plt.figure(figsize=(12,5))
plt.imshow(Cloud, interpolation='bilinear')

**Indeed, looks definitely like fake news :)**

**And we can also see out 'subject' feature in the foreground as it has been added manually in every title. Therefore, out vectorizer considers it as an important & frequent word.**

## 3.2 Tfidf-vectorizing

**First, I will tokenize words to pass it on to the SnowballStemmer method, which will take out lemmas from words.**

In [None]:
import nltk
nltk.download('punkt')
from nltk import word_tokenize

nlp['title'] = nlp['title'].apply(lambda x: word_tokenize(str(x)))

**An important step in every NLP-task is to get the roots of words in order not to distract the model by 'different' words.**

In [None]:
from nltk.stem import SnowballStemmer

snowball = SnowballStemmer(language='english')
nlp['title'] = nlp['title'].apply(lambda x: [snowball.stem(y) for y in x])

In [None]:
nlp['title'] = nlp['title'].apply(lambda x: ' '.join(x))

**Take the standard english bag of stopwords from nltk.**

In [None]:
from nltk.corpus import stopwords 

nltk.download('words')
nltk.download('stopwords')
stopwords = stopwords.words('english')

**And finally TfidfVectorizing. You can also take CountVectorizer, but I prefer Tfidf as it has masses of advantages.**

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer()
X_text = tfidf.fit_transform(nlp['title'])

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_text, nlp['flag'], test_size=0.33, random_state=1)

# 4. Model building

**I will use several approaches to solve the classification task, such as:**

1) Traditional (which are known as efficient for text classification):

    1.1) SVM
    1.2) Naive Bayes
    1.3) XGBoost
    
2) Not-very-traditional (Experimental): PyCaret NLP toolkit (I will apply unsupervised model to generate features which I will in turn pass on to the supervised model)

## 4.1 Linear SVC

In [None]:
scores = {}

In [None]:
from sklearn.svm import LinearSVC
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score

clf = LinearSVC(max_iter=100, C=1.0)
clf.fit(X_train, y_train)

y_pred_SVM = clf.predict(X_test)
print(cross_val_score(clf, X_text, nlp['flag'], cv=3))
print(accuracy_score(y_pred_SVM, y_test))

scores['LinearSVC'] = accuracy_score(y_pred_SVM, y_test)


**This looks suspiciously good, but lets try another algorithm.**

## 4.2 Naive Bayes

In [None]:
from sklearn.naive_bayes import MultinomialNB

clf2 = MultinomialNB()
clf2.fit(X_train, y_train)

y_pred_MNB = clf2.predict(X_test)
print(cross_val_score(clf2, X_text, nlp['flag'], cv=3))
print(accuracy_score(y_pred_MNB, y_test))

scores['MultinomialNB'] = accuracy_score(y_pred_MNB, y_test)

**Okay, this model performs a little worse, but still very good.**

## 4.3 XGBoost

In [None]:
from xgboost import XGBClassifier

clf3 = XGBClassifier(eval_metric='rmse', use_label_encoder=False)
clf3.fit(X_train, y_train)

y_pred_XGB = clf3.predict(X_test)
print(cross_val_score(clf3, X_text, nlp['flag'], cv=3))
print(accuracy_score(y_pred_XGB, y_test))

scores['XGB'] = accuracy_score(y_pred_XGB, y_test)

## 4.4 PyCaret + CatBoost

**PyCaret’s Natural Language Processing module is an unsupervised machine learning module that can be used for analyzing text data by creating topic models that can find hidden semantic structures  within documents. PyCaret’s NLP module comes with a wide range of text pre-processing techniques. It has over 5 ready-to-use algorithms and several plots to analyze the performance of trained models and text corpus.**

*Read more:* https://pycaret.org/nlp/

In [None]:
!pip install pycaret

**Setting up the model which will implement all traditional NLP-preprocessing operation (tokenizing, lemmatizing etc.**

**The PyCaret is almost fully automatic!**

In [None]:
from pycaret.nlp import *

caret_nlp = setup(data=nlp, target='title', session_id=1)

**LDA stands for Latent Dirichlet Allocation and is widely used in unsupervised learning tasks.**

*Read more:* https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation

In [None]:
lda = create_model('lda')

In [None]:
lda_data = assign_model(lda)

**Here's the outcome dataset:**

In [None]:
lda_data

**We'll utilize the 'Topic' features generated by PyCaret.**

In [None]:
from catboost import CatBoostClassifier

In [None]:
input_cat = lda_data.drop(['text','date','Perc_Dominant_Topic','flag','year'], axis=1)
input_cat['month'] = input_cat['month'].astype(str)
target_cat = lda_data['flag']

In [None]:
from sklearn.model_selection import train_test_split
X_train_cat, X_test_cat, y_train_cat, y_test_cat = train_test_split(input_cat, target_cat, test_size=0.33, random_state=1)

In [None]:
clf4 = CatBoostClassifier(iterations=1000, 
                          cat_features=['title','subject','Dominant_Topic','month']
                         )

In [None]:
clf4.fit(X_train_cat, y_train_cat, early_stopping_rounds=10)

In [None]:
scores['CatBoost'] = clf4.score(X_test_cat, y_test_cat)

In [None]:
scores

In [None]:
plt.bar(scores.keys(), scores.values())

# 5. Conclusion

**We have trained & tested 4 models for NLP task (implementing the traditional NLP preprocessing strategies). They all perform very good, however this is most likely due to the high correlation of the target other categorical features (such as 'subject'). If we did not add it to analysis, the result could have been totally different.**

**We also used a combination of supervised & unsupervised learning, which can be an interesting method to use.**

**Also, for text classification tasks I recommend using BERT models and DNN.**

*For more information on this and code snippets, read here:* https://medium.com/engineering-zemoso/text-classification-bert-vs-dnn-b226497c9de7

<font color='blue'><b>Thank you for your attention!</b><br></br><br></br>
Your comments and discussion contributions are always welcome.</font>