## Import libraries and dataset

In this first phase we import all the needed libraries and the data. We visualize a small sample of the data to have a fist understanding of the data structure.

In [None]:
import spacy
import numpy as np
import pandas as pd
nlp = spacy.load('en_core_web_lg')
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
from sklearn import metrics
import re
import seaborn as sns
import matplotlib.pyplot as plt
import os

In [None]:
true=pd.read_csv("/kaggle/input/fake-and-real-news-dataset/True.csv")
false=pd.read_csv("/kaggle/input/fake-and-real-news-dataset/Fake.csv")

In [None]:
true.sample(20)

As we can see, it seems that the text of the true news begins always with: "CITY NAME (Reuters) - ", this doesn't happen in the fake news dataset as you can observe below.
We will remove this first part to have a more clean and homogeneous dataset, if fact, we wish that our classifier can learn to distinguish fake news from true news based on underlying pattern in the text. Mantaining this original setting we run the risk of a trivial classification based only in the presence of this initial part.

In [None]:
false.sample(20)

From a first look we can notice an interesting behaviour: the title of a fake news seems to present a higher number of capital letters and punctuation rather than in the title of true news. We will study this phenomenon in the Exploratory Data Analysis session.

## Data cleaning


First of all we clean the True dataset. Each news text will start after the sentence "CITY NAME (Reuters) - ".

In [None]:
def clean_true_news(text):
    try:
        match = re.search(r'\WReuters\W\s-\s',text)
        new_text=text[match.span()[1]:]
    except: 
        new_text=text
    return new_text

In [None]:
true['text']=true['text'].apply(clean_true_news)

We create the dicotomic target variable: it takes 1 if the news is true 0 if it's fake. 

We then join the two dataset.

In [None]:
true["target"]=1
false["target"]=0
df=pd.concat([true,false])
df.reset_index(drop=True,inplace=True)
df.sample(5)

## Text EDA

In this session we will explore different patterns and visualize them to see if they can discriminate true news from fake news.

In [None]:
#Number of token in the text:
def token_len(row_text):
    row=nlp(row_text)
    return len(row)

#Number of sentences in the text
def n_of_sentences(row_text):
    row=nlp(row_text)
    tot_sent=[sentence for sentence in row.sents]
    return len(tot_sent)
    

We compute the number of tokens both in the title than in the text.

In [None]:
df['title_len']=df['title'].apply(token_len)
df['text_len']=df['text'].apply(token_len)


We also calculate the number of sentences in the news text.

In [None]:
df['text_sent']=df['text'].apply(n_of_sentences)

As mentioned before we add a feature that counts the number of capital letters in the title.

In [None]:
df['capital_letters']=df['title'].apply(lambda x: len(re.findall(r'[A-Z]',x)))

We finally add a feature to count the number of punctuation signs in the news title.

In [None]:
df['non_alpha']=df['title'].apply(lambda x: len(re.findall(r'\W',x)))
df.sample(5)

### Visualization

In [None]:
sns.countplot(df.target)

The two classes are almost balanced.

In [None]:
chart=sns.countplot(x = "subject", hue = "target" , data = df)
labels=chart.get_xticklabels()
chart.set_xticklabels(labels,rotation=45)

The subject of the news seems to be extremely discriminative, there is only a subject where true news and fake news overlap: "worldnews". As we don't really know how the this subject is assigned and if the criteria are the same and to avoid that the classifier will overfit this specific dataset we won't use this variable during our classification.

In [None]:
sns.boxplot(x="target", y="title_len", data=df)

We can observe that fake news generally have a longer title rather than true news.

In [None]:
sns.boxplot(x="target", y="text_len", data=df)

In [None]:
sns.boxplot(x="target", y="text_sent", data=df)

The number of tokens and of sentences are more or less the same. These fields are not discriminative for our problem.

In [None]:
sns.boxplot(x="target", y="capital_letters", data=df)

In [None]:
sns.boxplot(x="target", y="non_alpha", data=df)

As we guess before, the two boxplots clearly show that the number of both capital letters and puctuation signs are higher in fake news rather than in true news. This is an important discovery as we can leverage this information to further improve ore classification.

In [None]:
df.isnull().sum()

In [None]:
blanks = []  
for i,text in df['text'].items():  
    if text.isspace():         
        blanks.append(i)     
print(len(blanks))

Even if there are no missing values, some text are composed by blanks, i.e. just a white space, we won't discard this observations because as we can see in the next session, we will consider the title and the text jointly.

## News classification

We will perform a two steps classification:
1. First we will classify news only based on their text considering the title and the text together. The dataset will be transformed in a data term matrix with the term frequency-inverse document frequency (TF-IDF) method.
2. The prediction of the first model will be used as input of a second model which takes in input also: the length of the title, the number of capital letters and the number of punctuation signs.

We will use for both the two steps a **Support Vector Machine** with linear kernel.

In [None]:
X=df['title']+' '+df['text']
y=df['target']
X_2=df[['title_len','capital_letters','non_alpha']]
X_train, X_test, X_2_train, X_2_test, y_train, y_test = train_test_split(X,X_2,y,test_size=0.2,random_state=42)

In [None]:
X_train_1m, X_test_1m, X_test_2m, X_train_2m, y_1m, y_2m = train_test_split(X_train,X_2_train,y_train,test_size=0.5,random_state=42)

In this second split we will use:
* X_train_1m to train the first model, that one based only on the text
* X_train_2m to train the second model based on the prediction of the first model and: title_len,capital_letters,non_alpha. The prediction of the first model is done on the dataset X_test_1m.

In [None]:
text_clf=Pipeline([('tfidf',TfidfVectorizer()),
                  ('clf',LinearSVC())])
text_clf.fit(X_train_1m,y_1m)

In [None]:
yhat1=text_clf.predict(X_test)
print(metrics.classification_report(y_test,yhat1))

In order to see the improvment of this two steps method we first see the performance that we would obtain with just the first model:.

In [None]:
print(metrics.accuracy_score(y_test,yhat1))

In [None]:
#Create the prediction field for the second step model:
X_train_2m['y_predict']=text_clf.predict(X_test_1m)
#save the prediction on the test set for the final model:
X_2_test['y_predict']=text_clf.predict(X_test)

In [None]:
clf=LinearSVC(max_iter=10000,dual=False)
clf.fit(X_train_2m,y_2m)

In [None]:
yhat2=clf.predict(X_2_test)
print(metrics.classification_report(y_test,yhat2))

In [None]:
print(metrics.accuracy_score(y_test,yhat2))

We can observe how the used setting of the double step classifcation improves the results of the simple text classification model.