# Visually Exploring and Predicting Fake News

In this notebook we perform some basic exploration and visualisation of the Real and Fake Article datasets. 

This includes a few visual summaries of the basic text features, such as word counts and most common words (before and after pre-processing and cleaning operations). In this work, rather than performing extensive cleaning and pre-processing of our text, we conduct only light reformatting operations, such as lowercasing, removing various symbols and punctuation. Stop words are not removed, bcause in this case, they adversely impact the results. It should be noted that this is a common occurrence in many nlp applications, especially for more complex deep learning models, and thus stop words should not be removed by default. 

After pre-processing and cleaning we move forward and form pre-trained word embeddings for our text dataset using a pre-trained language model. With these embeddings, we explore a variety of different dimensionality reduction and visualisation techniques, including Uniform Manifold Approximation and Projection (UMAP), Latent Semantic Analysis (LSA), and T-distributed Stochastic Neighbor Embedding (t-SNE). Our results show that these achieve similar results in terms of producing a 2-D visualisation for our given pre-trained embeddings. A key difference between the methods was the significant training time required for t-SNE when compared to UMAP or LSA. This is a common finding in NLP, and a good way to compensate is by applying a deterministic dimensionality reduction technique, such as Principle Component Analysis (PCA), prior to using t-SNE. From all of the visualisations, each technique produced satisfactory results in terms of seperation between real and fake articles.

Following these visualisations, some basic classical machine learning models are trained on the dataset and compared in terms of accuracy and F1 performance using K-Folds Cross Validation. Using out of the box classifiers, the performance averaged between 97-98%, which is not a bad result for such minimal effort. Some further evaluation methods are used on one of these models, including confusion matrices and a plot of the Reciever Operating Characteristic (ROC) Curve. The results shown in this notebook could be improved upon easily, using individually tuned models and some random / grid searches. For the best results, a combined ensembled model could be formed, or a deep learning LSTM model, CNN/LSTM composite model or transformer architecture. The methods in this notebook were just some quick exploration and analysis methods of the given dataset.

**Table of Contents:**

1. [Imports Dependencies](#imports)
2. [Creation of Training / Test Splits](#import-data)
3. [EDA](#EDA)
4. [Data Cleaning and Processing](#cleaning-processing)
5. [Pre-Trained Language Model Word Embeddings](#word-embeddings)
6. [Word Embedding Visualisation Techniques](#word-embedding-visualisation)
7. [Formation of Training and Validation Splits](#training-splits)
8. [Cross-validation on a range of models](#cross-validation)
9. [Model refinements and evaluations](#model-evaluation)
10. [Visualising results of our chosen model](#chosen-model-results)

---

<a id="imports"></a>
## 1. Import dependencies and external libs

In [None]:
import numpy as np
import os
import pandas as pd
pd.plotting.register_matplotlib_converters()
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
%matplotlib inline
import seaborn as sns
import spacy
import re
import umap

from collections import Counter, defaultdict
from matplotlib.colors import ListedColormap
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer, SnowballStemmer

from sklearn.decomposition import PCA, LatentDirichletAllocation, TruncatedSVD
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics import accuracy_score, confusion_matrix, \
                            classification_report, f1_score, roc_curve, auc
from sklearn.model_selection import cross_validate, cross_val_score, train_test_split, StratifiedKFold
from sklearn.cluster import KMeans
from sklearn.manifold import TSNE

# import classical ml models
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier, VotingClassifier
from sklearn.linear_model import LogisticRegression, Perceptron, RidgeClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.neural_network import MLPClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import RandomizedSearchCV

In [None]:
nlp = spacy.load('en_core_web_sm')

---

<a id="import-data"></a>
## 2. Import data and inspect basic features

In [None]:
file_path = '/kaggle/input/fake-and-real-news-dataset/'
fake_df = pd.read_csv(f"{file_path}Fake.csv")
true_df = pd.read_csv(f"{file_path}True.csv")

fake_df['fake'] = 1
true_df['fake'] = 0

data_df = pd.concat([fake_df, true_df], ignore_index=True)
data_df

In [None]:
plt.figure(figsize=(8,5))
ax = sns.countplot(x="subject", hue="fake", data=data_df)
plt.xticks(rotation=90)
plt.show()

We can see in this case that the real and fake datasets have unique 'subject' categories between them. Therefore, it is extremely important that we **do not use the 'subject' as a feature for making predictions**; this will introduce **data leakage** that will make it extremely easy for our models to achieve 100% accuracy. 

With this data leakage in our models, we will have extremely high performance on our training and validation sets, but very poor generalisation performance on unseen real world data. This problem is common occurrence throughout data science, and something that we need to carefully look out for throughout projects.

In [None]:
plt.figure(figsize=(6,4))
ax = sns.countplot(x="fake", data=data_df)
ax.set_xticklabels(['Real', 'Fake'])
plt.xlabel('Classification')
plt.xticks(rotation=90)
plt.show()

We have a fairly balanced mix of real and fake articles, which is good news.

---

<a id="EDA"></a>
## 3. Exploratory Data Analysis of text features

A useful and quick technique to quickly compare various text datasets is to assess the word counts within each.

### 3.1 Distribution of word counts in title and text for true and fake articles

In [None]:
data_df['title_length'] = data_df['title'].apply(lambda x : len(x.strip().split()))
data_df['text_length'] = data_df['text'].apply(lambda x : len(x.strip().split()))

In [None]:
plt.figure(figsize=(12,6))
sns.distplot(data_df[data_df['fake'] == 1]['title_length'], 
             kde=False, label='Fake', bins=20)
sns.distplot(data_df[data_df['fake'] == 0]['title_length'], 
             kde=False, label='True', bins=20)
plt.xlabel('Title Length', weight='bold')
plt.title('Length of title comparison', weight='bold')
plt.legend()
plt.show()

In [None]:
fig = plt.figure(figsize=(10, 6))
plt.title("Word counts of article titles", fontsize=16, weight='bold')
ax = sns.boxplot(x="fake", y="title_length", data=data_df)
ax.set_xticklabels(['Real', 'Fake'])
ax.set_xlabel("Article Classification", fontsize=14, weight='bold') 
ax.set_ylabel("Length of Entry (Words)", fontsize=14, weight='bold')
plt.show()

In [None]:
plt.figure(figsize=(12,6))
sns.distplot(data_df[data_df['fake'] == 1]['text_length'], 
             kde=False, label='Fake', bins=20)
sns.distplot(data_df[data_df['fake'] == 0]['text_length'], 
             kde=False, label='True', bins=20)
plt.xlabel('Text Length', weight='bold')
plt.title('Length of title comparison', weight='bold')
plt.xlim(0.0, 4500)
plt.legend()
plt.show()

In [None]:
fig = plt.figure(figsize=(10, 6))
plt.title("Word counts of article text", fontsize=16, weight='bold')
ax = sns.boxplot(x="fake", y="text_length", data=data_df)
ax.set_xticklabels(['Real', 'Fake'])
ax.set_xlabel("Article Classification", fontsize=14, weight='bold') 
ax.set_ylabel("Length of Entry (Words)", fontsize=14, weight='bold')
plt.ylim(0.0, 3000.0)
plt.show()

It seems in general, fake news articles tend to be longer than true articles. This is especially true for the article titles, and in fact the stark difference noted above suggests that we could classify articles to a reasonable accuracy using only title length.  

However, in terms of word counts, it is harder to discriminate between real and fake articles using the main text field.

### 3.2 Top words in true and fake articles

In [None]:
def create_corpus(text_data):
    """ Create a corpus from the given text array of sentences """
    corpus = []
    for sentence in text_data:
        for word in sentence.split():
            corpus.append(word)
    return corpus
            
def top_words(text_corpus, top_n=25, return_dict=False):
    """ Return the top n words from a given corpus """
    def_dict = defaultdict(int)
    for word in text_corpus:
        def_dict[word] += 1
    most_common = sorted(def_dict.items(), key=lambda x : x[1], reverse=True)[:top_n]
    if return_dict:
        return most_common, def_dict
    else:    
        return most_common

#### Fake article top words

In [None]:
top_n = 50
text_field = "title"

fake_corpus = create_corpus(fake_df[text_field].values)
fake_top_n_words, fake_symptom_dict = top_words(fake_corpus, top_n=top_n, return_dict=True)
fake_words, fake_word_counts = zip(*fake_top_n_words)

def plot_words(word_list, word_counts, n, text_description, figsize=(15,5)):
    plt.figure(figsize=figsize)
    plt.xticks(rotation=90)
    plt.bar(word_list, word_counts)
    plt.title(f"Top {n} words in {text_description}", weight='bold')
    plt.ylabel("Word Count", weight='bold')
    plt.show()

In [None]:
plot_words(fake_words, fake_word_counts, 50, "Fake Article Titles")
print(f"Total unique words in {text_field}: {len(fake_symptom_dict)}")

#### Real article top words

In [None]:
top_n = 50
text_field = "title"

true_corpus = create_corpus(true_df[text_field].values)
true_top_n_words, true_symptom_dict = top_words(true_corpus, top_n=top_n, return_dict=True)
true_words, true_word_counts = zip(*true_top_n_words)

plot_words(true_words, true_word_counts, 50, "True Article Titles")
print(f"Total unique words in {text_field}: {len(true_symptom_dict)}")

Clearly we have many duplicate words due to capitalised text. In addition, we have various fields such as [video], punctuation and other symbols within our text data. It appears key names, e.g. Trump, Hillary are extremely popular throughout both the real and fake article datasets.

### 3.3 Numbers of Proper Nouns, Nouns, and other parts of speech in fake and real articles

Sometimes features such as the number of nouns and proper nouns within our text can be good indicators of spam or fake articles. Lets work out some of these features for our data and analyse how they are distributed between real and fake articles:

In [None]:
def count_nouns_and_prop_nouns(text, model=nlp):
    """ Return number of nouns and proper nouns in text """
    # generate POS tags, count and return nouns and proper nouns
    doc = model(text)
    pos = [token.pos_ for token in doc]
    return pos.count('NOUN'), pos.count('PROPN')

This could take some time, since we need to analyse each title using a pre-trained spacy language model:

In [None]:
# create new features in our dataframe from these counts
data_df['propn_count'] = 0
data_df['noun_count'] = 0
data_df.loc[:, ['noun_count', 'propn_count']] = data_df['title'].apply(count_nouns_and_prop_nouns).values.tolist()

# calculate mean number of proper nouns in real / fake
real_propn = data_df[data_df['fake'] == 0]['propn_count'].mean()
fake_propn = data_df[data_df['fake'] == 1]['propn_count'].mean()

# calculate mean number of nouns in real / fake
real_noun = data_df[data_df['fake'] == 0]['noun_count'].mean()
fake_noun = data_df[data_df['fake'] == 1]['noun_count'].mean()

In [None]:
print(f"Proper Noun Mean Counts: \n  - Real: {real_propn:.3f} \n  - Fake: {fake_propn:.3f}\n")
print(f"Noun Mean Counts: \n  - Real: {real_noun:.3f} \n  - Fake: {fake_noun:.3f}\n")

Lets analyse the distributions of nouns and proper nouns in our title for both real and fake articles:

In [None]:
plt.figure(figsize=(10,6))
sns.distplot(data_df[data_df['fake'] == 1]['propn_count'], 
             kde=False, label='Fake', bins=15)
sns.distplot(data_df[data_df['fake'] == 0]['propn_count'], 
             kde=False, label='True', bins=15)
plt.xlabel('Count of Proper Nouns in Title', weight='bold')
plt.title('Proper Noun Comparison', weight='bold')
plt.legend()
plt.show()

In [None]:
fig = plt.figure(figsize=(10, 6))
plt.title("Word counts of article titles", fontsize=16, weight='bold')
ax = sns.boxplot(x="fake", y="propn_count", data=data_df)
ax.set_xticklabels(['Real', 'Fake'])
ax.set_xlabel("Article Classification", fontsize=14, weight='bold') 
ax.set_ylabel("Number of Proper Nouns", fontsize=14, weight='bold')
plt.show()

In [None]:
plt.figure(figsize=(10,6))
sns.distplot(data_df[data_df['fake'] == 1]['noun_count'], 
             kde=False, label='Fake', bins=15)
sns.distplot(data_df[data_df['fake'] == 0]['noun_count'], 
             kde=False, label='True', bins=15)
plt.xlabel('Count of Nouns in Title', weight='bold')
plt.title('Noun Comparison', weight='bold')
plt.legend()
plt.show()

In [None]:
fig = plt.figure(figsize=(10, 6))
plt.title("Word counts of article titles", fontsize=16, weight='bold')
ax = sns.boxplot(x="fake", y="noun_count", data=data_df)
ax.set_xticklabels(['Real', 'Fake'])
ax.set_xlabel("Article Classification", fontsize=14, weight='bold') 
ax.set_ylabel("Number of Nouns", fontsize=14, weight='bold')
plt.show()

This actually looks quite promising! On average, it appears Real articles have less proper nouns relative to fake articles. Conversely, real articles appear to have more normal nouns, relative to fake articles that appear to have less. We haven't assessed whether any of these are statistically significant or not. Nevertheless, it is insightful to see and could help our models produced later in discriminating between real and fake articles.

---

<a id="cleaning-processing"></a>
## 4. Cleaning and pre-processing

### 4.1 Formation of cleaning functions and applying these to the title and text fields

In [None]:
def clean_and_tokenise(text, stop_words=False, stem=False, lemmatize=False):
    """ Text cleaning function - lowercase and remove stopwords """
    cleaned_text = re.sub('<[^>]*>', '', text.lower())
    
    # remove custom unwanted characters from our text
    cleaned_text = remove_badchars(cleaned_text)
    
    # apply stop-word, stemming/lemmatising as required
    if stop_words:
        tokens = [word for word in tokenise(cleaned_text, stem=stem, 
                                            lemmatize=lemmatize) if word not in sw]
    else:
        tokens = [word for word in tokenise(cleaned_text, stem=stem, lemmatize=lemmatize)]
    
    cleaned_text = " ".join(tokens)
    
    return cleaned_text


def remove_badchars(text):
    """ Remove certain unwanted symbols and chars """
    delete_chars = "[]()@''+&'"
    space_chars = "_.-"
    table = dict((ord(c), " ") for c in space_chars)
    table.update(dict((ord(c), None) for c in delete_chars))
    return text.translate(table)


def tokenise(text, stem=False, lemmatize=False):
    """ Form tokenised stemmed text using a list comp and return """
    if lemmatize:
        tokenised = [lemmatizer.lemmatize(word) for word in text.split()]
    elif stem:
        tokenised = [stemmer.stem(word) for word in text.split()]
    else:
        tokenised = [word for word in text.split()]
    return tokenised

We can perform a range of cleaning and pre-processing operations. In this, we'll perform some basic cleaning operations, including removal of various symbols and punctuation, and lowercasing our text. The above functions have been designed to support stop-words and stemming / lemmatisation operations, however these are more harmful than good for this dataset, so they are not performed.

In [None]:
# stop words - append additionals if needed
sw = stopwords.words('english')

# clean text data - apply stopwords, but not lemmatisation / stemming this time
data_df['cleaned_title'] = data_df['title'].apply(clean_and_tokenise, stop_words=False, 
                                                      stem=False, lemmatize=False)

# clean text data - apply stopwords, but not lemmatisation / stemming this time
data_df['cleaned_text'] = data_df['text'].apply(clean_and_tokenise, stop_words=False, 
                                                      stem=False, lemmatize=False)

data_df['combined_text'] = data_df['cleaned_text'] + " " + data_df['cleaned_title']

In [None]:
true_df = data_df[data_df['fake'] == 0].copy()
fake_df = data_df[data_df['fake'] == 1].copy()

### 4.2 Lets re-visualise the top n words in each after cleaning

Lets see how different our results are now that we've cleaned our data. We'll also combine both title and text fields, and visualise the word distributions.

In [None]:
top_n = 50
text_field = "combined_text"

fake_corpus = create_corpus(fake_df[text_field].values)
fake_top_n_words, fake_symptom_dict = top_words(fake_corpus, top_n=top_n, return_dict=True)
fake_words, fake_word_counts = zip(*fake_top_n_words)
plot_words(fake_words, fake_word_counts, 50, "Fake Article Combined Title and Text (Cleaned)", figsize=(15,4))

true_corpus = create_corpus(true_df[text_field].values)
true_top_n_words, true_symptom_dict = top_words(true_corpus, top_n=top_n, return_dict=True)
true_words, true_word_counts = zip(*true_top_n_words)
plot_words(true_words, true_word_counts, 50, "True Article Combined Title and Text (Cleaned)", figsize=(15,4))

We can try a range of things to improve on performance in general when cleaning our data. Notably, we can remove common stop-words, apply stemming (or lemmatisation), or apply further feature engineering.

---

<a id="word-embeddings"></a>
## 5. Obtaining word embeddings using a pre-trained language model

There are a wide variety of ways to obtain embeddings for our textual data. One such way, which we'll use in this notebook, is to obtain pre-trained word embeddings from a highly trained language model. For this we'll make use of SpaCy language model, which is super convenient, as the following code shows: 

In [None]:
X = data_df['combined_text']
y = data_df['fake']
X.shape, y.shape

In [None]:
# you need to ensure you've downloaded and installed 'en_core_web_lg' for this
nlp = spacy.load('en_core_web_lg', disable=["parser", "tagger", "ner", "textcat"])

In [None]:
vectorised_features = np.array([nlp(x).vector for x in X])
vectorised_features.shape

We now have a 300 dimensional vector representing each article text data (combined title + text) from our dataset. This is vastly smaller than we would obtain from a bag of words TF-IDF model, which makes it much easier to reduce to 2-dimensions and visualise graphically.

---

<a id="word-embedding-visualisation"></a>
## 6. Visualisation of pre-trained embeddings - a comparison of UMAP, LSA and t-SNE

We'll visualise our textual pre-trained embeddings using three different techniques: UMAP, LSA and t-SNE. All things considered, they each produce relatively similar results in terms of seperation performance of real and fake articles.

### 6.1 Uniform Manifold Approximation and Projection (UMAP)

In [None]:
umap_embedder = umap.UMAP()
%time umap_features = umap_embedder.fit_transform(vectorised_features)

In [None]:
# sns settings
sns.set(rc={'figure.figsize':(12,8)})
palette = sns.hls_palette(2, l=.4, s=.9)

# plot UMAP projection with annotations from k-means clustering
sns.scatterplot(umap_features[:,0], umap_features[:,1], 
                hue=y, legend='full', 
                palette=palette, s=50, alpha=0.1)

plt.title(f"UMAP Projection", 
          weight='bold', size=14)
plt.show()

### 6.2 Latent Semantic Analysis (LSA)

In [None]:
def plot_LSA(word_vectors, word_labels, figsize=(12, 8), alpha=0.3):
    """ Perform latent semantic analysis and plot results """
    lsa = TruncatedSVD(n_components=2)
    lsa.fit(word_vectors)
    lsa_scores = lsa.transform(word_vectors)
    color_mapper = {label:idx for idx,label in enumerate(set(word_labels))}
    color_column = [color_mapper[label] for label in word_labels]
    colors = ['orange','blue']
    
    fig = plt.figure(figsize=figsize)
    plt.scatter(lsa_scores[:,0], lsa_scores[:,1], s=8, alpha=alpha, 
                c=word_labels, cmap=ListedColormap(colors))
    
    orange_patch = mpatches.Patch(color='orange', label='Not')
    blue_patch = mpatches.Patch(color='blue', label='Real')
    plt.legend(handles=[orange_patch, blue_patch], prop={'size': 16})      
    plt.show()

In [None]:
plot_LSA(vectorised_features, y, alpha=0.2)

### 6.3 t-Distributed Stochastic Neighbor Embedding (t-SNE)

t-SNE is much more computationally complex than the previous two methods. Despite this, it often produces very good results for visualising text and high-dimensional data.

#### Apply PCA first to reduce our feature dimensions:

In [None]:
pca = PCA(n_components=0.95, random_state=42)
%time X_reduced = pca.fit_transform(vectorised_features)
X_reduced.shape

In [None]:
tsne = TSNE(verbose=1, perplexity=50, n_jobs=-1)

In [None]:
%time X_embedded = tsne.fit_transform(X_reduced)

In [None]:
sns.set(rc={'figure.figsize':(12,8)})
palette = sns.hls_palette(2, l=.4, s=.8)

# plot t-SNE with annotations from k-means clustering
sns.scatterplot(X_embedded[:,0], X_embedded[:,1], hue=y, 
                legend='full', palette=palette, alpha=0.2)
plt.title('t-SNE with true/fake labels', weight='bold')
plt.show()

t-SNE is a great tool for visualising our vectorised text, however it comes with the burden of significant computational complexity for large datasets with many features. For this reason we applied Principle component analysis (PCA) prior to performing t-SNE, so that the overhead is not so high. 

If we were applying this to tf-idf features, the dimensionality of our data would be significantly higher than the 300 we used in this case (with the pre-trained word embeddings), and so this prior reduction of dimensionality becomes increasingly more important when using t-SNE with traditional techniques like bag of words.

---

<a id="training-splits"></a>
## 7. Form training and validation splits

We'll form a 80% training and 20% validation data split for our data prior to evaluating the general performance across a range of classical machine learning models.

In [None]:
# obtain training and validation splits - 80% training data, 20% val data
X_train, X_val, y_train, y_val = train_test_split(vectorised_features, y, shuffle=True,
                                                  test_size = 0.2, random_state=0, stratify=y)

print("Shapes of our data: \nX_train: {0}\ny_train: {1}\nX_val: {2}\ny_val: {3} ".format(X_train.shape,
                                                                                         y_train.shape,
                                                                                         X_val.shape,
                                                                                         y_val.shape))

---

<a id="cross-validation"></a>
## 8. Evaluate K-folds cross validation performance on training set with a range of models

In [None]:
def multi_model_cross_validation(clf_tuple_list, X, y, K_folds=10, score_type='accuracy', random_seed=0):
    """ Find cross validation scores, and print and return results """
    
    model_names, model_scores = [], []
    
    for name, model in clf_list:
        k_fold = StratifiedKFold(n_splits=K_folds, shuffle=True, random_state=random_seed)
        cross_val_results = cross_val_score(model, X, y, cv=k_fold, scoring=score_type, n_jobs=-1)
        model_names.append(name)
        model_scores.append(cross_val_results)
        print("{0:<40} {1:.5f} +/- {2:.5f}".format(name, cross_val_results.mean(), cross_val_results.std()))
        
    return model_names, model_scores


def boxplot_comparison(model_names, model_scores, figsize=(12, 6), score_type="Accuracy",
                       title="Sentiment Analysis Classification Comparison"):
    """ Boxplot comparison of a range of models using Seaborn and matplotlib """
    
    fig = plt.figure(figsize=figsize)
    fig.suptitle(title, fontsize=18)
    ax = fig.add_subplot(111)
    sns.boxplot(x=model_names, y=model_scores)
    ax.set_xticklabels(model_names)
    ax.set_xlabel("Model", fontsize=16) 
    ax.set_ylabel("Model Score ({})".format(score_type), fontsize=16)
    ax.set_xticklabels(ax.get_xticklabels(), rotation=60)
    plt.show()
    return

In [None]:
# list of classifiers to compare - use some additional models this time
clf_list = [("Perceptron", Perceptron(eta0=0.1)),
            ("Logistic Regression", LogisticRegression(C=10.0, max_iter=200)),
            ("Support Vector Machine", SVC(kernel='linear', C=1.0)),
            ("Decision Tree", DecisionTreeClassifier()),
            ("Random Forest", RandomForestClassifier(n_estimators=300)),
            ("Ridge Classifier", RidgeClassifier()),
            ("Gradient Boosting", GradientBoostingClassifier())]


# calculate cross-validation scores and print / plot for each model accordingly
model_names, model_scores = multi_model_cross_validation(clf_list, X_train[:10000], y_train[:10000])
boxplot_comparison(model_names, model_scores)

---

<a id="model-evaluation"></a>
## 9. Evaluate performance on validation set using the same range of models

Let's take this forward, and assess the performance on our stand-alone validation set when we train our models on the training splits.

In [None]:
def test_set_performances(clf_tuple_list, X_train, y_train, X_test, 
                          y_test, score_type='accuracy', print_results=True):
    """ Find test set accuracy and F1 Score performance for all classifiers 
        and return """
    
    model_names, model_accuracies, model_f1 = [], [], []
    
    if print_results:
        print("{0:<30} {1:<10} {2:<10} \n{3}".format("Model", "Accuracy", 
                                                     "F1-Score", "-"*50))
    
    # fit each model to training data and form predictions
    for name, model in clf_list:
        
        # fit on training, predict on test
        model.fit(X_train, y_train)
        y_preds = model.predict(X_test)
        
        # find accuracy and f1 (macro) scores
        accuracy = accuracy_score(y_test, y_preds)
        test_f1 = f1_score(y_test, y_preds, average='macro')
        
        # append model results
        model_names.append(name)
        model_accuracies.append(accuracy)
        model_f1.append(test_f1)
        
        if print_results:
            print("{0:<30} {1:<10.5f} {2:<10.5f}".format(name, accuracy, test_f1))
            
    return model_names, model_accuracies, model_f1

In [None]:
# obtain accuracy and f1 metrics and print for each model
model_names, test_acc, test_f1 = test_set_performances(clf_list, X_train, y_train, X_val, y_val)

In [None]:
# barplot of model accuracies
sns.set(style="darkgrid")
sns.barplot(model_names, test_acc, alpha=0.9)
plt.title('Test set accuracy', weight='bold')
plt.ylabel('Accuracy', fontsize=12, weight='bold')
plt.xticks(rotation=90)
plt.ylim(0.0, 1.0)
plt.show()

# barplot of model f1 scores
sns.set(style="darkgrid")
sns.barplot(model_names, test_f1, alpha=0.9)
plt.title('Test set F1 Score', weight='bold')
plt.ylabel('F1 Score (Macro)', fontsize=12, weight='bold')
plt.xticks(rotation=90)
plt.ylim(0.0, 1.0)
plt.show()

---

<a id="chosen-model-results"></a>
## 10. Visualising the results of a chosen model from above

We'll take our gradient boosting classifier from above and analyse it further in terms of F1 Score, precision and recall. An effective means of doing this is through using a confusion matrix.

### 10.1 Plot a confusion matrix for our trained model

In [None]:
def plot_confusion_matrix(true_y, pred_y, title='Confusion Matrix', figsize=(8,6)):
    """ Custom function for plotting a confusion matrix for predicted results """
    conf_matrix = confusion_matrix(true_y, pred_y)
    conf_df = pd.DataFrame(conf_matrix, columns=np.unique(true_y), index = np.unique(true_y))
    conf_df.index.name = 'Actual'
    conf_df.columns.name = 'Predicted'
    plt.figure(figsize = figsize)
    plt.title(title)
    sns.set(font_scale=1.4)
    sns.heatmap(conf_df, cmap="Blues", annot=True, 
                annot_kws={"size": 16}, fmt='g')
    plt.show()
    return

In [None]:
# create a log reg classifier and predict using test set
gb_clf = GradientBoostingClassifier()
gb_clf.fit(X_train, y_train)
predictions = gb_clf.predict(X_val)

# print performance statistics
print("Samples incorrectly classified: {0} out of {1}.".format((y_val != predictions).sum(),
                                                                len(y_val)))

print("Logistic Regression classifier accuracy: {0:.2f}%".format(accuracy_score(predictions, y_val)*100.0))

# plot a confusion matrix of our results
plot_confusion_matrix(y_val, predictions, 
                      title="SVC Confusion Matrix", figsize=(5,5))

# print recall, precision and f1 score results
print(classification_report(y_val, predictions))

As shown, straight out of the box our classifier obtains a very good accuracy on predicting whether an article is fake or real.

### 10.2 Plot the Receiver Operating Characteristic (ROC) Curve for our Model

A method to further assess these characteristics of our model is the ROC curve. This allows us to determine the optimal threshold to set our model to in order to maximise or true positive rate or false positive rate, depending on our specific use-case.

In [None]:
# obtain prediction probabilities for trg and val
y_val_probs = gb_clf.predict_proba(X_val)
y_trg_probs = gb_clf.predict_proba(X_train)

# obtain true positive and false positive rates for roc_auc
fpr, tpr, thresholds = roc_curve(y_train, y_trg_probs[:, 1], pos_label=1)
roc_auc = auc(fpr, tpr)

# obtain true positive and false positive rates for roc_auc
val_fpr, val_tpr, val_thresholds = roc_curve(y_val, y_val_probs[:, 1], pos_label=1)
val_roc_auc = auc(val_fpr, val_tpr)

plt.figure(figsize=(9,9))
plt.plot(fpr, tpr, label=f"Train ROC AUC = {roc_auc}", color='blue')
plt.plot(val_fpr, val_tpr, label=f"Val ROC AUC = {val_roc_auc}", color='red')
plt.plot([0,1], [0, 1], label="Random Guessing", 
             linestyle=":", color='grey', alpha=0.6)
plt.plot([0, 0, 1], [0, 1, 1], label="Perfect Performance", 
             linestyle="--", color='black', alpha=0.6)
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Receiver Operating Characteristic", weight='bold')
plt.legend(loc='best')
plt.show()

Overall, not a bad performing model, especially considering our gradient boosting model only has the default, untuned parameters. In order to improve this, we could perform grid-search across a range of hyper-parameters in order to maximise the performance for our predictions.