# 1. Proposed Algorithms and Pretrained models list

***

List of proposed algorithms using in this experiment

**Machine Learning Models**

- SVM
- Random Forest
- XgBoost
- Naive Bayes
- KNN
- Logistic Regression

**Deep Learning Models**

- 1 Dimensional Convolutional Neural Network(1d CNN)
- Long Short Term Memory Neural Network(LSTM)
- Bi Directional Long Short Term Memory Neural Network(Bi-LSTM)


**Pretrained Models**

- Word2Vec -> GoogleNews-vectors-negative300.bin
- Glove -> glove.6B.300d.txt
- BERT  -> bert_en_uncased_L-12_H-768_A-12/1

*** 
# 2. Model Training Phase

This section includes the general scripts , user defined helper functions and some sample scripts which used in the model development phase for the automatic detection of fake news experiment.

## 2.1.  User defined helper functions to create the confusion matrix and normalised confusion matrix 

The below scripts is using to create the confusion matrix and normalised confusion matrix based on the predictions and actual truth lables

In [None]:
# Helper function to plot confusion matrix and normalised confusion matrix plot
def confusn_mtrx_plot(cm,path):
    from sklearn.metrics import confusion_matrix
    # Y_pred_evc = evc_mdl.predict(test_X)
    #cm = confusion_matrix(ytrue, y_pred)

    fig = plt.figure(figsize=(6, 5), dpi=60)
    ax = plt.subplot()
    sns.set(font_scale=1.4) # Adjust to fit
    sns.heatmap(cm, annot=True, ax=ax, cmap="GnBu", fmt="g");  
    #sns.heatmap(cm/np.sum(cm), annot=True, ax=ax[1], cmap="GnBu", fmt="g");  

    # Labels, title and ticks
    label_font = {'size':'18'}  # Adjust to fit
    ax.set_xlabel('Predicted', fontdict=label_font);
    ax.set_ylabel('Actuals', fontdict=label_font);

    title_font = {'size':'20'}  # Adjust to fit
    ax.set_title('Confusion Matrix', fontdict=title_font);

    ax.tick_params(axis='both', which='major', labelsize=15)  # Adjust to fit
    ax.xaxis.set_ticklabels(['Real', 'Fake']);
    ax.yaxis.set_ticklabels(['Real', 'Fake']);
    #fig.show()
    fig.savefig(path)
    
def norm_confusn_mtrx_plot(cm,path):
    fig = plt.figure(figsize=(6, 5), dpi=60)
    ax = plt.subplot()
    sns.set(font_scale=1.4) # Adjust to fit
    sns.heatmap(cm/np.sum(cm), annot=True, ax=ax, cmap="GnBu",fmt='.2%');  

    # Labels, title and ticks
    label_font = {'size':'18'}  # Adjust to fit
    ax.set_xlabel('Predicted', fontdict=label_font);
    ax.set_ylabel('Actuals', fontdict=label_font);

    title_font = {'size':'18'}  # Adjust to fit
    ax.set_title('Normalised Confusion Matrix', fontdict=title_font);

    ax.tick_params(axis='both', which='major', labelsize=15)  # Adjust to fit
    ax.xaxis.set_ticklabels(['Real', 'Fake']);
    ax.yaxis.set_ticklabels(['Real', 'Fake']);
    fig.savefig(path)

## 2.2.  User defined helper functions to find out the accuracy score metrics and creation of model comparison table


- The function named **metrics** is using to calculate the score of the different metrics such as Accuracy, Precision, Recall, F1Score and ROC AUC score for the given classifier with test data


- The function named **model_comparison_table** is using to create dataframe to list down all the classifiers with the metrics scores for the comparison purpose

In [None]:
import timeit
from timeit import default_timer as timer
from datetime import timedelta
 


def metrics(X_test,y_test,clf):
    predictions=clf.predict(X_test)
    #predictions=(clf.predict_proba(X_test)[:,1] >= 0.3).astype(bool)
    print("confusion_matrix :")
    print(confusion_matrix(y_test,predictions))
    print("Accuracy Score :")
    print(accuracy_score(y_test, predictions))
    print("Classification Report :")
    print(classification_report(y_test, predictions))
    print("F1 score :")
    print(f1_score(y_test, predictions))
    print("ROC AUC Score")
    y_pred_proba = clf.predict_proba(X_test)
    print(roc_auc_score(y_test, y_pred_proba[:,1]) )
    print("------------------------------")

    
def model_comparison_table(X_test,y_test,classifier_list):
    dict_clf={}
    for clf_name,clf in classifier_list:
        predictions=clf.predict(X_test)
        y_pred_proba = clf.predict_proba(X_test)
        accuracy=accuracy_score(y_test, predictions)
        precision=precision_score(y_test,predictions,average='macro').round(2)
        recall=recall_score(y_test,predictions)
        f1score=f1_score(y_test,predictions).round(2)
        ROC_AUC=roc_auc_score(y_test, y_pred_proba[:,1])
        dict_clf[clf_name]=[accuracy,precision,recall,f1score,ROC_AUC]
    df_models_scores = pd.DataFrame(dict_clf, index=['Accuracy', 'Precision', 'Recall', 'F1 Score','roc_auc_score'])
    return df_models_scores

## 2.3.  Train-Test Data Split

*** 
## 2.4  Feature Extraction Sample Scripts  by fitting data using Tokenisation and Pretrained Word Embeddings 

***
### 2.4.1 Tokenisation using TFIDF


**Sample scripts of applying TF-IDF tokenisation over the training set**

In [None]:
tfidf = TfidfVectorizer()
X_train_tfidf = tfidf.fit_transform(X_train)
X_test_tfidf = tfidf.transform(X_test)

*** 
### 2.4.2 Pretrained Word2Vec Embedding model


#### 2.4.2.1 Download the pretrained Word2Vec model

In [None]:
!wget -P /root/input/ -c "https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz"

#### 2.4.2.2 After downloading it, you can load it as follows (if it is in the same directory as the py file or jupyter notebook)

In [None]:
from gensim.models.keyedvectors import KeyedVectors
%time w2v_model = KeyedVectors.load_word2vec_format(path_to_model, binary=True)
print('done loading Word2Vec')

#### 2.4.2.3 Then inspect the model by getting the list of index of key values from the pretrained model

In [None]:
#Inspect the model
word2vec_vocab = list(w2v_model.index_to_key)

## Converting the lower case
word2vec_vocab_lower = [item.lower() for item in word2vec_vocab]

#### 2.4.2.4 User defined Function to create a feature vector by averaging all embeddings for the given sentence using the below user defined function

In [None]:
# Creating a feature vector by averaging all embeddings for all sentences
def embedding_feats(list_of_lists):
    DIMENSION = 300
    zero_vector = np.zeros(DIMENSION)
    feats = []
    for tokens in list_of_lists:
        feat_for_this =  np.zeros(DIMENSION)
        count_for_this = 0 + 1e-5 # to avoid divide-by-zero 
        for token in tokens:
            if token in w2v_model:
                feat_for_this += w2v_model[token]
                count_for_this +=1
        if(count_for_this!=0):
            feats.append(feat_for_this/count_for_this) 
        else:
            feats.append(zero_vector)
    return feats


#### 2.4.2.5 Then fit the training data and transform both training and test data

In [None]:
train_vectors = embedding_feats(X_train)
test_vectors = embedding_feats(X_test)

*** 
### 2.4.3 Pretrained Glove Embedding model


#### 2.4.3.1 Download the pretrained Glove model

In [None]:
!wget http://nlp.stanford.edu/data/glove.6B.zip

#### 2.4.3.2 Load the downloaded the 300 dimension Glove pretrained model "glove.6B.300d.txt" in a path and convert to word2vec format",

In [None]:
from gensim.models import KeyedVectors
from gensim.scripts.glove2word2vec import glove2word2vec

path_to_model = 'Embeddings//glove.6B.300d.txt'
output_file = 'Embeddings///gensim_glove.6B.300d.txt'
glove2word2vec(path_to_model, output_file)

#### 2.4.3.3 Then load the converted glove model 

In [None]:
glove_model = KeyedVectors.load_word2vec_format(output_file, binary=False)

#### 2.4.3.4 User defined function to create a feature vector by averaging all embeddings for the given sentence using the below user defined function

In [None]:
# Creating a feature vector by averaging all embeddings for all sentences
def embedding_feats(list_of_lists):
    DIMENSION = 300
    zero_vector = np.zeros(DIMENSION)
    feats = []
    for tokens in list_of_lists:
        feat_for_this =  np.zeros(DIMENSION)
        count_for_this = 0 + 1e-5 # to avoid divide-by-zero 
        for token in tokens:
            if token in w2v_model:
                feat_for_this += w2v_model[token]
                count_for_this +=1
        if(count_for_this!=0):
            feats.append(feat_for_this/count_for_this) 
        else:
            feats.append(zero_vector)
    return feats


#### 2.4.3.5 Then fit the training data and transform both training and test data

In [None]:
glove_train_vectors = embedding_feats(X_train)
glove_test_vectors = embedding_feats(X_test)

*** 
### 2.4.4 Pretrained BERT Embedding model


#### 2.4.4.1 Load the Pretrained BERT embedding preprocessor and encoder  model from tensorflow hub

In [None]:
# bert preprocessor - https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3
preprocessor = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3")
# bert encoder - https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-12_H-512_A-8/2
encoder = hub.KerasLayer("https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-12_H-512_A-8/2",trainable=True)

Text inputs need to be transformed to numeric token ids and arranged in several Tensors before being input to BERT.
Since our dataset is huge and it is not possible to transform full data to numeric token ids , we can do the same by splitting the data into multiple chunks as like below

In [None]:
# preprocessing dataset  - First Set
inputs = preprocessor(texts[0:10000])

After executing the above command, 3 outputs from the preprocessing will be generated that a BERT model would use (input_words_id, input_mask and input_type_ids).
Then encode the output and convert it into bert model features which can be feed to the model

In [None]:
# feeding it to model for vectorization
outputs = encoder(inputs)

The above output BERT models return a map with 3 important keys: pooled_output, sequence_output, encoder_outputs:

- pooled_output represents each input sequence as a whole. The shape is [batch_size, H]. 
- sequence_output represents each input token in the context. The shape is [batch_size, seq_length, H]. 
- encoder_outputs are the intermediate activations of the L Transformer block


In [None]:
pooled_output = outputs["pooled_output"]      # [batch_size, 512].
sequence_output = outputs["sequence_output"]  # [batch_size, seq_length, 512].

Then convert bert encoder sequence outputs to 1 dimension  and save the encoder sequence output to a dataframe for a single chunk.This same process need to do for all the splitted data chunks and then need to merge all the dataframes into a single one , which will feed to the model training

In [None]:
# defining dataframe
bertf_df1=pd.DataFrame()

for i in range(0,len(sequence_output)):
    b=sequence_output[i].numpy().sum(axis=0)
    bertf_df1=bertf_df1.append(pd.Series(b),ignore_index=True)
print('values added in dataframe')

In [None]:
bertf_df1 = pd.read_csv("Updated//bertf_df1.csv")
bertf_df2 = pd.read_csv("Updated//bertf_df2.csv")
bertf_df3 = pd.read_csv("Updated//bertf_df3.csv")
bertf_df4 = pd.read_csv("Updated//bertf_df4.csv")

# merging both props 
bertVectors_fulldf=pd.concat([bertf_df1,bertf_df2,bertf_df3,bertf_df4])

Then add class label column in the merged bert feature dataframe and then this dataframe will go with train-test data splitting

In [None]:
## Adding Class Label
bertVectors_fulldf.insert(len(bertVectors_fulldf.columns),'class',isot_full_df['class'])

*** 
## 2.5 Training Algorithm

***
### 2.5.1 Random Forest


**Initiate object for Random Forest model**

In [None]:
rf_clf=RandomForestClassifier(random_state=0)

Then fit the model and save it to a path

In [None]:
rf_clf.fit(X_train_tfidf,y_train)


# save the model to disk
filename = 'outputs//isot_ml_tfidf//isot_ml_RF_tfidf.sav'
pickle.dump(rf_clf, open(filename, 'wb'))
print('RandomForest - Completed')

Then predict using test data and measure the accuracy metrics

In [None]:
pred = rf_clf_tuned.predict(X_test_tfidf)
print("Accuracy score : {}".format(accuracy_score(y_test, pred)))
print("Confusion matrix : \n {}".format(confusion_matrix(y_test, pred)))
print("Classification Report")
print(classification_report(y_test, pred))
precision = precision_score(y_test, pred)
print("Precision : {}".format(precision))
recall = recall_score(y_test, pred)
print("Recall : {}".format(recall))
f1score = f1_score(y_test, pred)
print("F1 Score : {}".format(f1score))

Plot confusion matrix using the user defined functions as mentioned above earlier

In [None]:
cm=confusion_matrix(y_test, pred)
path1="outputs//isot_ml_tfidf//isot_ml_RF_tfidf_cmtrx.png"
path2="outputs//isot_ml_tfidf//isot_ml_RF_tfidf_ncmtrx.png"
confusn_mtrx_plot(cm,path1)
norm_confusn_mtrx_plot(cm,path2)

***
### 2.5.2 CNN


**Initiate object for CNN and configure the model check point and early stopping**

In [None]:
filepath = "outputs//isot_dl_tfidf//model_ISOT_CNN_TFIDF_V2.h5" # Location to get the model
checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=1, save_best_only=True, mode='min')
es = EarlyStopping(monitor='val_loss', patience=3,mode='min', verbose=1)
callbacks_list = [checkpoint,es]

**Define the model architecture**

In [None]:
warnings.filterwarnings('ignore')
model=Sequential()
model.add(Conv1D(filters=128, kernel_size=3, padding='valid', activation='relu',input_shape=(max_features,1)))
model.add(Conv1D(filters=32, kernel_size=3, padding='same', activation='relu'))
model.add(MaxPooling1D())
model.add(Flatten())
model.add(Dense(128, activation='relu'))
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(1, activation='sigmoid'))
# compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
# summarize the model
print(model.summary())

**Fitting the model**

In [None]:
warnings.filterwarnings('ignore')
history=model.fit(x_train_tf_cn, y_train, epochs=3, batch_size=32,validation_data=(x_test_tf_cn,y_test), 
                  callbacks=callbacks_list)

**Prediction Analysis using Test Data and save the history of metric scores in a csv**

In [None]:
y_pred=model.predict(x_test_tf_cn)
ytrue = y_test.astype(int).tolist()
y_pred2 = np.array((y_pred > 0.5).astype(int)[:,0])
precision = precision_score(ytrue, y_pred2)
recall = recall_score(ytrue, y_pred2)
f1score = f1_score(ytrue, y_pred2)
history.history['precision']=precision
history.history['recall']=recall
history.history['f1score']=f1score
hist_df = pd.DataFrame(history.history) 
hist_df.to_csv("outputs//isot_dl_tfidf//model_ISOT_CNN_TFIDF_V2_history.csv")
plot_loss_and_acc_from_hist2(hist_df)

**Measure the Performance Metrics Score**

In [None]:
print("Accuracy score : {}".format(accuracy_score(ytrue, y_pred2)))
print('precision =',precision)
print('recall =',recall)
print('f1score =',f1score)
print(confusion_matrix(ytrue, y_pred2))
print(classification_report(ytrue, y_pred2))

**Plot confusion matrix using the user defined functions as mentioned above earlier**

In [None]:
cm=confusion_matrix(ytrue, y_pred2)
path1="outputs//isot_dl_tfidf//isot_cnn_tfidf_cmtrx.png"
path2="outputs//isot_dl_tfidf//isot_cnn_tfidf_ncmtrx.png"
confusn_mtrx_plot(cm,path1)
norm_confusn_mtrx_plot(cm,path2)