<a href="https://www.kaggle.com/code/tanat94/fake-real-news-classification-w-bert?scriptVersionId=115440008" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

In [None]:
<a href="https://colab.research.google.com/github/tanat1994/Fake-news-Classification/blob/main/Fake_True_News_Classification_with_DistilBERT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<a name='0'></a>
# TABLE OF CONTENTS
- [1. Environment Setup](#1)
- [2. Import Libraries](#2)
- [3. Add Dataframe label & Merge](#3)
- [4. Exploratory Data Analysis(EDA)](#4)
- [5. Dataframe Preprocessing](#5)
    - [5.1 Tokenize function](#5.1)
    - [5.2 Split Train/Test](#5.2)
    - [5.3 Encode/Tokenize dataset](#5.3)
- [6. Modeling](#6)
- [7. Evaluation](#7)
    - [7.1 DistilBERT results](#7.1)
    - [7.2 BertBased results](#7.2)

<a name='1'></a>
## 1. Environment Setup
[Back to TOC](#0)

In [None]:
!pip install -q transformers
!pip install -q kaggle

In [None]:
from google.colab import files
files.upload()

In [None]:
! mkdir ~/.kaggle
! cp kaggle.json ~/.kaggle/
! chmod 600 ~/.kaggle/kaggle.json

In [None]:
# ! kaggle datasets list
!kaggle datasets download -d clmentbisaillon/fake-and-real-news-dataset

In [None]:
!unzip fake-and-real-news-dataset.zip

<a name='2'></a>
## 2. Import Libraries
[Back to TOC](#0)

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

import transformers
from transformers import DistilBertTokenizer, TFDistilBertModel, TFDistilBertForSequenceClassification, BertTokenizer, TFBertModel, TFBertForSequenceClassification

import tensorflow as tf
from tensorflow.keras import Input, Model
from tensorflow.keras.layers import Dense

<a name='3'></a>
## 3. Add Dataframe label & Merge
[Back to TOC](#0)

In [None]:
true_df = pd.read_csv('True.csv')
fake_df = pd.read_csv('Fake.csv')

In [None]:
true_df.shape, fake_df.shape

In [None]:
true_df['label'] = 1
fake_df['label'] = 0
full_df = pd.concat([true_df, fake_df], axis=0)

In [None]:
full_df.shape

In [None]:
full_df.head()

<a name='4'></a>
## 4. Exploratory Data Analysis(EDA)
[Back to TOC](#0)

In [None]:
sns.countplot(data=full_df, x='label')

In [None]:
plt.figure(figsize=(10, 5))
order_by_subject = full_df['subject'].value_counts().sort_values(ascending=False).index
sns.countplot(data=full_df, x='subject', order=order_by_subject)

In [None]:
plt.figure(figsize=(10, 5))
sns.countplot(data=full_df, x='subject', hue='label')

In [None]:
full_df.groupby(['subject', 'label'], sort=False)['label'].count()

<a name='5'></a>
## 5. Dataframe Preprocessing
[Back to TOC](#0)

In [None]:
full_df = full_df.drop(columns=['text', 'subject', 'date'])
full_df.head(2)

In [None]:
full_df['title_len'] = full_df['title'].str.split().str.len()
full_df.sample(3, random_state=42)

In [None]:
MAX_LENGTH = 42

In [None]:
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
# tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

<a name='5.1'></a>
### 5.1 Tokenize Function
[Back to TOC](#0)

In [None]:
MAX_LENGTH = 64 #42 # maxlength for new's topic
def tokenize_word(text):
  toks = tokenizer(text, 
                   max_length=MAX_LENGTH, 
                   padding='max_length', 
                   truncation=True, 
                   return_tensors='tf')
  toks = {
      'input_ids': toks['input_ids'][0],
      'attention_mask': toks['attention_mask'][0]
  }
  return toks

In [None]:
def inputs_tokenizer(df):
  input_ids = []
  attention_masks = []
  for title in df['title'].tolist():
    tokens = tokenize_word(title)
    input_ids.append(tokens['input_ids'])
    attention_masks.append(tokens['attention_mask'])
  
  inputs = {
      'input_ids': np.asarray(input_ids, dtype='int32'),
      'attention_mask': np.asarray(attention_masks, dtype='int32')
  }
  return inputs

<a name='5.2'></a>
### 5.2 Split Train/Test
`Training 80%`

`Validation 20%`

[Back to TOC](#0)

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(full_df.drop(columns=['label']), full_df['label'], test_size=0.2, stratify=full_df['label'], random_state=42)

In [None]:
X_train.shape, X_test.shape

<a name='5.3'></a>
### 5.3 Encode/Tokenize dataset

Tokenize each sentence and add special token [CLS], [SEP]

[Back to TOC](#0)

In [None]:
X_train_inputs = inputs_tokenizer(X_train)
X_test_inputs = inputs_tokenizer(X_test)

<a name='6'></a>
## 6. Modeling
[Back to TOC](#0)

**Model Summary**
- Pre-trained Model => `distilbert-base-uncased | bert-base-uncased`
  - trainable = `True | False` 
- Dropout => `0.1`
- Epochs => `3 (4 cause overfitting)`
    - learning rates => `[3e-4, 1e-4, 5e-5, 3e-5]` # value from BERT paper
    - batch sizes => `8, 16, 32, 64, 128`
    - ref. [https://github.com/google-research/bert](https://github.com/google-research/bert)

In [None]:
def create_model(bert_model):
  input_ids = Input(shape=(MAX_LENGTH,), dtype='int32', name='input_ids')
  attention_masks = Input(shape=(MAX_LENGTH,), dtype='int32', name='attention_masks')

  # TFDistilbertModel
  embedding = bert_model(input_ids, attention_masks)[0] 
  output = Dense(32, activation='relu')(embedding[:, 0, :])
  output = tf.keras.layers.Dropout(rate=0.1)(output)
  output = Dense(1, activation='sigmoid')(output)
  
  # TFBertModel
  # embedding = bert_model(input_ids, attention_masks)[1] #pooled output
  # output = Dense(32, activation='relu')(embedding)
  # output = tf.keras.layers.Dropout(rate=0.1)(output)
  # output = Dense(1, activation='sigmoid')(output)

  model = Model(inputs=[input_ids, attention_masks], outputs=output)
  model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=1e-4), 
                loss='binary_crossentropy', 
                metrics=['accuracy'])
  return model

In [None]:
base = TFDistilBertModel.from_pretrained('distilbert-base-uncased', num_labels=2)
# base = TFBertModel.from_pretrained('bert-base-uncased', num_labels=2)
for layer in base.layers:
  layer.trainable = True
  # layer.trainable = False
base.summary()

In [None]:
model = create_model(base)
model.summary()

In [None]:
history = model.fit([X_train_inputs['input_ids'], X_train_inputs['attention_mask']], 
          y_train, 
          batch_size=64, 
          epochs=3,
          validation_data=([X_test_inputs['input_ids'], X_test_inputs['attention_mask']], y_test))

<a name='7'></a>
## 7. Evaluation
[Back to TOC](#0)

In [None]:
y_pred = model.predict([X_test_inputs['input_ids'], X_test_inputs['attention_mask']])

In [None]:
y_pred = np.round(y_pred).astype(int).ravel()

<a name='7.1'></a>
###  7.1 DistilBERT results
[Back to TOC](#0)

#### Trainable = TRUE

- Elapsed: `238s`

- Results: `accuracy: 0.9594 - val_accuracy: 0.9835`

#### Trainable = FALSE
- Elapsed: `99s`

- Results: `accuracy: 0.8286 - val_accuracy: 0.8861`

<a name='7.2'></a>
### 7.2 BertBased results
[Back to TOC](#0)

#### Trainable = TRUE

- Elapsed: `471s`

- Results: `accuracy: 0.9871 - val_accuracy: 0.9839`

#### Trainable = FALSE

- Elapsed: `199s`

- Results: `accuracy: 0.7195 - val_accuracy: 0.7667`

In [None]:
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])

In [None]:
from sklearn.metrics import accuracy_score, confusion_matrix, f1_score, precision_score, recall_score

In [None]:
acc_score = accuracy_score(y_test, y_pred)
print(f'Accuracy score = {acc_score:.2f}%')

In [None]:
f1_score = f1_score(y_test, y_pred)
precision_score = precision_score(y_test, y_pred)
recall_score = recall_score(y_test, y_pred)

In [None]:
scores = [['accuracy', acc_score], ['f1', f1_score], ['precision', precision_score], ['recall', recall_score]]
metrics_df = pd.DataFrame(scores, columns=['metrics', 'score'])
metrics_df

In [None]:
cf_matrix = confusion_matrix(y_test, np.round(y_pred).astype(int).ravel().reshape(-1, 1))
sns.heatmap(cf_matrix, annot=True, cmap='Blues', fmt='d')