<a href="https://colab.research.google.com/github/sreebalajisree/Fake_News_Detection/blob/main/Fake_News_Detection_Using_T5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Install dependencies

In [99]:
#!pip install transformers

In [100]:
#!pip install texthero

In [101]:
#!pip install Cython

In [102]:
#!pip install -U spacy

In [103]:
#!pip install simpletransformers

# Import Libraries

In [104]:
#Basic Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

#Sk-Learn libraries
from sklearn.utils import shuffle
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import f1_score,confusion_matrix,precision_score,recall_score, accuracy_score, classification_report

#Text-hero for Data Cleaning, processing
import texthero as hero
from texthero import visualization

#Sci-py library
from scipy.stats import pearsonr, spearmanr

#Transformer Libraries
from simpletransformers.t5 import T5Model
from transformers.data.metrics.squad_metrics import compute_exact, compute_f1

#Other libraries
import json
from datetime import datetime
from pprint import pprint
from statistics import mean

# Load the dataset

In [105]:
fake_train_covid = pd.read_excel("/content/sample_data/data/Constraint_English_Train.xlsx")
fake_valid_covid = pd.read_excel("/content/sample_data/data/Constraint_English_Val.xlsx")
fake_test_covid = pd.read_excel("/content/sample_data/data/Constraint_English_Test_without_labels.xlsx")
fake_external1 = pd.read_excel("/content/sample_data/data/external_1.xlsx")
fake_external2 = pd.read_excel("/content/sample_data/data/external_2.xlsx")
fake_train = pd.read_excel("/content/sample_data/data/new_train_data_all_topic.xlsx")
fake_valid = pd.read_excel("/content/sample_data/data/new_valid_data_all_topic.xlsx")
fake_test = pd.read_excel("/content/sample_data/data/new_test_data_all_topic.xlsx")

# Read the top 5 values from the datasets

In [106]:
fake_train.head()

Unnamed: 0,tweet,label
0,Donald Trump Sends Out Embarrassing New Yearâ...,0
1,Drunk Bragging Trump Staffer Started Russian ...,0
2,Sheriff David Clarke Becomes An Internet Joke...,0
3,Trump Is So Obsessed He Even Has Obamaâ€™s Na...,0
4,Pope Francis Just Called Out Donald Trump Dur...,0


In [107]:
fake_test.head()

Unnamed: 0,tweet,label
0,NORDSTROM CANCELS IVANKA TRUMP BRAND After Lib...,0
1,BREAKING: IRAN Tests Cruise Missileâ€¦Trump WA...,0
2,WHAT? DEMOCRAT CONGRESSWOMAN Calls Violent Rio...,0
3,HILLARYâ€™S LAP DOG VA Senator Tim Kaine Calls...,0
4,SHOCKING MIGRANT CLASS WARS: N. African Migran...,0


In [108]:
fake_valid.head()

Unnamed: 0,tweet,label
0,ULTIMATE HYPOCRITES! RUSSIAN Ambassador Visite...,0
1,WATCH: G.W. BUSH Gushes Over Kimmelâ€™s Anti-T...,0
2,RACIST LIBERAL REPORTER Arrested In Connection...,0
3,NEWT GINGRICH Punches Back At Democrats With M...,0
4,EXPOSED! OBAMA REGIME Gave MILLIONS US Tax Dol...,0


# Data preprocessing

In [109]:
fake_train1 = pd.concat([fake_train_covid['tweet'], fake_train_covid['label']], axis=1).astype(str)
fake_valid1 = pd.concat([fake_valid_covid['tweet'], fake_valid_covid['label']], axis=1).astype(str)
fake_external1_df = pd.concat([fake_external1['tweet'], fake_external1['label']], axis=1).astype(str)

In [110]:
# Verify the dtype of the dataframe
print(fake_train1.dtypes)
print(fake_valid1.dtypes)
print(fake_external1_df.dtypes)

tweet    object
label    object
dtype: object
tweet    object
label    object
dtype: object
tweet    object
label    object
dtype: object


In [111]:
"""
Clean the train dataset using texthero library
"""
df_cleaned_tweet_fake_train = pd.DataFrame()
df_cleaned_tweet_fake_train['tweet'] = hero.clean(fake_train1['tweet'])
df_cleaned_tweet_fake_train = pd.concat([df_cleaned_tweet_fake_train['tweet'], fake_train1['label']], axis=1)

In [112]:
"""
Clean the valid dataset using texthero library
"""
df_cleaned_tweet_fake_valid = pd.DataFrame()
df_cleaned_tweet_fake_valid['tweet'] = hero.clean(fake_valid1['tweet'])
df_cleaned_tweet_fake_valid = pd.concat([df_cleaned_tweet_fake_valid['tweet'], fake_valid1['label']], axis=1)

In [113]:
"""
Clean the external dataset using texthero library
"""
df_cleaned_tweet_fake_external = pd.DataFrame()
df_cleaned_tweet_fake_external['tweet'] = hero.clean(fake_external1_df['tweet'])
df_cleaned_tweet_fake_external = pd.concat([df_cleaned_tweet_fake_external['tweet'], fake_external1['label']], axis=1)

In [114]:
binary_train_df = pd.DataFrame({
    'prefix': ["binary classification" for i in range(len(df_cleaned_tweet_fake_train))],
    'input_text': df_cleaned_tweet_fake_train.tweet.str.replace('\n', ' '),
    'target_text': df_cleaned_tweet_fake_train.label.astype(str),
})

print(binary_train_df.head())

                  prefix                                         input_text  \
0  binary classification  cdc currently reports deaths general discrepan...   
1  binary classification  states reported deaths small rise last tuesday...   
2  binary classification  politically correct woman almost uses pandemic...   
3  binary classification  indiafightscorona covid testing laboratories i...   
4  binary classification  populous states generate large case counts loo...   

  target_text  
0        real  
1        real  
2        fake  
3        real  
4        real  


In [115]:
binary_valid_df = pd.DataFrame({
    'prefix': ["binary classification" for i in range(len(df_cleaned_tweet_fake_valid))],
    'input_text': df_cleaned_tweet_fake_valid.tweet.str.replace('\n', ' '),
    'target_text': df_cleaned_tweet_fake_valid.label.astype(str),
})

print(binary_valid_df.head())

                  prefix                                         input_text  \
0  binary classification  chinese converting islam realising muslim affe...   
1  binary classification  people diamond princess cruise ship intially t...   
2  binary classification       covid caused bacterium virus treated aspirin   
3  binary classification  mike pence rnc speech praises donald trump' co...   
4  binary classification  sky edconwaysky explains latest covid19 data g...   

  target_text  
0        fake  
1        fake  
2        fake  
3        fake  
4        real  


In [116]:
binary_external_df = pd.DataFrame({
    'prefix': ["binary classification" for i in range(len(df_cleaned_tweet_fake_external))],
    'input_text': df_cleaned_tweet_fake_external.tweet.str.replace('\n', ' '),
    'target_text': df_cleaned_tweet_fake_external.label.astype(str),
})

print(binary_external_df.head())

                  prefix                                         input_text  \
0  binary classification  travellers adhere strict hygiene measures wash...   
1  binary classification  first time post war history epidemics reversal...   
2  binary classification  understing japanese doctor offers excellent ad...   
3  binary classification  drinking lemon water could kill virus due vita...   
4  binary classification  coronavirus hoax fake virus pandemic fabricate...   

  target_text  
0           1  
1           1  
2           0  
3           0  
4           0  


# View the cleaned data top 5 values

In [117]:
df_cleaned_tweet_fake_train.head()

Unnamed: 0,tweet,label
0,cdc currently reports deaths general discrepan...,real
1,states reported deaths small rise last tuesday...,real
2,politically correct woman almost uses pandemic...,fake
3,indiafightscorona covid testing laboratories i...,real
4,populous states generate large case counts loo...,real


In [118]:
df_cleaned_tweet_fake_valid.head()

Unnamed: 0,tweet,label
0,chinese converting islam realising muslim affe...,fake
1,people diamond princess cruise ship intially t...,fake
2,covid caused bacterium virus treated aspirin,fake
3,mike pence rnc speech praises donald trump' co...,fake
4,sky edconwaysky explains latest covid19 data g...,real


In [119]:
df_cleaned_tweet_fake_external.head()

Unnamed: 0,tweet,label
0,travellers adhere strict hygiene measures wash...,1
1,first time post war history epidemics reversal...,1
2,understing japanese doctor offers excellent ad...,0
3,drinking lemon water could kill virus due vita...,0
4,coronavirus hoax fake virus pandemic fabricate...,0


# Train the dataset with T5 small

In [120]:
model_args = {
    "max_seq_length": 196,
    "train_batch_size": 16,
    "eval_batch_size": 64,
    "num_train_epochs": 1,
    "evaluate_during_training": True,
    "evaluate_during_training_steps": 15000,
    "evaluate_during_training_verbose": True,
    "use_multiprocessing": False,
    "fp16": False,
    "save_steps": -1,
    "save_eval_checkpoints": False,
    "save_model_every_epoch": False,
    "reprocess_input_data": True,
    "overwrite_output_dir": True,
    "wandb_project": "T5 tasks - Binary classification",
}

model = T5Model("t5", "t5-base", args=model_args, use_cuda=False)

model.train_model(binary_train_df, eval_data=binary_valid_df)

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


  0%|          | 0/6420 [00:00<?, ?it/s]

`prepare_seq2seq_batch` is deprecated and will be removed in version 5 of HuggingFace Transformers. Use the regular
`__call__` method to prepare your inputs and targets.

Here is a short example:

model_inputs = tokenizer(src_texts, text_target=tgt_texts, ...)

If you either need to use different keyword arguments for the source and target texts, you should do two calls like
this:

model_inputs = tokenizer(src_texts, ...)
labels = tokenizer(text_target=tgt_texts, ...)
model_inputs["labels"] = labels["input_ids"]

See the documentation of your specific tokenizer for more details on the specific arguments to the tokenizer of choice.
For a more complete example, see the implementation of `prepare_seq2seq_batch`.

  "`as_target_tokenizer` is deprecated and will be removed in v5 of Transformers. You can tokenize your "


Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

ERROR:wandb.jupyter:Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit: 

··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


Running Epoch 0 of 1:   0%|          | 0/402 [00:00<?, ?it/s]

  0%|          | 0/2140 [00:00<?, ?it/s]

`prepare_seq2seq_batch` is deprecated and will be removed in version 5 of HuggingFace Transformers. Use the regular
`__call__` method to prepare your inputs and targets.

Here is a short example:

model_inputs = tokenizer(src_texts, text_target=tgt_texts, ...)

If you either need to use different keyword arguments for the source and target texts, you should do two calls like
this:

model_inputs = tokenizer(src_texts, ...)
labels = tokenizer(text_target=tgt_texts, ...)
model_inputs["labels"] = labels["input_ids"]

See the documentation of your specific tokenizer for more details on the specific arguments to the tokenizer of choice.
For a more complete example, see the implementation of `prepare_seq2seq_batch`.

`prepare_seq2seq_batch` is deprecated and will be removed in version 5 of HuggingFace Transformers. Use the regular
`__call__` method to prepare your inputs and targets.

Here is a short example:

model_inputs = tokenizer(src_texts, text_target=tgt_texts, ...)

If you either need

(402,
 {'global_step': [402],
  'eval_loss': [0.06928516294368926],
  'train_loss': [0.0019966699182987213]})