<a href="https://colab.research.google.com/github/vitthal-bhandari/Coding-Challenge-Fatima-Fellowship/blob/master/Vitthal_Bhandari_Coding_Challenge_for_Fatima_Fellowship.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Deep Learning for NLP


You can download the dataset from [https://www.kaggle.com/datasets/clmentbisaillon/fake-and-real-news-dataset](https://www.kaggle.com/datasets/clmentbisaillon/fake-and-real-news-dataset). 

To run this code, you need to upload both files of the dataset to the Colab instance.

The final finetuned model can be accessed on Huggingface at [https://huggingface.co/bitsanlp/distilbert-base-uncased-distilbert-fakenews-detection](https://huggingface.co/bitsanlp/distilbert-base-uncased-distilbert-fakenews-detection).

# Initializing Environment

In [None]:
#check gpu usage
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Not connected to a GPU')
else:
  print(gpu_info)

Fri Apr  1 15:30:12 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   62C    P0    54W / 250W |      0MiB / 16280MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [None]:
#check ram usage
from psutil import virtual_memory
ram_gb = virtual_memory().total / 1e9
print('Your runtime has {:.1f} gigabytes of available RAM\n'.format(ram_gb))

if ram_gb < 20:
  print('Not using a high-RAM runtime')
else:
  print('You are using a high-RAM runtime!')

Your runtime has 27.3 gigabytes of available RAM

You are using a high-RAM runtime!


In [None]:
!pip install datasets
!pip install transformers
!sudo apt-get install git-lfs
!pip3 install torch==1.10.2+cu102 torchvision==0.11.3+cu102 torchaudio===0.10.2+cu102 -f https://download.pytorch.org/whl/cu102/torch_stable.html

Reading package lists... Done
Building dependency tree       
Reading state information... Done
git-lfs is already the newest version (2.3.4-1).
0 upgraded, 0 newly installed, 0 to remove and 39 not upgraded.
Looking in links: https://download.pytorch.org/whl/cu102/torch_stable.html


In [None]:
from huggingface_hub import notebook_login
from datasets import load_dataset
import pandas as pd
import numpy as np
import torch

In [None]:
notebook_login()

Login successful
Your token has been saved to /root/.huggingface/token
[1m[31mAuthenticated through git-credential store but this isn't the helper defined on your machine.
You might have to re-authenticate when pushing to the Hugging Face Hub. Run the following command in your terminal in case you want to set this credential helper as the default

git config --global credential.helper store[0m


# Data Wrangling

In [None]:
df_true=pd.read_csv("True.csv")
df_true.info()
df_true.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21417 entries, 0 to 21416
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   title    21417 non-null  object
 1   text     21417 non-null  object
 2   subject  21417 non-null  object
 3   date     21417 non-null  object
dtypes: object(4)
memory usage: 669.4+ KB


Unnamed: 0,title,text,subject,date
0,"As U.S. budget fight looms, Republicans flip t...",WASHINGTON (Reuters) - The head of a conservat...,politicsNews,"December 31, 2017"
1,U.S. military to accept transgender recruits o...,WASHINGTON (Reuters) - Transgender people will...,politicsNews,"December 29, 2017"
2,Senior U.S. Republican senator: 'Let Mr. Muell...,WASHINGTON (Reuters) - The special counsel inv...,politicsNews,"December 31, 2017"
3,FBI Russia probe helped by Australian diplomat...,WASHINGTON (Reuters) - Trump campaign adviser ...,politicsNews,"December 30, 2017"
4,Trump wants Postal Service to charge 'much mor...,SEATTLE/WASHINGTON (Reuters) - President Donal...,politicsNews,"December 29, 2017"


In [None]:
df_fake=pd.read_csv('Fake.csv')
df_fake.info()
df_fake.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23481 entries, 0 to 23480
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   title    23481 non-null  object
 1   text     23481 non-null  object
 2   subject  23481 non-null  object
 3   date     23481 non-null  object
dtypes: object(4)
memory usage: 733.9+ KB


Unnamed: 0,title,text,subject,date
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017"
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017"
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017"
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,"December 29, 2017"
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,"December 25, 2017"


In [None]:
# Remove empty rows
df_true = df_true.dropna()
df_fake = df_fake.dropna()

# Remove duplicates
df_true = df_true.drop_duplicates()
df_fake = df_fake.drop_duplicates()

# Reset index
df_true = df_true.reset_index(drop=True)
df_fake = df_fake.reset_index(drop=True)

In [None]:
# Concatenating subject and title with news text

df_true_aug = pd.DataFrame()
df_fake_aug = pd.DataFrame()

df_true_aug["text"] = df_true.apply(lambda row: row['subject'] + ' ' + row['title'] + ' ' + row['text'], axis=1)
df_true_aug["label"] = 0

df_fake_aug["text"] = df_fake.apply(lambda row: row['subject'] + ' ' + row['title'] + ' ' + row['text'], axis=1)
df_fake_aug["label"] = 1

In [None]:
df_true_aug.head()

Unnamed: 0,text,label
0,"politicsNews As U.S. budget fight looms, Repub...",0
1,politicsNews U.S. military to accept transgend...,0
2,politicsNews Senior U.S. Republican senator: '...,0
3,politicsNews FBI Russia probe helped by Austra...,0
4,politicsNews Trump wants Postal Service to cha...,0


In [None]:
df_fake_aug.head()

Unnamed: 0,text,label
0,News Donald Trump Sends Out Embarrassing New ...,1
1,News Drunk Bragging Trump Staffer Started Rus...,1
2,News Sheriff David Clarke Becomes An Internet...,1
3,News Trump Is So Obsessed He Even Has Obama’s...,1
4,News Pope Francis Just Called Out Donald Trum...,1


In [None]:
frames = [df_true_aug, df_fake_aug]
df_news = pd.concat(frames)

In [None]:
# re shuffling dataframe

df_news = df_news.sample(frac=1).reset_index(drop=True)
df_news.info
df_news.head()

Unnamed: 0,text,label
0,"worldnews Merkel, Juncker discuss Catalan cris...",0
1,left-news LIBERAL HACK KATIE COURIC Says Fake ...,1
2,politics TRUMP WAS RIGHT! Audit Reveals State ...,1
3,News Clay Aiken Says He Was A ‘F*****g Dumbas...,1
4,politicsNews U.S. militia girds for trouble as...,0


In [None]:
df_news['label'].value_counts()

1    23478
0    21211
Name: label, dtype: int64

In [None]:
#saving for future reference

df_news.to_csv('df_news.csv', index=False)

In [None]:
from datasets import Dataset, DatasetDict

data = Dataset.from_pandas(df = df_news)

In [None]:
data

Dataset({
    features: ['text', 'label'],
    num_rows: 44689
})

In [None]:
# 70% train, 30% test + validation
data_train = data.train_test_split(test_size = 0.3)

# Split the 30% test + valid in half test, half valid
data_valid = data_train['test'].train_test_split(test_size=0.5)

# gather everyone since we want to have a single DatasetDict
train_test_valid_dataset = DatasetDict({
    'train': data_train['train'],
    'test': data_valid['test'],
    'valid': data_valid['train']})

train_test_valid_dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 31282
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 6704
    })
    valid: Dataset({
        features: ['text', 'label'],
        num_rows: 6703
    })
})

# Tokenization

In [None]:
from transformers import AutoTokenizer

model_ckpt = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)

In [None]:
def tokenize(batch):
  return tokenizer(batch["text"], padding=True, truncation=True)

In [None]:
news_train = train_test_valid_dataset["train"].filter(lambda example: example['text'] is not None)
news_dev = train_test_valid_dataset["valid"].filter(lambda example: example['text'] is not None)
news_test = train_test_valid_dataset["test"].filter(lambda example: example['text'] is not None)
print( len(news_train), len(news_dev), len(news_test) )

  0%|          | 0/32 [00:00<?, ?ba/s]

  0%|          | 0/7 [00:00<?, ?ba/s]

  0%|          | 0/7 [00:00<?, ?ba/s]

31282 6703 6704


In [None]:
# Before tokenization

train_test_valid_dataset['train'].column_names

['text', 'label']

In [None]:
#apply tokenizer across all splits in the corpus
news_train_encoded = news_train.map(tokenize, batched=True, batch_size=None)
news_dev_encoded = news_dev.map(tokenize, batched=True, batch_size=None)
news_test_encoded = news_test.map(tokenize, batched=True, batch_size=None)

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

In [None]:
# After tokenization

news_train_encoded.column_names

['text', 'label', 'input_ids', 'attention_mask']

# Instantiating Model

In [None]:
from transformers import AutoModelForSequenceClassification

num_labels = 2
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = (AutoModelForSequenceClassification.from_pretrained(model_ckpt, num_labels=num_labels).to(device))

In [None]:
#Define performance metrics

from sklearn.metrics import accuracy_score, f1_score

def compute_metrics(pred):
  labels = pred.label_ids
  preds = pred.predictions.argmax(-1)
  f1 = f1_score(labels, preds, average = "macro")
  acc = accuracy_score(labels, preds)
  return {"accuracy": acc, "f1": f1}

In [None]:
#Define hyperparameters

from transformers import Trainer, TrainingArguments

batch_size = 32
logging_steps = len(news_train_encoded) // batch_size
model_name = f"{model_ckpt}-distilbert-fakenews-detection"
training_args = TrainingArguments(
    output_dir = model_name,
    num_train_epochs = 3,
    learning_rate = 2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    weight_decay = 0.01,
    evaluation_strategy = "epoch",
    disable_tqdm = False,
    logging_steps =logging_steps,
    push_to_hub=True,
    log_level = "error"
)

In [None]:
trainer = Trainer(
    model = model,
    args = training_args,
    compute_metrics = compute_metrics,
    train_dataset = news_train_encoded,
    eval_dataset = news_dev_encoded,
    tokenizer = tokenizer
)

trainer.train()

Cloning https://huggingface.co/bitsanlp/distilbert-base-uncased-distilbert-fakenews-detection into local empty directory.


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,0.0125,3.9e-05,1.0,1.0
2,0.0,1.2e-05,1.0,1.0
3,0.0,9e-06,1.0,1.0


TrainOutput(global_step=2934, training_loss=0.004170979149720153, metrics={'train_runtime': 2765.9582, 'train_samples_per_second': 33.929, 'train_steps_per_second': 1.061, 'total_flos': 1.2431535494270976e+16, 'train_loss': 0.004170979149720153, 'epoch': 3.0})

# Testing

In [None]:
# Obtaining predictions on test set

preds_output = trainer.predict(news_test_encoded)

In [None]:
preds_output.metrics

{'test_accuracy': 1.0,
 'test_f1': 1.0,
 'test_loss': 8.775856258580461e-06,
 'test_runtime': 63.5898,
 'test_samples_per_second': 105.426,
 'test_steps_per_second': 3.302}

In [None]:
y_preds = np.argmax(preds_output.predictions, axis=1)

In [None]:
# Committing to hub

trainer.push_to_hub(commit_message="Training completed!")

Several commits (2) will be pushed upstream.
The progress bars may be unreliable.


Upload file pytorch_model.bin:   0%|          | 3.34k/255M [00:00<?, ?B/s]

Upload file runs/Apr01_16-11-53_8a4d231a373d/events.out.tfevents.1648829526.8a4d231a373d.987.2:  66%|######5  …

To https://huggingface.co/bitsanlp/distilbert-base-uncased-distilbert-fakenews-detection
   600ea63..1aeb5c8  main -> main

To https://huggingface.co/bitsanlp/distilbert-base-uncased-distilbert-fakenews-detection
   1aeb5c8..7475546  main -> main



'https://huggingface.co/bitsanlp/distilbert-base-uncased-distilbert-fakenews-detection/commit/1aeb5c86a05e21da4a18af4cbab300a6fc8a07d8'

# Discussion

The model obtains perfect accuracy on the classification task. 

In the case that we obtain misclassified articles, we can refine the dataset to obtain performance gains such as excluding spurious examples, downsampling the majority class or upsampling the minority class.

We can also experiment with models that have been pretrained on news corpora that are closer to the testing dataset as compared to generic BERT model and its derivatives.