# <font color = 'pickle'>**Fake News Multi-Class Classification**
I will be following a standard model training plan using the huggingface library and WANDB to track my experiment.
- The idea is to use the base DistilRoBERTa model to see how it fares with a standard binary sequence classification task.

**Training Plan**
- Set the Environment
- Load the dataset
- EDA
  - Check sequence lengths
  - Check class distribution
- Load the pretrained distilRoBERTa tokenizer
  - Tokenize combining the title and body of the documents
  - Compile into datasets prior to training.
- Train the model
  - Download the model
  - Download and modify the model's config file.
  - Compute the metric function (ensuring we use a standard binary classification task)
  - Training Args
  - Instantiate the trainer
  - Setup WANDB
  - Training and Validation
- Model inference



## <font color='pickle'>**1. Set the Environment**</font>


In [1]:
import sys
if 'google.colab' in str(get_ipython()):  # Check if running in Colab
    # Mount Google Drive
    from google.colab import drive
    drive.mount('/content/drive', force_remount=True)

    # Set the base path to a Google Drive folder
    base_path = '/content/drive/MyDrive/Colab Notebooks'

    # Install only the additional necessary packages without upgrading torch and its dependencies
    !pip install transformers evaluate wandb datasets accelerate torchinfo -U -qq


Mounted at /content/drive


In [2]:
!pip show torch torchvision torchaudio torchtext torchdata transformers



Name: torch
Version: 2.5.0+cu121
Summary: Tensors and Dynamic neural networks in Python with strong GPU acceleration
Home-page: https://pytorch.org/
Author: PyTorch Team
Author-email: packages@pytorch.org
License: BSD-3-Clause
Location: /usr/local/lib/python3.10/dist-packages
Requires: filelock, fsspec, jinja2, networkx, nvidia-cublas-cu12, nvidia-cuda-cupti-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-runtime-cu12, nvidia-cudnn-cu12, nvidia-cufft-cu12, nvidia-curand-cu12, nvidia-cusolver-cu12, nvidia-cusparse-cu12, nvidia-nccl-cu12, nvidia-nvtx-cu12, sympy, triton, typing-extensions
Required-by: accelerate, fastai, peft, sentence-transformers, timm, torchaudio, torchdata, torchtext, torchvision
---
Name: torchvision
Version: 0.20.0+cu121
Summary: image and video datasets and models for torch deep learning
Home-page: https://github.com/pytorch/vision
Author: PyTorch Core Team
Author-email: soumith@pytorch.org
License: BSD
Location: /usr/local/lib/python3.10/dist-packages
Requires: numpy, 

<font color = 'pickle'>***Loading Libraries***

In [45]:
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import confusion_matrix, precision_score, recall_score, f1_score
from pathlib import Path
import numpy as np
import joblib
import evaluate
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from datasets import load_dataset, DatasetDict, Dataset, ClassLabel
from transformers import (AutoTokenizer,
                          Trainer,
                          TrainingArguments,
                          AutoModelForSequenceClassification,
                          AutoConfig,
                          pipeline)
import wandb
from google.colab import userdata
from huggingface_hub import login
import torch.nn as nn
import ast
import torch

In [4]:
wandb_api_key = userdata.get('WANDB_API_KEY')
hf_token = userdata.get('HF_TOKEN')

In [5]:
if hf_token:
  login(token=hf_token)
  print('Login Successful')
else:
  print('Login Unsuccessful')

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: fineGrained).
Your token has been saved to /root/.cache/huggingface/token
Login successful
Login Successful


In [7]:
if wandb_api_key:
  wandb.login(key = wandb_api_key)
  print('Login Successful')
else:
  print('Login Unsuccessful')

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.
[34m[1mwandb[0m: W&B API key is configured. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


Login Successful


<font color = 'pickle'>***Setting up paths for data and model***

In [8]:
base_folder = Path(base_path)
data_folder = base_folder/'datasets/fake_news'
model_folder = base_folder/'models/personal/fake_distilroberta'

model_folder.mkdir(exist_ok = True, parents = True)

## <font color='pickle'>**2. Load the Dataset and Initial Split**</font>

Notice below that we have various columns besides the text and labels present in the dataset, I will be dropping these columns to get down to the columns necessary for training.

In [58]:
full_df = pd.read_csv(data_folder/'fake_news.csv', encoding = 'ISO-8859-1')
full_df.head()

Unnamed: 0,News_Headline,Link_Of_News,Source,Stated_On,Date,Label
0,Says Osama bin Laden endorsed Joe Biden,https://www.politifact.com/factchecks/2020/jun...,Donald Trump Jr.,"June 18, 2020","June 19, 2020",FALSE
1,CNN aired a video of a toddler running away fr...,https://www.politifact.com/factchecks/2020/jun...,Donald Trump,"June 18, 2020","June 19, 2020",pants-fire
2,Says Tim Tebow kneeled in protest of abortion...,https://www.politifact.com/factchecks/2020/jun...,Facebook posts,"June 12, 2020","June 19, 2020",FALSE
3,Even so-called moderate Democrats like Joe Bi...,https://www.politifact.com/factchecks/2020/jun...,Paul Junge,"June 10, 2020","June 19, 2020",barely-true
4,"""Our health department, our city and our count...",https://www.politifact.com/factchecks/2020/jun...,Jeanette Kowalik,"June 14, 2020","June 18, 2020",TRUE


In [59]:
# Pulling only the necessary columns
full_df = full_df[['News_Headline', 'Label']]
# renaming the columns for simplicity
full_df = full_df.rename(columns = {'News_Headline':'text', 'Label':'label'})
full_df.head()

Unnamed: 0,text,label
0,Says Osama bin Laden endorsed Joe Biden,FALSE
1,CNN aired a video of a toddler running away fr...,pants-fire
2,Says Tim Tebow kneeled in protest of abortion...,FALSE
3,Even so-called moderate Democrats like Joe Bi...,barely-true
4,"""Our health department, our city and our count...",TRUE


In [60]:
# setting up an encoder to do encoding of labels
encoder = OneHotEncoder(sparse_output = False)

# encoding the labels
encoded_labels = encoder.fit_transform(full_df[['label']])

# getting the encoded df
encoded_df = pd.DataFrame(encoded_labels, columns = encoder.get_feature_names_out(['label']))

# concatenating with the original dataframe
full_df = pd.concat([full_df.reset_index(drop = True), encoded_df.reset_index(drop = True)], axis = 1)

full_df.head()

Unnamed: 0,text,label,label_FALSE,label_TRUE,label_barely-true,label_full-flop,label_half-flip,label_half-true,label_mostly-true,label_no-flip,label_pants-fire
0,Says Osama bin Laden endorsed Joe Biden,FALSE,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,CNN aired a video of a toddler running away fr...,pants-fire,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
2,Says Tim Tebow kneeled in protest of abortion...,FALSE,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Even so-called moderate Democrats like Joe Bi...,barely-true,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
4,"""Our health department, our city and our count...",TRUE,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [61]:
# dropping the label column from the df
full_df.drop('label', axis=1, inplace = True)

full_df.head()

Unnamed: 0,text,label_FALSE,label_TRUE,label_barely-true,label_full-flop,label_half-flip,label_half-true,label_mostly-true,label_no-flip,label_pants-fire
0,Says Osama bin Laden endorsed Joe Biden,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,CNN aired a video of a toddler running away fr...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
2,Says Tim Tebow kneeled in protest of abortion...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Even so-called moderate Democrats like Joe Bi...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
4,"""Our health department, our city and our count...",0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [62]:
full_df.shape

(9960, 10)

In [None]:
full_dict = {
    'text': full_df['text'],
    'label': full_df[]
}

<font color = 'pickle'>***Performing initial split randomly to prevent leakage***

<font color = 'pickle'>***

## <font color='pickle'>**3. EDA**</font>  

### <font color='pickle'>**3.1 Check Sequence Lengths**</font>  

### <font color='pickle'>**3.2 Check Class Distribution**</font>

## <font color='pickle'>**4. Load the Pretrained DistilRoBERTa Tokenizer**</font>  

### <font color='pickle'>**4.1 Tokenize Combining the Title and Body of the Documents**</font>  

### <font color='pickle'>**4.2 Compile into Datasets**</font>  


## <font color='pickle'>**5. Train the Model**</font>  

### <font color='pickle'>**5.1 Download the Model**</font>  

### <font color='pickle'>**5.2 Download and Modify the Model's Config File**</font>  

### <font color='pickle'>**5.3 Compute the Metric Function (Ensuring We Account for Multilabel) (BCE)**</font>  

### <font color='pickle'>**5.4 Training Args**</font>  

### <font color='pickle'>**5.5 Instantiate the Trainer**</font>  

### <font color='pickle'>**5.6 Setup WANDB**</font>  

### <font color='pickle'>**5.7 Training and Validation**</font>

## <font color='pickle'>**6. Model Inference**</font>