We will build the real news vs fake news detection engine. We want to demonstrate how this pipeline can be adapted to your organization's specific needs. Instead of using a pre-built dataset, we will download a dataset from Kaggle and utilize it in our fine-tuning process. This approach will help illustrate how the pipeline can be tailored to work with custom datasets in real-world applications.
Here's an outline of the fine-tuning process
1. Import required libraries and packages

2. Load the dataset. Download the data from kaggle and save it on your drive.
3. Load pre-trained BERT tokenizer:


4. Prepare the dataset:


  * Tokenize the text using the BERT tokenizer
  * Create attention masks
 * Split the dataset into training and validation sets
  * Create a custom PyTorch dataset class (TextClassificationDataset)
  * Instantiate the custom dataset for both training and validation sets
  * Create PyTorch DataLoader
  
4. Load a pre-trained BERT model for sequence classification using the Hugging Face Transformers library
5. Setup Accelarator environment
6. Fine-tune the model:

7. Evaluate the model:
  *Calculate  metrics, such as F1 score, recall, and precision
8. Inference:

  * Create a function to perform inference on new text input
 * Tokenize the input text and convert it to the required format
 * Perform inference using the fine-tuned model
 * Interpret the model's output and return the predicted class

In [None]:
!pip install transformers
!pip install datasets
!pip install torch
!pip install torchtext
!pip install accelerate
!pip install sentencepiece
!pip3 install sacremoses

Collecting datasets
  Downloading datasets-3.0.2-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Downloading datasets-3.0.2-py3-none-any.whl (472 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m472.7/472.7 kB[0m [31m16.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m8.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading multiprocess-0.70.16-py310-none-any.whl (134 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading x

# 1. Import required libraries and packages


In [None]:
import pandas as pd

from sklearn.model_selection import train_test_split
from accelerate import Accelerator

import torch
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler

from tqdm import tqdm

from transformers import AutoTokenizer
from transformers import AutoModelForSequenceClassification
from transformers import AdamW
from transformers import get_scheduler

**Note:** MPS=> Apple's Metal Performance Shaders (MPS) is a framework that provides highly optimized, low-level GPU-accelerated functions for deep learning, image processing, and other compute-intensive tasks.

In [None]:
def get_device():
  device="cpu"
  if torch.cuda.is_available():
    device="cuda"
  elif  torch.backends.mps.is_available():
    device='mps'
  else:
    device="cpu"
  return device


device = get_device()
print(device)

cuda


# 2. Load Data
1. Reading data from two CSV files: True.csv (real news) and Fake.csv (fake news)
2. Cleaning and preprocessing the data in each CSV file
3. Concatenating both dataframes into a single dataframe
4. The resulting dataframe contains two columns: 'text' for the news content and 'label' for its corresponding category (real or fake)

In [None]:
# 구글 드라이브 연동
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
real=pd.read_csv('/content/drive/MyDrive/Book6/Ch4/True.csv')
fake=pd.read_csv('/content/drive/MyDrive/Book6/Ch4/Fake.csv')


In [None]:
real = real.drop(['title','subject','date'], axis=1)
real['label']=1.0
fake = fake.drop(['title','subject','date'], axis=1)
fake['label']=0.0
dataframe=pd.concat([real, fake], axis=0, ignore_index=True)


In [None]:
df = dataframe.sample(frac=0.1).reset_index(drop=True)
print(df.head(20))
print(len(df[df['label']==1.0]))
print(len(df[df['label']==0.0]))

                                                 text  label
0   (Reuters) - Texas Governor Greg Abbott made go...    1.0
1   ABOARD AIR FORCE ONE (Reuters) - U.S. Attorney...    1.0
2   WASHINGTON (Reuters) - Democratic presidential...    1.0
3   Wikileaks released another email showing how p...    0.0
4   OSLO (Reuters) - The United States could influ...    1.0
5   Hillary Clinton has turned down repeated reque...    0.0
6   21st Century Wire says One of the great myths ...    0.0
7    Effective as of December 31, 2017, the Patien...    0.0
8   McCain: Our intelligence agencies concluded un...    0.0
9   SAMARKAND, Uzbekistan (Reuters) - Senior offic...    1.0
10  WASHINGTON (Reuters) - A federal court in the ...    1.0
11  PARIS (Reuters) - U.S. Secretary of State Rex ...    1.0
12  BRUSSELS (Reuters) - EU envoys discussed on We...    1.0
13  WASHINGTON (Reuters) - U.S. Senate negotiators...    1.0
14  KIEV (Reuters) - The Ukrainian wife of a Chech...    1.0
15  NEW YORK (Reuters) -

#3.  Load Tokenizer:
1. We are using the `bert-base-uncased` tokenizer. We also need to use the corresponding model

In [None]:
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]



# 4. Prepare Data
The data preparation process for BERT-based uncased models involves tokenizing the text, mapping tokens to `input_ids`, creating attention masks `attention_mask`, , and preparing the labels tensor `labels`. Each element of Dataset Class should be dictionary of following structure.

```
{'input_ids': torch.Tensor(),'attention_mask':torch.Tensor(), 'labels': torch.Tensor()  }
```
1. Tokenization: The text input should be tokenized into subwords using BERT's WordPiece tokenizer. This tokenizer converts the text into a format that BERT can understand.

2. `input_ids`: Each token from the tokenized text needs to be mapped to an ID using BERT's vocabulary. The resulting input IDs should be in the form of a tensor or array, usually of shape (batch_size, max_sequence_length).
3. `attention_mask`: The attention mask is used to differentiate between the actual tokens and padding tokens. It has the same shape as the input IDs tensor, i.e., (batch_size, max_sequence_length). The mask has 1s for actual tokens and 0s for padding tokens.
4. `labels`: The labels tensor contains the true class or value for each example in the dataset. It usually has a shape of (batch_size,). For classification tasks, these labels are one-hot-encoded labels

In [None]:
# (text, label) 형태의 튜플로 구성된 리스트 생성
data=list(zip(df['text'].tolist(), df['label'].tolist()))

# 다음 함수는 파라미터로 texts와 lables로 구성된 리스트를 가지며
# 출력으로 input_ids, attention_mask, labels_out을 생성
def tokenize_and_encode(texts, labels):
    input_ids, attention_masks, labels_out = [], [], []
    for text, label in zip(texts, labels):
        encoded = tokenizer.encode_plus(
            text, max_length=512, padding='max_length', truncation=True)
        input_ids.append(encoded['input_ids'])
        attention_masks.append(encoded['attention_mask'])
        labels_out.append(label)
    return torch.tensor(input_ids), torch.tensor(attention_masks), torch.tensor(labels_out)

# 튜플을 분리하여 containing texts, containing labels 리스트 생성
texts, labels = zip(*data)

# 학습 및 검증 데이터셋 분리
train_texts, val_texts, train_labels, val_labels = train_test_split(texts, labels, test_size=0.2)

# 토큰화
train_input_ids, train_attention_masks, train_labels = tokenize_and_encode(train_texts, train_labels)
val_input_ids, val_attention_masks, val_labels = tokenize_and_encode(val_texts, val_labels)




**It's always good to review the data**
1. input_ids
  * `0` token value means padded token
2. attention_mask
  * `1`: corresponding token is real token
  * `0`: corresponding token is padded token

In [None]:
print('train_input_ids ',train_input_ids[0].shape ,train_input_ids[0], '\n'
      'train_attention_masks ', train_attention_masks[0] ,train_attention_masks[0], '\n'
      'train_labels', train_labels[0])

train_input_ids  torch.Size([512]) tensor([  101, 14749,  9587, 14478,  2218,  1037,  2811,  3034,  2651,  1999,
         2029,  2016,  2056,  2610,  3738,  2018,  2053, 15596,  3426,  2000,
         6545, 15528,  3897,  1012,  2008,  1055,  1037,  4682, 15528,  3897,
         2018,  2019,  3161, 10943,  2041,  2005,  2010,  6545,  2061,  6222,
         2610,  2018,  2296,  3114,  2000,  3288,  2032,  1999,  1012,   102,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0, 

### TextClassificationDataset
1. For tunning `bert-based-uncased`: each item of Dataset must be of type dictionary with at following  keys:
  * input_ids
  * attention_mask
  * labels
2. Thus,  `__getitem__`  should return dictionary of following structure:
```
{
            'input_ids': self.input_ids[idx],
            'attention_mask': self.attention_masks[idx],
            'labels': self.one_hot_labels[idx]
        }
```
3. one_hot_encode method: A static method that takes in targets (labels) and num_classes as arguments. It converts the given targets into one-hot encoded tensors. The method first converts the targets to long tensors and then initializes a zero tensor of shape (number of samples, num_classes). The scatter_ function is used to place 1.0 in the appropriate position for each sample's label, resulting in a one-hot encoded tensor.

In [None]:
class TextClassificationDataset(torch.utils.data.Dataset):
    def __init__(self, input_ids, attention_masks, labels, num_classes=2):
        self.input_ids = input_ids
        self.attention_masks = attention_masks
        self.labels = labels
        self.num_classes = num_classes
        self.one_hot_labels = self.one_hot_encode(labels, num_classes)

    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, idx):
        return {
            'input_ids': self.input_ids[idx],
            'attention_mask': self.attention_masks[idx],
            'labels': self.one_hot_labels[idx]
        }


    @staticmethod
    def one_hot_encode(targets, num_classes):
        targets = targets.long()
        one_hot_targets = torch.zeros(targets.size(0), num_classes)
        one_hot_targets.scatter_(1, targets.unsqueeze(1), 1.0)
        return one_hot_targets


train_dataset = TextClassificationDataset(train_input_ids, train_attention_masks, train_labels)
val_dataset = TextClassificationDataset(val_input_ids, val_attention_masks, val_labels)


### DataLoader
*italicized text*

In [None]:
train_dataloader = DataLoader(train_dataset, batch_size=8, shuffle=True)
eval_dataloader = DataLoader(val_dataset, batch_size=8)

In [None]:
print(len(train_dataset))
len((val_dataset))

3592


898

1.Revisiting dimension requirements for Transformers in Pytorch from Chapter 3: The encoder expects data with dimensions (seq_len, batch_size). However, Hugging Face's bert-based-uncased model requires data with dimensions (batch_size, seq_len). As a result, the output from the train_dataloader has dimensions of (batch_size, seq_len).

In [None]:
item=next(iter(train_dataloader))
item_ids,item_mask,item_labels=item['input_ids'],item['attention_mask'],item['labels']
print ('item_ids, ',item_ids.shape, '\n',
       'item_mask, ',item_mask.shape, '\n',
       'item_labels, ',item_labels.shape, '\n',)

item_ids,  torch.Size([8, 512]) 
 item_mask,  torch.Size([8, 512]) 
 item_labels,  torch.Size([8, 2]) 



In [None]:
model = AutoModelForSequenceClassification.from_pretrained(
    "bert-base-uncased", num_labels=2)
optimizer = AdamW(model.parameters(), lr=5e-5)

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


# 5. Prepare Accelaerator
What is Accelerator?
 1. It provides an easy-to-use API for training deep learning models on various hardware accelerators, such as GPUs, TPUs, and Apple's Metal Performance Shaders (MPS).
  * In our example, during training, we donot specifically select 'mps' device. THe accelerator automatically detects it and use 'mps' for training
 2. The Accelerator library is particularly useful for distributed training and mixed-precision training.

In [None]:
# 모델 및 옵티마이저 준비
accelerator = Accelerator()
model, optimizer, train_dataloader, eval_dataloader = accelerator.prepare(
    model, optimizer, train_dataloader, eval_dataloader
)


# 5. Fine Tune The Model
1. `lr_scheduler` in the provided code is an instance of a learning rate scheduler, which is responsible for adjusting the learning rate during the training process. The learning rate scheduler helps improve the training process by dynamically adjusting the learning rate based on the number of training steps. In this code, the learning rate starts with the initial value set in the optimizer and decreases linearly to 0 as the training progresses.
2. Some benefit of lr_scheduler over optimizer alone are
  * Faster convergence
  * Avoid Overshooting: When using a fixed learning rate, the optimizer might overshoot the optimal solution, especially in the later stages of training. By decreasing the learning rate over time, the model can make smaller updates and fine-tune its weights
  
3. `progress_bar` is just utility to show the progress of training
4. These are standard approach for fine tunning:
```
 }
        outputs = model(**batch)
        loss = outputs.loss
        accelerator.backward(loss)
        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)
```
  * each batch should be dictionary of structure {input_ids:torch.Tensor(), attention_mask: torch.Tensor(), labels: torch.Tensor()
  * the dimension of input_ids=(batch_size, seq_len); attention_mask= (batch_size, seq_len); and labels=(batch_size,)
  * You can notice that during training, we are not explicitly converting `tensor` into device; accelerator is automatically identifying the `device` and converting `tensor` into the appropriate format
1. After each epoch, we are also printing the evaluation metrics over the evaluation dataset

In [None]:
# 런타임 5분 30초 소요

# 메트릭 함수 가져오기
from sklearn.metrics import accuracy_score, f1_score, recall_score, precision_score

num_epochs = 1
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
      "linear",
      optimizer=optimizer,
      num_warmup_steps=0,
      num_training_steps=num_training_steps
  )
progress_bar = tqdm(range(num_training_steps))

for epoch in range(num_epochs):
    for batch in train_dataloader:
        outputs = model(**batch)
        loss = outputs.loss
        accelerator.backward(loss)
        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)
    model.eval()
    #device = 'mps' 원서에 수록된 코드이나, 코랩에서는 불필요하여 주석 처리함
    preds = []
    out_label_ids = []
    epochs=1
    epoch=1

    for batch in eval_dataloader:
        with torch.no_grad():
            inputs = {k: v.to(device) for k, v in batch.items()}
            outputs = model(**inputs)
            logits = outputs.logits

        preds.extend(torch.argmax(logits.detach().cpu(), dim=1).numpy())
        out_label_ids.extend(torch.argmax(inputs["labels"].detach().cpu(),dim=1).numpy())
    accuracy = accuracy_score(out_label_ids, preds)
    f1 = f1_score(out_label_ids, preds, average='weighted')
    recall = recall_score(out_label_ids, preds, average='weighted')
    precision = precision_score(out_label_ids, preds, average='weighted')

    print(f"Epoch {epoch + 1}/{num_epochs} Evaluation Results:")
    print(f"Accuracy: {accuracy}")
    print(f"F1 Score: {f1}")
    print(f"Recall: {recall}")
    print(f"Precision: {precision}")

100%|██████████| 449/449 [09:35<00:00,  1.28s/it]
100%|██████████| 449/449 [05:29<00:00,  1.34it/s]

Epoch 2/1 Evaluation Results:
Accuracy: 0.9977728285077951
F1 Score: 0.9977729390368493
Recall: 0.9977728285077951
Precision: 0.9977829520145779


# 6. Inference Pipeline
1. `tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
`: You need use the same tokenizer that was use for fine-tunning
2. `logits.detach().cpu()`
  * `detach is done to prevent  unintentional back-propogation
  * `.cpu` is done so that the output is compatible with scikit-learn libraries for further computation

In [None]:
from transformers import BertTokenizer
import torch
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

def inference(text, model,  label, device=device):
    # 토크나이저 불러오기 및 입력 텍스트 토큰화
    inputs = tokenizer(text, return_tensors='pt', padding=True, truncation=True)
    # 입력 텐서를 특정 디바이스로 전송(디폴트 값: 'cpu')
    inputs = {k: v.to(device) for k, v in inputs.items()}

    # 모델을 eval 모드로 설정 후 추론
    model.eval()
    with torch.no_grad():
        outputs = model(**inputs)
        logits = outputs.logits

    # predicted label 인덱스 추출
    pred_label_idx = torch.argmax(logits.detach().cpu(), dim=1).item()

    print(f"Predicted label index: {pred_label_idx}, actual label {label}")
    return pred_label_idx




In [None]:
# 딥러닝 모델의 특징상 실행시 결과가 책과 다를 수 있습니다.
# https://abcnews.go.com/US/tornado-confirmed-delaware-powerful-storm-moves-east/story?id=98293454
text="""
WASHINGTON (ABC) A confirmed tornado was located near Bridgeville in Sussex County, Delaware, shortly after 6 p.m. ET Saturday, moving east at 50 mph, according to the National Weather Service. Downed trees and wires were reported in the area.
"""
inference(text, model, 1.0)
text="this is definately junk text I am typing"
inference(text, model, 0.0)

Predicted label index: 1, actual label 1.0
Predicted label index: 0, actual label 0.0


0