<a href="https://colab.research.google.com/github/trungtuan123github/FakeNewsDetection/blob/main/Processing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Final Project**

##**Problem stament :**     

The widespread dissemination of fake news and propaganda presents serious societal risks, including the erosion of public trust, political polarization, manipulation of elections, and the spread of harmful misinformation during crises such as pandemics or conflicts. From an NLP perspective, detecting fake news is fraught with challenges. Linguistically, fake news often mimics the tone and structure of legitimate journalism, making it difficult to distinguish using surface-level features. The absence of reliable and up-to-date labeled datasets, especially across multiple languages and regions, hampers the effectiveness of supervised learning models. Additionally, the dynamic and adversarial nature of misinformation means that malicious actors constantly evolve their language and strategies to bypass detection systems. Cultural context, sarcasm, satire, and implicit bias further complicate automated analysis. Moreover, NLP models risk amplifying biases present in training data, leading to unfair classifications and potential censorship of legitimate content. These challenges underscore the need for cautious, context-aware approaches, as the failure to address them can inadvertently contribute to misinformation, rather than mitigate it.



Use datasets in link : https://drive.google.com/drive/folders/1mrX3vPKhEzxG96OCPpCeh9F8m_QKCM4z?usp=sharing
to complete requirement.

## **About dataset:**

* **True Articles**:

  * **File**: `MisinfoSuperset_TRUE.csv`
  * **Sources**:

    * Reputable media outlets like **Reuters**, **The New York Times**, **The Washington Post**, etc.

* **Fake/Misinformation/Propaganda Articles**:

  * **File**: `MisinfoSuperset_FAKE.csv`
  * **Sources**:

    * **American right-wing extremist websites** (e.g., Redflag Newsdesk, Breitbart, Truth Broadcast Network)
    * **Public dataset** from:

      * Ahmed, H., Traore, I., & Saad, S. (2017): "Detection of Online Fake News Using N-Gram Analysis and Machine Learning Techniques" *(Springer LNCS 10618)*



## **Requirement**

A team consisting of three members must complete a project that involves applying the methods learned from the beginning of the course up to the present. The team is expected to follow and document the entire machine learning workflow, which includes the following steps:

1. **Data Preprocessing**: Clean and prepare the dataset,etc.

2. **Exploratory Data Analysis (EDA)**: Explore and visualize the data.

3. **Model Building**: Select and build one or more machine learning models suitable for the problem at hand.

4. **Hyperparameter set up**: Set and adjust the model's hyperparameters using appropriate methods to improve performance.

5. **Model Training**: Train the model(s) on the training dataset.

6. **Performance Evaluation**: Evaluate the trained model(s) using appropriate metrics (e.g., accuracy, precision, recall, F1-score, confusion matrix, etc.) and validate their performance on unseen data.

7. **Conclusion**: Summarize the results, discuss the model's strengths and weaknesses, and suggest possible improvements or future work.





## Loading data


### Original Data

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import pandas as pd
true_df = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/LAB-NLP/DataSet_Misinfo_TRUE_EN_2.csv")
fake_df = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/LAB-NLP/DataSet_Misinfo_FAKE_EN_2.csv")

In [None]:
true_df.head()

Unnamed: 0.1,Unnamed: 0,en_text
0,0,The head of a conservative Republican faction ...
1,1,Transgender people will be allowed for the fir...
2,2,The special counsel investigation of links bet...
3,3,Trump campaign adviser George Papadopoulos tol...
4,4,President Donald Trump called on the U.S. Post...


In [None]:
true_df.tail()

Unnamed: 0.1,Unnamed: 0,en_text
34522,34522,Most conservatives who oppose marriage equalit...
34523,34523,The freshman senator from Georgia quoted scrip...
34524,34524,The State Department told the Republican Natio...
34525,34525,"ADDIS ABABA, Ethiopia —President Obama convene..."
34526,34526,Jeb Bush Is Suddenly Attacking Trump. Here's W...


In [None]:
fake_df.tail()

Unnamed: 0.1,Unnamed: 0,en_text
33969,33969,"apparently, the new kyiv government is in a hu..."
33970,33970,the usa wants to divide syria.\r\n\r\ngreat br...
33971,33971,the ukrainian coup d'etat cost the us nothing ...
33972,33972,the european parliament falsifies history by d...
33973,33973,"a leading fsb officer, segey beseda, said duri..."


In [None]:
fake_df.head()

Unnamed: 0.1,Unnamed: 0,en_text
0,0,donald trump just couldn t wish all americans ...
1,1,house intelligence committee chairman devin nu...
2,2,"on friday, it was revealed that former milwauk..."
3,3,"on christmas day, donald trump announced that ..."
4,4,pope francis used his annual christmas day mes...


In [None]:
fake_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 33974 entries, 0 to 33973
Data columns (total 2 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Unnamed: 0  33974 non-null  int64 
 1   en_text     33974 non-null  object
dtypes: int64(1), object(1)
memory usage: 531.0+ KB


In [None]:
true_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34527 entries, 0 to 34526
Data columns (total 2 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Unnamed: 0  34527 non-null  int64 
 1   en_text     34527 non-null  object
dtypes: int64(1), object(1)
memory usage: 539.6+ KB


In [None]:
fake_df['label'] = 0
true_df['label'] = 1
# drop Unnamed: 0 columns and Nan value
true_df = true_df.drop(columns=['Unnamed: 0'])
fake_df = fake_df.drop(columns=['Unnamed: 0'])
fake_df.dropna(inplace=True,ignore_index= True)
true_df.dropna(inplace=True,ignore_index= True)

In [None]:
df= pd.concat([true_df,fake_df],ignore_index=True)
display(df.head(10))
display(df.tail(10))

Unnamed: 0,en_text,label
0,The head of a conservative Republican faction ...,1
1,Transgender people will be allowed for the fir...,1
2,The special counsel investigation of links bet...,1
3,Trump campaign adviser George Papadopoulos tol...,1
4,President Donald Trump called on the U.S. Post...,1
5,The White House said on Friday it was set to k...,1
6,President Donald Trump said on Thursday he bel...,1
7,While the Fake News loves to talk about my so-...,1
8,"Together, we are MAKING AMERICA GREAT AGAIN! b...",1
9,Alabama Secretary of State John Merrill said h...,1


Unnamed: 0,en_text,label
68491,the european union is an unreliable partner wh...,0
68492,the recently opened mine action centre in stra...,0
68493,[...] trump is now openly unleashing trade war...,0
68494,leveling kaliningrad - such threats come from ...,0
68495,"in the village of starychi, a man’s body was f...",0
68496,"apparently, the new kyiv government is in a hu...",0
68497,the usa wants to divide syria.\r\n\r\ngreat br...,0
68498,the ukrainian coup d'etat cost the us nothing ...,0
68499,the european parliament falsifies history by d...,0
68500,"a leading fsb officer, segey beseda, said duri...",0


## Preprocessing Data

In [None]:
! pip install swifter

Collecting swifter
  Downloading swifter-1.4.0.tar.gz (1.2 MB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.2 MB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.2/1.2 MB[0m [31m4.8 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m1.2/1.2 MB[0m [31m18.7 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m14.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: swifter
  Building wheel for swifter (setup.py) ... [?25l[?25hdone
  Created wheel for swifter: filename=swifter-1.4.0-py3-none-any.whl size=16505 sha256=569826e8c46e087c8f6e073d86f2a5bf6779e3a175ecd545eea00ad0d358b4be
  Stored in directory: /root/.cache/pip/wheels/ef/7f/bd/9bed48f078f3ee1fa75e0b29b6e0335ce1cb03a38d3443b3a3
Successfully b

In [None]:
from nltk.corpus import stopwords
import unicodedata
import nltk
import re
import string
import pandas as pd
import swifter
from tqdm.auto import tqdm
import spacy
nlp = spacy.load("en_core_web_sm")

def lemmatize_words(text):
    """
    Hàm thực hiện lemmatization trên một đoạn văn bản.
    Args:
        text (str): Văn bản đầu vào.
    Returns:
        List[str]: Danh sách các từ ở dạng lemma.
    """
    doc = nlp(text)
    lemmas = [token.lemma_ for token in doc]
    return lemmas

tqdm.pandas()
punct_table = str.maketrans('', '', string.punctuation)
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
url_pattern = re.compile(r'https?://\S+|www\.\S+')
mention_pattern = re.compile(r'(?:@|https?://)\S+')
bracket_pattern = re.compile(r'\[.*?\]')
html_pattern = re.compile(r'<.*?>')
digit_word_pattern = re.compile(r'\w*\d\w*')
repeat_char_pattern = re.compile(r'(.)\1{2,}')
non_word_pattern = re.compile(r'\W+')
def preprocess(text):
    text = url_pattern.sub('', text)
    text = mention_pattern.sub('', text)
    text = text.lower()
    text = bracket_pattern.sub('', text)
    text = html_pattern.sub('', text)
    text = digit_word_pattern.sub('', text)

    # Chuẩn hóa unicode và bỏ ký tự không in được
    text = unicodedata.normalize("NFKD", text).encode("ascii", "ignore").decode("utf-8", "ignore")
    text = text.translate(punct_table)  # remove punctuation
    text = non_word_pattern.sub(' ', text)
    text = text.replace('\n', ' ').replace('\r', ' ')
    text = re.sub(r'\s+', ' ', text).strip()
    text = repeat_char_pattern.sub(r'\1', text)  # reduce repeated characters like "soooo" -> "so"
    words = lemmatize_words(text)
    # words = text.split()
    # Token hóa và loại bỏ stopwords + từ không có ý nghĩa
    filtered_words = [word for word in words
                      if word not in stop_words
                      and len(word) > 1
                      and len(word) < 16
                      and word.isalpha()]

    return ' '.join(filtered_words)


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:
# Áp dụng preprocessing lên cột 'text' với progress bar
df_clean = df.copy()
df_clean['en_text'] = df_clean['en_text'].astype(str).fillna('').progress_apply(preprocess)

# Hiển thị 10 dòng đầu và cuối
display(df_clean.head(10))
display(df_clean.tail(10))

  0%|          | 0/68501 [00:00<?, ?it/s]

In [None]:
# Luu dataframe
df_clean.to_csv('/content/drive/MyDrive/Colab Notebooks/LAB-NLP/df_clean.csv', index=False)

## Exploratory Data Analysisg


In [None]:
! pip install langdetect nltk



In [None]:
# Create Vocabulary
import nltk
# Download the stopwords corpus
nltk.download('stopwords')

def create_vocabulary(df_clean):
  vocab= []
  stop_words = set(nltk.corpus.stopwords.words("english"))
  tokenizer = nltk.tokenize.RegexpTokenizer(r'\w+')
  for text in df_clean['text']:
      tokens = tokenizer.tokenize(text)
      filtered_words = [w.strip() for w in tokens if w not in stop_words and len(w) > 1]
      vocab.extend(filtered_words)
  return vocab

VOCAB = create_vocabulary(df_clean.copy())
print(VOCAB[:10])

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


['head', 'conservative', 'republican', 'faction', 'us', 'congress', 'voted', 'month', 'huge', 'expansion']


In [None]:
VOCAB = set(VOCAB)
print(len(VOCAB))

320628


In [None]:
from langdetect import detect
from langdetect.lang_detect_exception import LangDetectException


def detect_language(word):
    try:
        return detect(word)
    except LangDetectException:
        return "unknown"

def classify_languages(vocab):
    language_map = {}
    for word in vocab:
        lang = detect_language(word)
        if lang not in language_map:
            language_map[lang] = []
        language_map[lang].append(word)
    return language_map




# Phân loại ngôn ngữ
language_classification = classify_languages(VOCAB)
# Hiển thị kết quả
for lang, words in language_classification.items():
    print(f"Language: {lang}, Words: {words}")



Language: lt, Words: ['vau', 'tuvieran', 'gasline', 'alnujaifi', 'auspol', 'tuvo', 'panky', 'audiomost', 'somaliosman', 'karoblis', 'gavarin', 'silną', 'ministras', 'kavota', 'galvin', 'bonini', 'idiosyncrasies', 'kbeauty', 'judaistic', 'virta', 'nepalis', 'minisme', 'gaurantee', 'majorpayne', 'virginiaon', 'pilgrims', 'pussyrubio', 'stinting', 'karolyn', 'rivaltrump', 'ribisi', 'raus', 'britainin', 'ramon', 'prorioting', 'stupidin', 'roskosmos', 'pieish', 'arbaien', 'garbanzos', 'virginias', 'sikra', 'virologist', 'gavins', 'impetus', 'ploybut', 'gratinian', 'savoring', 'boinking', 'lipsky', 'dijon', 'jorgemario', 'religionkasich', 'splitas', 'lavinia', 'soldvietnam', 'salisbury', 'burials', 'givi', 'proselytism', 'drink', 'pssytrump', 'strause', 'inko', 'susy', 'usaustralia', 'karpinski', 'storybut', 'judiciaryso', 'bolivariana', 'saidmejia', 'kraichgau', 'últimos', 'usacitylinkcom', 'exijamos', 'viva', 'virago', 'sarajo', 'gvasalia', 'ausonius', 'veiny', 'moslim', 'nóvostivadim', 't

## Model building

### Pretraining model

#### Create DataLoader

In [None]:
from torch.utils.data import Dataset, DataLoader
import torch
import numpy as np
import tqdm
from torch.optim.lr_scheduler import CosineAnnealingLR
from sklearn.metrics import f1_score,precision_score,recall_score,classification_report
import os

In [None]:

class MyDataset(Dataset):
  def __init__(self, df,tokenizer,max_length = 512):
    self.df = df
    self.labels = [label for label in df['label']]
    self.texts = [text for text in df['text']]
    self.tokenizer = tokenizer
    self.max_length = max_length
  def __len__(self):
    return len(self.df)
  def __getitem__(self, idx):
    label = self.labels[idx]
    text = self.texts[idx]
    encoding = self.tokenizer(text,
                              truncation=True,
                              padding='max_length',
                              max_length=self.max_length,
                              return_tensors='pt')
    item = {key: val.squeeze(0) for key, val in encoding.items()}
    item['labels'] = torch.tensor(label)
    return item

class MyDataLoader():
    def __init__(self, dataset, batch_size=16, shuffle=True,num_worker=4):
        self.dataset = dataset
        self.num_worker = num_worker
        self.batch_size = batch_size
        self.shuffle = shuffle
    def get_dataloader(self):
      return DataLoader(self.dataset, batch_size=self.batch_size, shuffle=self.shuffle,num_workers=self.num_workers)

#### Training Workflow

In [None]:
class Trainer():
  def __init__(self,training_args,model,trainLoader,testLoader,optimizer,criterion,device):
    self.lr = training_args.learning_rate
    self.epochs = training_args.num_train_epochs
    self.train_batch_size = training_args.per_device_train_batch_size
    self.eval_batch_size = training_args.per_device_eval_batch_size
    self.weight_decay = training_args.weight_decay
    self.model = model
    self.trainLoader = trainLoader.get_dataloader()
    self.testLoader = testLoader.get_dataloader()
    self.optimizer = optimizer
    self.criterion = criterion
    self.device = device
    self.device_count = torch.cuda.device_count()
    if self.device_count > 1:
      print(f"Using {torch.cuda.device_count()} GPUs via DataParallel")
      self.model = torch.nn.DataParallel(self.model)
    else:
      print(f"Using {torch.cuda.device_count()} GPUs")
      self.model.to(self.device)

  def train(self):
    self.model.to(self.device)
    self.model.train()
    scheduler = CosineAnnealingLR(self.optimizer, T_max=self.epochs* len(self.trainLoader))
    loop = tqdm(self.train_loader, desc="\tTraining", leave=False)
    for epoch in range(self.epochs):
      print(f"Epoch {epoch+1}/{self.epochs}")
      sum_loss=0
      sum_step=0
      predict=None
      target=None
      for batch in loop:
        input_ids = batch['input_ids'].to(self.device)
        attention_mask = batch['attention_mask'].to(self.device)
        labels = batch['labels'].to(self.device)
        outputs = self.model(input_ids, attention_mask=attention_mask)
        loss = self.criterion(outputs.logits, labels)

        loss = outputs.loss
        sum_loss += loss.item()
        sum_step += 1

        loss.backward()
        self.optimizer.step()
        scheduler.step()
        self.optimizer.zero_grad()
        if predict is None:
          predict = outputs.logits.argmax(dim=1).cpu().numpy()
          target = labels.cpu().numpy()
        else:
          predict = np.concatenate([predict, outputs.logits.argmax(dim=1).cpu().numpy()])
          target = np.concatenate([target, labels.cpu().numpy()])

    print("epoch: ",epoch,"loss: ",sum_loss/sum_step)

    print("f1_score: ",f1_score(target,predict,average="macro"))

    print("classification_scroce\n: ",classification_report(target,predict))

  def evaluate(self):
    self.model.eval()
    self.model.to(self.device)
    sum_loss=0
    sum_step=0
    predict=None
    target=None
    with torch.no_grad():
      for batch in self.testLoader:
        input_ids = batch['input_ids'].to(self.device)
        attention_mask = batch['attention_mask'].to(self.device)
        labels = batch['labels'].to(self.device)
        outputs = self.model(input_ids, attention_mask=attention_mask)
        loss = self.criterion(outputs.logits, labels)
        sum_loss += loss.item()
        sum_step += 1
        if predict is None:
          predict = outputs.logits.argmax(dim=1).cpu().numpy()
          target = labels.cpu().numpy()
        else:
          predict = np.concatenate([predict, outputs.logits.argmax(dim=1).cpu().numpy()])
          target = np.concatenate([target, labels.cpu().numpy()])
    print("loss: ",sum_loss/sum_step)
    print("f1_score: ",f1_score(target,predict,average="macro"))
    print("classification_scroce\n: ",classification_report(target,predict))

  def save_model(self, path, filename="best_model.pt"):
    os.makedirs(path, exist_ok=True)
    full_path = os.path.join(path, filename)

    try:
        torch.save(self.model.state_dict(), full_path)
        print(f"Model saved to {full_path}")
    except Exception as e:
        print(f"Failed to save the model: {e}")


#### Inference Pipeline

In [None]:
class Pipeline(torch.nn.Module):
  def __init__(self,model,tokenizer,device):
    self.model = model
    self.tokenizer = tokenizer
    self.device = device
    self.device_count = torch.cuda.device_count()
    if self.device_count > 1:
      print(f"Using {torch.cuda.device_count()} GPUs via DataParallel")
      self.model = torch.nn.DataParallel(self.model)
    else:
      print(f"Using {torch.cuda.device_count()} GPUs")
      self.model.to(self.device)
  def forward(self,text):
    encoding = self.tokenizer(text,
                              truncation=True,
                              padding='max_length',
                              max_length=512,
                              return_tensors='pt')
    input_ids = encoding['input_ids'].to(self.device)
    attention_mask = encoding['attention_mask'].to(self.device)
    self.model.to(self.device)
    outputs = self.model(input_ids, attention_mask=attention_mask)
    return outputs

#### Bert

In [None]:
! pip install --upgrade transformers peft
! pip install -U bitsandbytes

Collecting transformers
  Downloading transformers-4.52.4-py3-none-any.whl.metadata (38 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.13.0->peft)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.13.0->peft)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=1.13.0->peft)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=1.13.0->peft)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch>=1.13.0->peft)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch>=1.13.0->peft)
  Downloadi

In [None]:
import torch
import torch.nn as nn
from torch.utils.data import DataLoader, Dataset
from transformers import (
    BertForSequenceClassification,
    BertTokenizer,
    XLNetForSequenceClassification,
    XLNetTokenizer,
    BitsAndBytesConfig
)
import bitsandbytes
from peft import LoraConfig, TaskType, get_peft_model, prepare_model_for_kbit_training
import numpy as np
from tqdm import tqdm
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
import os
from torch.utils.checkpoint import checkpoint

class MyModel(torch.nn.Module):
    def __init__(self, type_model: str, num_labels: int):
        super(MyModel, self).__init__()
        self.type_model = type_model

        # Configure quantization
        self.bnb_config = BitsAndBytesConfig(
        load_in_4bit=True, # Change from load_in_8bit
        bnb_4bit_use_double_quant=True, # Adjust for 4-bit
        bnb_4bit_quant_type="nf4", # Adjust for 4-bit
        bnb_4bit_compute_dtype=torch.bfloat16 # Adjust for 4-bit
        )

        if self.type_model == "bert":
            self.model = BertForSequenceClassification.from_pretrained(
                "bert-large-uncased",
                num_labels=num_labels,
                quantization_config=self.bnb_config,
                device_map="auto",
                torch_dtype=torch.bfloat16
            )
            # Prepare model for k-bit training
            self.model = prepare_model_for_kbit_training(self.model,use_gradient_checkpointing=True)

            # LoRA configuration for BERT
            self.lora_config = LoraConfig(
                r=8,
                target_modules=["query", "value", "key", "dense"],
                task_type=TaskType.SEQ_CLS,
                lora_alpha=32,
                lora_dropout=0.05,
                bias="none",
                inference_mode=False
            )
            self.model = get_peft_model(self.model, self.lora_config)

        elif self.type_model == "xlnet":
            self.model = XLNetForSequenceClassification.from_pretrained(
                "xlnet-large-cased",
                num_labels=num_labels,
                quantization_config=self.bnb_config,
                device_map="auto",
                torch_dtype=torch.bfloat16
            )

            # Prepare model for k-bit training
            self.model = prepare_model_for_kbit_training(self.model,use_gradient_checkpointing=False)

            # LoRA configuration for XLNet
            self.lora_config = LoraConfig(
                r=8,
                target_modules=["layer_1", "layer_2","logits_proj"],
                task_type=TaskType.SEQ_CLS,
                lora_alpha=32,
                lora_dropout=0.05,
                bias="none",
                inference_mode=False
            )
            self.model = get_peft_model(self.model, self.lora_config)

    def forward(self, input_ids, attention_mask=None, labels=None):
        # Định nghĩa hàm custom_forward cho checkpoint
        def custom_forward(*inputs):
            # Forward của mô hình gốc
            return self.model(*inputs)

        # Sử dụng checkpoint cho mô hình
        outputs = checkpoint(custom_forward, input_ids, attention_mask)

        # Nếu có labels, tính toán loss
        if labels is not None:
            logits = outputs.logits  # Giả định đầu ra của mô hình có logits
            loss_fn = torch.nn.CrossEntropyLoss()
            loss = loss_fn(logits, labels)
            return loss, outputs

        return outputs

In [None]:
bert = MyModel("xlnet",2)

Some weights of XLNetForSequenceClassification were not initialized from the model checkpoint at xlnet-large-cased and are newly initialized: ['logits_proj.bias', 'logits_proj.weight', 'sequence_summary.summary.bias', 'sequence_summary.summary.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
xlnet = XLNetForSequenceClassification.from_pretrained("xlnet-large-cased",num_labels=2)

Some weights of XLNetForSequenceClassification were not initialized from the model checkpoint at xlnet-large-cased and are newly initialized: ['logits_proj.bias', 'logits_proj.weight', 'sequence_summary.summary.bias', 'sequence_summary.summary.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
xlnet

XLNetForSequenceClassification(
  (transformer): XLNetModel(
    (word_embedding): Embedding(32000, 1024)
    (layer): ModuleList(
      (0-23): 24 x XLNetLayer(
        (rel_attn): XLNetRelativeAttention(
          (layer_norm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
        (ff): XLNetFeedForward(
          (layer_norm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
          (layer_1): Linear(in_features=1024, out_features=4096, bias=True)
          (layer_2): Linear(in_features=4096, out_features=1024, bias=True)
          (dropout): Dropout(p=0.1, inplace=False)
          (activation_function): GELUActivation()
        )
        (dropout): Dropout(p=0.1, inplace=False)
      )
    )
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (sequence_summary): XLNetSequenceSummary(
    (summary): Linear(in_features=1024, out_features=1024, bias=True)
    (activation): Tanh()
    (first_dropout): Identity

In [None]:
def count_parameters(model):
    """
    Hàm đếm tổng số tham số và số lượng tham số có thể huấn luyện trong một mô hình PyTorch.

    Args:
        model (torch.nn.Module): Mô hình PyTorch.

    Returns:
        dict: Bao gồm tổng số tham số, số lượng tham số trainable và phần trăm trainable.
    """
    total_params = sum(p.numel() for p in model.parameters())  # Tổng số tham số
    trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)  # Tham số có thể huấn luyện
    percent_trainable = (trainable_params / total_params) * 100 if total_params > 0 else 0  # Phần trăm trainable
    return {
        "total_params": total_params,
        "trainable_params": trainable_params,
        "percent_trainable": percent_trainable,
    }

print(f"Số lượng tham số có thể huấn luyện: {count_parameters(bert)}")

Số lượng tham số có thể huấn luyện: {'total_params': 187182084, 'trainable_params': 3557378, 'percent_trainable': 1.9004906473848213}
