<a href="https://colab.research.google.com/github/uits-2215151050/ML-Project/blob/main/Project_Bangla_News_Summarize.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
import os

BASE_DIR = "/content/drive/MyDrive/BanglaT5_Project"

os.makedirs(BASE_DIR, exist_ok=True)
os.makedirs(f"{BASE_DIR}/cache", exist_ok=True)
os.makedirs(f"{BASE_DIR}/checkpoints", exist_ok=True)
os.makedirs(f"{BASE_DIR}/model", exist_ok=True)
os.makedirs(f"{BASE_DIR}/logs", exist_ok=True)

BASE_DIR

'/content/drive/MyDrive/BanglaT5_Project'

**Force HuggingFace + PyTorch cache to Google Drive**



In [None]:
import os

# Redirect HuggingFace cache
os.environ["HF_HOME"] = f"{BASE_DIR}/cache"
os.environ["TRANSFORMERS_CACHE"] = f"{BASE_DIR}/cache"

# Disable W&B entirely
os.environ["WANDB_DISABLED"] = "true"
os.environ["WANDB_MODE"] = "offline"
os.environ["WANDB_DIR"] = f"{BASE_DIR}/logs"

# Optional: reduce PyTorch multiprocessing warnings
os.environ["TOKENIZERS_PARALLELISM"] = "false"

print("Cache directory:", os.environ["HF_HOME"])


Cache directory: /content/drive/MyDrive/BanglaT5_Project/cache


**Install required libraries**

In [None]:
!pip install transformers datasets sentencepiece rouge-score gradio

Collecting rouge-score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: rouge-score
  Building wheel for rouge-score (setup.py) ... [?25l[?25hdone
  Created wheel for rouge-score: filename=rouge_score-0.1.2-py3-none-any.whl size=24934 sha256=fce8ca9ee9de974122a1361b7a4b09333171235d03f5f20e693fe1601422d952
  Stored in directory: /root/.cache/pip/wheels/85/9d/af/01feefbe7d55ef5468796f0c68225b6788e85d9d0a281e7a70
Successfully built rouge-score
Installing collected packages: rouge-score
Successfully installed rouge-score-0.1.2


**Load dataset (article.txt & summary.txt)**

In [None]:
import pandas as pd

article_path = f"{BASE_DIR}/article.txt"
summary_path = f"{BASE_DIR}/summary.txt"

with open(article_path, "r") as f:
    articles = [line.strip() for line in f.readlines()]

with open(summary_path, "r") as f:
    summaries = [line.strip() for line in f.readlines()]

df = pd.DataFrame({"article": articles, "summary": summaries})
df.head(), len(df)

(                                             article  \
 0  স্ট্যান্ডার্ড চার্টার্ড ব্যাংকের নতুন প্রধান ন...   
 1  রাজধানী থেকে চামড়া শিল্পগুলো সাভারে স্থানান্তর...   
 2  দেশীয় শিল্প বিকাশে সরকারের সব ধরনের উদ্যোগ অব্...   
 3  একীভূত হতে চলেছে অনলাইনে শ্রেণিবদ্ধ বিজ্ঞাপন স...   
 4  যাত্রীবাহী একটি বাসে আগুন দেওয়ার আধা ঘণ্টার মধ...   
 
                                              summary  
 0          স্ট্যান্ডার্ড চার্টার্ডের নতুন সিইও আবরার  
 1  মার্চের মধ্যে সাভারে চামড়া শিল্পের সিইটিপি: মন...  
 2                       ওয়ালটন কারখানায় শিল্পমন্ত্রী  
 3                    একীভূত হচ্ছে এখানেই ডটকমওএলএক্স  
 4              বাসে আগুন: নড়াইলের পৌর মেয়র গ্রেপ্তার  ,
 19096)

**Split train/validation**

In [None]:
from sklearn.model_selection import train_test_split

train_df, val_df = train_test_split(df, test_size=0.1, random_state=42)

train_df.shape, val_df.shape

((17186, 2), (1910, 2))

**Convert to HuggingFace Dataset**

In [None]:
from datasets import Dataset

train_ds = Dataset.from_pandas(train_df)
val_ds = Dataset.from_pandas(val_df)

**Load BanglaT5 model & tokenizer (stored into Drive cache)**

In [None]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_name = "csebuetnlp/banglat5"

tokenizer = AutoTokenizer.from_pretrained(model_name, cache_dir=f"{BASE_DIR}/cache")
model = AutoModelForSeq2SeqLM.from_pretrained(model_name, cache_dir=f"{BASE_DIR}/cache")


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json: 0.00B [00:00, ?B/s]

config.json:   0%|          | 0.00/659 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/1.11M [00:00<?, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


pytorch_model.bin:   0%|          | 0.00/990M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

**Preprocess / Tokenization function**

In [None]:
max_input = 128
max_output = 32

def preprocess(batch):
    inputs = ["summarize: " + x for x in batch["article"]]

    model_inputs = tokenizer(
        inputs,
        max_length=max_input,
        truncation=True,
        padding="max_length"
    )

    with tokenizer.as_target_tokenizer():
        labels = tokenizer(
            batch["summary"],
            max_length=max_output,
            truncation=True,
            padding="max_length"
        )

    labels["input_ids"] = [
        [(token if token != tokenizer.pad_token_id else -100) for token in seq]
        for seq in labels["input_ids"]
    ]

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

train_ds = train_ds.map(preprocess, batched=True)
val_ds = val_ds.map(preprocess, batched=True)


Map:   0%|          | 0/17186 [00:00<?, ? examples/s]



Map:   0%|          | 0/1910 [00:00<?, ? examples/s]

**Data collator**

In [None]:
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(
    tokenizer=tokenizer,
    model=model,
    padding=True,
    label_pad_token_id=-100
)


**TrainingArguments (save everything into Drive)**

In [None]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir=f"{BASE_DIR}/checkpoints",
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    gradient_accumulation_steps=2,
    learning_rate=3e-4,
    num_train_epochs=5,

    save_strategy="epoch",
    eval_strategy="epoch", # Corrected argument name

    logging_dir=f"{BASE_DIR}/logs",
    logging_steps=100,

    load_best_model_at_end=True,
    metric_for_best_model="eval_loss",

    report_to="none"  # disable W&B
)

**Trainer setup**

In [None]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_ds,
    eval_dataset=val_ds,
    data_collator=data_collator,
    tokenizer=tokenizer
)


⚙️  Running in WANDB offline mode


  trainer = Trainer(


**Train & Save final model to Drive**

In [None]:
trainer.train()

trainer.save_model(f"{BASE_DIR}/model")
tokenizer.save_pretrained(f"{BASE_DIR}/model")

Epoch,Training Loss,Validation Loss
1,2.5938,2.123913
2,2.1993,2.042734
3,1.8723,2.043406
4,1.6585,2.056576
5,1.3653,2.081972


There were missing keys in the checkpoint model loaded: ['encoder.embed_tokens.weight', 'decoder.embed_tokens.weight'].


('/content/drive/MyDrive/BanglaT5_Project/model/tokenizer_config.json',
 '/content/drive/MyDrive/BanglaT5_Project/model/special_tokens_map.json',
 '/content/drive/MyDrive/BanglaT5_Project/model/spiece.model',
 '/content/drive/MyDrive/BanglaT5_Project/model/added_tokens.json',
 '/content/drive/MyDrive/BanglaT5_Project/model/tokenizer.json')

**Test summary**

In [None]:
def summarize(text):
    inp = "summarize: " + text
    inputs = tokenizer(inp, return_tensors="pt", truncation=True)
    inputs = {k: v.to(model.device) for k, v in inputs.items()}
    output = model.generate(inputs["input_ids"], max_length=32, num_beams=4)
    return tokenizer.decode(output[0], skip_special_tokens=True)

print("ARTICLE:", val_df.iloc[0]["article"])
print("GENERATED:", summarize(val_df.iloc[0]["article"]))
print("TRUE:", val_df.iloc[0]["summary"])

ARTICLE: রাষ্ট্রদ্রোহের একটি ও নাশকতার ১০টিসহ মোট ১১ মামলায় বিএনপি চেয়ারপারসন খালেদা জিয়ার হাজির হওয়ার জন্য ৩১ জুলাই দিন ধার্য করেছে আদালত।
GENERATED: খালেদার হাজিরার দিন ধার্য
TRUE: ১১ মামলায় আদালতে খালেদার হাজিরা ৩১ জুলাই
