<a href="https://colab.research.google.com/github/susuhlaingmyk26-tech/Colab-project/blob/main/Untitled9.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# =====================================
# ALL-IN-ONE : DATA + TRAIN + UI
# =====================================

# ---------- INSTALL ----------
!pip install -q transformers sentencepiece datasets accelerate gradio

import os
import torch
import gradio as gr
from datasets import Dataset
from transformers import (
    MT5ForConditionalGeneration,
    MT5Tokenizer,
    DataCollatorForSeq2Seq,
    Trainer,
    TrainingArguments
)

# ---------- AUTO CREATE DATASET ----------
if not os.path.exists("data.txt"):
    with open("data.txt", "w", encoding="utf-8") as f:
        f.write(
            "ဖရဲသီး စားမယ်\tဖရဲသီး ဝါးဝါ\n"
            "မနက်ဖြန် သွားမယ်\tမနက်ဖြန် သွားဝါ\n"
            "မလာဘူး\tမလာဝါဘူး\n"
            "ဘာလုပ်နေလဲ\tဘာလုပ်နေလဲဝါ\n"
            "အိမ်မှာ ရှိလား\tအိမ်မှာ ရှိဝါလား\n"
            "ထမင်း စားပြီးပြီ\tထမင်း ဝါးပြီးဝါ\n"
        )

# ---------- LOAD DATA ----------
def load_txt(path):
    src, tgt = [], []
    with open(path, encoding="utf-8") as f:
        for line in f:
            if "\t" in line:
                s, t = line.strip().split("\t")
                src.append(s)
                tgt.append(t)
    return Dataset.from_dict({"src": src, "tgt": tgt})

dataset = load_txt("data.txt")

# ---------- MODEL ----------
model_name = "google/mt5-small"
tokenizer = MT5Tokenizer.from_pretrained(model_name)
model = MT5ForConditionalGeneration.from_pretrained(model_name)

# ---------- TOKENIZE ----------
MAX_LEN = 64

def preprocess(batch):
    inputs = tokenizer(
        batch["src"],
        padding="max_length",
        truncation=True,
        max_length=MAX_LEN
    )
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(
            batch["tgt"],
            padding="max_length",
            truncation=True,
            max_length=MAX_LEN
        )
    inputs["labels"] = labels["input_ids"]
    return inputs

dataset = dataset.map(preprocess, batched=True)
dataset.set_format(
    type="torch",
    columns=["input_ids", "attention_mask", "labels"]
)

# ---------- TRAIN ----------
args = TrainingArguments(
    output_dir="./beik_mt5",
    per_device_train_batch_size=8,
    num_train_epochs=6,
    learning_rate=3e-4,
    fp16=torch.cuda.is_available(),
    logging_steps=10,
    save_total_limit=1,
    report_to="none"
)

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=dataset,
    data_collator=DataCollatorForSeq2Seq(tokenizer, model=model)
)

trainer.train()

# ---------- SAVE ----------
model.save_pretrained("beik_translator")
tokenizer.save_pretrained("beik_translator")

# ---------- INFERENCE ----------
tokenizer = MT5Tokenizer.from_pretrained("beik_translator")
model = MT5ForConditionalGeneration.from_pretrained("beik_translator")

def translate(text):
    inputs = tokenizer(text, return_tensors="pt")
    outputs = model.generate(**inputs, max_length=64)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# ---------- UI ----------
iface = gr.Interface(
    fn=translate,
    inputs=gr.Textbox(label="မြန်မာစာရိုက်ပါ"),
    outputs=gr.Textbox(label="ဘိတ်စကား"),
    title="Myanmar → Beik AI Translator",
    description="AI model နဲ့ မြန်မာစာကို ဘိတ်စကားအဖြစ် ပြန်ပေးသည်"
)

iface.launch()

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/82.0 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/4.31M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/99.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/553 [00:00<?, ?B/s]

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'T5Tokenizer'. 
The class this function is called from is 'MT5Tokenizer'.
You are using the default legacy behaviour of the <class 'transformers.models.mt5.tokenization_mt5.MT5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


pytorch_model.bin:   0%|          | 0.00/1.20G [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.20G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

Map:   0%|          | 0/6 [00:00<?, ? examples/s]

  batch["labels"] = torch.tensor(batch["labels"], dtype=torch.int64)


Step,Training Loss


It looks like you are running Gradio on a hosted Jupyter notebook, which requires `share=True`. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://052c1c53a27bd038c1.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


