In [10]:
pip install transformers sentencepiece torch

Note: you may need to restart the kernel to use updated packages.


In [1]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch

# 載入訓練過的 T5-paraphraser 模型
tokenizer = AutoTokenizer.from_pretrained("ramsrigouthamg/t5_paraphraser")
model = AutoModelForSeq2SeqLM.from_pretrained("ramsrigouthamg/t5_paraphraser")
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)

def t5_paraphraser_augment(text, max_length=64, num_return_sequences=5, temperature=1.5, top_k=50, top_p=0.95):
    prompt = f"paraphrase: {text} </s>"
    input_ids = tokenizer.encode(prompt, return_tensors="pt", truncation=True).to(device)

    outputs = model.generate(
        input_ids=input_ids,
        max_length=max_length,
        do_sample=True,
        top_k=top_k,
        top_p=top_p,
        temperature=temperature,
        num_return_sequences=num_return_sequences
    )

    return [tokenizer.decode(output, skip_special_tokens=True) for output in outputs]

tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/1.79k [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


pytorch_model.bin:   0%|          | 0.00/892M [00:00<?, ?B/s]

In [6]:
from transformers import T5ForConditionalGeneration, T5Tokenizer
import torch

# 載入模型與 tokenizer（可改為 t5-small 或微調後的模型）
tokenizer = T5Tokenizer.from_pretrained("t5-base")
model = T5ForConditionalGeneration.from_pretrained("t5-base")
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)

def t5_data_augmentation(text, max_length=64, num_return_sequences=5, temperature=1.5, top_k=50, top_p=0.95):
    prompt = f"paraphrase: {text} </s>"
    input_ids = tokenizer.encode(prompt, return_tensors="pt").to(device)

    # 改用 sampling 而非 beam search
    outputs = model.generate(
        input_ids=input_ids,
        max_length=max_length,
        do_sample=True,                # 啟用隨機 sampling
        top_k=top_k,
        top_p=top_p,
        temperature=temperature,
        num_return_sequences=num_return_sequences
    )

    # 解碼
    return [tokenizer.decode(output, skip_special_tokens=True) for output in outputs]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [2]:
text = "The user cannot access the system after changing the password."
augmented_texts = t5_paraphraser_augment(text)
print(augmented_texts)

['If I changed my e mail address without password then I cannot get back access and password for computer.', 'I want to log into my social security website after change my password. Do you know what I am doing wrong with my current password? Just give your new password and then the same gets back.', "Will not get into my account after I've changed my password?", 'What does it mean if a user has not resets the password if they have forgotten the email address and all the information in the box also shows.', 'When the password for another PC is reset, the account is not able to function.']
