**Step 1: Set Up the Environment**

First, ensure you have the necessary libraries installed.
You can use the Hugging Face Transformers library to fine-tune T5.

In [2]:
!pip install transformers torch



**Step 2: Load the Pre-trained T5 Model and Tokenizer**

We'll use the t5-small model for this example. You can choose other variants like t5-base or t5-large based on your computational resources.

In [3]:
from transformers import T5Tokenizer, T5ForConditionalGeneration

model_name = "t5-small"
tokenizer = T5Tokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)

tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

**Step 3: Prepare the Data**

Prepare your dataset containing product features and corresponding descriptions. Format the data so that each input-output pair is suitable for fine-tuning.

Example data (for illustration purposes):

In [4]:
product_data = [
    {
        "features": "Red, cotton, V-neck, short-sleeve",
        "description": "A stylish red V-neck t-shirt made from 100% cotton. Perfect for casual outings."
    },
    {
        "features": "Black, leather, waterproof, size 42",
        "description": "Elegant black leather shoes, waterproof and comfortable, suitable for all occasions."
    }
    # Add more product entries here
]

train_texts = ["features: " + item["features"] for item in product_data]
train_labels = [item["description"] for item in product_data]

**Step 4: Tokenize the Data**

Tokenize the input features and labels.

In [5]:
train_encodings = tokenizer(train_texts, truncation=True, padding=True, max_length=512)
train_labels_encodings = tokenizer(train_labels, truncation=True, padding=True, max_length=512)


**Step 5: Create a PyTorch Dataset**


Define a custom dataset class for PyTorch.

In [6]:
import torch

class ProductDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels_encodings):
        self.encodings = encodings
        self.labels_encodings = labels_encodings

    def __len__(self):
        return len(self.encodings.input_ids)

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels_encodings.input_ids[idx])
        return item

train_dataset = ProductDataset(train_encodings, train_labels_encodings)


**Step 6: Set Up DataLoader**

Use DataLoader to handle batching and shuffling of data.

In [7]:
from torch.utils.data import DataLoader

train_loader = DataLoader(train_dataset, batch_size=2, shuffle=True)


**Step 7: Fine-tune the Model**

Define the training loop and fine-tune the model.

In [8]:
from transformers import AdamW

# Set up the optimizer
optimizer = AdamW(model.parameters(), lr=5e-5)

model.train()

for epoch in range(3):  # Train for 3 epochs
    for batch in train_loader:
        optimizer.zero_grad()
        input_ids = batch['input_ids']
        attention_mask = batch['attention_mask']
        labels = batch['labels']
        outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs.loss
        loss.backward()
        optimizer.step()
    print(f"Epoch {epoch + 1}, Loss: {loss.item()}")




Epoch 1, Loss: 6.389193534851074
Epoch 2, Loss: 6.222358226776123
Epoch 3, Loss: 6.0869245529174805


**Step 8: Evaluate the Model**

Evaluate the fine-tuned model on some test data.

In [18]:
model.eval()

# test_features = ["Blue, denim, slim-fit, jeans, size 32"]
test_features = ["Red, cotton, V-neck, short-sleeve"]
test_encodings = tokenizer(test_features, truncation=True, padding=True, max_length=512, return_tensors="pt")

with torch.no_grad():
    generated_ids = model.generate(test_encodings.input_ids, max_length=100)
    generated_description = tokenizer.decode(generated_ids[0], skip_special_tokens=True)

print(f"Features: {test_features[0]}")
print(f"Generated Description: {generated_description}")


Features: Red, cotton, V-neck, short-sleeve
Generated Description: Red, cotton, V-neck, short-sleeve, short-sleeve, short-sleeve, short-sleeve, short-sleeve, short-sleeve, short-sleeve, short-sleeve, short-sleeve, short-sleeve, short-sleeve, short-sleeve, short-sleeve,


In [16]:
generated_ids[0]


tensor([   0, 1624,    6,  710,   18,    7,  109,   15,  162,  710,   18,    7,
         109,   15,  162,    6,  710,   18,    7,  109,   15,  162,  710,   18,
           7,  109,   15,  162,  710,   18,    7,  109,   15,  162,  710,   18,
           7,  109,   15,  162,  710,   18,    7,  109,   15,  162,  710,   18,
           7,  109])