<a href="https://colab.research.google.com/github/vieer-dwivedi/AI/blob/main/Learnings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install -U transformers datasets accelerate torch sagemaker boto3

Collecting sagemaker
  Downloading sagemaker-2.251.1-py3-none-any.whl.metadata (17 kB)
Collecting boto3<2.0,>=1.39.5 (from sagemaker)
  Downloading boto3-1.40.21-py3-none-any.whl.metadata (6.7 kB)
Collecting docker (from sagemaker)
  Downloading docker-7.1.0-py3-none-any.whl.metadata (3.8 kB)
Collecting graphene<4,>=3 (from sagemaker)
  Downloading graphene-3.4.3-py2.py3-none-any.whl.metadata (6.9 kB)
Collecting importlib-metadata<7.0,>=1.4.0 (from sagemaker)
  Downloading importlib_metadata-6.11.0-py3-none-any.whl.metadata (4.9 kB)
Collecting numpy>=1.17 (from transformers)
  Downloading numpy-1.26.4-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.0/61.0 kB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
Collecting packaging>=20.0 (from transformers)
  Downloading packaging-24.2-py3-none-any.whl.metadata (3.2 kB)
Collecting pathos (from sagemaker)
  Downloading pathos-0.3.4-py3-none-any.whl.metad

In [None]:
import os, torch, pandas as pd
from datasets import Dataset
from transformers import (
    AutoTokenizer, AutoModelForSeq2SeqLM,
    DataCollatorForSeq2Seq, TrainingArguments, Trainer
)

In [None]:
os.environ["WANDB_DISABLED"] = "true"
device = "cuda" if torch.cuda.is_available() else "cpu"
df = pd.read_csv("data_full.csv")
needed = ["brand", "model", "body_style", "description"]
df = df[[c for c in needed if c in df.columns]].copy()
df = df.dropna(subset=["brand", "model", "description"])
df["body_style"] = df.get("body_style", "").fillna("")

In [None]:
df["input_text"]  = (
    df["brand"].astype(str).fillna("") + " " +
    df["model"].astype(str).fillna("") + " " +
    df["body_style"].astype(str).fillna("")
).str.strip()
df["target_text"] = df["description"].astype(str).fillna("")

In [None]:
df_small = df.sample(n=min(len(df), 1000), random_state=42)
dataset = Dataset.from_pandas(df_small[["input_text", "target_text"]])
dataset = dataset.train_test_split(test_size=0.2, seed=42)
train_ds, val_ds = dataset["train"], dataset["test"]

In [None]:
model_name = "MBZUAI/LaMini-T5-61M"
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
model     = AutoModelForSeq2SeqLM.from_pretrained(model_name).to(device)
MAX_IN_LEN  = 128
MAX_OUT_LEN = 128

In [None]:
def preprocess(batch):
    inputs  = [str(x) if x is not None else "" for x in batch["input_text"]]
    targets = [str(x) if x is not None else "" for x in batch["target_text"]]

    model_inputs = tokenizer(
        inputs, max_length=MAX_IN_LEN, truncation=True, padding="max_length"
    )
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(
            targets, max_length=MAX_OUT_LEN, truncation=True, padding="max_length"
        )
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

In [None]:
train_enc = train_ds.map(preprocess, batched=True, remove_columns=train_ds.column_names)
val_enc   = val_ds.map(preprocess,   batched=True, remove_columns=val_ds.column_names)
collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)
fp16_ok = torch.cuda.is_available()

Map:   0%|          | 0/800 [00:00<?, ? examples/s]



Map:   0%|          | 0/200 [00:00<?, ? examples/s]

In [None]:
training_args = TrainingArguments(
    output_dir="./results",
    learning_rate=5e-4,                 # your earlier choice (0.0005)
    per_device_train_batch_size=8,      # small to prevent OOM
    per_device_eval_batch_size=8,
    gradient_accumulation_steps=4,      # effective batch 32 on 1 GPU
    num_train_epochs=3,                 # start small
    seed=42,
    lr_scheduler_type="linear",
    report_to=[],                       # no wandb
    fp16=fp16_ok,                       # halves memory if GPU is available
    save_total_limit=2,
    logging_steps=50,
    # optional memory savers:
    # gradient_checkpointing=True,      # enable if you still hit OOM (slower)
)

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_enc,
    eval_dataset=val_enc,
    data_collator=collator,
    tokenizer=tokenizer,
)

  trainer = Trainer(


In [None]:
trainer.train()

Step,Training Loss
50,3.9613


TrainOutput(global_step=75, training_loss=3.8598527018229167, metrics={'train_runtime': 41.0009, 'train_samples_per_second': 58.535, 'train_steps_per_second': 1.829, 'total_flos': 81205080883200.0, 'train_loss': 3.8598527018229167, 'epoch': 3.0})

Test

In [None]:
def generate(text, max_new_tokens=80):
    inputs = tokenizer(text, return_tensors="pt").to(device)
    with torch.no_grad():
        out_ids = model.generate(
            **inputs,
            max_length=min(MAX_OUT_LEN, MAX_IN_LEN + max_new_tokens),
            num_beams=4,               # explore more possibilities
            no_repeat_ngram_size=3,    # stop repeating 3-word sequences
            repetition_penalty=2.0,    # discourage repeating phrases
            early_stopping=True
        )
    return tokenizer.decode(out_ids[0], skip_special_tokens=True)


samples = [
    (df_small["brand"].iloc[0] + " " + df_small["model"].iloc[0] + " " + str(df_small["body_style"].iloc[0])).strip(),
    "Toyota Camry sedan",
    "Tesla Model 3 sedan"
]
for s in samples:
    print("\nINPUT:", s)
    print("OUTPUT:", generate(s))

# === 8) Save locally ===========================================================
trainer.save_model("my_car_model")
tokenizer.save_pretrained("my_car_model")
print("Saved to ./my_car_model")


INPUT: PORSCHE PORSCHE 928 GTS Coupé (two-door)
OUTPUT: The 928 GTS Coupé was the first generation of the Porsche 928, and it was the third generation of a new car. It was introduced in 1997 as a whole by the German carmaker, but it was not a single-door model. It had to be more aggressive than its predecessors, but with a few different variants.The 927 GTS was also a very popular version for the Japanese carmaker. However, it was still a long-term sporty rider.

INPUT: Toyota Camry sedan
OUTPUT: The first generation of the Toyota Camry was launched in 2008 and it was a great success. It was based on a compact SUV with a front-wheeled headlights and a rear-wheelbase system.The second generation, called the Ford Camry, was designed to be a hybrid sedan that had a lot of room for its customers. In 2015, the Chevrolet Camry introduced a new version of the car's interior design, but it wasn't just a single vehicle. But it didn't have any impact on the driver's reputation.

INPUT: Tesla Mo

In [1]:
!pip install sagemaker



In [2]:
from google.colab import userdata
import boto3

In [3]:
session = boto3.Session(
    aws_access_key_id=userdata.get('AWS_ACCESS_KEY_ID'),
    aws_secret_access_key=userdata.get('AWS_SECRET_ACCESS_KEY'),
    region_name=userdata.get('AWS_REGION')
)

In [4]:
import sagemaker
from sagemaker.s3 import S3Uploader

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml
