# 🚀 DailyFlash Offer Generator - Model Training

This notebook fine-tunes a GPT2 model to generate structured JSON offers from raw promotional text.

## Overview
1. Setup and installation
2. Data loading and preprocessing
3. Model configuration
4. Training
5. Saving the model
6. Testing the model

## Step 1: Setup & Installation

First, let's install the necessary packages and configure Google Drive for saving the model.

In [None]:
# Install required libraries
!pip install transformers datasets accelerate torch evaluate

In [None]:
# Mount Google Drive for saving model
from google.colab import drive
drive.mount('/content/drive')

# Create directory to save model
import os
os.makedirs('/content/drive/MyDrive/offer_generator_model', exist_ok=True)

In [None]:
# Upload dataset to Colab
from google.colab import files
uploaded = files.upload()  # Upload your offer_dataset.jsonl file here

## Step 2: Load and Preprocess Data

We'll use the HuggingFace datasets library to load our JSONL data and prepare it for training.

In [None]:
import pandas as pd
import json
import torch
from transformers import GPT2LMHeadModel, GPT2Tokenizer, Trainer, TrainingArguments, DataCollatorForLanguageModeling
from datasets import Dataset

# Load dataset
data = []
with open('offer_dataset.jsonl', 'r') as file:
    for line in file:
        data.append(json.loads(line))

df = pd.DataFrame(data)
dataset = Dataset.from_pandas(df)

# Print sample to verify data
print(dataset[0]['text'])

In [None]:
# Load tokenizer and model
model_name = "gpt2"
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
model = GPT2LMHeadModel.from_pretrained(model_name)

In [None]:
# Configure tokenizer
tokenizer.pad_token = tokenizer.eos_token

def tokenize_function(examples):
    return tokenizer(examples["text"], truncation=True, padding="max_length", max_length=512)

# Tokenize the dataset
tokenized_dataset = dataset.map(tokenize_function, batched=True)

# Split into training and validation
split_dataset = tokenized_dataset.train_test_split(test_size=0.1, seed=42)

print(f"Training size: {len(split_dataset['train'])}")
print(f"Validation size: {len(split_dataset['test'])}")

## Step 3: Configure Training

We'll set up the training parameters and data collator.

In [None]:
# Configure model for training
model.resize_token_embeddings(len(tokenizer))

# Set training arguments
training_args = TrainingArguments(
    output_dir="/content/drive/MyDrive/offer_generator_model",
    evaluation_strategy="epoch",
    learning_rate=5e-5,
    weight_decay=0.01,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    num_train_epochs=5,
    save_strategy="epoch",
    load_best_model_at_end=True,
    push_to_hub=False,
)

In [None]:
# Data collator for language modeling
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False,  # We're doing causal language modeling, not masked language modeling
)

## Step 4: Train the Model

Now we'll train the model using the Trainer API.

In [None]:
# Initialize trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=split_dataset["train"],
    eval_dataset=split_dataset["test"],
    data_collator=data_collator,
)

# Train the model
trainer.train()

## Step 5: Save the Model

Let's save the trained model to Google Drive.

In [None]:
# Save the model
save_directory = "/content/drive/MyDrive/offer_generator_model/final_model"
os.makedirs(save_directory, exist_ok=True)

model.save_pretrained(save_directory)
tokenizer.save_pretrained(save_directory)

print(f"Model saved to {save_directory}")

## Step 6: Test the Model

Let's generate some offers using the trained model to see how it performs.

In [None]:
# Function for generating offers
def generate_offer(input_text, model=model, tokenizer=tokenizer):
    prompt = f"Input: {input_text}\nOutput:"
    inputs = tokenizer(prompt, return_tensors="pt", padding=True, truncation=True).to(model.device)
    
    # Generate output
    outputs = model.generate(
        inputs.input_ids,
        max_length=512,
        temperature=0.7,
        top_p=0.9,
        pad_token_id=tokenizer.eos_token_id,
        num_return_sequences=1
    )
    
    # Decode and clean up the output
    generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
    
    # Extract just the JSON part
    if "Output:" in generated_text:
        generated_text = generated_text.split("Output:")[1].strip()
    
    return generated_text

In [None]:
# Test the model with some examples
test_inputs = [
    "50% off on all electronics at TechMart until Dec 25. Shop now at www.techmart.com",
    "Buy 2 get 1 free on shirts at ABC Store, Lucknow. Call: +919876543210",
    "New test case: Flash sale on all vegetables at FreshMart, Mumbai. Valid today only."
]

for test_input in test_inputs:
    print(f"Input: {test_input}")
    result = generate_offer(test_input)
    print(f"Output: {result}")
    print("---\n")

## Download the Model (Optional)

If you want to download the model files directly to your computer.

In [None]:
# Zip the model directory for download
!zip -r /content/offer_generator_model.zip /content/drive/MyDrive/offer_generator_model/final_model
files.download('/content/offer_generator_model.zip')

## 🎉 Congratulations!

You've successfully trained a model that can generate structured JSON offers from raw promotional text. This model can now be used in your inference script and Gradio UI.