# Hackathon: Optimize and Deploy Text Summarization Model

---

## Problem Statement

- Optimize a transformer-based text summarization model for efficient deployment in a resource-constrained environment to simulate a production scenario.
- Use **knowledge distillation** and **manual pruning** techniques to reduce model size and inference latency without significantly compromising summarization quality.
- Use a subset of the CNN/DailyMail dataset and fine-tune a student model (`t5-small`).
- Apply structured weight pruning to further compress the model.
- The final model is saved in TensorFlow format, suitable for containerized inference deployment.

---

## Steps

1. **Subsampling** you need to subsample the original CNN/DailyMail dataset to adjust it to the small compute resources
2. **Distillation**: Train a compact student model to mimic the output behavior of a larger teacher model using both hard labels and soft logits.
3. **Pruning**: Apply magnitude-based weight pruning manually to remove unimportant weights from the trained student model.
4. **TensorFlow Only**: All components—models, training, pruning, and saving—must use TensorFlow (no PyTorch).
5. **No Quantization**: Skip any quantization techniques in this notebook.
6. **Deployment Ready**: Save the optimized model using TensorFlow’s `SavedModel` format for later use in Docker and ECS environments.
7. **Bonus**: Experiment with subsample size, batch size and number of epochs.

---

## Assumptions

* The dataset is a pre-subsampled version of CNN/DailyMail (\~200 training and \~50 test examples or less) and is available locally in JSON format.
* The selected models (`t5-base` and `t5-small`) are fully supported by Hugging Face's TensorFlow APIs.
* Training is conducted on a CPU-only machine (e.g., AWS `t2.large`), requiring careful management of memory and runtime.
* Only basic training (2 epochs) is required to demonstrate optimization strategies, not to reach state-of-the-art performance.
* Final output will be integrated into a containerized microservice, so model size and inference efficiency are critical.

---

Here's a concise yet informative **description of the CNN/DailyMail dataset** that you can include in your notebook or project documentation:

---

## CNN/DailyMail Dataset Description

The **CNN/DailyMail** dataset is a large-scale benchmark for **abstractive text summarization**, widely used in natural language processing research. It consists of news articles from CNN and the Daily Mail, paired with human-written summaries, often referred to as *highlights*.

Each example in the dataset includes:

* **`article`**: The full body of a news story (approx. 300–800 words).
* **`highlights`**: A concise, human-curated summary capturing the main points (1–3 sentences).

### Dataset Versions

* The most commonly used version is **"3.0.0"**, which filters out anonymized entity tags and provides clean text.
* The dataset is split into:

  * `train`: \~287,000 examples
  * `validation`: \~13,000 examples
  * `test`: \~11,000 examples

### Use Case in This Project

In this project, we use a **subsampled version** of the dataset (e.g., 200 train, 50 test examples) to simulate summarization model training in a resource-constrained environment. The summaries serve as target outputs for fine-tuning a student model using knowledge distillation techniques.

---



In [1]:

# --- Imports ---
import json
import tensorflow as tf
from transformers import TFAutoModelForSeq2SeqLM, AutoTokenizer
import numpy as np

# --- Load and prepare data ---
with open("subsampled_cnn_dailymail/train_sample.json", "r", encoding="utf-8") as f:
    train_data = json.load(f)
with open("subsampled_cnn_dailymail/test_sample.json", "r", encoding="utf-8") as f:
    test_data = json.load(f)
