# Daily Mail Text Summarization using Transformers

## 1. Dataset Loading & Inspection

### 1.1 Load the CSV File

In [12]:
import pandas as pd

DATA_PATH = "/kaggle/input/daily-mail-summarization-dataset/article_highlights.csv"

df = pd.read_csv(DATA_PATH)

### 1.2 Inspect the Dataset

In [13]:
df.head()

Unnamed: 0,url,article,highlights
0,https://www.dailymail.co.uk/tvshowbiz/article-...,Beyoncé showcases her incredible figure in plu...,Beyoncé has shown off her flawless beauty in a...
1,https://www.dailymail.co.uk/tvshowbiz/article-...,Radio 1 listeners in shock as sex noises are p...,BBC Radio 1 listeners were left choking on the...
2,https://www.dailymail.co.uk/tvshowbiz/article-...,"TOWIE's Dan Edgar, 33, and Ella Rae Wise, 23, ...",Dan Edgar and Ella Rae Wise put on a loved-up ...
3,https://www.dailymail.co.uk/tvshowbiz/article-...,Bradley Cooper recalls 'crazy' pitch meeting a...,Bradley Cooper discussed the 'crazy' experienc...
4,https://www.dailymail.co.uk/tvshowbiz/article-...,Margaret Qualley and Beanie Feldstein stun in ...,Margaret Qualley and Beanie Feldstein were dre...


In [14]:
print("Total samples:", len(df))
df.isnull().sum()

Total samples: 8176


url            0
article       11
highlights     3
dtype: int64

### 1.3 Drop Invalid Rows

In [15]:
df = df.dropna(subset=["article", "highlights"])
df = df.reset_index(drop=True)

print("Samples after cleaning:", len(df))

Samples after cleaning: 8165


### 1.4 Inspect Text Lengths

In [16]:
df["article_length"] = df["article"].apply(lambda x: len(str(x).split()))
df["summary_length"] = df["highlights"].apply(lambda x: len(str(x).split()))

df[["article_length", "summary_length"]].describe()

Unnamed: 0,article_length,summary_length
count,8165.0,8165.0
mean,44.832456,23.055726
std,10.618829,9.135632
min,28.0,5.0
25%,37.0,18.0
50%,44.0,22.0
75%,51.0,28.0
max,78.0,53.0


## 2. Minimal Text Cleaning & Preparation

### 2.1 Select Only Required Columns

In [17]:
df = df[["article", "highlights"]]

### 2.2 Basic Text Normalization

In [18]:
import re

def clean_text(text):
    text = str(text)
    text = text.strip()
    text = re.sub(r"\s+", " ", text)
    return text

df["article"] = df["article"].apply(clean_text)
df["highlights"] = df["highlights"].apply(clean_text)

df.head()

Unnamed: 0,article,highlights
0,Beyoncé showcases her incredible figure in plu...,Beyoncé has shown off her flawless beauty in a...
1,Radio 1 listeners in shock as sex noises are p...,BBC Radio 1 listeners were left choking on the...
2,"TOWIE's Dan Edgar, 33, and Ella Rae Wise, 23, ...",Dan Edgar and Ella Rae Wise put on a loved-up ...
3,Bradley Cooper recalls 'crazy' pitch meeting a...,Bradley Cooper discussed the 'crazy' experienc...
4,Margaret Qualley and Beanie Feldstein stun in ...,Margaret Qualley and Beanie Feldstein were dre...


## 3. Convert to Hugging Face Dataset & Split

### 3.1 Convert Pandas DataFrame ti Hugging Face Dataset

In [19]:
from datasets import Dataset

dataset = Dataset.from_pandas(df)
dataset

Dataset({
    features: ['article', 'highlights'],
    num_rows: 8165
})

### 3.2 Create Train / Validation Split

In [20]:
dataset = dataset.train_test_split(test_size=0.1, seed=42)

dataset

DatasetDict({
    train: Dataset({
        features: ['article', 'highlights'],
        num_rows: 7348
    })
    test: Dataset({
        features: ['article', 'highlights'],
        num_rows: 817
    })
})

### 3.3 Rename Test set to Validation set


In [21]:
from datasets import DatasetDict

dataset = DatasetDict({
    "train": dataset["train"],
    "validation": dataset["test"]
})

print("TRAIN SAMPLE")
print(dataset["train"][0]["article"])
print("\nTARGET SUMMARY")
print(dataset["train"][0]["highlights"])

TRAIN SAMPLE
French apple rose tart Apples and pears as you've never eaten them beforeApples and pears as you’ve never eaten them before, in a savoury tray bake, a stunning squash soup and this spectacular – and simple – tart

TARGET SUMMARY
Apples and pears as you’ve never eaten them before, in a savoury tray bake, a stunning squash soup and this spectacular – and simple – tart


## 4. Load Tokenizer & Tokenize the Dataset

### 4.1 Load the Tokenizer

In [22]:
from transformers import AutoTokenizer

model_checkpoint = "facebook/bart-base"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

### 4.2 Tokenization Function

In [23]:
MAX_INPUT_LENGTH = 128
MAX_TARGET_LENGTH = 64


def tokenize_function(batch):
    model_inputs = tokenizer(
        batch["article"],
        max_length=MAX_INPUT_LENGTH,
        truncation=True,
        padding="max_length"
    )

    with tokenizer.as_target_tokenizer():
        labels = tokenizer(
            batch["highlights"],
            max_length=MAX_TARGET_LENGTH,
            truncation=True,
            padding="max_length"
        )

    labels["input_ids"] = [
        [(token if token != tokenizer.pad_token_id else -100) for token in label]
        for label in labels["input_ids"]
    ]

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs


### 4.3 Apply Tokenization to the Dataset

In [24]:
tokenized_datasets = dataset.map(
    tokenize_function,
    batched=True,
    remove_columns=["article", "highlights"]
)

tokenized_datasets["train"][0].keys()

Map:   0%|          | 0/7348 [00:00<?, ? examples/s]



Map:   0%|          | 0/817 [00:00<?, ? examples/s]

dict_keys(['input_ids', 'attention_mask', 'labels'])