# Daily Mail Text Summarization using Transformers

## 1. Dataset Loading & Inspection

### 1.1 Load the CSV File

In [1]:
import pandas as pd

DATA_PATH = "/kaggle/input/daily-mail-summarization-dataset/article_highlights.csv"

df = pd.read_csv(DATA_PATH)

### 1.2 Inspect the Dataset

In [4]:
df.head()

Unnamed: 0,url,article,highlights
0,https://www.dailymail.co.uk/tvshowbiz/article-...,Beyoncé showcases her incredible figure in plu...,Beyoncé has shown off her flawless beauty in a...
1,https://www.dailymail.co.uk/tvshowbiz/article-...,Radio 1 listeners in shock as sex noises are p...,BBC Radio 1 listeners were left choking on the...
2,https://www.dailymail.co.uk/tvshowbiz/article-...,"TOWIE's Dan Edgar, 33, and Ella Rae Wise, 23, ...",Dan Edgar and Ella Rae Wise put on a loved-up ...
3,https://www.dailymail.co.uk/tvshowbiz/article-...,Bradley Cooper recalls 'crazy' pitch meeting a...,Bradley Cooper discussed the 'crazy' experienc...
4,https://www.dailymail.co.uk/tvshowbiz/article-...,Margaret Qualley and Beanie Feldstein stun in ...,Margaret Qualley and Beanie Feldstein were dre...


In [5]:
print("Total samples:", len(df))
df.isnull().sum()

Total samples: 8176


url            0
article       11
highlights     3
dtype: int64

### 1.3 Drop Invalid Rows

In [6]:
df = df.dropna(subset=["article", "highlights"])
df = df.reset_index(drop=True)

print("Samples after cleaning:", len(df))

Samples after cleaning: 8165


### 1.4 Inspect Text Lengths

In [7]:
df["article_length"] = df["article"].apply(lambda x: len(str(x).split()))
df["summary_length"] = df["highlights"].apply(lambda x: len(str(x).split()))

df[["article_length", "summary_length"]].describe()

Unnamed: 0,article_length,summary_length
count,8165.0,8165.0
mean,44.832456,23.055726
std,10.618829,9.135632
min,28.0,5.0
25%,37.0,18.0
50%,44.0,22.0
75%,51.0,28.0
max,78.0,53.0


## 2. Minimal Text Cleaning & Preparation

### 2.1 Select Only Required Columns

In [10]:
df = df[["article", "highlights"]]

### 2.2 Basic Text Normalization

In [12]:
import re

def clean_text(text):
    text = str(text)
    text = text.strip()
    text = re.sub(r"\s+", " ", text)
    return text

df["article"] = df["article"].apply(clean_text)
df["highlights"] = df["highlights"].apply(clean_text)

df.head()

Unnamed: 0,article,highlights
0,Beyoncé showcases her incredible figure in plu...,Beyoncé has shown off her flawless beauty in a...
1,Radio 1 listeners in shock as sex noises are p...,BBC Radio 1 listeners were left choking on the...
2,"TOWIE's Dan Edgar, 33, and Ella Rae Wise, 23, ...",Dan Edgar and Ella Rae Wise put on a loved-up ...
3,Bradley Cooper recalls 'crazy' pitch meeting a...,Bradley Cooper discussed the 'crazy' experienc...
4,Margaret Qualley and Beanie Feldstein stun in ...,Margaret Qualley and Beanie Feldstein were dre...
