**Extractive Text Summarization – Method Overview**
This notebook implements extractive summarization, where the system selects the most important sentences directly from the original text—without generating any new sentences.

**SentenceTransformer Embeddings**
After testing multiple models, all-MiniLM-L6-v2 provided the best performance, so it is used for all sentence embeddings.
These embeddings help measure both sentence importance and similarity.


In [76]:
import pandas as pd

**For evaluation, I am using only the validation split because the training dataset contains over 100k samples. For extractive summarization tasks, the validation set provides more than enough data to reliably assess model performance without unnecessary computation.”**

# preprocessing

In [77]:
df=pd.read_csv('/kaggle/input/newspaper-text-summarization-cnn-dailymail/cnn_dailymail/validation.csv')

In [78]:
df.shape

(13368, 3)

In [79]:
df.head()

Unnamed: 0,id,article,highlights
0,61df4979ac5fcc2b71be46ed6fe5a46ce7f071c3,"Sally Forrest, an actress-dancer who graced th...","Sally Forrest, an actress-dancer who graced th..."
1,21c0bd69b7e7df285c3d1b1cf56d4da925980a68,A middle-school teacher in China has inked hun...,Works include pictures of Presidential Palace ...
2,56f340189cd128194b2e7cb8c26bb900e3a848b4,A man convicted of killing the father and sist...,"Iftekhar Murtaza, 29, was convicted a year ago..."
3,00a665151b89a53e5a08a389df8334f4106494c2,Avid rugby fan Prince Harry could barely watch...,Prince Harry in attendance for England's crunc...
4,9f6fbd3c497c4d28879bebebea220884f03eb41a,A Triple M Radio producer has been inundated w...,Nick Slater's colleagues uploaded a picture to...


In [80]:
df.drop(columns=['id'], inplace=True)

In [81]:
df.head()

Unnamed: 0,article,highlights
0,"Sally Forrest, an actress-dancer who graced th...","Sally Forrest, an actress-dancer who graced th..."
1,A middle-school teacher in China has inked hun...,Works include pictures of Presidential Palace ...
2,A man convicted of killing the father and sist...,"Iftekhar Murtaza, 29, was convicted a year ago..."
3,Avid rugby fan Prince Harry could barely watch...,Prince Harry in attendance for England's crunc...
4,A Triple M Radio producer has been inundated w...,Nick Slater's colleagues uploaded a picture to...


In [82]:
df['article'][0]

"Sally Forrest, an actress-dancer who graced the silver screen throughout the '40s and '50s in MGM musicals and films such as the 1956 noir While the City Sleeps died on March 15 at her home in Beverly Hills, California. Forrest, whose birth name was Katherine Feeney, was 86 and had long battled cancer. Her publicist, Judith Goffin, announced the news Thursday. Scroll down for video . Actress: Sally Forrest was in the 1951 Ida Lupino-directed film 'Hard, Fast and Beautiful' (left) and the 1956 Fritz Lang movie 'While the City Sleeps' A San Diego native, Forrest became a protege of Hollywood trailblazer Ida Lupino, who cast her in starring roles in films including the critical and commercial success Not Wanted, Never Fear and Hard, Fast and Beautiful. Some of Forrest's other film credits included Bannerline, Son of Sinbad, and Excuse My Dust, according to her iMDB\xa0page. The page also indicates Forrest was in multiple Climax! and Rawhide television episodes. Forrest appeared as hersel

In [83]:
df['highlights'][0]

"Sally Forrest, an actress-dancer who graced the silver screen throughout the '40s and '50s in MGM musicals and films died on March 15 .\nForrest, whose birth name was Katherine Feeney, had long battled cancer .\nA San Diego native, Forrest became a protege of Hollywood trailblazer Ida Lupino, who cast her in starring roles in films ."

In [84]:
df.isnull().sum()

article       0
highlights    0
dtype: int64

In [85]:
df['article'].duplicated().sum()

0

# building our function 

In [86]:
from sentence_transformers import SentenceTransformer, util
import numpy as np
import nltk, re
from nltk.tokenize import sent_tokenize
nltk.download('punkt')

[nltk_data] Downloading package punkt to /usr/share/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [87]:
import re
import numpy as np
from nltk.tokenize import sent_tokenize
from sentence_transformers import SentenceTransformer, util

def clean_text(text):
    text = text.lower()
    text = re.sub(r'http\S+|www\S+|https\S+', '', text)
    text = re.sub(r'<.*?>', '', text)
    text = text.encode("ascii", "ignore").decode()
    text = re.sub(r"[^a-z0-9.,!? \n]", " ", text)
    text = re.sub(r'\s+', ' ', text)
    text = text.strip()
    return text

def sent_token(text):
    return sent_tokenize(text)


model = SentenceTransformer('all-MiniLM-L6-v2')

#main function
def extractive(article, top_n=4):
    text = clean_text(article)
    sentences = sent_token(text)

    if len(sentences) <= top_n:
        return " ".join(sentences)

    # Embeddings
    sentence_embeddings = model.encode(sentences)

    # Document representation
    doc_embedding = sentence_embeddings.mean(axis=0)

    # Cosine similarity
    scores = util.cos_sim(sentence_embeddings, doc_embedding).squeeze().tolist()

    top_ids = np.argsort(scores)[-top_n:]
    top_ids = sorted(top_ids)

    summary = " ".join([sentences[i] for i in top_ids])
    return summary


In [88]:
df['extracted_summary']=df['article'].apply(extractive)

In [89]:
df.head()

Unnamed: 0,article,highlights,extracted_summary
0,"Sally Forrest, an actress-dancer who graced th...","Sally Forrest, an actress-dancer who graced th...","sally forrest, an actress dancer who graced th..."
1,A middle-school teacher in China has inked hun...,Works include pictures of Presidential Palace ...,natural canvas artist and teacher wang lian ha...
2,A man convicted of killing the father and sist...,"Iftekhar Murtaza, 29, was convicted a year ago...","iftekhar murtaza, 30, was sentenced for the mu..."
3,Avid rugby fan Prince Harry could barely watch...,Prince Harry in attendance for England's crunc...,avid rugby fan prince harry could barely watch...
4,A Triple M Radio producer has been inundated w...,Nick Slater's colleagues uploaded a picture to...,"triple m producer nick slater, c , pictured wi..."


# Evaluation


In [90]:
from rouge_score import rouge_scorer

scorer = rouge_scorer.RougeScorer(['rouge1','rouge2','rougeL'], use_stemmer=True)

r1, r2, rl = [], [], []

n_eval = min(500, len(df))

for i in range(n_eval):
    scores = scorer.score(df['highlights'].iloc[i], df['extracted_summary'].iloc[i])
    r1.append(scores['rouge1'].fmeasure)
    r2.append(scores['rouge2'].fmeasure)
    rl.append(scores['rougeL'].fmeasure)
print(f"ROUGE-1  →  {np.mean(r1):.4f}")
print(f"ROUGE-2  →  {np.mean(r2):.4f}")
print(f"ROUGE-L  →  {np.mean(rl):.4f}")


ROUGE-1  →  0.3327
ROUGE-2  →  0.1286
ROUGE-L  →  0.2095


decent performence


# Testing

In [91]:
# lets test
test=pd.read_csv('/kaggle/input/newspaper-text-summarization-cnn-dailymail/cnn_dailymail/test.csv')['article'][0]
test


"Ever noticed how plane seats appear to be getting smaller and smaller? With increasing numbers of people taking to the skies, some experts are questioning if having such packed out planes is putting passengers at risk. They say that the shrinking space on aeroplanes is not only uncomfortable - it's putting our health and safety in danger. More than squabbling over the arm rest, shrinking space on planes putting our health and safety in danger? This week, a U.S consumer advisory group set up by the Department of Transportation said at a public hearing that while the government is happy to set standards for animals flying on planes, it doesn't stipulate a minimum amount of space for humans. 'In a world where animals have more rights to space and food than humans,' said Charlie Leocha, consumer representative on the committee.\xa0'It is time that the DOT and FAA take a stand for humane treatment of passengers.' But could crowding on planes lead to more serious issues than fighting for sp

In [93]:
extractive(test, top_n=5)

'ever noticed how plane seats appear to be getting smaller and smaller? they say that the shrinking space on aeroplanes is not only uncomfortable it s putting our health and safety in danger. more than squabbling over the arm rest, shrinking space on planes putting our health and safety in danger? but these tests are conducted using planes with 31 inches between each row of seats, a standard which on some airlines has decreased, reported the detroit news. while most airlines stick to a pitch of 31 inches or above, some fall below this.'