# 11. Text Summarization
Text summarization is a crucial task in Natural Language Processing (NLP) that involves condensing large volumes of text into shorter, coherent summaries while retaining the essential information. This technique is widely used in various applications, such as news aggregation, document summarization, and information retrieval, to help users quickly grasp the main points of lengthy documents.

### What You'll Learn:
- Extractive vs Abstractive summarization
- Algorithms
- Evaluation metrics
- Practical implementation
- Real-world applications

## Two Types of Summarization

### Extractive Summarization
- Select important sentences from original text
- Faster
- Preserves original language
- Use case: News articles

### Abstractive Summarization
- Generate new sentences
- Slower but more natural
- Requires language generation
- Use case: Professional reports

## Extractive Summarization Methods

1. **TF-IDF based**: Select sentences with highest word importance
2. **TextRank**: Treat sentences as graph, find most important
3. **Frequency-based**: Select sentences with most frequent words

In [1]:
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np

print('='*60)
print('EXTRACTIVE SUMMARIZATION WITH TF-IDF')
print('='*60)

text = '''Machine learning is a subset of artificial intelligence. 
It enables computers to learn from data without explicit programming. 
Deep learning is a subset of machine learning. 
It uses neural networks with multiple layers. 
Natural language processing is an important field. 
It focuses on interaction between computers and human language.'''

sentences = text.split('. ')
print('\nOriginal text:')
print(text)
print(f'\nTotal sentences: {len(sentences)}\n')

# Extract features
vectorizer = TfidfVectorizer(max_features=10, stop_words='english')
tfidf_matrix = vectorizer.fit_transform(sentences)

# Get sentence scores
sentence_scores = np.array(tfidf_matrix.sum(axis=1)).flatten()

# Get top 2 sentences
top_indices = sentence_scores.argsort()[-2:][::-1]
top_sentences = [sentences[i] for i in sorted(top_indices)]

print('Summary (top 2 sentences):')
for sent in top_sentences:
    print(f'  - {sent}.')

EXTRACTIVE SUMMARIZATION WITH TF-IDF

Original text:
Machine learning is a subset of artificial intelligence. 
It enables computers to learn from data without explicit programming. 
Deep learning is a subset of machine learning. 
It uses neural networks with multiple layers. 
Natural language processing is an important field. 
It focuses on interaction between computers and human language.

Total sentences: 6

Summary (top 2 sentences):
  - Machine learning is a subset of artificial intelligence.
  - 
It enables computers to learn from data without explicit programming.


## Abstractive Summarization with Transformers

In [2]:
from transformers import pipeline

print('\nABSTRACTIVE SUMMARIZATION WITH TRANSFORMERS')

try:
    summarizer = pipeline('summarization', model='facebook/bart-large-cnn')
    
    text = 'Machine learning enables computers to learn from data. ' \
           'Deep learning uses neural networks. ' \
           'Transformers revolutionized NLP. ' \
           'BERT is a famous transformer model. ' \
           'These models achieve state-of-the-art results.'
    
    # Summarize
    summary = summarizer(text, max_length=50, min_length=20, do_sample=False)
    print(f'\nOriginal: {text}')
    print(f'\nSummary: {summary[0]["summary_text"]}')
    
except Exception as e:
    print('(Requires transformers library and internet connection)')
    print('Process:')
    print('1. Load pre-trained summarization model')
    print('2. Tokenize input text')
    print('3. Generate summary tokens')
    print('4. Decode to get summary')

  from .autonotebook import tqdm as notebook_tqdm



ABSTRACTIVE SUMMARIZATION WITH TRANSFORMERS


To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


(Requires transformers library and internet connection)
Process:
1. Load pre-trained summarization model
2. Tokenize input text
3. Generate summary tokens
4. Decode to get summary


## Evaluation Metrics

### ROUGE Score
Compares generated summary with reference summary
- ROUGE-1: Unigram overlap
- ROUGE-2: Bigram overlap
- ROUGE-L: Longest common sequence

In [3]:
print('\nEVALUATION EXAMPLE:')
from collections import Counter

reference = 'machine learning is powerful'
generated = 'machine learning is great'

ref_words = reference.split()
gen_words = generated.split()

# Simple precision/recall
matching = len(set(ref_words) & set(gen_words))
precision = matching / len(gen_words)
recall = matching / len(ref_words)

print(f'Reference: {reference}')
print(f'Generated: {generated}')
print(f'\nPrecision: {precision:.2%}')
print(f'Recall: {recall:.2%}')


EVALUATION EXAMPLE:
Reference: machine learning is powerful
Generated: machine learning is great

Precision: 75.00%
Recall: 75.00%
