Natural Language Processing (NLP) is a field of artificial intelligence that focuses on the interaction between computers and humans through natural language, enabling machines to understand, interpret, and generate human language.

Types of Text Summarization Techniques:
Abstractive Summarization: Generates a summary that captures the main ideas by rephrasing and paraphrasing the original content, often using deep learning models. Example techniques include:

Sequence-to-sequence models with attention mechanisms.
Transformer models like BERT and GPT.
Extractive Summarization: Selects and extracts key sentences or phrases from the original text to create a summary, preserving the original wording.





### Extractive Summarization Methods

1. **TF-IDF (Term Frequency-Inverse Document Frequency)**:
   - Weighs the importance of words in a document relative to a corpus, allowing the selection of sentences that contain the most important terms.

2. **TextRank**:
   - A graph-based ranking algorithm that identifies important sentences based on their connections within the text, similar to the PageRank algorithm used by Google.

3. **LexRank**:
   - Another graph-based method that identifies sentence importance by measuring the centrality of sentences in a graph, where sentences are nodes and edges represent similarities.

4. **BERT (Bidirectional Encoder Representations from Transformers)**:
   - Utilizes the pre-trained BERT model to generate embeddings for sentences, which can then be clustered or ranked based on similarity to create summaries.

5. **LSA (Latent Semantic Analysis)**:
   - A technique that reduces the dimensionality of the text data by identifying patterns in word usage, allowing the extraction of semantically relevant sentences.

6. **LDA (Latent Dirichlet Allocation)**:
   - A topic modeling technique that can also be used for extractive summarization by identifying the most representative sentences for each topic in a document.

7. **SumBasic**:
   - A probabilistic extractive summarization method that selects sentences based on the frequency of words in the summary and document, ensuring that more important words are included.

8. **Centroid-based Summarization**:
   - Identifies the centroid of a set of sentences (i.e., the average representation) and selects sentences closest to this centroid as the summary.

9. **MMR (Maximal Marginal Relevance)**:
   - A method that combines relevance and diversity by selecting sentences that maximize relevance while minimizing redundancy with the already selected sentences.

10. **Graph-Based Methods**:
   - Techniques like **HITS (Hyperlink-Induced Topic Search)** can also be adapted for extractive summarization by assessing the importance of sentences based on their connectivity.




In [None]:
!pip install nltk
!pip install numpy scikit-learn
!pip install gensim
!pip install sumy
!pip install sentence-transformers
!pip install gensim==3.8.3

Collecting sumy
  Downloading sumy-0.11.0-py2.py3-none-any.whl.metadata (7.5 kB)
Collecting docopt<0.7,>=0.6.1 (from sumy)
  Downloading docopt-0.6.2.tar.gz (25 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting breadability>=0.1.20 (from sumy)
  Downloading breadability-0.1.20.tar.gz (32 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting pycountry>=18.2.23 (from sumy)
  Downloading pycountry-24.6.1-py3-none-any.whl.metadata (12 kB)
Downloading sumy-0.11.0-py2.py3-none-any.whl (97 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m97.3/97.3 kB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pycountry-24.6.1-py3-none-any.whl (6.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.3/6.3 MB[0m [31m26.5 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: breadability, docopt
  Building wheel for breadability (setup.py) ... [?25l[?25hdone
  Created wheel for breadability: filename=brea

In [None]:

import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.luhn import LuhnSummarizer
from sentence_transformers import SentenceTransformer, util

from transformers import pipeline


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
  from tqdm.autonotebook import tqdm, trange


In [None]:

try:
    from gensim.summarization.summarizer import summarize as gensim_summarize
except ImportError:
    print("gensim.summarization not found. Please make sure Gensim 3.8.3 or lower is correctly installed and the kernel is restarted.")


gensim.summarization not found. Please make sure Gensim 3.8.3 or lower is correctly installed and the kernel is restarted.


In [None]:

text = """
Artificial intelligence (AI) is transforming industries by automating tasks, enhancing decision-making processes, and improving efficiency.
AI technologies, such as machine learning and natural language processing, are being integrated into various sectors like healthcare, finance, and manufacturing.
In healthcare, AI is helping doctors to diagnose diseases faster and more accurately, while in finance, AI algorithms are optimizing trading and risk management.
The ability of AI to analyze large amounts of data and provide actionable insights is making it indispensable in today’s data-driven world.

Despite the benefits, there are also concerns about AI’s impact on jobs and society.
Automation could lead to job displacement in certain sectors, particularly those that involve repetitive tasks.
Moreover, ethical issues such as data privacy, algorithmic bias, and transparency remain challenges that need to be addressed.
As AI continues to evolve, it is crucial to ensure that its development is guided by ethical considerations, and that regulations are put in place to mitigate its potential negative effects on society.
"""

In [None]:


#K-Means Clustering Summarization
def kmeans_summarization(text, max_sentences=5):
    sentences = sent_tokenize(text)
    vectorizer = TfidfVectorizer()
    tfidf_matrix = vectorizer.fit_transform(sentences)
    n_clusters = min(max_sentences, len(sentences))
    kmeans = KMeans(n_clusters=n_clusters)
    kmeans.fit(tfidf_matrix)
    closest_sentences = []
    for i in range(n_clusters):
        centroid = kmeans.cluster_centers_[i]
        closest_sentence_idx = np.argmin(np.linalg.norm(tfidf_matrix - centroid, axis=1))
        closest_sentences.append(sentences[closest_sentence_idx])

    return ' '.join(closest_sentences)

#KL-Sum Summarization (using Luhn as substitute)
def klsum_summarization(text, max_sentences=5):
    parser = PlaintextParser.from_string(text, Tokenizer("english"))
    summarizer = LuhnSummarizer()
    summary = summarizer(parser.document, max_sentences)

    return ' '.join(str(sentence) for sentence in summary)

# Luhn Summarizer
def luhn_summarization(text, max_sentences=5):
    parser = PlaintextParser.from_string(text, Tokenizer("english"))
    summarizer = LuhnSummarizer()
    summary = summarizer(parser.document, max_sentences)

    return ' '.join(str(sentence) for sentence in summary)

#BERT Summarizer
def bert_summarization(text, max_sentences=5):
    model = SentenceTransformer('paraphrase-MiniLM-L6-v2')
    sentences = sent_tokenize(text)
    embeddings = model.encode(sentences, convert_to_tensor=True)

    # Rank sentences based on cosine similarity with the mean sentence
    mean_embedding = embeddings.mean(dim=0)
    similarities = util.pytorch_cos_sim(mean_embedding, embeddings).squeeze().cpu().numpy()

    top_sentence_indices = similarities.argsort()[-max_sentences:][::-1]
    top_sentences = [sentences[i] for i in top_sentence_indices]

    return ' '.join(top_sentences)

#LexRank Summarizer
def lexrank_summarization(text, max_sentences=5):
    from sumy.summarizers.lex_rank import LexRankSummarizer
    parser = PlaintextParser.from_string(text, Tokenizer("english"))
    summarizer = LexRankSummarizer()
    summary = summarizer(parser.document, max_sentences)
    return ' '.join(str(sentence) for sentence in summary)

#Transformer Summarizer (using Hugging Face)
def transformer_summarization(text, max_sentences=5):
    from transformers import pipeline
    summarizer = pipeline("summarization")
    summary = summarizer(text, max_length=max_sentences*20, min_length=max_sentences*5, do_sample=False)
    return summary[0]['summary_text']





In [None]:

def display_summary(text, technique, max_sentences=5):
    if technique == "K-Means":
        summary = kmeans_summarization(text, max_sentences)
    elif technique == "KL-Sum":
        summary = klsum_summarization(text, max_sentences)
    elif technique == "Luhn":
        summary = luhn_summarization(text, max_sentences)
    elif technique == "BERT":
        summary = bert_summarization(text, max_sentences)
    elif technique == "LexRank":
        summary = lexrank_summarization(text, max_sentences)
    elif technique == "Transformer":
        summary = transformer_summarization(text, max_sentences)
    else:
        print("Invalid technique selected.")
        return None

    original_words = len(text.split())
    summarized_words = len(summary.split())
    print(f"\n{technique} Summary ({summarized_words}/{original_words} words):\n{summary}")
    return summary

In [None]:
def choose_summarization_technique():
    print("Choose a summarization technique:")
    print("1. K-Means Clustering")
    print("2. KL-Sum")
    print("3. Luhn Summarizer")
    print("4. BERT Summarizer")
    print("5. LexRank Summarizer")
    print("6. Transformer Summarizer")

    choice = input("Enter the number of your chosen technique: ")

    if choice == "1":
        return "K-Means"
    elif choice == "2":
        return "KL-Sum"
    elif choice == "3":
        return "Luhn"
    elif choice == "4":
        return "BERT"
    elif choice == "5":
        return "LexRank"
    elif choice == "6":
        return "Transformer"
    else:
        print("Invalid choice. Please choose a valid technique.")
        return None


def summarization_pipeline(text):
    # Call choose_summarization_technique to get the user's choice
    technique = choose_summarization_technique()
    if technique:
        display_summary(text, technique)


summarization_pipeline(text)

Choose a summarization technique:
1. K-Means Clustering
2. KL-Sum
3. Luhn Summarizer
4. BERT Summarizer
5. LexRank Summarizer
6. Transformer Summarizer
Enter the number of your chosen technique: 1

K-Means Summary (106/162 words):
Automation could lead to job displacement in certain sectors, particularly those that involve repetitive tasks. As AI continues to evolve, it is crucial to ensure that its development is guided by ethical considerations, and that regulations are put in place to mitigate its potential negative effects on society. 
Artificial intelligence (AI) is transforming industries by automating tasks, enhancing decision-making processes, and improving efficiency. AI technologies, such as machine learning and natural language processing, are being integrated into various sectors like healthcare, finance, and manufacturing. The ability of AI to analyze large amounts of data and provide actionable insights is making it indispensable in today’s data-driven world.
