<a href="https://colab.research.google.com/github/usmanyousaaf/NLP/blob/master/Text_Summarizer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### import libraries

In [7]:
import nltk
from nltk.tokenize import sent_tokenize
from nltk.corpus import stopwords
from nltk.cluster.util import cosine_distance
import numpy as np
import networkx as nx
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [8]:
def read_article(text):
    # Tokenize the text into sentences
    sentences = sent_tokenize(text)
    # Remove stop words from the sentences
    stop_words = stopwords.words('english')
    clean_sentences = []
    for sent in sentences:
        words = nltk.word_tokenize(sent)
        words = [word.lower() for word in words if word not in stop_words]
        clean_sentences.append(words)
    return clean_sentences

In [9]:
def sentence_similarity(sent1, sent2):
    # Convert sentences to word vectors using GloVe embeddings
    word_vectors = {}
    with open("glove.6B.100d.txt", encoding='utf-8') as f:
        for line in f:
            values = line.split()
            word = values[0]
            vector = np.asarray(values[1:], dtype='float32')
            word_vectors[word] = vector
    # Compute cosine similarity between sentence vectors
    sent1_vector = np.mean([word_vectors.get(word, np.zeros((100,))) for word in sent1], axis=0)
    sent2_vector = np.mean([word_vectors.get(word, np.zeros((100,))) for word in sent2], axis=0)
    return 1 - cosine_distance(sent1_vector, sent2_vector)

In [10]:
def build_similarity_matrix(sentences):
    # Build a similarity matrix between sentences using sentence_similarity()
    similarity_matrix = np.zeros((len(sentences), len(sentences)))
    for i in range(len(sentences)):
        for j in range(len(sentences)):
            if i != j:
                similarity_matrix[i][j] = sentence_similarity(sentences[i], sentences[j])
    return similarity_matrix


In [11]:
def generate_summary(text, num_sentences):
    # Read the text and tokenize it into sentences
    sentences = read_article(text)
    # Build a similarity matrix between sentences
    similarity_matrix = build_similarity_matrix(sentences)
    # Use PageRank algorithm to rank sentences by importance
    nx_graph = nx.from_numpy_array(similarity_matrix)
    scores = nx.pagerank(nx_graph)
    # Sort the sentences by their scores and extract the top N sentences as summary
    ranked_sentences = sorted(((scores[i],s) for i,s in enumerate(sentences)), reverse=True)
    summary = [s[1] for s in ranked_sentences[:num_sentences]]
    return " ".join(summary)

In [14]:
import nltk
nltk.download('punkt')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
from heapq import nlargest

def generate_summary(text, n):
    """
    Generates a summary of the given text using NLTK's tokenization and stopword removal.
    
    Parameters:
    text (str): The text to be summarized
    n (int): The number of sentences to include in the summary
    
    Returns:
    summary (str): The summary of the text
    """
    stop_words = set(stopwords.words('english'))
    words = word_tokenize(text)
    # Removing stop words
    words = [word for word in words if word.lower() not in stop_words]
    # Creating frequency table
    freq_table = dict()
    for word in words:
        if word in freq_table:
            freq_table[word] += 1
        else:
            freq_table[word] = 1
    # Creating sentence scores
    sentences = sent_tokenize(text)
    sentence_scores = dict()
    for sentence in sentences:
        for word in word_tokenize(sentence.lower()):
            if word in freq_table:
                if len(sentence.split(" ")) < 30:
                    if sentence not in sentence_scores:
                        sentence_scores[sentence] = freq_table[word]
                    else:
                        sentence_scores[sentence] += freq_table[word]
    # Selecting top n sentences
    summary_sentences = nlargest(n, sentence_scores, key=sentence_scores.get)
    # Combining summary sentences into a paragraph
    summary = ' '.join(summary_sentences)
    return summary


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [23]:
text = "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum."
text1="Obviously, if your training data is full of errors, outliers, and noise (e.g., due to poorquality measurements), it will make it harder for the system to detect the underlyingpatterns, so your system is less likely to perform well. It is often well worth the effortto spend time cleaning up your training data. The truth is, most data scientists spenda significant part of their time doing just that. For example If some instances are clearly outliers, it may help to simply discard them or try to fix the errors manually.If some instances are missing a few features (e.g., 5% of your customers did not specify their age), you must decide whether you want to ignore this attribute alto‐gether, ignore these instances, fill in the missing values (e.g., with the median age), or train one model with the feature and one model without it, and so on."
summary = generate_summary(text1, 2)

print("Length of text:", len(text1))
print("Length of summary:", len(summary))
print("\nOriginal Text:\n", text1)
print("\nSummary:\n", summary)


Length of text: 846
Length of summary: 168

Original Text:
 Obviously, if your training data is full of errors, outliers, and noise (e.g., due to poorquality measurements), it will make it harder for the system to detect the underlyingpatterns, so your system is less likely to perform well. It is often well worth the effortto spend time cleaning up your training data. The truth is, most data scientists spenda significant part of their time doing just that. For example If some instances are clearly outliers, it may help to simply discard them or try to fix the errors manually.If some instances are missing a few features (e.g., 5% of your customers did not specify their age), you must decide whether you want to ignore this attribute alto‐gether, ignore these instances, fill in the missing values (e.g., with the median age), or train one model with the feature and one model without it, and so on.

Summary:
 The truth is, most data scientists spenda significant part of their time doing jus

In [22]:
text = "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum."

summary = generate_summary(text, 1)

text_word_count = len(text.split())
summary_word_count = len(summary.split())

print("Word count of text:", text_word_count)
print("Word count of summary:", summary_word_count)
print("\nOriginal Text:\n", text)
print("\nSummary:\n", summary)


Word count of text: 69
Word count of summary: 19

Original Text:
 Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

Summary:
 Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
