Text Summarization

In [None]:
import nltk
from nltk.corpus import movie_reviews
import random
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import string
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np


# Download necessary NLTK resources
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

# Download the movie_reviews dataset
nltk.download('movie_reviews') # Download the missing dataset

# Load movie reviews dataset
documents = [movie_reviews.raw(fileid) for fileid in movie_reviews.fileids()]

random.shuffle(documents)

def preprocess(text):
    lemmatizer = WordNetLemmatizer()
    stop_words = set(stopwords.words('english'))
    sentences = sent_tokenize(text)
    preprocessed_sentences = []

    for sentence in sentences:
        words = word_tokenize(sentence.lower())
        words = [lemmatizer.lemmatize(word) for word in words if word not in stop_words and word not in string.punctuation]
        preprocessed_sentences.append(' '.join(words))

    return sentences, preprocessed_sentences

def extract_summary(sentences, scores, num_sentences=3):
    ranked_sentences = [sentences[i] for i in np.argsort(scores, axis=0)[-num_sentences:]]
    return ' '.join(ranked_sentences)

# Preprocess the first document in the dataset for demonstration
text = documents[0]
original_sentences, preprocessed_sentences = preprocess(text)

vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(preprocessed_sentences)

# Ensure the TF-IDF matrix is treated as a NumPy array
tfidf_matrix = np.asarray(tfidf_matrix.todense())

sentence_scores = cosine_similarity(tfidf_matrix, tfidf_matrix.mean(axis=0).reshape(1, -1))
sentence_scores = sentence_scores.flatten()
print(sentence_scores)

summary = extract_summary(original_sentences, sentence_scores, num_sentences=3)
print("Summary:")
print(summary)




[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package movie_reviews to /root/nltk_data...
[nltk_data]   Unzipping corpora/movie_reviews.zip.


[0.301198   0.28207196 0.3149342  0.26607807 0.26826333 0.29040319
 0.29900693 0.27446921 0.28322579 0.31803876 0.2604603  0.29688908
 0.32477218 0.31391293]
Summary:
the question is what could have gone wrong with a potentially great idea with big name cast ? connery as sir august , does not fair better than thurman or fiennes . for one thing , you will not have to witness a product that is far inferior to the three high profile names that is associated with the title .


In [None]:
documents

["this is the worst movie i've viewed so far in 98 . \nthe avengers = silly = man dressed in a bowler hat + woman wearing tight leathers > evil scientists dressed in teddy bear suits + greater evil , sir august de wynter wearing kilt . \nthe question is what could have gone wrong with a potentially great idea with big name cast ? \nthe same question was probably asked of last year's stinker batman and robin . \ni feel the production got a little too smug , the script a little to smart and direction was somehow lost in the chaos of random events that collided together to form a movie . \nmy greatest criticism rests on the fact that there was no chemistry between emma peel and john steed ( thurman and fiennes ) ? something that was a vital element of the 60's tv serial of the same name . \nthe dialogue goes on and on about tea and other finer british perks , but does not allow much room for character development and interaction , except to perhaps grate on the viewer's nerves . \none won

In [None]:
def abstractive_summary(text): # Add text as a parameter to the function
  !pip install transformers
  from transformers import pipeline
  summarizer = pipeline("summarization")
  summary_text = summarizer(text, max_length=100, min_length=30, do_sample=False)[0]['summary_text']
  return summary_text

# Pass the 'text' variable (containing the movie review) to the function
abstractive_summary_text = abstractive_summary(documents[0])
print("Abstractive summary")
print(abstractive_summary_text)




No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.80k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Abstractive summary
 This is the worst movie i've viewed so far in 98 . The production got a little too smug, the script a little to smart and direction was somehow lost in the chaos of random events .
