# Document summarization

In [2]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize, word_tokenize
from heapq import nlargest

In [3]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [4]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [5]:
def summarize(text, n):

    sentences = sent_tokenize(text)
    words = word_tokenize(text.lower())

    stop_words = set(stopwords.words('english'))
    words = [word for word in words if word.isalpha() and word not in stop_words]

    word_freq = nltk.FreqDist(words)

    scores = {}
    for i, sentence in enumerate(sentences):
        for word in word_tokenize(sentence.lower()):
            if word in word_freq:
                if i in scores:
                    scores[i] += word_freq[word]
                else:
                    scores[i] = word_freq[word]

    top_sentences = nlargest(n, scores, key=scores.get)

    summary = ' '.join([sentences[i] for i in sorted(top_sentences)])
    return summary



In [6]:

text = '''
Morgan & Morgan is an American law firm. Founded in 1988 by John Morgan, it is headquartered in Orlando, Florida. While Morgan & Morgan was historically considered a firm focused on personal injury, medical malpractice and class action lawsuits, it also expanded practices to other areas of legal services. The firm has offices in 49 U.S. states and Washington, D.C.
The law firm was established in 1988 by John Morgan and his partners Stewart Colling and Ron Gilbert.[4] From the start, the company has been headquartered in Orlando, Florida.
In 1989, the law firm began advertising on television and radio.[5] In 2005, Morgan bought out his partners' share of the company and renamed the firm "Morgan & Morgan", also adding his wife Ultima as partner.[6] As Orlando Sentinel quotes, "John Morgan and his partners had a fundamental difference over growth and expansion of the law firm throughout Florida at the time".[7]
By early 2000s, the firm expanded throughout Florida with 420 employees, and by 2013 the company had 260 attorneys among 1,800 staffers in Florida, Georgia, Mississippi, Kentucky and Manhattan.[4][5]
In January 2011, Charlie Crist joined the Tampa office of Morgan & Morgan after expressing an interest in returning to the legal field during his final week in office as governor of Florida. Crist worked primarily in the firm's class-action sector as a complex-litigation attorney, serving as a "rainmaker" for the firm.[8] In November 2016, after almost six years with the firm, he was elected to represent Florida's 13th congressional district.[9] In February 2018, Brad Slager of Sunshine State News cited evidence that Morgan & Morgan was "attempting to purge all evidence" of its relationship with Crist now that he was a "rookie congressman" with "little-to-no power".[10][relevant?]
In 2018, the firm received over two million phone calls and signed up 500 new cases each day. That year, the firm collected $1.5 billion in settlements and spent $130 million nationwide on advertising. John Morgan was one of the first lawyers to advertise in phone books and television commercials.[6]
In 2021, Morgan fired half of his firm's marketing department. The staffing purge came in the wake of a controversial Morgan & Morgan national advertising campaign, "Size Matters," which was meant to convey the large scale of the firm, but was criticized as an inappropriate dick joke. The staffers who were fired had criticized the ad campaign's phallic implications.[11][12]
As of 2022, the law firm had over 3,000 employees, including 800 lawyers in 49 states.
'''
summary = summarize(text, 3)
print(summary)

While Morgan & Morgan was historically considered a firm focused on personal injury, medical malpractice and class action lawsuits, it also expanded practices to other areas of legal services. [5] In 2005, Morgan bought out his partners' share of the company and renamed the firm "Morgan & Morgan", also adding his wife Ultima as partner. The staffing purge came in the wake of a controversial Morgan & Morgan national advertising campaign, "Size Matters," which was meant to convey the large scale of the firm, but was criticized as an inappropriate dick joke.


In [7]:
len(text)

2579

In [8]:
len(summary)

561

In [9]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.28.1-py3-none-any.whl (7.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.0/7.0 MB[0m [31m47.4 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.11.0
  Downloading huggingface_hub-0.14.1-py3-none-any.whl (224 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m224.5/224.5 kB[0m [31m23.6 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m61.1 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.14.1 tokenizers-0.13.3 transformers-4.28.1


In [10]:
from transformers import BartTokenizer, BartForConditionalGeneration

# Load the tokenizer and model
tokenizer = BartTokenizer.from_pretrained('facebook/bart-large-cnn')
model = BartForConditionalGeneration.from_pretrained('facebook/bart-large-cnn')

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.58k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

In [11]:
text = '''
The law firm was established in 1988 by John Morgan and his partners Stewart Colling and Ron Gilbert.[4] From the start, the company has been headquartered in Orlando, Florida.
In 1989, the law firm began advertising on television and radio.[5] In 2005, Morgan bought out his partners' share of the company and renamed the firm "Morgan & Morgan", also adding his wife Ultima as partner.[6] As Orlando Sentinel quotes, "John Morgan and his partners had a fundamental difference over growth and expansion of the law firm throughout Florida at the time".[7]
By early 2000s, the firm expanded throughout Florida with 420 employees, and by 2013 the company had 260 attorneys among 1,800 staffers in Florida, Georgia, Mississippi, Kentucky and Manhattan.[4][5]
In January 2011, Charlie Crist joined the Tampa office of Morgan & Morgan after expressing an interest in returning to the legal field during his final week in office as governor of Florida. Crist worked primarily in the firm's class-action sector as a complex-litigation attorney, serving as a "rainmaker" for the firm.[8] In November 2016, after almost six years with the firm, he was elected to represent Florida's 13th congressional district.[9] In February 2018, Brad Slager of Sunshine State News cited evidence that Morgan & Morgan was "attempting to purge all evidence" of its relationship with Crist now that he was a "rookie congressman" with "little-to-no power".[10][relevant?]
In 2018, the firm received over two million phone calls and signed up 500 new cases each day. That year, the firm collected $1.5 billion in settlements and spent $130 million nationwide on advertising. John Morgan was one of the first lawyers to advertise in phone books and television commercials.[6]
In 2021, Morgan fired half of his firm's marketing department. The staffing purge came in the wake of a controversial Morgan & Morgan national advertising campaign, "Size Matters," which was meant to convey the large scale of the firm, but was criticized as an inappropriate dick joke. The staffers who were fired had criticized the ad campaign's phallic implications.[11][12]
As of 2022, the law firm had over 3,000 employees, including 800 lawyers in 49 states.
'''

inputs = tokenizer.encode(text, return_tensors='pt')

summary_ids = model.generate(inputs, max_length=150, min_length=40, length_penalty=2.0, num_beams=4, early_stopping=True)
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

print(summary)

The law firm was established in 1988 by John Morgan and his partners Stewart Colling and Ron Gilbert. In 2005, Morgan bought out his partners' share of the company and renamed the firm "Morgan & Morgan" In January 2011, Charlie Crist joined the Tampa office of Morgan & Morgan after expressing an interest in returning to the legal field. Crist worked primarily in the firm's class-action sector as a complex-litigation attorney. In November 2016, after almost six years with the firm, he was elected to represent Florida's 13th congressional district.


In [12]:
print("Length of the text:",len(text))
print("length of the summary:",len(summary))

Length of the text: 2212
length of the summary: 552
