## Text Summarization With N-Grams
### By RUTUJA SHINDE 


### **Web Scrapping Data**

In [1]:
import bs4 as bs                                                           # BeautifulSoup
import urllib.request
import re

In [2]:
def _scrape_webpage(url):
       
    scraped_textdata = urllib.request.urlopen(url)
    textdata = scraped_textdata.read()
    parsed_textdata = bs.BeautifulSoup(textdata,'lxml')
    paragraphs = parsed_textdata.find_all('p')
    formated_text = ""

    for para in paragraphs:
        formated_text += para.text
    
    return formated_text

In [3]:
mytext = _scrape_webpage('https://en.wikipedia.org/wiki/Natural_language_processing')

In [4]:
#print(mytext)
len(mytext)

8602

## **Tokenizing the Words**

In [5]:
from nltk.tokenize import word_tokenize
tokens = word_tokenize(mytext)
print("Number of words after word tokenizing: ", len(tokens))                 #tokenizing the words
print(tokens[:50])

print(len(mytext))

Number of words after word tokenizing:  1466
['Natural', 'language', 'processing', '(', 'NLP', ')', 'is', 'a', 'subfield', 'of', 'linguistics', ',', 'computer', 'science', ',', 'information', 'engineering', ',', 'and', 'artificial', 'intelligence', 'concerned', 'with', 'the', 'interactions', 'between', 'computers', 'and', 'human', '(', 'natural', ')', 'languages', ',', 'in', 'particular', 'how', 'to', 'program', 'computers', 'to', 'process', 'and', 'analyze', 'large', 'amounts', 'of', 'natural', 'language', 'data']
8602


## **Sentence Tokenization**

In [10]:
from nltk.tokenize import sent_tokenize
sent_tokens = sent_tokenize(mytext)
print(sent_tokens)

['Natural language processing (NLP) is a subfield of linguistics, computer science, information engineering, and artificial intelligence concerned with the interactions between computers and human (natural) languages, in particular how to program computers to process and analyze large amounts of natural language data.', 'Challenges in natural language processing frequently involve speech recognition, natural language understanding, and natural language generation.', 'The history of natural language processing (NLP) generally started in the 1950s, although work can be found from earlier periods.', 'In 1950, Alan Turing published an article titled "Computing Machinery and Intelligence" which proposed what is now called the Turing test as a criterion of intelligence[clarification needed].', 'The Georgetown experiment in 1954 involved fully automatic translation of more than sixty Russian sentences into English.', 'The authors claimed that within three or five years, machine translation wo

## **Punctuation Removal**

In [11]:
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+')
regexp_tokens = tokenizer.tokenize(mytext.lower())                                   #punctuation removal
print("Number of words after word tokenizing with removing punctuation: ", len(regexp_tokens))
print(regexp_tokens[0:50])

print(len(mytext))
print(len(regexp_tokens))

Number of words after word tokenizing with removing punctuation:  1295
['natural', 'language', 'processing', 'nlp', 'is', 'a', 'subfield', 'of', 'linguistics', 'computer', 'science', 'information', 'engineering', 'and', 'artificial', 'intelligence', 'concerned', 'with', 'the', 'interactions', 'between', 'computers', 'and', 'human', 'natural', 'languages', 'in', 'particular', 'how', 'to', 'program', 'computers', 'to', 'process', 'and', 'analyze', 'large', 'amounts', 'of', 'natural', 'language', 'data', 'challenges', 'in', 'natural', 'language', 'processing', 'frequently', 'involve', 'speech']
8602
1295


## **Convert to Lower case**

In [12]:
from nltk.tokenize import RegexpTokenizer
regexp_tokens = tokenizer.tokenize(mytext.lower())
print(regexp_tokens)

['natural', 'language', 'processing', 'nlp', 'is', 'a', 'subfield', 'of', 'linguistics', 'computer', 'science', 'information', 'engineering', 'and', 'artificial', 'intelligence', 'concerned', 'with', 'the', 'interactions', 'between', 'computers', 'and', 'human', 'natural', 'languages', 'in', 'particular', 'how', 'to', 'program', 'computers', 'to', 'process', 'and', 'analyze', 'large', 'amounts', 'of', 'natural', 'language', 'data', 'challenges', 'in', 'natural', 'language', 'processing', 'frequently', 'involve', 'speech', 'recognition', 'natural', 'language', 'understanding', 'and', 'natural', 'language', 'generation', 'the', 'history', 'of', 'natural', 'language', 'processing', 'nlp', 'generally', 'started', 'in', 'the', '1950s', 'although', 'work', 'can', 'be', 'found', 'from', 'earlier', 'periods', 'in', '1950', 'alan', 'turing', 'published', 'an', 'article', 'titled', 'computing', 'machinery', 'and', 'intelligence', 'which', 'proposed', 'what', 'is', 'now', 'called', 'the', 'turing

In [13]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\rutuj\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [14]:
import nltk
nltk.download('stopwords')    

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\rutuj\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

## **Stopword Removal**

In [15]:
from nltk.corpus import stopwords
stopwords_tokens = [token for token in regexp_tokens if token not in stopwords.words('english')]
print("# of words without stop words: ", len(stopwords_tokens))
print(stopwords_tokens[0:50])                                                               #stopword removal
print(len(mytext))

# of words without stop words:  824
['natural', 'language', 'processing', 'nlp', 'subfield', 'linguistics', 'computer', 'science', 'information', 'engineering', 'artificial', 'intelligence', 'concerned', 'interactions', 'computers', 'human', 'natural', 'languages', 'particular', 'program', 'computers', 'process', 'analyze', 'large', 'amounts', 'natural', 'language', 'data', 'challenges', 'natural', 'language', 'processing', 'frequently', 'involve', 'speech', 'recognition', 'natural', 'language', 'understanding', 'natural', 'language', 'generation', 'history', 'natural', 'language', 'processing', 'nlp', 'generally', 'started', '1950s']
8602


## **Digit Removal**

In [16]:
a = re.sub( r'\d+', '', mytext)
#print(a)
#print(len(mytext))                               # digit removal
print(len(a))

8499


## **Generate Unigrams**

In [17]:
#function to generate unigrams from sentences
from nltk.util import ngrams
def extract_ngrams(data,num):
    n_grams = ngrams(nltk.word_tokenize(data.lower()), num)
    return [' '.join(grams) for grams in n_grams]
gram_num = 1
n_grams = extract_ngrams(a, gram_num)
print(n_grams)

['natural', 'language', 'processing', '(', 'nlp', ')', 'is', 'a', 'subfield', 'of', 'linguistics', ',', 'computer', 'science', ',', 'information', 'engineering', ',', 'and', 'artificial', 'intelligence', 'concerned', 'with', 'the', 'interactions', 'between', 'computers', 'and', 'human', '(', 'natural', ')', 'languages', ',', 'in', 'particular', 'how', 'to', 'program', 'computers', 'to', 'process', 'and', 'analyze', 'large', 'amounts', 'of', 'natural', 'language', 'data', '.', 'challenges', 'in', 'natural', 'language', 'processing', 'frequently', 'involve', 'speech', 'recognition', ',', 'natural', 'language', 'understanding', ',', 'and', 'natural', 'language', 'generation', '.', 'the', 'history', 'of', 'natural', 'language', 'processing', '(', 'nlp', ')', 'generally', 'started', 'in', 'the', 's', ',', 'although', 'work', 'can', 'be', 'found', 'from', 'earlier', 'periods', '.', 'in', ',', 'alan', 'turing', 'published', 'an', 'article', 'titled', '``', 'computing', 'machinery', 'and', 'in

## **Generate Bi Grams**

In [18]:
#function to generate bigrams from sentences
from nltk.util import ngrams
def extract_ngrams(data,num):
    n_grams = ngrams(nltk.word_tokenize(data.lower()), num)
    return [' '.join(grams) for grams in n_grams]
gram_num = 2
n_grams = extract_ngrams(a, gram_num)
print(n_grams)

['natural language', 'language processing', 'processing (', '( nlp', 'nlp )', ') is', 'is a', 'a subfield', 'subfield of', 'of linguistics', 'linguistics ,', ', computer', 'computer science', 'science ,', ', information', 'information engineering', 'engineering ,', ', and', 'and artificial', 'artificial intelligence', 'intelligence concerned', 'concerned with', 'with the', 'the interactions', 'interactions between', 'between computers', 'computers and', 'and human', 'human (', '( natural', 'natural )', ') languages', 'languages ,', ', in', 'in particular', 'particular how', 'how to', 'to program', 'program computers', 'computers to', 'to process', 'process and', 'and analyze', 'analyze large', 'large amounts', 'amounts of', 'of natural', 'natural language', 'language data', 'data .', '. challenges', 'challenges in', 'in natural', 'natural language', 'language processing', 'processing frequently', 'frequently involve', 'involve speech', 'speech recognition', 'recognition ,', ', natural'

## **Generate TriGrams**

In [19]:
#function to generate trigrams from sentences
from nltk.util import ngrams
def extract_ngrams(data,num):
    n_grams = ngrams(nltk.word_tokenize(data.lower()), num)
    return [' '.join(grams) for grams in n_grams]
gram_num = 3
n_grams = extract_ngrams(a, gram_num)
print(n_grams)

['natural language processing', 'language processing (', 'processing ( nlp', '( nlp )', 'nlp ) is', ') is a', 'is a subfield', 'a subfield of', 'subfield of linguistics', 'of linguistics ,', 'linguistics , computer', ', computer science', 'computer science ,', 'science , information', ', information engineering', 'information engineering ,', 'engineering , and', ', and artificial', 'and artificial intelligence', 'artificial intelligence concerned', 'intelligence concerned with', 'concerned with the', 'with the interactions', 'the interactions between', 'interactions between computers', 'between computers and', 'computers and human', 'and human (', 'human ( natural', '( natural )', 'natural ) languages', ') languages ,', 'languages , in', ', in particular', 'in particular how', 'particular how to', 'how to program', 'to program computers', 'program computers to', 'computers to process', 'to process and', 'process and analyze', 'and analyze large', 'analyze large amounts', 'large amounts

## **Generate N Grams and frequencies**

In [23]:
#function to generate ngrams from mytext
def generate_ngrams(text,n):
 
    # split sentences into tokens
    tokens=re.split("\\s+",text)
    ngrams=[]
 
    # collect the n-grams
    for i in range(len(tokens)-n+1):
       temp=[tokens[j] for j in range(i,i+n)]
       ngrams.append(" ".join(temp))
 
    return ngrams

In [24]:
#defining a function for computing ngrams from mytext and finding frequency for ngram words
def compute_ngrams(tokens,n):
    ngram_dict={}
    for sentence in tokens:
        mytext = re.sub(r'\d+', '', sentence.lower())
        generated_ngrams = ngrams(stopwords_tokens,n)
        ngram_dict[sentence] = generated_ngrams
        length=len(stopwords_tokens)
        p=Counter(generated_ngrams)
        for x,y in p.items():
            p[x]=y/length
            ngram_dict[sentence]=p
    return ngram_dict

In [26]:
from collections import Counter                  #declare Counter

In [27]:
#defining functions for computing ngram sentence frequencies 
def compute_score(computed_ngram_Frequency):
    sentencescore={}
    for sentence,freq in computed_ngram_Frequency.items():
        words_in_sentence = len(freq)
        totalscore = 0
        for gram,score in freq.items():
            totalscore += score
            sentencescore[sentence] = totalscore / words_in_sentence
    return sentencescore
sentence_ngrams=compute_ngrams(sent_tokens,2)
sent_scores=compute_score(sentence_ngrams)


## **Summary Generation**

In [32]:
#summarizing the text based on ngrams sentence frequencies
import heapq  
summary_sentences = heapq.nlargest(3, sent_scores, key=sent_scores.get)
summary = ' '.join(summary_sentences)
print(summary)

Natural language processing (NLP) is a subfield of linguistics, computer science, information engineering, and artificial intelligence concerned with the interactions between computers and human (natural) languages, in particular how to program computers to process and analyze large amounts of natural language data. Challenges in natural language processing frequently involve speech recognition, natural language understanding, and natural language generation. The history of natural language processing (NLP) generally started in the 1950s, although work can be found from earlier periods.


## **N-GRAM SUMMARY**

**Natural language processing (NLP) is a subfield of linguistics, computer science, information engineering, and artificial intelligence concerned with the interactions between computers and human (natural) languages, in particular how to program computers to process and analyze large amounts of natural language data. Challenges in natural language processing frequently involve speech recognition, natural language understanding, and natural language generation. The history of natural language processing (NLP) generally started in the 1950s, although work can be found from earlier periods.**

## **COMPARISON WITH LAB 1**
**The summary in Lab 1 was lengthy as compared to N-Grams and has punctuations but the summary for N-Grams is small and precise with minimum punctuations comparatively. It is to the point and precise with n-grams. The N-Gram helps us capture the context information of the data better than TF-IDF approach as it uses the frequencies of the sentences.As Number of N increases, the summary becomes more precise and upto the point** 