***Fetching Articles from Wikipedia***

In [50]:
 !pip install beautifulsoup4           #used for web scraping



In [51]:
!pip install lxml                      



In [52]:
import bs4 as bs
import urllib.request                                                                           #helps in opening URLs(mostly http)
import re

data = urllib.request.urlopen('https://en.wikipedia.org/wiki/Natural_language_processing')      #fetching article from wikipedia
article = data.read()

parsed_article = bs.BeautifulSoup(article,'lxml')                                                #transforming it into a more readable data

paragraphs = parsed_article.find_all('p')                                                        #returns the paragraph in the form of list

article_text = ""

for p in paragraphs:
    article_text += p.text

In [53]:
print(article_text)

Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data.  The result is a computer capable of "understanding" the contents of documents, including the contextual nuances of the language within them. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves. 
Challenges in natural language processing frequently involve speech recognition, natural language understanding, and natural-language generation.
Natural language processing has its roots in the 1950s. Already in 1950, Alan Turing published an article titled "Computing Machinery and Intelligence" which proposed what is now called the Turing test as a criterion of intelligence, a task that involves the automated inte

# ***Data Preprocessing***

***Removing square brackets and extra spaces***

In [54]:
article_text = re.sub(r'\[[0-9]*\]', ' ', article_text)
article_text = re.sub(r'\s+', ' ', article_text)

***Removing special characters and digits***

In [55]:
formatted_article_text = re.sub('[^a-zA-Z]', ' ', article_text )
formatted_article_text = re.sub(r'\s+', ' ', formatted_article_text)            #we will use this to create weighted frequency histograms for the words

# ***Converting Text to Sentences***

***Sentence Tokenizer***

In [56]:
from keras.preprocessing.text import Tokenizer
import nltk
nltk.download('punkt')
import re
sentence_list = nltk.sent_tokenize(article_text)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


***Finding weighted frequency of the words***

In [57]:
nltk.download('stopwords')
from nltk.corpus import stopwords
stopwords = nltk.corpus.stopwords.words('english')

word_frequencies = {}
for word in nltk.word_tokenize(formatted_article_text):
    if word not in stopwords:
        if word not in word_frequencies.keys():
            word_frequencies[word] = 1
        else:
            word_frequencies[word] += 1

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [58]:
maximum_frequncy = max(word_frequencies.values())

for word in word_frequencies.keys():
    word_frequencies[word] = (word_frequencies[word]/maximum_frequncy)

***Calculating Sentence Scores***

In [59]:
sentence_scores = {}
for sent in sentence_list:
    for word in nltk.word_tokenize(sent.lower()):
        if word in word_frequencies.keys():
            if len(sent.split(' ')) < 40:
                if sent not in sentence_scores.keys():
                    sentence_scores[sent] = word_frequencies[word]
                else:
                    sentence_scores[sent] += word_frequencies[word]

***Getting the summary***

In [60]:
import heapq
summary_sentences = heapq.nlargest(10, sentence_scores, key=sentence_scores.get)#we take the top N sentences with the highest scores

summary = ' '.join(summary_sentences)
print(summary)

Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data. Starting in the late 1980s, however, there was a revolution in natural language processing with the introduction of machine learning algorithms for language processing. Challenges in natural language processing frequently involve speech recognition, natural language understanding, and natural-language generation. In some areas, this shift has entailed substantial changes in how NLP systems are designed, such that deep neural network-based approaches may be viewed as a new paradigm distinct from statistical natural language processing. Up to the 1980s, most natural language processing systems were based on complex sets of hand-written rules. The cache language models upon which many speech recognition systems 