# Webscrapping, Cleansing data and Summarizing content

This can be used to summarize the content from webpages

In [1]:
import bs4 as bs
import urllib.request
import re
import nltk
import heapq 
nltk.download('punkt')
nltk.download('stopwords') 

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Rishi\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Rishi\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [2]:
scrapped_data = urllib.request.urlopen('https://en.wikipedia.org/wiki/Data_science')
article = scrapped_data.read()

parsed_article = bs.BeautifulSoup(article,'lxml')

paragraphs = parsed_article.find_all('p')

article_text = ""

for p in paragraphs:  
    article_text += p.text

# Preprocessing
The first preprocessing step is to remove references from the article.The following script removes the square brackets and replaces the resulting multiple spaces by a single space.

In [3]:
article_text = re.sub(r'\[[0-9]*\]', ' ', article_text)  
article_text = re.sub(r'\s+', ' ', article_text) 

To clean the text and calculate weighted frequences, we will create another object.

In [4]:
# Removing special characters and digits
formatted_article_text = re.sub('[^a-zA-Z]', ' ', article_text )  
formatted_article_text = re.sub(r'\s+', ' ', formatted_article_text)  

There are two objects article_text, which contains the original article and formatted_article_text which contains the formatted article. We will use formatted_article_text to create weighted frequency histograms for the words and will replace these weighted frequencies with the words in the article_text object.

# Converting Text To Sentences

We will use article_text object for tokenizing the article to sentence since it contains full stops. The formatted_article_text does not contain any punctuation and therefore cannot be converted into sentences using the full stop as a parameter.

In [5]:
sentence_list = nltk.sent_tokenize(article_text)  

# Find Weighted Frequency of Occurrence

we are using formatted_article_text variable, sincee it doesn't contain punctuation, digits, or other special characters

In [6]:
stopwords = nltk.corpus.stopwords.words('english')

word_frequencies = {}  
for word in nltk.word_tokenize(formatted_article_text):  
    if word not in stopwords:
        if word not in word_frequencies.keys():
            word_frequencies[word] = 1
        else:
            word_frequencies[word] += 1

In [8]:
maximum_frequency = max(word_frequencies.values())

for word in word_frequencies.keys():  
    word_frequencies[word] = (word_frequencies[word]/maximum_frequency)

# Calculating Sentence Scores

Here we calculate the scores for each sentence by adding weighted frequencies of the words that occur in that particular sentence.

In [9]:
sentence_scores = {}  
for sent in sentence_list:  
    for word in nltk.word_tokenize(sent.lower()):
        if word in word_frequencies.keys():
            if len(sent.split(' ')) < 40:  #limiting sentences with < 30 words
                if sent not in sentence_scores.keys():
                    sentence_scores[sent] = word_frequencies[word]
                else:
                    sentence_scores[sent] += word_frequencies[word]

number 5 in the parameter below refer to number of sentences of summary we want to view the whole paragraph. We can give this as user input as well 

In [11]:
summary_sentences = heapq.nlargest(5, sentence_scores, key=sentence_scores.get)

summary = ' '.join(summary_sentences)  
print(summary)  

First, for Donoho, data science does not equate to big data, in that the size of the data set is not a criterion to distinguish data science and statistics. In his conclusion, he initiated the modern, non-computer science, usage of the term "data science" and advocated that statistics be renamed data science and statisticians data scientists. In 2015, the International Journal on Data Science and Analytics was launched by Springer to publish original work on data science and big data analytics. As Donoho concludes, "the scope and impact of data science will continue to expand enormously in coming decades as scientific data and data about science itself become ubiquitously available." However, many critical academics and journalists see no distinction between data science and statistics, whereas others consider it largely a popular term for "data mining" and "big data".
