# BLOG SUMMARIZATION USING SCRAPY AND NLTK

### This tutorial is a short and quick way of summarising the blogs online by scraping them. The two main tools of python used for the scraping and text processing are Scrapy and NLTK. Learning the Scrapy tool becomes very important today when there is huge amount of data present on the internet which is continuosly changing. Hence, this helps to get all the data. 

# USE CASE OF SCRAPY AND NLTK
In the upcoming parts of the tutorial, you will be scraping the https://www.cnet.com website. From the websites you will be able to scrape the URLs of the lastest news. These URLs will then be saved in a file and then be accessed from the file one by one to scrape the content of the blog. The blog text received from the scrapy output will then be given to a summarisation code. More details on the tools used below.

# SCRAPY 

### This is a python library which is a framework in itself that is used for scraping all the pages on the internet. They are even used to crawl the web pages over time. The data can be accessed in the form of structured data whic can then directly be used for text analysis and other kinds of data science activities. The scrapy architecture  uses various components which are as follows :
- Scrapy Engine
- Scheduler
- Downloader
- Spiders
- Item Pipeline
- Downloader middlewares
- Spider middlewares

### In this tutorial we won't be using all of these components but some of the basic ones. Alongwith their explanations, there will be some examples based on the same. More about the components can be gone through here : https://doc.scrapy.org/en/latest/topics/architecture.html

The following are some of the parts of the scrapy library that are need for completing this tutorial. Install scrapy using :
>  conda install scrapy

In [1]:
import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy.crawler import CrawlerRunner
from twisted.internet import reactor
from scrapy.utils.log import configure_logging
from multiprocessing import Process, Queue
import logging
from scrapy.exceptions import CloseSpider

### Spiders :
scrapy.Spider is the component which is extended by the users to write custom classes. This custom class is used t define the attributes of the spider like :
- name : some unique name of the spider that scrapy uses to initialise
- start_urls : a list of urls that you need the spider to start crawling from
- custom_settings : some basic setings like logging the warnings or using which pipeline to write the result
There are some others as well which can be looked up at : https://doc.scrapy.org/en/latest/topics/spiders.html

The parse function is also a overridden function, where the control goes after the request made to the internet using scrapy. So, the response variable is the reply from the server after the request. The response variable has to be used to yield the data that you need. The yielded data in the form of a dictionary is sent to the item pipeline component(more on it soon). 

So, in our use case, the start_url is our base website. And the item pipeline that we will be using is the LinkWriterPipeline which is defined later on. The path of the `href` attribute of the latest blog is given which is extracted and sent to the pipeline in the form of dictionary.

Also, you may notice that here something called as a CloseSpider exception is used which closes the spider after the condition of 5 links is met. This is done just for the tutorial sake to make it practical.

In [2]:
class LinkSpider(scrapy.Spider):
    name = "links"
    start_urls = [
        'https://www.cnet.com/'
    ]
    custom_settings = {
        'LOG_LEVEL': logging.WARNING,
        'ITEM_PIPELINES': {'__main__.LinkWriterPipeline': 1},
    }
    def parse(self, response):
        i = 0
        for link in response.xpath("//div[@class='latestScrollItems']//div[@class='item']//span[@class='col-5 rlLine']/a/@href"):
            yield {
                'link': link.extract()
            }
            if i == 5:
                raise CloseSpider('termination condition met')
            i += 1


### Item Pipelines :
This is where all the items output by each response page arrive and are processed. This class can have three functions :
- open_spider : whenever a spider is started, this function is called.
- process_item : whatever processing to be done with the data happens in this location.
- close_spider : whenever the spider is done with the crawling this function is called.
In this case, the open_spider and close_spider are just used to open the file `blogs-link.txt` to write all the available links. In the process_item function, the domain name is added to the folder path received from the href of the tag and written to the file.  

In [5]:
class LinkWriterPipeline(object):
    def open_spider(self, spider):
        self.file = open('blogs-link.txt', 'w')

    def close_spider(self, spider):
        self.file.close()

    def process_item(self, item, spider):
        
        domain = "https://www.cnet.com"
        for link in item:
            print(item['link'])
            link = domain + item['link']
            self.file.write(link+"\n")
        return item

Now, we replicate creating another spider and a similar item pipeline for scrapping the blog content and storing them in `blog-text.txt`.
But, it will noticed that while doing so, you will need to dynamically append the urls to the `start_urls` from the file and then run the spider. This implies that the LinkSpider should run first before getting to the BlogSpider.

But, according to the way scrapy has been designed, we can only start a process once and it doesn't end in the same process. Hence, we get a `Reactor Not Restartable` error. 

According to https://stackoverflow.com/questions/41495052/scrapy-reactor-not-restartable, this error is solvable by making two processes run, that is creatio of a multiprocess system. Hence the following code is used to run the spiders in different processes.(This is out of the scope of the project topic so I used it as it is). Hence, the run_spider code is used to run different spiders. Apart from the mulltiprocess coding, there are some scrapy functions used to start the crawler. The basic code is as follows :
> process = CrawlerProcess({'USER_AGENT': 'Mozilla/4.0 '})                
> process.crawl(LinkSpider)                                                                            
> process.start()

Here, a crawler process is defined and given some parameters if needed like type of User Agent, etc. Then, the process is given a Spider which it will crawl with and then the process is started.

In [6]:
def run_spider(name):
    def f(q):
        try:
            runner = CrawlerRunner()
            deferred = runner.crawl(name)
            deferred.addBoth(lambda _: reactor.stop())
            reactor.run()
            q.put(None)
        except Exception as e:
            q.put(e)

    q = Queue()
    p = Process(target=f, args=(q,))
    p.start()
    result = q.get()
    p.join()

    if result is not None:
        raise result

The run_spider process is then called, which will call the pipeline which prints all the output of the spider and then it is stored in the blogs-link.txt file. The output of that is also below.

In [8]:
run_spider(LinkSpider)

/pictures/apple-iphone-x-macro/
/news/chinese-tiangong-1-space-station-may-miss-april-1-crash-re-entry/
/news/ron-howard-teases-solo-a-star-wars-story-space-battle/
/news/the-viva-egoista-845-extreme-high-end-audio-meets-italian-style/
/news/daytons-98-per-pair-bluetooth-speakers-do-full-on-stereo/
/news/best-april-fools-day-2018-pranks-on-web/


In [11]:
for link in open('blogs-link.txt'):
    print(link)

https://www.cnet.com/pictures/apple-iphone-x-macro/

https://www.cnet.com/news/chinese-tiangong-1-space-station-may-miss-april-1-crash-re-entry/

https://www.cnet.com/news/ron-howard-teases-solo-a-star-wars-story-space-battle/

https://www.cnet.com/news/the-viva-egoista-845-extreme-high-end-audio-meets-italian-style/

https://www.cnet.com/news/daytons-98-per-pair-bluetooth-speakers-do-full-on-stereo/

https://www.cnet.com/news/best-april-fools-day-2018-pranks-on-web/



Now, when we have our URLs ready in a file, we can create the new spider and run the same in the similar format as before. Here, we are scraping all the links and storing the result in the text file. The outputs of the process is given below.

In [12]:
class BlogSpider(scrapy.Spider):
    name = "blogs"
    start_urls=[]
    for url in open("blogs-link.txt"):
        url = url.replace('\n',"")
        start_urls.append(url)
    print(start_urls)  
    custom_settings = {
        'LOG_LEVEL': logging.WARNING,
        'ITEM_PIPELINES': {'__main__.BlogWriterPipeline': 1}, 
    }

    def parse(self, response):
        for quote in response.css('div.article-main-body'):
            yield {
                'text': quote.css('p').extract()
            }

['https://www.cnet.com/pictures/apple-iphone-x-macro/', 'https://www.cnet.com/news/chinese-tiangong-1-space-station-may-miss-april-1-crash-re-entry/', 'https://www.cnet.com/news/ron-howard-teases-solo-a-star-wars-story-space-battle/', 'https://www.cnet.com/news/the-viva-egoista-845-extreme-high-end-audio-meets-italian-style/', 'https://www.cnet.com/news/daytons-98-per-pair-bluetooth-speakers-do-full-on-stereo/', 'https://www.cnet.com/news/best-april-fools-day-2018-pranks-on-web/']


In [13]:
class BlogWriterPipeline(object):
    def open_spider(self, spider):
        self.file = open('blog-text.txt', 'w')

    def close_spider(self, spider):
        self.file.close()

    def process_item(self, item, spider):
        for text in item:
            blog = ''.join(item['text'])
            self.file.write(blog + '\n')
        return item

In [16]:
run_spider(BlogSpider)

In [83]:
for blog in open('blog-text.txt'):
    print(blog)

<p>I'll cut right to the chase: the <a href="https://www.parts-express.com/dayton-audio-mk402bt-powered-bluetooth-2-way-bookshelf-speaker-pair-with-35mm-aux-in--300-458" target="_blank" data-component="externalLink">Dayton Audio MK402BT</a> two-speaker system fills a room better than any single speaker could: whether thats a $349 <span class="link" section="shortcodeLink"><a href="/products/apple-homepod/review/">Apple HomePod</a></span> or a $399 <span class="link" section="shortcodeLink"><a href="/products/google-home-max/review/">Google Home Max</a></span> smart speaker. The big advantage of bona fide stereo separation is hard to beat for the money; the MK402BT sells for just $98 a pair.</p><p>The Dayton Audio MK402BT speakers</p><p>When I had them spread five feet apart, the MK402BTs' stereo was wonderfully spacious, with a fair degree of soundstage depth. While I didn't have a HomePod or Home Max on hand for direct comparisons, I know those bad boys deliver more potent bass. Still

# NLTK
NLTK is the library used for text processing. The library is familiar but one more use of the different functions will be to summarise the text. There are two basic types of text summarisation :
- Extractive technique : this will just take the important statements from the prose and be presented as the summary
- Abstractive technique : this will make a short summary from all the statements by making the summary from scratch. 

In this tutorial, the extractive technique is used. For demo purpose, just the first blog from the blogs file is used and processed upon. As it is an HTML file, the tags are removed and the apostrophes and other characters are also removed. The other changed to make sure the sentence is well formatted is done before starting to process it.

In [84]:
import nltk
import re

f = open('blog-text.txt').readline()
text = re.sub('<.*?>','', f)
text = text.replace("\\","'")
text = text.replace('\n',"")
text = text.replace(",", "")
text = text.replace(" .",'.')
text = text.replace(". ",'.')
text = text.replace(".",". ")
text = text.replace("?", "")
print(text)

I'll cut right to the chase: the Dayton Audio MK402BT two-speaker system fills a room better than any single speaker could: whether thats a $349 Apple HomePod or a $399 Google Home Max smart speaker. The big advantage of bona fide stereo separation is hard to beat for the money; the MK402BT sells for just $98 a pair. The Dayton Audio MK402BT speakersWhen I had them spread five feet apart the MK402BTs' stereo was wonderfully spacious with a fair degree of soundstage depth. While I didn't have a HomePod or Home Max on hand for direct comparisons I know those bad boys deliver more potent bass. Still the MK402BT's bass is definitely adequate. The MK402BT speakers are the active version of one of our new favorite budget designs: the MK402. As well as offering a wired connection they also include Bluetooth connectivity.  Like the original this is a two-way speaker with a 4-inch treated paper woofer and a 0. 75-inch soft dome tweeter. The internal power amplifier is rated at 20 watts per chan

Now, following are the steps for summarization of the blog above:

### STEP 1 : 
Remove the english stopwords from the words in the blog. Tokenize the blog. This means that making a list of the number of distinct words in text. After that a lemmatiser is used to convert all forms of a similar meaning words to the same format.

In [85]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize

In [86]:
stopWords = set(stopwords.words("english"))
words = word_tokenize(text)
lemmatizer=nltk.stem.wordnet.WordNetLemmatizer()
words=[lemmatizer.lemmatize(x) for x in words]

In [87]:
print(words)

['I', "'ll", 'cut', 'right', 'to', 'the', 'chase', ':', 'the', 'Dayton', 'Audio', 'MK402BT', 'two-speaker', 'system', 'fill', 'a', 'room', 'better', 'than', 'any', 'single', 'speaker', 'could', ':', 'whether', 'thats', 'a', '$', '349', 'Apple', 'HomePod', 'or', 'a', '$', '399', 'Google', 'Home', 'Max', 'smart', 'speaker', '.', 'The', 'big', 'advantage', 'of', 'bona', 'fide', 'stereo', 'separation', 'is', 'hard', 'to', 'beat', 'for', 'the', 'money', ';', 'the', 'MK402BT', 'sell', 'for', 'just', '$', '98', 'a', 'pair', '.', 'The', 'Dayton', 'Audio', 'MK402BT', 'speakersWhen', 'I', 'had', 'them', 'spread', 'five', 'foot', 'apart', 'the', 'MK402BTs', "'", 'stereo', 'wa', 'wonderfully', 'spacious', 'with', 'a', 'fair', 'degree', 'of', 'soundstage', 'depth', '.', 'While', 'I', 'did', "n't", 'have', 'a', 'HomePod', 'or', 'Home', 'Max', 'on', 'hand', 'for', 'direct', 'comparison', 'I', 'know', 'those', 'bad', 'boy', 'deliver', 'more', 'potent', 'bass', '.', 'Still', 'the', 'MK402BT', "'s", 'ba

## STEP 2 :
Make a dictionary of how many words appear how many times in the words list.

In [102]:
word_dict = dict()
for word in words:
    word = word.lower()
    if word in stopWords:
        continue
    if word in word_dict:
        word_dict[word] += 1
    else:
        word_dict[word] = 1

In [103]:
print(word_dict)

{"'ll": 2, 'cut': 1, 'right': 5, 'chase': 1, ':': 4, 'dayton': 4, 'audio': 4, 'mk402bt': 15, 'two-speaker': 1, 'system': 1, 'fill': 1, 'room': 2, 'better': 2, 'single': 2, 'speaker': 17, 'could': 1, 'whether': 1, 'thats': 1, '$': 5, '349': 1, 'apple': 1, 'homepod': 3, '399': 1, 'google': 1, 'home': 3, 'max': 3, 'smart': 1, '.': 34, 'big': 1, 'advantage': 1, 'bona': 1, 'fide': 1, 'stereo': 4, 'separation': 2, 'hard': 1, 'beat': 1, 'money': 2, ';': 2, 'sell': 2, '98': 2, 'pair': 2, 'speakerswhen': 1, 'spread': 1, 'five': 1, 'foot': 2, 'apart': 1, 'mk402bts': 4, "'": 1, 'wa': 4, 'wonderfully': 1, 'spacious': 1, 'fair': 1, 'degree': 1, 'soundstage': 1, 'depth': 1, "n't": 4, 'hand': 1, 'direct': 1, 'comparison': 1, 'know': 1, 'bad': 1, 'boy': 1, 'deliver': 1, 'potent': 1, 'bass': 3, 'still': 2, "'s": 8, 'definitely': 1, 'adequate': 1, 'active': 1, 'version': 1, 'one': 2, 'new': 2, 'favorite': 1, 'budget': 1, 'design': 1, 'mk402': 3, 'well': 1, 'offering': 1, 'wired': 6, 'connection': 5, 'al

## STEP 3 :
Now, to extract the most important sentences related to the blog, the sentence tokenization is also used. 

In [108]:
sentences = sent_tokenize(text)
sentence_score = [0]*len(sentences)

In [109]:
print(sentences)

["I'll cut right to the chase: the Dayton Audio MK402BT two-speaker system fills a room better than any single speaker could: whether thats a $349 Apple HomePod or a $399 Google Home Max smart speaker.", 'The big advantage of bona fide stereo separation is hard to beat for the money; the MK402BT sells for just $98 a pair.', "The Dayton Audio MK402BT speakersWhen I had them spread five feet apart the MK402BTs' stereo was wonderfully spacious with a fair degree of soundstage depth.", "While I didn't have a HomePod or Home Max on hand for direct comparisons I know those bad boys deliver more potent bass.", "Still the MK402BT's bass is definitely adequate.", 'The MK402BT speakers are the active version of one of our new favorite budget designs: the MK402.', 'As well as offering a wired connection they also include Bluetooth connectivity.', 'Like the original this is a two-way speaker with a 4-inch treated paper woofer and a 0.', '75-inch soft dome tweeter.', 'The internal power amplifier i

## STEP 4 : 
Find the importance of the sentence. The importance of the sentence is basically the sum of frequencies of all the words in the sentence.

In [110]:
for index,sentence in enumerate(sentences):
    for wordValue in word_dict:        
        if wordValue[0] in sentence.lower():
            sentence_score[index] += word_dict[wordValue]

In [111]:
print(sentence_score)

[397, 339, 387, 375, 286, 362, 312, 303, 278, 327, 366, 252, 66, 73, 157, 372, 289, 243, 363, 172, 384, 381, 388, 393, 331, 388, 381, 371, 364, 383, 374, 360, 385, 329]


## STEP 5 :
Calculating the average of the scores of the sentences. The threshold to select the important statements, I am keeping it as 1.2 * average here. This is because the summary should be a shorter one according to me. But, if someone prefers a slightly bigger summary then the threshold can be kept as the average itself. So, according to the threshold choose the sentences necessary and build the summary. 

In [113]:
total_sum = 0
for value in sentence_score:
    total_sum += value
average = total_sum/len(sentence_score)
print(average)

321.5


In [114]:
summary = ""
for index,sentence in enumerate(sentences):
    if sentence_score[index] > (1.2*average):
        print(index)
        summary += sentence

0
2
22
23
25


In [115]:
print(summary)

I'll cut right to the chase: the Dayton Audio MK402BT two-speaker system fills a room better than any single speaker could: whether thats a $349 Apple HomePod or a $399 Google Home Max smart speaker.The Dayton Audio MK402BT speakersWhen I had them spread five feet apart the MK402BTs' stereo was wonderfully spacious with a fair degree of soundstage depth.I'm not claiming the MK402BT will satisfy fussy audiophiles -- the sound has an edge that grates when the speakers are played at moderate or loud volume.That said the MK402BT's sound quality improves when you turn off Bluetooth and go for a wired connection with my iPhone 6S.With a wired connection the MK402BT sound is on par with what I recall from my recent listening sessions with the passive (non-amplified) Dayton Audio MK402 speakers that sell for $69 a pair.


This is a very basic way of summarization of any texts. There are various papers and blogs talking about the other methods and algorithms used for the same. This is a very good paper that lists the various techniques which I liked: https://arxiv.org/pdf/1707.02268.pdf 
You can implement it in the second half of the tutorial to get better results.