# Web Scraping with Scrapy and MongoDB

A python project was built to scrape headline titles and urls to the full news page from the latest news page of marketwatch.com. The scraped data are then be stored in MongoDB.

The environment for this project is built with 


conda env create -f requirements.txt

- Note: Twisted for Scrapy must <= v 16.6

## Build the MarketWatch Scraper

The scraper can be generated with the following commands in cmd.exe:

    c:\workpath\>scrapy startproject marketspider

A scrapy spider project is created using the default spider template as follows:
![](figures/001-raw-files.png)



In **items.py** file, we define the class to store scraped data, including newsid, title, timestamp of the headline, and the url.

```python
import scrapy

class MarketspiderItem(scrapy.Item):
    newsid = scrapy.Field()
    title = scrapy.Field()
    url = scrapy.Field()
    timestamp = scrapy.Field()
```

In ** marketspider_spider.py**, we define how the data will be scraped from the given webpage. The start_url willl be 'http://www.marketwatch.com/newsviewer'. Using Chrome 'developer tools', we can find the XPath syntax for all headlines, newsids, timestamps of all news on this page. a defined `item()` is defined by `parse` function.
- Note: The current spider only handles 80 news, which is the default number of headlines on the 1st page of the 'Latest News' tab of marketwatch.com. We will improve our scraper to crawl over all news on this **infinite scrolling tab**.

```python
from scrapy import Spider
from scrapy.selector import Selector
from marketspider.items import MarketspiderItem

class MarketspiderSpider(Spider):
    name = 'marketspider'
    allowed_domains = ['marketwatch.com']
    start_urls = [
            'http://www.marketwatch.com/newsviewer',
            ]
    def parse(self, response):
        # select timesamps and headlines in the 'latest news' tag of newsviewer
        headlines = Selector(response).xpath('//div[@class="nv-text-cont"]/h4')
        timestamps = Selector(response).xpath('//li/@timestamp')
        newsids = Selector(response).xpath('//li/@id')

        for (newsid,timestamp, headline) in zip(newsids,timestamps,headlines):
            item = MarketspiderItem()
            
            item['timestamp'] = timestamp.extract().strip()
            item['newsid'] = newsid.extract().strip()
            if not headline.xpath('a').extract():
                item['title'] = headline.xpath(
                    'text()').extract()[0].strip()
                item['url'] = 'n/a'
                yield item
            else:
                item['title'] = headline.xpath(
                    'a[@class="read-more"]/text()').extract()[0].strip()
                item['url'] = headline.xpath(
                    'a[@class="read-more"]/@href').extract()[0]
                yield item
```

We must set the download delay as follows in **settings.py**
```python
DOWNLOAD_DELAY = 1
```

## From Spider to MongoDB

Once an item is returned by the spider function, it will be sent to 'item pipeline'. According to scrapy document, each item pipeline is a Python class that implements a simple method, which receive an item and perform an action over it, also deciding if the item should continue through the pipeline or be dropped and no longer processed.

In this program, we would like to store items in a MongoDB database. Operations and global settings of this database are defined in **settings.py** as part of the spider's pipeline parameters, such as pipeline name, database server location and port, database name, and collection name.

```pyhton
# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
    'marketspider.pipelines.MongoDBPipeline': 300,
}

MONGODB_SERVER = 'localhost'
MONGODB_PORT = 27017
MONGODB_DB = 'marketwatch'
MONGODB_COLLECTION = 'news'

```

Using pymongo, we can connect to the MongoDB database and update items to our database in **pipelines.py**. The connection to database  is defined in `__init__()` function. In `process_item()` function, we use `self.collection.update()` to insert new items into the database if the 'newsid' does not exist in the database.

```python
import pymongo

from scrapy.conf import settings
from scrapy import log
from scrapy.exceptions import DropItem

class MongoDBPipeline(object):
    
    def __init__(self):
        connection = pymongo.MongoClient(
                settings['MONGODB_SERVER'],
                settings['MONGODB_PORT']
                )
        db = connection[settings['MONGODB_DB']]
        self.collection = db[settings['MONGODB_COLLECTION']]

    def process_item(self, item, spider):
        for data in item:
            if not data:
                raise DropItem('Missing {}'.format(data))
            self.collection.update({'newsid':item['newsid']},dict(item),upsert=True)
            log.msg('Headline added to MongoDB database',
                    level=log.DEBUG, spider=spider)
        return item

class MarketspiderPipeline(object):
    def process_item(self, item, spider):
```

## Test the Spider

Our spider can be tested as follows

    c:\workpath\>scrapy crawl marketspider


## Reference

Web Scraping With Scrapy and MongoDB

https://realpython.com/blog/python/web-scraping-with-scrapy-and-mongodb/