# Introduction to Web Crawling

Web crawlers are bots that travel around sites and gather designated web contents. This is a technology used by search engines like Google and in mass scale requires a data base for storing the information the crawler fetches. 
 
Search engines like Google, have massive crawling operations that have a good proportion of the world wide web traffic. The server load from crawling is heavily influenced by the amount of the traffic the server faces from other sources. 

Crawling a site is allowed by default and it is the job of the system admin to withdraw this permission. These restrictions can be  put in place by using specifications in a configuration file. An example of this would be the [robot.txt](https://github.com/robots.txt) file outlining the specification for crawling Github.  

## Crawling with Scrapy

The first step is to create a project in a directory of your choice. 


In [1]:
%%bash
#uncomment the start project comment below and name it your project of interest
#scrapy startproject wikiCrawler
cd wikiCrawler/
#examine the folder structure with tree
tree
#move to spiders and create an articleSpider.py
cd wikiCrawler/spiders
touch articleSpiders.py

.
├── scrapy.cfg
├── wikiCrawler
│   ├── __init__.py
│   ├── items.py
│   ├── middlewares.py
│   ├── pipelines.py
│   ├── __pycache__
│   │   ├── __init__.cpython-36.pyc
│   │   ├── items.cpython-36.pyc
│   │   └── settings.cpython-36.pyc
│   ├── settings.py
│   └── spiders
│       ├── articleSpiders.py
│       ├── __init__.py
│       └── __pycache__
│           ├── articleSpiders.cpython-36.pyc
│           └── __init__.cpython-36.pyc
└── wiki_pages.json

4 directories, 14 files


In [2]:
%%writefile wikiCrawler/wikiCrawler/items.py
# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

from scrapy import Item, Field

#each item is a page
class WikicrawlerItem(Item):
    # define the fields for your item here like:
    name = Field()
    url = Field()
    #text = Field()
    #header = Field()

Overwriting wikiCrawler/wikiCrawler/items.py


In [3]:
%%bash
python wikiCrawler/wikiCrawler/items.py

Now that we have an understanding of the folder structure and the type of items we care about from each page, we can modify the articleSpider.py file we have created. 

In [19]:
%%writefile wikiCrawler/wikiCrawler/spiders/articleSpiders.py
import sys
import os.path

sys.path.append(os.path.join(os.path.dirname(__file__), '..'))

from items import WikicrawlerItem
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from bs4 import BeautifulSoup


class ArticleSpider(CrawlSpider):
    ''' spiders for pages in each wiki article'''
    name = "wiki_pages"
    allowed_domains = ["en.wikipedia.org"]
    #identify a list of start urls for scraping
    start_urls = ["https://en.wikipedia.org/wiki/Main_Page", "https://en.wikipedia.org/wiki/History_of_Python"]
    # Denying certain wikipedia pages not providing us with good content
    rules = (Rule(LinkExtractor(deny=[
        "https://en\.wikipedia\.org/wiki/Wikipedia.*",
        "https://en\.wikipedia\.org/wiki/Main_Page",
        "https://en\.wikipedia\.org/wiki/Free_Content",
        "https://en\.wikipedia\.org/wiki/Talk.*",
        "https://en\.wikipedia\.org/wiki/Portal.*",
        "https://en\.wikipedia\.org/wiki/Special.*"
    ]), callback='parse_wiki'),)
    
    def parse_wiki(self, response):
        item = WikicrawlerItem()
        soup = BeautifulSoup(response.body)
        
        item['url'] = response.url
        item['name'] = soup.find("h1", {"id": "firstHeading"}).string
                
        return item


Overwriting wikiCrawler/wikiCrawler/spiders/articleSpiders.py


In [20]:
%%bash
#run the file first to ensure no errors
python wikiCrawler/wikiCrawler/spiders/articleSpiders.py
#now get to the folder directory with scrapy.cfg
cd wikiCrawler
scrapy crawl wiki_pages -o wiki_pages.csv -t csv

2018-01-25 23:46:52 [scrapy.utils.log] INFO: Scrapy 1.5.0 started (bot: wikiCrawler)
2018-01-25 23:46:52 [scrapy.utils.log] INFO: Versions: lxml 4.1.0.0, libxml2 2.9.4, cssselect 1.0.3, parsel 1.3.1, w3lib 1.19.0, Twisted 17.9.0, Python 3.6.3 |Anaconda, Inc.| (default, Oct 13 2017, 12:02:49) - [GCC 7.2.0], pyOpenSSL 17.2.0 (OpenSSL 1.0.2l  25 May 2017), cryptography 2.0.3, Platform Linux-4.8.0-59-generic-x86_64-with-debian-stretch-sid
2018-01-25 23:46:52 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'wikiCrawler', 'FEED_FORMAT': 'csv', 'FEED_URI': 'wiki_pages.csv', 'NEWSPIDER_MODULE': 'wikiCrawler.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['wikiCrawler.spiders']}
2018-01-25 23:46:52 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.logstats.LogStats']
2018-01-25 23:46:53 [scrapy