[View in Colaboratory](https://colab.research.google.com/github/samirpoonawala/learning-data-science/blob/master/scaping_data_from_the_web.ipynb)

# (Treehouse) Scaping Data from the Web
[course link](https://teamtreehouse.com/library/scraping-data-from-the-web)

Almost any information you want is available on the Internet. Web scraping is a key tool for data mining that information allowing for web page exploration and collection for a variety of reporting. The tools and techniques used in this course allow for data to be collected that would otherwise not be easily accessible without robotic assistance.

What you'll learn:


*   An introduciton to the Beautiful Soup Python package
*   How to scrape a web page with Beautiful Soup
*   An introduction to the Scrapy Python package
*   How to crawl a website with Scrapy
*   Web scraping considerations



# Introducing Data Scraping

A look at what data scraping is and how it is used. We'll have a discussion about how a web page is designed and look at the Python package, Beautiful Soup to scrape data from the web.

## What is data scraping?
A high-level overview of the world of data web scraping in Python. What it is and isn't and how it can be used.


*  Web scraping is the automated collecting of data from the web by any means other than a program interacting with a web API

## Web page anatomy

Let's take a brief look at how an HTML page is structured so we can better understand how to navigate a page for web scraping


*   [House Land](https://treehouse-projects.github.io/horse-land/index.html) website



## Beautiful Soup
Introducing the Python web scraping package, Beautiful Soup

In [0]:
!pip install beautifulsoup4

In [0]:
from urllib2 import urlopen
# Python 3 command:
#from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen("https://treehouse-projects.github.io/horse-land/index.html")
soup = BeautifulSoup(html.read(), 'html.parser')

# print out page
#print(soup.prettify())

In [0]:
# print page title
print(soup.title)

In [0]:
# find all divs on the website
divs = soup.find_all('div')
for div in divs:
  print(div)

In [0]:
# filter divs by class
featured_divs = soup.find_all('div', {'class':'featured'})
for featured_div in featured_divs:
  print(featured_div)

## More Soup in the Tureen
Let's look at two Beautiful Soup methods, 'find()' and 'find_all()', in greater detail

In [0]:
# use find to find first instance of an item
div = soup.find('div', {'class':'featured'})
print(div)

In [0]:
# chaining elements together
featured_header = soup.find('div', {'class':'featured'}).h2
print(featured_header)

In [0]:
# just the text (w/o h2 tag)
# use as last step of the scraping process (harder to work with during scraping process)
featured_header = soup.find('div', {'class':'featured'}).h2
print(featured_header.get_text())

In [0]:
# print all references to primary button class

for button in soup.find(attrs={'class':'button button--primary'}):
  print(button)

In [0]:
# shortcut for above
for button in soup.find(class='button button--primary'):
  print(button)

In [0]:
# get all hyperlinks on a specific page

for link in soup.find_all('a'):
  print(link.get('href'))

## Being a Good Citizen
Just because we can do something doesn't mean we always should. Let's take a look at some of the responsibilities taht come with the power of web scraping.

*   Web scraping legal claims (USA): copyright infringement; computer fraud and abuse act (CFAA); tresspass to chattels
*   EU: Directive 96/9/EC (Database Directive);
*   Austrailia: Spam Act of 2003




## Everyone Loves Charlotte
We've seen how to scrape data from a single page. Now let's see how we can capture links on one page and follow them to process additional pages.

In [0]:
from urllib2 import urlopen
from bs4 import BeautifulSoup

import re

site_links = []

def internal_links(linkURL):
  html = urlopen('https://treehouse-projects.github.io/horse-land/{}'.format(linkURL))
  soup = BeautifulSoup(html, 'html.parser')
  
  return soup.find('a', href=re.compile('(.html)$'))

if __name__ == '__main__':
  urls = internal_links('index.html')
  while len(urls) > 0:
    page = urls.attrs['href']
    if page not in site_links:
      site_links.append(page)

      print(page)

      print('\n==================\n')

      urls = internal_links(page)
    else:
      break

## Installing Scrapy
Getting up and going with the Scrapy library

In [0]:
# install scrapy
!pip install scrapy

In [0]:
# scrapy project setup: terminal command
!scrapy startproject AraneaSpider

In [0]:
!cd /usr/local/lib/python2.7/dist-packages/scrapy/templates/project

In [0]:
!ls

## Crawling Spiders

Let's use the Python Library, Scrapy, to create a spider to crawl the web

In [0]:
# new file to crawl site
import scrapy

class HorseSpider(scrapy.Spider):
  name = 'ike'
  def start_requests(self):
    urls = ['https://treehouse-projects.github.io/horse-land/index.html', 
            'https://treehouse-projects.github.io/horse-land/mustang.html']
    
    return [scrapy.Request(url=url, callback=self.parse) for url in urls]
  
  def parse(self, response):
    url = response.url
    page = url.split('/')[-1]
    filename = 'horses-%s' % page
    print('URL is {}'.format(url))
    with open(filename, 'wb') as file:
      file.write(response.body)
    print("Saved file %s" % filename)
    
    
# NOTE: getting error: 'no module named zope.interface'

In [0]:
# tried using this to fix error in cell above
!pip install zope

In [0]:
# execute script above in terminal
# after navigating file directory to spiders folder

scrapy crawl ike

## The Endless Web
Let's further explore how to crawl the web


*   First scraper scrapped a static list of URLs



In [0]:
# crawler.py

# imports
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule

class HorseSpider(CrawlSpider):
  
  name = 'Whirlaway'
  
  allowed_domains = ['treehouse-projects.github.io']
  
  start_urls = ['http://treehouse-projects.github.io/horse-land']
  
  rules = [Rule(LinkExtractor(allow=r'.*).
                             callback='parse_horses',
                             follow=True)]
  
  def parse_horses(self, response):
                url = response.url
                title = response.css('title::text').extract()
                print("Page URL: {}".format(url))
                print("Page title: {}".format(title))

In [0]:
!scrapy crawl whirlaway

# Additional Scraping Tasks
Going beyond static web pages can be a challenge when scraping. Working with web forms and APIs can require a different approach. We'll also touch on how to write tests for a web scraper.



## An Intelligent Scraper

Forms are a big part of many websites. Scrapy provides a FormRequest class for handling them.

In [0]:
# formSpider.py

from scrapy.http import FormRequest
from scrapy.spiders import Spider

class FormSpider(Spider):
  name = 'horseForm'
  start_urls = ["https//treehouse-projects.github.io/horse-land/form.html"]
  
  def parse(self, response):
    formdata = {'firstname':'Samir',
               'lastname': "Poonawala",
               "title": "Partner and Chief Financial Officer"}
    
    return FormRequest.form_response(response, formnumber=0, 
                                     formdata=formdata, callback-self.after_post)
  
  def after_post(self, response):
    print("\n\n******\nForm processed.\n")
    print(response)
    print("\n******\n")

In [0]:
!scrapy crawl horseForm

## Scraping APIs

APIs are all around us on the web. Sometimes we can use scraping techniques to interact with them in a meaningful way.

In [0]:
#world_bank.py

from urllib.request import urlopen
from bs4 import BeautifulSoup
import csv

def get_country(country_code):
  html = urlopen("http://api.worldbank.org/v2/countries/{}".format(country_code))
  
  soup = BeautifulSoup(html, 'xml')
  
  country_name = soup.find('wb:name')
  region = soup.find('wb:region')
  income_level = soup.find('wb:incomelevel')
  
  print(country_name.get_text())
  print(region.get_text())
  print(incomelevel.get_text())
  
 if __name__ == "__main__":
  # references csv file included in project files with course
  file = open("country_iso_codes.csv", "r")
  iso_codes = csv.reader(file, delimiter = ',')
  
  for code in iso_codes:
    get_country(code[0])

## Using scrapers for site testing

Web scraping doesn't have to entirely be about scraping data for processing. Web scraping tools can be used to test websites as well.

In [0]:
#horse_test.py

from urllib.request import urlopen
from bs4 import BeautifulSoup

import unittest

class TestHorseLand(unittest.TestCase):
  soup = None
  
  def setUpClass():
    url = "https://treehouse-projects.github.io/horse-land/index.html"
    TestHorseLand.soup = BeautifulSoup(urlopen(url), "html.parser")
    
  def test_header1(self):
    header1 = TestHorseLand.soup.find('h1').get_text()
    self.assertEqual("Horse Land", header1)
    
if __name__ == '__main__':
  unittest.main()

In [0]:
# horse_test_selenium

from bs4 import BeautifulSoup
from selenium import webdriver

import time

driver = webdriver.Chrome()

driver.get("https://treehouse-projects.github.io/horse-land/index.html")

time.sleep(5)

page_html = driver.page_source

soup = BeautifulSoup(page_html, 'html.parser')

print(soup.prettify())

driver.close()

## Common Issues with Data Scraping
Challenges in Data Scraping
We've discussed a few of the challenges already. Topics such as bot access and legal crawling can pose hurdles to scraping data.

Other things to watch out for that can pose hurdles or outright walls to your data scraping include:

User Authentication & Captchas
Honeypots
Structural site changes
IP blocking
Latency
We've seen that JavaScript poses challenges itself, but dynamic websites in general pose challenges. Especially those that utilize AJAX mechanisms.

Potential Solutions
User Authentication can be handled similar to forms and Scrapy has a loginform library to help with these tasks.
Captchas can be worked around with various technologies, but can still severely slow down the scraping process.
Site changes force web scraping developers to keep up to date with the "target" sites and may require the spiders and scraping tools to be rewritten to account for site changes.
One way around IP blocking is to utilize multiple IP addresses for your scraping efforts.

Other thoughts
Some of the things to think about in terms of how to best handle scraping websites are:

Be polite and honest about your scraping intentions.

Minimize the load on a single website that you visit for scraping. 

Scraping can put a heavy load on their web servers. One technique to handle this is to cache the pages you crawl so that you don't have to load them again.

Make your efforts as inconspicuous as possible to reduce suspicion from target websites.

## Wrapping Up

