# What if we want data from the Graduate Center Website?
![Graduate Center Home Page](figs/gchome.png)

The [Scrapy](https://scrapy.org/) library is designed to pull data from websites when there's no API or the API won't work. It's often worth trying [DownThemAll](https://www.downthemall.net/) first. 

If you haven't installed Scrapy yet, [open a terminal](https://github.com/GCDigitalFellows/installdri.github.io/blob/master/anaconda.md) and type:
```bash
conda install -c conda-forge scrapy -y
```

In [1]:
# scrapy tutorial is at 
# https://docs.scrapy.org/en/latest/intro/tutorial.html
import scrapy
from scrapy.crawler import CrawlerProcess


In [2]:
# this helps the scraper run in a notebook
# https://www.jitsejan.nl/using-scrapy-in-jupyter-notebook.html
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [3]:
import scrapy

class GCSpider(scrapy.Spider):
    name = "gc"

    def start_requests(self):
        urls = ['https://www.gc.cuny.edu/Home', 
                'https://www.gc.cuny.edu/Prospective-Current-Students/Current-Students']
        for url in urls:
            yield scrapy.Request(url=url, 
                                 callback=self.parse)

    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = f'gc-{page}.html'
        with open(filename, 'wb') as f:
            f.write(response.body)
        self.log('Saved file %s' % filename)

In [4]:
process = CrawlerProcess()
process.crawl(GCSpider)
process.start()

2018-04-27 14:33:02 [scrapy.utils.log] INFO: Scrapy 1.5.0 started (bot: scrapybot)
2018-04-27 14:33:02 [scrapy.utils.log] INFO: Versions: lxml 4.1.1.0, libxml2 2.9.7, cssselect 1.0.3, parsel 1.4.0, w3lib 1.19.0, Twisted 17.9.0, Python 3.6.4 | packaged by conda-forge | (default, Dec 23 2017, 16:54:01) - [GCC 4.2.1 Compatible Apple LLVM 6.1.0 (clang-602.0.53)], pyOpenSSL 17.4.0 (OpenSSL 1.0.2n  7 Dec 2017), cryptography 2.1.4, Platform Darwin-15.6.0-x86_64-i386-64bit
2018-04-27 14:33:02 [scrapy.crawler] INFO: Overridden settings: {}
2018-04-27 14:33:02 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats']
2018-04-27 14:33:02 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downlo

<Deferred at 0x113e00208>

2018-04-27 14:33:03 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.gc.cuny.edu/Home> (referer: None)
2018-04-27 14:33:03 [gc] DEBUG: Saved file gc-www.gc.cuny.edu.html
2018-04-27 14:33:04 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.gc.cuny.edu/Prospective-Current-Students/Current-Students> (referer: None)
2018-04-27 14:33:04 [gc] DEBUG: Saved file gc-Prospective-Current-Students.html
2018-04-27 14:33:04 [scrapy.core.engine] INFO: Closing spider (finished)
2018-04-27 14:33:04 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 477,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 86692,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 2,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2018, 4, 27, 18, 33, 4, 361366),
 'log_count/DEBUG': 5,
 'log_count/INFO': 7,
 'memusage/max': 67903488,
 'memusage/startup': 67903488,
 'response_receiv

While scrapy supports really robust parsing, it requires understanding the xpath experessions language. Sometimes it's easier to just save the page and use [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) for the parsing. Install it using
```
conda install -c conda-forge beautifulsoup4 -y
```


In [5]:
from bs4 import BeautifulSoup
# The file needs to be opened first and the file handle passed into soup
with open("gc-www.gc.cuny.edu.html") as html_doc:
    soup = BeautifulSoup(html_doc, 'lxml')

Soup is an html document that we can now traverse using the DOM ([Document Object Model](https://developer.mozilla.org/en-US/docs/Web/API/Document_Object_Model))

https://github.com/taspinar/twitterscraper/blob/master/twitterscraper/tweet.py

In [6]:
soup.text[:100]

' \n\n\tHome\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nvar ajaxConfig = new Array();\nvar calendarConfig = new Array();\nv'

# What is the dom

In [7]:
# any attribute of the DOM is also an attribute the soup object
soup.meta

<meta content="The Graduate Center, The City University of New York
Established in 1961, the Graduate Center of the City University of New York (CUNY) is devoted primarily to doctoral studies and awards most of CUNY's doctoral degrees. An internationally recognized center for advanced studies and a national model for public doctoral education, the Graduate Center offers more than thirty doctoral programs in the arts, humanities, social sciences, and the natural sciences, as well as a number of master's programs. Many of its faculty members are among the world's leading scholars in their respective fields. The school currently enrolls over 4700 students from throughout the United States, as well as from about eighty foreign countries, and its alumni hold major positions in industry and government, as well as in academia. The Graduate Center is also home to more than thirty interdisciplinary research centers and institutes focused on areas of compelling social, civic, cultural, and scien

In [8]:
soup.find_all('meta')

[<meta content="The Graduate Center, The City University of New York
 Established in 1961, the Graduate Center of the City University of New York (CUNY) is devoted primarily to doctoral studies and awards most of CUNY's doctoral degrees. An internationally recognized center for advanced studies and a national model for public doctoral education, the Graduate Center offers more than thirty doctoral programs in the arts, humanities, social sciences, and the natural sciences, as well as a number of master's programs. Many of its faculty members are among the world's leading scholars in their respective fields. The school currently enrolls over 4700 students from throughout the United States, as well as from about eighty foreign countries, and its alumni hold major positions in industry and government, as well as in academia. The Graduate Center is also home to more than thirty interdisciplinary research centers and institutes focused on areas of compelling social, civic, cultural, and sci

In [9]:
#attributes of the tag are treated as dictionary (key, value pairs)
soup.meta['content']

"The Graduate Center, The City University of New York\nEstablished in 1961, the Graduate Center of the City University of New York (CUNY) is devoted primarily to doctoral studies and awards most of CUNY's doctoral degrees. An internationally recognized center for advanced studies and a national model for public doctoral education, the Graduate Center offers more than thirty doctoral programs in the arts, humanities, social sciences, and the natural sciences, as well as a number of master's programs. Many of its faculty members are among the world's leading scholars in their respective fields. The school currently enrolls over 4700 students from throughout the United States, as well as from about eighty foreign countries, and its alumni hold major positions in industry and government, as well as in academia. The Graduate Center is also home to more than thirty interdisciplinary research centers and institutes focused on areas of compelling social, civic, cultural, and scientific concern

1. find the content of the meta tag on https://www.gc.cuny.edu/Prospective-Current-Students/Current-Students
2. Scrape another page on the GC website