# What if we want data from the Graduate Center Website?
![Graduate Center Home Page](figs/gchome.png)

The [Scrapy](https://scrapy.org/) library is designed to pull data from websites when there's no API or the API won't work. It's often worth trying [DownThemAll](https://www.downthemall.net/) first. 

If you haven't installed Scrapy yet, [open a terminal](https://github.com/GCDigitalFellows/installdri.github.io/blob/master/anaconda.md) and type:
```bash
conda install -c conda-forge scrapy -y
```

In [None]:
# scrapy tutorial is at 
# https://docs.scrapy.org/en/latest/intro/tutorial.html
import scrapy
from scrapy.crawler import CrawlerProcess


In [None]:
# this helps the scraper run in a notebook
# https://www.jitsejan.nl/using-scrapy-in-jupyter-notebook.html
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [None]:
import scrapy

class GCSpider(scrapy.Spider):
    name = "gc"

    def start_requests(self):
        urls = ['https://www.gc.cuny.edu/Home', 
                'https://www.gc.cuny.edu/Prospective-Current-Students/Current-Students']
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = f'gc-{page}.html'
        with open(filename, 'wb') as f:
            f.write(response.body)
        self.log('Saved file %s' % filename)

In [None]:
process = CrawlerProcess()
process.crawl(GCSpider)
process.start()

While scrapy supports really robust parsing, it requires understanding the xpath experessions language. Sometimes it's easier to just save the page and use [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) for the parsing. Install it using
```
conda install -c conda-forge beautifulsoup4 -y
```


In [None]:
from bs4 import BeautifulSoup
# The file needs to be opened first and the file handle passed into soup
with open("gc-www.gc.cuny.edu.html") as html_doc:
    soup = BeautifulSoup(html_doc, 'lxml')

Soup is an html document that we can now traverse using the DOM ([Document Object Model](https://developer.mozilla.org/en-US/docs/Web/API/Document_Object_Model))

https://github.com/taspinar/twitterscraper/blob/master/twitterscraper/tweet.py

In [None]:
soup.text[:100]

# What is the dom

In [None]:
# any attribute of the DOM is also an attribute the soup object
soup.meta

In [None]:
soup.find_all('meta')

In [None]:
#attributes of the tag are treated as dictionary (key, value pairs)
soup.meta['content']

1. find the content of the meta tag on https://www.gc.cuny.edu/Prospective-Current-Students/Current-Students
2. Scrape another page on the GC website