# Using Scrapy to make a spider for HLHW Workshop events
Every year, a page of the upcoming HLHW events is posted on the Library web site. The events for the [2018-2019](https://libraries.ucsd.edu/visit/library-workshops/holocaust-living-history-workshop/events/2018-2019.html) season are up. Can we make a bot that will scrape the pages so we don't have to manually key in data? Let's try using the python program [Scrapy](https://doc.scrapy.org/en/latest)!

## Scrapy set up
The tutorial for Scrapy says we first have to run some commands to set up the project: 
```
$ scrapy startproject hlhw
```

This returned the message: 
```
You can start your first spider with:
    cd hlhw
    scrapy genspider example example.com
```

But going through the tutorial, we will move on to making our spider, within the project directory this command created

## Coding the spider 

We will use the turotial's spider as a template, but plug in our relevant info, and save as a python (.py) file: `hlhw_events.py`

In [1]:
import scrapy

class EventsSpider(scrapy.Spider):
    name = "events"

    def start_requests(self):
        url = 'https://libraries.ucsd.edu/visit/library-workshops/holocaust-living-history-workshop/events/2018-2019.html'
        for u in url:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = 'hlhw-%s.html' % page
        with open(filename, 'wb') as f:
            f.write(response.body)
        self.log('Saved file %s' % filename)

## Learning to parse 
This command dumped an .html file in our project, but we need to learn how to parse using the Scrapy shell interface in order to make our spider really smart. The tutorial says to start the Scrapy shell with 
```
$ scrapy shell 'https://libraries.ucsd.edu/visit/library-workshops/holocaust-living-history-workshop/events/2018-2019.html'
```

Once we're in the shell, it has a REPL environment (not unlike Jupyter!) where we can run commands and get output. 

For example, we run: 
```
In [1]: response.css('title')
Out[1]: [<Selector xpath='descendant-or-self::title' data='<title>2018 - 2019 Events</title>'>]
```

That makes a Selector object we can extract text from:  
```
In [2]: response.css('title::text').extract()
Out[2]: ['2018 - 2019 Events']
```


A big thing we'd like to do is capture the description of the events. Luckily, these are only captured in the paragraph `<p>` tags. Unfortunately, there's other tags sometimes nested within those tags (most commonly italics `<i>` tags). We can make the selector a little greedy by changing the XPath request: 

```
In [23]: response.xpath('//p')
Out[23]: 
[<Selector xpath='//p' data='<p>Paneriai is the Lithuanian name for P'>,
 <Selector xpath='//p' data='<p>“I live in crazy times,” Anne Frank w'>,
 <Selector xpath='//p' data='<p>Despite the explosive growth of Holoc'>,
 <Selector xpath='//p' data='<p>The fate of Bulgarian Jewry during Wo'>,
 <Selector xpath='//p' data='<p>It is a common misconception that Jew'>,
 <Selector xpath='//p' data='<p>The suite of international convention'>,
 <Selector xpath='//p' data='<p>Louis “Lubo” Pechi was born in the Cr'>,
 <Selector xpath='//p' data='<p>Every once in a while a book comes al'>]
```

Now, we extract the text from those selectors: 
```
In [24]: for p in response.xpath('//p'):
    ...:     print(p.extract())
    ...:     
<p>Paneriai is the Lithuanian name for Ponar (Ponary in Polish), the site of one of the worst massacres of Jews during World War II. For Barbara Michelman Panieriai is a landscape of loss and silence – a silence exemplified by her father who was born there. Though he escaped the slaughter, he was nevertheless broken by its haunting dimensions. In her solo-exhibition <i>Past is Prologue</i>, a series of photo montages with text, ...
```

This works well. Now we just have to loop through these in order to get them into an output like JSON/csv
