# Hacking heritage: Day 2, Morning

## 1. Running Jupyter

A little bit more on the Jupyter technology ecosystem.

In [1]:
from IPython.display import IFrame
IFrame('https://slides.com/wragge/hh2020-running-jupyter/embed?token=kX6uUwRQ', 600, 438)

## 2. Getting to know the GLAM Workbench

Let's go for a guided tour of my [GLAM Workbench](https://glam-workbench.github.io/) (and some other GLAM Workbenches!).

## 3. Activity: Zooming in

In this activity we're going to link together a number of the notebooks in the GLAM Workbench to zoom in on particular questions drawn from Trove's digitised newspapers. The point of the exercise is both to introduce you to some more useful tools, and to get you thinking about how individual notebooks can be hooked up into workflows. There a lots of possible steps to try – see how far you get!

### The big picture

QueryPic is a tool for visualising searches in Trove's digitised newspapers and has existed in a variety of forms since 2010. The latest version, not suprisingly, is in a Jupyter notebook. It runs in Appmode, so as we did yesterday with the CSV Explorer, it's necessary to generate a link.

In [3]:
from notebook.notebookapp import list_running_servers
from IPython.display import HTML, display

# Get current running servers
servers = list_running_servers()
# Get the current base url
base_url = next(servers)['base_url']
app_url = f'{base_url}apps/hackingheritage2020/trove-newspapers/QueryPic_deconstructed.ipynb'
display(HTML(f'<a href="{app_url}">Open QueryPic in Appmode</a>'))

1. Click on the link to open QueryPic in Appmode.

2. Once the page has loaded, paste your Trove API key into the box where indicated.

3. Now it's time to try a search, go to the 'Compare queries' and type in a search term.

4. Click on the **Add query** (note this doesn't run the search it just saves your search term).

5. Type in another search term to compare to the first, click on **Add query** again.

6. Keep adding as many different search terms as you want.

7. Now click on click on the **Create Chart** button to run the searches and visualise the results.

8. When the chart first loads it's showing the raw number of newspaper articles each year matching your query. But raw numbers can be misleading because, as we saw yesterday, the articles are not evenly distributed.

9. Click on the dropdown to select 'Proportion of total articles' and see how this changes the chart.

10. Continue experimenting with different search terms. Note that you can adjust the date range with the slider.

11. As well as comparing multiple search terms, you can compare a single search terms across multiple states or newspapers. Click on the 'Compare states' or 'Compare newspapers' tabs to give this a try.

12. Keep exploring until you find something you think might be worth exploring further. Perhaps there's an unexpected peak, or an interesting shift in usage between particular words.

13. Click on the 'Download data' link to save a CSV of your search data. Useful if you want to capture the results as they are today (after all more newspapers are being added all the time). 

14. You can also click on the **Save chart** to generate a standalon HTML version of your chart that you can save to your own computer.

### The Bigger Picture

**QueryPic deconstructed** is a handy app for looking for change over time in the newspaper search results. But if you want to explore the trends it reveals in more depth, you probably want to know a bit more about what it's actually doing – the limitations and possibilities. Fortunately there's a notebook that does just that, showing how you can use search facets to slice and dice the seach results in interesting ways.

1. Open up [Visualise Trove newspaper searches over time](../trove-newspapers/visualise-searches-over-time.ipynb)
2. Paste in your API ley where indicated.
3. This is just a standard notebook, so you can **Shift+Enter** your way through, modifying the search terms where ever you want.
4. Copy the url of your Trove search.

### Back to Trove web interface

1. Armed with the interesting feature you found while playing around with QueryPic, go to the [Trove web interface](https://trove.nla.gov.au/) and and construct a search that focuses on that feature. For example, you might use the facets to filter your search by year, state, newspaper, or article type. For the purposes of this exercise, you want to limit the number of results in your search to a few thousand. (This is just so you don't spend the rest of the day waiting for your 253,000 newspaper articles to download...)

### Digging deeper with the Trove Newspaper Harvester

The Trove Newspaper Harvester is a command line tool that also dates back about 10 years. I've embedded it in a couple of notebooks in the GLAM Workbench to make it easier to use. 

The Newspaper Harvester downloads all the metadata about newspaper articles in a Trove search, and the OCRd text of each individual article. This means that we can explore the contents of the articles in depth with a variety of tools. It's worth remembering too that the Trove web interface only lets you see the first 2,000 articles in your search. The Newspaper Harvester can download many thousands of articles. A while back I harvested more than half a million articles to [explore the changing context of the words 'aliens' and 'immigrants'](http://timsherratt.org/blog/who-belongs/).

Recently I added the option to save images of all the newspaper articles. This is pretty cool, but it does slow the harvest down considerably. If you want to try out the image harvesting, start with a search that has a small number of results.

1. Open up [Using TroveHarvester to get newspaper articles in bulk](../trove-newspaper-harvester/Using-TroveHarvester-to-get-newspaper-articles-in-bulk.ipynb). There's also an app-ified version in the GLAM Workbench, but the notebook includes a bit more information about what's going on.
2. Paste in your Trove API key where indicated. Remember to run the cell to save the value!
3. Paste the search you copied from the Trove web interface in as the value of `query`. Remember to run the cell to save the value!
4. Run the cell that says `%run -m troveharvester -- start $query $api_key --text` to start your harvest.

5. If you want to get images as well, you need to add ` --image` to this cell before you start harvesting.
6. While you're waiting for your harvest to complete, have a read of the information about the format of the harvest results.
7. Once your harvest has finished, zip up and download all the the text, data, and images by running the two cells under 'Download your data'. Once it's downloaded, unzip it!

### Use the CSV Explorer to visualise your results!

I don't know if you noticed yesterday, but the GLAM CSV Explorer includes an option to upload your own CSV file for analysis. Let's see what happens when we upload the article metadata we just harvested. Once again we need to generate an appmode link to the CSV Explorer with the following cell.

In [1]:
from notebook.notebookapp import list_running_servers
from IPython.display import HTML, display

# Get current running servers
servers = list_running_servers()
# Get the current base url
base_url = next(servers)['base_url']
app_url = f'{base_url}apps/hackingheritage2020/csv-explorer/csv-explorer.ipynb'
display(HTML(f'<a href="{app_url}">Open CSV Explorer in Appmode</a>'))

1. Click on the link above to open the CSV Explorer in Appmode.

2. Click on the 'Upload CSV' tab.

3. Click on the upload button and navigate to the unzipped data folder you just downloaded. Select the file named `results.csv` inside the downloaded folder.

4. Click on the **Analyse CSV** button!

5. What can you find out about your dataset?

### Turn your harvested data into searchable database

Using Datasette, we can put our harvested metadata, OCRd text, and even images into a searchable online database.

1. Open up [Display the results of a harvest as a searchable database using Datasette](../trove-newspaper-harvester/display_harvest_results_using_datasette.ipynb)

2. Run the first two cells to create your database from the article metadata. Note that by default, the notebook will use the most recently completed harvest as the source of its data. **Don't** run the cell that says `open_datasette()` yet.

3. Run the first two cells under 'Add Ocrd text' they'll add the text of the articles to your database and make it full text searchable.

4. If you harvested images as well as text, run the first two cells under 'Add image links'.

5. Ok, now you're ready to run one of the `open_datasette()` cells (it doesn't matter which one). After you run it a blue button will appear, just wait for a moment until you see a message that the server is running.

6. Now click on the big blue **View in Datasette** button. It will open your database in a new tab.

7. Click on the 'records' link to view the harvested data.

8. Explore! Try some searches! What can you find?

Using Datasette it's even possible to create a public, online version of your database. If you're interested in finding out how, [this article will get you started](https://101dhhacks.net/share-searchable-csvs/).

### Other ways of exploring the harvested data

I've created a few other notebooks that explore the harvested data in different ways.

1. [Exploring your Trove Harvester data](../trove-newspaper-harvester/Exploring-your-TroveHarvester-data.ipynb) doesn't do much more than the CSV Explorer, though it does include a map showing the places of publication.

2. [Explore harvested text files](../trove-newspaper-harvester/Explore-harvested-text-files.ipynb) does some word frequency analysis across **all** the files of OCRd text. It's worth having a look at, especially for the cool faceted charts! It also lets you compare the frequency of particular words (like QueryPic, but with bubbles).

![Faceted data](../images/faceted-data.png)

3. Finally, [Explore harvested text files using TF-IDF](../trove-newspaper-harvester/Explore-harvested-text-files-using-tfidf.ipynb) looks at word frequency a bit differently. TF-IDF scores are calculated by comparing the number of times a word appears in a single document to the number of times it appears in a collection of documents. Word frequencies point us to 'common' words, TF-IDF can indicate 'significant' words. This notebook aggregates the articles by year and then calculates TF-IDF values for each word in each year. How do the words highlighted by TF-IDF differ to the most frequent words?
