# Patrick's Absolute Beginner Guide to Building a Webscraper (Sort of) 

## Scraping for Algonquin Park Fire Photos

This notebook aims to search the Algonquin Park Archives online website for image files that are related to fires in the Park, and collect the links to these photos into a single .csv document. The python requests library and beautiful soup package will be required to run this notebook, so please install those before attempting to follow along with this project. 

This notebook serves as a tool to gather historical photos for my research on the topics of Algonquin Park, the timber trade in the Ottawa Valley, and the role of fire in the Indigenous, logging, and settlement histories of the area. The list of photo links that this notebook will generate will give me quick access to the complete collection of the APK Archives Online. This will be especially helpful to me next year when I am planning to participate in an MA program that deals with the topic of timber in the Ottawa Valley and fire in the Park. 

As per data, this notebook only uses the url of the online archives in order to gain access to and parse the html of the webpage for the desired links. The particular url I have used is for the photo search page with the search criteria set simply to 'fire'. This search provides 200 results out of almost 7000 photos searched, however it appears that each time the url is refreshed, the photos appear in a different order. This can be disorienting during the initial process of examining the html with the inspect function on a web browser, but causes no issues once the coding phase is started. 

Additionally, not all the results are loaded onto a single webpage. Rather, you have to hit 'next page' to see the remaining results. This means that to get all the photos, we would need to run through the scraper twice, the second time with the page 2 url. This will be touched on again in the following section. 

Finally, this notebook is meant for those arts students in university who quite honestly have not the slightest clue of how coding works. For those people, congrats on managing to open up the notebook and view this page, that's a great first step. I'm going to walk you through this 'simple' scraper and explain every single minute thing to the best of my abilities, because that is what I needed even just to get to this point. 

## The Basics of html

Before we start parsing a website, we first have to understand html and how url's are formatted. 

First let's look at the entire url. 
https://algonquinpark.pastperfectonline.com/photo?utf8=%E2%9C%93&search_criteria=fire&searchButton=Search 
Our first part here is the main website, algonquinpark.pastperfectonline.com, so that's the first thing we need to know. 

The second part from photo? to the & indicates the "image only" search section of the archive webpage. Search_criteria=fire means that fire is the keyword that was put into the search bar. searchButton indicates the search button that gets clicked in order to initiate the search. [^1] 

Keep in mind that this scraper only works for the specific page that the url links to, and only for the data that is contained on that specific page. This means that just going to the photo search page, without searching for a key word, would yield no results. In other words the search page doesn't automatically link to the entire site inventory. Additionally, in the case of the fire photos, 200 results are acquired, but they don't display all of them on the first page. We would as such need to run through the notebook a second time with the url for 'page 2 results' in order to get the second half of the photo links and titles. 

Overall, our entire process here is based off of this url, so it's best we understand why we're using it the way we are. 

[1] Martin Breuss, "Beautiful Soup: Build a Webscraper with Python," Real Python, https://realpython.com/beautiful-soup-web-scraper-python/

## Establishing Contact with the Website

In [1]:
#set up code to communicate with website
import requests
#url used is from APK Archives photo search for key word 'fire'
URL = 'https://algonquinpark.pastperfectonline.com/photo?utf8=%E2%9C%93&search_criteria=fire&searchButton=Search'
#requests.get(url) communicates with website server and 'gets' the data (the html), data is stored in 'response'
response = requests.get(URL)
#print(response) indicates successfull communication if printed response is [200]
print(response)

<Response [200]>


## Looking at the Raw html

In [3]:
#html is the code that makes up a website
# a string is a text unit in code
#.text renders data in text form
html_string = response.text
#therefore printing the above code gives you the raw html of the whole page
print(html_string)
#use the inspect function on your browser to compare the html of the original to ours

<!DOCTYPE html>
<html lang="en">

  
  <head>
     <script>
    (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
    (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
    m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
    })(window,document,'script','https://www.google-analytics.com/analytics.js','ga');

    ga('create', 'UA-84695581-1', 'auto', 'clientTracker');
    ga('clientTracker.send', 'pageview');


    </script>
      <title>The Friends of Algonquin Park : Online Collections</title>
    <meta name="viewport" content="width=device-width"/>
      
      <link href="https://s3.amazonaws.com/pastperfectonline/sitecontent/reset.css" media="screen" rel="stylesheet" />
      <link href="https://s3.amazonaws.com/pastperfectonline/sitecontent/text.css" media="screen" rel="stylesheet" />
      <link href="/assets/jquery-ui.min.css?body=1" media="screen" rel="stylesheet" />
      <link href="https://

## Understanding More html

At this point I invite you to go the APK Archive page we're using and right click on the page, then hit inspect. This will bring up a window on your screen that will show you the same html of the page that we see above, however on the inspect function, the html is interactive. This means that as you hover over blocks of html, the corresponding webpage element will get highlighted in blue. This is going to be important as we go forward because it is this inspect function that will allow us to pinpoint the webpage elements that we want to scrap and collect. [^2] 

[2] Martin Breuss, "Beautiful Soup: Build a Webscraper with Python," Real Python, https://realpython.com/beautiful-soup-web-scraper-python/

## Parsing with Beautiful Soup

In [4]:
#import beautiful soup package that we need for parsing html, just like above but with the parser added
import requests
from bs4 import BeautifulSoup

URL = 'https://algonquinpark.pastperfectonline.com/photo?utf8=%E2%9C%93&search_criteria=fire&searchButton=Search'
response = requests.get(URL)
#Here we are making a single object to serve as our parser for the whole html of the webpage
document = BeautifulSoup(response.content, 'html.parser')
print(document.prettify())
#the print gives the raw html, just like before

<!DOCTYPE html>
<html lang="en">
 <head>
  <script>
   (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
    (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
    m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
    })(window,document,'script','https://www.google-analytics.com/analytics.js','ga');

    ga('create', 'UA-84695581-1', 'auto', 'clientTracker');
    ga('clientTracker.send', 'pageview');
  </script>
  <title>
   The Friends of Algonquin Park : Online Collections
  </title>
  <meta content="width=device-width" name="viewport"/>
  <link href="https://s3.amazonaws.com/pastperfectonline/sitecontent/reset.css" media="screen" rel="stylesheet"/>
  <link href="https://s3.amazonaws.com/pastperfectonline/sitecontent/text.css" media="screen" rel="stylesheet"/>
  <link href="/assets/jquery-ui.min.css?body=1" media="screen" rel="stylesheet"/>
  <link href="https://s3.amazonaws.com/pastperfectonline/sit

## Finding the Links

In [97]:
#'h1' is html talk for headers. On the APK Archive page, all the links and titles are contained in all the 'h1'
#so, .find_all, finds all of the instances of h1 in document, which is our object for all the raw html
links = document.find_all('h1')
#For every h1 link and title popped up by document.find_all(h1), print it
for link in links: 
    print(link)
# now we have all our relevant links and their titles, with only a few extra html tags

<h1 class="searchResultTitle">
<number>200</number> results found. Records searched: 6904
      </h1>
<h1>
<a href="/photo/A833C586-B138-4061-B0E3-785438375747">ca. 1926 - Camp Tanamakoon</a>
</h1>
<h1>
<a href="/photo/3A77376A-D1C7-4669-BA14-882771243770">1977 - Barron Canyon Fire</a>
</h1>
<h1>
<a href="/photo/22902A1C-BC67-465F-8436-068155738233">September 11, 1944 - Ne-ow-notin Island, Cedar Lake</a>
</h1>
<h1>
<a href="/photo/BDFAF8BB-F663-4AC6-8EB7-401499233125">1916. - Rose Thomas in front of the hill in front of Canoe Lake Station, December 20, 1916.</a>
</h1>
<h1>
<a href="/photo/27F1913B-3F38-4D87-B1EB-360057568911">1922 - Lake Louisa</a>
</h1>
<h1>
<a href="/photo/44C42E50-B3E3-407C-9DA6-540023428341">1927 - Tower for drying fire hose, Brule Lake.</a>
</h1>
<h1>
<a href="/photo/17C297A7-29A6-46D8-96F7-013333644848">1895 - W.S. Cranston's Survey Camp, No. 1, Cache Lake.</a>
</h1>
<h1>
<a href="/photo/9CB8D6F7-1FB6-47FE-9CAD-176017211400">19- - Burning brush on raft due to fir

## Save it all to a CSV

In [118]:
#At this point, we could use python to clean up links, or we could just send it to a csv and clean it there
#open- opens a csv file
#'APK_Python'- file name
#w+ allows file to be edited
# newline='' is an important thing, I'm told
csvfile = open('APK_Python.csv', 'w+', newline='')
#This actually write the file and puts stuff into the spreadsheet
writer = csv.writer(csvfile)
# This make a column header
writer.writerow(['Links'])
#This is a loop like seen above, except with the write.row function added we get all those sweet links sent to our csv file
for link in links:
        print(link)
        writer.writerow(link)
#This closes the file then we're done
csvfile.close()

<h1 class="searchResultTitle">
<number>200</number> results found. Records searched: 6904
      </h1>
<h1>
<a href="/photo/A833C586-B138-4061-B0E3-785438375747">ca. 1926 - Camp Tanamakoon</a>
</h1>
<h1>
<a href="/photo/3A77376A-D1C7-4669-BA14-882771243770">1977 - Barron Canyon Fire</a>
</h1>
<h1>
<a href="/photo/22902A1C-BC67-465F-8436-068155738233">September 11, 1944 - Ne-ow-notin Island, Cedar Lake</a>
</h1>
<h1>
<a href="/photo/BDFAF8BB-F663-4AC6-8EB7-401499233125">1916. - Rose Thomas in front of the hill in front of Canoe Lake Station, December 20, 1916.</a>
</h1>
<h1>
<a href="/photo/27F1913B-3F38-4D87-B1EB-360057568911">1922 - Lake Louisa</a>
</h1>
<h1>
<a href="/photo/44C42E50-B3E3-407C-9DA6-540023428341">1927 - Tower for drying fire hose, Brule Lake.</a>
</h1>
<h1>
<a href="/photo/17C297A7-29A6-46D8-96F7-013333644848">1895 - W.S. Cranston's Survey Camp, No. 1, Cache Lake.</a>
</h1>
<h1>
<a href="/photo/9CB8D6F7-1FB6-47FE-9CAD-176017211400">19- - Burning brush on raft due to fir

# Conclusion

Let's recap what we've done in this notebook. Firstly, our goal was to scrape the APK Archives online for all their photos involving the keyword 'fire'. For me, this is a really relevant tool because I specialize in research concerning Algonquin Park and the surrounding region, and currently my research deals with forest fires in the Park. So for me, this scraper allows me to quickly compile all the links available on the APK Archive website into a single .csv document for easy access. Of course, you could sub in whatever url you like for whatever website is relevant to you. The ultimate purpose here is just to walk through the basic steps for making a scraper, so that you can go apply your new skills to the big wide world of research on the internet. 

First, we learned some html basics and established contact with the APK Archives server and printed the raw html for the entire page. Then we added in beautiful soup so that we could parse the html for the links and titles that we were looking for. Once we identified that all the links and titles were contained under the pages h1 header sections, we were able to isolate and print all h1 instances, in other words all the things we were looking for. Finally, we used some fancy code that I only sort of understand to make a .csv file, add a header in the .csv, and then print the results of our previous code into the csv. Then voila, we suddenly had a .csv that contained all the links and object titles, with minimal clean up of html neeeded in the csv itself.

For some this notebook may seem dirt simple, but it has been made for an audience of historians or other arts students with literally zero knowledge of coding, just like myself. If you have no idea how to use the command line, or what a string is, if you have never seen html before, don't know where to go to write code, if you don't even know the letters that make up the python language, let alone how the language actually works, then this is the notebook for you. It isn't pretty, nor is it complicated, but it gets the job done, and that's good enough for me. 


### Further Reading

Melanie Walsh, "Webscraping Part 1," Introduction to Cultural Analytics and Python, https://melaniewalsh.github.io/Intro-Cultural-Analytics/Data-Collection/Web-Scraping-Part1.html

Melanie Walsh, "Webscraping Part 2," Introduction to Cultural Analytics and Python, https://melaniewalsh.github.io/Intro-Cultural-Analytics/Data-Collection/Web-Scraping-Part2.html

Both these pages form the backbone of my knowledge for the scraper I've put together in this notebook. I did what I was able to do, but these two tutorials go further than I did. In particular, they run through how to use regular expressions to make entire functions that parse the html more quickly and effectively, and they give details as to how to clean up your html data in the notebook before sending your results over to a .csv 

Hiren Pattel, "How Webscraping is Transforming the World with its Applications," Towards Data Science, https://towardsdatascience.com/https-medium-com-hiren787-patel-web-scraping-applications-a6f370d316f4

Pattel goes through a vast number of ways that webscrapers are being used more and more in the modern world. In his section on academics, he makes the simple point that academics are all about working with data, and webscraping is all about collecting data, so it naturally follows that scraping could be applied to any kind of work you please. The rest of his article offers many other practical examples of business applications that may inspire new ideas of how to use a scraper. 

Martin Breuss, "Beautiful Soup: Build a Webscraper with Python," Real Python, https://realpython.com/beautiful-soup-web-scraper-python/

If you're actually an absolute beginner like me, then you'll need at least three different tutorials to reference in order to make this scraper actually work. Here's one that spends a lot of time explaining the basic theory that underlies these line of codes.Additionally, Breuss explains some of the pitfalls and traps that one can run into when parsing a site which is valuable for when your code inevitably goes wrong and you need to figure out why. A little extra understanding goes a long way towards the trouble shooting process.  

### References

Melanie Walsh, "Webscraping Part 1," Introduction to Cultural Analytics and Python, https://melaniewalsh.github.io/Intro-Cultural-Analytics/Data-Collection/Web-Scraping-Part1.html

Melanie Walsh, "Webscraping Part 2," Introduction to Cultural Analytics and Python, https://melaniewalsh.github.io/Intro-Cultural-Analytics/Data-Collection/Web-Scraping-Part2.html

Martin Breuss, "Beautiful Soup: Build a Webscraper with Python," Real Python, https://realpython.com/beautiful-soup-web-scraper-python/

"Writing Scraped Links to a CSV File Using Pyhton 3," StackOverflow, https://stackoverflow.com/questions/47372961/writing-to-scraped-links-to-a-csv-file-using-python3

"What does the Newline='' argument do?", Code Academy Discuss, https://discuss.codecademy.com/t/what-does-the-newline-argument-do/463575

Andrew Schwan, roommate and computer enthusiast