# Digital Humanities Tools: class

A digital humanities tool may have some peculiar characteristics, such as:

- a category (e.g. infographic, mapping, video, photo, social media, etc)
- an availability (e.g. free, with fees, free trial)
- a type (e.g. web app, software, tutorial, etc)
- a supported device (mobile, computer)

It is thus possible to write a code to define a "dhtool":

In [1]:
import matplotlib.pyplot as plt
import numpy as np
from scipy import misc
import urllib.request
import requests
from io import BytesIO

class dhtool():
    def __init__(self, category= None, availability=None, type = None, device = None):
        self.cat = category
        self.av = availability
        self.typ = type
        self.dev = device

In order to extract information from the websites I am currently browsing, we could potentially write a code to automate that. This, though, is set back by a series of inconveniences:

- in order to extract information on a single tool, I search through MULTIPLE websites
- these website are not UNIFORM, as a matter of fact they each use a different format
- most website don't allow scraping, not even when using a spoofing user agent 
- most of the time, the description of a DH tool doesn't even follow a specific format, but simply "states" in a sentence the DH tool's characteristics:
    
    
    e.g. "An authoring and publishing platform for writing long-form, born-digital, multimodal scholarship online"
   
   --> this sentence on its own does NOT furnish any useful info, nor can be used with a SINGLE code to extract info such as "availability" or "device" 
    --> as a matter of fact, if looking for any of these "parameters" on any website, the output would always be useless
    --> the type of research I am doing would require MULTIPLE codes, with a definition of MULTIPLE classes with MULTIPLE parameters according to each website

To show this, let's use the code precedently used in class to scrape information from this website: "https://guides.nyu.edu/dighum/tools"

In [3]:
from bs4 import BeautifulSoup
import requests

# spoofing user agent
user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
goodresponse = requests.get(url, headers={"User-agent":user_agent}) #access the website with requests library
print(goodresponse.status_code)


# inspect website to find particular item code
url = 'https://guides.nyu.edu/dighum/tools'

user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
response = requests.get(url, headers={"User-agent":user_agent}) #access the website with requests library


#parse HTML and save to BeautifulSoup object
bigsoup = BeautifulSoup(response.text, "html.parser")

# no matter in what table, looking for <td>language</td> --> orginal is <td><b>language</b></td>
thedevice = bigsoup.find('id', text='device').find_next('td')
print(thedevice.text)

SSLError: HTTPSConnectionPool(host='guides.nyu.edu', port=443): Max retries exceeded with url: /dighum/tools (Caused by SSLError(SSLError("bad handshake: Error([('SSL routines', 'tls_process_server_certificate', 'certificate verify failed')])")))

This website doesn't even allow scraping, which doesn't allow me to test the correctness of the code.
Let's try with another one: "https://cdh.unc.edu/resources/tools/"

In [4]:
from bs4 import BeautifulSoup
import requests

# spoofing user agent
user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
goodresponse = requests.get(url, headers={"User-agent":user_agent}) #access the website with requests library
print(goodresponse.status_code)


# inspect website to find particular item code
url = 'https://cdh.unc.edu/resources/tools/'

user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
response = requests.get(url, headers={"User-agent":user_agent}) #access the website with requests library


#parse HTML and save to BeautifulSoup object
bigsoup = BeautifulSoup(response.text, "html.parser")

# no matter in what table, looking for <td>language</td> --> orginal is <td><b>language</b></td>
thedevice = bigsoup.find('id', text='device').find_next('td')
print(thedevice.text)

SSLError: HTTPSConnectionPool(host='guides.nyu.edu', port=443): Max retries exceeded with url: /dighum/tools (Caused by SSLError(SSLError("bad handshake: Error([('SSL routines', 'tls_process_server_certificate', 'certificate verify failed')])")))

Description of workflow:

1. choosing category of tool to do research on (for example, Mapping)
2. browsing website and lists and looking for tools belonging to the category "Mapping"
3. choosing one tool (e.g. Arcgis)
4. reading the multiple descriptions offered by the websites 
5. opening the tool to check existence
6. opening my Word File 
7. filling up set format:
    [TYPE OF TOOL] - {}
    [FIELD OF ACTION] - { }
    [ACCESSIBILITY]  - {}
    [DEVICES SUPPORTED] - {}
8. writing a short synapsis 
9. inserting link to tool 
10. opening website digitalhumanitiestools.altervista.org 
11. creating a new tool
12. copying info from the Word file to the website 