In [1]:
# DATA2001 Week 4 Tutorial
# Material last updated: 28 Mar 2023
# Note: this notebook was designed with the Roboto Condensed font, which can be installed here: https://www.1001fonts.com/roboto-condensed-font.html

from IPython.display import HTML
HTML('''
    <style> body {font-family: "Roboto Condensed Light", "Roboto Condensed";} h2 {padding: 10px 12px; background-color: #E64626; position: static; color: #ffffff; font-size: 40px;} .text_cell_render p { font-size: 15px; } .text_cell_render h1 { font-size: 30px; } h1 {padding: 10px 12px; background-color: #E64626; color: #ffffff; font-size: 40px;} .text_cell_render h3 { padding: 10px 12px; background-color: #0148A4; position: static; color: #ffffff; font-size: 20px;} h4:before{ 
    content: "@"; font-family:"Wingdings"; font-style:regular; margin-right: 4px;} .text_cell_render h4 {padding: 8px; font-family: "Roboto Condensed Light"; position: static; font-style: italic; background-color: #FFB800; color: #ffffff; font-size: 18px; text-align: center; border-radius: 5px;}input[type=submit] {background-color: #E64626; border: solid; border-color: #734036; color: white; padding: 8px 16px; text-decoration: none; margin: 4px 2px; cursor: pointer; border-radius: 20px;}</style>
''')

# Week 6 - Web Scraping

Not all data is presented as neatly as a structured dataframe of rows and columns, like we've become accustomed to in the previous weeks. Often, meaningful information exists in unstructured or semi-structured formats. Take for example, the internet. Worlds of information exist across millions of webpages, and extracting particular fields of interest from these pages is our focus today.

This will require the following Python libraries:
- **Request**         for interacting with websites and web services
- **BeautifulSoup**   for webpage parsing
- **HTML5Lib**        for the actual parser that BeautifulSoup uses
- **Pandas**          for dataframe management

To use the above, you will need to have the following libraries installed (using either pip3 or Anaconda navigator):
- `bs4`
- `html5lib`

In [2]:
import requests
import bs4
import pandas as pd

## 1. Scraping Data from a Webpage

We'll start with a familiar example, and read in the webpage for [this unit's outline](https://www.sydney.edu.au/units/DATA2001/2023-S1C-ND-CC) on the USYD website.

### 1.1 Webpage Retrieval and Parsing

The `requests` library can be used to `get()` the contents of a page, as seen below:

In [3]:
webpage_source = requests.get("https://www.sydney.edu.au/units/DATA2001/2023-S1C-ND-CC").text
print(webpage_source)


<!DOCTYPE HTML>
<html lang="en-AU">
    <head>
    <meta http-equiv="content-type" content="text/html; charset=UTF-8"/>
    
    
    <meta name="robots" content="noindex"/>
    
    
    

    








    

	<meta charset="utf-8"/>
	<meta http-equiv="X-UA-Compatible" content="IE=edge"/>
	<meta http-equiv="content-type" content="text/html; charset=UTF-8"/>
	<meta name="viewport" content="width=device-width, initial-scale=1"/>

	
    
<link rel="stylesheet" href="/etc.clientlibs/corporate-commons/clientlibs/frontend-css.12d73ed15a462848ad40552cc7e5c20c.css" type="text/css">



	
    
<script src="/etc.clientlibs/corporate-commons/clientlibs/cookie-banner.c46e39bebc7a5cc57e670a5bc6619142.js"></script>
<script src="/etc.clientlibs/corporate-commons/clientlibs/jquery.577ddf6779040ea52503746fdeece2ce.js"></script>



	
	
	

	
		<meta name="DC.Created" content="2023-01-17"/>
	
		<meta name="DC.Modified" content="17/01/2023"/>
	
	
	<meta name="google-site-verification" content="FicMLRy30eyM

This output of this request is the raw webpage source code. This is normally parsed and rendered by a web browser as a nice visual webpage.

The language in which this webpage is written, is called **HTML** (the *Hypertext Markup Language*), and is a tree-like structure of content elements. We can interpret this content using a **HTML parser** - several exist, but we'll be using *BeautifulSoup*.

In [None]:
from bs4 import BeautifulSoup
content = BeautifulSoup(webpage_source, 'html5lib')

### 1.2 Traversing the Tree

The key benefit of parsing the webpage is that we can now **locate and iterate through the HTML content** by either traversing the tree, or by selecting particular HTML tags, classes or identifiers. As a simple example, the webpage output contains a single instance of a `title` tag, as seen below:

`<title>Outline - The University of Sydney</title>`

This title is typically reflected in the browser tab name, which you'll notice here doesn't align with our code. In the web browser's code, the title reflects the unit of study name - "DATA2001: Semester 1, 2023". When first loaded though, there's a brief second where the tab name is "Outline - The University of Sydney", like our code here. For more complex web pages, some content may be dynamically generated on load (e.g. with Javascript), and hence small disparities like these can occur. This is the only field we'll encounter in this tutorial's example that should differ.

Nonetheless, using BeautifulSoup, we can extract this information, by finding the first `title` tag within the content, and extracting its text:

In [None]:
print(content.title.text)

We can get even more in-depth, and specify a path to traverse. The example below seeks the `body` tag, then the first `div` within that, and the first `div` within that!

In another browser tab with the actual webpage open, try using the **Inspect Element** feature to follow the path, and confirm this is pulling the right information:

In [None]:
print(content.body.div.div.text)

### 1.3 CSS Selectors

All elements of a HTML document can be assigned a **class** (multiple elements can share a class) or an **id** (which are unique). These are **Cascading Style Sheet** references that ease formatting (for example, all elements containing `class='darktext'` might be defined as having a black text colour).

Take the website's header as an example. This is all contained with a div called _primaryNavigation_, which we can focus on using the `find()` function. By then narrowing it down using `header > div`, we can find a few **<a\>** tags (which are links). Let's extract the `class` of the first link that appears:

In [None]:
print(content.find('div', 'primaryNavigation').header.div.a['class'])

From there, we can tell it to jump to the next **<a\>** tag using `findNext()`. Within this element is an **img**, so we'll extract the class from that:

In [None]:
print(content.find('div', 'primaryNavigation').header.div.a.findNext('a').img['class'])

The above examples are useful, but only allow us to find single elements. The `find_all()` function captures _all_ occurrences of a HTML element, by tag, and optionally by class or ID. Header text elements are defined in HTML as **h1**, **h2**, etc. Finding the text from all occurrences of `h2` tags nicely recaps the page structure:

In [None]:
for heading in content.find_all('h2'):
    print(heading.text.strip())

# alternative one line solution:
# [x.text.strip() for x in content.find_all('h2')]

**Task: Find the text and hyperlinks of all USYD social media platforms listed in the page footer.**

The footer section of all pages on USYD's website contains a few small icons on the left, each linking to a social media account run by the university. Use "Inspect Element" on the webpage to find the class of the `div` that contains this information, and within this, extract the **text** _and_ **link** of each.

In [None]:
### TO DO

### 1.4 HTML tables

Despite being a webpage, not all useful information is stored in text fields. HTML features `table` elements, which are made up of `<tr>` rows, each consisting of `<td>` cells (or `<th>` if a header). In our example, the top of each UoS page contains an overview table of academic details. We can first locate it by it's **id**, in this case _academicDetails_, and explore its structure:

In [None]:
details = content.find('div', id='academicDetails')
details

Let's iterate through each row, and extract both its header (in **<th\>**), and the corresponding cell data (in **<td\>**).

In [None]:
for row in details.find_all('tr'):
    print(row.th.text, '=', row.td.text)

Note this produces output that was a bit messy, as one of the cells has a div _within_ it (for a question mark icon that users can hover over for more information). By default, BeautifulSoup will extract all text, at any depth, within this cell, therefore including text within the internal div. This can be avoided by using `.find(text=True, recursive=False)` rather than just `.text`.

In [None]:
for row in details.find_all('tr'):
    print(row.th.find(text=True, recursive=False).strip(), '=', row.td.text)

**Task: Extract the details of all assessments in the webpage.**

1. Use InspectElement to locate the id/class of the div containing the assessment details (set this as `assessments`)
2. Create a list of `headers`, for the column headers (<td\>) of the table (e.g. ['Type', 'Description', ...])
3. For each row in the table, add a dictionary of values for that row (e.g. {'Type': 'Online task', 'Description': 'Weekly Homework', ...}) to the `data` list

Tip #1: The "Outcomes assessed" rows are not intended to be kept. Either skip these rows in your loop, or see if you can find a CSS class that would ignore these.

Tip #2: When iterating through each cell of a row, beware that not all may be enclosed by a tag of the same type.

Tip #3: A different approach may be needed for cells with bold text, and cells without bold text, to avoid the longer description text being brought in.

In [None]:
### TO DO
assessments = '?' # use the find() function to locate the div containing the table

headers = []  # populate this list with the headers of the table

data = []
for row in '?':  # iterate through each row of the table
    assessment = {'Unit': 'DATA2001', 'Session': '2023-S1C-ND-CC'}  # start with a couple fields populated
    # iterate through each cell in the row, and add it to the 'assessment' dictionary
    data.append(assessment)  # add the dictionary of row values to our overall list 'data'

pd.DataFrame(data)  # return the results as a dataframe

## 2. Web Crawling

Web scraping can be very powerful, but especially so when a script can be established to do so over **multiple** webpages.

Note the legal/ethical cautions, and best practices:
1. Check the **robots.txt** to determine whether users are permitted to scrape pages, and at what frequency
2. Add **intentional delays** in the code to avoid congesting servers (or getting blocked from websites!)
3. Initially, just **practice** building your code over a single webpage or two. Only scale up to multiple pages once you are confident the code does as it is intended to!

### 2.1 Link Extraction

So far, we've been exploring the webpage for this year's occurrence of DATA2001. If we go back to the homepage for DATA2001, we can similarly parse the HTML content, and find links to all occurrences of the unit. Past occurrences are represented in the _archivedOutlines_ div, and current units are in the _currentOutlines_ div, so we'll pull links from them both.

In [None]:
data2001page = requests.get("https://www.sydney.edu.au/units/DATA2001").text
data2001content = BeautifulSoup(data2001page, 'html5lib')
oldlinks = data2001content.find('div', id='archivedOutlines').find_all('a')
newlinks = data2001content.find('div', id='currentOutlines').find_all('a')
links = oldlinks+newlinks
links

Note the links there seem incomplete - they start with a slash, rather than specifying a full URL. This implies pages on the same web domain. Therefore, we can add the domain in, to turn these into fully qualified hyperlinks:

In [None]:
for link in links:
    URL = 'http://sydney.edu.au'+link['href']
    print(URL)

### 2.2 Link Traversal

**Task: Create a function that receives a URL, and returns the assessment data.**

The function is set up below, for you to paste in your answer from the task in Section 1.4. Only a couple adjustments are needed:
1. Your previous code worked with a predefined `content` variable. This function should receive the URL, retrieve its web contents, parse its HTML, and then proceed with this.
2. Our previous row-by-row `assessment` dictionary had the unit and session hardcoded in. Try updating this to reflect this information dynamically from the URL itself.

When confident your function is likely correct, test it runs correctly by uncommenting the last row of the cell below, which will test it on [2020's DATA2001](https://www.sydney.edu.au/units/DATA2001/2020-S1C-ND-CC).

In [None]:
### TO DO

def findAssessments(URL):
    # retrieve the URL first, then:
    """
    paste in your code from the task in Section 1.4, but adjust the initial 'assessment' dictionary to actually detail the true unit code and session from the URL
    """

    return pd.DataFrame(data)

findAssessments('https://www.sydney.edu.au/units/DATA2001/2020-S1C-ND-CC')

Once this has been achieved, we can test it by iterating over the links we located in Section 2.1.

Note an explicit delay of two seconds has been added in between each request, using the `.sleep()` function from the `time` module.

In [None]:
import time as t
df = pd.DataFrame(columns=['Unit', 'Session']+headers)  # establishing a blank dataframe to be populated
for link in links:  # for each link we found earlier
    URL = 'http://sydney.edu.au'+link['href']  # establishing its full address
    print(URL)  # printing it to summarise our progress
    t.sleep(2)  # waiting for two seconds before requesting
    df = pd.concat([df, findAssessments(URL)])  # merging the new data with our existing df

df

And there we have it! A simple, brief example of crawling and scraping to collate data summarised from websites.

## 3. Data Storage

Now that we have collated some information, let's note our storage options.

### 3.1 CSV Output

As mentioned in previous coverage of Pandas, exporting to a CSV file is quite simple using the `.to_csv()` function. This should create a CSV in your working directory, containing the information we collated.

In [None]:
df.to_csv("assessments.csv", index=False)

### 3.2 Database Ingestion

Of course, we can store this in our pgAdmin servers. The code below is taken directly from the Week 4 tutorial, defining helper functions to allow us to both connect to, and query, our individual databases. Note it requires the `Credentials.json` file from Week 4, so make sure that's in your current working directory! As always, it is recommended to launch pgAdmin in the background, so that you can close connections if you encounter the issue of too many.

In [None]:
from sqlalchemy import create_engine
import psycopg2
import psycopg2.extras
import json
import os
import pandas as pd

credentials = "Credentials.json"

def pgconnect(credential_filepath, db_schema="public"):
    with open(credential_filepath) as f:
        db_conn_dict = json.load(f)
        host       = db_conn_dict['host']
        db_user    = db_conn_dict['user']
        db_pw      = db_conn_dict['password']
        default_db = db_conn_dict['user']
        try:
            db = create_engine('postgresql+psycopg2://'+db_user+':'+db_pw+'@'+host+'/'+default_db, echo=False)
            conn = db.connect()
            print('Connected successfully.')
        except Exception as e:
            print("Unable to connect to the database.")
            print(e)
            db, conn = None, None
        return db,conn

def query(conn, sqlcmd, args=None, df=True):
    result = pd.DataFrame() if df else None
    try:
        if df:
            result = pd.read_sql_query(sqlcmd, conn, params=args)
        else:
            result = conn.execute(sqlcmd, args).fetchall()
            result = result[0] if len(result) == 1 else result
    except Exception as e:
        print("Error encountered: ", e, sep='\n')
    return result

The cell below should inform you of a successful connection. If not, see our [Ed post](https://edstem.org/au/courses/8139/discussion/769731) for common issues.

In [None]:
db, conn = pgconnect(credentials)

We'll prepare the data load by creating a schema for it (_UnitsOfStudy_), setting our `search_path`, and deleting the Assessments table, if one does not already exist.

In [None]:
conn.execute("""
create schema if not exists UnitsOfStudy;
set search_path to UnitsOfStudy;
drop table if exists Assessments;
""")

We'll make two small adjustments to the dataframe, for later ease in pgAdmin.
1. Changing all column names to lower case (case sensitivity issues addressed in Week 4 tutorial)
2. Removing the '%' sign from the weight column, and interpreting it as a float (to allow numerical analysis)

In [None]:
df.columns = map(str.lower, df.columns)
df.weight = df.weight.str.rstrip('%').astype('float')
df.head()

Finally, we can push the data to the servers using Pandas' `.to_sql()` function, and try out a simple `select *` SQL statement to confirm the data has loaded.

In [None]:
df.to_sql("assessments", con=conn, if_exists='append', index=False)
query(conn, "select * from Assessments")

### 3.3 Querying

**Task: Develop an SQL query that reports the first session, last session, and average weight of each assessment type.**

Order the resulting table primarily by first session, and secondarily by last session.

Optional extension: try and report the first and last _year_ rather than session.

In [None]:
### TO DO
sql = """
"""
query(conn, sql)

Well done on making it through! A fairly in-depth example of how web scraping can be used to tackle semi-structured data, completed by database ingestion and a simple query example.

## 4. Application

The fun (but way overboard) **OPTIONAL** extra task, for students in either stream.

For the last couple of years, OLEs have been a requirement of all degrees at USYD. What if we could extract assessment information **for all OLEs at the university**, thereby enabling students to narrow down those that are most appealing to them? Perhaps one student is interested in finding the OLE with the _least_ assessments, but including at least one presentation, for example. Perhaps another seeks OLEs with a group work element above a weighting of 20%. The possibilities are plentiful.

**OPTIONAL Task: Extract the list of all OLE UoS codes/titles.**

Run the below cell to request one of the 3 OLE pages (this one describes units starting with the letter A-D), then (again using Inspect Element), extract the list of all OLE titles (e.g. 'OLET1622 Numbers and Numerics') available.

In [None]:
OLEpage = requests.get("https://www.sydney.edu.au/handbooks/interdisciplinary_studies/open_learning_environment/open_learning_environment_ad_table.html").text
OLEcontent = BeautifulSoup(OLEpage, 'html5lib')

In [None]:
### TO DO
uoslist = '?'

**OPTIONAL Task: For each OLE found, extract all assessments for their most recent outline.**

Iterate through the list of units extracted above, and visit the UoS page for each. Within this UoS page, find the first link that appears in _currentOutlines_ (if one exists), and apply the same `findAssessments()` function we developed earlier, again ensuring to leave an intentional delay between visiting web pages.

A `progress()` helper function is included below, so that the time taken for the cell to run is reported. It is also recommended to print the unit code as each next one is reached, so that your progress can be monitored.

A limit on your list of OLEs has also been added by default, so that only the first three pages are processed. Only remove this once you are confident your code will run smoothly on the remaining pages.

In [None]:
### TO DO

# helper function to report how long the cell took to run
def progress(t0):
    print('Completed in ' + str(round((t.time()-t0)/60, 1)) + ' minutes.')
t0 = t.time()

OLEdf = pd.DataFrame(columns=['Unit', 'Session']+headers)  # establishing a blank dataframe to be populated
for i, uos in enumerate(uoslist[:3]):  # purposefully limiting to just the first few for now
    print(f'({i}/{len(uoslist)})')  # printing what element in the list we're up to
    uoscode = '?'  # locate the UoS code
    t.sleep(2)  # wait two seconds before requesting anything
    # request the URL for this unit and locate the first link in the unit outlines table, if it exists
    # go to the first link in this table, and use the findAssessments() function to extract its assessment info
    # final line should be something like: OLEdf = pd.concat([OLEdf, findAssessments(URL)])

progress(t0)

In [None]:
OLEdf

From there, you are welcome to ingest this in your database server (see Step 3 instructions), and begin querying it at will, to begin discovering your "ideal" OLE. Any findings from this are gladly welcomed! If sufficient interest is garnered, we may even create an Ed thread to discuss some student findings, and (more importantly from an educational perspective), the queries used to discover them :)