# OpenAdvisor - Part 1. Web Scraping

In order to collect the data for this project, we'll scrape degree and course requirements from the college catalog websites. This process begins with collecting all the url's for degree and course requirements (section 1A), followed by scraping the data off the degree and course pages (sections 1B and 1C). We'll also do some preprocessing, converting html to plain text and filtering out only the relevant html elements.

### 1A. Collecting Degree and Course Description URLs

For each school, we need to input four example URLs into an Excel file: two degree requirement URLs and two course description URLs. The script uses the differences in these URLs' directory structures to find all the other pages and to automatically generate school, department, and degree names.

![alt text](JupyterFiles\mainurls_excel.jpg "MainURLs.xlsx")

We can now run the script, starting with importing dependencies and the excel file we created.

In [1]:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
import time
import pandas as pd
import unicodedata
import json
import numpy as np
from bs4 import BeautifulSoup as bs
import html2text
h2t = html2text.HTML2Text()
h2t.ignore_links = True
h2t.ignore_emphasis = True
h2t.body_width = 0
from random import sample
import re
from jupyter_functions import display_styled_table

# Retrieve URLs
mainurlsdf = pd.read_excel('MainURLs.xlsx')
schools = mainurlsdf.loc[::2, 'School'].reset_index(drop=True)
schoolsdf = pd.DataFrame(schools)
schoolsdf.index.name = 'ID'

A user prompt is now generated to select a school. For this notebook, we will run it for Colorado State University (CSU). Following this step, the rest of the webscraping process is completely automated and requires no user input.

In [2]:
print(schoolsdf)
print('Enter the ID for the school you want to scrape:')
schoolid = None
while schoolid not in schoolsdf.index:
    schoolid = int(input())
schoolname = schoolsdf.School[schoolid]

# Save school name
with open('schoolname.json', 'w') as outfile:
    json.dump(schoolname, outfile)

coursedescription_urls = mainurlsdf.loc[schoolid*2:schoolid*2+1, 'Course description URLs'].reset_index(drop=True)
degreerequirement_urls = mainurlsdf.loc[schoolid*2:schoolid*2+1, 'Degree requirement URLs'].reset_index(drop=True)

        School
ID            
0          CSU
1           CU
2    UC Denver
3   CSU Pueblo
4      CO Mesa
5          CCD
Enter the ID for the school you want to scrape:


 0


<br>
We'll start the scraping process at the parent directory of the example URLs, find all the links with a matching directory structure, and repeat, going up every level of the example URL directory structure. For each link, the parts of the directory structure that vary are used to infer the school, program, degree name, and the department, where applicable. This is possible because CourseLeaf catalogs generally follow a standardized URL format across schools. 

In [3]:
def get_sibling_urls(urls):
    """Generates a dataframe containing all the pages within a site that match the input urls' directory structure.

    Given two example url's, this generates a dataframe containing all the sibling pages that have matching directory
    structures. For example, given the example urls "www.example.com/abc/def/ghi" and "www.example.com/abc/xyz/ghi",
    get_siblings_urls would match any url of the form "www.example.com/abc/*/ghi", where "*" is a wildcard directory
    name. The dataframe also returns school, department, program, and degree names for each url, where applicable, by
    infering via the directories that vary in name. The dataframe is then saved as a .pkl file. The dataframe is also
    printed for inspection.

    :param urls: Series containing two example url's
    :param pickle_name: A string representing the filename for the output .pkl file
    :return: None
    """
    # Determine the url structure
    exampleurl1 = urls[0].strip('/')
    exampleurl2 = urls[1].strip('/')
    examplesplit1 = exampleurl1.split('/')
    examplesplit2 = exampleurl2.split('/')
    splitdf = pd.DataFrame({'e1': examplesplit1, 'e2': examplesplit2})
    # Delete last subdirectories if they match in both examples (likely a fragment)
    splitdf = splitdf.loc[splitdf.e1.ne(splitdf.e2).loc[::-1].cumsum()[::-1].ne(0)]

    # Make a list of the static parent directories
    directoryi = 0
    parentdirectories = ['']*10                             # Pad list
    for i in range(len(splitdf)):
        if splitdf.e1[i] == splitdf.e2[i]:  # If directory in both example URLs match --> save as static directory
            parentdirectories[directoryi] = parentdirectories[directoryi] + '/' + splitdf.e1[i]
        else:
            directoryi += 1
    parentdirectories[0] = parentdirectories[0].strip('/')
    parentdirectories = parentdirectories[:directoryi]      # Remove padding
    if directoryi == 1:
        directorynames = ['department']         # Course description url's are only dilineated by department
    elif directoryi == 2:
        directorynames = ['school', 'degree']       # Some degree url's are dilineated by school and degree
    elif directoryi == 3:
        directorynames = ['school', 'program', 'degree']    # Most degree url's are dilineated by all of these
    else:
        raise Exception('check url format')

    # Extract list of all sites that match the structure of the example url's
    nextdirectoryurllist = ['']
    sitelinks = []
    # Loop through each static directory, finding matching url's
    for diri, directory in enumerate(parentdirectories):
        urllist = list(set(nextdirectoryurllist))
        nextdirectoryurllist = []
        for url in urllist:
            url = url + directory
            driver.get(url)
            driver.execute_script(
                "window.scrollTo(0, document.body.scrollHeight);var lenOfPage=document.body.scrollHeight;return lenOfPage;")
            time.sleep(.3)      # increase this to 5 seconds if selenium is throwing exceptions
            # Get a list of all elements with a link starting with the matching parent url
            # hrefstart_index = parentdirectories[0].rindex('.edu')+4
            # hrefstart = url[hrefstart_index:]
            # alllinks = driver.find_elements(By.XPATH, '//a[contains(@href,"' + hrefstart + '")]')
            alllinks = driver.find_elements(By.XPATH, '//a')       # Slower backup option to get all unfiltered links

            for link in alllinks:
                rawlink = link.get_attribute("href")
                if not rawlink:
                    continue
                linkurl = rawlink.strip('/')
                if ('@' in linkurl) or ('#' in linkurl) or ('.pdf' in linkurl):
                    continue
                if '/' not in linkurl:
                    continue
                if linkurl.rindex('/') == len(url):   # If linkurl is child of parent dir, not a grandchild --> append
                    linktitle = link.text.strip(' \n')
                    linktitle = unicodedata.normalize('NFKC', linktitle).encode('ascii', 'ignore').decode('utf-8')
                    if not linktitle:
                        linktitle = link.get_attribute('innerText').strip(' \n')
                        linktitle = unicodedata.normalize('NFKC', linktitle).encode('ascii', 'ignore').decode('utf-8')
                    sitelinks.append({"link": linkurl, directorynames[diri]: linktitle})
                    nextdirectoryurllist.append(linkurl)

    # Save all links and properties to a dataframe
    linksdf = pd.DataFrame(sitelinks)
    # Remove duplicates and keep the longest titles
    linksdf['linklength'] = linksdf.link.str.len()
    linksdf = linksdf.groupby(linksdf.link, as_index=False).apply(lambda x: x.iloc[x.linklength.argmax()])
    linksdf.drop(columns='linklength', inplace=True)
    # Broadcast all properties except degree (sites without a degree name need to be removed)
    linksdf.loc[:, directorynames[:-1]] = linksdf.sort_values('link').loc[:, directorynames[:-1]].ffill()
    linksdf = linksdf.dropna(subset=[directorynames[-1]])
    linksdf = linksdf.sort_values('link')
    linksdf.reset_index(inplace=True, drop=True)

    return linksdf

We'll then start up the webdriver, scrape the course description URLs and view the output.

In [4]:
s = Service('C:/PythonExtraPath/chromedriver.exe')
driver = webdriver.Chrome(service=s)

course_urls_df = get_sibling_urls(coursedescription_urls)

# Print out results
head_tail_df = course_urls_df.iloc[np.r_[0:5, -5:0]] 
display_styled_table(head_tail_df)

Unnamed: 0,link,department
0,https://catalog.colostate.edu/general-catalog/admissions/applicant-definitions,Undergraduate Applicant Definitions
1,https://catalog.colostate.edu/general-catalog/admissions/enrollment-deposit,Enrollment Deposit
2,https://catalog.colostate.edu/general-catalog/admissions/general-policies,General Policies for Undergraduate Admissions
3,https://catalog.colostate.edu/general-catalog/admissions/how-to-apply,How to Apply
4,https://catalog.colostate.edu/general-catalog/admissions/international-admissions,International Undergraduate Admissions
140,https://catalog.colostate.edu/general-catalog/courses-az/vm,Veterinary Medicine-VM (VM)
141,https://catalog.colostate.edu/general-catalog/courses-az/vmbs,Vet Med + Biomed Sciences-VMBS (VMBS)
142,https://catalog.colostate.edu/general-catalog/courses-az/vs,Clinical Sciences-VS (VS)
143,https://catalog.colostate.edu/general-catalog/courses-az/wr,Watershed Science-WR (WR)
144,https://catalog.colostate.edu/general-catalog/courses-az/ws,Women's Studies-WS (WS)


Then we repeat for the degree requirement pages and close the webdriver.

In [5]:
degree_urls_df = get_sibling_urls(degreerequirement_urls)

# Print out results
head_tail_df = degree_urls_df.iloc[np.r_[0:5, -5:0]] 
display_styled_table(head_tail_df)

driver.close()

Unnamed: 0,link,school,program,degree
0,https://catalog.colostate.edu/general-catalog/colleges/agricultural-sciences/agricultural-biology/agricultural-biology-major,Agricultural Sciences,Agricultural Biology,Agricultural Biology
1,https://catalog.colostate.edu/general-catalog/colleges/agricultural-sciences/agricultural-biology/agricultural-biology-major-entomology-concentration,Agricultural Sciences,Agricultural Biology,Entomology Concentration
2,https://catalog.colostate.edu/general-catalog/colleges/agricultural-sciences/agricultural-biology/agricultural-biology-major-plant-pathology-concentration,Agricultural Sciences,Agricultural Biology,Plant Pathology Concentration
3,https://catalog.colostate.edu/general-catalog/colleges/agricultural-sciences/agricultural-biology/agricultural-biology-major-weed-science-concentration,Agricultural Sciences,Agricultural Biology,Weed Science Concentration
4,https://catalog.colostate.edu/general-catalog/colleges/agricultural-sciences/agricultural-biology/bioagricultural-sciences-phd,Agricultural Sciences,Agricultural Biology,Ph.D. in Bioagricultural Sciences
552,https://catalog.colostate.edu/general-catalog/university-wide-programs/interdisciplinary-studies/school-advanced-materials-discovery,Students' Rights,University Interdisciplinary Studies Programs,School of Advanced Materials Discovery
553,https://catalog.colostate.edu/general-catalog/university-wide-programs/interdisciplinary-studies/sports-management-interdisciplinary-minor,Students' Rights,University Interdisciplinary Studies Programs,Sports Management Interdisciplinary Minor
554,https://catalog.colostate.edu/general-catalog/university-wide-programs/interdisciplinary-studies/sustainable-energy-interdisciplinary-minor,Students' Rights,University Interdisciplinary Studies Programs,Sustainable Energy Interdisciplinary Minor
555,https://catalog.colostate.edu/general-catalog/university-wide-programs/interdisciplinary-studies/sustainable-water-interdisciplinary-minor,Students' Rights,University Interdisciplinary Studies Programs,Sustainable Water Interdisciplinary Minor
556,https://catalog.colostate.edu/general-catalog/university-wide-programs/interdisciplinary-studies/womens-study-interdisciplinary-minor,Students' Rights,University Interdisciplinary Studies Programs,Womens Study Interdisciplinary Minor


### 1B. Course Description Scraping

We'll now use the URLs gathered in section 1A to scrape those individual sites, starting with the course description URLs. On CourseLeaf catalogs, each individual URL holds an entire department's courses, and each individual course is contained in a DIV element with the class "courseblock". We use BeautifulSoup to parse and extract all these matching elements and then collect them into a dataframe.

In [6]:
courseblocks_df = pd.DataFrame()
driver = webdriver.Chrome(service=s)

# Loop through each course description page (each page contains an entire department's courses)
for site in course_urls_df.itertuples():
    url = site.link
    driver.get(url)
    driver.execute_script(
        "window.scrollTo(0, document.body.scrollHeight);var lenOfPage=document.body.scrollHeight;return lenOfPage;")
    time.sleep(.3)
    sitehtml = driver.page_source
    soup = bs(sitehtml, "html.parser")
    # Courseblock is the html container for each individual course description
    courseblocks = soup.find_all('div', {'class': 'courseblock'})
    for blocki, block in enumerate(courseblocks):
        rowsdf = pd.DataFrame()
        # Elements contains every html element within courseblock
        elements = pd.Series(list(block.children))
        elements = elements[elements != '\n']
        # An html break followed by bold text is likely a header (TODO test if splitting at every break & \n is valid)
        rowsdf['html'] = elements.apply(str).str.split('<br/><strong>').apply(pd.Series).stack().reset_index(drop=True)
        rowsdf = rowsdf.dropna()
        rowsdf['plaintext'] = rowsdf.html.apply(lambda x: unicodedata.normalize('NFKC', h2t.handle(x)).strip('\n'))
        rowsdf = rowsdf.loc[rowsdf.plaintext != '']
        rowsdf = rowsdf.dropna().reset_index(drop=True)
        # Assign unique ID for each course within a page and their associated elements
        rowsdf['blockid'] = blocki
        rowsdf['department'] = site.department
        courseblocks_df = pd.concat([courseblocks_df, rowsdf])
driver.close()

# Block index resets for every block, blockid resets for every department, while df.index doesn't reset
courseblocks_df['blockindex'] = courseblocks_df.index
courseblocks_df.index = courseblocks_df.groupby(['department', 'blockid'], sort=False).ngroup()
courseblocks_df.index.rename('courseblock_id', inplace=True)

# Save and print first 50 entries
courseblocks_df.to_pickle('coursedescriptions.pkl')

# print(tabulate(courseblocks_df[['plaintext', 'department']].head(100), headers='keys', tablefmt='psql'))
head20_df = courseblocks_df[['plaintext', 'department']].head(20)
display_styled_table(head20_df)

Unnamed: 0_level_0,plaintext,department
courseblock_id,Unnamed: 1_level_1,Unnamed: 2_level_1
0,AA 100 Introduction to Astronomy (GT-SC2) Credits: 3 (3-0-0),Astronomy-AA (AA)
0,Course Description: Description of the various objects found in the heavens as well as the principles and techniques employed in investigations of these objects.,Astronomy-AA (AA)
0,Prerequisite: None.,Astronomy-AA (AA)
0,Registration Information: Sections may be offered: Online.,Astronomy-AA (AA)
0,"Terms Offered: Fall, Spring, Summer.",Astronomy-AA (AA)
0,Grade Mode: Traditional.,Astronomy-AA (AA)
0,Special Course Fee: Yes.,Astronomy-AA (AA)
0,"Additional Information: Biological & Physical Sciences 3A, Natural & Physical Sciences w/o lab (GT-SC2).",Astronomy-AA (AA)
1,AA 101 Astronomy Laboratory (GT-SC1) Credit: 1 (0-2-0),Astronomy-AA (AA)
1,"Course Description: Conduct observations, experiments, and simulations to develop an intuitive understanding of astronomical phenomena.",Astronomy-AA (AA)


### 1C. Degree Requirement Scraping and Pre-processing

Similar to section 1B, we'll use the URLs generated in 1A and use those to scrape the individual degree requirement pages. We will also apply several steps of filtering to return only relevant pages (ones that contain degree requirement tables) and relevant HTML elements on those pages. 

First, we select only pages that have a main div (class=page_content) and extract all the child elements into siblings_df. We infer the header heirarchy of the h elements using their importance, and use this heirarchy to identify which headers apply to which elements. Similarly, we identify which elements contain superscript characters and determine which sibling element contains the associated superscript definitions. We will include the superscript definitions, along with the school, department, and degree data as properties in a dataframe.

In [7]:
# Set regex definitions
htmltext_re = re.compile(r'(?<=>)[^<]+')
sscript_pattern = r'(?<=_SUPERSCRIPT_).+?(?=_)'     # sscript is abbreviation for superscript
tablerowhtml_re = re.compile(r'<tr.+?</tr>', flags=re.DOTALL)
tablerowclass_pattern = '(?:class=")([^ "]*)'

s = Service('C:/PythonExtraPath/chromedriver.exe')
driver = webdriver.Chrome(service=s)

htmldf = pd.DataFrame()
for pagei, page in degree_urls_df.iterrows():
    url = page.link
    driver.get(url)
    driver.execute_script(
        "window.scrollTo(0, document.body.scrollHeight);var lenOfPage=document.body.scrollHeight;return lenOfPage;")
    time.sleep(.3)
    sitehtml = driver.page_source

    # Replace html superscripts with plain text code so they don't lose their superscript designation when decoded
    for superscript in re.findall(r'(?<=<sup>)[^<]+', sitehtml):
        if ',' in superscript:  # For comma separated lists of superscripts
            replacements = ' _SUPERSCRIPT_' + '_ _SUPERSCRIPT_'.join(re.findall(r'[^ ,]+', superscript)) + '_'
            sitehtml = re.sub(r'<sup>.+?</sup>', replacements, sitehtml, count=1)
        elif ' ' in superscript:  # For space separated lists of superscripts (this is rare)
            replacements = ' _SUPERSCRIPT_' + '_ _SUPERSCRIPT_'.join(re.findall(r'[^ ]+', superscript)) + '_'
            sitehtml = re.sub(r'<sup>.+?</sup>', replacements, sitehtml, count=1)
        else:
            sitehtml = re.sub(r'<sup>.+?</sup>', (' _SUPERSCRIPT_' + superscript + '_'), sitehtml, count=1)

    # Remove invisible superscripts (these are errors from the webdeveloper)
    sitehtml = re.sub(r'_SUPERSCRIPT_ *_', '', sitehtml)

    soup = bs(sitehtml, features='lxml')

    siblings = []
    tabnumber = []
    # Extract elements from main page content then loop through each sub-element
    pagecontent = soup.find_all(None, {'class': 'page_content tab_content'})
    if not pagecontent:       # Not a page with sub-pages on tabs
        pagecontent = soup.find_all(None, {'class': 'page_content'})
    # Drill down from page -> tabs -> children -> grandchildren -> greatgrandchildren and append elements
    for tabi, tabpage in enumerate(pagecontent):
        for child in tabpage.children:
            if child.name == 'div':         # TODO: Generalize to n-levels
                for grandchild in child.children:
                    if grandchild.name == 'div':
                        for ggchild in grandchild.children:
                            if str(ggchild) != '\n':
                                siblings.append(str(ggchild))       # Save each html element
                                tabnumber.append(tabi)              # Save the tab number too
                    elif str(grandchild) != '\n':
                        siblings.append(str(grandchild))
                        tabnumber.append(tabi)
            elif str(child) != '\n':
                siblings.append(str(child))
                tabnumber.append(tabi)
    if not siblings:
        continue

    # Save elements to a dataframe
    siblingsdf = pd.DataFrame({'tabnumber': tabnumber, 'html': siblings})

    # Extract html class (use html class name for tables and html tagname for all other elements)  todo: clarify naming
    siblingsdf['htmlclass'] = siblingsdf.html.str.extract('(?<=<)(.+?)(?=( |>))')[0]
    tableclass = siblingsdf.html.str.extract('(?:class=")(.+?)(?:")', expand=False).str.split().str[0]
    siblingsdf.loc[siblingsdf.htmlclass.eq('table'), 'htmlclass'] = tableclass
    siblingsdf = siblingsdf[siblingsdf.htmlclass.ne('hr/')]
    siblingsdf = siblingsdf[siblingsdf.htmlclass.notna()]

    # Extract header importance
    siblingsdf['h'] = siblingsdf.htmlclass.str.extract(r'(?<=h)([1-6])(?=\Z)')
    siblingsdf.loc[siblingsdf.htmlclass.eq('pre'), 'h'] = 7         # Preformatted text is sometimes table header
    siblingsdf.loc[:, 'h'].fillna('8', inplace=True)                # Anything above 7 is not a header
    siblingsdf.loc[:, 'h'] = pd.to_numeric(siblingsdf.loc[:, 'h'])
    siblingsdf['string'] = siblingsdf.html.apply(lambda x: ''.join(htmltext_re.findall(x)))

    # Assign corresponding headers to each element heirarchicaly and save as headertext
    siblingsdf['headertext'] = ''
    for h in range(1, 8):
        if h in siblingsdf.h.values:
            for starti in siblingsdf.h[siblingsdf.h == h].index:
                if ~(siblingsdf.h.loc[starti + 1:] <= h).any():
                    stopi = max(siblingsdf.index)
                else:
                    stopi = siblingsdf.loc[starti + 1:, 'h'][siblingsdf.h.loc[starti + 1:] <= h].index[0] - 1
                siblingsdf.loc[starti:stopi, 'headertext'] = siblingsdf.headertext + siblingsdf.string[starti] + ' : '
    siblingsdf.headertext = siblingsdf.headertext.str.strip(' : ')

    # Delete header elements (their text is saved in headertext)
    siblingsdf = siblingsdf.loc[siblingsdf.h.eq(8)].reset_index(drop=True)
    siblingsdf.drop(columns="h", inplace=True)

    # Change htmlclass for elemnents to 'sc_footnotes' if they contain superscript definition   todo: clarify naming
    siblingsdf.loc[siblingsdf.string.str.match('_SUPERSCRIPT_'), 'htmlclass'] = 'sc_footnotes'

    # Merge adjacent text blocks with matching headers (but don't merge for tables)
    dontmerge = siblingsdf.html.str.match('<table ')
    groupedbyheader = (
                siblingsdf.headertext.ne(siblingsdf.headertext.shift()) | dontmerge | dontmerge.shift(1)).cumsum()
    siblingsdf = siblingsdf.groupby(groupedbyheader, as_index=False).agg(
        {'tabnumber': 'first', 'html': ' /n '.join, 'htmlclass': 'first', 'string': ' ; '.join, 'headertext': 'first'})

    # Get headers of siblings that precede tables (may contain degree name)
    siblingsdf.loc[
        ~siblingsdf.htmlclass.isin(['sc_plangrid', 'sc_courselist']), 'siblingheaders'] = siblingsdf.headertext
    tablegroups = siblingsdf[::-1].siblingheaders.isna().cumsum()[::-1]
    siblingsdf.siblingheaders = siblingsdf.siblingheaders.groupby(tablegroups).transform(lambda x: ' : '.join(x[:-1]))

    # Assign superscript definitions to the tables that reference them
    siblingsdf.loc[siblingsdf.htmlclass.eq('sc_footnotes'), 'sscripts'] = siblingsdf.string
    siblingsdf.loc[siblingsdf.html.str.match('<p> _SUPERSCRIPT_'), 'sscripts'] = siblingsdf.string
    siblingsdf.sscripts = siblingsdf.sscripts.bfill()     # Sometimes definitions are listed under a later element
    siblingsdf.sscripts = siblingsdf.sscripts.ffill()     # or before the element
    has_sscripts = siblingsdf.html.str.contains(sscript_pattern) | siblingsdf.headertext.str.contains(sscript_pattern)
    needs_sstable = has_sscripts & siblingsdf.htmlclass.ne('sc_footnotes')
    # if (siblingsdf.sscripts.isna() & siblingsdf.htmlclass.isin(['sc_courselist', 'sc_plangrid']) & needs_sstable).any():
    #     print('There is a table that contains superscripts but does not have superscript definitions. Ignore? (y/n)')
    #     if input() != 'y':
    #         raise Exception('Terminated by User')

    # Remove definitions if the table doesn't have any superscripts
    siblingsdf.loc[~needs_sstable & siblingsdf.htmlclass.isin(['sc_courselist', 'sc_plangrid']), 'sscripts'] = ''
    siblingsdf = siblingsdf[siblingsdf.htmlclass.ne('sc_footnotes')]
    if siblingsdf.empty:
        continue
    siblingsdf = siblingsdf.reset_index(drop=True)

    # Assign all the page-wide properties
    siblingsdf = siblingsdf.assign(link=url)
    siblingsdf = siblingsdf.assign(pagenumber=pagei)
    siblingsdf.loc[:, ['program', 'degree', 'school']] = pd.concat([page] * siblingsdf.index.size,
                                                                   axis=1, ignore_index=True).T
    # Extract the page title (it's usually the degree name)
    if soup.find(None, {'id': 'page-title'}) is None:
        if soup.find(None, {'class': 'page-title'}) is None:
            siblingsdf['pagetitle'] = soup.find(None, {'class': 'page-header'}).text
        else:
            siblingsdf['pagetitle'] = soup.find(None, {'class': 'page-title'}).text
    else:
        siblingsdf['pagetitle'] = soup.find(None, {'id': 'page-title'}).text

    # Concatenate the df's for all pages
    htmldf = pd.concat([htmldf, siblingsdf])
driver.close()

Degree requirements for CourseLeaf catalogs are contained within tables within the HTML classes "sc_courselist" and "sc_plangrid". 

In [8]:
# Extract just the tables that contain degree requirements
isdegreetable = htmldf.htmlclass.eq('sc_courselist') | htmldf.htmlclass.eq('sc_plangrid')
tables = htmldf.loc[isdegreetable, ['degree', 'headertext', 'siblingheaders', 'sscripts', 'tabnumber', 'pagenumber',
                                    'html', 'htmlclass', 'link', 'pagetitle']].reset_index(drop=True)

To vectorize the following steps, we will nest each table within a series, then apply operations on the whole series.

In [9]:
# At this point the tables are still just html
# Convert each individual html table to a dataframe, which is then nested as a single element in tablesseries
tablesseries = tables.html.apply(lambda x: pd.read_html(x)[0])

We apply some basic data cleaning, and then rename the nested tables column indices according to the CourseLeaf schema.

In [10]:
tablesseries.rename('tables', inplace=True)
tablesseries = tablesseries.apply(lambda x: x.fillna(''))

# Remove unicode junk
tablesseries = tablesseries.apply(lambda x: x.applymap(lambda y: unicodedata.normalize('NFKC', str(y).strip(' \n')).
                                                       encode('ascii', 'ignore').decode('utf-8')))
# Mark current rows as not being a table header, then move column index to row and mark those as table headers
tablesseries = tablesseries.apply(lambda x: x.assign(headerflag=False))

tablesseries = tablesseries.apply(lambda x: x.T.reset_index().T.reset_index(drop=True))
# Assign column names
# 'coregroup' is an optional column sometimes present in course tables (represents gen ed groups)
tablesseries.apply(lambda x: x.columns)
tablesseries.apply(lambda x: x.insert(2, "coregroup", '') if len(x.columns) == 4 else None)
tablesseries.apply(lambda x: x.set_axis(['code', 'title', 'coregroup', 'credits', 'headerflag'], axis=1, inplace=True))
# Move column indexes to rows and mark with headerflag
tablesseries = tablesseries.apply(lambda x: x.assign(headerflag=x.headerflag.ne(False)))


The series is then moved into a dataframe, along with the associated columns from siblings_df.

In [11]:
# Make new df containing each individual table and their table-wide info
tabledf = pd.concat([tablesseries, tables], axis=1)
tabledf = tabledf.assign(id=list(range(len(tables))))
# Broadcast table info from other rows of tabledf into the nested dataframes in the first column
tabledf.tables = tabledf.apply(lambda x: x.tables.assign(degree=x.degree), axis=1)
tabledf.tables = tabledf.apply(lambda x: x.tables.assign(pagetitle=x.pagetitle), axis=1)
tabledf.tables = tabledf.apply(lambda x: x.tables.assign(headertext=x.headertext), axis=1)
tabledf.tables = tabledf.apply(lambda x: x.tables.assign(siblingheaders=x.siblingheaders), axis=1)
tabledf.tables = tabledf.apply(lambda x: x.tables.assign(tabnumber=x.tabnumber), axis=1)
tabledf.tables = tabledf.apply(lambda x: x.tables.assign(pagenumber=x.pagenumber), axis=1)
tabledf.tables = tabledf.apply(lambda x: x.tables.assign(superscripts=x.sscripts), axis=1)
tabledf.tables = tabledf.apply(lambda x: x.tables.assign(htmlclass=x.htmlclass), axis=1)
tabledf.tables = tabledf.apply(lambda x: x.tables.assign(link=x.link), axis=1)
tabledf.tables = tabledf.apply(lambda x: x.tables.assign(id=x.id), axis=1)

# Convert entire table html into html of individual rows              
tabledf.tables = tabledf.apply(lambda x: x.tables.assign(
    html=[''] * (len(x.tables) - len(tablerowhtml_re.findall(x.html))) + tablerowhtml_re.findall(x.html)), axis=1)
tabledf.tables = tabledf.apply(lambda x: x.tables.assign(rowclass=x.tables.html.str.extract(tablerowclass_pattern)),
                               axis=1)

Finally, we explode the nested tables to create the final dataframe, then save and print out a sample.

In [12]:
# Explode series containing dataframes into one big dataframe
df = pd.concat(tabledf.tables.to_list())

# Delete blank rows
df = df.loc[df.html.str.contains(r'(?<=>)[^<]+'), :]
df.reset_index(drop=True, inplace=True)

df.to_pickle('degreetables.pkl')
df_no_html_column = df.drop(columns='html')
display_styled_table(df_no_html_column.head(10))

Unnamed: 0,code,title,coregroup,credits,headerflag,degree,pagetitle,headertext,siblingheaders,tabnumber,pagenumber,superscripts,htmlclass,link,id,rowclass
0,Freshman,Freshman,Freshman,Freshman,True,Agricultural Biology,Major in Agricultural Biology,Effective Fall 2020,: Learning Outcomes : Potential Occupations : Concentrations,1,0,"_SUPERSCRIPT_1_ A minimum grade of 'C' (2.000) must be obtained in this course in order to complete the program. _SUPERSCRIPT_2_Transfer students are required to take AB 270 in lieu of AB 120, AB 130, and AB 230. _SUPERSCRIPT_3_Select enough elective credits to bring the program total to 120, of which at least 42 must be Upper-Division (300- to 400-level).",sc_plangrid,https://catalog.colostate.edu/general-catalog/colleges/agricultural-sciences/agricultural-biology/agricultural-biology-major,0,plangridyear
1,Unnamed: 0_level_1,Unnamed: 1_level_1,AUCC,Credits,True,Agricultural Biology,Major in Agricultural Biology,Effective Fall 2020,: Learning Outcomes : Potential Occupations : Concentrations,1,0,"_SUPERSCRIPT_1_ A minimum grade of 'C' (2.000) must be obtained in this course in order to complete the program. _SUPERSCRIPT_2_Transfer students are required to take AB 270 in lieu of AB 120, AB 130, and AB 230. _SUPERSCRIPT_3_Select enough elective credits to bring the program total to 120, of which at least 42 must be Upper-Division (300- to 400-level).",sc_plangrid,https://catalog.colostate.edu/general-catalog/colleges/agricultural-sciences/agricultural-biology/agricultural-biology-major,0,plangridterm
2,AB 120 _SUPERSCRIPT_1_ _SUPERSCRIPT_2_,Agricultural Biology--Freshman Orientation,,1,False,Agricultural Biology,Major in Agricultural Biology,Effective Fall 2020,: Learning Outcomes : Potential Occupations : Concentrations,1,0,"_SUPERSCRIPT_1_ A minimum grade of 'C' (2.000) must be obtained in this course in order to complete the program. _SUPERSCRIPT_2_Transfer students are required to take AB 270 in lieu of AB 120, AB 130, and AB 230. _SUPERSCRIPT_3_Select enough elective credits to bring the program total to 120, of which at least 42 must be Upper-Division (300- to 400-level).",sc_plangrid,https://catalog.colostate.edu/general-catalog/colleges/agricultural-sciences/agricultural-biology/agricultural-biology-major,0,codecol
3,AB 130 _SUPERSCRIPT_1_ _SUPERSCRIPT_2_,Working with Agricultural Biology Data,,1,False,Agricultural Biology,Major in Agricultural Biology,Effective Fall 2020,: Learning Outcomes : Potential Occupations : Concentrations,1,0,"_SUPERSCRIPT_1_ A minimum grade of 'C' (2.000) must be obtained in this course in order to complete the program. _SUPERSCRIPT_2_Transfer students are required to take AB 270 in lieu of AB 120, AB 130, and AB 230. _SUPERSCRIPT_3_Select enough elective credits to bring the program total to 120, of which at least 42 must be Upper-Division (300- to 400-level).",sc_plangrid,https://catalog.colostate.edu/general-catalog/colleges/agricultural-sciences/agricultural-biology/agricultural-biology-major,0,codecol
4,AREC 202,Agricultural and Resource Economics (GT-SS1),3C,3,False,Agricultural Biology,Major in Agricultural Biology,Effective Fall 2020,: Learning Outcomes : Potential Occupations : Concentrations,1,0,"_SUPERSCRIPT_1_ A minimum grade of 'C' (2.000) must be obtained in this course in order to complete the program. _SUPERSCRIPT_2_Transfer students are required to take AB 270 in lieu of AB 120, AB 130, and AB 230. _SUPERSCRIPT_3_Select enough elective credits to bring the program total to 120, of which at least 42 must be Upper-Division (300- to 400-level).",sc_plangrid,https://catalog.colostate.edu/general-catalog/colleges/agricultural-sciences/agricultural-biology/agricultural-biology-major,0,codecol
5,CHEM 107,Fundamentals of Chemistry (GT-SC2),3A,4,False,Agricultural Biology,Major in Agricultural Biology,Effective Fall 2020,: Learning Outcomes : Potential Occupations : Concentrations,1,0,"_SUPERSCRIPT_1_ A minimum grade of 'C' (2.000) must be obtained in this course in order to complete the program. _SUPERSCRIPT_2_Transfer students are required to take AB 270 in lieu of AB 120, AB 130, and AB 230. _SUPERSCRIPT_3_Select enough elective credits to bring the program total to 120, of which at least 42 must be Upper-Division (300- to 400-level).",sc_plangrid,https://catalog.colostate.edu/general-catalog/colleges/agricultural-sciences/agricultural-biology/agricultural-biology-major,0,codecol
6,CHEM 108,Fundamentals of Chemistry Laboratory (GT-SC1),3A,1,False,Agricultural Biology,Major in Agricultural Biology,Effective Fall 2020,: Learning Outcomes : Potential Occupations : Concentrations,1,0,"_SUPERSCRIPT_1_ A minimum grade of 'C' (2.000) must be obtained in this course in order to complete the program. _SUPERSCRIPT_2_Transfer students are required to take AB 270 in lieu of AB 120, AB 130, and AB 230. _SUPERSCRIPT_3_Select enough elective credits to bring the program total to 120, of which at least 42 must be Upper-Division (300- to 400-level).",sc_plangrid,https://catalog.colostate.edu/general-catalog/colleges/agricultural-sciences/agricultural-biology/agricultural-biology-major,0,codecol
7,CO 150,College Composition (GT-CO2),1A,3,False,Agricultural Biology,Major in Agricultural Biology,Effective Fall 2020,: Learning Outcomes : Potential Occupations : Concentrations,1,0,"_SUPERSCRIPT_1_ A minimum grade of 'C' (2.000) must be obtained in this course in order to complete the program. _SUPERSCRIPT_2_Transfer students are required to take AB 270 in lieu of AB 120, AB 130, and AB 230. _SUPERSCRIPT_3_Select enough elective credits to bring the program total to 120, of which at least 42 must be Upper-Division (300- to 400-level).",sc_plangrid,https://catalog.colostate.edu/general-catalog/colleges/agricultural-sciences/agricultural-biology/agricultural-biology-major,0,codecol
8,Select one group from the following:,Select one group from the following:,,8,False,Agricultural Biology,Major in Agricultural Biology,Effective Fall 2020,: Learning Outcomes : Potential Occupations : Concentrations,1,0,"_SUPERSCRIPT_1_ A minimum grade of 'C' (2.000) must be obtained in this course in order to complete the program. _SUPERSCRIPT_2_Transfer students are required to take AB 270 in lieu of AB 120, AB 130, and AB 230. _SUPERSCRIPT_3_Select enough elective credits to bring the program total to 120, of which at least 42 must be Upper-Division (300- to 400-level).",sc_plangrid,https://catalog.colostate.edu/general-catalog/colleges/agricultural-sciences/agricultural-biology/agricultural-biology-major,0,
9,Group A,Group A,,,False,Agricultural Biology,Major in Agricultural Biology,Effective Fall 2020,: Learning Outcomes : Potential Occupations : Concentrations,1,0,"_SUPERSCRIPT_1_ A minimum grade of 'C' (2.000) must be obtained in this course in order to complete the program. _SUPERSCRIPT_2_Transfer students are required to take AB 270 in lieu of AB 120, AB 130, and AB 230. _SUPERSCRIPT_3_Select enough elective credits to bring the program total to 120, of which at least 42 must be Upper-Division (300- to 400-level).",sc_plangrid,https://catalog.colostate.edu/general-catalog/colleges/agricultural-sciences/agricultural-biology/agricultural-biology-major,0,


**At this stage, this degree requirement and course descriptions are still fairly raw. In Section 2 we will process this data into a more human-readable and organized format.**