# OpenAdvisor - Part 2. Data Organization & Parsing

### 2A. Course Description Organization

In this section, we start to organize and parse the data by taking the raw course descriptions scraped in part 1B, and extracting key features. We use keywords, their position, and other context clues to sort the data into categories for each course. We begin by importing dependendcies and defining these keywords & key-phrases.

In [1]:
import re
import pandas as pd
from listtopattern import listtopatternraw
from tabulate import tabulate
from random import sample
from verticalprinter import v
from jupyter_functions import display_styled_table
import warnings
import numpy as np
from listtopattern import listtopattern
import sys
import json
warnings.filterwarnings("ignore", 'This pattern has match groups')

# Header keywords/phrases
creditsID = ['.*Credits?:']
descriptionsID = ['Course Description:']
requisitesID = ['Requisites?:']
prerequisitesID = ['Pre-?(req(uisite)?)?s?:']
corequisitesID = ['Co-?req(uisite)?s?:']
equivalentsID = ['Equivalent - Duplicate Degree Credit Not Granted:', 'Also Offered As:', 'Same as:',
                 r'Equivalent Course\(?s?\)?:', 'Equivalent with:?']
coursegroupsID = ['Additional Information:', 'Attributes:', 'Essential Learning Categories:']
gradingtypeID = ['Grading Basis:', 'Grade Modes?:', 'Grading Scheme:']
recommendedsID = ['Recommended:']
repeatablityID = [r'Repeatable(\.|:)', r'Course may be taken (?=multiple|\d)']
restrictionsID = ['Restrictions?:']
registrationinfoID = ['Registration Information:', 'Note:']
termofferedID = ['(Terms? )?(Typically )?Offered:']
coursefeesID = ['((Special )?Course )?Fees?:']
GTpathwaysID = ['Colorado Guaranteed Transfer']
ID_list = [descriptionsID, requisitesID, prerequisitesID, corequisitesID, equivalentsID, coursegroupsID, gradingtypeID,
           recommendedsID, repeatablityID, restrictionsID, registrationinfoID, termofferedID, coursefeesID,
           GTpathwaysID]
ID_names = ['description', 'requisites', 'prerequisites', 'corequisites', 'equivalents', 'coursegroups', 'gradingtype',
            'recommendeds', 'repeatablity', 'restrictions', 'registrationinfo', 'termoffered', 'coursefees',
            'GTpathways']

Certain features can also be defined by unique patterns, such as course codes (e.g. MAT 201), so we define these as regular expressions.

In [2]:
# Regex patterns
coursecode_pattern = r'([A-Z][A-Z]?[A-Z]?[A-Z]?[A-Z]?[A-Z]?[A-Z]?[A-Z]?)[- ]?([0-9][0-9]?[0-9]?[0-9]?[A-Z]?[A-Z]?[A-Z]?)'
ccredits_parenthesis_pattern = r'(?:\()([0-9][0-9]?\.?[0-9]? ?-? ?[0-9]?[0-9]?\.?[0-9]?)(?:\))'
ccredits_colon_pattern = r'(?:credits?|units?|hours?):? ?(?:var|varies|variable)? ?\[?([0-9][0-9]?\.?[0-9]? ?-? ?[0-9]?[0-9]?\.?[0-9]?)\]?'
ccredits_nocolon_pattern = r'(?:var|varies|variable)? ?\[?([0-9][0-9]?\.?[0-9]? ?-? ?[0-9]?[0-9]?\.?[0-9]?)\]? (?:credits?|units?|(?:semester )?h(?:ou)?rs?)'
ID_pattern = r'\A([^(/.|:)]+:)'

We can then import the dataframe generated in section 1B (blocksdf), and create a new dataframe where we'll put all the features we extract in later steps. 

In [3]:
# Open dataframe from script 2
blocksdf = pd.read_pickle('coursedescriptions.pkl')
coursesdf = pd.DataFrame(columns=['dept', 'number', 'credits', 'title', 'description'])

# Split up any lines that have \n    Todo: Accomplish this in script 2 instead
blocksdf.plaintext = blocksdf.plaintext.apply(lambda x: x.split('\n'))
blocksdf = blocksdf.explode('plaintext')

# Reset block index
blocksdf.blockindex = blocksdf.blockid.groupby(blocksdf.index).transform(lambda x: range(len(x)))

The first line of each course description block typically contains the most important features: course code, course title, and credits. We use the regex pattern for course codes to identify. If 98% of the course blocks contain course codes, we can safely assume we've found all of them and they aren't located somewhere else in the course block

In [4]:
# First line of course description usually contains coursecode, title and credits
firstlines = blocksdf.groupby([blocksdf.index]).first().plaintext.reset_index(drop=True)

# Extract course codes (make sure 98% of firstlines contain course codes)
if sum(firstlines.str.match(coursecode_pattern)) > .98*len(firstlines):
    blocksdf = blocksdf.loc[firstlines.str.match(coursecode_pattern)]
    firstlines = firstlines.loc[firstlines.str.match(coursecode_pattern)]
    coursesdf['dept'] = firstlines.str.extract(coursecode_pattern).loc[:, 0]
    coursesdf['number'] = firstlines.str.extract(coursecode_pattern).loc[:, 1]
    firstlines = firstlines.str.replace(coursecode_pattern, '', regex=True, n=1).str.strip(' .')
else:
    raise Exception('the coursecode is not the first item on the firstlines')


We follow the same procedure for credits, then assume that whatever is left over in the first line is the title of the course.

In [5]:
# Extract credits (ensure 80% of firstlines contain credits)        Todo: Clean up and generalize
if sum(firstlines.str.contains(ccredits_colon_pattern, flags=re.IGNORECASE, regex=True)) > .8*len(firstlines):
    coursesdf['credits'] = firstlines.str.extract(ccredits_colon_pattern, flags=re.IGNORECASE)
    firstlines = firstlines.str.replace(ccredits_colon_pattern, '', regex=True, flags=re.IGNORECASE, n=1)
elif sum(firstlines.str.contains(ccredits_nocolon_pattern, flags=re.IGNORECASE, regex=True)) > .8*len(firstlines):
    coursesdf['credits'] = firstlines.str.extract(ccredits_nocolon_pattern, flags=re.IGNORECASE)
    firstlines = firstlines.str.replace(ccredits_nocolon_pattern, '', regex=True, flags=re.IGNORECASE, n=1)
elif sum(firstlines.str.contains(ccredits_parenthesis_pattern)) > .8:
    coursesdf['credits'] = firstlines.str.extract(ccredits_parenthesis_pattern)
    firstlines = firstlines.str.replace(ccredits_parenthesis_pattern, '', regex=True, n=1)
else:
    matchinglines = blocksdf.loc[blocksdf.plaintext.str.match(listtopatternraw(creditsID), flags=re.IGNORECASE),
                                 'plaintext']
    if sum(matchinglines.notna()) < .8*len(firstlines):
        raise Exception('cant find the credits on firstline')
    coursesdf['credits'] = matchinglines.str.replace(listtopatternraw(creditsID), '', regex=True, flags=re.IGNORECASE)
    blocksdf = blocksdf.loc[~blocksdf.plaintext.str.match(listtopatternraw(creditsID), flags=re.IGNORECASE)]

coursesdf.credits = coursesdf.credits.str.strip(' .')
firstlines = firstlines.str.strip(' .')

Occasionally there will be an item in parentheses on the first line which can typically be ignored.

In [6]:
# If all the courses have something else at the end in parentheses, assume it's irrelevant (like a breakdown of credits)
if firstlines.str.contains(r'\)\Z').all():
    firstlines = firstlines.str.replace(r'\([^()]*?\)\Z', '', regex=True)

# If a small percentage of the remaining firstlines contain parenthesis and colons, then assume it's the course title
has_parentheses = firstlines.str.contains(r'\(')
has_colon = firstlines.str.contains(r':')
if (has_parentheses.sum() < .1*len(firstlines)) & (has_colon.sum() < .3*len(firstlines)):
    coursesdf['title'] = firstlines
else:
    print(list(set(firstlines.str.extract(r'(\([^()]*\))', expand=False).to_list())))
    print('Can we delete these parentheses? (y/n)')         # Print all parentheses items to verify
    if input() == 'y':
        firstlines.str.replace(r'(\([^()]*\))', '', regex=True)
    else:
        raise Exception('terminated by user')

We can then remove the first lines from blocksdf and move on to extracting the other features.

The remaining features are typically found in one of two formats. Either they are all contained in one long, unbroken paragraph, or they are split up on different lines. If there is only one item for each course block remaining, we assume it is the former and search for keywords within a paragraph. If there are multiple items for each block, we can assume the keywords will be located at the beginning of each item and one item corresponds to a single feature (this is sometimes not the case and is remedied in later steps).

In [7]:
blocksdf = blocksdf.loc[blocksdf.blockindex != 0]

# Extract all the other info
if blocksdf.iloc[-1].name + 1 == len(blocksdf):         # If there is only 1 block per course, everything is in it
    for i, ID in enumerate(ID_list):    # Look for matches in the middle of the paragraph Todo: Apply this in all cases
        middle_pattern = listtopatternraw(ID) + r'([^\n]*?)(\.(?![A-Za-z0-9])|\n|\Z)'
        matchingstring = blocksdf.plaintext.str.extract(middle_pattern, flags=re.IGNORECASE).iloc[:, 2].str.strip(' .')
        coursesdf[ID_names[i]] = matchingstring
        blocksdf.plaintext = blocksdf.plaintext.str.replace(middle_pattern, '', regex=True, flags=re.IGNORECASE)
else:
    for i, ID in enumerate(ID_list):            # Look for matches at the start of each line
        matchinglines = blocksdf.loc[blocksdf.plaintext.str.match(listtopatternraw(ID), flags=re.IGNORECASE),
                                     'plaintext']
        # Fix duplicated entries (e.g. two lines that say 'Prerequisites:' but one is blank)
        lengthsdf = pd.concat([matchinglines, matchinglines.apply(len)], axis=1)
        lengthsdf.columns = ['lines', 'length']
        if not lengthsdf.empty:     # Keep entry with longest string length  Todo: Ensure deleted is redundant or blank
            matchinglines = lengthsdf.groupby(lengthsdf.index).apply(lambda x: x.lines.iloc[x.length.argmax()])
        coursesdf[ID_names[i]] = matchinglines.str.replace(listtopatternraw(ID), '', regex=True,
                                                           flags=re.IGNORECASE).str.strip(' .')
        blocksdf = blocksdf.loc[~blocksdf.plaintext.str.match(listtopatternraw(ID), flags=re.IGNORECASE)]
       

The last feature we look for is the plain text description of the course, which is typically everything left over. If there are extra items, we ask the user to scan over the remainder and confirm that these are just descriptions or irrelevant.

In [8]:
# Locate the plaintext description
if coursesdf.description.isna().all():    # If descriptions is empty it must be in the remaining block
    # Assume course description is the first block; unparsed is after
    unparsed = blocksdf.groupby(blocksdf.index).apply(lambda x: x[1:] if len(x) != 1 else None)
    coursesdf['description'] = blocksdf.plaintext.groupby(blocksdf.index).apply(lambda x: x.iloc[0])
else:
    unparsed = blocksdf

if not unparsed.empty:
    # Verify first lines of remaining blocks are just plain text descriptions
    kvalue = min(400, len(blocksdf.index.to_list()))
    v(blocksdf.plaintext.groupby(blocksdf.index).first()[sorted(sample(blocksdf.index.to_list(), k=kvalue))])
    print('Do all these look like just descriptions? (y/n)')
    if input() != 'y':
        raise Exception('Terminated by user')

    # Verify remaining blocks can be merged with plain text description
    v(sorted(unparsed.plaintext.unique()))
    print('Move these unparsed items to the description? (y/n)')
    if input() == 'y':
        hasextras = blocksdf.groupby(blocksdf.index).apply(any)
        extrasjoined = blocksdf.plaintext.groupby(blocksdf.index).agg(lambda x: '. '.join(x))
        coursesdf.loc[hasextras.index, 'description'] = coursesdf.description + '. ' + extrasjoined
    else:
        raise Exception('Terminated by user')

# If prereqs, coreqs and requisites are all blank, look in the plain text description   Todo: Do this earlier in script
if coursesdf.requisites.isna().all() & coursesdf.prerequisites.isna().all():
    coursesdf.prerequisites = coursesdf.description.str.extract('Prerequisites?: ([^.]+)', expand=False)
    coursesdf.description = coursesdf.description.str.replace('Prerequisites?: ([^.]+)', '', regex=True)
    coursesdf.corequisites = coursesdf.description.str.extract('Corequisites?: ([^.]+)', expand=False)
    coursesdf.description = coursesdf.description.str.replace('Corequisites?: ([^.]+)', '', regex=True)
    if coursesdf.requisites.isna().all() & coursesdf.prerequisites.isna().all():
        raise Exception('Could not locate course requisites')

# Verify remaining instances of ID keywords/phrases are irrelevant
potentialIDs = blocksdf.plaintext.str.extract(ID_pattern).loc[:, 0].dropna().reset_index(drop=True)
if potentialIDs.groupby(potentialIDs).count().max() > .005*len(coursesdf):
    v(potentialIDs.groupby(potentialIDs).count().sort_values().tail(50))
    print('Ignore these? (y/n)')
    if input() != 'y':
        raise Exception('Terminated by user')


# Clean up
coursesdf = coursesdf.fillna('')
coursesdf.replace('  +', ' ', regex=True, inplace=True)
coursesdf.replace(r' \.', '.', regex=True, inplace=True)
coursesdf = coursesdf.applymap(lambda x: x.strip()).reset_index(drop=True)


  courseblock_id  plaintext
----------------  --------------------------------------------------------------------------------------------------------------------
             869  precipitation growth and breakup; ice multiplication; cloud electrification.
            4206  Junior or senior standing. Must register for lecture and laboratory.
            5144  Required field trips.
            5458  Recitative technique through both operatic and choral examples; final project is a group conducted Broadway musical.
            6126  Credit not allowed for both PHIL 550 and IE 550.
Do all these look like just descriptions? (y/n)


 y


    0
--  --------------------------------------------------------------------------------------------------------------------
 0  Credit not allowed for both PHIL 550 and IE 550.
 1  Junior or senior standing. Must register for lecture and laboratory.
 2  Recitative technique through both operatic and choral examples; final project is a group conducted Broadway musical.
 3  Required field trips.
 4  precipitation growth and breakup; ice multiplication; cloud electrification.
Move these unparsed items to the description? (y/n)


 y


<br>
We find 2 cases out of the 5 shown that are relevant and ideally should not be moved to the description column (courseblock_id 4206 and 6126). However, this is well within the acceptable amount of error (2 out of ~7300 courses), so we can safely ignore them.
<br>
<br>
Courses are sometimes identified as belonging to a larger group, such as a type of general education requirement, or another requisite group. They can contain multiple group names that they are members of, and these are often split using a common delimiter. We use the most frequent delimiter (comma, semicolon, or new line) to determine which is the most likely (rather than just used for occasional punctuation), then split the group into a list.

In [9]:
# The punctuation that occurs the most in coursegroups is probably a delimiter. Split coursegroups using that
has_newlines = coursesdf.coursegroups.str.contains('\n')
has_semicolons = coursesdf.coursegroups.str.contains(';')
has_commas = coursesdf.coursegroups.str.contains(',')
delimiterdict = {'\n': sum(has_newlines), ';': sum(has_semicolons), ',': sum(has_commas)}
delimiter = max(delimiterdict, key=delimiterdict.get)
coursesdf.coursegroups = coursesdf.coursegroups.apply(lambda groups: [x.strip() for x in groups.split(delimiter)])

At this stage, the course requisites are organized by feature and ready to be saved.

In [10]:
display_styled_table(coursesdf.head())
coursesdf.to_pickle('organizedcoursedescriptions.pkl')

Unnamed: 0,dept,number,credits,title,description,requisites,prerequisites,corequisites,equivalents,coursegroups,gradingtype,recommendeds,repeatablity,restrictions,registrationinfo,termoffered,coursefees,GTpathways
0,AA,100,3,Introduction to Astronomy (GT-SC2),Description of the various objects found in the heavens as well as the principles and techniques employed in investigations of these objects,,,,,"['Biological & Physical Sciences 3A', 'Natural & Physical Sciences w/o lab (GT-SC2)']",Traditional,,,,Sections may be offered: Online,"Fall, Spring, Summer",Yes,
1,AA,101,1,Astronomy Laboratory (GT-SC1),"Conduct observations, experiments, and simulations to develop an intuitive understanding of astronomical phenomena",,"AA 100, may be taken concurrently",,,"['Biological & Physical Sciences 3A', 'Natural & Physical Sciences w/ lab (GT-SC1)']",Traditional,,,,Sections may be offered: Online,"Fall, Spring, Summer",No,
2,AA,250,3,Introduction to Astrophysics,"Comprehensive introduction to astrophysics, including: observational astronomy, stellar evolution, cosmology, exoplanets, and astrobiology",,(MATH 161 or MATH 255 or MATH 271) and (PH 122 or PH 142),,,[''],Traditional,,,,"Credit allowed for only one of the following: AA 250, AA 280A1, and AA 380A1",Fall,No,
3,AA,495,1-6,Independent Study in Astrophysics,,,,,,[''],Instructor Option,,,,Written consent of instructor,"Fall, Spring, Summer",No,
4,AB,120,1,Agricultural Biology--Freshman Orientation,Introduction to information and skills necessary to succeed in the agricultural biology major,,,,,[''],Traditional,,,Must be a: Undergraduate,This is a partial semester course,Fall,No,


## 2B. Degree Requirements Organization

The main strategy for transforming the degree requirements is to clean and standardize key features as much as possible first (like course codes), then perform named entity recognition to identify functional elements. Using the context of headers and other features (such as course sums), we interpret the heirarchical relationships within the tables to group related requirements for parsing in subsequent steps.

Let's begin by importing the raw degree requirement data, defining some regex patterns for use later, and cleaning up our working dataframe.

In [11]:
df = pd.read_pickle('degreetables.pkl')

# Define regex patterns
credit_pattern = r'([0-9][0-9]?[0-9]?\.?[0-9]?) ?-? ?([0-9][0-9]?[0-9]?\.?[0-9]?)?'
mincredit_pattern = r'\A([0-9][0-9]?[0-9]?\.?[0-9]?)'
maxcredit_pattern = r'([0-9][0-9]?[0-9]?\.?[0-9]?)\Z'
or_pattern = r'(?: or | ?/ ?| ?[|] ?)'
and_pattern = r'(?: and | ?& ?)'

# Standardize table formatting and simple requirements
df = df.fillna('')
df = df.replace('nan', '')
df = df.applymap(str)
df.headerflag = df.headerflag.eq('True')  # convert headerflag from string back to bool
df.credits = df.credits.str.replace('.0', '', regex=False)  # simplify string representations of floats
df = df.replace('  +', ' ', regex=True)
df.code = df.code.replace(' :', ':', regex=False)
df.degree = df.pagetitle        # set page-title as degree (not link title)


We'll use the course codes found in part 2A to generate a regex pattern that is more restrictive.

In [12]:
courses_df = pd.read_pickle('organizedcoursedescriptions.pkl')

# Determine the coursecode structure
cdeptrange = (courses_df.dept.apply(len).min(), courses_df.dept.apply(len).max())
course_numbers = courses_df.number.str.extract('([0-9]+)', expand=False).fillna('')
cnumrange = (course_numbers.apply(len).min(), course_numbers.apply(len).max())
course_letters = courses_df.number.str.extract('([A-Z]+)', expand=False).fillna('')  # Optional letters that follow course num.
cletrange = (course_letters.apply(len).min(), course_letters.apply(len).max())
# Redefine course code patterns to reflect structure
cdept_pattern = '[A-Z]'*cdeptrange[0] + '[A-Z]?'*(cdeptrange[1]-cdeptrange[0])
cnum_pattern = '[0-9]'*cnumrange[0] + '[0-9]?'*(cnumrange[1]-cnumrange[0]) + '[A-Z]'*cletrange[0] + '[A-Z]?'*(cletrange[1]-cletrange[0])

# Save ccode patterns for later scripts
with open('cnum_pattern.json', 'w') as outfile:
    json.dump(cnum_pattern, outfile)
with open('cdept_pattern.json', 'w') as outfile:
    json.dump(cdept_pattern, outfile)

We'll be doing some regex replacements with special characters later so we want to ensure none of them are present. We'll also replace our superscript representation ("\_SUPERSCRIPT_*") with a representation using angle brackets ("<*>").

In [13]:
# Check that special characters used in this program aren't already in use
df.code = df.code.str.replace('<([^<>][^<>][^<>]+)>', r'(\1)', regex=True)     # Replace <> if more than 3 chars inside
df.code = df.code.str.replace('{([^{}]*)}', r'(\1)', regex=True)

if df.code.str.contains('[{}<>¥ß§Æ¿Ø]').any():
    raise Exception('Special characters are present in the code column')

# Reformat superscripts to angle bracket representation
df.code = df.code.str.replace(r' ?_SUPERSCRIPT_(..?)_ ?', r'<\1>', regex=True)
df.headertext = df.headertext.str.replace(r' ?_SUPERSCRIPT_(..?)_ ?', r'<\1>', regex=True)

# Fix ccodes without a space between it and 'or' or &
df.code = df.code.replace(r'\b(' + cdept_pattern + ')' + ' ?-? ?(' + cnum_pattern + r')' + r'(&|or) ', r'\1\2 \3 ',
                          regex=True)

Now we begin to standardize simple course codes

In [14]:
# Remove hyphens and spaces from ccodes
df = df.replace(r'\b(' + cdept_pattern + ')' + ' ?-? ?(' + cnum_pattern + r')\b', r'_\1\2_', regex=True)
ccode_pattern = '_' + cdept_pattern + cnum_pattern + '_(?:<.>)*'  # ccode + superscripts

Implied course codes are made explicit (e.g., "MAT 301 & 303" is converted to "MAT 301 & MAT 303"

In [15]:
# Fill in 'or' or 'and' seperated ccodes that lack either the dept or number (dept or number is implied)
df_old = (['']*len(df.code))
while (df.code != df_old).any():
    df_old = df.code.copy()
    df.code = df.code.str.replace(
        r'\b(_' + cdept_pattern + ')(' + cnum_pattern + '_)' + or_pattern + '(' + cnum_pattern + r')\b',
        r'\1\2 | \1\3_', regex=True)
    df.code = df.code.str.replace(
        r'\b(_' + cdept_pattern + ')(' + cnum_pattern + '_)' + and_pattern + '(' + cnum_pattern + r')\b',
        r'\1\2 & \1\3_', regex=True)
    df.code = df.code.str.replace(
        r'\b(' + cdept_pattern + ')' + or_pattern + '(_' + cdept_pattern + ')' + '(' + cnum_pattern + r'_)\b',
        r'_\1\3 | \2\3', regex=True)
    df.code = df.code.str.replace(
        r'\b(' + cdept_pattern + ')' + and_pattern + '(_' + cdept_pattern + ')' + '(' + cnum_pattern + r'_)\b',
        r'_\1\3 & \2\3', regex=True)

# Convert '&' and 'or' seperated ccodes into one unit
df = df.replace('(' + ccode_pattern + ') ?& ?', r'\1 & ', regex=True)
# Merge rows that begin with 'or' and their preceding row(s) into one XOR group
startswithor = df.code.str.match('or ', flags=re.IGNORECASE)
df.loc[startswithor, 'code'] = df.loc[startswithor, 'code'].str.replace('or ', '', flags=re.IGNORECASE)
orgroups = (~startswithor).cumsum()
df = df.groupby(orgroups, as_index=False).agg(
    {'code': ' | '.join, 'title': ' | '.join, 'coregroup': 'first', 'credits': 'first', 'headerflag': 'first',
     'pagenumber': 'first', 'tabnumber': 'first', 'degree': 'first', 'link': 'first', 'headertext': 'first',
     'siblingheaders': 'first',
     'superscripts': 'first', 'htmlclass': 'first', 'id': 'first', 'html': 'first', 'rowclass': 'first'})

# Replace 'or' when it separates two courses and place brackets around the group so it's serialized correctly later
allccodeor = df.code.str.fullmatch(ccode_pattern + '(' + or_pattern + ccode_pattern + ')+')
allccodeand = df.code.str.fullmatch(ccode_pattern + '(' + and_pattern + ccode_pattern + ')+')
df.loc[allccodeor, 'code'] = df.code.str.replace(or_pattern, ' | ', regex=True)
df.loc[allccodeand, 'code'] = df.code.str.replace(and_pattern, ' & ', regex=True)
df.loc[allccodeor | allccodeand, 'code'] = '{' + df.loc[allccodeor | allccodeand, 'code'] + '}'

# Delete (s)        example: course(s) --> course
df.code = df.code.replace('(s)', '', regex=False)

# Make a copy before any major transformation
df['codecopy'] = df.code

At this point, most course codes should be in the form "\_ABC123_", with their associated conjunctions, "and" and "or" represented as "&" and "|", respectively.

These tables still contain a mix of individual course requirements and headers, without a clear designation between the two. Importantly, headers that contain requirements that apply to multiple cells below it are not identified. For example let's take a look at the following table:

![alt text](JupyterFiles\ag_bio_semester1.jpg "example table")

The first row (marked "Freshman") is clearly a header because it's in bold (and it's designated as one in its HTML class under the hood), but the others highlighted are more ambiguous. They don't belong to a header HTML class and they're formatted similarly to the rest of the non-header elements. For the lines marked "Group A" & "Group B", the indentation of the courses below them indicate they are headers (a method we will use later on), but this doesn't work for the line that states "Select one group from the following:". We could possibly infer that it's a header because it has blanks in the title and "AUCC" columns, but this is problematic because the line marked "Arts and Humanities" isn't a header but it also lacks these columns. Furthermore, we can't be sure other schools will follow similar conventions and want to make our solution as general as possible so we don't have to tweak it when running it on different schools. A better way to handle this is to infer context from the actual text and other context clues. 

To accomplish this, we'll extract key numerical requirements: numbers of courses, numbers of course groupings, numbers of credits, numbers of labs, as well as modifiers to these numerical requirements. In the table above, "select one group" is an example of what we'll be extracting. Because the text states we're selecting a group, this implies there are groupings below, which must also have their own individual headings. We will use this observation later in the code to link this "meta" header (a header which applies to sub-headers) to its corresponding subheaders and their elements.

We begin by converting variants & synonyms of keywords and phrases into single, standardized forms.

In [16]:
reqcode_pattern = r'_\d\d?\d?(?:-\d\d?\d?)?_[a-z_]+_(?<!__)'
groupwords = ['groups?', 'concentrations?', 'lists?', 'tracks?', 'options?', 'subfields?',
              'fields', 'areas', 'course groups?']
coursewords = ['courses?', 'classes']
creditswords = ['credits?', '(credit|semester) hours?', 'hours?', ]
twowords = ['one pair of', 'both']
fourwords = ['two pairs of']
perwords = ['from each', 'from every']
fromwords = ['belonging to', 'in']
labwords = ['labs?', 'laboratory']
maxwords = ['not? more than', 'max(imum)?( of)?', 'as (many|much) as', 'at most', 'up to',
            'may( choose| select)?']  # Todo: test to make sure 'may' isn't too general
upperdivwords = ['3000?-? or 4000?-? ?level', 'upper-? ?level', 'upper ?-? ?division']

# Replace synonyms
df.code = df.code.str.replace(listtopattern(coursewords), 'courses', regex=True, flags=re.IGNORECASE)
df.code = df.code.str.replace(listtopattern(groupwords), 'groups', regex=True, flags=re.IGNORECASE)
df.code = df.code.str.replace(listtopattern(creditswords), 'credits', regex=True, flags=re.IGNORECASE)
df.code = df.code.str.replace(listtopattern(twowords), 'two', regex=True, flags=re.IGNORECASE)
df.code = df.code.str.replace(listtopattern(fourwords), 'four', regex=True, flags=re.IGNORECASE)
df.code = df.code.str.replace(listtopattern(perwords), 'per', regex=True, flags=re.IGNORECASE)
df.code = df.code.str.replace(listtopattern(fromwords), ' from ', regex=True, flags=re.IGNORECASE)
df.code = df.code.str.replace(listtopattern(labwords), 'labs', regex=True, flags=re.IGNORECASE)
df.code = df.code.str.replace(listtopattern(maxwords), 'max', regex=True, flags=re.IGNORECASE)
df.code = df.code.str.replace(listtopattern(upperdivwords), 'upperdiv', regex=True, flags=re.IGNORECASE)

Likewise, we convert numbers represented as words to digits and standardize number ranges (e.g., "three to five" is converted to "3-5").

In [17]:
# Convert number words to digits
numbers = [r'(?i)\bone\b', r'(?i)\btwo\b', r'(?i)\bthree\b', r'(?i)\bfour\b', r'(?i)\bfive\b', r'(?i)\bsix\b',
           r'(?i)\bseven\b', r'(?i)\beight\b', r'(?i)\bnine\b', r'(?i)\bten\b', r'(?i)\beleven\b',
           r'(?i)\btwelve\b', r'(?i)\bthirteen\b', r'(?i)\bfourteen\b', r'(?i)\bfifteen\b', r'(?i)\bsixteen\b',
           r'(?i)\bseventeen\b', r'(?i)\beighteen\b', r'(?i)\bnineteen\b', r'(?i)\btwenty\b', r'(?i)\bthirty\b']
digits = ['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19',
          '20', '30']
df.code = df.code.replace(numbers, digits, regex=True)
# Fix number ranges so they're hyphenated
df.code = df.code.str.replace(r'\b(\d\d?) ?(?:-|to) ?(\d\d?)\b', r'\1-\2', regex=True)

Then we remove redundant numbers and numbers that don't quantify numeric requirements.

In [18]:
# Get rid of course and credit values that merely reference the total number of options in the group
noncodewords = [r'(of|from) the following', r'the']  # Todo: test to make sure 'the' isn't too general
df.code = df.code.str.replace(listtopattern(noncodewords) + r' \d\d?\b( (credits|courses|groups))?', '', regex=True,
                            flags=re.IGNORECASE)

# Get rid of noun keywords immediately followed by numbers (eg. group 1, lab 4, course 3)
df.code = df.code.str.replace(r'\b(groups|labs|courses) \d\d?:?\b', '', regex=True)

Our goal is to identify where a number is associated with a keyword. For instance, "3 courses", or "5 credits". For these simple cases we look for where a number is immediately preceded by a keyword. For more complex cases, like "3 of the following social anthropology courses", we'll need to have an algorithm that can link the word "courses" to the number 3. This is accomplished by looking for numbers, then moving the closest keyword that follows it (within the same sentence) so that it's adjacent. This simple algorithm may seem like it's prone to erroneously move keywords that are not related to numbers, however preliminary tests show that this method is overwhelmingly effective and accurate (produces correct associations in over 98% of cases). 

We implement this by replacing keywords with a unique character, using regex to shift these characters where appropriate, then decoding them from special characters back to keywords. We'll also account for important modifiers which should not be ignored, such as when credits/courses need to be upper division, or when there are a certain number of courses per course group, or when there is a lab requirement. 

In [19]:
# Remove words that come in between keywords and numbers, and keywords and modifiers
# First convert keywords to unique characters
df.code = df.code.str.replace(r'\bcredits\b', '¥', regex=True)

# Labs and other coursetypes need be ID'd before courses so 'labs' gets moved next to the number
df.code = df.code.str.replace(r'\blabs\b', 'ß', regex=True)
df.code = df.code.str.replace(r'\bcourses\b', 'Æ', regex=True)
df.code = df.code.str.replace(r'\bgroups\b', '¿', regex=True)
df.code = df.code.str.replace(r'\bupperdiv\b', 'Ø', regex=True)
df.code = df.code.str.replace(r'\bper\b', '§', regex=True)

# Make sure to only move keywords if there isn't another keyword between it and the number
df.code = df.code.str.replace(r'\b(?:(\d\d?\d?) )([^()\d¥ß§Æ¿]*?)¥', r'\1 ¥ \2', regex=True)
# Note that course can be between num and lab
df.code = df.code.str.replace(r'\b(?:(\d\d?) )([^()\d¥ß§¿]*?)ß', r'\1 ß \2', regex=True)
df.code = df.code.str.replace(r'\b(?:(\d\d?) )([^()\d¥ß§Æ¿]*?)Æ', r'\1 Æ \2', regex=True)
df.code = df.code.str.replace(r'\b(?:(\d\d?) )([^()\d¥ß§Æ¿]*?)¿', r'\1 ¿ \2', regex=True)
df.code = df.code.str.replace(r'\b(?:(\d\d?\d?) )([^()\d¥ß§Æ¿]*?)Ø', r'\1 Ø \2', regex=True)
df.code = df.code.str.replace(r'§ ([^\d¥()ß§Æ¿]*?)¿', r'§ ¿ \1', regex=True)

# Convert symbols back to words
df.code = df.code.str.replace('¥', 'credits', regex=False)
df.code = df.code.str.replace('ß', 'labs', regex=False)
df.code = df.code.str.replace('Æ', 'courses', regex=False)
df.code = df.code.str.replace('¿', 'groups', regex=False)
df.code = df.code.str.replace('Ø', 'upperdiv', regex=False)
df.code = df.code.str.replace('§', 'per', regex=False)
df.code = df.code.str.replace('  +', ' ', regex=True)
df.code = df.code.replace(' :', ':', regex=False)

Now that numbers are adjacent to their associated keywords, we can combine the two into an encoded form using underscores, which we'll refer to as a 'headercode'.

In [20]:
# Convert numbers followed by keywords to headercodes
df.code = df.code.str.replace(r'\b(\d\d?)( |-)credits\b', r'_\1_credits_', regex=True)
df.code = df.code.str.replace(r'\b(\d\d?)( |-)courses\b', r'_\1_courses_', regex=True)
df.code = df.code.str.replace(r'\b(\d\d?)( |-)labs\b', r'_\1_labs_', regex=True)
df.code = df.code.str.replace(r'\b(\d\d?)( |-)groups\b', r'_\1_groups_', regex=True)
# Convert any remaining number dashes ahead of headercodes (indicates a credit range)
df.code = df.code.str.replace(r'\b(\d\d?)-_(\d\d?_[a-z]+_)', r'_\1-\2', regex=True)
# Todo: Convert implied singular keywords to headercodes (e.g. Choose one:)

Finally, we can deal with headercodes that have modifiers, such as when the requirement is a maximum limit (as opposed to the default, which is a minimum requirement: e.g., 5 credits max), or for upper division credit and courses.

In [21]:
# Move and append modifiers to headercodes
df.code = df.code.str.replace(r'([a-z]_)\b([^_]*)per groups\b', r'\1per_group_ \2', regex=True)
df.code = df.code.str.replace(r'([a-z]_)\b([^_0-9]*)upperdiv\b', r'\1upperdiv_ \2', regex=True)
df.code = df.code.str.replace(r'\bmax (_\d\d?_)(credits_|courses_|credits_upperdiv_)\b', r'\1\2max_', regex=True)
df.code = df.code.str.replace(r'\b(_\d\d?_)(credits_|courses_|credits_upperdiv_)\b([^_.]*?)\bmax\b', r'\1\2max_\3',
                            regex=True)
# ID implicitly defined modifiers (missing 'courses' or 'credits') using the closest prior keyword
df.code = df.code.str.replace(r'(credits_|courses_)\b([^_\d]*\b)(\d\d?) upperdiv\b', r'\1\2_\3_\1upperdiv_',
                            regex=True)
df.code = df.code.str.replace(r'(credits_|courses_)\b([^_\d]*\b)(\d\d?) max\b', r'\1\2_\3_\1max_', regex=True)
# Left over 'max' + digits should be references to courses requirement
df.code = df.code.str.replace(r'\A([^_]*)\bmax (\d\d?)\b([^_]*)\Z', r'\1_\2_courses_max_ \3', regex=True)

Now that the numeric requirements are extracted and encoded, we can move them over to their own column. We'll also check that none conflict (e.g., a line that has both "_2_courses_" and "_4_courses_", making the requirement ambiguous), and flag those that do. 

In [22]:
# Copy all header codes over to a new column
df['headercodes'] = df.code.str.findall(r'_[^ ]+[a-z][a-z][a-z]_\b').apply(lambda x: ' '.join(x))

# Flag headers with multiple conflicting requirements
nonumbersheadercodes = df.headercodes.str.replace(r'\d\d?-\d\d?|\d\d?', '', regex=True)
df['codeconflict'] = nonumbersheadercodes.apply(lambda x: len(x.split()) != len(set(x.split())))
df['degreeflags'] = ''
df.degreeflags = df.codeconflict.groupby(df.id).transform(lambda x: 'codeconflict ' if x.any() else '')

Now we'll move on to identifying other headers. We'll begin with rows that are in a header html class, then lines that end in a colon, and then lines that precede an indented block. 

In [23]:
# ID headers that have row in an html header class
isrowheader = df.html.str.contains('areaheader', regex=False)
isrowsubheader = df.html.str.contains('areasubheader', regex=False)

# ID headers based on presence of a colon
iscolonheader = df.code.str.contains(r': ?\Z')

# ID headers based on indentation
df.html = df.html.str.replace('<br', 'Ð')         # Replace with special character so we can avoid it in next step
isindented = df.html.str.contains(r'\A[^Ð]* style="margin-left:')
indentlevel = df.html.str.extract(r'\A[^Ð]* style="margin-left:(\d\d?\d?)px').iloc[:, 0].fillna('0')
df.html.str.replace('Ð', '<br')
if len(indentlevel.unique()) > 2:
    raise Exception('Tables have multiple levels of indent. Update table header heirarchy.')
# group together indented objects
indentgroups = isindented.eq(False).cumsum()
indentgroups.loc[indentgroups.groupby(indentgroups).transform('count') == 1] = np.nan
# ID header of indented objects
isindentheader = pd.Series(index=indentgroups.index, dtype=bool)
isindentheader.loc[indentgroups.groupby(indentgroups).head(1).index] = True  # Why does this treat nan's as a group?!
isindentheader.loc[0] = False

Then we have table-wide headers, and finally the "meta" headers that we identified earlier.

In [24]:
# ID headers for the entire table
istableheader = df.headerflag.copy()
df.drop(columns='headerflag', inplace=True)

# ID headers for the year or semester in plangrids
istermheader = df.rowclass.isin(['plangridyear', 'plangridterm'])

# ID metaheaders (metaheaders indicate groups that contain sub-group requirements (eg: Choose two of the groups below))
ismetaheader = df.headercodes.str.contains('group', flags=re.IGNORECASE)

# All headers together
isheader = isrowheader | isrowsubheader | isindentheader | iscolonheader | istableheader | istermheader | ismetaheader

Now that we have a preliminary idea of which rows are headers (and conversely which rows are individual requirements), we will collect some more info on the structure of the tables, so that we can eventually parse the table as a whole and understand how everything relates to each other. 

For the indented headers, we want to mark the end of their indented blocks. Also, we need to ID rows that simply quantify the sums of credits. These rows act similarly to headers in the way that they break up groups of requirements.

In [25]:
# ID rows where indentation ends
reverseindentgroups = isindented[::-1].eq(False).cumsum()[::-1]
reverseindentgroups.loc[reverseindentgroups.groupby(reverseindentgroups).transform('count') == 1] = np.nan
isendofindent = pd.Series(index=indentgroups.index, dtype=bool)
isendofindent.loc[
    reverseindentgroups.groupby(reverseindentgroups).tail(1).index] = True
isendofindent.loc[0] = False

# ID tables with credit sums
creditsumnames = [r'(\w+ )?total (program )?(credits|units|hours)( required)?:?']
totalincode = df.code.str.fullmatch(listtopattern(creditsumnames), flags=re.IGNORECASE)
totalintitle = df.title.str.fullmatch(listtopattern(creditsumnames), flags=re.IGNORECASE)
alternatetotalmatch = df.title.str.fullmatch('total .+ (credits|units|hours)( required)?:?', flags=re.IGNORECASE)
creditsum_is_predefined = df.rowclass.isin(['listsum', 'plangridsum', 'plangridtotal'])

iscreditssum = creditsum_is_predefined | totalincode | totalintitle | alternatetotalmatch
df['containssum'] = iscreditssum.groupby(df.id).transform(lambda x: sum(x) != 0)


Each table within a degree requirement page may serve different purposes. There may be a table representing a sample 4 year degree plan (we'll refer to these as plangrids), or there may be a table representing a list of electives or another group of courses to select from (an electivelist), or there may be something more general that concisely outlines all the degree requirements and options in one table (courselist). We'll define these here.

In [26]:
isplangrid = df.htmlclass.eq('sc_plangrid')
iscourselist = df.htmlclass.eq('sc_courselist') & df.containssum
iselectivelist = df.htmlclass.eq('sc_courselist') & ~df.containssum
df.loc[isplangrid, 'tableclass'] = 'plangrid'
df.loc[iscourselist, 'tableclass'] = 'courselist'
df.loc[iselectivelist, 'tableclass'] = 'electivelist'

if (~df.containssum & isplangrid).any():                # Todo: Replace exception with user prompt
    raise Exception('A plangrid doesnt have a credit sum')

Now, we'll perform a few sanity checks to make sure that the individual requirement's credits sum to the degree total requirements. We'll also infer the degree type from the HTML titles, headers, and other HTML elements that neighbor the tables, thn ensure that the credit totals are in line with the degree type.

In [27]:
# Compare individual credits to program total credits
contains_credits = df.credits.str.fullmatch(credit_pattern)
varieswords = ['var(ies|iable)?.?']
creditsvary = df.credits.str.fullmatch(listtopattern(varieswords), flags=re.IGNORECASE)
df['maxcredits'] = df.credits.str.extract(maxcredit_pattern).iloc[:, 0].astype(float).fillna(0)
df['mincredits'] = df.credits.str.extract(mincredit_pattern).iloc[:, 0].astype(float).fillna(0)

# if df[~df.containssum].maxcredits.max() > 100:        # Todo: Replace exception with user prompt
#     raise Exception('An electives table has a high credits value (may be a degree requirement table)')

df.loc[creditsvary, 'maxcredits'] = 120  # 'credits vary' could mean anywhere from 0-120 credits at the extremes
df['creditsumblocks'] = (iscreditssum | istableheader)[::-1].cumsum()[::-1]
df.loc[~iscreditssum.groupby(df.creditsumblocks).transform('last'), 'creditsumblocks'] = np.NaN
df['maxsums'] = df.maxcredits.groupby(df.creditsumblocks).transform(lambda x: sum(x[~iscreditssum]))
df['minsums'] = df.mincredits.groupby(df.creditsumblocks).transform(lambda x: sum(x[~iscreditssum]))

lastcreditsum = pd.Series([False] * len(iscreditssum))  # Degree totals (not term totals)
lastcreditsum.loc[df.loc[iscreditssum, 'creditsumblocks'].groupby(df.id).tail(1).index] = True
df['credittotalblocks'] = (lastcreditsum | istableheader)[::-1].cumsum()[::-1]
df.loc[~lastcreditsum.groupby(df.credittotalblocks).transform('last'), 'credittotalblocks'] = np.NaN
df['maxtotals'] = df.maxcredits.groupby(df.credittotalblocks).transform(lambda x: sum(x[~iscreditssum]))
df['mintotals'] = df.mincredits.groupby(df.credittotalblocks).transform(lambda x: sum(x[~iscreditssum]))

# Verify sums and totals
sumnotinrange = ((df.maxcredits > df.maxsums) | (df.mincredits < df.minsums)) & iscreditssum & ~lastcreditsum
totalnotinrange = ((df.maxcredits > df.maxtotals) | (df.mincredits < df.mintotals)) & lastcreditsum
notinrange = sumnotinrange | totalnotinrange

# Flag mismatches
df.degreeflags = df.degreeflags + df.degreeflags.groupby(df.id).transform(lambda x: 'creditmismatch '
                                                                          if not x[notinrange].empty else '')
df.degreeflags = df.degreeflags + df.degreeflags.groupby(df.id).transform(lambda x: 'creditsvary '
                                                                          if not x[creditsvary].empty else '')

# ID degree types       Todo: Clean this up so code isn't duplicated
df.loc[df.degree.str.contains(r'\b(bachelor|major in|BA|BS|BM|BFA|BSN|BBA|BAS|BSME|BSRS|BSW|BME)\b',
                              flags=re.IGNORECASE), 'degreetype'] = 'bachelor'
df.loc[df.degree.str.contains(r'\b(associates?|AAS|AA|AS)\b', flags=re.IGNORECASE), 'degreetype'] = 'associate'
df.loc[df.degree.str.contains(r'\b(certificate|PCT)\b', flags=re.IGNORECASE), 'degreetype'] = 'certificate'
df.loc[df.degree.str.contains(r'\bminor\b', flags=re.IGNORECASE), 'degreetype'] = 'minor'
df.loc[df.degree.str.contains(r'\b(masters?|MS|ME|MA|MAED|MSN|MPAS|MBA)\b', flags=re.IGNORECASE), 'degreetype'] = 'master'
df.loc[df.degree.str.contains(r'\bdual degree\b', flags=re.IGNORECASE), 'degreetype'] = 'dual bachelor'
df.loc[df.degree.str.contains(r'\b3\+2 \b', flags=re.IGNORECASE), 'degreetype'] = 'combined B&M'
df.loc[df.degree.str.contains(r'\bp.?h.?d.?|doctor(ate)?\b', flags=re.IGNORECASE), 'degreetype'] = 'doctorate'

# If no degree types were found in degree column, look in the headertext
if df.degreetype.isna().all():
    df.loc[df.headertext.str.contains(r'\b(bachelor|major in|BA|BS|BM|BFA|BSN|BBA|BAS|BSME|BSRS|BSW|BME)\b',
                                      flags=re.IGNORECASE), 'degreetype'] = 'bachelor'
    df.loc[df.headertext.str.contains(r'\b(associates?|AAS|AA|AS)\b', flags=re.IGNORECASE), 'degreetype'] = 'associate'
    df.loc[df.headertext.str.contains(r'\b(certificate|PCT)\b', flags=re.IGNORECASE), 'degreetype'] = 'certificate'
    df.loc[df.headertext.str.contains(r'\bminor\b', flags=re.IGNORECASE), 'degreetype'] = 'minor'
    df.loc[df.headertext.str.contains(r'\b(masters?|MS|ME|MA|MAED|MSN|MPAS|MBA)\b',
                                      flags=re.IGNORECASE), 'degreetype'] = 'master'
    df.loc[df.headertext.str.contains(r'\bdual degree\b', flags=re.IGNORECASE), 'degreetype'] = 'dual bachelor'
    df.loc[df.headertext.str.contains(r'\b3\+2 \b', flags=re.IGNORECASE), 'degreetype'] = 'combined B&M'
    df.loc[df.headertext.str.contains(r'\bp.?h.?d.?|doctor(ate)?\b', flags=re.IGNORECASE), 'degreetype'] = 'doctorate'
df.loc[df.degree.eq('GENEDS'), 'degreetype'] = 'GENEDS'
if df.degreetype.isna().any():
    print(df.degree[df.degreetype.isna()].unique())
    print('Ignore these unidentified degrees? (y/n)')
    if input() != 'y':
        raise Exception('Terminated by User')

# Determine if there are multiple tracks for the same degree
hasmultipletracks = pd.Series([False] * len(isheader))
hasmultipletracks[isplangrid] = df[isplangrid].id.groupby([df.degree, df.tableclass]).transform(lambda x:
                                                                                                len(x.unique()) > 1)
hasmultipletracks[iscourselist] = df[iscourselist].id.groupby([df.degree, df.tableclass]).transform(lambda x:
                                                                                                    len(x.unique()) > 1)

# if not df.loc[hasmultipletracks, 'tableclass'].groupby(df.degree).transform(lambda x: len(x.unique()) > 1).empty:
#     raise Exception('Theres a degree with multiple fouryearplans AND multiple courselists')

# Extract concentration/track from page header, table headers, titles, or if absent, assign a unique ID
tableheader = df.headertext.apply(lambda x: x[x.rindex(' : ')+3:] if ' : ' in x else x)
numberofheaders = tableheader.groupby([df.degree, df.tableclass]).transform(lambda x: len(x.unique()))
numberoftoprows = df.code.groupby([df.degree, df.tableclass]).transform(lambda x: len(x.unique()))
numberoftables = df.id.groupby([df.degree, df.tableclass]).transform(lambda x: len(x.unique()))
toprowisheader = isheader.groupby(df.id).transform(lambda x: x.iloc[1] if len(x) > 1 else False)
alltoprowsareheader = toprowisheader.groupby([df.degree, df.tableclass]).transform('all')
toprows = df.code.groupby(df.id).transform(lambda x: x.iloc[1] if len(x) > 1 else np.nan)

# If headers vary, use those as track name; if toprows vary, use those; if nothing varies, use ID
df.loc[hasmultipletracks & (numberofheaders == numberoftables), 'track'] = tableheader
df.loc[hasmultipletracks & ~(numberofheaders == numberoftables) & (numberoftoprows == numberoftables), 'track'] = toprows
df.loc[hasmultipletracks & ~(numberofheaders == numberoftables) & ~(numberoftoprows == numberoftables), 'track'] = df.id
df.loc[df.track.notna(), 'track'] = df.degree + ' : ' + df.track
df.loc[df.track.isna(), 'track'] = df.degree

# Verify total credits makes sense for degree type
df['maxdegreecredits'] = df.maxcredits.groupby(df.track).transform(lambda x: max(x[iscreditssum])
                                                                   if not x[iscreditssum].empty else np.nan)
df['mindegreecredits'] = df.mincredits.groupby(df.track).transform(lambda x: max(x[iscreditssum])
                                                                   if not x[iscreditssum].empty else np.nan)
# # If less than 120 credits total for bachelors, this is only a partial sum
# df.loc[df.degreetype.isin(['bachelor', 'dual bachelor']) & (df['mindegreecredits'] < 120), 'maxdegreecredits'] = np.nan
# if df.degreetype.eq('master').any() and max(df.loc[df.degreetype.eq('master'), 'maxdegreecredits']) > 115:
#     raise Exception('theres a masters degree with more than 115 credits')
# if df.degreetype.eq('certificate').any() and max(df.loc[df.degreetype.eq('certificate'), 'maxdegreecredits']) > 65:
#     raise Exception('theres a certificate with more than 60 credits')
# if df.degreetype.eq('minor').any() and max(df.loc[df.degreetype.eq('minor'), 'maxdegreecredits']) > 40:
#     raise Exception('theres a minor with more than 40 credits')

df.drop(columns=['maxcredits', 'mincredits', 'creditsumblocks', 'maxsums', 'minsums', 'credittotalblocks', 'maxtotals',
                 'mintotals'], inplace=True)

['Extreme Ultraviolet and Optical Science and Technology Graduate Interdisciplinary Studies Program'
 'Food Science/Safety Interdisciplinary Studies Program'
 'International Development Interdisciplinary Studies Program'
 'Molecular, Cellular and Integrative Neurosciences Graduate Interdisciplinary Studies Program'
 'Sustainable Peace and Reconciliation Studies Graduate Interdisciplinary Studies Program'
 'Political Economy Graduate Interdisciplinary Studies Program'
 'Resilience of Social Ecological Systems Graduate Interdisciplinary Studies Program']
Ignore these unidentified degrees? (y/n)


 y


<br/>
A few of these degree types show up as unidentified, but remember that this section is just a sanity check. While classifying a degree type is nice, it's not critical to classifying the requirements for those degrees. These degrees represent a small proportion of the total, so we can safely skip the checks on them. 
<br/>
<br/>
We'll now classify rows as either headers, requirements that only contain course codes, or currently unknown. 

In [28]:
# ID rowtypes
# Row characteristics
only_ccode = df.code.str.fullmatch(ccode_pattern)
only_ccodecombos = df.code.str.fullmatch('{' + ccode_pattern + r'(( \| | & )' + ccode_pattern + ')+}')

# Table and term headers
df.loc[istermheader, 'rowtype'] = 'term header'
df.loc[istableheader, 'rowtype'] = 'table header'
df.loc[ismetaheader, 'rowtype'] = 'metagroup header'
df.loc[isrowheader, 'rowtype'] = 'row header'
df.loc[isrowsubheader, 'rowtype'] = 'row subheader'
df.loc[iscreditssum, 'rowtype'] = 'credits sum'
# Required courses
df.loc[df.containssum & only_ccode & (contains_credits | creditsvary), 'rowtype'] = 'required course'
# Course groups
df.loc[df.containssum & only_ccodecombos & (contains_credits | creditsvary), 'rowtype'] = 'oneline group'
df.loc[df.containssum & isindented, 'rowtype'] = 'multiline group'
# Credit sums
df.loc[df.containssum & df.rowclass.isin(['plangridsum', 'plangridtotal']), 'rowtype'] = 'credits sum'
# Everything else in degree tables (i.e. tables that contain credit sums)
df.loc[df.containssum & df.rowtype.isna() & (contains_credits | creditsvary), 'rowtype'] = 'other requirement'
df.loc[df.containssum & df.rowtype.isna(), 'rowtype'] = 'unknown'
# Elective groups
df.loc[~df.containssum & only_ccode, 'rowtype'] = 'elective'
df.loc[~df.containssum & only_ccodecombos, 'rowtype'] = 'elective combo'
# Everything else in elective tables (i.e. tables that don't contain credit sums)
df.loc[~df.containssum & df.rowtype.isna(), 'rowtype'] = 'unknown elective'

For a row that states "Select 7 courses:", we need to know exactly where we are selecting those seven courses from. Is it the entire table? Is it just the block of courses directly below it? Is it from multiple groups located below and seperated by subheaders?

To tackle this effectively, the header heirarchy must be interpreted. We need to know which headers supercede others and fall under each other's scope. We have multiple classes of headers to sort through:

1. rowheaders - HTML class for headers in Bold and larger font. We assume these are primary.
2. rowsubheader - Class for subheaders in Bold and smaller font. These are secondary.
3. otherheader - Rows that end in a colon (but are not followed by indented items). These are tertiary.
4. indentheader - Rows that precede an indented block. These have the lowest precedence.

We also can have variations of these headers that modify their precedence. For instance, having a header in all caps suggests it has higher precedence than one that is in lower case. Also, indentation is often used to indicate when a row is a subheader of a non-indented header.

We'll go ahead and mark each header accordingly.

In [29]:
# ID header heirarchy
hheirarchy = ['rowheader', 'rowsubheader', 'otherheader', 'indentheader']
formatheirarchy = ['allcaps', 'regular', 'allcapsindented', 'regularindented']
headertype = pd.Series([np.nan] * len(isheader))
headertype[isrowheader & ~(istermheader | istableheader)] = 'rowheader'
headertype[isrowsubheader & ~(istermheader | istableheader)] = 'rowsubheader'
headertype[iscolonheader & ~(istermheader | istableheader | isrowheader | isrowsubheader)] = 'otherheader'
headertype[isindentheader & ~(istermheader | istableheader | isrowheader | isrowsubheader)] = 'indentheader'
headertype.fillna('otherheader', inplace=True)  # Leftovers are group headers that don't have any special formatting
formattype = pd.Series([np.nan] * len(isheader))
formattype[df.codecopy.str.isupper() & ~isindented] = 'allcaps'
formattype[~df.codecopy.str.isupper() & ~isindented] = 'regular'
formattype[df.codecopy.str.isupper() & isindented] = 'allcapsindented'
formattype[~df.codecopy.str.isupper() & isindented] = 'regularindented'
headerdf = pd.DataFrame({'headertype': headertype, 'formattype': formattype, 'isheader': isheader, })

We'll use the formala below to convert these to a numeric value, with low numbers having the highest importance and high numbers having the lowest importance. We'll also assume table wide headers are more important than any, regardless of formatting, term headers (e.g. "Fall 2020") are also more important, and credit sums too by assigning them negative values (and therefore high importance).

Note that this header level is similar to HTML header importance, with H1 being the most precedent, and H6 being the least.

In [30]:
# Assign headerlevel based on header type and formating (this determines the groupings for serialization later)
# Levels are 0-3 for allcaps headers, 4-7 for regular headers, 8-11 for indented allcaps, 12-15 for indented regular
df['headerlevel'] = headerdf[isheader].apply(
    lambda x: hheirarchy.index(x.headertype) + formatheirarchy.index(x.formattype) * len(hheirarchy), axis=1)
# Ensure table headers and term headers are the lowest level
df.loc[istermheader, 'headerlevel'] = -1
df.loc[istableheader, 'headerlevel'] = -3
# Creditsums aren't headers but they act as one in the heirarchy (they represent a complete division in the table)
df.loc[iscreditssum, 'headerlevel'] = -2

Now what about metaheaders? They can have the exact same formatting as their sub headers, but because of the _context_, we know that they take precedence over their subheaders. 

Remember, a meta header follows the pattern:
1. Metaheader
2. Sub-header 1 
3. Sub-group 1
4. Sub-header 2
5. Sub-group 2
<br>
etc.....

Example: 
1. Choose 3 courses from one of the following groups:
2. Group A:
3. (Group A contents)
4. Group B:
5. (Group B contents)

We can verify metaheaders by checking that the first line after is a subheader, that this first subheader contains a group word (e.g., "group", "field", "concentration", "track", etc.), and that this group word is found in at least one header that immediately follows the first group. This will also allow us to ID what subheaders correspond to the metaheader, and where every group starts and ends.

In [31]:
# Row after a metaheader must be a header for another group, or a comment
nextrowisnotheader = ismetaheader & (
        only_ccode | only_ccodecombos | istermheader | ismetaheader | istableheader | iscreditssum).shift(-1)
ismetaheader[nextrowisnotheader] = False

# ID metagroup starts (i.e. the metagroup headers)
df['metagroup'] = ismetaheader.cumsum()
df.loc[~ismetaheader.groupby(df.metagroup).transform('first').fillna(False), 'metagroup'] = np.NaN
if not df.loc[df.groupby('metagroup').code.transform('count') < 2, 'metagroup'].empty:
    raise Exception('theres a metagroup with no groups')
groupwordspresent = df.codecopy.str.findall(listtopattern(groupwords), flags=re.IGNORECASE)
groupwordspresent = groupwordspresent.apply(lambda x: list(set([string.rstrip('s').lower() for string in x])))
metaheadergroupname = groupwordspresent.groupby(df.metagroup).transform('first')
metaheadergroupname[metaheadergroupname.isna()] = pd.Series(
    [[]] * metaheadergroupname.isna().sum()).values  # sets nan as []
metasubheaderintersection = groupwordspresent.apply(set) - (
        groupwordspresent.apply(set) - metaheadergroupname.apply(set))
metaheadermatch = metasubheaderintersection.apply(len).ne(0)

# Metagroups must have a metaheader, followed by the first group header, then the first group, then the 2nd g. header...
# Group must have matching group word on the first line after the metaheader (i.e. 'group', 'list', 'field')
firstheadermatches = metaheadermatch.groupby(df.metagroup.fillna(-1)).transform(lambda x: x.iloc[1])

# Group must be matched more than once (can't be a group requirement with just one group)
groupismatched = firstheadermatches & metaheadermatch.groupby(df.metagroup.fillna(-1)).transform(
    lambda x: sum(x[1:]) > 1)

# Determine if metaheaders for unmatched groups are not metaheaders
# Row after metaheader must be header         Todo: Wasn't this done a few steps above?
nextisheader = df.headerlevel.groupby(df.metagroup.fillna(-1)).transform(lambda x: x.iloc[1]).notna()

# Must be at least 2 groups, all with headers that match the first one's text formatting
nextlinehlevel = df.headerlevel.groupby(df.metagroup.fillna(-1)).transform(lambda x: x.iloc[1])
nextlinehascredits = contains_credits.groupby(df.metagroup.fillna(-1)).transform(lambda x: x.iloc[1])
hlevelmatch = nextlinehlevel == df.headerlevel
hcreditsmatch = contains_credits == nextlinehascredits
headermatch = hlevelmatch & hcreditsmatch & df.headerlevel.notna()

# First line can't be a sub-metaheader. This is used so secondheadermatch returns false for missing secondheaderindex's
headermatch.iloc[0] = False
secondheaderindex = isheader.groupby(df.metagroup.fillna(-1)).transform(
    lambda x: x[2:].loc[x].index[0] if x[2:].any() else 0)
secondheadermatch = headermatch.iloc[secondheaderindex].reset_index(drop=True)

# Get rid of non-matching metaheaders and ungroup their groups
ismetaheader.loc[~groupismatched & (~nextisheader | ~secondheadermatch)] = False
df.loc[~ismetaheader.groupby(df.metagroup).transform('first').fillna(False), 'metagroup'] = np.NaN

# Determine where metagroups end
# Find where the indentation of the group changes from indented to not indented
groupisindented = isindented.groupby(df.metagroup).transform(lambda x: x[~isheader].iloc[0])
# Backup option if there's no matching group word (metagroup ends at the next header that has a lower level)
islowerlevel = nextlinehlevel > df.headerlevel

# Regroup metagroup with end of metagroups
metagroupindicators = pd.Series([np.nan] * len(isheader))
metagroupindicators[groupismatched & ~groupisindented] = ismetaheader | (
        hlevelmatch & ~metaheadermatch) | islowerlevel | iscreditssum
metagroupindicators[groupismatched & groupisindented] = ismetaheader | (
        hlevelmatch & ~metaheadermatch) | islowerlevel | (~isheader & ~isindented) | iscreditssum
metagroupindicators[~groupismatched & ~groupisindented] = ismetaheader | islowerlevel | iscreditssum
metagroupindicators[~groupismatched & groupisindented] = ismetaheader | islowerlevel | (
        ~isheader & ~isindented) | iscreditssum
df['metagroupindicators'] = metagroupindicators
df['metagroup'] = metagroupindicators.cumsum()
df.loc[df.metagroup.groupby(df.metagroup).transform('count') == 1, 'metagroup'] = np.nan
df.loc[~ismetaheader.groupby(df.metagroup).transform('first').fillna(False), 'metagroup'] = np.NaN
endofmetagroup = pd.Series([False] * len(isheader))
endofmetagroup.loc[df.groupby('metagroup').tail(1).index] = True
endofmetagroup = endofmetagroup.shift(1)

# ID the inner groups for each metagroup
innergroupindicators = pd.Series([False] * len(isheader))
innergroupindicators[groupismatched] = ismetaheader | endofmetagroup | (metaheadermatch & ~ismetaheader) | iscreditssum
innergroupindicators[~groupismatched & groupisindented] = ismetaheader | endofmetagroup | (
        metaheadermatch & ~ismetaheader) | iscreditssum | (~isheader & ~isindented)
innergroupindicators[~groupismatched & ~groupisindented] = ismetaheader | endofmetagroup | (
        metaheadermatch & ~ismetaheader) | iscreditssum
df['innergroup'] = innergroupindicators.cumsum()
df.loc[df.innergroup.groupby(df.innergroup).transform('count') == 1, 'innergroup'] = np.nan
df.loc[df.metagroup.isna() | ismetaheader, 'innergroup'] = np.NaN
isinnergroupheader = pd.Series([False] * len(isheader))
isinnergroupheader.loc[df.innergroup.groupby(df.innergroup).head(1).index] = True
isinnergroupheader.loc[0] = False
df.drop(columns=['metagroup', 'metagroupindicators', 'innergroup'], inplace=True)

We can now reassign the header heirarchy based on this new information on what is a metaheader and what isn't. We will assign metaheaders to be higher in importance than their subheaders, but not higher than other headers with different formatting by increasing their header level by 0.5.

In [32]:
# Reclassify headers based on the new info about what's a metaheader and what's not
df.loc[isinnergroupheader, 'rowtype'] = 'group header'
isheader = isheader | ismetaheader | isinnergroupheader

# Reassign header heirarchy
headerdf = pd.DataFrame({'headertype': headertype, 'formattype': formattype, 'isheader': isheader})
df['headerlevel'] = headerdf[isheader].apply(
    lambda x: hheirarchy.index(x.headertype) + formatheirarchy.index(x.formattype) * len(hheirarchy), axis=1)
df.loc[istermheader, 'headerlevel'] = -1
df.loc[iscreditssum, 'headerlevel'] = -2
df.loc[istableheader, 'headerlevel'] = -3
# Implied inner group headers need to be higher than metagroup header (but not higher than ones formatted differently)
df.loc[isinnergroupheader, 'headerlevel'] = df.headerlevel + .5



Lastly, we will convert the credits column into an explicit requirement, representing them as headercodes and moving them to the credits column.

In [33]:
# Assign headercodes for credits requirements in credits column
df.credits = df.credits.replace(' ', '')
creditsreqs = ('_' + df.credits.str.extract(r'(\d\d?(?:-\d\d?)?)') + '_credits_').fillna('')
df['headercodes'] = df.headercodes + ' ' + creditsreqs.iloc[:, 0]                   # append creditsreq to headercodes
df.headercodes = df.headercodes.str.replace('  +', ' ', regex=True).str.strip()
df.headercodes.fillna('', inplace=True)
df.headercodes = df.headercodes.apply(lambda x: ' '.join(list(set(x.split()))))     # remove duplicate headercodes

df['endofindent'] = isendofindent & ~isheader

# Classify based on whether requirement is defined or not
df['unknownreq'] = ~(df.headerlevel.notna() | only_ccode | only_ccodecombos)

# Delete credits requirements from creditsums so they are removed in the serializer
df.loc[iscreditssum, 'headercodes'] = ''
# Get rid of singular groups
df = df[df.id.groupby(df.id).transform(lambda x: len(x)) != 1].reset_index(drop=True)

At this point, we have most of the information we need to put everything together and parse the tables. We have functional dilineations between the different table types. Simple requirements that are represented by a course code (or combination thereof) and a number of credits are identified and have their requisites formalized. Broader requirements that apply to multiple courses (i.e., headercodes) are encoded and have clear dileneations on which sub-requirements they apply to. And we even have meta headers formalized with specified scopes. Everything that has been processed is surrounded by a pair of underscores to distinguish them from unknowns. We'll go ahead and save our datafame and print out a sample.

In [34]:
df.to_pickle('degreesorganized.pkl')

df_only_important_columns = df[['code', 'credits', 'headercodes', 'rowtype', 'unknownreq','codeconflict', 'degree', 'degreetype', 'link']]
display_styled_table(df_only_important_columns.iloc[np.r_[0:20, -20:0]])

Unnamed: 0,code,credits,headercodes,rowtype,unknownreq,codeconflict,degree,degreetype,link
0,Freshman,Freshman,,table header,False,False,Major in Agricultural Biology,bachelor,https://catalog.colostate.edu/general-catalog/colleges/agricultural-sciences/agricultural-biology/agricultural-biology-major
1,Unnamed: 0_level_1,Credits,,table header,False,False,Major in Agricultural Biology,bachelor,https://catalog.colostate.edu/general-catalog/colleges/agricultural-sciences/agricultural-biology/agricultural-biology-major
2,_AB120_<1><2>,1,_1_credits_,required course,False,False,Major in Agricultural Biology,bachelor,https://catalog.colostate.edu/general-catalog/colleges/agricultural-sciences/agricultural-biology/agricultural-biology-major
3,_AB130_<1><2>,1,_1_credits_,required course,False,False,Major in Agricultural Biology,bachelor,https://catalog.colostate.edu/general-catalog/colleges/agricultural-sciences/agricultural-biology/agricultural-biology-major
4,_AREC202_,3,_3_credits_,required course,False,False,Major in Agricultural Biology,bachelor,https://catalog.colostate.edu/general-catalog/colleges/agricultural-sciences/agricultural-biology/agricultural-biology-major
5,_CHEM107_,4,_4_credits_,required course,False,False,Major in Agricultural Biology,bachelor,https://catalog.colostate.edu/general-catalog/colleges/agricultural-sciences/agricultural-biology/agricultural-biology-major
6,_CHEM108_,1,_1_credits_,required course,False,False,Major in Agricultural Biology,bachelor,https://catalog.colostate.edu/general-catalog/colleges/agricultural-sciences/agricultural-biology/agricultural-biology-major
7,_CO150_,3,_3_credits_,required course,False,False,Major in Agricultural Biology,bachelor,https://catalog.colostate.edu/general-catalog/colleges/agricultural-sciences/agricultural-biology/agricultural-biology-major
8,Select _1_groups_ from the following:,8,_8_credits_ _1_groups_,metagroup header,False,False,Major in Agricultural Biology,bachelor,https://catalog.colostate.edu/general-catalog/colleges/agricultural-sciences/agricultural-biology/agricultural-biology-major
9,groups A,,,group header,False,False,Major in Agricultural Biology,bachelor,https://catalog.colostate.edu/general-catalog/colleges/agricultural-sciences/agricultural-biology/agricultural-biology-major


**What remains is to parse more complicated requirements that aren't simple course codes, gather some more information on general education requirements and other course groups, and finally transform these tables into a serialized code. We'll do that in Part 3.**