# Code Walkthrough
The following takes a slow motion walk through the code. Run each cell below, one at a time, looking at the code before running it. Study it in fact. 

The walkthrough is in 3 parts:
- Comments, imports, variables, etc. at the top of the module
- The `parse_table_row()` function that comprises the bulk of the functionality
- The `scrape_undergrad_course_booklet()` function that reads the Tabula-generated CSV file and applies the `parse_table_row()` function to each line

Each section will first display the full source code and then run it in short snippets so we can examine variable settings (program state) along the way. 

# Section 1: Definitional Code at the the Top of the File
First, let's walk through the first few lines of code, up to the first function definition. 

<!-- Note: ever wondered how you show Python code in Markdown? See below. -->

```python
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
File: course_schedules_tabula.py

Created on Tue Dec  5 12:57:56 2017

@author: chuntley

A utility for extracting Fairfield U course data from text scraped PDF files using tabula.
Currently works for the Spring 2018 Course Booklet.

"""


import re
import csv
import json

# A set of tags that appear in the Notes field of a course_spec string
tags = {
  'CLRC':'Creative Life Residential College',
  'CORN':'Cornerstone Course',
  'HYBD':'Hybrid Course',
  'IGRC':'Ignatian Residential College',
  'RCOL':'Residential Colleges',
  'SERO':'Service Learning Option',
  'RNNU':'RN to BSN Students Only',
  'SJRC':'Service for Justice Residential College',
  'SDNU':'Second Degree Nurses Only',
  'UDIV':'U.S. Diversity',
  'SERL':'Service Learning',
  'WDIV':'World Diversity'
}

# A set of regular expressions (regex patterns) to use to extract data fields from a table row
flds = {
    'crn':re.compile('(^[0-9]+)'),
    'catalog_id':re.compile('(^[A-Z]+ [0-9,A-Z]+)'),
    'section':re.compile('(^[0-9,A-Z]+)'),
    'credits':re.compile('(^[0-9])'),
    'timecode':re.compile('(TBA|[Bb]y [Aa]rrangement|[Oo]nline|[MTWRFSU]+ [0-9]{4}-[0-9]{4}[PpAa][Mm])'),
    'tags':re.compile('('+'|'.join(tags.keys())+')'),
    'instructor':re.compile('(.+)'),
    'title':re.compile('(.+)')
}
```

## Section 1 Line by Line

**The very top of the file is for execution notes, comments, etc. Leaving this stuff out is considered extremely unprofessional.**

**This particular module was designed to be run as a script from the command line. By convention, the first line then needs to tell the (command line interface) shell how it expects to be run. Below we are using the `python3` interpreter in the `/usr/bin/env` folder on MacOS X. (We'd have to modify it for other computer setups.) Note how the two lines are a comment in Python? The code is never executed by Python. Thats what the shell is for.**  

In [None]:
#!/usr/bin/env python3
# -*- coding: utf-8 -*-

In [None]:
"""
File: course_schedules_tabula.py

Created on Tue Dec  5 12:57:56 2017

@author: chuntley

A utility for extracting Fairfield U course data from text scraped PDF files using tabula.
Currently works for the Spring 2018 Course Booklet.

"""

**Imports come just after the comments so the imported code can be used farter down in the file.** 

In [1]:
import re     # regular expressions
import csv    # csv file I/O
import json   # JSON data handling

**Sometimes we will need to create constants or configuration variables used by the rest of the code. Like the imports, always define constants and config variables near the top of the file.**

In [2]:
# A set of tags that appear in the Notes field of the course booklet
tags = {
  'CLRC':'Creative Life Residential College',
  'CORN':'Cornerstone Course',
  'HYBD':'Hybrid Course',
  'IGRC':'Ignatian Residential College',
  'RCOL':'Residential Colleges',
  'SERO':'Service Learning Option',
  'RNNU':'RN to BSN Students Only',
  'SJRC':'Service for Justice Residential College',
  'SDNU':'Second Degree Nurses Only',
  'UDIV':'U.S. Diversity',
  'SERL':'Service Learning',
  'WDIV':'World Diversity'
}
tags

{'CLRC': 'Creative Life Residential College',
 'CORN': 'Cornerstone Course',
 'HYBD': 'Hybrid Course',
 'IGRC': 'Ignatian Residential College',
 'RCOL': 'Residential Colleges',
 'SERO': 'Service Learning Option',
 'RNNU': 'RN to BSN Students Only',
 'SJRC': 'Service for Justice Residential College',
 'SDNU': 'Second Degree Nurses Only',
 'UDIV': 'U.S. Diversity',
 'SERL': 'Service Learning',
 'WDIV': 'World Diversity'}

**This next snippet defines a regular expression for each column of the CSV file.** Don't know what a regular expression is? [RTFM](https://docs.python.org/3.7/library/re.html) or try [this tutorial](https://www.datacamp.com/community/tutorials/python-regular-expression-tutorial).

In [3]:
# A set of regular expressions (regex patterns) to use to extract data fields from a table row
flds = {
    'crn':re.compile('(^[0-9]+)'),
    'catalog_id':re.compile('(^[A-Z]+ [0-9,A-Z]+)'),
    'section':re.compile('(^[0-9,A-Z]+)'),
    'credits':re.compile('(^[0-9])'),
    'timecode':re.compile('(TBA|[Bb]y [Aa]rrangement|[Oo]nline|[MTWRFSU]+ [0-9]{4}-[0-9]{4}[PpAa][Mm])'),
    'tags':re.compile('('+'|'.join(tags.keys())+')'),
    'instructor':re.compile('(.+)'),
    'title':re.compile('(.+)')
}
flds

{'crn': re.compile(r'(^[0-9]+)', re.UNICODE),
 'catalog_id': re.compile(r'(^[A-Z]+ [0-9,A-Z]+)', re.UNICODE),
 'section': re.compile(r'(^[0-9,A-Z]+)', re.UNICODE),
 'credits': re.compile(r'(^[0-9])', re.UNICODE),
 'timecode': re.compile(r'(TBA|[Bb]y [Aa]rrangement|[Oo]nline|[MTWRFSU]+ [0-9]{4}-[0-9]{4}[PpAa][Mm])',
 re.UNICODE),
 'tags': re.compile(r'(CLRC|CORN|HYBD|IGRC|RCOL|SERO|RNNU|SJRC|SDNU|UDIV|SERL|WDIV)',
 re.UNICODE),
 'instructor': re.compile(r'(.+)', re.UNICODE),
 'title': re.compile(r'(.+)', re.UNICODE)}

# Section 2: the `parse_table_row()` function

**The `parse_table_row()` function does the actual cleanup of the data. 

Note the use of a triple-quoted docstring to document what the function does. We can include lots of things in docstring comments (e.g., parameter definitions, assumptions, outputs, etc.) but in this case it's just a single line.**

In [4]:
def parse_table_row(row):
    ''' Parse one row of tabula data; each row is a column-wise list of strings'''

    course_spec = {}

    # Deal with extra timecodes on rows by themselves
    if not row[0]:
        unparsed = ' '.join(row)
        # use a regex to extract the timecode
        course_spec['timecodes'] = flds['timecode'].findall(unparsed)

        # return a partial course_spec with just the timecode
        return course_spec

    # What follows handles a typical table row exported from tabula

    # Parse out the easier columns that always seem to work in tabula
    course_spec['crn'] = int(row[0])
    course_spec['catalog_id'] = row[1] + ' ' + row[2]
    course_spec['section'] = row[3]
    course_spec['title'] = row[4]

    # Parse out the trickier columns that seem to merge awkwardly in tabula.
    # The logic below applies regular expressions to an unparsed string.
    # For each column:
    #   1. use a regex to extract data from the unparsed string;
    #   2. remove the extracted data from the unparsed string
    unparsed = ' '.join(row[5:]) # create a string of columns

    credits = flds['credits'].findall(unparsed)
    course_spec['credits'] = int(credits[0]) if credits else 0 # number of credits
    unparsed = flds['credits'].sub('',unparsed)

    course_spec['tags'] = flds['tags'].findall(unparsed) # list of tags
    unparsed = flds['tags'].sub('',unparsed)

    course_spec['timecodes']=flds['timecode'].findall(unparsed) # list of timecodes
    unparsed = flds['timecode'].sub('',unparsed)

    course_spec['instructor']=unparsed.strip() # remainder, minus extra whitespace

    return course_spec

## Section 2 Line by Line Trace
**Rather than run the function directly, let's run it's body line by line for a couple of different rows of data:** 
- `["35712","MU","0120","01","History of Hip Hop","3 MR 0200-0250pm","","Yezee, I","UDIV HYBD"]`
- `["","","","","","W","1000-1050am","",""]`

### Row 1: `parse_table_row(["35712","MU","0120","01","History of Hip Hop","3 MR 0200-0250pm","","Yezee, I","UDIV HYBD"])`

In [5]:
# The local variable below is meant to be equivalent to calling the function. It's not in the module code.
row = ["35712","MU","0120","01","History of Hip Hop","3 MR 0200-0250pm","","Yezee, I","UDIV HYBD"]

In [6]:
course_spec = {}
course_spec

{}

**This conditional deals with a special case. If the special case applies then the function returns immediately without running the code below this block. This "short-circuit" technique is used a lot in systems programming to guard against so-called "corner-case" bugs.**

In this case the special case does not apply so the code has not effect.  

Note: Since we are in a cell, not a function, the `return` statement has been commented out in our snippet.

In [7]:
# Deal with extra timecodes on rows by themselves
if not row[0]:
    unparsed = ' '.join(row)
    # use a regex to extract the timecode
    course_spec['timecodes'] = flds['timecode'].findall(unparsed)
    
    # return a partial course_spec with just the timecode
    # return course_spec
course_spec

{}

**The rest of the function handles the normal case, where it builds up a dictionary  based on the columns/fields expected in the CSV file.**

The first few columns are easy. Just take them directly from the list of srings in the `row` variable, formatting the data to what it needs as it goes along.

In [8]:
# Parse out the easier columns that always seem to work in tabula
course_spec['crn'] = int(row[0])
course_spec['catalog_id'] = row[1] + ' ' + row[2]
course_spec['section'] = row[3]
course_spec['title'] = row[4]
course_spec

{'crn': 35712,
 'catalog_id': 'MU 0120',
 'section': '01',
 'title': 'History of Hip Hop'}

**With the easy columns parsed out, we can now move on to the hard ones where the column breaks are possibly wrong. For this we'll need to use regular expressions to pluck them out of the line of input.**

Also, since the column breaks cant be trusted anyway, the code starts by reassembling the original line of input for these last few columns. 

**For the remaining code `unparsed` always has the part of the input string that has not been parsed yet. We will truncate it as we go along.**

In [9]:
# Parse out the trickier columns that seem to merge awkwardly in tabula.
# The logic below applies regular expressions to an unparsed string.
# For each column:
#   1. use a regex to extract data from the unparsed string;
#   2. remove the extracted data from the unparsed string
unparsed = ' '.join(row[5:]) # create a string of columns
unparsed

'3 MR 0200-0250pm  Yezee, I UDIV HYBD'

**Now for the regular expressions. The `flds` dict is defined near the top of the file. Each dictionary item is a regular expression saying what data in that column is supposed to look like.** 
The line below applies the pattern `(^[0-9])` to extract any single-digit numbers at the front of the string. The result of calling the `.findall()` method is always a list of strings.

In [10]:
 credits = flds['credits'].findall(unparsed)
 credits

['3']

**Once the credits has been extracted, it is converted to an integer.**

In [11]:
course_spec['credits'] = int(credits[0]) if credits else 0 # number of credits
course_spec

{'crn': 35712,
 'catalog_id': 'MU 0120',
 'section': '01',
 'title': 'History of Hip Hop',
 'credits': 3}

**After the credits are extracted and insterted into dictionary, we update `unparsed` to reflect that we extracted data from it. Note the code chopped off the '3' from the head of the string. The remainder needs to be parsed.


**

In [12]:
unparsed = flds['credits'].sub('',unparsed)
unparsed

' MR 0200-0250pm  Yezee, I UDIV HYBD'

In [13]:
course_spec['tags'] = flds['tags'].findall(unparsed) # list of tags
course_spec

{'crn': 35712,
 'catalog_id': 'MU 0120',
 'section': '01',
 'title': 'History of Hip Hop',
 'credits': 3,
 'tags': ['UDIV', 'HYBD']}

In [14]:
unparsed = flds['tags'].sub('',unparsed)
unparsed

' MR 0200-0250pm  Yezee, I  '

In [15]:
course_spec['timecodes']=flds['timecode'].findall(unparsed) # list of timecodes
course_spec

{'crn': 35712,
 'catalog_id': 'MU 0120',
 'section': '01',
 'title': 'History of Hip Hop',
 'credits': 3,
 'tags': ['UDIV', 'HYBD'],
 'timecodes': ['MR 0200-0250pm']}

In [16]:
unparsed = flds['timecode'].sub('',unparsed)
unparsed

'   Yezee, I  '

In [17]:
course_spec['instructor']=unparsed.strip() # remainder, minus extra whitespace
course_spec

{'crn': 35712,
 'catalog_id': 'MU 0120',
 'section': '01',
 'title': 'History of Hip Hop',
 'credits': 3,
 'tags': ['UDIV', 'HYBD'],
 'timecodes': ['MR 0200-0250pm'],
 'instructor': 'Yezee, I'}

### Row 2: `parse_table_row(["","","","","","W","1000-1050am","",""])`

In [18]:
# A local variable to account for the row function parameter
row = ["","","","","","W","1000-1050am","",""]
row

['', '', '', '', '', 'W', '1000-1050am', '', '']

In [19]:
course_spec = {}
if not row[0]:
    unparsed = ' '.join(row)
    # use a regex to extract the timecode
    course_spec['timecodes'] = flds['timecode'].findall(unparsed)
course_spec

{'timecodes': ['W 1000-1050am']}

**The function returns immediately in this case, so there is nothing more to trace.**

# Section 3: The `scrape_undergrad_course_booklet()` function

In [20]:
def scrape_undergrad_course_booklet(filename):
    ''' Parse a course booklet that has been exported as a CSV from Tabula.'''
    with open(filename, newline='') as csvfile:
        linereader = csv.reader(csvfile)
        course_specs =[]
        for row in linereader:
            if not row[0].startswith('CRN'):
                course_spec = parse_table_row(row)
                if 'crn' in course_spec:
                    # add the new course_spec
                    course_specs += [course_spec]
                elif 'timecodes' in course_spec:
                    # merge timecode into last course_spec
                    course_specs[-1]['timecodes'] += course_spec['timecodes']
    return {'course_offerings':course_specs,'tags':tags}

## Section 3 Line by Line
**As with the previous function, we're just going to step through the meat of the function body for the two rows of CSV data. Assume that we are reading a CSV file with our two test rows.**

In [21]:
 with open('test_data.csv', newline='') as csvfile:
        linereader = csv.reader(csvfile)
        course_specs =[]
        for row in linereader:
            print(row)

['CRN', 'Subj', 'Course', 'Sec', 'Title', 'Creds DaysTime', 'Instructor', 'Notes']
['35712', 'MU', '0120', '01', 'History of Hip Hop', '3 MR 0200-0250pm', 'Young A R', 'UDIV HYBD']
['', '', '', '', '', 'W', '1000-1050am', '', '']


**Now lets take each (non-header) row in turn.**

In [22]:
row = ['35712', 'MU', '0120', '01', 'History of Hip Hop', '3 MR 0200-0250pm', '', 'Yezee', ' I', 'UDIV HYBD']
import course_schedules_tabula
course_spec = course_schedules_tabula.parse_table_row(row)
course_spec

{'crn': 35712,
 'catalog_id': 'MU 0120',
 'section': '01',
 'title': 'History of Hip Hop',
 'credits': 3,
 'tags': ['UDIV', 'HYBD'],
 'timecodes': ['MR 0200-0250pm'],
 'instructor': 'Yezee  I'}

**Great, that matches what we already have! Now let's update course_specs.**

In [23]:
if 'crn' in course_spec:
    # add the new course_spec
    course_specs += [course_spec]
elif 'timecodes' in course_spec:
    # merge timecode into last course_spec
    course_specs[-1]['timecodes'] += course_spec['timecodes']
course_specs

[{'crn': 35712,
  'catalog_id': 'MU 0120',
  'section': '01',
  'title': 'History of Hip Hop',
  'credits': 3,
  'tags': ['UDIV', 'HYBD'],
  'timecodes': ['MR 0200-0250pm'],
  'instructor': 'Yezee  I'}]

**With the first row processed, let's try to second row.**

In [24]:
row = ['', '', '', '', '', 'W', '1000-1050am', '', '']
course_spec = parse_table_row(row)
course_spec

{'timecodes': ['W 1000-1050am']}

In [25]:
if 'crn' in course_spec:
    # add the new course_spec
    course_specs += [course_spec]
elif 'timecodes' in course_spec:
    # merge timecode into last course_spec
    course_specs[-1]['timecodes'] += course_spec['timecodes']
course_specs

[{'crn': 35712,
  'catalog_id': 'MU 0120',
  'section': '01',
  'title': 'History of Hip Hop',
  'credits': 3,
  'tags': ['UDIV', 'HYBD'],
  'timecodes': ['MR 0200-0250pm', 'W 1000-1050am'],
  'instructor': 'Yezee  I'}]

# Final Output

In [26]:
course_schedules_tabula.scrape_undergrad_course_booklet('tabula-201801CourseBooklet.csv')['course_offerings']

[{'crn': 34379,
  'catalog_id': 'AY 0010',
  'section': '01',
  'title': 'Intro Four-Field Anthropology',
  'credits': 3,
  'tags': ['WDIV'],
  'timecodes': ['TF 1100-1215pm'],
  'instructor': 'Lacy S'},
 {'crn': 34380,
  'catalog_id': 'AY 0010',
  'section': '02',
  'title': 'Intro Four-Field Anthropology',
  'credits': 3,
  'tags': ['WDIV'],
  'timecodes': ['TF 1230-0145pm'],
  'instructor': 'Lacy S'},
 {'crn': 35688,
  'catalog_id': 'AY 0052',
  'section': '01',
  'title': 'Culture and Political Economy',
  'credits': 3,
  'tags': [],
  'timecodes': ['TF 0930-1045am'],
  'instructor': 'Crawford D'},
 {'crn': 34553,
  'catalog_id': 'AY 0110',
  'section': '01',
  'title': 'Biological Anthropology',
  'credits': 3,
  'tags': [],
  'timecodes': ['W 0200-0430pm'],
  'instructor': 'Hensley-Marschand B'},
 {'crn': 34749,
  'catalog_id': 'AY 0110',
  'section': '02',
  'title': 'Biological Anthropology',
  'credits': 3,
  'tags': [],
  'timecodes': ['W 0630-0900pm'],
  'instructor': 'Hensl