# Workflows

The point of Jupyter is to *document workflows*. 
A *workflow* is a series of steps toward an end-result. 
The cells contain individual steps, either as programs or as text. 

Let's work out a basic workflow. We start with a text file consisting of pairs of elements. The first element is a label, the second is a count of the number of things with that label. Our goal is to sum up the counts for similar items and generate a report. 

Very often, you'll be faced with a notebook written by someone else and have to figure out how it works. This is the first of several exercises in reading and understanding other peoples' notebooks. 

First, let's look at the file. Run this to see what's in the file. 

In [None]:
# take a look at the file
f = open('data.txt', 'r')
for line in f: 
    print(line.strip())
f.close()

The next step is to put the file into a variable that I can use. I'll read it into a list. Run this to see what happens. 

In [None]:
# step 1: read a file into a list
f = open('data.txt', 'r')
lines = []
for line in f:
    lines.append(line.strip())
f.close()
lines

Now I need to split the file up into columns. I have a recipe for that. It's this code. Run it to see what happens. 

In [None]:
# step 2: split up lines into data columns, sum up categories
categories = {}
for l in lines: 
    label, value = l.split(',')
    if label in categories: 
        categories[label] += int(value)
    else: 
        categories[label] = int(value)
categories

Now I have to put that into a form that can be sorted. I'll put it into a list. 

In [None]:
# Step 3: transform into a list
pairs = []
for k in categories: 
    pairs.append((k, categories[k],))
pairs

Finally, I have a snippet that sorts it into reverse order: 

In [None]:
# Step 4: sort by most frequent
ordered = sorted(pairs, key=lambda x: x[1], reverse=True)
ordered

# A few notes on workflows

1. I intentionally put cells into this workbook that potentially no one in the class understands yet. 
2. The key to understanding a workflow is to look at it one cell at a time. 
3. The key to using a workflow is to understand the conditions under which it works, and those conditions in which it doesn't work.
4. The cells here are 'rituals' that accomplish an end. 
5. We'll study how to turn them into 'patterns' that are reusable. 

# Let's learn something about these cells
Each cell has inputs and outputs. The first two cells have an input that is an external file. 

Looking through the code, the last line of each cell is an output. When that variable is used again, it is an input. Based upon this, what are the inputs and outputs to each cell below? 

In [None]:
# step 1: read a file into a list
f = open('data.txt', 'r')
lines = []
for line in f:
    lines.append(line.strip())
f.close()
lines

Your answer: 
1. Inputs are: 
2. Outputs are:

In [None]:
# step 2: split up lines into data columns, sum up categories
categories = {}
for l in lines: 
    label, value = l.split(',')
    if label in categories: 
        categories[label] += int(value)
    else: 
        categories[label] = int(value)
categories

Your answer: 
1. Inputs are: 
2. Outputs are:

In [None]:
# Step 3: transform into a list
pairs = []
for k in categories: 
    pairs.append((k, categories[k],))
pairs

Your answer: 
1. Inputs are: 
2. Outputs are:

In [None]:
# Step 4: sort by most frequent
ordered = sorted(pairs, key=lambda x: x[1], reverse=True)
ordered

Your answer: 
1. Inputs are: 
2. Outputs are:

# Order is important

Let's explore that concept. 

First, let's clear all calculations in the notebook by selecting `Kernel > Restart and clear output`. This erases all variable values. 
Then, run `step 4` without `steps 1-3`. What happens? 

___your answer:___

# Kinds of workflow steps
There are several kinds of steps in a typical workflow
a. Input/output: import data from outside, or export data to outside. 
b. Calculation: perform a calculation on the data. 
c. Transformation: change the data format. 
d. Filtering: exclude certain data. 

What kinds of steps do we have above? 

___Your answer:___
* Step 1: 
* Step 2: 
* Step 3: 
* Step 4: 

# Filtering
Filtering is the process of eliminating data we don't want in our view. 

Suppose we want to eliminate all lines of the input with counts < 10. Here's a snippet to do that.

In [None]:
# step ???: remove all lines with counts < 10
temp = []
for l in lines: 
    if int(l.split(',')[1]) >= 10: 
        temp.append(l)
lines = temp
lines

# Put this step where it goes
Click on the cell above and use the up and down arrows *in the control ribbon* to position where it goes in the workflow below. This is a copy of the workflow above. 

In [None]:
# step 1: read a file into a list
f = open('data.txt', 'r')
lines = []
for line in f:
    lines.append(line.strip())
f.close()
lines

In [None]:
# step 2: split up lines into data columns, sum up categories
categories = {}
for l in lines: 
    label, value = l.split(',')
    if label in categories: 
        categories[label] += int(value)
    else: 
        categories[label] = int(value)
categories

In [None]:
# Step 3: transform into a list
pairs = []
for k in categories: 
    pairs.append((k, categories[k],))
pairs

In [None]:
# Step 4: sort by most frequent
ordered = sorted(pairs, key=lambda x: x[1], reverse=True)
ordered

Please run these steps after you put the new step in its place in the workflow to see how it works. 

# an afterword
I realize that I just exposed you to a lot of Python code that you might not yet understand. Don't worry, we'll cover that in detail. 

Note how I approached this problem: 
1. break the problem down into steps, 
2. use code to accomplish each step, and 
3. keep clear what you get from each step and the order of steps. 

This is -- in general -- the pattern for using Jupyter notebooks. 

# when you have completed this exercise

Rerun all cells from top to bottom and check that they work. 

You can submit a notebook by saving it as PDF. In the cluster environment, it's File | Print (Save as PDF) and submit to Gradescope. https://www.gradescope.com/courses/182658,On other versions, it may be File | Download As (PDF) and then submit to Gradescope.

To submit to Gradescope, log into the [website](https://www.gradescope.com/courses/182658), add course **9W7PW3** (if not already added) and submit. The assignment name should match the name of this notebook.