Coding Strategies

We wrote a detailed writeup on pipe and related concepts which ended up saved to a github repository [here](https://github.com/sillyotter/pytut/blob/main/pytut.ipynb).  It is worth a read, but below we will provide a short summary of the use of the .pipe operation.

In the first few documents in this collection, we discussed some of the problems with using a notebook, specifically, the fact that cells can be run in any order possibly causing what you see on the screen to not match the reality of the notebooks state, and that the shared, often hidden state shared between cells can, if not managed well, result in different outputs from a cell based on the order that cells are run in.  These result in problems that are often very hard to 

There are a few defensive strategies that you can adopt to try to avoid having to ever deal with these kinds of issues.

## Linear Flows

You can write a notebook composed of arbitrary cells.  You can run them manually in any order you want.  It is possible for you to find the answer you wanted but only because you ran cell 2 first, then 1, then 5, then 3, then 4, then reran 5.  No one else will know this magic order of operations.  Some of those steps may have been unneeded, but you dont know that, and some may actually only work if preceded by some unknown sequence of cell runs. It is possible that in the above example, you changed cell 5 between the two runs, meaning that even if I repeated that sequence of operations, I wouldnt get the same results you did.  While you may have gotten your answer, this notebook is not very useful to anyone else, nor to your self. No one else will know this magic order, or the code changes that occurred between step 5 and 10.  This is not a terribly good way to communicate what you did, even to a future version of yourself. 

To combat this, try to always think of your notebook as a linear flow of cell from top to bottom that will be run in that order. There may be reasons for you to run them out of order during development, while exploring and investigating, but keep an eye on that and try to make sure that before your done, you have structured your notebook as a linear sequence of cells.  

Remember, this notebook is attempting to tell a story, with code and documentation, and it would be best if it read from top to bottom just like any piece of english prose.  No one else knows what was going on in your mind when you wrote it, so document it, and structure it to lead the next person from start to finish.

This helps prevent issues where out of order cell execution results in incorrect results being displayed, or global state getting corrupted.

**This should not be interpreted as saying you must restart the kernel and run everything linearly every time.**  Running all cells is a useful technique to make sure everything is a nice correct linear story, but it can also be hugely wasteful.  If cell 4 of your notebook spends 20 minutes downloading statistics, you dont want to run that over and over again.  Certainly restart and run all when your done, just to make sure it all works as you think it does, or maybe every day or few hours, but doing so all the time can be hugely wasteful.

If you have structured your notebook to be linear, then you will often not actually need to restart all and run everything too often.  If in cell 4 I read in my data to a dataframe. Then in cell 5 I can do work on it.  Also in cell 6, 7, and 8.  If I realize later on that I need to change how I work on that data, I can go back up to cells 5-8, make the needed changes, and just run those cells, in order.  I don't need to rerun the load. The load was run once, and the data stored in a dataframe.  And in cells 5-8, I didnt change that dataset, it was left there unmolested, I just created new dataframes based on specifying changes to the original data.  While exploring I may need to rerun cells, and I may want to rerun them from where ever I made a change forward.  You will find your development process is basically a series of steps where you add a cell, do some work, add a second cell, do some work, go back a few cells and rework things forward, and then going a bit further, adding new cells, but the 'rerun' stuff you have to do will mostly be an interation over the last N cells you have added, just changing and rerunning those.  

## State Management

Above, we touched on but did not really dig into the idea of state management.  When you run a jupyter notebook, a running python interpreter is fired up, and every time you run a cell, the code in that cell is sent to the python interpreter and run.  The interpreter stays running between cell executions, and the same python interpreter is reused for each cell.  If you allocate a dataframe in cell 1, it will still be there in cell 5.   If you add a cell at point 6 and run it to overwrite that variable, its going to take on this new state from that point in time forward.  Going back to run cell 5 will result in code running with this new variable, even though the change occurred later in the linear flow of cells, because you ran them out of order.  

Most people reading your notebook will with out thinking about it assume the program is to be read top to bottom and that the state changes that occur to your programs memory will happen in that order. All a good reason to try to keep the actual running order linear.

But, people do dumb things, and even though you designed it to be linear, you cant be sure the next user (who may be you) will do so all the time.

As a result, it is useful to adopt the defensive strategy of not letting people have the option of running things out of order.

If I have 7 cells that must be run in order one after the other in order to get the answer I want, thats 7 cells that someone, including your self, may accidentally or in a panic may run out of order.  I cant stop that, but I can guarantee that they get run in the right order if I put all that code in one cell.

 If its important to the correct execution of your notebook that a set of cells gets run in order, you'll be doing yourself and others a service by putting them all in one cell, to guarantee they get executed right.


In [2]:
xxx = 100

In [3]:
xxx += 200

In [4]:
xxx /= 2

In [5]:
xxx

150.0

In the above example, I have my calculation spread out over 3 cells.  We start with 100, add 200, then divide by 2.  In the order expressed, that results in the final answer 150.  But, thats 3 whole cells, who knows if someone will accidentally run one out of order.  If after I run the above linearly, I rerun the `x /=2` cell, 150 just became 75.  If I rerun the `+=200` operation, its not 275.  

I dont want to have to ever think to myself, when I see a wrong answer, did this happen due to someone running the calc in the wrong order or accidentally rerunning a cell after the fact.  There are enough real problems to sort through, I dont want to have to deal with artificial ones caused by jupyter interactions.  So, I would not write the code like you see above, I would put them all in one cell.

In [6]:
xxx = 100
xxx += 200
xxx /= 2
xxx

150.0

You can only run the above cell once.  You will get the same answer no matter how many times you run it, and if you go back and run it again, you will still get the same answer.  Again, as with assuming a linear flow, this strategy makes entire classes of errors go away, and no longer have to approach every debugging by asking 'did this get run in order'  I know it did, there was no choice.

As you can see, trying to make linear flows is helpful.  Thinking of cells as discrete units of work and putting all the work that does one step in one cell also limits the options for things to be run out of order,and helps make sure that the state of your variables in ram are what you want them to be.  State management can also be thought of in a different way.

We will provide an example, but lets just think it out first.  It is perfectly natural to assume that someone reading a notebook will be doing so linearly, from top to bottom.  When examining code in cell 5, which is working on a dataframe created in cell 4, its natural to assume I can go up and look at the previous cell, cell 4, to see how that dataframe was created, to better understand what cell 5 is doing.  We assume a linear flow of cells.  But, what if of in cell 44, we physically changed the dataframe.  What if we overwrote it with new data, or if we deleted columns or overwrote them with new calculations.  Now, while im trying to understand the code, I see the cell 4 dataframe creation code doing x, and I see cell 5 working on that, but when I look at my dataframe, I see that it doesnt match what the code says.  Now im all confused.  

It is safe to assume your notebook will be read linearly, and in order to make sure that that makes sense, that the data created in cell 4 feeds in to cell 5, its important that we dont much later on in the notebook change that data.  To be able to easily understand the code, I will want to be able to think linearly about it, from top to bottom, but that gets hard when people are fundamentally changing the contents of variables that were defined and used higher up in the program.  The flow is linear, but the data gets changed so it no longer matches the data, making it hard to quite understand.

To be linear, cell 5 should depend only on data created in cells 1-4.  That way, I can read the thing linearly, and even rerun parts of it, as long as i do so linearly, and get sensible results.  But if cell 33 is changing things that were created in cells 1-4, then if I go to rerun 5, or even just try to understand it, it will be hard to do, because the data that cell 5 is dependant on is being created in the future, off in cell 33.  

So, not only do you want to try to write a linear notebook where the code is meant to be understood from front to back, but you want the data dependencies to also be linear. And this means dont go and destructively mutate data created earlier.  Always create new data with the changes, and make sure the following cells use this new data.

Lets try an example:


In [7]:
xxx = 100

In [8]:
xxx += 200

In [9]:
xxx /= 2

In [10]:
xxx

150.0

Yes, that should likely be written in a single cell, but for the example, lets do it like this.  If I writ the code like this, and run it in order, I will get 150.  If I decide I want to change 200 to 300, I can do so, and as long as I run that cell and those below it, I will get a new answer, correct answer, and I can think about it as a straight forward linear sequence of operations. but what if it looked like this:

In [11]:
xxx = 100

In [12]:
xxx += 200

In [13]:
xxx /= 2

In [14]:
xxx

150.0

In [15]:
# 30 cells worth of work and then
xxx *= 2

In [16]:
xxx

300.0

Note that in the last cell, way down in the notebook, I changed the contents of xxx to point be double what ever it was.

If I am up and reading the original 3 cells, I will read them linearly and assume that cell `x += 200` is dependant only on what happened above it.  So, if I now change that to `+=300` and run from there down, I will no longer get a meaningful result, because the value of xxx is no longer the 100 it was the first time, because way down many cells below, we mutated that to `*= 2`.  The logic flows linearly, but the data does not.  Depending on the history of your executions, cell `xxx += 200` is either dependent on the cell just before it or possibly the cell 30 steps further down.

This leans to very hard to work with and understand code, and may make you throw up your hands and restart the kernel and rerun it.  Which can be a huge waste of time.

Another reason not to mutate data is a performance related one. In this document, we have been doing trivial things so far, so if you do have to rerun everything, its no big deal. 

But what if your first step was to load up a dataframe that contained a years worth of metrics data for 1000 network interfaces.  That may take 20 minutes to load.  If you destructively mutate that dataset, and find you did it wrong, you have to go back up in your program and rerun that data fetch.  And now you're going to go take a nap for 20 minutes.  

So in addition to linear logic and data flow making the program easier to reason about, linear data flow, where you never change data created earlier, also means that you always have the original data sitting around. If I spend 20 minutes loading that data and mutate it incorrectly in my experiments, I now have to spend another 20 min reloading that data.  If my changes to the data do not mutate the data, if I leave it alone, if I find I did my calculations wrong, I can just rerun teh calculations on the unmolested data I loaded up earlier.

In some of our code, we take this idea a bit further.  I may have to shutdown my program, or my computer may get rebooted.  I will have to run everything the next time I open the notebook, but I dont want to spend 20 minutes waiting for that data to load again.  So in some cases, you will find that not only do we not mutate the data so that we dont have to load it again in this session, we may also save it to disk and use that data in future runs, to save time. We take our state management a bit further, and try to preserve old values not just across cell executions, but across entire notebook sessions or login sessions.  We will discuss this in more detail later, but the concept is the same.  Some times state management is defensive, and sometime it is just a quality of life thing where you want to be able to avoid costly operations over and over again. 

So, lets see where we are in our defensive programming strategies:

1. Write your notebooks with a linear logic flow so its easier to understand, and easier to rerun tailing subsets
2. If you really must enforce an order of operations, put it all in one cell to guarantee its not run out of order.
3. Don't mutate data, to preserve a linear data flow. Possibly cache data as well.

So lets make a more realistic example.  I will try to write the code with a linear logic flow, and a linear data flow:

In [17]:
import pandas as pd

df = pd.read_csv('data/cleaning_example_01.csv')
df

Unnamed: 0,a,b,c,d
0,,1.0,1.0,1
1,2.0,2.0,2.0,2
2,3.0,3.0,,3
3,4.0,4.0,4.0,4
4,5.0,,5.0,5
5,6.0,6.0,6.0,YYY
6,7.0,7.0,7.0,XXX


In [18]:
df1 = df.replace('YYY', 0)

In [19]:
df2 = df1.replace('XXX', 0)
df2

Unnamed: 0,a,b,c,d
0,,1.0,1.0,1
1,2.0,2.0,2.0,2
2,3.0,3.0,,3
3,4.0,4.0,4.0,4
4,5.0,,5.0,5
5,6.0,6.0,6.0,0
6,7.0,7.0,7.0,0


In [20]:
df3 = df2.fillna(0)
df3

Unnamed: 0,a,b,c,d
0,0.0,1.0,1.0,1
1,2.0,2.0,2.0,2
2,3.0,3.0,0.0,3
3,4.0,4.0,4.0,4
4,5.0,0.0,5.0,5
5,6.0,6.0,6.0,0
6,7.0,7.0,7.0,0


In [21]:
df4 = df3.assign(c = lambda df: df.c ** 2)
df4

Unnamed: 0,a,b,c,d
0,0.0,1.0,1.0,1
1,2.0,2.0,4.0,2
2,3.0,3.0,0.0,3
3,4.0,4.0,16.0,4
4,5.0,0.0,25.0,5
5,6.0,6.0,36.0,0
6,7.0,7.0,49.0,0


The above cells present a linear flow of logic, and each one does no mutation of data, so our dataflow is also linear.

We did mention that if you must have things run in a particular order, put them in one cell to prevent any problems, so lets do that. If nothing else, it makes the code take up less space so its easier for us to read.

In [22]:
df = pd.read_csv('data/cleaning_example_01.csv')
df1 = df.replace('YYY', 0)
df2 = df1.replace('XXX', 0)
df3 = df2.fillna(0)
df4 = df3.assign(c = lambda xxx: xxx.c ** 2)
df4

Unnamed: 0,a,b,c,d
0,0.0,1.0,1.0,1
1,2.0,2.0,4.0,2
2,3.0,3.0,0.0,3
3,4.0,4.0,16.0,4
4,5.0,0.0,25.0,5
5,6.0,6.0,36.0,0
6,7.0,7.0,49.0,0


Based on our discussion above, this is pretty good.  The code which must go in that order is one cell, so I dont have to worry about the order of execution.  The logic is linear, and the dataflow is linear, making it easy for me to reason about it in context with the rest of the program, and to get repeatable, sensible results if I should rerun it.

The only thing wrong with this, really, is all those temp dataframe variables we created. I can guarantee to you that in the writing of this or other code later on, someone will screw up and refer to df3 when they meant df4.  We have polluted the python namespace with 4 different intermediate dataframe variables we will never need, but they do exist, and we can accidentally use them. 

It would be ideal if we didn't have to worry about them taking up memory, or accidentally using them.  Fortunately, there is another strategy we can use to avoid having to worry about any of that.

In [23]:
dfx = (
    pd.read_csv('data/cleaning_example_01.csv')
    .replace('YYY', 0)
    .replace('XXX', 0)
    .fillna(0)
    .assign(c = lambda xxx: xxx.c ** 2)
)

dfx

Unnamed: 0,a,b,c,d
0,0.0,1.0,1.0,1
1,2.0,2.0,4.0,2
2,3.0,3.0,0.0,3
3,4.0,4.0,16.0,4
4,5.0,0.0,25.0,5
5,6.0,6.0,36.0,0
6,7.0,7.0,49.0,0


Visit the article we wrote [here](https://github.com/sillyotter/pytut/blob/main/pytut.ipynb) (and that we referenced up at the top of this article) for more detail on exactly how that works.

In summary, each of those operations above are methods on a dataframe that will return a new modified dataframe, and we can simply chain each operation one after the other.  No unneeded and potentially confusing set of transient temp variables were created.  Its simple, linear, non mutating, and non polluting.  Its all in one cell, we dont have to worry about what order they are run in we only get one dataframe out of it, the final answer, and its easy to read as a linear sequence of steps.

This final technique is sort of the culmination of all the other strategies we listed above.  We want linear flow of logic and data, we want to make sure that we put related operations in one cell to make sure they get run in the right order, and we dont want to pollute the namespace (nor our brains) with 100's of spare variables we dont really need.

Writing things in this basic style then basically structures our code such that we have made it basically impossible for us to introduce several whole classes of errors.  We dont need to think about them anymore. We can focus on the problem, not looking for issues caused by out of order execution or improper state management.  

We of course still need to make sure we write our cells in a logic and data flow linear way, we dont want to put all of our code in a single cell.  We still need to make sure future operations dont corrupt the contents of dxf, for instance, but as for the work in a single cell, by adopting this non mutating chained sequence of operations as our basic style, we help prevent entire classes of errors from showing up. 

And that is why you will see, while reading our code, this style of programming used over and over.  

There is one issue with this basic style though, though its easy enough to work around.  Above, we chained together a bunch of operations, each which was a non mutating method on a dataframe.  Each one of those methods worked on a dataframe and returned a new but modified dataframe.  

But what do we do when we want to call a method or do an operation that can not be expressed as a method on a dataframe?

We use the `.pipe()` method.

The whole point of the pipe method is to allow you to turn non dataframe methods in to dataframe methods.

The naming of the function may seem odd, but its got a long pedigree.  In old unix systems from the 70s and 80s, the shells that users used to input requests in to the system decided they needed a way to perform more than one operation in a sequence over data generated by commands.  You may issue the command `ls` to list files, but then I want to sort them. There is a `sort` command, but how do I connect the two.  I want to, like in a factory, feed the output of one command into the input of the second command.  Or perhaps more like plumbing, I want to pipe the hot water from the heater to the dishwasher.  That metaphor, pipe, became long ago just how you discussed taking data from one command and sending it to the next command.  The shell systems standardized on the `|` character to do this.  I honestly dont know if that character even has a real name other than the "pipe character", as its been used for that for so long.    

As a result, in a unix shell, if you want to get a file listing and sort it, you would write:

```sh
ls | sort
```

Which is saying "run the ls command, then take its output and run it through a pipe which is connected to the input of the sort command."

You can get more complex with this:

```sh
ls | sort | uniq | head 2
```
which would get a directory listing, sort it, throw out any duplicates, and then take the first two lines and send them to the screen.

The same operations and syntax work on shells from the late 70s and on the most recent versions of powershell.  Pipe is ubiquitous in the software world.

That is what the pandas `.pipe()` method is used for.  I have a dataframe, its my data, and I want to send it through a function, and take the output of that function (which better be another dataframe) and send it on down the stream.  Some operations like assign and replace and so one do this naturally, but if I have a function that is not a member of the dataframe object, I need a way to wire it into this chain of operations.  And that is what pipe does.    

In [24]:
def trim_dataframe(df: pd.DataFrame, count: int) -> pd.DataFrame:
    return df.head(count)

(
    dfx
    .pipe(trim_dataframe, 4)
    .sort_values('b')
)


Unnamed: 0,a,b,c,d
0,0.0,1.0,1.0,1
1,2.0,2.0,4.0,2
2,3.0,3.0,0.0,3
3,4.0,4.0,16.0,4


The above example shows that we were able to use the pipe command to take the dataframe in question and `pipe` it through a function.  A function that took in a dataframe and output a new modified dataframe, and then that could be passed along to the `sort_values` method.

Lets look at a different example.

If we have a dataframe and we want to subset it, filter it, extract just a few rows, we know how to do this:

In [25]:
dfx[dfx.a > 5]

Unnamed: 0,a,b,c,d
5,6.0,6.0,36.0,0
6,7.0,7.0,49.0,0


Thats nice and simple, but how do I do that in a sequence of chained expressions.  With pipe.

In [26]:
(
    dfx
    .pipe(lambda df: df[df.a > 5])
    .sort_values('b')
)

Unnamed: 0,a,b,c,d
5,6.0,6.0,36.0,0
6,7.0,7.0,49.0,0


With out pipe, wed have to write it like this:

In [27]:
dfy = dfx[dfx.a > 5]
dfz = dfy.sort_values('b')
dfz 

Unnamed: 0,a,b,c,d
5,6.0,6.0,36.0,0
6,7.0,7.0,49.0,0


The non piped code had to introduce spare pointless variables to get in the way later.

So that is what pipe is for.  To take a dataframe and pipe it through some kind of transformation or filter so we can include it in a sequenced chain of operations with out having to pause, mutate anything, or introduce any transient temp variables.  