# Debugging and Profiling Python
## Marcus D. Collins, Ph.D.
Senior Scientist, Placed, Inc.

# My history with Python
* I come from the land of C and FORTRAN
* I am trained as a physicist and biophysicist, with almost no formal CS background.
* I first used started using Python for image processing, about 15 years ago.
* I've been writing Python daily for the last couple of years.
* I now work as a Data Scientist, essentially interpreting human telemetry data in shopping contexts.
* I have seen and used an actual working VT100 terminal, in my home.

# Jupyter, the debugger, and profiling
* This is the first time I've used Jupyter. People say it's good for presentations?
* Remember: I come from the land of FORTRAN and C, before even IDEs were common.
* I have come to love and rely on the debugger fairly recently. 
* Profiling code is vastly under-utilized.

# An approximate outline
* What is the role of a debugger? (<i> and the trap of scripting languages!!!</i>)
* Avoiding the debugger: Tips on how to write manageable code from someone who writes a lot of unmanageable code.
* BROKEN CODE!
* What is the role of profiling, and when should I use it?
* What to do when your code isss sooooo sloooowwwwwww.

# Why use a debugger?
## The temptation of scripting languages
* We've all used `print("My integer is %d" % myinteger)` in some form or other.
* If you are like me, you have had that statement raise a `TypeError` because `myinteger = str(3)` was the actual underlying bug. 
* This is the most basic debugging
* if you've ever read Apache GreatestThingEver logs, you'll recognize it's more sophisticated cousin `logging`.
* I think it is a terrible habit, one that we got from the ease with which scripts can be run.
* With tools like hadoop, it is sometimes the only way (AWS EMR was built for production, not development). 
* But it is slow, error prone, and doesn't allow realtime feedback.<p>
### Lesson
<b><i>We want to be able to make and test hypotheses of what's wrong in our code quickly!</i></b>

## Understanding what our code is doing.
* Often we make assumptions about what our code is doing that prove false.
* Very often, we want to test what our code will do if we give it odd or bad input.
### Lesson
<b><i>The debugger is also a tool to quickly test for fragile code and edge cases:</i></b> we can run our setup, and then alter the input <i>in situ</i> to test how short fragments will behave. 

# But first: Avoid debugging in the first place.
* Use a good IDE!
  * I use PyCharm. Atom is okay. There's one that starts with S that seems to be good if you're a developer.
  * Find one that does type checking, static error checking, PEP8.
* Other modules
  * PyLint and related tools are mostly style checkers, but they can be helpful.
  * pyflakes does some error and type checking, looks for undeclared names... 
  * but really, get a good IDE.
* TESTS!
  * Tests are irritating from a Data Science perspective, because most of them are poorly designed.
  * A good test does not have explicit input. A good test generates the input according to a model.
  * A good test doesn't compare to explicit floating point numbers, unless it is testing the values of constants or distributions.
  * A good test doesn't test that a model prediction doesn't change, it tests the logic of the model.
  * <b><i> The best tests communicate the assumptions, required input, and function of the code.</i></b>
* Comment your code.
  * NO code is "self documenting". RTFCode is for arrogant losers. WTFComments is more like it.
  * Only YOU think the way you do.
  * Are you doing math? Using an even slightly unusual formula? <i>Cite references in your code!</i> For scientists especially, code is just as much an expression of your work as a journal article. Treat it that way.
* Code reuse
  * A common trap for Python programmers is that production types want to rewrite things in Java. <i>IF THIS HAPPENS TO YOU, you will need to write a test that compares their result to yours!</i> I say this from painful experience.
  * "Object obsfucation": 
    * Remember that I come from a long-ago time, when `malloc` was a thing people dealt with regularly. 
    * I abhor classes, but I've learned to live with them. 
    * Don't fall into the class trap though--too much inheritance, especially chained inheritance, makes your code unreadable, and that is a recipe for bugs. 

# Enough! Let's run something!

In [4]:
from scipy import sparse as sp
import numpy as np
from pandas import DataFrame

r = np.random.random_sample(size=(500,15))
r = r * (r < 0.2).astype(int)
r = sp.csr_matrix(r)

# Note the lack of a long line. I personally find it easier to read
# if I can glance up and down, rather than try to read a long line.
df = DataFrame(np.random.random_sample(size=(500,2)),         
               columns=['A','B'])
df['C'] = r  # hint, the error is here...
thing1 = sp.csr_matrix(df.A.values)
thing2 = sp.vstack(df.C.values)

In [31]:
output = sp.hstack([thing1, thing2])  # Whooops!

ValueError: blocks[0,:] has incompatible row dimensions

IPython integrates the python debugger, essentially running your code as `python -m pdb <i>somecode.py</i>` and giving you the `pdb>` prompt if you ask nicely <i>immediately</i> after an error: 

In [32]:
%debug

> [1;32m/home/marcus/virtual/IPTEST/local/lib/python2.7/site-packages/scipy/sparse/construct.py[0m(489)[0;36mbmat[1;34m()[0m
[1;32m    487 [1;33m                [1;32melse[0m[1;33m:[0m[1;33m[0m[0m
[0m[1;32m    488 [1;33m                    [1;32mif[0m [0mbrow_lengths[0m[1;33m[[0m[0mi[0m[1;33m][0m [1;33m!=[0m [0mA[0m[1;33m.[0m[0mshape[0m[1;33m[[0m[1;36m0[0m[1;33m][0m[1;33m:[0m[1;33m[0m[0m
[0m[1;32m--> 489 [1;33m                        [1;32mraise[0m [0mValueError[0m[1;33m([0m[1;34m'blocks[%d,:] has incompatible row dimensions'[0m [1;33m%[0m [0mi[0m[1;33m)[0m[1;33m[0m[0m
[0m[1;32m    490 [1;33m[1;33m[0m[0m
[0m[1;32m    491 [1;33m                [1;32mif[0m [0mbcol_lengths[0m[1;33m[[0m[0mj[0m[1;33m][0m [1;33m==[0m [1;36m0[0m[1;33m:[0m[1;33m[0m[0m
[0m
ipdb> u
> [1;32m/home/marcus/virtual/IPTEST/local/lib/python2.7/site-packages/scipy/sparse/construct.py[0m(391)[0;36mhstack[1;34m()[0m
[1;32m

Great, super cool. We can have a look around. 

Let's get familiar with some of the navigation in the ipython debugger (ipdb):

Python, unlike may languages, displays a very nice <i><b>traceback</b></i> showing exactly which lines resulted in the error. We can move up and down the <i><b>call stack</b></i>:

* <b>u</b>: up, the previous call in the traceback
* <b>d</b>: down, the next call in the traceback.
* <b>w</b>: show where in the call stack we are.
* <b>c</b> or quit: exit the debugger and continue running the code.

and we can step through the code line by line or into functions:
* <b>n</b>: execute the next line of code, do not enter any function call
* <b>s</b>: same as n but will enter (<b>s</b>tep into) any python function call.

When doing this, it is easy to get lost:
* <b>l</b> to see the code at the current location.
* <b>a</b> shows you the current namespace (available variables).
* <b>bt</b> (for back trace) show the current back trace again, in case you're disoriented.

* <b>h</b> or <b>help</b> are always there for you too.

At any level, all variables (including class and instance variables) that are available in that scope (but not surrounding or lower scopes) are available. 

<b><i>But what if we have a bug, not an error? Or we're not running in this fancy-pants notebook?</i></b>

So now, we'll set a breakpoint. There are other ways to invoke the debugger, but this is probably the most useful in practice. When you find a place where your code breaks, you can set a breakpoint above it. Often, it makes sense to put it at the start of the last <i>block</i>, for instance, the first line of a for loop, or the first line of a class method. In our example, we'll set it at the start to demonstrate the basics, but it can be quite tedious to invoke the debugger too often.

In [34]:
%run script.py

ValueError: blocks[0,:] has incompatible row dimensions

OOPS! You can't just type anything in the pdb! Some things are commands! Try again:

In [36]:
%run script.py

> /home/marcus/pyladies/script.py(11)<module>()
-> r = r * (r < 0.2).astype(int)
(Pdb) print r
[[ 0.26875702  0.78156342  0.5191541  ...,  0.53270272  0.95403293
   0.0345442 ]
 [ 0.38015038  0.504218    0.11425369 ...,  0.95339391  0.40322888
   0.37616339]
 [ 0.94012001  0.17384152  0.38398771 ...,  0.29083631  0.69258176
   0.48843051]
 ..., 
 [ 0.77112158  0.78840637  0.2680866  ...,  0.80123618  0.36078602
   0.57794764]
 [ 0.56212088  0.19646672  0.17092477 ...,  0.4652415   0.01796785
   0.76890592]
 [ 0.82946305  0.68115492  0.5279345  ...,  0.55681231  0.63164835
   0.76600214]]
(Pdb) n
> /home/marcus/pyladies/script.py(12)<module>()
-> r = sp.csr_matrix(r)
(Pdb) print r
[[ 0.          0.          0.         ...,  0.          0.          0.0345442 ]
 [ 0.          0.          0.11425369 ...,  0.          0.          0.        ]
 [ 0.          0.17384152  0.         ...,  0.          0.          0.        ]
 ..., 
 [ 0.          0.          0.         ...,  0.          0.      

BdbQuit: 

Some observations to help orient you:
  1. The line of code you see is the one you are <i>about</i> to execute, not the last one executed. So any variable defined on that line isn't available yet.
  2. Context is often helpful, remember <b>l</b>. 
  3. You always start way down in some module or class. Pandas for instance has unusually deep back traces. The meaningful bit may be quite a ways <b>u</b>p in the code.
  4. If you are running with `python -m pdb somescript.py`, be aware that your error is actually caught by yet another layer--the pdb module itself.
  5. When running a script wrapped by pdb, be aware that it will run through, and when it finishes, it will ask you if you want to start again. Type `quit` to get out. 
  
And *REMEMBER* you can change the course of things by redefining variables within the pdb, to make it work, and keep looping until it does!

*HOWEVER* you cannot edit the original source file and have pdb pick up the changes. 

# Profiling Python Code

# When should I profile?
## Any time something takes longer than it takes to get a cup of coffee.
* You can spend some time looking over your code to find obvious speed-ups. *BUT*
* remember the adage: make your code work first, then make it fast <b><i>BUT</i></b>
* making your code faster can help shorten your development cycle, SO:
* I say, profile as you go, make small chunks work, and make them fast. Sometimes "working" means "working in a reasonable amount of time."

# How should I profile?
## Visually: look for obvious speedups:
* convert simple for loops to list contexts, dict contexts, set contexts
* 
* numpy vs. pandas
*...

## the cProfile module
*It's so easy!*

In [38]:
%run -p -s cumulative script2.py

ValueError: blocks[0,:] has incompatible row dimensions

 

## Some thoughts: because running all of this would be tedious, and I ran out of time...
- cProfile works very well in IPython/Jupyter, but also works as the command line: `python -m cProfile -s cumulative`
- cumulative mode is typically more useful. 
- Look for loops: loops in Python are reasonably inefficient.
  - Use contexts instead. A lot can be done in a context:

```
data = []
for i in gzip.open('file.gz'):
    parse some complicated, nested, horrible object
    compute some things
    data.append(something in flat dict form)
```
df = pandas.DataFrame(data)
Can easily be converted to:

```df = pandas.DataFrame([parse(l) for l in gzip.open('file.gz')])```

Woo! And there are `dict` contexts too! `{k:v for k,v in zip(keys, values)}` 
and *generator contexts*! `myiter = (x**2 in np.array([0,1,2,3,4]))`

This is more compact, easier to read (I think) and *runs faster*.

  - Remove extraneous code. We use lots of JSON object data, with newlines. People were obsessed with str.strip(). It is sloooow and totally unnecessary. json.loads() doesn't care. (Usually). There are lots of things like this you can look for.
  
  - *don't use classes*. Classes are slow because of how Python handles objects. 
  
  - Avoid libraries that excessively use classes. (Wes McKinney I'm talking to you.)
  
  - Look for faster ways: there is so much Python I realized I don't know, or was stuck in a rut on. `"\t".join(list)` is faster than "item1" + "\t" + "item2", even for short lists.
  
  - Remember that anything you do in a loop is magnified. Using cProfile will help you see this.
  
  - Lambdas are slow!
  
  - For math, push yourself to use numpy's amazing broadcasting. It allows you to make large computations at absurd speed.

This is super slow (even though it uses list contexts!):

In [39]:
tsarray1 = np.random.random_sample(size=(100,))
tsarray2 = np.random.random_sample(size=(100,))

In [40]:
min([[min(ts1 - ts2) for ts1 in tsarray1] for ts2 in tsarray2])

TypeError: 'numpy.float64' object is not iterable

In [8]:
min([(tsarray2 - ts1) for ts1 in tsarray1])

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

In [9]:
%debug

> [1;32m<ipython-input-8-47882c3c7535>[0m(1)[0;36m<module>[1;34m()[0m
[1;32m----> 1 [1;33m[0mmin[0m[1;33m([0m[1;33m[[0m[1;33m([0m[0mtsarray2[0m [1;33m-[0m [0mts1[0m[1;33m)[0m [1;32mfor[0m [0mts1[0m [1;32min[0m [0mtsarray1[0m[1;33m][0m[1;33m)[0m[1;33m[0m[0m
[0m
ipdb> tsarray2
array([  9.8,  23.2,  42. ])
ipdb> tsarray1
array([ 10.4,  12.1,  78.4])
ipdb> [ts1 for ts1 in tsarray1]
[10.4, 12.1, 78.400000000000006]
ipdb> tsarray2 - ts1
array([-68.6, -55.2, -36.4])
ipdb> c


In [41]:
%timeit min([min([ts1 - ts2 for ts1 in tsarray1]) for ts2 in tsarray2])

100 loops, best of 3: 1.78 ms per loop


In [42]:
%timeit min([min(tsarray1 - ts2) for ts2 in tsarray2])

1000 loops, best of 3: 1.01 ms per loop


In [43]:
%timeit (tsarray1[:,np.newaxis] - tsarray2[np.newaxis, :]).min()

The slowest run took 4.37 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 27.4 µs per loop


  - If you absolutely must use a instance method function, consider making a static reference to it:
`gb = dataframe.groupby` will essentially cache that function, preventing costly lookups. *Remember: avoid costly lookups by avoiding classes!*

  - Avoid deepcopy! Get clever with slicing! Deepcopying objects is usually a mistake, it costs you memory and time. Learn better in-place algorithms to do what you need to do, or figure one out yourself.
  
## Cython
Not today! Maybe next time! Very freeing. Can actually be run on pure python and will give some speedup.