# Recap

In order of priority/time taken

1. basalareaincremementnonspatialaw
    - this is actually slow because of the number of times the BAFromZeroToDataAw function is called as shown above
    - relaxing the tolerance may help
    - indeed the tolerance is 0.01 * some value while the other factor finder functions have 0.1 tolerance i think
    - can also use cython for the increment functions
2. vectorize merch and gross volume functions
    - they require a lot of getting scalars off data frame, which is quite slow. faster to get an array

do a profiling run with IO (of reading input data and writing the plot curves to files) in next run


# Decide on the action

- speed up increment functions
    - use cython for increment functions
    - it turns out this may not help that much. the function is pretty fast, it's called almost 500,000 times on the sample of 300 plots
    - reduce the number of times its called maybe by using gradient descent for the optimization?
    - relax the tolerance
    - gives a context to refactor them as well (into their own module) which would be a welcome change
    - the increment functions use numpy functions but operate on scalars, there is no benefit to using numpy functions there
- performance-wise, it is not clear that this will pay off so much. vectorizing the volume functions is probably wiser

# Characterize what is happening

In [14]:
import pandas as pd
import numpy as np

The original gross volume function checks that top height is greater than 0


``` python
def GrossTotalVolume_Pl(BA_Pl, topHeight_Pl):
    Tvol_Pl = 0

    if topHeight_Pl > 0:
        a1 = 0.194086
        a2 = 0.988276
        a3 = 0.949346
        a4 = -3.39036
        Tvol_Pl = a1* (BA_Pl**a2) * (topHeight_Pl **a3) * numpy.exp(1+(a4/((topHeight_Pl**2)+1)))

    return Tvol_Pl
```

This makes it fail if trying to use it on an array:

In [22]:
from gypsy.GYPSYNonSpatial import GrossTotalVolume_Pl


GrossTotalVolume_Pl(np.random.random(10) * 100, np.random.random(10) * 100)

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

# MWEs

If we can rewrite it to handle 0s properly, i.e. to return 0 where an input is 0, then it is trivial to vectorize

In [24]:
def GrossTotalVolume_Pl_arr(BA_Pl, topHeight_Pl):
    a1 = 0.194086
    a2 = 0.988276
    a3 = 0.949346
    a4 = -3.39036
    Tvol_Pl = a1* (BA_Pl**a2) * (topHeight_Pl **a3) * np.exp(1+(a4/((topHeight_Pl**2)+1)))

    return Tvol_Pl

print(GrossTotalVolume_Pl_arr(10, 10))
print(GrossTotalVolume_Pl_arr(0, 10))
print(GrossTotalVolume_Pl_arr(10, 0))
print(GrossTotalVolume_Pl_arr(np.random.random(10) * 100, np.random.random(10) * 100))
print(GrossTotalVolume_Pl_arr(np.zeros(10) * 100, np.random.random(10) * 100))

44.1908473047
0.0
0.0
[  415.44083176  3769.55471466  1288.45549553    49.77278285   298.91564963
  1367.74822794   729.29619243   906.68358934  2393.18506461   385.18108024]
[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]


### Timings

In [29]:
ba = np.random.random(1000) * 100
top_height = np.random.random(1000) * 100
d = pd.DataFrame({'ba': ba, 'th': top_height})

In [30]:
%%timeit
d.apply(
    lambda x: GrossTotalVolume_Pl(
        x.at['ba'],
        x.at['th']
    ),
    axis=1
)

10 loops, best of 3: 31.9 ms per loop


In [32]:
%%timeit
GrossTotalVolume_Pl_arr(ba, top_height)

10000 loops, best of 3: 165 µs per loop


The array method is 20x faster. This is worth implementing. We should also add tests to help be explicity about the behaviour of these volume functions.

# Revise the code

Go on. Do it.

# Tests

Do tests still pass?

# Review code changes

In [6]:
%%bash
git log --since "2016-11-14 19:30" --oneline # 19:30 GMT/UTC

In [9]:
! git diff "HEAD~$(git log --since "2016-11-14 19:30" --oneline | wc -l)" ../gypsy

# Run timings

From last time:

```
real	6m16.021s
user	5m59.620s
sys  	0m2.030s
```

After cython'ing iter functions:

In [10]:
%%bash
# git checkout 4c978aff110001efdc917ed60cb611139e1b54c9 -b remove-getitem-redundancy
# time gypsy simulate ../private-data/prepped_random_sample_300.csv --output-dir tmp
# rm -rfd tmp

# real	5m36.407s
# user	5m25.740s
# sys	0m2.140s

???

# Run profiling

In [42]:
from gypsy.forward_simulation import simulate_forwards_df

In [44]:
data = pd.read_csv('../private-data/prepped_random_sample_300.csv', index_col=0, nrows=10)

In [54]:
%%prun -D forward-sim-2.prof -T forward-sim-2.txt -q
result = simulate_forwards_df(data)

 
*** Profile stats marshalled to file u'forward-sim-2.prof'. 

*** Profile printout saved to text file u'forward-sim-2.txt'. 


In [56]:
!head forward-sim-2.txt

         4167146 function calls (4034230 primitive calls) in 36.370 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   492069    7.050    0.000    7.050    0.000 GYPSYNonSpatial.py:428(BasalAreaIncrementNonSpatialAw)
     7191    2.481    0.000    9.716    0.001 GYPSYNonSpatial.py:956(BAfromZeroToDataAw)
    90290    2.419    0.000    8.879    0.000 base.py:2116(get_value)
    90280    1.686    0.000   17.526    0.000 indexing.py:1654(__getitem__)
   369310    1.517    0.000    2.398    0.000 {isinstance}


# Compare performance visualizations

Now use either of these commands to visualize the profiling

```
pyprof2calltree -k -i forward-sim-1.prof forward-sim-1.txt

# or

dc run --service-ports snakeviz notebooks/forward-sim-1.prof
```

### Old

![definitive reference profile screenshot](forward-sim-1-performance-icicle.png)

### New

![2nd iteration performance](forward-sim-2a-performance.png)

## Summary of performance improvements

forward_simulation is now 2x faster than last iteration, 8 times in total, due to the changes outlined in the code review section above

on my hardware, this takes 1000 plots to ~4 minutes

on carol's hardware, this takes 1000 plots to ~13 minutes

For 1 million plots, we're looking at 2 to 9 days on desktop hardware



# Profile with I/O


In [None]:
! rm -rfd gypsy-output

In [None]:
output_dir = 'gypsy-output'

In [20]:
%%prun -D forward-sim-2.prof -T forward-sim-2.txt -q
# restart the kernel first
data = pd.read_csv('../private-data/prepped_random_sample_300.csv', index_col=0, nrows=10)
result = simulate_forwards_df(data)
os.makedirs(output_dir)
for plot_id, df in result.items():
    filename = '%s.csv' % plot_id
    output_path = os.path.join(output_dir, filename)
    df.to_csv(output_path)


 
*** Profile stats marshalled to file u'forward-sim-1.prof'. 

*** Profile printout saved to text file u'forward-sim-1.txt'. 


# Identify new areas to optimize



- from last time:
    - parallel (3 cores) gets us to 2 - 6 days - save for last
    - AWS with 36 cores gets us to 4 - 12 hours ($6.70 - $20.10 USD on a c4.8xlarge instance in US West Region)
    - aws lambda and split up the data 
- now:
    - getting items in apply is still slow - vectorize the functions
    - cython for icnrement functions epsecially bA