# Recap

In order of priority/time taken

1. pandas init dict
    - `basal_area_aw_df = pd.DataFrame(columns=['BA_Aw'], index=xrange(max_age))`
    - find a faster way to create this data frame
    - relax the tolerance for aspen
2. pandas set item
    - use at method 
    - http://pandas.pydata.org/pandas-docs/stable/indexing.html#fast-scalar-value-getting-and-setting
3. lambdas
    - use cython for the gross tot vol and merch vol functions
    - might be wise to refactor these first to have conventional names, keyword arguments, and a base implementation to get rid of the boilerplate
    - don't be deceived - the callable is a miniscule portion; series.__getitem__ is taking most of the time
    - again, using .at here would probably be a significant improvement
4. basalareaincremementnonspatialaw
    - this is actually slow because of the number of times the BAFromZeroToDataAw function is called as shown above
    - relaxing the tolerance may help
    - indeed the tolerance is 0.01 * some value while the other factor finder functions have 0.1 tolerance i think
    - can also use cython for the increment functions

do a profiling run with IO (of reading input data and writing the plot curves to files) in next run


# Characterize what is happening

Indexing with df[] or series[] is slow for scalars (lambdas, pandas set)
basalareaincrement is running a lot for aw, use the same tolerance as is used for other species

merchvol, increment, and gross vol functions use pure python. cython would be effective.

# Decide on the action

- use same tolerance for aw as other species
- use at instead of [] or ix? - compare these in MWE
- creating data frame is slow, maybe because its fromdict. see if this can be improved

# MWEs

In [6]:
import pandas as pd
import numpy as np

## init from dict and xrange index vs from somethign else

### Timings

In [8]:
%%timeit
d = pd.DataFrame(columns=['A'], index=xrange(1000))

The slowest run took 38.51 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 368 µs per loop


In [9]:
%%timeit
d = pd.DataFrame(columns=['A'], index=xrange(1000), dtype='float')

The slowest run took 4.03 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 418 µs per loop


In [13]:
%%timeit
d = pd.DataFrame({'A': np.zeros(1000)})

The slowest run took 5.08 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 275 µs per loop


The problem here is that dataframe init being called 7000 times because of the aw ba factor finder

Maybe it's not worth using a data frame here. use a list or numpy and then convert to dataframe when the factor is found, e.g.:

In [14]:
%%timeit
for _ in xrange(5000):
    d = pd.DataFrame(columns=['A'], index=xrange(1000))

1 loop, best of 3: 2.13 s per loop


In [17]:
%%timeit
for _ in xrange(5000):
    d = np.zeros(1000)

100 loops, best of 3: 14.1 ms per loop


### Review the code to see how this can be applied

The numpy/purepython approach as potential

But there's a couple issues for which the code must be examined

The problem comes from the following call chain

`simulate_forwards_df` (called 1x) ->  
`get_factors_for_all_species` (called 10x, 1x per plot) ->  
`BAfactorFinder_Aw` (called 2x, 1x per plot that has aw) ->  
`BAfromZeroToDataAw` (called 7191 times, most of which in this chain) ->   
`DataFrame.__init__` (called 7932 times, most of which in this chain) ...  

why does `BAfromZeroToDataAw` create a dataframe? It's good to see the code:

First, `simulate_forwards_df` calls `get_factors_for_all_species` and then `BAfromZeroToDataAw` with some parameters and simulation choice of false

Note that when `simulation==False`, that is the only time that the list is created. otherwise the list is left empty.

Note also that `simulation_choice` defaults to `True` in forward simulation, i.e. for when `BAfromZeroToData__` are called from forward simulation.

`get_factors_for_all_species` calls factor finder functions for each species, if the species is present, and returns a dict of the factors

`BAfactorFinder_Aw` is the main suspect, for sime reason aspen has a harder time converging, so the loop in this function runs many times


It calls `BAfromZeroToDataAw` with `simulation_choice` of `'yes'` and `simulation=True` **BUT IT ONLY USES THE 1ST RETURN VALUE**

## slow lambdas

**below is left here for the record, but the time is actually spent in getitem, not so much in the callables applied, that is an easy fix**

With the df init improved by using np array, the next suspect is the lambdas. The method for optimizing is generally to use cython, the functiosn themselves can be examined for opportunities:

they are pretty basic - everything is a float.

``` python
def MerchantableVolumeAw(N_bh_Aw, BA_Aw, topHeight_Aw, StumpDOB_Aw,
                         StumpHeight_Aw, TopDib_Aw, Tvol_Aw):
    # ...
    if N_bh_Aw > 0:
        k_Aw = (BA_Aw * 10000.0 / N_bh_Aw)**0.5
    else:
        k_Aw = 0

    if k_Aw > 0 and topHeight_Aw > 0:
        b0 = 0.993673
        b1 = 923.5825
        b2 = -3.96171
        b3 = 3.366144
        b4 = 0.316236
        b5 = 0.968953
        b6 = -1.61247
        k1 = Tvol_Aw * (k_Aw**b0)
        k2 = (b1* (topHeight_Aw**b2) * (StumpDOB_Aw**b3) * (StumpHeight_Aw**b4) * (TopDib_Aw**b5)  * (k_Aw**b6)) + k_Aw
        MVol_Aw = k1/k2
    else:
        MVol_Aw = 0

    return MVol_Aw

```

``` python
def GrossTotalVolume_Aw(BA_Aw, topHeight_Aw):
    # ...
    Tvol_Aw = 0

    if topHeight_Aw > 0:
        a1 = 0.248718
        a2 = 0.98568
        a3 = 0.857278
        a4 = -24.9961
        Tvol_Aw = a1 * (BA_Aw**a2) * (topHeight_Aw**a3) * numpy.exp(1+(a4/((topHeight_Aw**2)+1)))

    return Tvol_Aw
```

### Timings for getitem

There's a few ways to get an item from a series:

In [23]:
d = pd.Series(np.random.randint(0,100, size=(100)), index=['%d' %d for d in xrange(100)])

In [25]:
%%timeit
d['1']

The slowest run took 14.05 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 8.4 µs per loop


In [26]:
%%timeit 
d.at('1')

The slowest run took 37.35 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 1.42 µs per loop


In [27]:
%%timeit
d.loc('1')

The slowest run took 33.90 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 1.41 µs per loop


`loc` or `at` are faster than `[]` indexing.

Revising the code, there was an issue

In [35]:
df = pd.DataFrame(np.random.randint(0,100,size=(4, 100)), columns=['%d' %i for i in range(100)])
df.apply(np.mean, axis=1)


0    52.78
1    47.75
2    49.06
3    50.29
dtype: float64

In [37]:
df.apply(lambda x: x['1'] + x['2'], axis=1)

0    83
1    99
2    84
3    68
dtype: int64

In [39]:
df.apply(lambda x: x.at('1') + x.at('2'), axis=1)

TypeError: ("unsupported operand type(s) for +: '_AtIndexer' and '_AtIndexer'", u'occurred at index 0')

# Revise the code

Go on. Do it.

# Review code changes

In [2]:
%%bash
# git log --since 2016-11-09 --oneline

In [3]:
# ! git diff HEAD~7 ../gypsy

# Tests

Do tests still pass?

# Run timings

In [3]:
%%bash
# git checkout dev
# time gypsy simulate ../private-data/prepped_random_sample_300.csv --output-dir tmp
# rm -rfd tmp

# real	8m18.753s
# user	8m8.980s
# sys	0m1.620s

In [22]:
%%bash
# after factoring dataframe out of zerotodata functions
# git checkout -b da080a79200f50d2dda7942c838b7f3cad845280 df-factored-out-zerotodata
# time gypsy simulate ../private-data/prepped_random_sample_300.csv --output-dir tmp
# rm -rfd tmp

# real	5m51.028s
# user	5m40.130s
# sys	0m1.680s

Removing the data frame init gets a 25% time reduction

In [22]:
%%bash
# after using a faster indexing method for the arguments put into the apply functions
# git checkout 
# time gypsy simulate ../private-data/prepped_random_sample_300.csv --output-dir tmp
# rm -rfd tmp

# real	5m51.028s
# user	5m40.130s
# sys	0m1.680s

# Run profiling

In [18]:
from gypsy.forward_simulation import simulate_forwards_df



In [19]:
data = pd.read_csv('../private-data/prepped_random_sample_300.csv', index_col=0, nrows=10)

In [20]:
%%prun -D forward-sim-2.prof -T forward-sim-2.txt -q
result = simulate_forwards_df(data)

 
*** Profile stats marshalled to file u'forward-sim-1.prof'. 

*** Profile printout saved to text file u'forward-sim-1.txt'. 


In [21]:
!head forward-sim-2.txt

         10055657 function calls (9875729 primitive calls) in 76.264 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   492069    6.857    0.000    6.857    0.000 GYPSYNonSpatial.py:427(BasalAreaIncrementNonSpatialAw)
  1836602    6.527    0.000    9.190    0.000 {isinstance}
796652/624746    3.102    0.000    4.823    0.000 {len}
     7191    2.670    0.000   40.459    0.006 GYPSYNonSpatial.py:959(BAfromZeroToDataAw)
   511948    2.020    0.000    3.373    0.000 {getattr}


In [22]:
!diff -y forward-sim-2.txt forward-sim-1.txt

diff: forward-sim.txt: No such file or directory


# Compare performance visualizations

Now use either of these commands to visualize the profiling

```
pyprof2calltree -k -i forward-sim-1.prof forward-sim-1.txt

# or

dc run --service-ports snakeviz notebooks/forward-sim-1.prof
```

### Old

![definitive reference profile screenshot](forward-sim-1-performance.png)

### New

![1st iteration performance](forward-sim-2-performance.png)

## Summary of performance improvements

forward_simulation is now 4x faster due to the changes outlined in the code review section above

on my hardware, this takes 1000 plots to ~8 minutes

on carol's hardware, this takes 1000 plots to ~25 minutes

For 1 million plots, we're looking at 5 to 17 days on desktop hardware



# Profile with I/O


In [None]:
! rm -rfd gypsy-output

In [None]:
output_dir = 'gypsy-output'

In [20]:
%%prun -D forward-sim-2.prof -T forward-sim-2.txt -q
# restart the kernel first
data = pd.read_csv('../private-data/prepped_random_sample_300.csv', index_col=0, nrows=10)
result = simulate_forwards_df(data)
os.makedirs(output_dir)
for plot_id, df in result.items():
    filename = '%s.csv' % plot_id
    output_path = os.path.join(output_dir, filename)
    df.to_csv(output_path)


 
*** Profile stats marshalled to file u'forward-sim-1.prof'. 

*** Profile printout saved to text file u'forward-sim-1.txt'. 


# Identify new areas to optimize



- from last time:
    - parallel (3 cores) gets us to 2 - 6 days - save for last
    - AWS with 36 cores gets us to 4 - 12 hours ($6.70 - $20.10 USD on a c4.8xlarge instance in US West Region)
- now:
    - 

# Identify some means of optimization

In order of priority/time taken

1.
2.