Initial attemps at profiling had very confusing results; possibly because of module loading and i/o

Here, gypsy will be run and profiled on one plot, with no module loading/io recorded in profiling

# Characterize what is happening

In several places, we append data to a data frame

In [19]:
%%bash
grep --colour -nr append ../gypsy/*.py

../gypsy/GYPSYNonSpatial.py:1028:        densities_along_time.append({'N_bh_AwT': N_bh_AwT, 'N_bh_SwT': N_bh_SwT, 'N_bh_SbT': N_bh_SbT, 'N_bh_PlT': N_bh_PlT,
../gypsy/GYPSYNonSpatial.py:1163:            BA_Aw_DF = BA_Aw_DF.append({'BA_Aw':BA_AwB}, ignore_index=True)
../gypsy/GYPSYNonSpatial.py:1287:            BA_Sb_DF = BA_Sb_DF.append({'BA_Sb': BA_SbB}, ignore_index=True)
../gypsy/GYPSYNonSpatial.py:1422:            BA_Sw_DF = BA_Sw_DF.append({'BA_Sw': BA_SwB}, ignore_index=True)
../gypsy/GYPSYNonSpatial.py:1673:            BA_Pl_DF = BA_Pl_DF.append({'BA_Pl': BA_PlB}, ignore_index=True)
../gypsy/forward_simulation.py:367:            output_DF_Sw = output_DF_Sw.append({'BA_Sw':BA_SwT}, ignore_index=True)
../gypsy/forward_simulation.py:368:            output_DF_Sb = output_DF_Sb.append({'BA_Sb':BA_SbT}, ignore_index=True)
../gypsy/forward_simulation.py:369:            output_DF_Pl = output_DF_Pl.append({'BA_Pl':BA_PlT}, ignore_index=True)


Either in the way we do it, or by its nature, it is a slow operation.

In [21]:
import pandas as pd

In [23]:
help(pd.DataFrame.append)

Help on method append in module pandas.core.frame:

append(self, other, ignore_index=False, verify_integrity=False) unbound pandas.core.frame.DataFrame method
    Append rows of `other` to the end of this frame, returning a new
    object. Columns not in this frame are added as new columns.
    
    Parameters
    ----------
    other : DataFrame or Series/dict-like object, or list of these
        The data to append.
    ignore_index : boolean, default False
        If True, do not use the index labels.
    verify_integrity : boolean, default False
        If True, raise ValueError on creating index with duplicates.
    
    Returns
    -------
    appended : DataFrame
    
    Notes
    -----
    If a list of dict/series is passed and the keys are all contained in
    the DataFrame's index, the order of the columns in the resulting
    DataFrame will be unchanged.
    
    See also
    --------
    pandas.concat : General function to concatenate DataFrame, Series
        or Panel obj

There is nothing very clear about performance from the documentation. It may be worth examining the source, and of course googling append performance.

python - Improve Row Append Performance On Pandas DataFrames - Stack Overflow  
http://stackoverflow.com/questions/27929472/improve-row-append-performance-on-pandas-dataframes

python - Pandas: Why should appending to a dataframe of floats and ints be slower than if its full of NaN - Stack Overflow  
http://stackoverflow.com/questions/17141828/pandas-why-should-appending-to-a-dataframe-of-floats-and-ints-be-slower-than-if

python - Creating large Pandas DataFrames: preallocation vs append vs concat - Stack Overflow  
http://stackoverflow.com/questions/31690076/creating-large-pandas-dataframes-preallocation-vs-append-vs-concat

python - efficient appending to pandas dataframes - Stack Overflow  
http://stackoverflow.com/questions/32746248/efficient-appending-to-pandas-dataframes

python - Pandas append perfomance concat/append using "larger" DataFrames - Stack Overflow  
http://stackoverflow.com/questions/31860671/pandas-append-perfomance-concat-append-using-larger-dataframes

pandas.DataFrame.append — pandas 0.18.1 documentation  
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.append.html

# Decide on the action

Do not append in a loop. It makes a copy each time and the memory allocation is poor. Should have known; it's interesting to see it demonstrated in the wild!

Pre-allocate the dataframe length by giving it an index and assigning to the index

# MWE

In [29]:
%%timeit
d = pd.DataFrame(columns=['A'])
for i in xrange(1000):
    d.append({'A': i}, ignore_index=True)

1 loop, best of 3: 1.39 s per loop


In [30]:
%%timeit
d = pd.DataFrame(columns=['A'], index=xrange(1000))
for i in xrange(1000):
    d.loc[i,'A'] = i

1 loop, best of 3: 150 ms per loop


In [31]:
1.39/.150

9.266666666666666

Speedup of nearly 1 order of magnitude

# Review code changes

In [36]:
! git diff HEAD~~ ../gypsy

[1mdiff --git a/gypsy/GYPSYNonSpatial.py b/gypsy/GYPSYNonSpatial.py[m
[1mindex 9ece9bc..2bae1d6 100644[m
[1m--- a/gypsy/GYPSYNonSpatial.py[m
[1m+++ b/gypsy/GYPSYNonSpatial.py[m
[36m@@ -981,11 +981,10 @@[m [mdef BAfromZeroToDataAw(startTage, SI_bh_Aw, N0_Aw, BA_Aw0, SDF_Aw0, f_Aw,[m
     elif simulation_choice == 'no':[m
         max_age = 250[m
 [m
[31m-    BA_Aw_DF = pd.DataFrame(columns=['BA_Aw'])[m
[31m-    t = 0[m
[32m+[m[32m    basal_area_aw_df = pd.DataFrame(columns=['BA_Aw'], index=xrange(max_age))[m
     BA_tempAw = BA_Aw0[m
 [m
[31m-    for SC_Dict in densities[0: max_age]:[m
[32m+[m[32m    for SC_Dict, i in enumerate(densities[0: max_age]):[m
         bhage_Aw = SC_Dict['bhage_Aw'][m
         SC_Aw = SC_Dict['SC_Aw'][m
         N_bh_AwT = SC_Dict['N_bh_AwT'][m
[36m@@ -1008,11 +1007,9 @@[m [mdef BAfromZeroToDataAw(startTage, SI_bh_Aw, N0_Aw, BA_Aw0, SDF_Aw0, f_Aw,[m
             BA_AwB = 0[m
 [m
         if simulatio

# Tests

There are some issues with the tests - the data does not match the old output data to within 3 or even 2 decimal places. The mismatch is always:

`(mismatch 0.221052631579%)`



# Run profiling

In [22]:
from gypsy.forward_simulation import simulate_forwards_df

In [3]:
data = pd.read_csv('../private-data/prepped_random_sample_300.csv', index_col=0, nrows=10)

In [None]:
%%prun -D forward-sim-1.prof -T forward-sim-1.txt -q
result = simulate_forwards_df(data)

In [1]:
!head forward-sim-1.txt

head: cannot open ‘forward-sim.txt’ for reading: No such file or directory


# Compare performance visualizations

Now use either of these commands to visualize the profiling

```
pyprof2calltree -k -i forward-sim-1.prof forward-sim-1.txt

# or

dc run --service-ports snakeviz notebooks/forward-sim-1.prof
```

![definitive reference profile screenshot](profile-forwards-sim.png)

![profile 1 screenshot](profile-forwards-sim-1.png)

# Identify new areas to optimize



# Identify some means of optimization

- do not ignore index in append operations
- use cpython for factor finder / zero to data functions
- cache numeric functions
- ???