## Demo of faster_df_apply

`pandas.apply` is a lot slower than it should be. See discussion [here](https://stackoverflow.com/questions/54432583/when-should-i-ever-want-to-use-pandas-apply-in-my-code).

This notebooks demonstrates the `faster_df_apply` function, which speeds things up dramatically.

In [1]:
import ticdat
import time
import pandas as pd

In [2]:
pandat = ticdat.PanDatFactory(table_data = [[], ["SoldCount", "StrengthFactor", "ItemCount"]])

In [3]:
pandat.set_data_type("table_data", "SoldCount", min=0, max=50, inclusive_min=True, 
                     inclusive_max=True, nullable=False)

In [4]:
dat = pandat.csv.create_pan_dat("pandat_data")

In [5]:
dat.table_data["SoldCount"].max(), dat.table_data["SoldCount"].min()

(73.0, 0.0)

In [6]:
def faster_df_apply(df, func):
    cols = list(df.columns)
    data, index = [], []
    for row in df.itertuples(index=True):
        row_dict = {f:v for f,v in zip(cols, row[1:])}
        data.append(func(row_dict))
        index.append(row[0])
    return pd.Series(data, index=index)

In [7]:
def bad_row(row):
    val = row["SoldCount"]
    try:
        if 0 <= val <= 50:
            return False
    except:
        return True
    return True

Here are three different ways to find the rows with a bad "SoldCount", each with very different run times.
 * Using the `bad_row` function with `pandas.DataFrame`.
 * Using the `bad_row` function with `faster_df_apply`.
 * Using `PanDatFactory.find_data_type_failures`. 

In [8]:
%timeit dat.table_data.apply(bad_row, axis=1)

1.92 s ± 29.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [9]:
%timeit faster_df_apply(dat.table_data, bad_row)

480 ms ± 5.33 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [10]:
%timeit pandat.find_data_type_failures(dat)

806 ms ± 10.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


## Conclusion
The default implementation of `apply` appears to be slow for silly reasons. It can be sigmnificantly improved by a more thoughtful pure Python rewrite. With this commit, we have woven `faster_df_apply` into `ticdat`, thus addressing issue [7](https://github.com/ticdat/ticdat/issues/7). That said, `PanDatFactory.find_data_type_failures` has a slower version of `bad_row`, and thus is somewhat slower than the `faster_df_apply(dat.table_data, bad_row)` call. There may or may-not be room for run time improvement, but closing issue 7 for now.

In [11]:
import cProfile

In [12]:
#cProfile.run("pandat.find_data_type_failures(dat)", sort=2)

In [13]:
#cProfile.run("faster_df_apply(dat.table_data, bad_row)", sort=2)