Not archiving the data right now, this exercise results in a pretty easy way to improve ticdat.

In [1]:
import ticdat
import time
import pandas as pd

In [2]:
pandat = ticdat.PanDatFactory(table_data = [[], ["SoldCount", "StrengthFactor", "ItemCount"]])

In [3]:
pandat.set_data_type("table_data", "SoldCount", min=0, max=50, inclusive_min=True, 
                     inclusive_max=True, nullable=False)

In [4]:
dat = pandat.csv.create_pan_dat("pandat_data")

In [5]:
dat.table_data["SoldCount"].max(), dat.table_data["SoldCount"].min()

(73.0, 0.0)

In [6]:
start_time = time.time()

In [7]:
all_fails = pandat.find_data_type_failures(dat)

In [8]:
df_with_bad_rows = all_fails['table_data', "SoldCount"]

In [9]:
len(dat.table_data)

198917

In [10]:
len(df_with_bad_rows)

122924

In [11]:
print(f"Time to do bulk query using apply under the hood in sec {time.time()-start_time}")

Time to do bulk query using apply under the hood in sec 2.4090590476989746


In [12]:
def faster_df_apply(df, func):
    cols = list(df.columns)
    data, index = [], []
    for row in df.itertuples(index=True):
        row_dict = {f:v for f,v in zip(cols, row[1:])}
        data.append(func(row_dict))
        index.append(row[0])
    return pd.Series(data, index=index)

In [13]:
start_time = time.time()

In [14]:
def bad_row(row):
    val = row["SoldCount"]
    try:
        if 0 <= val <= 50:
            return False
    except:
        return True
    return True

In [15]:
df_with_bad_rows = dat.table_data[faster_df_apply(dat.table_data, bad_row)]

In [16]:
len(df_with_bad_rows)

122924

In [17]:
print(f"Time to do bulk query using faster_df_apply in sec {time.time()-start_time}")

Time to do bulk query using faster_df_apply in sec 0.5074670314788818


There is a lot going on above, so lets run some `timeit` tests too.

In [18]:
%timeit dat.table_data.copy(deep=True)

9.02 ms ± 88.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [19]:
%timeit faster_df_apply(dat.table_data, bad_row)

471 ms ± 9.75 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [20]:
%timeit dat.table_data.apply(bad_row, axis=1)

1.92 s ± 11.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


## Conclusion
The default implementation of `apply` appears to be slow for silly reasons. It can be sigmnificantly improved by a more thoughtful pure Python rewrite. That said, even the faster version is slow compared to a deep copy, so YMMV.