# StaticFrame from the Ground Up: Getting Started with Immutable DataFrames

Back in 2017 I found myself frequently asking: "is Pandas a suitable foundation for production library code?" While Pandas is well-known for its utility in data science, I consistently found its flexibility a detriment in building library code for production systems.

This led me to create StaticFrame, an alternative dataframe library built on an immutable data model. After years of development and use, I am confident that StaticFrame reduces opportunities for error and leads to more maintainable code. While not yet always more efficient than Pandas, in some areas StaticFrame offers very significant improvements in run time and memory usage. Beyond common functionality, StaticFrame offers a more explicit and consistent API, novel multi-Frame containers and processors, and support for high-performance serialization through the NPZ format.

# What is a DataFrame?
* A 2D table with labelled rows and columns
    * Labels stay with data after selection
    * Operations align on labels
    * Can reindex based on labels
* Distinct from a simple 2D array
    * Labels can be any (hashable) type
    * Support for hetergenous column types
* A high-level language (Pyhon) can be used to implement dataframe functionality over a high-performance, low-level array library (NumPy)

# A Brief History of DataFrames

* 1991: earliest implementation of a dataframe in the S language
* 2009: Pandas 0.1 released
* 2018: StaticFrame 0.1 released
* Presently a number of dataframe libraries in Python and other languages


# Why Not Just Use Pandas?

* Pandas prioritizes ease of use over explicit, strict interfaces
* Pandas supports in-place mutation
* Pandas only optionally supports unique indices (`verify_integrity` defaults to `False`)
* Pandas API is riddled with inconsistencies
* Many Pandas interfaces have non-orthogonal parameters
* Pandas does not support all NumPy types (`<U`, `datetime64`)
* Pandas abandoned multi-frame containers (i.e., removing the `pd.Panel`)

* See also: https://dev.to/flexatone/ten-reasons-to-use-staticframe-instead-of-pandas-4aad

# Learning StaticFrame from Pandas

* Nearly everything you can do with Pandas you can do with StaticFrame
    * No internal graphing / plotting support
    * Few internal implementations of calculation available elsewhere (NumPy, SciPy)
* Much of what you already know will directly translate
    * Many interfaces and methods are identical
    * StaticFrame has more numerous, more narrow interfaces with key-word only arguments
    * StaticFrame follows hierarchical naming
* You can go back and forth
    * `Frame.to_pandas()`
    * `Frame.from_pandas()`

# StaticFrame Development Status

* Releases
    * Regular releases via PIP
    * Stable API on minor releases (i.e., 0.9 will introduce backward incompatibilities on 0.8)
* Quality & Test
    * 100% test coverage
    * Robust CI/CD with MyPy, Pylint, and multiplatform test
* Documentation
    * Full code-based API documentation (https://static-frame.readthedocs.io)
    * Every object exposes API via `interface` attribute
* Core Dependencies
    * NumPy
    * Team-maintained CPython extension libraries: `automap`, `arraykit`
* When will there be a 1.0?
    * Pending `arraykit` implementation of delimited file readers to fix known issues
    * Maybe by end of 2022


# Installing & Importing

* Available via pip, conda-forge
* `import static_frame as sf`


In [2]:
import static_frame as sf

# The Frame & the Series
* A `Series` is a 1D array (of a single dtype) with labels 
* A `Frame` is a 2D container (of one or more columnar dtypes) with row and column labels
* Support for higher-dimensional data
    * Use hierarchical indices on a 2D container
    * Use multi-`Frame` containers (i.e., the `Bus`)

# Anatomy of a Frame

* A `sf.Frame` wraps 1D and 2D NumPy arrays
* NumPy dtypes are unified by column
* Each axis is labelled with an `sf.Index` (or subclass)
    * Row labels via `sf.Frame.index`
    * Column labels via `sf.Frame.columns`
* Hashable metadata via `name` attributes
    * `sf.Frame.name`
    * `sf.Frame.index.name`
    * `sf.Frame.columns.name`

# Getting Data In & Out: Constructors & Exporters

* Constructors always live on containers
    * `pd.read_csv()`, `pd.DataFrame.from_records()`
    * `sf.Frame.from_csv()`, `sf.Frame.from_records()`
* Explicit constructors with narrow functionality
    * `pd.DataFrame()` supports a single element, or a column of elements
    * `sf.Frame.from_element()`, `sf.Frame.from_elements()`
* Support for common serialization formats
    * `pd.read_excel()`, `pd.read_csv()`, `pd.read_parquet()`
    * `sf.Frame.from_xlsx()`, `sf.Frame.from_csv()`, `sf.Frame.from_parquet()`
* Serialization methods exclusive to StaticFrame
    * NPZ and NPY formats faster than parquet with comparable file sizes
    * Encodes all `sf.Frame` characteristics
    * NPY supports memory mapping out-of-core data
    * `sf.Frame.to_npz()`, `sf.Frame.from_npz()`

In [25]:
# Creating a Frame from row iterables
f = sf.Frame.from_records(((True, 20, '1954-11'), (False, 30, '2020-04')))
print(str(f))

<Frame>
<Index> 0      1       2       <int64>
<Index>
0       True   20      1954-11
1       False  30      2020-04
<int64> <bool> <int64> <<U7>


# String Representations

* `sf.Frame.__repr__()` provides more information than `pd.DataFrame.__repr__()`
* Shows types of `Frame`, `.index`, and `.columns`
* Shows NumPy dtypes of each column, `.index`, and `.columns`
* In terminal environments can use colors for types, dtypes

In [26]:
# Creating a Frame with Frame subclass, Index subclasses, name attributes
f = sf.FrameGO.from_records(((True, 20, '1954-11'), (False, 30, '2020-04')), 
        index=sf.IndexDate(('2021-04-03', '2022-01-09'), name='date'),
        columns=('A', 'B', 'C'),
        name='records', 
        )
print(str(f))

<FrameGO: records>
<IndexGO>          A      B       C       <<U1>
<IndexDate: date>
2021-04-03         True   20      1954-11
2022-01-09         False  30      2020-04
<datetime64[D]>    <bool> <int64> <<U7>


# Representation in Jupyter Notebooks

* An HTML table repsentation
* Type / dtype information is hidden by default

In [27]:
f1 = sf.Frame.from_records(((True, 20, '1954-11'), (False, 30, '2020-04')), 
                            index=tuple('xy'), columns=tuple('ABC'))
f1

Unnamed: 0,A,B,C
x,True,20,1954-11
y,False,30,2020-04


# Finding All Constructors

* Every SF container has an `.interface` attribute
* `.interface` returns a `sf.Frame` of the complete interface
* The same representation is used to populate API overview: https://static-frame.readthedocs.io/en/latest/api_overview/frame.html


In [28]:
# Using the interface attribute to show the signature of all constructors
f = sf.Frame.interface
f.loc[f['group'] == 'Constructor'].head()

Unnamed: 0,cls_name,group,doc
"__init__(data, *, index, columns, ...)",Frame,Constructor,Initializer. Args: data: Default Frame initialization requires typed data such a...
"from_arrow(value, *, index_depth, index_name_depth_level, ...)",Frame,Constructor,Realize a Frame from an Arrow Table. Args: value: A pyarrow.Table instance. inde...
"from_clipboard(*, delimiter, index_depth, index_column_first, ...)",Frame,Constructor,Create a Frame from the contents of the clipboard (assuming a table is stored as...
"from_concat(frames, *, axis, union, ...)",Frame,Constructor,Concatenate multiple Frames into a new Frame. If index or columns are provided a...
"from_concat_items(items, *, axis, union, ...)",Frame,Constructor,"Produce a Frame with a hierarchical index from an iterable of pairs of labels, F..."


# Selection
* StaticFrame exposes all types of NumPy and Pandas-style selection
* StaticFrame interfaces are more narrow than Pandas
* Selection interfaces
    * `[]`: use labels
    * `loc[]`: use lables
    * `iloc[]`: use integer position (from zero)
    * `bloc[]`: use Boolean indicator (StaticFrame only)
* NumPy-style selection values 
    * A single label (a tuple is a single label)
    * A list of labels (must be a list to distinguish from a tuple label)
    * A slice of labels
    * A 1D Boolean arary selecting labels


# Frame Selection Interfaces
    
* `[]`: root `__getitem__` selection 
    * `pd.DataFrame[]` selects by column labels, or row and column labels, or by 2D Boolean array
    * `sf.Frame[]` is exclusively column selection
* `loc[]`: select rows, optionally columns, by label (same as Pandas)
* `iloc[]`: select rows, optionally columns, by integer position (same as Pandas)
* `bloc[]`: select with a 2D Boolean array (StaticFrame only)

In [29]:
f1 = sf.Frame.from_records(((True, 20, '1954-11'), (False, 30, '2020-04')), 
                            index=tuple('xy'), columns=tuple('ABC'))
f1['B'] # Single label

0,1
x,20
y,30


In [30]:
f1[f1.columns == 'C'] # Boolean indicator

Unnamed: 0,C
x,1954-11
y,2020-04


In [31]:
f1.loc['y':, ['A', 'C']] # A slice and list of labels

Unnamed: 0,A,C
y,False,2020-04


In [32]:
f1.iloc[-1, -1]

'2020-04'

In [34]:
f1.bloc[f1.isin([30, '2020-04'])]

0,1
"('y', 'B')",30
"('y', 'C')",2020-04


# Mixing `loc` and `iloc` Selection

* `sf.ILoc` (StaticFrame only) permits embedding `iloc` selection in a `loc` selection
* `sf.HLoc` (similar to `pd.IndexSlice`) permits embedding hierarchical selection in `loc` selection

In [35]:
f1.loc[sf.ILoc[-1], ['A', 'C']] # Get the last row, columns A and C

0,1
A,False
C,2020-04


# Collection Inclusion

* `sf.Frame.isin()` (same as Pandas)


# Handling Missing Values
* Missing values are `None` and `np.nan` (same as Pandas)
* Boolean indicators (same as Pandas)
    * `sf.Frame.isna()`
    * `sf.Frame.notna()`
* Replacing missing values with new containers (same as Pandas)
    * `sf.Frame.dropna()`
    * `sf.Frame.fillna()`

# Handling Falsy Values
* Sometimes we want to treat `0` or `''` or `()` as missing
* Staticframe provides a family of functions
    * `sf.Frame.isfalsy()`
    * `sf.Frame.notfalsy()`
    * `sf.Frame.dropfalsy()`
    * `sf.Frame.fillfalsy()`

# Fill Missing Values Along an Axis
* Fill the first or last non-missing observation up to the `limit` parameter.
    * `sf.Frame.fillna_forward()`
    * `sf.Frame.fillna_backward()`
* Fill the leading or trailing missing values with a provided value
    * `sf.Frame.fillna_leading()`
    * `sf.Frame.fillna_trailing()`

# Fill Falsy Values Along an Axis
* Fill the first or last non-missing observation up to the `limit` parameter.
    * `sf.Frame.fillfalsy_forward()`
    * `sf.Frame.fillfalsy_backward()`
* Fill the leading or trailing missing values with a provided value
    * `sf.Frame.fillfalsy_leading()`
    * `sf.Frame.fillfalsy_trailing()`

# Immutability and "No-Copy" Operations
* Immutability reduces opportunities for errors 
* NumPy provides no-copy "views" of array data when possible
* With immutabile arrays, we can pass around views without defensive copies
* Examples:
    * Renaming a `sf.Frame` is no-copy
    * Relabelling `index` or `columns` does not copy underlying arrays
    * Horizontal concatenation of same-index components is no-copy

# Assignment with Immutable Frames
* Pandas permits in-place assignment and mutationi to all types of selections
    * `pd.DataFrame.loc['x', 'B':] = 1.0`
* StaticFrame offers an `assign` interface that defines a selection and is called with a value to assign
* `sf.Frame.assign.loc['x', 'B':](1.0)`
    * Returns a new container
    * Unchanged columns will be views and re-used (no-copy)

In [36]:
# Assigning a value to a slice in a single row
f1.assign.loc['x', 'B':](-1)

Unnamed: 0,A,B,C
x,True,-1,-1
y,False,30,2020-04


In [37]:
# Assigning a Series to a column, matching on label
f1.assign['B'](sf.Series(('y', 'x'), index=('y', 'x')))

Unnamed: 0,A,B,C
x,True,x,1954-11
y,False,y,2020-04


# Grow-Only Mutation
* Pandas permits growing a DataFrame by columns (efficient) and rows (very inefficient)
* The `sf.FrameGO` permits grow-only column addition or whole-frame extension
* While the container is mutated, underlying array data remains immutable 
* Growing rows is never permitted (use `sf.Frame.from_concat()` with collected rows)

In [55]:
# Adding a column to a FrameGO
f2 = f1.to_frame_go()
f2['D'] = (34, 87)
f2

Unnamed: 0,A,B,C,D
x,True,20,1954-11,34
y,False,30,2020-04,87


In [54]:
# Extending a FrameGO with another Frame
f3 = (f1[['A', 'B']] * 100).relabel(columns=lambda l: l.lower())
f2.extend(f3)
f2

Unnamed: 0,A,B,C,D,a,b
x,True,20,1954-11,34,100,2000
y,False,30,2020-04,87,0,3000


# A Family of `sf.Frame`

* Pandas has only one `DataFrame` class
* StaticFrame has a family
    * `sf.Frame`
    * `sf.FrameGO`: a grow-only `sf.Frame`
    * `sf.FrameHE`: a hashable `sf.Frame`
* Methods exist to easily convert between all three (always a no-copy operation)
    * `sf.Frame.to_frame_go()`
    * `sf.FrameHE.to_frame()`


In [11]:
# Example

# Full Support for All NumPy Dtypes
* NumPy is the foundation of StaticFrame and Pandas
* Pandas only uses a subset of NumPy dtypes; StaticFrame supports all
* NumPy's fixed-size Unicode arrays
    * Optimal when elements are diverse and of similar size
    * Pandas always converts these to object arrays of Python strings
* NumPy's `datetime64` type
    * Fast datetime representation with units for resolution (from year to attosecond)
    * Pandas coerces any `datetime64` to nanosecond units
    * StaticFrame permits using year, date, or any `datetime64` unit
    * See also: https://www.youtube.com/watch?v=jdnr7sgxCQI

# A Family of `sf.Index`

* To use `datetime64` as an index, use a `datetime64` `sf.Index` subclass
    * `sf.IndexDate`, `sf.indexYearMonth`, etc.
    * Provides robust translation from Python date / datetime objects
    * Provides partial selection with less granular date units
    * Provides alternative constructor for date ranges
* Hierarchical indices with `sf.IndexHierarchy`
* Many interfaces expose `index_constructor` arguments to specify what kind of index to make.
    

# Renaming, Reindexing, Relabeling

* `rename()` sets the `name` attribute on all containers
    * `pd.DataFrame.rename()` relabels the axis, `pd.Series.rename()` sets the name of the container
    * `sf.Frame.rename()`, `sf.Series.rename()` all do the same thing
    * renaming is a no-copy operations
* `reindex()` applies new index, aligning to the previous index
    * Similar to `pd.DataFrame.reindex()`
    * Matching labels will retain thier data
    * New labels will introduce missing values (provided with a `fill_value`)
* `relabel()` applies a new index, regardless of alignment to previous index
    * Can map old to new with `dict`
    * Can process old to new with a function
    * Can replace with a new `sf.Index` or iterable

# Iteration
* Iterating elements: `Frame.iter_elements()`
* Iterating rows or columns:
    * Specify axis=1 for rows, axis=0 for columns
    * Choose what you want to get back
        * `Frame.iter_series()`
        * `Frame.iter_tuple()`
        * `Frame.iter_array()`

# Function & Mapping Application
* Function application implies iteration
* Choose what you want to iterate on and call `apply()`
* Can multi-process / thread with `apply_pool()`
* Can iterate through results with `apply_iter()`
* Can map instead of apply
    * `map_all()`: if value not mappable, raise
    * `map_any()`: map what you can, leave the rest unchanged
    * `map_fill()`: map what you can, provide `fill_value` for others

# Grouping & Windowing

* `sf.Frame.iter_group()`
    * Group by unique values in one or more columns (axis 0) or rows (axis 1)
    * Can use `apply()` if reducing to an `sf.Series`
    * Can use an `sf.Batch` for performing operations on sub-Frames like `pd.DataFrameGroupBy`
* `sf.Frame.iter_window()`
    * Can use an `sf.Batch` for performing operations on sub Frames like `pd.Rolling`

# Interfaces for Working with Strings
* `sf.Frame.via_str`, similar to `pd.Series.str`

# Interfaces for Working with Dates
* `sf.Frame.via_dt`, similar to `pd.Series.dt`

# Configuring `fill_value` in Operator Application

* Operations on labelled containers force reindexing
* `sf.Frame.via_fill_value()` permits providing a fill value

# Virtual Transposition in Operator Application
* Applying a 1D container on a 2D container applies to rows
* `sf.Frame.via_T` presents 2d containers "virtually" transposed

# Working with Collections of Frames
* Pandas deprecated the `pd.Panel` for 3D data
* Hierarchical indices incur overhead and force loading all data at once
* The `sf.Bus`
    * Offers a Series-like interface to collections of Frames
    * Can read to and write from multi-tabel storage formats
        * XLSX, HDF5, SQLite
        * Zipped archives of CSV, TSV, Parquet, and NPZ
    * Reads lazily
    * Optionally unloads eagerly


* A more explicit interface
    * Keyword-only arguments everywhere
    * Relevant, orthogonal parameters
    * Avoiding string parameters
        * pd.rank v. sf.rank_ordinal()    
    * Consitent, hierarchical naming
        * relabel v. rename
        * sf.Series.relabel, sf.Series.relabel_label_add
        * iter_element().apply(), map_all(), map_fill(), map_any()

    
* Higher-Order Containers
    * Bus: A Series of Frame
        * Lazy loading, eager unloading
        * Support for reading to, writing to archives
    * Batch: composing operations on iterators of Frames
        * A generalization of Pandas
    * Yarn, Quilt
    
* Performance
