
* StaticFrame from the Ground Up
* StaticFrame for Pandas Users
* DataFrames Beyond Pandas: An Introduction to StaticFrame






Back in 2017 I asked the question: "is Pandas suitable for building production libraries?". While Pandas is well-known for its utility in data science and ad-hoc research, I consistently found its ease of use and flexability a detriment in building long-lived library code for production systems. This led me to create StaticFrame, an alternative dataframe library built on an immutable data model. After years of development and use, I am confident that StaticFrame reduces opportunities for error and leads to more maintainable code. While not yet always more efficient than Pandas, in some areas StaticFrame's immutable data model offers very significant improvements in run time and memory usage. Beyond common functionality, StaticFrame offers a more explicit and consistent API, novel multi-Frame containers, and support for high-performance serialization throught the NPZ format.


* History
    * Motivation
        * Spooky actions at a distance
            * Giving a 2D array to Frame
        * Redundancies
            * f.attr, f['attr']
            * pivot, pivot_table
        * Inconsistencies
            * No DataFrame.name, but Index.name, Series.name
        * NumPy types: datetime64, strings
    * Development
        * May of 2017 began implementation
        * May of 2018 first public release
    
* Current State
    * Code, Docs, Packages
    * Quality & Test
        * 100% test coverage
        * CI on all platforms
        * Full mypy, pylint quality checks
    * Contributing
    * 1.0
        * Nearly feature complete
        * C-extensions through ArrayKit

* The Frame and the Series
    * DataFrame as labelled arrays
        * Labels stay with data
        * Operations align on labels
    * Focus on 2D and 1D containers
    * nD via hierarchical indices, higher-order containers

* Introspective containers
    * The `interface` attributes
    * Using the docs
    
* A more explicit interface
    * Keyword-only arguments everywhere
    * Relevant, orthogonal parameters
    * Avoiding string parameters
        * pd.rank v. sf.rank_ordinal()    
    * Consitent, hierarchical naming
        * relabel v. rename
        * sf.Series.relabel, sf.Series.relabel_label_add
        * iter_element().apply(), map_all(), map_fill(), map_any()

* Immutability and "no-copy" operations
    * Immutability reduces opportunities for errors
    * Immutability means we can pass around views without defensive copies
        * Renaming a `sf.Frame` is no-copy
        * Horizontal concatenation of same-index copmonents is no-copy

* Getting data in and out
    * Consistent from / to naming on classes
        * pd.read_parquet(), pd.Frame.to_parquet()
        * sf.Frame.from_parquet(), sf.Frame.to_parquet()
    * Explicit constructors to ensure parameter relevance and orthogonality
        * sf.Series.from_element
    * Broad support for all common formats
        * TSV, XLSX, Parquet
    * A new format: NPZ
        * Often faster than parquet
        * Full encodes all characteristics

* More than one type of Frame
    * Frame, FrameGO, FrameHE
    * FrameGO: Grow-only: safer mutability
    * FrameHE: A hashable Frame

* More than one type of Index
    * Index, IndexDate, IndexYear, etc.
    * IndexHierarchy
    
* Selection
    * More explicit than pandas
        * sf.Frame[] is only a columns selector
        * sf.Frame.iloc, sf.Frame.loc, sf.Frame.bloc
        * sf.ILoc, sf.HLoc: embed iloc, hloc selection in a loc

* Renaming, reindexing, and relabeling
    * rename()
    * reindex
    * relabel

* Iteration and function application
    * sf.Frame.iter_element().apply()
    * sf.Frame.iter_series(), iter_array(), iter_tuple()

* Grouping and Windowing
    * sf.Frame.iter_group()
    * sf.Frame.iter_window()

* `via` interfaces
    * Consistent, hierarchical naming
    * via_str, via_dt
    * via_re
    * via_T, via_fill_value
    
* Higher-Order Containers
    * Bus: A Series of Frame
        * Lazy loading, eager unloading
        * Support for reading to, writing to archives
    * Batch: composing operations on iterators of Frames
        * A generalization of Pandas
    * Yarn, Quilt
    
* Performance


# A Brief History of DataFrames

* 1991: earliest implementation of a data frame in the S language
* 2009: Pandas 0.1 released
* 2018: StaticFrame 0.1 released



# Why Use a Different DataFrame Library?

* Pandas prioritizes ease of use over explicit, strict interfaces
* Pandas supports in-place mutation
* Pandas only optionally supports unique indices (`verify_integrity` defaults to `False`)
* Pandas API is riddled with inconsistencies
* Pandas does not support all NumPy types (`<U`, `datetime64`)
* Pandas abandoned multi-frame containers (i.e., removing the `pd.Panel`)

# StaticFrame Development Status
* Releases
    * Regular releases via PIP
    * Stable API on minor releases (i.e., 0.9 will introduce backward incompatibilities on 0.8)
* Quality & Test
    * 100% test coverage
    * Robust CI/CD with MyPy, Pylint, and test
* Documentation
    * Full code-based API documentation (https://static-frame.readthedocs.io)
    * Every object exposes API via `interface` attribute
* Core Dependencies
    * NumPy
    * team-maintained CPython extension libraries: `automap`, `arraykit`
* 1.0
    * Pending `arraykit` implementation of delimited file readers to fix known issues
    * Maybe by end of 2022


# Installing & Importing

* Available via pip, conda-forge
* `import static_frame as sf`


In [2]:
import static_frame as sf

# The Frame & the Series
* A `Series` is a 1D array (of a single dtype) with labels 
* A `Frame` is a 2D container (of one or more columnar dtypes) with row and column labels
* nD containers
    * Use hierarchical indices on a 2D container
    * Use multi-`Frame` containers (i.e., the `Bus`)

# Anatomy of a Frame

* A `sf.Frame` wraps 1D and 2D NumPy arrays
* NumPy dtypes are unified by column
* Each axis is labelled with an `sf.Index` (or subclass)
    * Row labels via `sf.Frame.index`
    * Column labels via `sf.Frame.columns`
* Hashable metadata via `name` attributes
    * `sf.Frame.name`
    * `sf.Frame.index.name`
    * `sf.Frame.columns.name`

# Getting Data In & Out with Constructors & Exporters

* Constructors always live on containers
    * `pd.read_csv()`, `pd.DataFrame.from_records()`
    * `sf.Frame.from_csv()`, `sf.Frame.from_records()`
* Explicit constructors with narrow functionality
    * `pd.DataFrame()` supports a single element, or a column of elements
    * `sf.Frame.from_element()`, `sf.Frame.from_elements()`
* Support for common serialization formats
    * `pd.read_excel()`, `pd.read_csv()`, `pd.read_parquet()`
    * `sf.Frame.from_xlsx()`, `sf.Frame.from_csv()`, `sf.Frame.from_parquet()`
* StaticFrame only serialization methods
    * NPZ and NPY formats faster than parquet with comparable file sizes
    * All `sf.Frame` characteristics encoded
    * NPY supports memory mapping out-of-core data
    * `sf.Frame.to_npz()`, `sf.Frame.from_npz()`

In [3]:
# Creating a Frame from row iterables
f = sf.Frame.from_records(((True, 20, '1954-11'), (False, 30, '2020-04')))
print(str(f))

<Frame>
<Index> 0      1       2       <int64>
<Index>
0       True   20      1954-11
1       False  30      2020-04
<int64> <bool> <int64> <<U7>


# The StaticFrame `__repr__()`

* Shows types of `Frame`, `.index`, and `.columns`
* Shows NumPy dtypes of each column, `.index`, and `.columns`

In [4]:
# Creating a Frame with Frame, Index subclasses, name attributes
f = sf.FrameGO.from_records(((True, 20, '1954-11'), (False, 30, '2020-04')), 
        index=sf.IndexDate(('2021-04-03', '2022-01-09'), name='date'),
        columns=('A', 'B', 'C'),
        name='records', 
        )
print(str(f))

<FrameGO: records>
<IndexGO>          A      B       C       <<U1>
<IndexDate: date>
2021-04-03         True   20      1954-11
2022-01-09         False  30      2020-04
<datetime64[D]>    <bool> <int64> <<U7>


# Representation in Jupyter Notebooks

* An HTML table repsentation
* Type / dtype information is hidden

In [6]:
f = sf.Frame.from_records(((True, 20, '1954-11'), (False, 30, '2020-04')), index=tuple('xy'), columns=tuple('ABC'))
f

Unnamed: 0,A,B,C
x,True,20,1954-11
y,False,30,2020-04


# Finding All Constructors

* Every SF container has a `.interface` attribute
* Returns a `sf.Frame` of complete ineterface


In [10]:
# Using the interface attribute to show the signature of all constructors
f = sf.Frame.interface
f.loc[f['group'] == 'Constructor'].head()

Unnamed: 0,cls_name,group,doc
"__init__(data, *, index, columns, ...)",Frame,Constructor,Initializer. Args: data: Default Frame initialization requires typed data such a...
"from_arrow(value, *, index_depth, index_name_depth_level, ...)",Frame,Constructor,Realize a Frame from an Arrow Table. Args: value: A pyarrow.Table instance. inde...
"from_clipboard(*, delimiter, index_depth, index_column_first, ...)",Frame,Constructor,Create a Frame from the contents of the clipboard (assuming a table is stored as...
"from_concat(frames, *, axis, union, ...)",Frame,Constructor,Concatenate multiple Frames into a new Frame. If index or columns are provided a...
"from_concat_items(items, *, axis, union, ...)",Frame,Constructor,"Produce a Frame with a hierarchical index from an iterable of pairs of labels, F..."


# Selection
* Selection Types
    * `loc`: use lables
    * `iloc`: use integer position (from zero)
    * `bloc`: use Boolean indicator (StaticFrame only)
* Selection values
    * A single label (a tuple is a single label)
    * A list of labels
    * A slice of labels
    * A 1D Boolean arary selecting labels


# Frame Selection Interfaces
    
* `[]`: root `__getitem__` selection 
    * `pd.DataFrame[]` selects by column or row and column labels, or by 2D Boolean array
    * `sf.Frame[]` is exclusively column selection
* `loc[]`: select row, optionally columns by label
* `iloc[]`: select row, optionally columns by integer position
* `bloc[]`: select with a 2D Boolean array (StaticFrame only)

In [None]:
# Examples

# Mixing `loc` and `iloc` Selection

* `sf.ILoc` (StaticFrame only)
* Permits embedding `iloc` selection in a `loc` selection

In [9]:
# Example

# Immutability and "No-Copy" Operations
* Immutability reduces opportunities for errors
* Immutability means we can pass around views without defensive copies
    * Renaming a `sf.Frame` is no-copy
    * Horizontal concatenation of same-index copmonents is no-copy

# Assignment-like Transformations Without In-Place Mutation

* Pandas permits in-place assignment to all types of selections
    * `pd.Frame.loc['x', 'B':] = 1.0`
* StaticFrame offers an interface that defines a selection and is called with a value
    * `sf.Frame.assign.loc['x', 'B':](1.0)`
    * Creates a new container
    * Unchanged columns will be views and re-used (no-copy)

In [None]:
# Example

# Grow-Only Mutation
* Pandas permits growing a DataFrame by columns (efficient) and rows (very inefficient)
* The `sf.FrameGO` permits grow-only column addition
* While the container is mutated, underlying array data remains immutable
* Growing rows is never permitted (use `sf.Frame.from_concat()` with collected rows)

In [None]:
# Example

# A Family of `sf.Frame`

* Pandas has only one `DataFrame` class
* StaticFrame has a family
    * `sf.Frame`
    * `sf.FrameGO`: a grow-only `sf.Frame`
    * `sf.FrameHE`: a hashable `sf.Frame`
* Methods to convert between all three (always a no-copy operation)
    * `sf.Frame.to_frame_go()`
    * `sf.FrameHE.to_frame()`


In [11]:
# Example

# Full Support for All NumPy Dtypes
* Pandas only uses a subset of NumPy dtypes

# A Family of `sf.Index`



In [None]:
import pandas as pd
import numpy as np
pd.read_csv()