
# DataFrames Beyond Pandas: An Introduction to StaticFrame






Back in 2017 I asked the question: "is Pandas suitable for building production libraries?". While Pandas is well-known for its utility in data science and ad-hoc research, I consistently found its ease of use and flexability a detriment in building long-lived library code for production systems. This led me to create StaticFrame, an alternative dataframe library built on an immutable data model. After years of development and use, I am confident that StaticFrame reduces opportunities for error and leads to more maintainable code. While not yet always more efficient than Pandas, in some areas StaticFrame's immutable data model offers very significant improvements in run time and memory usage. Beyond common functionality, StaticFrame offers a more explicit and consistent API, novel multi-Frame containers, and support for high-performance serialization throught the NPZ format.


* History
    * Motivation
        * Spooky actions at a distance
            * Giving a 2D array to Frame
        * Redundancies
            * f.attr, f['attr']
            * pivot, pivot_table
        * Inconsistencies
            * No DataFrame.name, but Index.name, Series.name
        * NumPy types: datetime64, strings
    * Development
        * May of 2017 began implementation
        * May of 2018 first public release
    
* Current State
    * Code, Docs, Packages
    * Quality & Test
        * 100% test coverage
        * CI on all platforms
        * Full mypy, pylint quality checks
    * Contributing
    * 1.0
        * Nearly feature complete
        * C-extensions through ArrayKit

* The Frame and the Series
    * DataFrame as labelled arrays
        * Labels stay with data
        * Operations align on labels
    * Focus on 2D and 1D containers
    * nD via hierarchical indices, higher-order containers

* Introspective containers
    * The `interface` attributes
    * Using the docs
    
* A more explicit interface
    * Keyword-only arguments everywhere
    * Relevant, orthogonal parameters
    * Avoiding string parameters
        * pd.rank v. sf.rank_ordinal()    
    * Consitent, hierarchical naming
        * relabel v. rename
        * sf.Series.relabel, sf.Series.relabel_label_add
        * iter_element().apply(), map_all(), map_fill(), map_any()

* Immutability and "no-copy" operations
    * Immutability reduces opportunities for errors
    * Immutability means we can pass around views without defensive copies
        * Renaming a `sf.Frame` is no-copy
        * Horizontal concatenation of same-index copmonents is no-copy

* Getting data in and out
    * Consistent from / to naming on classes
        * pd.read_parquet(), pd.Frame.to_parquet()
        * sf.Frame.from_parquet(), sf.Frame.to_parquet()
    * Explicit constructors to ensure parameter relevance and orthogonality
        * sf.Series.from_element
    * Broad support for all common formats
        * TSV, XLSX, Parquet
    * A new format: NPZ
        * Often faster than parquet
        * Full encodes all characteristics

* More than one type of Frame
    * Frame, FrameGO, FrameHE
    * FrameGO: Grow-only: safer mutability
    * FrameHE: A hashable Frame

* More than one type of Index
    * Index, IndexDate, IndexYear, etc.
    * IndexHierarchy
    
* Selection
    * More explicit than pandas
        * sf.Frame[] is only a columns selector
        * sf.Frame.iloc, sf.Frame.loc, sf.Frame.bloc
        * sf.ILoc, sf.HLoc: embed iloc, hloc selection in a loc

* Renaming, reindexing, and relabeling
    * rename()
    * reindex
    * relabel

* Iteration and function application
    * sf.Frame.iter_element().apply()
    * sf.Frame.iter_series(), iter_array(), iter_tuple()

* Grouping and Windowing
    * sf.Frame.iter_group()
    * sf.Frame.iter_window()

* `via` interfaces
    * Consistent, hierarchical naming
    * via_str, via_dt
    * via_re
    * via_T, via_fill_value
    
* Higher-Order Containers
    * Bus: A Series of Frame
        * Lazy loading, eager unloading
        * Support for reading to, writing to archives
    * Batch: composing operations on iterators of Frames
        * A generalization of Pandas
    * Yarn, Quilt
    
* Performance


# A Brief History of DataFrames

* 1991: earliest implementation of a data frame in the S language
* 2009: Pandas 0.1 released
* 2018: StaticFrame 0.1 released



# Another DataFrame Library?

* Pandas prioritizes ease of use over explicit, strict interfaces
* Pandas supports in-place mutation
* Pandas API is riddled with inconsistencies
* Pandas does not support all NumPy types (`<U`, `datetime64`)
* Pandas abandoned multi-frame containers (i.e., removing the `pd.Panel`)

# StaticFrame Development Status
* Releases
    * Regular releases via PIP
    * Stable API on minor releases (i.e., 0.9 will introduce backward incompatibilities on 0.8)
* Quality & Test
    * 100% test coverage
    * Robust CI/CD with MyPy, Pylint, and test
* Documentation
    * Full code-based API documentation (https://static-frame.readthedocs.io)
    * Every object exposes API via `interface` attribute
* Core Dependencies
    * NumPy
    * team-maintained CPython extension libraries: `automap`, `arraykit`
* 1.0
    * Pending `arraykit` implementation of delimited file readers to fix known issues
    * Maybe by end of 2022


# Getting Data In and Out

* Constructors live on containers
    * `pd.read_csv()`, `pd.DataFrame.from_records()`
    * `sf.Frame.from_csv()`, `sf.Frame.from_records()`
* Narrow, explicit constructors
    * `pd.DataFrame()` supports a single element, or a column of elements
    * `sf.Frame.from_element()`, `sf.Frame.from_elements()`
* Support for common serialization formats
    * `pd.read_excel()`, `pd.read_csv()`, `pd.read_parquet()`
    * `sf.Frame.from_xlsx()`, `sf.Frame.from_csv()`, `sf.Frame.from_parquet()`
* Serialization unique to StaticFrame
    * NPZ and NPY formats faster than parquet with comparable file sizes
    * All `sf.Frame` characteristics encoded
    * NPY supports memory mapping out-of-core data
    * `sf.Frame.to_npz()`, `sf.Frame.from_npz()`

# The Frame & the Series
* A `Series` is a 1D array (of a single type) with labels (an `index`)
* A `Frame` is a 2D container (of one or more types) with row and column labels (an `index` and `columns`)
* Higher-order containers
    * Use hierarchical indices on a 2D container
    * Use multi-`Frame` containers (i.e., the `Bus`)

In [13]:
import pandas as pd
import numpy as np
pd.read_csv()

TypeError: read_csv() missing 1 required positional argument: 'filepath_or_buffer'