Standardize DataFrame

standardize_df provides features to store and execute your operations for standardizing a pandas dataframes.

Installation

pip install standardize-df

Overview

You can think of the process of standardizing a dataframe as conforming its values to some standard for comparitive evaluations. standardize contains many features to help standardize a dataframe:

adjust_df: Functions for altering a dataframe based off an altered subset of the dataframe.
df_operations: General functions to help with standardizing a dataframe.
pipeline: Pipeline classes for chaining the output of one callable to the next.
standards: Provides the Standard class to store and execute your operations for standardizing a dataframes column(s).
utils: Utility functions related for evaluating empty values and flattening iterables

Here's a tour of some of the main features with examples from these modules:

Adjusting DataFrames

When standardizing a dataframe, you'll often create, drop, or override columns and rows from a subset. The adjust_df module offers the adjust_df function that can reflect those changes from the subset to the dataframe.

Adjusting rows:

import pandas as pd
from standardize_df.adjust_df import adjust_df
df = pd.DataFrame({'letters': ['a', 'b', 'c'], 'symbols': ['!', '@', '#']})
altered_subset = pd.Series(['!', '@'], name='symbols')
adjust_df(df, altered_subset, fields=['symbols'])

  letters symbols
0       a       !
1       b       @

Adjusting columns:

altered_subset = pd.DataFrame({'symbols': ['!', '@', '#'], 'numbers': [1, 2, 3]})
adjust_df(df, altered_subset, fields=['symbols'])

  letters symbols  numbers
0       a       !        1
1       b       @        2
2       c       #        3

Pipeline

You can store operations (callables) into Pipeline or PipelineMapping to create a data pipeline. The PipelineMapping offers the reorder method to easily reorder, replace, or drop operations.

Creating a data pipeline with the Pipeline class:

from standardize_df.pipeline import Pipeline

def add_one(num): return num + 1
def add_two(num): return num + 2

pipe = Pipeline([add_one, add_two])
pipe(7)

Creating a data pipeline with the PipelineMapping class:

from standardize_df.pipeline import PipelineMapping

def add_one(num): return num + 1
def square_two(num): return num ** 2

pipe = PipelineMapping({'add_one': add_one, 'square_two': square_two})
pipe(2)

Reordering a PipelineMapping instance with an iterable:

reordered = pipe.reorder(['square_two', 'add_one'])
reordered
reordered(2)

PipelineMapping([('square_two', <function square_two at 0x7f002b86e820>), ('add_one', <function add_one at 0x7f0044be15e0>)])
5

Reordering a PipelineMapping instance with a mapping:

reordered = pipe.reorder({'square_two': None})  # None denotes 'leave func as is'
reordered
reordered(2)

PipelineMapping([('square_two', <function square_two at 0x7f002b86e820>)])
4

Standards

The standards module offers the Standards class for storing field name(s) and an operation to standardize those fields of a dataframe. The Standards.standardize_df method passes in a subset with the field name(s) to the operation, and adjusts the dataframe according to the return value, the altered subset.

Single field name (keys) will result in a series subset being passed into the operation. A series or a dataframe can be returned, and the original dataframe will be adjusted to it.

import pandas as pd
from standardize_df.standards import Standards

def drop_first(col: pd.Series) -> pd.Series: 
    '''drops the first row from the original dataframe.'''
    return col.drop(index=0)

def add_one(col: pd.Series) -> pd.DataFrame:
    '''adds the plus_one column to the original dataframe.'''
    df = col.to_frame()
    df['plus_one'] = df['numbers'] + 1
    return df

df = pd.DataFrame({'letters': ['a', 'b', 'c'], 'numbers': [1, 2, 3]})
standards_mapping = {
    'letters': drop_first, 
    'numbers': add_one
}
standards = Standards(standards_mapping)
standards.standardize_df(df)

  letters  numbers  plus_one
1       b        2         3
2       c        3         4

Multiple field name (keys) will result in a dataframe subset being passed into the operation. A series or dataframe can be returned, and the original dataframe will be adjusted to it.

def increment_one(df: pd.DataFrame) -> pd.DataFrame:
    '''adds columns letters_plus and numbers_plus to the original dataframe.'''
    df['letters_plus'] = df['letters'].apply(lambda x: chr(ord(x) + 1))
    df['numbers_plus'] = df['numbers'] + 1
    return df

def drop_numbers(df: pd.DataFrame) -> pd.Series:
    '''drops the numbers column from the original dataframe.'''
    return df['numbers_plus']


standards_mapping = {
    ('letters', 'numbers'): increment_one, 
    ('numbers', 'numbers_plus'): drop_numbers  # numbers_incr column added in increment_one func 
}
standards = Standards(standards_mapping)
standards.standardize_df(df)

  letters letters_plus  numbers_plus
0       a            b             2
1       b            c             3
2       c            d             4

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
standardize_df		standardize_df
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Standardize DataFrame

Installation

Overview

Adjusting DataFrames

Pipeline

Standards

About

Releases

Packages

Languages

License

yohan10/standardize-df

Folders and files

Latest commit

History

Repository files navigation

Standardize DataFrame

Installation

Overview

Adjusting DataFrames

Pipeline

Standards

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages