Skip to content

raoulg/goad_toolkit

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

64 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

GOAD๐Ÿ is the GOAT - Goal Oriented Analysis of Data

uv image

GOAD๐Ÿ - When your data analysis is so fire๐Ÿ”ฅ it's got rizzโœจ

GOAD๐Ÿ

GOAD๐Ÿ is a flexible Python package for analyzing, transforming, and visualizing data with an emphasis on statistical distribution fitting and modular visualization components.

with thanks to my daughters for the genz slang!

๐Ÿ“Š Features

  • Composable & extendable plotting system - Build complex visualizations by combining simple components. You can extend the existing components with your own.
  • Statistical distribution fitting - Automatically fit and compare distributions to your data. The distribution registry is extendable with additional distributions.
  • Extendable data transformation pipelines - Chain and reuse data transformations into pipelines. Again, extendable with custom transformation components.

Before GOAD๐Ÿ : mid data After GOAD๐Ÿ : data got infinity aura

๐Ÿš€ Quick Start

Installation

Using uv:

uv add goad-toolkit

Or, if you prefer your dependencies to be installed 100x slower, with pip:

pip install goad-toolkit

๐Ÿ“‹ Demo: Linear Model Analysis

GOAD๐Ÿ includes a comprehensive demo that shows how to use its components together.

Main capabilities

In the demo/linear.py file you can see a showcase of the main capabilities of GOAD๐Ÿ:

  • create a data processing pipeline
  • components are extendable, so you can easily add your own steps to a pipeline
  • create visualisations by stacking components. BasePlot will handle boilerplate.
  • the DistributionFitter will try to fit a few common distributions, and add statistical tests for you
  • The results work together with the visualizer.PlotFits class to show the results

The main strenght of this module is not that these elements are there (even thought they are very useful). Its superpower is that everything is extendable: so you can use this as a start, and extend it with your own visualisations and analytics.

POV: Your data just got GOADed๐Ÿ and now it's giving main character energy

๐Ÿ“š Core Components

๐Ÿ”„ Extendable Data Transforms

GOAD๐Ÿ provides a pipeline approach to transform your data:

from goad_toolkit.datatransforms import Pipeline, ShiftValues, ZScaler

# Create a pipeline
pipeline = Pipeline()

# Add transformations
pipeline.add(ShiftValues, name="shift_deaths", column="deaths", period=-14)
pipeline.add(ZScaler, name="scale_tests", column="positivetests", rename=True)

# Apply all transformations
result = pipeline.apply(data)

Available transforms include:

  • ShiftValues - Shift values in a column by a specified period
  • DiffValues - Calculate the difference between consecutive values
  • SelectDataRange - Select rows within a specified date range
  • RollingAvg - Calculate rolling average of a column
  • ZScaler - Standardize values in a column

You can extend the pipeline with your own transformations by subclassing BaseTransform. The Zscaler is implemented as follows:

class ZScaler(TransformBase):
    """Standardize the values in a column."""
    def transform(
        self, data: pd.DataFrame, column: str, rename: bool = False
    ) -> pd.DataFrame:
        """Standardize the values in a column."""
        if rename:
            colname = f"{column}_zscore"
        else:
            colname = column
        data[colname] = (data[column] - data[column].mean()) / data[column].std()
        return data

๐Ÿ“Š Visualization System

GOAD๐Ÿ visualization system is built on a composable architecture that allows you to build complex plots by combining simpler components:

from goad_toolkit.visualizer import PlotSettings, ResidualPlot

# Create plot settings
plotsettings = PlotSettings(
        xlabel="date",
        ylabel="normalized values",
        title="Z-Scores of Deaths and Positive Tests",
    )

class LinePlot(BasePlot):
    """Plot a line plot using seaborn."""
    def build(self, data: pd.DataFrame, **kwargs):
        sns.lineplot(data=data, ax=self.ax, **kwargs)
        return self.fig, self.ax


class ComparePlot(BasePlot):
    def build(self, data: pd.DataFrame, x: str, y1: str, y2: str, **kwargs):
        compare = LinePlot(self.settings)
        self.plot_on(compare, data=data, x=x, y=y1, label=y1, **kwargs)
        self.plot_on(compare, data=data, x=x, y=y2, label=y2, **kwargs)
        plt.xticks(rotation=45)

        return self.fig, self.ax

compareplot = ComparePlot(plotsettings)
compareplot.plot(
        data=data, x="date", y1="deaths_shifted_zscore", y2="positivetests_zscore"
    )

zscore This extendable strategy lets BasePlot handle the boilerplate, while you can focus on creating the visualizations you need. It is also easier to reuse components in different contexts.

๐Ÿ“ˆ Distribution Fitting

GOAD๐Ÿ includes tools for fitting statistical distributions to your data:

from goad_toolkit.analytics import DistributionFitter
from goad_toolkit.visualizer import PlotSettings, FitPlotSettings, PlotFits

fitter = DistributionFitter()
fits = fitter.fit(data["residual"], discrete=False) # we have to decide if the data is discrete or not
best = fitter.best(fits)
settings = PlotSettings(
    figsize=(12, 6), title="Residuals", xlabel="error", ylabel="probability"
)
fitplotsettings = FitPlotSettings(bins=30, max_fits=3)
fitplotter = PlotFits(settings)
fig = fitplotter.plot(
    data=data["residual"], fit_results=fits, fitplotsettings=fitplotsettings
)

For the kstest, the null hypothesis is that the two distributions are identical. In this example, the p-values are below 0.05, so we can reject the null hypothesis and conclude that the data does not follow any of these.

The plots are sorted by log-likelihood, which means there is no good fit with a distribution in this case. residuals

๐Ÿงฉ Extending with Custom Distributions

You can easily register new distributions:

from goad_toolkit.distributions import DistributionRegistry
from scipy import stats

# Create registry
registry = DistributionRegistry()

# Register a new distribution
registry.register_distribution(
    name="negative_binomial",
    dist=stats.nbinom,
    is_discrete=True,
    num_params=2
)

# Now it will be used automatically in the  DistributionFitter for discrete fits
from goad_toolkit.analytics import DistributionFitter
fitter = DistributionFitter()
print(fitter.registry) # shows all registered distributions

๐Ÿ”ง Advanced Usage: Composing Plots

GOAD๐Ÿ has a powerful plotting system that allows you to combine plot elements:

from goad_toolkit.visualizer import BasePlot, LinePlot, BarWithDates, VerticalDate

# Use a base plot to create a composite
class MyCompositePlot(BasePlot):
    def build(self, data: pd.DataFrame, x: str, y1: str, y2: str, special_date: str):
        # Plot the first component - a line plot
        line_plot = LinePlot(self.settings)
        self.plot_on(line_plot, data=data, x=x, y=y1, label=y1)

        # Plot the second component - a bar chart
        bar_plot = BarWithDates(self.settings)
        self.plot_on(bar_plot, data=data, x=x, y=y2)

        # Add a vertical line
        vline = VerticalDate(self.settings)
        self.plot_on(vline, date=special_date, label="Important Event")
        return self.fig, self.ax

๐Ÿค Contributing

Contributions are welcome! Please feel free to submit a Pull Request.


GOAD๐Ÿ - When your data analysis is so fire๐Ÿ”ฅ it's got rizzโœจ

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages