Skip to content

Commit

Permalink
Merge 2a9f18e into 942e3d9
Browse files Browse the repository at this point in the history
  • Loading branch information
lalmei committed Apr 30, 2021
2 parents 942e3d9 + 2a9f18e commit e5122ac
Show file tree
Hide file tree
Showing 3 changed files with 104 additions and 90 deletions.
194 changes: 104 additions & 90 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,141 +1,155 @@
# whylogs Library
# whylogs: A Data and Machine Learning Logging Standard
<img align="center" src="images/Whylabs-Dots-Light-Bg.png" width="250"><img align="center" src="images/Whylabs-Dots-Light-Bg.png" width="250"><img align="center" src="images/Whylabs-Dots-Light-Bg.png" width="250">


[![License](http://img.shields.io/:license-Apache%202-blue.svg)](https://github.com/whylabs/whylogs-python/blob/mainline/LICENSE)
[![PyPI version](https://badge.fury.io/py/whylogs.svg)](https://badge.fury.io/py/whylogs)
[![Coverage Status](https://coveralls.io/repos/github/whylabs/whylogs-python/badge.svg?branch=mainline&service=github)](https://coveralls.io/github/whylabs/whylogs-python?branch=mainline)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/python/black)
[![CII Best Practices](https://bestpractices.coreinfrastructure.org/projects/4490/badge)](https://bestpractices.coreinfrastructure.org/projects/4490)
[![PyPi Downloads](https://pepy.tech/badge/whylogs)](https://pepy.tech/project/whylogs)

![CI](https://github.com/whylabs/whylogs-python/workflows/whylogs%20CI/badge.svg)
[![Maintainability](https://api.codeclimate.com/v1/badges/442f6ca3dca1e583a488/maintainability)](https://codeclimate.com/github/whylabs/whylogs-python/maintainability)

This is a Python implementation of whylogs. The Java implementation can be found [here](https://github.com/whylabs/whylogs-java).

Understanding the properties of data as it moves through applications is essential to keeping your ML/AI pipeline stable
and improving your user experience, whether your pipeline is built for production or experimentation. whylogs is an
open source statistical logging library that allows data science and ML teams to effortlessly profile ML/AI pipelines and applications, producing log files that can be used for monitoring, alerts, analytics, and error analysis.
whylogs is an open source standard for data and ML logging, monitoring, and troubleshooting

whylogs logging agent is the easiest way to enable logging, testing, and monitoring in an ML/AI application. The lightweight agent profiles data in real time, collecting thousands of metrics from structured data, unstructured data, and ML model predictions with zero configuration.

whylogs can be installed in any Python, Java or Spark environment; it can be deployed as a container and run as a sidecar; or invoked through various ML tools (see integrations).

whylogs is designed by data scientists, ML engineers and distributed systems engineers to log data in the most cost-effective, scalable and accurate manner. No sampling. No post-processing. No manual configurations.

whylogs is released under the Apache 2.0 open source license. It supports many languages and is easy to extend. This repo contains the whylogs CLI, language SDKs, and individual libraries are in their own repos.

whylogs calculates approximate statistics for datasets of any size up to TB-scale, making it easy for users to identify
changes in the statistical properties of a model's inputs or outputs. Using approximate statistics allows the package
to run on minimal infrastructure and monitor an entire dataset, rather than miss outliers and other anomalies by only
using a sample of the data to calculate statistics. These qualities make whylogs an excellent solution for profiling
production ML/AI pipelines that operate on TB-scale data and with enterprise SLAs.

For questions and discussions, hop on our [slack channel](http://join.slack.whylabs.ai/)!

# Key Features
This is a Python implementation of whylogs. The Java implementation can be found [here](https://github.com/whylabs/whylogs-java).

* **Data Insight:** whylogs provides complex statistics across different stages of your ML/AI pipelines and applications.
If you have any questions, comments, or just want to hang out with us, please join [our Slack channel](http://join.slack.whylabs.ai/).

* **Scalability:** whylogs scales with your system, from local development mode to live production systems in multi-node
clusters, and works well with batch and streaming architectures.

* **Lightweight:** whylogs produces small mergeable lightweight outputs in a variety of formats, using sketching
algorithms and summarizing statistics.
<img align="center" src="images/Whylabs-Dots-Light-Bg.png" width="250"><img align="center" src="images/Whylabs-Dots-Light-Bg.png" width="250"><img align="center" src="images/Whylabs-Dots-Light-Bg.png" width="250">

* **Unified data instrumentation:** To enable data engineering pipelines and ML pipelines to share a common framework
for tracking data quality and drifts, the whylogs library supports multiple languages and integrations.

* **Observability:** In addition to supporting traditional monitoring approaches, whylogs data can support advanced
ML-focused analytics, error analysis, and data quality and data drift detection.

## Statistical Profile
whylogs collects approximate statistics and sketches of data on a column-basis into a statistical profile.
These metrics include:
- [Getting started](#getting-started)
- [Features](#features)
- [Data Types](#data-types)
- [Integrations](#integrations)
- [Community](#community)
- [Roadmap](#roadmap)
- [Contribute](#contribute)

* **Simple counters**: boolean, null values, data types.
* **Summary statistics**: sum, min, max, variance.
* **Unique value counter** or **cardinality**: tracks an approximate unique value of your feature using HyperLogLog algorithm.
* **Histograms** for numerical features. whylogs binary output can be queried to with dynamic binning based on the
shape of your data.
* **Top frequent items** (default is 128). Note that this configuration affects the memory footprint, especially for text features.
<img align="center" src="images/Whylabs-Dots-Light-Bg.png" width="250"><img align="center" src="images/Whylabs-Dots-Light-Bg.png" width="250"><img align="center" src="images/Whylabs-Dots-Light-Bg.png" width="250">
## Getting started<a name="getting-started" />

# Examples
For a full set of our examples, please check out [whylogs-examples](https://github.com/whylabs/whylogs-examples).

Note that to use the run with matplotlib vizualiation, you'll have to install whylogs with `viz` dependencies:
### Using pip

Install whylogs using the pip package manager by running

```
pip install "whylogs[viz]"
pip install whylogs
```

Check out our example notebooks with Binder: [![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/whylabs/whylogs-examples/HEAD)
- [Getting Started notebook](https://github.com/whylabs/whylogs-examples/blob/mainline/python/GettingStarted.ipynb)
- [Logging Example notebook](https://github.com/whylabs/whylogs-examples/blob/mainline/python/logging_example.ipynb)
- [Logging Images](https://github.com/whylabs/whylogs-examples/blob/mainline/python/Logging_Images.ipynb)
- [MLflow Integration](https://github.com/whylabs/whylogs-examples/blob/mainline/python/MLFlow%20Integration%20Example.ipynb)
### From the source

# Installation
- Download the source code by cloning the repository or by pressing [Download ZIP](https://github.com/whylabs/whylogs-python/archive/master.zip) on this page.

### Using pip
- You'll need to install poetry in order to install dependencies using the lock file in this project. Follow [their docs](https://python-poetry.org/docs/) to get it set up.

[![PyPi Downloads](https://pepy.tech/badge/whylogs)](https://pepy.tech/project/whylogs)
[![PyPi Version](https://badge.fury.io/py/whylogs.svg)](https://pypi.org/project/whylogs/)
- Run the following comand at the root of the source code:

Install whylogs using the pip package manager by running
```
make install
make
```

pip install whylogs

### From source
<img align="center" src="images/Whylabs-Dots-Light-Bg.png" width="250"><img align="center" src="images/Whylabs-Dots-Light-Bg.png" width="250"><img align="center" src="images/Whylabs-Dots-Light-Bg.png" width="250">
## Quickly Logging Data

Download the source code by cloning the repository or by pressing ['Download ZIP'](https://github.com/whylabs/whylogs-python/archive/master.zip) on this page.
Install by navigating to the desired directory and running
whylogs is easy to get up and runnings

python setup.py install
```python
from whylogs import get_or_create_session
import pandas as pd

## Documentation
session = get_or_create_session()

API documentation for `whylogs` can be found at [whylogs.readthedocs.io](http://whylogs.readthedocs.io/).
df = pd.read_csv("path/to/file.csv")

### Demo CLI
with session.logger(dataset_name="my_dataset") as logger:

#dataframe
logger.log_dataframe(df)

Our demo CLI generates a demo project flow by running
#dict
logger.log({"name": 1})

whylogs-demo init
#images
logger.log_images("path/to/image.png")
```

### Quick start CLI
whylogs can be configured programmatically or by using our config YAML file. The quick start CLI can help you bootstrap the
configuration for your project. To use the quick start CLI, run the following command in the root of your Python project.
Check the examples below for visualization and other use cases

whylogs init
### Glossary/Concepts
**Project:** A collection of related data sets used for multiple models or applications.
### Documentation

**Pipeline:** One or more datasets used to build a single model or application. A project may contain multiple pipelines.
The [documentation](https://docs.whylabs.ai/docs/) of this package is generated automatically.

**Dataset:** A collection of records. whylogs v0.0.2 supports structured datasets, which represent data as a table
where each row is a different record and each column is a feature of the record.
### Features

**Feature:** In the context of whylogs v0.0.2 and structured data, a feature is a column in a dataset. A feature can
be discrete (like gender or eye color) or continuous (like age or salary).
- Accurate data profiling: whylogs calculates statistics from 100% of the data, never requiring sampling, ensuring an accurate representation of data distributions
- Lightweight runtime: whylogs utilizes approximate statistical methods to achieve minimal memory footprint that scales with the number of features in the data
- Any architecture: whylogs scales with your system, from local development mode to live production systems in multi-node clusters, and works well with batch and streaming architectures
-Configuration-free: whylogs infers the schema of the data, requiring zero manual configuration to get started
-Tiny storage footprint: whylogs turns data batches and streams into statistical fingerprints, 10-100MB uncompressed
-Unlimited metrics: whylogs collects all possible statistical metrics about structured or unstructured data

**whylogs Output:** whylogs returns profile summary files for a dataset in JSON format. For convenience, these files
are provided in flat table, histogram, and frequency formats.

**Statistical Profile:** A collection of statistical properties of a feature. Properties can be different for discrete
and continuous features.
## Data Types<a name="data-types" />
Whylogs supports both structured and unstructured data, specifically:

### Integrations
The whylogs library is integrated with the following:
- NumPy and Pandas
- [Java and Apache Spark](https://github.com/whylabs/whylogs-java)
- AWS S3 (for output storage)
- Jupyter Notebooks
- MLflow
| Data type | Features | Notebook Example |
| --- | --- | ---|
|Structured data | Distribution, cardinality, schema, counts, missing values | [Getting started with structure data](https://github.com/whylabs/whylogs-examples/blob/mainline/python/GettingStarted.ipynb) |
| Images | exif metadata, derived pixels features, bounding boxes | [Getting started with images](https://github.com/whylabs/whylogs-examples/blob/mainline/python/Logging_Images.ipynb) |
| Video | In development | [Github Issue #214](https://github.com/whylabs/whylogs/issues/214) |
| Tensors | derived 1d features (more in developement) | [Github Issue #216](https://github.com/whylabs/whylogs/issues/216) |
| Text | top k values, counts, cardinality (more in developement) | [Github Issue #213](https://github.com/whylabs/whylogs/issues/213) |
| Audio | In developement | [Github Issue #212](https://github.com/whylabs/whylogs/issues/212) |

### Dependencies

For the core requirements, see [requirements.txt](https://github.com/whylabs/whylogs-python/blob/mainline/requirements.txt).

For the development environment, see [requirements-dev.txt](https://github.com/whylabs/whylogs-python/blob/mainline/requirements-dev.txt).
## Integrations

![current integration](images/integrations.001.png)

# Development/contributing
For more information on contributing to whylogs, see [`DEVELOPMENT.md`](DEVELOPMENT.md).
| Integration | Features | Resources |
| --- | --- | --- |
| Spark | Log and monitor any Spark dataframe | |
| Pandas | Log and monitor any pandas dataframe | <ul><li>[Notebook Example](https://github.com/whylabs/whylogs-examples/blob/mainline/python/logging_example.ipynb)</li><li>[whylogs: Embrace Data Logging](https://whylabs.ai/blog/posts/whylogs-embrace-data-logging)</li></ul> |
| Kafka | Log and monitor Kafka topics with whylogs| <ul><li>[Notebook Example](https://github.com/whylabs/whylogs-examples/blob/mainline/python/Kafka.ipynb)</li><li> [Integrating whylogs into your Kafka ML Pipeline](https://whylabs.ai/blog/posts/integrating-whylogs-into-your-kafka-ml-pipeline) </li></ul>|
| MLflow | Enhance MLflow metrics with whylogs: | <ul><li>[Notebook Example](https://github.com/whylabs/whylogs-examples/blob/mainline/python/MLFlow%20Integration%20Example.ipynb)</li><li>[Streamlining data monitoring with whylogs and MLflow](https://whylabs.ai/blog/posts/on-model-lifecycle-and-monitoring)</li></ul> |
| Github actions | Unit test data with whylogs and github actions| <ul><li>[Notebook Example](https://github.com/whylabs/whylogs-examples/tree/mainline/github-actions)</li></ul> |
| RAPIDS | Use whylogs in RAPIDS environment | <ul><li>[Notebook Example](https://github.com/whylabs/whylogs-examples/blob/mainline/python/RAPIDS%20GPU%20Integration%20Example.ipynb)</li><li>[Monitoring High-Performance Machine Learning Models with RAPIDS and whylogs](https://whylabs.ai/blog/posts/monitoring-high-performance-machine-learning-models-with-rapids-and-whylogs)</li></ul> |
| Java | Run whylogs in Java environment| <ul><li>[Notebook Example](https://github.com/whylabs/whylogs-examples/blob/mainline/java/demo1/src/main/java/com/whylogs/examples/WhyLogsDemo.java)</li></ul> |
| Scala | Run whylogs in Scala environment| <ul><li>[Notebook Example](https://github.com/whylabs/whylogs-examples/blob/mainline/scala/src/main/scala/WhyLogsDemo.scala)</li></ul> |
| Docker | Run whylogs as in Docker | |
| AWS S3 | Store whylogs profiles in S3 | <ul><li>[S3 example](https://github.com/whylabs/whylogs-examples/blob/mainline/python/S3%20example.ipynb)</li></ul>

## Roadmap

# Who maintains whylogs?
whylogs is maintained by [WhyLabs](https://whylabs.ai).

## Community

If you have any questions, comments, or just want to hang out with us, please join [our Slack channel](http://join.slack.whylabs.ai/).

If you want to see whylogs in action in enterprise settings with complex visualizations, check out the [WhyLabs Platform Sandbox](http://try.whylabsapp.com/).
You'll need a GitHub/Google/LinkedIn account to login to view the sandbox (it's a 1-click experience!).

## Contribute

We welcome contributions to whylogs. Please see our [developement guide](https://github.com/whylabs/whylogs/blob/mainline/DEVELOPMENT.md) for details.






Binary file added images/Whylabs-Dots-Light-Bg.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/integrations.001.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit e5122ac

Please sign in to comment.