# Auto EDA Tool Comparison
- A bunch of auto-eda tools out there in the last years, good to know which one to use in workflow
- Comparing pandas profiler, sweetviz, autoviz, lux, dataprep
- Usability and style are the biggest factors. In terms of features all are very similar: distributions, correlations, missing values.
- Testing on some random tabular kaggle dataset
- The ones that are not actively maintained will break with new versions of jupyter and numpy
- Had to clear outputs as interactives not working in viewer, also quite large

## Summary 
- Pandas Profiler is best, easy to to use and configurable for extra complexity. Actively maintained and very neat.
- DataPrep is the second choice, perhaps to dig deeper after using yprofiler. env issues with new numpy versions.
- SweetViz, AutoViz ok but not very usable - either messy viz or api - and not up to date
- Lux seriously out of date, couldn't get to work without chaning the base env

## Pandas Profiler
- very clean an neatly organised
- great docs
- nice balance between ease of use and advanced usage
- congis are the not the most pythonic but do the job
- would be nice to have dedicated target analysis
- compatible with newer libraries and up to date 

## SweetViz
- no docs?
- viz style not great, too cluttered
- Has target analysis options but doesn't work for categorical :-/
- Doesn't seem to be actively maintained

## AutoViz
- api fairly complicated, would have separated reader from df api.
- not great separation of use-cases: basic overview and more complex drilldown.
- target analysis pretty cool
- output is quite messy

## Lux
- Not a fan of overriding pandas api
- Can't get it to work with jupyterlab 4.2 `UserWarning: ValueError: The extension "luxwidget" does not yet support the current version of JupyterLab.'`
- While it's ok to having to use older package versions inside the venv, it's not acceptable if one has to change the stack outside of the venv 
- Doesn't seem to be maintained, last commit in 2023

## DataPrep
- Had to downgrade numpy and markupsafe
- Stats and insights quite helpful
- Nice that each specialised call (correlation, missing) has multiple metrics and displays to choose from
- Good balance between high-level reports and drill down options using different apis
- Viz not as tidy as pandas-profiler

In [None]:
import pandas as pd
train_df = pd.read_csv("/data/datasets/kaggle/Multi-Class-Prediction-of-Obesity-Risk/train.csv")
test_df = pd.read_csv("/data/datasets/kaggle/Multi-Class-Prediction-of-Obesity-Risk/test.csv")

# Pandas Profiler
- very clean an neatly organised
- great docs
- nice balance between ease of use and advanced usage via additional configs
- would be nice to have dedicated target analysis

In [None]:
import ydata_profiling as ydp

conf = {"correlations": {"auto": {"calculate": True}, "spearman": {"calculate": True}}}

train_report = ydp.ProfileReport(train_df, **conf)
test_report = ydp.ProfileReport(test_df, **conf)

train_report.compare(test_report)

# SweetViz
- no docs?
- viz style not great, too cluttered
- Has target analysis options but doesn't work for categorical :-/
- Doesn't seem to be actively maintained

In [None]:
import sweetviz as sv

report = sv.compare([train_df, "Train"], [test_df, "Test"])
report.show_notebook()

# AutoViz
- api fairly complicated, would have separated reader from df api.
- not great separation of use-cases: basic overview and more complex drilldown.
- target analysis pretty cool
- output is quite messy

In [None]:
from autoviz import AutoViz_Class
AV = AutoViz_Class()

dft = AV.AutoViz(
    filename="",
    dfte=train_df,
    depVar="NObeyesdad",
    chart_format="bokeh",
)


# Lux
- Not a fan of overriding pandas api
- Can't get it to work with jupyterlab 4.2 `UserWarning: ValueError: The extension "luxwidget" does not yet support the current version of JupyterLab.'`
- While it's ok to having to use older package versions inside the venv, it's not acceptable if one has to change the stack outside of the venv 
- Doesn't seem to be maintained, last commit in 2023

In [None]:
import lux
lux.debug_info()

# DataPrep
- Had to downgrade numpy and markupsafe
- Stats and insights quite helpful
- Nice that each specialised call (correlation, missing) has multiple metrics and displays to choose from
- Good balance between high-level reports and drill down options using different apis
- Viz not as tidy as pandas-profiler

In [None]:
from dataprep.eda import plot, plot_correlation, plot_missing, create_report, plot_diff

In [None]:
create_report(train_df)

In [None]:
plot(train_df)

In [None]:
plot(train_df, "NObeyesdad")

In [None]:
plot_correlation(train_df)

In [None]:
plot_missing(train_df)

In [None]:
plot_diff([train_df, test_df])