## EDA Toolkit Demo

This notebook provides a walkthrough of key functionalities from the EDA Toolkit, a lightweight Python library designed to streamline exploratory data analysis. Whether you're working in Jupyter, building reports for stakeholders, or preparing datasets for modeling, the toolkit helps automate the process of generating summary statistics, visualizations, and Markdown exports.

In this demo, we'll walk through:

- Importing and reading data
- Creating summary tables (Table 1 style)
- Exporting clean, readable output for use in documentation or publication

In [1]:
import pandas as pd
import eda_toolkit

print(f"EDA Toolkit version: {eda_toolkit.__version__}")

EDA Toolkit version: 0.0.19


### Read-in The Data

In [None]:
from ucimlrepo import fetch_ucirepo

# fetch dataset
df = fetch_ucirepo(id=2)
df = df.data.features.join(df.data.targets, how="inner")

In [4]:
## save out the df in case the uci-mlrepo package goes down
df.to_csv("../data/raw/df.csv", index=False)

### Table 1

#### Example 1: Mixed Summary Table

In the example below, we generate a summary table from a dataset containing both 
categorical and continuous variables. We explicitly define which columns fall into 
each category, although the ``generate_table1`` function also supports automatic 
detection of variable types if desired.

The summary output is automatically pretty-printed in the console using the 
``table1_to_str`` utility. This formatting is applied behind the scenes whenever 
a summary table is printed, making it especially helpful for reading outputs 
within notebooks or logging environments.

In this case, we specify ``export_markdown=True`` and provide a filename via
``markdown_path``. This allows the summary to be exported in Markdown format
for use in reports, documentation, or publishing platforms like Jupyter Book or Quarto.
When ``include_types="both"`` and ``combine=True`` (the default), both continuous and 
categorical summaries are merged into a single DataFrame and written to two separate 
Markdown files with ``_continuous.md`` and ``_categorical.md`` suffixes.

We also set ``value_counts=False`` to limit each categorical variable to a single 
summary row, rather than expanding into one row per category-level value.

In [None]:
from eda_toolkit import generate_table1

table1 = generate_table1(
    df=df,
    categorical_cols=["sex", "race", "workclass"],
    continuous_cols=["hours-per-week", "age", "education-num"],
    value_counts=False,
    max_categories=3,
    export_markdown=True,
    decimal_places=0,
    markdown_path="table1_summary.md",
)

print(table1)

### Example 2: Mixed Summary Table ``(value_counts=True)``

In this example, we call ``generate_table1`` without manually specifying which 
columns are categorical or continuous. Instead, the function automatically detects 
variable types based on data types. Numeric columns with two or fewer unique values 
are also reclassified as categorical by default 
(controlled via ``detect_binary_numeric=True``).

We set ``value_counts=True`` to generate a separate summary row for each unique value 
within a categorical variable, rather than a single row per variable. To keep 
the output concise, we limit each breakdown to the top 3 most frequent values 
using ``max_categories=3``.

We also enable ``export_markdown=True`` to export the summaries in Markdown format. 
While you can specify a custom markdown_path, if none is provided, the output files 
are saved to the current working directory.

Since ``include_types="both"`` is the default and ``combine=True`` by default as well, 
the underlying summaries are merged into a single DataFrame for display—but two 
separate Markdown files are still generated with suffixes that reflect the type of 
summary:

- ``table1_summary_continuous.md``
- ``table1_summary_categorical.md``

This setup is ideal for detailed reporting, especially when working with 
downstream tools like Jupyter Book, Quarto, or static site generators.

In [None]:
from eda_toolkit import generate_table1

table1_cont, table1_cat = generate_table1(
    df=df,
    value_counts=True,
    max_categories=3,
    combine=False,
    export_markdown=True,
    markdown_path="table1_summary.md",
)

table1_cont = table1_cont.drop(columns=["Type", "Mode"])

In [None]:
table1_cont

In [None]:
table1_cat

### Histogram Example - (Mean, Median, and Std. Deviation)

In this example, the ``kde_distributions()`` function is customized to generate 
a histogram that include mean, median, and 3 standard deviation lines. The 
``mean_color`` is set to ``"blue"`` and the median_color is set to ``"black"``, 
allowing for a clear distinction between these two central tendency measures. 
The function parameters are adjusted to ensure that both the mean and median lines 
are plotted ``(plot_mean=True, plot_median=True)``. The ``y_axis_label`` remains
``"Density"``, indicating that the histograms represent the density of observations 
within each bin. The histogram bars are colored using ``hist_color="brown"``, 
with a ``fill_alpha=0.40``, which adjusts the transparency of the fill color. 
Additionally, standard deviation bands are plotted using colors ``"purple"``, 
``"green"``, and ``"silver"`` for one, two, and three standard deviations, respectively.

The layout is configured with a single row and multiple columns ``(n_rows=1, n_cols=3)``, 
and the grid size is set to `15x5 inches`. This setup is particularly useful for 
visualizing the central tendencies within the data while also providing a clear 
view of the distribution and spread through the standard deviation bands. The 
configuration used in this example showcases how histograms can be enhanced with 
statistical overlays to provide deeper insights into the data.

In [None]:
from eda_toolkit import kde_distributions

vars_of_interest = [
    "age",
]

kde_distributions(
    df=df,
    figsize=(6, 4),
    text_wrap=50,
    hist_color="brown",
    bbox_inches="tight",
    vars_of_interest=vars_of_interest,
    y_axis_label="Density",
    bins=10,
    fill_alpha=0.40,
    plot_type="both",
    stat="Density",
    # label_fontsize=16,  # Font size for axis labels
    # tick_fontsize=14,  # Font size for tick labels
    plot_mean=True,
    plot_median=True,
    mean_color="blue",
    std_dev_levels=[
        1,
        2,
        3,
    ],
    std_color=[
        "purple",
        "green",
        "silver",
    ],
    image_filename="age_distribution_mean_median_std",
)