Skip to content

Commit

Permalink
test(eda):add densify test and doc for diff
Browse files Browse the repository at this point in the history
  • Loading branch information
jinglinpeng committed Oct 26, 2021
1 parent 323ae6b commit f8d2054
Show file tree
Hide file tree
Showing 2 changed files with 61 additions and 43 deletions.
1 change: 1 addition & 0 deletions dataprep/tests/eda/test_plot_diff.py
Original file line number Diff line number Diff line change
Expand Up @@ -79,3 +79,4 @@ def test_dataset() -> None:
df1 = df[df["Survived"] == 0]
df2 = df[df["Survived"] == 1]
plot_diff([df1, df2])
plot_diff([df1, df2], config={"diff.density": True})
103 changes: 60 additions & 43 deletions docs/source/user_guide/eda/plot_diff.ipynb
Original file line number Diff line number Diff line change
@@ -1,53 +1,36 @@
{
"metadata": {
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": 3
},
"orig_nbformat": 2
},
"nbformat": 4,
"nbformat_minor": 2,
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# `plot_diff()`: analyze differences"
],
"cell_type": "markdown",
"metadata": {}
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Overview"
],
"cell_type": "markdown",
"metadata": {}
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The function `plot_diff()` explores the difference of column distributions and statistics across multiple datasets.\n",
"\n",
"Next, we demonstrate the functionality of `plot_diff()`"
],
"cell_type": "markdown",
"metadata": {}
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Load the dataset\n",
"\n",
"`dataprep.eda` supports **Pandas** and **Dask** dataframes. Here, we will load the house prices datasets for both training and testing into a Pandas dataframe."
],
"cell_type": "markdown",
"metadata": {}
]
},
{
"cell_type": "code",
Expand All @@ -64,18 +47,18 @@
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Get an overview of the dataset with `plot_diff([df1, df2])`"
],
"cell_type": "markdown",
"metadata": {}
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We start by calling `plot_diff([df1, df2])` which computes dataset-level statistics, a histogram for each numerical column, and a bar chart for each categorical column across two dataframes. The number of bins in the histogram can be specified with the parameter `bins`, and the number of categories in the bar chart can be specified with the parameter `ngroups`. If a column contains missing values, the percent of missing values is shown in the title and ignored when generating the plots."
],
"cell_type": "markdown",
"metadata": {}
]
},
{
"cell_type": "code",
Expand All @@ -88,13 +71,13 @@
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Set the customized label in the comparison\n",
"\n",
"Sometimes we want to give our datasets some better names, this can be specified with the parameter `diff.label`."
],
"cell_type": "markdown",
"metadata": {}
]
},
{
"cell_type": "code",
Expand All @@ -106,15 +89,15 @@
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Change the baseline dataset used for comparison\n",
"\n",
"By default, we use the first dataset as our baseline to compute the distributions and statistics. If this baseline is not properly set, we can specify this parameter with `diff.baseline`.\n",
"\n",
"The baseline starts with index `0` instead of `1` which is in the default label parameter."
],
"cell_type": "markdown",
"metadata": {}
]
},
{
"cell_type": "code",
Expand All @@ -124,6 +107,40 @@
"source": [
"plot_diff([df1, df2], config={\"diff.baseline\": 1})"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Change to density plot\n",
"By default, we will show a comparison of histogram for a numerical column. You can change it to a density plot using `diff.density` parameter. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"plot_diff([df1, df2], config = {\"diff.density\": True})"
]
}
]
}
],
"metadata": {
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": 3
},
"orig_nbformat": 2
},
"nbformat": 4,
"nbformat_minor": 2
}

0 comments on commit f8d2054

Please sign in to comment.