docs(clean): add documentation for clean_df()

sfu-db · Apr 21, 2021 · 4ecf0d7 · 4ecf0d7
1 parent 2812062
commit 4ecf0d7
Showing 1 changed file with 333 additions and 0 deletions.
diff --git a/docs/source/user_guide/clean/clean_df.ipynb b/docs/source/user_guide/clean/clean_df.ipynb
@@ -0,0 +1,333 @@
+{
+ "cells": [
+  {
+   "cell_type": "raw",
+   "metadata": {},
+   "source": [
+    ".. _clean_df_user_guide:\n",
+    "\n",
+    "Pandas DataFrame\n",
+    "=============="
+   ]
+  },
+  {
+   "cell_type": "raw",
+   "metadata": {},
+   "source": [
+    "Introduction\n",
+    "------------\n",
+    "\n",
+    "The function :func:`clean_df() <dataprep.clean.clean_df.clean_df>` performs a set of operations that would be useful for cleaning and standardizing a full Pandas DataFrame."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "In general, `clean_df()` is designed as a function to be called initially when users want to clean a DataFrame. It currently supports the following functionalities via parameters:\n",
+    "\n",
+    "1. `clean_headers`:  aims to clean column headers in a DataFrame and standardizes them in a desired format. It uses the default settings from the function \"clean_headers()\", which set column names to \"snake\" case and strip accents.\n",
+    "2. `data_type_detection`: aims to do a column-wise semantic and/or atomic data type detection for the whole DataFrame.\n",
+    "    - `\"semantic\"`: it helps users identify semantic variables (e.g., email) that are supported by DataPrep validating functions. Hence, users can later call the corresponding cleaning functions on these variables (e.g., \"clean_email()\").\n",
+    "    - `\"atomic\"`: it helps users identify the basic data types that are built into Python (e.g., interger, floating, string).\n",
+    "    - `\"none\"`: no results will be returned.\n",
+    "3. `standardize_missing_values`: aims to standardize all missing entries to be a uniform format.\n",
+    "    - `\"fill\"`: it sets all detected missing values to be \"np.nan\" or \"pd.NaT\".\n",
+    "    - `\"remove\"`: it removes all rows with any missing values (hence, return a complete DataFrame).\n",
+    "    - `\"ignore\"`: no actions will be taken.\n",
+    "4. `remove_duplicate_entries`: aims to remove the repetitive rows in the DataFrame.\n",
+    "5. `downcast_memory`: aims to downcast the column data types to be the smallest possible, which saves storage memory for the DataFrame.\n",
+    "\n",
+    "After cleaning, a **report** is printed that provides the following information:\n",
+    "* The number and percentage of header values that were cleaned from `clean_headers`.\n",
+    "* A list of supported semantic data types by DataPrep from `data_type_detection`: \"semantic\".\n",
+    "* How many data entries are removed from `standardize_missing_values`: \"remove\" and/or `remove_duplicate_entries`.\n",
+    "* How much memory is saved only from `downcast_memory`.\n",
+    "\n",
+    "The following sections demonstrate the functionality of `clean_df()`."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## An example dirty dataset"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import pandas as pd\n",
+    "import numpy as np\n",
+    "df = pd.DataFrame({\"Name\":\n",
+    "                   [\"Abby\", \"Scott\", \"Scott\", \"Scott2\", np.nan, \"NULL\"],\n",
+    "                   \"AGE\":\n",
+    "                   [12, 33, 33, 56,  np.nan, \"NULL\"],\n",
+    "                   \"weight__\":\n",
+    "                   [32.5, 47.1, 47.1, 55.2, np.nan, \"NULL\"],\n",
+    "                   \"Admission Date\":\n",
+    "                   [\"2020-01-01\", \"2020-01-15\", \"2020-01-15\", \n",
+    "                    \"2020-09-01\", pd.NaT, \"NULL\"],\n",
+    "                   \"email_address\":\n",
+    "                   [\"abby@gmail.com\",\"scott@gmail.com\", \"scott@gmail.com\", \"test@abc.com\", np.nan, \"NULL\"],\n",
+    "                   \"Country of Birth\":\n",
+    "                   [\"CA\",\"Canada\", \"Canada\", \"NULL\", np.nan, \"NULL\"],\n",
+    "                   \"Contact (Numbers)\":\n",
+    "                   [\"1-789-456-0123\",\"1-123-456-7890\",\"1-123-456-7890\",\"1-456-123-7890\", np.nan, \"NULL\" ],\n",
+    "                   \n",
+    "})\n",
+    "df"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 1. Default `clean_df()`"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "By default, `clean_df()` will always return **a tuple of two DataFrames**, with one for data types and another one for the cleaned DataFrame. The default parameters are set to be:\n",
+    "* `clean_headers`: True\n",
+    "* `data_type_detection`: \"semantic\"\n",
+    "* `standardize_missing_values`: \"fill\"\n",
+    "* `remove_duplicate_entries`: False\n",
+    "* `downcast_memory`: True"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from dataprep.clean import clean_df\n",
+    "inferred_dtypes, cleaned_df = clean_df(df)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "inferred_dtypes"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The first returned DataFrame presents the semantic data types for each variable. Note that when using \"semantic\", the corresponding atomic results will also be returned.  "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "cleaned_df"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The second returned DataFrame presents the cleaned DataFrame.\n",
+    "\n",
+    "Note that the floating columns (e.g., \"weight\") may look to lose some information by downcasting memory. Actually nowhere is precision lost, it is due to how the numbers get displayed in Python (more explanations are available in section 5). "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 2. `data_type_detection` parameter\n",
+    "\n",
+    "### semantic\n",
+    "\n",
+    "Consider the efficiency on medium or large datasets (>=10,000 data entries), the function will randomly query a subset of valid data entries to best infer the semantic data types. Currently, `clean_df()` supports to detect the following semantic data types, which all have the corresponding `clean_$()` function for users to call.\n",
+    "* Country Names (`country`)\n",
+    "* Email Addresses (`email`)\n",
+    "* Geographic Coordinates (`coordinate`)\n",
+    "* IP Address (`ip`)\n",
+    "* Phone Numbers (`phone`)\n",
+    "* URLs (`url`)\n",
+    "* US Street Addresses (`address`)\n",
+    "* ...more types will be supported with the development\n",
+    "\n",
+    "Note that \"semantic\" will also generate a report to show the semantic data types that are supported by DataPrep to further clean."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "inferred_dtypes, _ = clean_df(df, data_type_detection = \"semantic\")\n",
+    "inferred_dtypes"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### atomic\n",
+    "The \"atomic\" option can give users an overview of the data types. Note that \"atomic\" does not generate a report. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "inferred_dtypes, _ = clean_df(df, data_type_detection = \"atomic\")\n",
+    "inferred_dtypes"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### none\n",
+    "If \"none\", then no data type detection will be performed and `clean_df()` will return a single cleaned DataFrame only."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "cleaned_df = clean_df(df, data_type_detection = \"none\")\n",
+    "cleaned_df"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 3. `standardize_missing_values`: \"remove\"\n",
+    "\n",
+    "This section demonstrates how the \"remove\" option works. It removes all rows with **any** missing values and returns a complete DataFrame. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "_, cleaned_df = clean_df(df, standardize_missing_values = \"remove\")\n",
+    "cleaned_df"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 4. `remove_duplicate_entries` parameter"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "_, cleaned_df = clean_df(df, remove_duplicate_entries = True)\n",
+    "cleaned_df"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Note that if multiple rows are all standardized in step 3 to be completely missing rows, this `remove_duplicate_entries = True` will also treat them as duplicated entries and only keep one."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 5. `downcast_memory` parameter\n",
+    "\n",
+    "By default and depending on the environment, Pandas will read numerical data and commonly store them using \"`int64`/`float64`\" types. The storage memory can usually be saved by assigning subtypes (e.g., `int8`), if the range of all data in a numerical column is not extreme. Also, Pandas stores categorical columns as `object`. This can be optimized by using the `category` type for columns with low cardinality. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "_, cleaned_df = clean_df(df, downcast_memory = True, report = False)\n",
+    "cleaned_df.dtypes"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We can see that the numeric columns for `cleaned_df` use downcasted `int8` and `float32` types. \n",
+    "\n",
+    "However, the floating column \"weight\" may look to lose some precision. This is due to how Python Numpy displays decimals."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "cleaned_df['weight']"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The actual values are not influenced. For safety, Numpy tends to use `float32` as the downcasted data type."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "cleaned_df['weight'][1]"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.7.3"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}