docs(clean): improve DataPrep.Clean ReadMe

sfu-db · Mar 4, 2021 · a0bc96b · a0bc96b
1 parent bcbc2dd
commit a0bc96b
Show file tree

Hide file tree

Showing 3 changed files with 39 additions and 46 deletions.
diff --git a/README.md b/README.md
@@ -126,45 +126,43 @@ Check [plot](https://sfu-db.github.io/dataprep/user_guide/eda/plot.html), [plot_
 
 ## Clean
 
-DataPrep.Clean contains simple functions designed for cleaning and standardizing a column in a DataFrame. It provides
-
-- A unified API: each function follows the syntax `clean_{type}(df, "column name")` (see an example below)
-- Python Data Science Support: its design for cleaning pandas and Dask DataFrames enables seamless integration into the Python data science workflow
-- Transparency: a report is generated that summarizes the alterations to the data that occured during cleaning
-
-The following example shows how to clean a column containing messy emails:
-
-<center><img src="https://github.com/sfu-db/dataprep/blob/develop/assets/clean_example_1.jpg"/></center>
-<center><img src="https://github.com/sfu-db/dataprep/blob/develop/assets/clean_example_2.jpg"/></center>
+DataPrep.Clean contains simple functions designed for cleaning and validating data in a DataFrame. It provides
+
+- **A Unified API**: each function follows the syntax `clean_{type}(df, 'column name')` (see an example below).
+- **Speed**: the computations are parallelized using Dask. It can clean **50K rows per second** on a dual-core laptop (that means cleaning 1 million rows in only 20 seconds).
+- **Transparency**: a report is generated that summarizes the alterations to the data that occured during cleaning.
+
+The following example shows how to clean and standardize a column of country names.
+
+``` python
+from dataprep.clean import clean_country
+import pandas as pd
+df = pd.DataFrame({'country': ['USA', 'country: Canada', '233', ' tr ', 'NA']})
+df2 = clean_country(df, 'country')
+df2
+           country  country_clean
+0              USA  United States
+1  country: Canada         Canada
+2              233        Estonia
+3              tr          Turkey
+4               NA            NaN
+```
 
 Type validation is also supported:
 
-<center><img src="https://github.com/sfu-db/dataprep/blob/develop/assets/clean_example_3.jpg"/></center>
-
-Below are the supported semantic types (more are currently being developed).
-
-<table>
-    <tr>
-      <th>Semantic Types</th>
-    </tr>
-    <tr>
-      <td>longitude/latitude</td>
-    </tr>
-    <tr>
-      <td>country</td>
-    </tr>
-    <tr>
-      <td>email</td>
-    </tr>
-    <tr>
-      <td>url</td>
-    </tr>
-    <tr>
-      <td>phone</td>
-    </tr>
-  </table>
+``` python
+from dataprep.clean import validate_country
+series = validate_country(df['country'])
+series
+0     True
+1    False
+2     True
+3     True
+4    False
+Name: country, dtype: bool
+```
 
-For more information, refer to the [User Guide](https://sfu-db.github.io/dataprep/user_guide/clean/introduction.html).
+**Currently supports functions for:** Column Headers | Country Names | Dates and Times | Email Addresses | Geographic Coordinates | IP Addresses | Phone Numbers | URLs | US Street Addresses
 
 ## Documentation
 

diff --git a/docs/source/user_guide/clean/clean_address.ipynb b/docs/source/user_guide/clean/clean_address.ipynb
@@ -9,7 +9,7 @@
     ".. _address_userguide:\n",
     "\n",
     "US Street Addresses\n",
-    "============="
+    "==================="
    ]
   },
   {
@@ -3577,7 +3577,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.7.3"
+   "version": "3.9.1"
   }
  },
  "nbformat": 4,

diff --git a/docs/source/user_guide/clean/introduction.ipynb b/docs/source/user_guide/clean/introduction.ipynb
@@ -7,7 +7,7 @@
     "# Clean\n",
     "\n",
     "\n",
-    "This section introduces the data cleaning component of DataPrep."
+    "DataPrep.Clean provides functions for quickly and easily cleaning and validating your data."
    ]
   },
   {
@@ -24,6 +24,8 @@
    "source": [
     "## Section Contents\n",
     "\n",
+    "DataPrep.Cleam currently contains functions to clean\n",
+    "\n",
     " * [Column Headers](clean_headers.ipynb)\n",
     " * [Country Names](clean_country.ipynb)\n",
     " * [Email Addresses](clean_email.ipynb)\n",
@@ -33,13 +35,6 @@
     " * [URLs](clean_url.ipynb)\n",
     " * [US Street Addresses](clean_address.ipynb)"
    ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": []
   }
  ],
  "metadata": {
@@ -59,7 +54,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.7.3"
+   "version": "3.9.1"
   }
  },
  "nbformat": 4,