Skip to content

Commit

Permalink
docs(clean): improve DataPrep.Clean ReadMe
Browse files Browse the repository at this point in the history
  • Loading branch information
Brandon Lockhart committed Mar 4, 2021
1 parent bcbc2dd commit a0bc96b
Show file tree
Hide file tree
Showing 3 changed files with 39 additions and 46 deletions.
68 changes: 33 additions & 35 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -126,45 +126,43 @@ Check [plot](https://sfu-db.github.io/dataprep/user_guide/eda/plot.html), [plot_

## Clean

DataPrep.Clean contains simple functions designed for cleaning and standardizing a column in a DataFrame. It provides

- A unified API: each function follows the syntax `clean_{type}(df, "column name")` (see an example below)
- Python Data Science Support: its design for cleaning pandas and Dask DataFrames enables seamless integration into the Python data science workflow
- Transparency: a report is generated that summarizes the alterations to the data that occured during cleaning

The following example shows how to clean a column containing messy emails:

<center><img src="https://github.com/sfu-db/dataprep/blob/develop/assets/clean_example_1.jpg"/></center>
<center><img src="https://github.com/sfu-db/dataprep/blob/develop/assets/clean_example_2.jpg"/></center>
DataPrep.Clean contains simple functions designed for cleaning and validating data in a DataFrame. It provides

- **A Unified API**: each function follows the syntax `clean_{type}(df, 'column name')` (see an example below).
- **Speed**: the computations are parallelized using Dask. It can clean **50K rows per second** on a dual-core laptop (that means cleaning 1 million rows in only 20 seconds).
- **Transparency**: a report is generated that summarizes the alterations to the data that occured during cleaning.

The following example shows how to clean and standardize a column of country names.

``` python
from dataprep.clean import clean_country
import pandas as pd
df = pd.DataFrame({'country': ['USA', 'country: Canada', '233', ' tr ', 'NA']})
df2 = clean_country(df, 'country')
df2
country country_clean
0 USA United States
1 country: Canada Canada
2 233 Estonia
3 tr Turkey
4 NA NaN
```

Type validation is also supported:

<center><img src="https://github.com/sfu-db/dataprep/blob/develop/assets/clean_example_3.jpg"/></center>

Below are the supported semantic types (more are currently being developed).

<table>
<tr>
<th>Semantic Types</th>
</tr>
<tr>
<td>longitude/latitude</td>
</tr>
<tr>
<td>country</td>
</tr>
<tr>
<td>email</td>
</tr>
<tr>
<td>url</td>
</tr>
<tr>
<td>phone</td>
</tr>
</table>
``` python
from dataprep.clean import validate_country
series = validate_country(df['country'])
series
0 True
1 False
2 True
3 True
4 False
Name: country, dtype: bool
```

For more information, refer to the [User Guide](https://sfu-db.github.io/dataprep/user_guide/clean/introduction.html).
**Currently supports functions for:** Column Headers | Country Names | Dates and Times | Email Addresses | Geographic Coordinates | IP Addresses | Phone Numbers | URLs | US Street Addresses

## Documentation

Expand Down
4 changes: 2 additions & 2 deletions docs/source/user_guide/clean/clean_address.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@
".. _address_userguide:\n",
"\n",
"US Street Addresses\n",
"============="
"==================="
]
},
{
Expand Down Expand Up @@ -3577,7 +3577,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.3"
"version": "3.9.1"
}
},
"nbformat": 4,
Expand Down
13 changes: 4 additions & 9 deletions docs/source/user_guide/clean/introduction.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
"# Clean\n",
"\n",
"\n",
"This section introduces the data cleaning component of DataPrep."
"DataPrep.Clean provides functions for quickly and easily cleaning and validating your data."
]
},
{
Expand All @@ -24,6 +24,8 @@
"source": [
"## Section Contents\n",
"\n",
"DataPrep.Cleam currently contains functions to clean\n",
"\n",
" * [Column Headers](clean_headers.ipynb)\n",
" * [Country Names](clean_country.ipynb)\n",
" * [Email Addresses](clean_email.ipynb)\n",
Expand All @@ -33,13 +35,6 @@
" * [URLs](clean_url.ipynb)\n",
" * [US Street Addresses](clean_address.ipynb)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
Expand All @@ -59,7 +54,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.3"
"version": "3.9.1"
}
},
"nbformat": 4,
Expand Down

0 comments on commit a0bc96b

Please sign in to comment.