Skip to content

Commit

Permalink
perf(clean): update documentation of clean_duplication
Browse files Browse the repository at this point in the history
perf(clean): update documentation of clean_duplication
  • Loading branch information
qidanrui committed Jun 8, 2021
1 parent b941801 commit 50f90fa
Showing 1 changed file with 41 additions and 3 deletions.
44 changes: 41 additions & 3 deletions docs/source/user_guide/clean/clean_duplication.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -61,7 +61,7 @@
"compared using the levenshtein distance function. If two values have a distance less than \n",
"or equal to the given radius they are added to the same cluster. Textboxes are provided for choosing the block size and the radius.\n",
"\n",
"The [python-Levenshtein](https://github.com/ztane/python-Levenshtein) library is used for a fast levenshtein distance implementation.\n",
"The [Levenshtein](https://github.com/polm/levenshtein) library is used for a fast levenshtein distance implementation.\n",
"\n",
"Clustering methods are taken from the [OpenRefine](https://github.com/OpenRefine/OpenRefine) project and the [simile-vicino](https://code.google.com/archive/p/simile-vicino/n) project, you can read more about these clustering methods [here](https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth).\n",
"\n",
Expand Down Expand Up @@ -107,13 +107,41 @@
"df"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"cities = pd.DataFrame(\n",
" {\n",
" \"city\": [\n",
" \"Québec\",\n",
" \"Quebec\",\n",
" \"Vancouver\",\n",
" \"Vancouver\",\n",
" \"vancouver\",\n",
" \" Vancuver \",\n",
" \"Toronto\",\n",
" \"Toront\",\n",
" \"Tronto\",\n",
" \"Ottowa\",\n",
" \"otowa\"\n",
" ]\n",
" }\n",
")\n",
"cities"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1. Default `clean_duplication()`\n",
"\n",
"By default the `df_var_name` parameter is set to \"df\" and the `page_size` variable is set to 5. Clustering methods can be toggled using the dropdown menu at the top of the GUI. Select which clusters you would like to merge using the checkboxes under the \"Merge?\" heading. Then press the \"Merge and Re-Cluster\" button to merge the cluster. If the \"export code\" checkbox is selected, code for merging the clusters will be created and added to the notebook cell. Finally, you can press the \"finish\" button to close the GUI and see the final DataFrame created."
"By default the `df_var_name` parameter equals to `default`, which means the prefix of the final result DataFrame is the same with the name of input dataframe. And the `page_size` variable is set to 5. Clustering methods can be toggled using the dropdown menu at the top of the GUI. Select which clusters you would like to merge using the checkboxes under the \"Merge?\" heading. Then press the \"Merge and Re-Cluster\" button to merge the cluster. If the \"export code\" checkbox is selected, code for merging the clusters will be created and added to the notebook cell. Finally, you can press the \"finish\" button to close the GUI and see the final DataFrame created.\n"
]
},
{
Expand All @@ -126,6 +154,16 @@
"clean_duplication(df, \"city\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from dataprep.clean import clean_duplication\n",
"clean_duplication(cities, \"city\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
Expand Down Expand Up @@ -179,7 +217,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.3"
"version": "3.6.12"
}
},
"nbformat": 4,
Expand Down

0 comments on commit 50f90fa

Please sign in to comment.