perf(clean): update documentation of clean_duplication

sfu-db · Jun 8, 2021 · 50f90fa · 50f90fa
1 parent b941801
commit 50f90fa
Showing 1 changed file with 41 additions and 3 deletions.
diff --git a/docs/source/user_guide/clean/clean_duplication.ipynb b/docs/source/user_guide/clean/clean_duplication.ipynb
@@ -61,7 +61,7 @@
     "compared using the levenshtein distance function. If two values have a distance less than \n",
     "or equal to the given radius they are added to the same cluster. Textboxes are provided for choosing the block size and the radius.\n",
     "\n",
-    "The [python-Levenshtein](https://github.com/ztane/python-Levenshtein) library is used for a fast levenshtein distance implementation.\n",
+    "The [Levenshtein](https://github.com/polm/levenshtein) library is used for a fast levenshtein distance implementation.\n",
     "\n",
     "Clustering methods are taken from the [OpenRefine](https://github.com/OpenRefine/OpenRefine) project and the [simile-vicino](https://code.google.com/archive/p/simile-vicino/n) project, you can read more about these clustering methods [here](https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth).\n",
     "\n",
@@ -107,13 +107,41 @@
     "df"
    ]
   },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import pandas as pd\n",
+    "import numpy as np\n",
+    "cities = pd.DataFrame(\n",
+    "    {\n",
+    "        \"city\": [\n",
+    "            \"Québec\",\n",
+    "            \"Quebec\",\n",
+    "            \"Vancouver\",\n",
+    "            \"Vancouver\",\n",
+    "            \"vancouver\",\n",
+    "            \" Vancuver \",\n",
+    "            \"Toronto\",\n",
+    "            \"Toront\",\n",
+    "            \"Tronto\",\n",
+    "            \"Ottowa\",\n",
+    "            \"otowa\"\n",
+    "        ]\n",
+    "    }\n",
+    ")\n",
+    "cities"
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "## 1. Default `clean_duplication()`\n",
     "\n",
-    "By default the `df_var_name` parameter is set to \"df\" and the `page_size` variable is set to 5. Clustering methods can be toggled using the dropdown menu at the top of the GUI. Select which clusters you would like to merge using the checkboxes under the \"Merge?\" heading. Then press the \"Merge and Re-Cluster\" button to merge the cluster. If the \"export code\" checkbox is selected, code for merging the clusters will be created and added to the notebook cell. Finally, you can press the \"finish\" button to close the GUI and see the final DataFrame created."
+    "By default the `df_var_name` parameter equals to `default`, which means the prefix of the final result DataFrame is the same with the name of input dataframe. And the `page_size` variable is set to 5. Clustering methods can be toggled using the dropdown menu at the top of the GUI. Select which clusters you would like to merge using the checkboxes under the \"Merge?\" heading. Then press the \"Merge and Re-Cluster\" button to merge the cluster. If the \"export code\" checkbox is selected, code for merging the clusters will be created and added to the notebook cell. Finally, you can press the \"finish\" button to close the GUI and see the final DataFrame created.\n"
    ]
   },
   {
@@ -126,6 +154,16 @@
     "clean_duplication(df, \"city\")"
    ]
   },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from dataprep.clean import clean_duplication\n",
+    "clean_duplication(cities, \"city\")"
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -179,7 +217,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.7.3"
+   "version": "3.6.12"
   }
  },
  "nbformat": 4,