[SPARKNLP-1125] Adding partition with chunk demo notebook [skip test]

danilojsl · DevinTDHa · commit bb306b97e672 · 2025-06-09T11:10:05.000+02:00
diff --git a/examples/python/data-preprocessing/SparkNLP_Partition_with_Chunking_Demo.ipynb b/examples/python/data-preprocessing/SparkNLP_Partition_with_Chunking_Demo.ipynb
@@ -0,0 +1,362 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "tzcU5p2gdak9"
+   },
+   "source": [
+    "# Introducing Partition with Semantic Chunking SparkNLP\n",
+    "This notebook showcases the newly added `Partition` component in Spark NLP\n",
+    "providing a streamlined and user-friendly interface for interacting with Spark NLP readers"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "RFOFhaEedalB"
+   },
+   "source": [
+    "## Setup and Initialization\n",
+    "Let's keep in mind a few things before we start 😊\n",
+    "\n",
+    "Support for **Partitioning** files was introduced in Spark NLP 6.0.1 \n",
+    "\n",
+    "Chunking support was added in Spark NLP 6.0.3\n",
+    "Please make sure you have upgraded to the latest Spark NLP release.\n",
+    "\n",
+    "For local files example we will download different files from Spark NLP Github repo:"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "ATDLz3Gws5ob"
+   },
+   "source": [
+    "**Downloading Files**"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "metadata": {
+    "id": "g7PMCOJo0ZlU"
+   },
+   "outputs": [],
+   "source": [
+    "!mkdir txt-files\n",
+    "!mkdir html-files"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "metadata": {
+    "colab": {
+     "base_uri": "https://localhost:8080/"
+    },
+    "id": "AV-krG6Ps8pq",
+    "outputId": "ea4c2484-6e83-4a7a-a000-537f38189ed0"
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "--2025-06-06 15:19:01--  https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/feature/SPARKNLP-1125-Implement-Chunking-Strategies/src/test/resources/reader/txt/long-text.txt\n",
+      "Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.111.133, 185.199.108.133, ...\n",
+      "Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.\n",
+      "HTTP request sent, awaiting response... 200 OK\n",
+      "Length: 1032 (1.0K) [text/plain]\n",
+      "Saving to: ‘txt-files/long-text.txt’\n",
+      "\n",
+      "long-text.txt       100%[===================>]   1.01K  --.-KB/s    in 0s      \n",
+      "\n",
+      "2025-06-06 15:19:01 (58.1 MB/s) - ‘txt-files/long-text.txt’ saved [1032/1032]\n",
+      "\n",
+      "--2025-06-06 15:19:01--  https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/feature/SPARKNLP-1125-Implement-Chunking-Strategies/src/test/resources/reader/html/fake-html.html\n",
+      "Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.111.133, 185.199.108.133, ...\n",
+      "Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.\n",
+      "HTTP request sent, awaiting response... 200 OK\n",
+      "Length: 665 [text/plain]\n",
+      "Saving to: ‘html-files/fake-html.html’\n",
+      "\n",
+      "fake-html.html      100%[===================>]     665  --.-KB/s    in 0s      \n",
+      "\n",
+      "2025-06-06 15:19:02 (26.7 MB/s) - ‘html-files/fake-html.html’ saved [665/665]\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "!wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/feature/SPARKNLP-1125-Implement-Chunking-Strategies/src/test/resources/reader/txt/long-text.txt -P txt-files\n",
+    "!wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/feature/SPARKNLP-1125-Implement-Chunking-Strategies/src/test/resources/reader/html/fake-html.html -P html-files"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "EoFI66NAdalE"
+   },
+   "source": [
+    "## Partitioning Documents with Chunking\n",
+    "Use the `basic` chunking to segment data into coherent chunks based on character limits"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "metadata": {
+    "colab": {
+     "base_uri": "https://localhost:8080/"
+    },
+    "id": "bAkMjJ1vdalE",
+    "outputId": "75831f62-c84a-4170-f87e-e70a6c1ef39d"
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Warning::Spark Session already created, some configs may not take.\n"
+     ]
+    }
+   ],
+   "source": [
+    "from sparknlp.partition.partition import Partition\n",
+    "\n",
+    "partition_df = Partition(content_type = \"text/plain\", chunking_strategy = \"basic\").partition(\"./txt-files/long-text.txt\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "k6uvYxiVzGsG"
+   },
+   "source": [
+    "Output without `basic` chunk:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "metadata": {
+    "colab": {
+     "base_uri": "https://localhost:8080/"
+    },
+    "id": "3L-Tp017qgqb",
+    "outputId": "98af5f84-5abc-4554-bab7-7dd9c5212612"
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\n",
+      "|col                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |\n",
+      "+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\n",
+      "|Ukrainian forces reportedly advanced in the western Donetsk-eastern Zaporizhia Oblast border area and in western Zaporizhia Oblast amid Ukrainian counteroffensive operations in southern and eastern Ukraine. Tavriisk Group of Forces Spokesperson Oleksandr Shtupun reported that Ukrainian forces are advancing in the directions of Novoprokopivka (13km south of Orikhiv), Mala Tokmachka (9km southeast of Orikhiv), and Ocheretuvate (25km southeast of Orikhiv) in western Zaporizhia Oblast.[1] Shtupun also stated that Ukrainian forces advanced near Urozhaine (9km south of Velyka Novosilka) and Robotyne (10km south of Orikhiv) and achieved unspecified successes near Staromayorske (9km south of Velyka Novosilka) in the Berdyansk direction (western Donetsk-eastern Zaporizhia Oblast border area) and in an unspecified location in the Melitopol direction (western Zaporizhia Oblast).[2] Ukrainian Eastern Group of Forces Spokesperson Ilya Yevlash stated that Ukrainian forces continued offensive operations in the Bakhmut direction.[3]|\n",
+      "+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "from pyspark.sql.functions import explode, col\n",
+    "\n",
+    "result_df = partition_df.select(explode(col(\"txt.content\")))\n",
+    "result_df.show(truncate=False)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "EQJvQsnxzRg1"
+   },
+   "source": [
+    "Output with `basic` chunk:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "metadata": {
+    "colab": {
+     "base_uri": "https://localhost:8080/"
+    },
+    "id": "VlhnXCV5qr4J",
+    "outputId": "cdaf98f1-3109-4770-adaf-f51c80a59ab9"
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\n",
+      "|col                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |\n",
+      "+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\n",
+      "|Ukrainian forces reportedly advanced in the western Donetsk-eastern Zaporizhia Oblast border area and in western Zaporizhia Oblast amid Ukrainian counteroffensive operations in southern and eastern Ukraine. Tavriisk Group of Forces Spokesperson Oleksandr Shtupun reported that Ukrainian forces are advancing in the directions of Novoprokopivka (13km south of Orikhiv), Mala Tokmachka (9km southeast of Orikhiv), and Ocheretuvate (25km southeast of Orikhiv) in western Zaporizhia Oblast.[1] Shtupun|\n",
+      "|also stated that Ukrainian forces advanced near Urozhaine (9km south of Velyka Novosilka) and Robotyne (10km south of Orikhiv) and achieved unspecified successes near Staromayorske (9km south of Velyka Novosilka) in the Berdyansk direction (western Donetsk-eastern Zaporizhia Oblast border area) and in an unspecified location in the Melitopol direction (western Zaporizhia Oblast).[2] Ukrainian Eastern Group of Forces Spokesperson Ilya Yevlash stated that Ukrainian forces continued offensive   |\n",
+      "|operations in the Bakhmut direction.[3]                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |\n",
+      "+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "result_df = partition_df.select(explode(col(\"chunks.content\")))\n",
+    "result_df.show(truncate=False)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "4YYTB7G6zbmN"
+   },
+   "source": [
+    "Use `by_title` chunking to group sections in documents with headings, tables, and mixed semantic elements"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 12,
+   "metadata": {
+    "colab": {
+     "base_uri": "https://localhost:8080/"
+    },
+    "id": "PxTf0Ot23ZaO",
+    "outputId": "9b02a493-b4d0-41fc-c5ee-9ed8ab2de194"
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Warning::Spark Session already created, some configs may not take.\n"
+     ]
+    }
+   ],
+   "source": [
+    "partition_df = Partition(content_type = \"text/html\", chunking_strategy = \"by_title\", combineTextUnderNChars = 50).partition(\"./html-files/fake-html.html\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "YXMf3cBfz_2-"
+   },
+   "source": [
+    "Output without `by_title` chunk:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 13,
+   "metadata": {
+    "colab": {
+     "base_uri": "https://localhost:8080/"
+    },
+    "id": "O-_R-W86sFo-",
+    "outputId": "6f07e491-c556-41af-89da-273e905d0e8b"
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "+----------------------------------------------------------------------------------------------------------------------------------+\n",
+      "|col                                                                                                                               |\n",
+      "+----------------------------------------------------------------------------------------------------------------------------------+\n",
+      "|My First Heading                                                                                                                  |\n",
+      "|My Second Heading                                                                                                                 |\n",
+      "|My first paragraph. lorem ipsum dolor set amet. if the cow comes home under the sun how do you fault the cow for it's worn hooves?|\n",
+      "|A Third Heading                                                                                                                   |\n",
+      "|Column 1 Column 2 Row 1, Cell 1 Row 1, Cell 2 Row 2, Cell 1 Row 2, Cell 2                                                         |\n",
+      "+----------------------------------------------------------------------------------------------------------------------------------+\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "result_df = partition_df.select(explode(col(\"html.content\")))\n",
+    "result_df.show(truncate=False)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "EhLOvpfe0JIe"
+   },
+   "source": [
+    "Output with `by_title` chunk:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 14,
+   "metadata": {
+    "colab": {
+     "base_uri": "https://localhost:8080/"
+    },
+    "id": "WhSWaeYGrvP-",
+    "outputId": "8f5da326-029c-4ad6-c201-c5d2f2f8fa7d"
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\n",
+      "|col                                                                                                                                                                                  |\n",
+      "+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\n",
+      "|My First Heading My Second Heading My first paragraph. lorem ipsum dolor set amet. if the cow comes home under the sun how do you fault the cow for it's worn hooves? A Third Heading|\n",
+      "|Column 1 Column 2 Row 1, Cell 1 Row 1, Cell 2 Row 2, Cell 1 Row 2, Cell 2                                                                                                            |\n",
+      "+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "result_df = partition_df.select(explode(col(\"chunks.content\")))\n",
+    "result_df.show(truncate=False)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "BB2FEfegGuxl"
+   },
+   "source": [
+    "You can also use DFS file systems like:\n",
+    "- Databricks: `dbfs://`\n",
+    "- HDFS: `hdfs://`\n",
+    "- Microsoft Fabric OneLake: `abfss://`"
+   ]
+  }
+ ],
+ "metadata": {
+  "colab": {
+   "provenance": []
+  },
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.12"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 1
+}