From ce06972742fbb2064e64c5e9002a03594c76a226 Mon Sep 17 00:00:00 2001 From: Samy OUBOUAZIZ Date: Thu, 4 Dec 2025 15:21:16 +0100 Subject: [PATCH 01/21] feat(dlb): add v2 doc MTA-6795 --- .../data-lab/how-to/use-private-networks.mdx | 39 +++++++++++++++++++ 1 file changed, 39 insertions(+) create mode 100644 pages/data-lab/how-to/use-private-networks.mdx diff --git a/pages/data-lab/how-to/use-private-networks.mdx b/pages/data-lab/how-to/use-private-networks.mdx new file mode 100644 index 0000000000..e547fa50ed --- /dev/null +++ b/pages/data-lab/how-to/use-private-networks.mdx @@ -0,0 +1,39 @@ +--- +title: How to use Private Networks with your Data Lab cluster +description: This page explains how to use Private Networks with Scaleway Data Lab for Apache Spark™ +tags: private-networks private networks data lab spark apache cluster vpc +dates: + validation: 2025-06-25 + posted: 2021-06-25 +--- +import Requirements from '@macros/iam/requirements.mdx' + + +[Private Networks](/vpc/concepts/#private-networks) allow your Data Lab for Apache Spark™ cluster to communicate in an isolated and secure network without needing to be connected to the public internet. + +For full information about Scaleway Private Networks and VPC, see our [dedicated documentation](/vpc/) and [best practices guide](/vpc/reference-content/getting-most-private-networks/). + + + +- A Scaleway account logged into the [console](https://console.scaleway.com) +- [Owner](/iam/concepts/#owner) status or [IAM permissions](/iam/concepts/#permission) allowing you to perform actions in the intended Organization +- [Created a Private Network](/vpc/how-to/create-private-network/) + + +## How to create a Private Network + +This action must be carried out from the Private Networks section of the console. Follow the procedure detailed in our [dedicated Private Networks documentation](/vpc/how-to/create-private-network/). + +## How to attach and detach a cluster to a Private Network + +Data Lab clusters can only be attached to a Private Network during their creation, and cannot be detached and reattached to another Private Network afterward. + +Refer to the [dedicated documentation](/data-lab/how-to/create-data-lab/) for comprehensive information on how to create a Data Lab for Apache Spark™ cluster. + +## How to delete a Private Network + + + Before deleting a Private Network, you must [detach](/vpc/how-to/attach-resources-to-pn/#how-to-detach-a-resource-from-a-private-network) all resources attached to it. + + +This must be carried out from the Private Networks section of the console. Follow the procedure detailed in our [dedicated Private Networks documentation](/vpc/how-to/delete-private-network/). \ No newline at end of file From 813b49004c1618ee112c7e52c4330587796e92a0 Mon Sep 17 00:00:00 2001 From: Samy OUBOUAZIZ Date: Thu, 4 Dec 2025 15:24:45 +0100 Subject: [PATCH 02/21] feat(dlb): update --- pages/data-lab/how-to/use-private-networks.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/pages/data-lab/how-to/use-private-networks.mdx b/pages/data-lab/how-to/use-private-networks.mdx index e547fa50ed..bd8ae1e7eb 100644 --- a/pages/data-lab/how-to/use-private-networks.mdx +++ b/pages/data-lab/how-to/use-private-networks.mdx @@ -26,7 +26,7 @@ This action must be carried out from the Private Networks section of the console ## How to attach and detach a cluster to a Private Network -Data Lab clusters can only be attached to a Private Network during their creation, and cannot be detached and reattached to another Private Network afterward. +At the moment, Data Lab clusters can only be attached to a Private Network during their creation, and cannot be detached and reattached to another Private Network afterward. Refer to the [dedicated documentation](/data-lab/how-to/create-data-lab/) for comprehensive information on how to create a Data Lab for Apache Spark™ cluster. From 82ae49458306aa4996d21673b8e8091fbb54746f Mon Sep 17 00:00:00 2001 From: Samy OUBOUAZIZ Date: Thu, 4 Dec 2025 16:09:41 +0100 Subject: [PATCH 03/21] feat(dlb): update --- pages/data-lab/concepts.mdx | 4 ++- pages/data-lab/how-to/access-spark-ui.mdx | 31 +++++++++++++++++++++++ pages/data-lab/how-to/create-data-lab.mdx | 8 +++--- 3 files changed, 39 insertions(+), 4 deletions(-) create mode 100644 pages/data-lab/how-to/access-spark-ui.mdx diff --git a/pages/data-lab/concepts.mdx b/pages/data-lab/concepts.mdx index 6e8d9269b7..d3a732f77c 100644 --- a/pages/data-lab/concepts.mdx +++ b/pages/data-lab/concepts.mdx @@ -38,13 +38,15 @@ Lighter is a technology that enables SparkMagic commands to be readable and exec A notebook for an Apache Spark cluster is an interactive, web-based tool that allows users to write and execute code, visualize data, and share results in a collaborative environment. It connects to an Apache Spark cluster to run large-scale data processing tasks directly from the notebook interface, making it easier to develop and test data workflows. +Adding a notebook to your cluster requires 1 GB of storage. + ## Persistent volume A Persistent Volume (PV) is a cluster-wide storage resource that ensures data persistence beyond the lifecycle of individual Pods. Persistent volumes abstract the underlying storage details, allowing administrators to use various storage solutions. Apache Spark® executors require storage space for various operations, particularly to shuffle data during wide operations such as sorting, grouping, and aggregation. Wide operations are transformations that require data from different partitions to be combined, often resulting in data movement across the cluster. During the map phase, executors write data to shuffle storage, which is then read by reducers. -A PV sized properly ensures a smooth execution of your workload. +A persistent volume sized properly ensures a smooth execution of your workload. ## SparkMagic diff --git a/pages/data-lab/how-to/access-spark-ui.mdx b/pages/data-lab/how-to/access-spark-ui.mdx new file mode 100644 index 0000000000..89b6394357 --- /dev/null +++ b/pages/data-lab/how-to/access-spark-ui.mdx @@ -0,0 +1,31 @@ +--- +title: How to Access the Apache Spark™ UI +description: Step-by-step guide to access and use the Apache Spark™ UI in a Data Lab for Apache Spark™ on Scaleway. +tags: data lab apache spark ui gui console +dates: + validation: 2025-12-04 + posted: 2025-12-04 +--- + +import Requirements from '@macros/iam/requirements.mdx' + +This page explains how to Access the Apache Spark™ UI of your Data Lab for Apache Spark™ cluster. + + + +- A Scaleway account logged into the [console](https://console.scaleway.com) +- [Owner](/iam/concepts/#owner) status or [IAM permissions](/iam/concepts/#permission) allowing you to perform actions in the intended Organization +- Created a [Data Lab for Apache Spark™ cluster](/data-lab/how-to/create-data-lab/) +- Created an [IAM API key](/iam/how-to/create-api-keys/) + +1. Click **Data Lab** under **Data & Analytics** on the side menu. The Data Lab for Apache Spark™ page displays. + +2. Click the name of the desired Data Lab cluster. The overview tab of the cluster displays. + +3. Click the **Open Apache Spark™ UI** button. A login page displays. + +4. Enter the **secret key** of your API key, then click **Authenticate**. The Apache Spark™ UI dashboard displays. + +From this view, you can view and monitor worker nodes, executors and applications. + +Refer to the [official Apache Spark™ documentation](https://spark.apache.org/docs/latest/web-ui.html) for comprehensive information on how to use the web UI. \ No newline at end of file diff --git a/pages/data-lab/how-to/create-data-lab.mdx b/pages/data-lab/how-to/create-data-lab.mdx index 335efadc4d..1f42fe894a 100644 --- a/pages/data-lab/how-to/create-data-lab.mdx +++ b/pages/data-lab/how-to/create-data-lab.mdx @@ -21,9 +21,11 @@ Data Lab for Apache Spark™ is a product designed to assist data scientists and 2. Click **Create Data Lab cluster**. The creation wizard displays. -3. Complete the following steps in the wizard: - - Choose an Apache Spark version from the drop-down menu. - - Select a worker node configuration. +3. Choose an Apache Spark version from the drop-down menu. + +4. Choose a main node type. If you plan to add a notebook to your cluster, select the **DDL-PLAY2-MICRO** configuration to provision sufficient resources for it. + +5. Select a worker node configuration. - Enter the desired number of worker nodes. Provisioning zero worker nodes lets you retain and access you cluster and notebook configurations, but will not allow you to run calculations. From fb2992184263e32f70255515948760bc87533e0e Mon Sep 17 00:00:00 2001 From: Samy OUBOUAZIZ Date: Thu, 4 Dec 2025 17:38:40 +0100 Subject: [PATCH 04/21] feat(dlb): update --- pages/data-lab/how-to/access-notebook.mdx | 32 +++++++++++++++++++++++ 1 file changed, 32 insertions(+) create mode 100644 pages/data-lab/how-to/access-notebook.mdx diff --git a/pages/data-lab/how-to/access-notebook.mdx b/pages/data-lab/how-to/access-notebook.mdx new file mode 100644 index 0000000000..bf7351fdce --- /dev/null +++ b/pages/data-lab/how-to/access-notebook.mdx @@ -0,0 +1,32 @@ +--- +title: How to access and use the notebook of a Data Lab cluster +description: Step-by-step guide to access and use the notebook environment in a Data Lab for Apache Spark™ on Scaleway. +tags: data lab apache spark notebook environment jupyterlab +dates: + validation: 2025-12-04 + posted: 2025-12-04 +--- + +import Requirements from '@macros/iam/requirements.mdx' + +This page explains how to access and use the notebook environment of your Data Lab for Apache Spark™ cluster. + + + +- A Scaleway account logged into the [console](https://console.scaleway.com) +- [Owner](/iam/concepts/#owner) status or [IAM permissions](/iam/concepts/#permission) allowing you to perform actions in the intended Organization +- Created a [Data Lab for Apache Spark™ cluster](/data-lab/how-to/create-data-lab/) with a notebook +- Created an [IAM API key](/iam/how-to/create-api-keys/) + +## How to access the notebook of your cluster + +1. Click **Data Lab** under **Data & Analytics** on the side menu. The Data Lab for Apache Spark™ page displays. + +2. Click the name of the desired Data Lab cluster. The overview tab of the cluster displays. + +3. Click the **Open notebook** button. A login page displays. + +4. Enter the **secret key** of your API key, then click **Authenticate**. The notebook dashboard displays. + + + From 40bef224c3381360f2dc6d5a86936c6739217fa7 Mon Sep 17 00:00:00 2001 From: Samy OUBOUAZIZ Date: Thu, 4 Dec 2025 17:45:10 +0100 Subject: [PATCH 05/21] feat(dlb): update --- pages/data-lab/how-to/use-private-networks.mdx | 2 ++ 1 file changed, 2 insertions(+) diff --git a/pages/data-lab/how-to/use-private-networks.mdx b/pages/data-lab/how-to/use-private-networks.mdx index bd8ae1e7eb..9e1caeaf39 100644 --- a/pages/data-lab/how-to/use-private-networks.mdx +++ b/pages/data-lab/how-to/use-private-networks.mdx @@ -24,6 +24,8 @@ For full information about Scaleway Private Networks and VPC, see our [dedicated This action must be carried out from the Private Networks section of the console. Follow the procedure detailed in our [dedicated Private Networks documentation](/vpc/how-to/create-private-network/). +## How to use + ## How to attach and detach a cluster to a Private Network At the moment, Data Lab clusters can only be attached to a Private Network during their creation, and cannot be detached and reattached to another Private Network afterward. From 285338318183e6ebee9670f8a317d07e2f3dcb62 Mon Sep 17 00:00:00 2001 From: Samy OUBOUAZIZ Date: Mon, 8 Dec 2025 15:23:01 +0100 Subject: [PATCH 06/21] feat(dlb): update --- .../data-lab/how-to/use-private-networks.mdx | 141 ++++++++++++++++-- 1 file changed, 130 insertions(+), 11 deletions(-) diff --git a/pages/data-lab/how-to/use-private-networks.mdx b/pages/data-lab/how-to/use-private-networks.mdx index 9e1caeaf39..ff43c52cf3 100644 --- a/pages/data-lab/how-to/use-private-networks.mdx +++ b/pages/data-lab/how-to/use-private-networks.mdx @@ -11,6 +11,8 @@ import Requirements from '@macros/iam/requirements.mdx' [Private Networks](/vpc/concepts/#private-networks) allow your Data Lab for Apache Spark™ cluster to communicate in an isolated and secure network without needing to be connected to the public internet. +At the moment, Data Lab clusters can only be attached to a Private Network [during their creation](/data-lab/how-to/create-data-lab/), and cannot be detached and reattached to another Private Network afterward. + For full information about Scaleway Private Networks and VPC, see our [dedicated documentation](/vpc/) and [best practices guide](/vpc/reference-content/getting-most-private-networks/). @@ -18,24 +20,141 @@ For full information about Scaleway Private Networks and VPC, see our [dedicated - A Scaleway account logged into the [console](https://console.scaleway.com) - [Owner](/iam/concepts/#owner) status or [IAM permissions](/iam/concepts/#permission) allowing you to perform actions in the intended Organization - [Created a Private Network](/vpc/how-to/create-private-network/) +- [Created an Instance](/instances/how-to/create-an-instance/) attached to a [Private Network](/instances/how-to/use-private-networks/) + +## How to use a cluster through a Private Network + +### Setting up your Instance + +1. [Connect to your Instance via SSH](/instances/how-to/connect-to-instance/). + +2. Run the command below from the shell of your Instance to install the required dependencies: + + ```bash + sudo apt update + sudo apt install -y \ + build-essential curl git \ + libssl-dev zlib1g-dev libbz2-dev libreadline-dev libsqlite3-dev \ + libncursesw5-dev xz-utils tk-dev libxml2-dev libxmlsec1-dev libffi-dev liblzma-dev \ + openjdk-17-jre-headless tmux + ``` + +2. Run the command below to install `pyenv`: + + ```bash + curl https://pyenv.run | bash + ``` + +3. Run the command below to add `pyenv` to your Bash configuration: + + ```bash + echo 'export PATH="$HOME/.pyenv/bin:$PATH"' >> ~/.bashrc + echo 'eval "$(pyenv init -)"' >> ~/.bashrc + echo 'eval "$(pyenv virtualenv-init -)"' >> ~/.bashrc + ``` + +4. Run the command below to reload your shell: + + ```bash + exec $SHELL + ``` + +5. Run the command below to install **Python 3.13**, and activate a virtual environment: + + ```bash + pyenv install 3.13.0 + pyenv virtualenv 3.13.0 jupyter-spark-3.13 + pyenv activate jupyter-spark-3.13 + ``` + + + Your Instance Python version must be 3.13. If you encounter an error due to a mismatch between the worker and driver Python versions, run the following command to display minor versions, then reinstall using the exact one: + + ```bash + pyenv install -l | grep 3.13 + ``` + + +6. Run the command below to install JupyterLab and PySpark inside the virtual environment: + + ```bash + pip install --upgrade pip + pip install jupyterlab pyspark + ``` + +7. Run the command below to generate a configuration file for your JupyterLab: + + ```bash + jupyter lab --generate-config + ``` + +8. Open the configuration file you just created: + + ```bash + nano ~/.jupyter/jupyter_lab_config.py + ``` + +9. Set the following parameters: + + ```python + c.ServerApp.ip = "127.0.0.1" # bind only to localhost + c.ServerApp.port = 8888 + c.ServerApp.open_browser = False + # optional authentication token: + # c.ServerApp.token = "your-super-secure-password" + # if running as root: + # c.ServerApp.allow_root = True + ``` + +10. Run the command below to start Jupyterlab: + + ```bash + jupyter lab + ``` + +11. Connect to your JupyterLab via SSH: + + ```bash + ssh -L 8888:127.0.0.1:8888 @ + ``` + +12. Access [http://localhost:8888](http://localhost:8888) + +You now have access to your Data Lab for Apache Spark™ cluster via a Private Network, using a JupyterLab notebook deployed on an Instance. + +### Running a sample workload over Private Networks +1. In a new Jupyter notebook file, add the code below to a new cell: -## How to create a Private Network + ```python + from pyspark.sql import SparkSession -This action must be carried out from the Private Networks section of the console. Follow the procedure detailed in our [dedicated Private Networks documentation](/vpc/how-to/create-private-network/). + MASTER_URL = "" # "spark://master-datalab-[...]:7077" format + DRIVER_HOST = "" # "XX.XX.XX.XX" format -## How to use + spark = ( + SparkSession.builder + .appName("jupyter-from-vpc-instance-test") + .master(MASTER_URL) + # make sure executors can talk back to this driver + .config("spark.driver.host", DRIVER_HOST) + .config("spark.driver.bindAddress", "0.0.0.0") + .config("spark.driver.port", "7078") + .config("spark.blockManager.port", "7079") + .config("spark.ui.port", "4040") + .getOrCreate() + ) -## How to attach and detach a cluster to a Private Network + spark.range(10).show() + ``` -At the moment, Data Lab clusters can only be attached to a Private Network during their creation, and cannot be detached and reattached to another Private Network afterward. +2. Replace the placeholders with the appropriate values: + - `` can be found in the **Overview** tab of your cluster, under **Private endpoint** in the **Network** section. + - `` can be found in the **Private Networks** tab of your Instance. Make sure to only copy the IP, and not the `/22` part. -Refer to the [dedicated documentation](/data-lab/how-to/create-data-lab/) for comprehensive information on how to create a Data Lab for Apache Spark™ cluster. +3. Run the cell. -## How to delete a Private Network +Your notebook hosted on an Instance is ready to be used over Private Network. - - Before deleting a Private Network, you must [detach](/vpc/how-to/attach-resources-to-pn/#how-to-detach-a-resource-from-a-private-network) all resources attached to it. - +### Running an application over Private Networks using spark-submit -This must be carried out from the Private Networks section of the console. Follow the procedure detailed in our [dedicated Private Networks documentation](/vpc/how-to/delete-private-network/). \ No newline at end of file From d7265a29bc859375a328aebebcdd0af6c533361e Mon Sep 17 00:00:00 2001 From: Samy OUBOUAZIZ Date: Mon, 8 Dec 2025 15:30:47 +0100 Subject: [PATCH 07/21] feat(dlb): update --- pages/data-lab/menu.ts | 18 +++++++++++++++--- 1 file changed, 15 insertions(+), 3 deletions(-) diff --git a/pages/data-lab/menu.ts b/pages/data-lab/menu.ts index 893c9a3cbf..0291c1ddd5 100644 --- a/pages/data-lab/menu.ts +++ b/pages/data-lab/menu.ts @@ -19,15 +19,27 @@ export const dataLabMenu = { { items: [ { - label: 'Create a Data Lab', + label: 'Create a Data Lab cluster', slug: 'create-data-lab', }, { - label: 'Connect to a Data Lab', + label: 'Connect to a Data Lab cluster', slug: 'connect-to-data-lab', }, { - label: 'Manage and delete a Data Lab', + label: 'Access the notebook', + slug: 'access-notebook', + }, + { + label: 'Access the Spark™ UI', + slug: 'access-spark-ui', + }, + { + label: 'Use a cluster with Private Networks', + slug: 'use-private-network', + }, + { + label: 'Manage and delete a cluster', slug: 'manage-delete-data-lab', }, ], From 945fd9b4f50578100ff32256ee25e873397f14a7 Mon Sep 17 00:00:00 2001 From: Samy OUBOUAZIZ Date: Mon, 8 Dec 2025 15:34:10 +0100 Subject: [PATCH 08/21] feat(dlb): update --- pages/data-lab/how-to/access-notebook.mdx | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/pages/data-lab/how-to/access-notebook.mdx b/pages/data-lab/how-to/access-notebook.mdx index bf7351fdce..fbce1eddfb 100644 --- a/pages/data-lab/how-to/access-notebook.mdx +++ b/pages/data-lab/how-to/access-notebook.mdx @@ -28,5 +28,4 @@ This page explains how to access and use the notebook environment of your Data L 4. Enter the **secret key** of your API key, then click **Authenticate**. The notebook dashboard displays. - - +You are now connected to your notebook environment. \ No newline at end of file From 9af7650660573a8e72a91b0fa74489048ac1aeb3 Mon Sep 17 00:00:00 2001 From: Samy OUBOUAZIZ Date: Mon, 8 Dec 2025 15:35:38 +0100 Subject: [PATCH 09/21] feat(dlb): update --- pages/data-lab/how-to/connect-to-data-lab.mdx | 38 ------------------- pages/data-lab/menu.ts | 4 -- 2 files changed, 42 deletions(-) delete mode 100644 pages/data-lab/how-to/connect-to-data-lab.mdx diff --git a/pages/data-lab/how-to/connect-to-data-lab.mdx b/pages/data-lab/how-to/connect-to-data-lab.mdx deleted file mode 100644 index 8a941f17a1..0000000000 --- a/pages/data-lab/how-to/connect-to-data-lab.mdx +++ /dev/null @@ -1,38 +0,0 @@ ---- -title: How to connect to a Data Lab for Apache Spark™ -description: Step-by-step guide to connecting to a Data Lab for Apache Spark™ with the Scaleway console. -tags: data lab for apache spark create process -dates: - validation: 2025-09-17 - posted: 2024-07-31 ---- -import Requirements from '@macros/iam/requirements.mdx' - - - - -- A Scaleway account logged into the [console](https://console.scaleway.com) -- [Owner](/iam/concepts/#owner) status or [IAM permissions](/iam/concepts/#permission) allowing you to perform actions in the intended Organization -- Created a [Data Lab for Apache Spark™ cluster](/data-lab/how-to/create-data-lab/) -- A valid [API key](/iam/how-to/create-api-keys/) - -1. Click **Data Lab** under **Data & Analytics** on the side menu. The Data Lab for Apache Spark™ page displays. - -2. Click the name of the Data Lab cluster you want to connect to. The cluster **Overview** page displays. - -3. Click **Open Notebook** in the **Notebook** section. You are directed to the notebook login page. - -4. Enter your [API secret key](/iam/concepts/#api-key) when prompted for a password, then click **Log in**. You are directed to the lab's home screen. - -5. In the files list on the left, double-click the `DatalabDemo.ipynb` file to open it. - -6. Update the first cell of the file with your API access key and secret key, as shown below: - - ```json - "spark.hadoop.fs.s3a.access.key": "[your-api-access-key]", - "spark.hadoop.fs.s3a.secret.key": "[your-api-secret-key]", - ``` - - Your notebook environment is now ready to be used. - -7. Optionally, follow the instructions contained in the `DatalabDemo.ipynb` file to process a test batch of data. \ No newline at end of file diff --git a/pages/data-lab/menu.ts b/pages/data-lab/menu.ts index 0291c1ddd5..b933b9f1ab 100644 --- a/pages/data-lab/menu.ts +++ b/pages/data-lab/menu.ts @@ -22,10 +22,6 @@ export const dataLabMenu = { label: 'Create a Data Lab cluster', slug: 'create-data-lab', }, - { - label: 'Connect to a Data Lab cluster', - slug: 'connect-to-data-lab', - }, { label: 'Access the notebook', slug: 'access-notebook', From 2e7b209153e8b598a33889d7ea64421a06d2c327 Mon Sep 17 00:00:00 2001 From: Samy OUBOUAZIZ Date: Mon, 8 Dec 2025 17:52:56 +0100 Subject: [PATCH 10/21] feat(dlb): update --- .../data-lab/how-to/use-private-networks.mdx | 60 ++++++++++++++++--- pages/data-lab/menu.ts | 2 +- 2 files changed, 54 insertions(+), 8 deletions(-) diff --git a/pages/data-lab/how-to/use-private-networks.mdx b/pages/data-lab/how-to/use-private-networks.mdx index ff43c52cf3..548cc28283 100644 --- a/pages/data-lab/how-to/use-private-networks.mdx +++ b/pages/data-lab/how-to/use-private-networks.mdx @@ -20,7 +20,7 @@ For full information about Scaleway Private Networks and VPC, see our [dedicated - A Scaleway account logged into the [console](https://console.scaleway.com) - [Owner](/iam/concepts/#owner) status or [IAM permissions](/iam/concepts/#permission) allowing you to perform actions in the intended Organization - [Created a Private Network](/vpc/how-to/create-private-network/) -- [Created an Instance](/instances/how-to/create-an-instance/) attached to a [Private Network](/instances/how-to/use-private-networks/) +- [Created an Ubuntu Instance](/instances/how-to/create-an-instance/) attached to a [Private Network](/instances/how-to/use-private-networks/) ## How to use a cluster through a Private Network @@ -97,13 +97,11 @@ For full information about Scaleway Private Networks and VPC, see our [dedicated 9. Set the following parameters: ```python - c.ServerApp.ip = "127.0.0.1" # bind only to localhost + # if running as root: + c.ServerApp.allow_root = True c.ServerApp.port = 8888 - c.ServerApp.open_browser = False # optional authentication token: # c.ServerApp.token = "your-super-secure-password" - # if running as root: - # c.ServerApp.allow_root = True ``` 10. Run the command below to start Jupyterlab: @@ -112,13 +110,17 @@ For full information about Scaleway Private Networks and VPC, see our [dedicated jupyter lab ``` -11. Connect to your JupyterLab via SSH: +11. In a new terminal, connect to your JupyterLab via SSH. The Instance public IP can be found in the **Overview** tab of your Instance: ```bash ssh -L 8888:127.0.0.1:8888 @ ``` -12. Access [http://localhost:8888](http://localhost:8888) + + Make sure to allow root connection in your configuration file if you log in as a root user. + + +12. Access [http://localhost:8888](http://localhost:8888), then enter the token generated while executing the `jupyter lab` command. You now have access to your Data Lab for Apache Spark™ cluster via a Private Network, using a JupyterLab notebook deployed on an Instance. @@ -158,3 +160,47 @@ Your notebook hosted on an Instance is ready to be used over Private Network. ### Running an application over Private Networks using spark-submit +1. [Connect to your Instance via SSH](/instances/how-to/connect-to-instance/). + +2. Run the command below from the shell of your Instance to install the required dependencies: + + ```bash + sudo apt update + sudo apt install -y openjdk-17-jdk curl wget tar + java -version + ``` + +3. Run the command below to install Apache Spark™: + + ```bash + cd ~ + wget https://archive.apache.org/dist/spark/spark-4.0.0/spark-4.0.0-bin-hadoop3.tgz + ``` + +4. Run the command below to unzip the archive: + + ```bash + sudo mkdir -p /opt/spark + sudo tar -xzf spark-4.0.0-bin-hadoop3.tgz -C /opt/spark --strip-components=1 + ``` + +5. Run the command below to add Apache Spark™ to your Bash configuration, and reload your bash session: + + ```bash + echo 'export SPARK_HOME=/opt/spark' >> ~/.bashrc + echo 'export PATH="$SPARK_HOME/bin:$PATH"' >> ~/.bashrc + source ~/.bashrc + ``` + +6. Install Python 3.13 if you have not done it yet, then set the environment variables below: + + ```bash + export PYSPARK_PYTHON=$(which python) //should be 3.13 + export PYSPARK_DRIVER_PYTHON=$(which python) + ``` + +7. Run the command below to execute `spark-submit`. Do not forget to replace the placeholders with the appropriate values. + + ```bash + spark-submit --master spark://:7077 --deploy-mode client --conf spark.driver.port=7078 --conf spark.blockManager.port=7079 --conf spark.driver.host= $SPARK_HOME/examples/src/main/python/pi.py 100 + ``` \ No newline at end of file diff --git a/pages/data-lab/menu.ts b/pages/data-lab/menu.ts index b933b9f1ab..ad0bdeb9af 100644 --- a/pages/data-lab/menu.ts +++ b/pages/data-lab/menu.ts @@ -32,7 +32,7 @@ export const dataLabMenu = { }, { label: 'Use a cluster with Private Networks', - slug: 'use-private-network', + slug: 'use-private-networks', }, { label: 'Manage and delete a cluster', From 370d3dc70f411a0782faaa958640cd9e183cc424 Mon Sep 17 00:00:00 2001 From: Samy OUBOUAZIZ Date: Tue, 9 Dec 2025 11:19:50 +0100 Subject: [PATCH 11/21] feat(dlb): update --- .../data-lab/how-to/use-private-networks.mdx | 25 +++++++++++++++---- 1 file changed, 20 insertions(+), 5 deletions(-) diff --git a/pages/data-lab/how-to/use-private-networks.mdx b/pages/data-lab/how-to/use-private-networks.mdx index 548cc28283..5541da232b 100644 --- a/pages/data-lab/how-to/use-private-networks.mdx +++ b/pages/data-lab/how-to/use-private-networks.mdx @@ -131,7 +131,7 @@ You now have access to your Data Lab for Apache Spark™ cluster via a Private N ```python from pyspark.sql import SparkSession - MASTER_URL = "" # "spark://master-datalab-[...]:7077" format + MASTER_URL = "" # "spark://master-datalab-[...]:7077" format DRIVER_HOST = "" # "XX.XX.XX.XX" format spark = ( @@ -151,7 +151,7 @@ You now have access to your Data Lab for Apache Spark™ cluster via a Private N ``` 2. Replace the placeholders with the appropriate values: - - `` can be found in the **Overview** tab of your cluster, under **Private endpoint** in the **Network** section. + - `` can be found in the **Overview** tab of your cluster, under **Private endpoint** in the **Network** section. - `` can be found in the **Private Networks** tab of your Instance. Make sure to only copy the IP, and not the `/22` part. 3. Run the cell. @@ -199,8 +199,23 @@ Your notebook hosted on an Instance is ready to be used over Private Network. export PYSPARK_DRIVER_PYTHON=$(which python) ``` -7. Run the command below to execute `spark-submit`. Do not forget to replace the placeholders with the appropriate values. +7. Run the command below to execute `spark-submit` to calculate pi for 100 iterations. Do not forget to replace the placeholders with the appropriate values. ```bash - spark-submit --master spark://:7077 --deploy-mode client --conf spark.driver.port=7078 --conf spark.blockManager.port=7079 --conf spark.driver.host= $SPARK_HOME/examples/src/main/python/pi.py 100 - ``` \ No newline at end of file + spark-submit \ + --master spark://:7077 \ + --deploy-mode client \ + --conf spark.driver.port=7078 \ + --conf spark.blockManager.port=7079 \ + --conf spark.driver.host= \ + $SPARK_HOME/examples/src/main/python/pi.py 100 + ``` + + + - `` can be found in the **Overview** tab of your cluster, under **Private endpoint** in the **Network** section. + - `` can be found in the **Private Networks** tab of your Instance. Make sure to only copy the IP, and not the `/22` part. + + +8. [Access the Apache Spark™ UI](/data-lab/how-to/access-spark-ui/) of your cluster. The list of completed applications displays. From here, you can inspect the jobs previously started using `spark-submit`. + +You successfully run workloads on your cluster from an Instance over a Private Network. \ No newline at end of file From 853fe1f46a23e8d7acb836c88bfecc24409888d2 Mon Sep 17 00:00:00 2001 From: Samy OUBOUAZIZ Date: Tue, 9 Dec 2025 16:05:01 +0100 Subject: [PATCH 12/21] feat(dlb): update --- pages/data-lab/quickstart.mdx | 78 ++++++++++------------------------- 1 file changed, 21 insertions(+), 57 deletions(-) diff --git a/pages/data-lab/quickstart.mdx b/pages/data-lab/quickstart.mdx index ed60589db6..fefa7f3e36 100644 --- a/pages/data-lab/quickstart.mdx +++ b/pages/data-lab/quickstart.mdx @@ -3,7 +3,7 @@ title: Data Lab for Apache Spark™ - Quickstart description: Get started with Scaleway Data Lab for Apache Spark™ quickly and efficiently. tags: data lab apache spark notebook jupyter processing dates: - validation: 2025-09-02 + validation: 2025-12-09 posted: 2024-07-10 --- import Requirements from '@macros/iam/requirements.mdx' @@ -12,23 +12,19 @@ import Requirements from '@macros/iam/requirements.mdx' Follow this guided tour to discover how to navigate the console. -Data Lab for Apache Spark™ is a product designed to assist data scientists and data engineers in performing calculations on a remotely managed Apache Spark infrastructure. +Data Lab for Apache Spark™ is a product designed to assist data scientists and data engineers in performing calculations on a remotely managed Apache Spark™ infrastructure. -It is composed of the following: +Scaleway provides dedicated node types for both the main node and the worker nodes of the cluster. The worker nodes are high-end machines built for intensive computations, featuring powerful CPUs/GPUs, and substantial RAM. - - Cluster: An Apache Spark cluster powered by a Kubernetes architecture. +The main node, although capable of performing some local computations, primarily serves as a web interface for interacting with the Apache Spark™ cluster. - - Notebook: A JupyterLab service operating on a dedicated node type. - -Scaleway provides dedicated node types for both the notebook and the cluster. The cluster nodes are high-end machines built for intensive computations, featuring powerful CPUs/GPUs, and substantial RAM. - -The notebook, although capable of performing some local computations, primarily serves as a web interface for interacting with the Apache Spark cluster. +This documentation explains how to create a Data Lab for Apache Spark™ cluster, how to access its notebook environment and run the included demo file, and how to delete your cluster. - A Scaleway account logged into the [console](https://console.scaleway.com) - [Owner](/iam/concepts/#owner) status or [IAM permissions](/iam/concepts/#permission) allowing you to perform actions in the intended Organization -- Optionally, an [Object Storage bucket](/object-storage/how-to/create-a-bucket/) +- Created an [IAM API key](/iam/how-to/create-api-keys/) ## How to create a Data Lab for Apache Spark™ cluster @@ -37,19 +33,21 @@ The notebook, although capable of performing some local computations, primarily 2. Click **Create Data Lab cluster**. The creation wizard displays. 3. Complete the following steps in the wizard: - - Choose an Apache Spark version from the drop-down menu. - - Select a worker node configuration. For this procedure, we recommend selecting a CPU rather than a GPU. + + - Select a region for your cluster. + - Choose an Apache Spark™ version from the drop-down menu. + - Select the **DDL-PLAY2-MICRO** main nide type. + - Select a **CPU** worker node configuration. - Enter the desired number of worker nodes. - - Provisioning zero worker nodes lets you retain and access you cluster and notebook configurations, but will not allow you to run calculations. - - - Enter a name for your Data Lab. - - Optionally, add a description and/or tags for your Data Lab. + - Enter a name for your cluster. + - Optionally, add a description and/or tags. - Verify the estimated cost. -4. Click **Create Data Lab cluster** to finish. You are directed to the Data Lab cluster overview page. +4. Click **Create Data Lab cluster** to finish. + +Once the cluster is created, you are directed to its **Overview** page. -## How to connect to your Data Lab +## How to connect to your cluster's notebook 1. Click **Data Lab** under **Data & Analytics** on the side menu. The Data Lab for Apache Spark™ page displays. @@ -63,45 +61,11 @@ The notebook, although capable of performing some local computations, primarily Each Data Lab for Apache Spark™ comes with a default `DatalabDemo.ipynb` demonstration file for testing purposes. This file contains a preconfigured notebook environment that requires no modification to run. -Execute the cells in order to perform pre-determined operations on a dummy data set. - -## How to set up a new Data Lab environment - -1. From the notebook **Launcher** tab, select **PySpark** under **Notebook**. - -2. In a new cell, copy and paste the code below and replace the placeholders with your API access key, secret key, and the endpoint of your Object Storage Bucket to set up the Apache Spark session: - - ```json - %%configure -f - { - "name": "My Spark", - "conf":{ - "spark.hadoop.fs.s3a.access.key": "your-api-access-key", - "spark.hadoop.fs.s3a.secret.key": "your-api-secret-key", - "spark.hadoop.fs.s3a.endpoint": "your-bucket-endpoint" - } - } - ``` +Execute the cells in order to perform pre-determined operations on a dummy data set representative of real life use cases and workloads to assess the performance of your cluster. - - The Object Storage bucket endpoint is required only if you did not specify a bucket when creating the cluster. - - -3. In a new cell below, copy and paste the following command to initialize the Apache Spark session: - - ```python - from pyspark.sql.types import StructType, StructField, LongType, DoubleType, StringType - ``` - -4. Execute the two cells you just created. - - - The initialization of your Apache Spark session can take a few minutes. - - - Once initialized, the information of the Spark session displays. - -You can now execute commands that will run on the resources defined when creating the Data Lab for Apache Spark™. + +The demo file also contains a set of examples to configure and extend your Apache Spark™ configuration. + ## How to delete a Data Lab for Apache Spark™ From 09dfaed41238e45ab8fb8376ba8705ea06ef935c Mon Sep 17 00:00:00 2001 From: Samy OUBOUAZIZ Date: Wed, 10 Dec 2025 09:48:45 +0100 Subject: [PATCH 13/21] feat(dlb): update --- pages/data-lab/concepts.mdx | 27 ++++++++++++++++++--------- pages/data-lab/quickstart.mdx | 4 ---- 2 files changed, 18 insertions(+), 13 deletions(-) diff --git a/pages/data-lab/concepts.mdx b/pages/data-lab/concepts.mdx index d3a732f77c..a536e8be4b 100644 --- a/pages/data-lab/concepts.mdx +++ b/pages/data-lab/concepts.mdx @@ -6,13 +6,13 @@ dates: validation: 2025-09-02 --- -## Apache Spark cluster +## Apache Spark™ cluster -An Apache Spark cluster is an orchestrated set of machines over which distributed/Big data calculus is processed. In the case of Scaleway Data Lab, the Apache Spark cluster is a Kubernetes cluster, with Apache Spark installed in each Pod. For more details, check out the [Apache Spark documentation](https://spark.apache.org/documentation.html). +An Apache Spark™ cluster is an orchestrated set of machines over which distributed/Big data calculus is processed. In the case of Scaleway Data Lab, the Apache Spark™ cluster is a Kubernetes cluster, with Apache Spark™ installed in each Pod. For more details, check out the [Apache Spark™ documentation](https://spark.apache.org/documentation.html). ## Data Lab -A Data Lab is a project setup that combines a Notebook and an Apache Spark Cluster for data analysis and experimentation. it comes with the required infrastructure and tools to allow data scientists, analysts, and researchers to explore data, create models, and gain insights. +A Data Lab is a project setup that combines a Notebook and an Apache Spark™ Cluster for data analysis and experimentation. it comes with the required infrastructure and tools to allow data scientists, analysts, and researchers to explore data, create models, and gain insights. ## Data Lab for Apache Spark™ @@ -24,7 +24,7 @@ A fixture is a set of data forming a request used for testing purposes. ## GPU -GPUs (Graphical Processing Units) allow Apache Spark to accelerate computations for tasks that involve large-scale parallel processing, such as machine learning and specific data-analytics, significantly reducing the processing time for massive datasets and preparation for AI models. +GPUs (Graphical Processing Units) allow Apache Spark™ to accelerate computations for tasks that involve large-scale parallel processing, such as machine learning and specific data-analytics, significantly reducing the processing time for massive datasets and preparation for AI models. ## JupyterLab @@ -32,11 +32,16 @@ JupyterLab is a web-based platform for interactive computing, letting you work w ## Lighter -Lighter is a technology that enables SparkMagic commands to be readable and executable by the Apache Spark cluster. For more details, check out the [Lighter repository](https://github.com/exacaster/lighter). +Lighter is a technology that enables SparkMagic commands to be readable and executable by the Apache Spark™ cluster. For more details, check out the [Lighter repository](https://github.com/exacaster/lighter). + +## Main node + +The main node in a n Apache Spark™ cluster is the driver Node, which coordinates the execution of the Spark™ application by transforming code into tasks, scheduling them, and managing communication with the cluster. + ## Notebook -A notebook for an Apache Spark cluster is an interactive, web-based tool that allows users to write and execute code, visualize data, and share results in a collaborative environment. It connects to an Apache Spark cluster to run large-scale data processing tasks directly from the notebook interface, making it easier to develop and test data workflows. +A notebook for an Apache Spark™ cluster is an interactive, web-based tool that allows users to write and execute code, visualize data, and share results in a collaborative environment. It connects to an Apache Spark™ cluster to run large-scale data processing tasks directly from the notebook interface, making it easier to develop and test data workflows. Adding a notebook to your cluster requires 1 GB of storage. @@ -44,15 +49,19 @@ Adding a notebook to your cluster requires 1 GB of storage. A Persistent Volume (PV) is a cluster-wide storage resource that ensures data persistence beyond the lifecycle of individual Pods. Persistent volumes abstract the underlying storage details, allowing administrators to use various storage solutions. -Apache Spark® executors require storage space for various operations, particularly to shuffle data during wide operations such as sorting, grouping, and aggregation. Wide operations are transformations that require data from different partitions to be combined, often resulting in data movement across the cluster. During the map phase, executors write data to shuffle storage, which is then read by reducers. +Apache Spark™ executors require storage space for various operations, particularly to shuffle data during wide operations such as sorting, grouping, and aggregation. Wide operations are transformations that require data from different partitions to be combined, often resulting in data movement across the cluster. During the map phase, executors write data to shuffle storage, which is then read by reducers. A persistent volume sized properly ensures a smooth execution of your workload. ## SparkMagic -SparkMagic is a set of tools that allows you to interact with Apache Spark clusters through Jupyter notebooks. It provides magic commands for running Spark jobs, querying data, and managing Spark sessions directly within the notebook interface, facilitating seamless integration and execution of Spark tasks. For more details, check out the [SparkMagic repository](https://github.com/jupyter-incubator/sparkmagic). +SparkMagic is a set of tools that allows you to interact with Apache Spark™ clusters through Jupyter notebooks. It provides magic commands for running Spark™ jobs, querying data, and managing Spark™ sessions directly within the notebook interface, facilitating seamless integration and execution of Spark™ tasks. For more details, check out the [SparkMagic repository](https://github.com/jupyter-incubator/sparkmagic). ## Transaction -An SQL transaction is a sequence of one or more SQL operations (such as queries, inserts, updates, or deletions) executed as a single unit of work. These transactions ensure data integrity and consistency, following the ACID properties: Atomicity, Consistency, Isolation, and Durability, meaning all operations within a transaction either complete successfully or none of them take effect. An SQL transaction can be rolled back in case of an error. \ No newline at end of file +An SQL transaction is a sequence of one or more SQL operations (such as queries, inserts, updates, or deletions) executed as a single unit of work. These transactions ensure data integrity and consistency, following the ACID properties: Atomicity, Consistency, Isolation, and Durability, meaning all operations within a transaction either complete successfully or none of them take effect. An SQL transaction can be rolled back in case of an error. + +## Worker nodes + +Worker nodes are high-end machines built for intensive computations, featuring powerful CPUs/GPUs, and substantial RAM. diff --git a/pages/data-lab/quickstart.mdx b/pages/data-lab/quickstart.mdx index fefa7f3e36..90de72d1e3 100644 --- a/pages/data-lab/quickstart.mdx +++ b/pages/data-lab/quickstart.mdx @@ -14,10 +14,6 @@ Follow this guided tour to discover how to navigate the console. Data Lab for Apache Spark™ is a product designed to assist data scientists and data engineers in performing calculations on a remotely managed Apache Spark™ infrastructure. -Scaleway provides dedicated node types for both the main node and the worker nodes of the cluster. The worker nodes are high-end machines built for intensive computations, featuring powerful CPUs/GPUs, and substantial RAM. - -The main node, although capable of performing some local computations, primarily serves as a web interface for interacting with the Apache Spark™ cluster. - This documentation explains how to create a Data Lab for Apache Spark™ cluster, how to access its notebook environment and run the included demo file, and how to delete your cluster. From 314c0706b5434f36788128c5bab343f8702b771b Mon Sep 17 00:00:00 2001 From: Samy OUBOUAZIZ Date: Wed, 10 Dec 2025 11:17:06 +0100 Subject: [PATCH 14/21] feat(dlb): update --- pages/data-lab/faq.mdx | 18 ++++----- pages/data-lab/how-to/create-data-lab.mdx | 37 +++++++++++-------- .../how-to/manage-delete-data-lab.mdx | 8 +++- .../data-lab/how-to/use-private-networks.mdx | 4 +- pages/data-lab/quickstart.mdx | 5 ++- 5 files changed, 39 insertions(+), 33 deletions(-) diff --git a/pages/data-lab/faq.mdx b/pages/data-lab/faq.mdx index 20cc6dea3d..e7f26fc615 100644 --- a/pages/data-lab/faq.mdx +++ b/pages/data-lab/faq.mdx @@ -28,10 +28,6 @@ It offers scalable CPU and GPU Instances with flexible node limits and robust Ap ## Offering and availability -### What data source options are available? - -Data Lab natively integrates with Scaleway Object Storage for reading and writing data, making it easy to process data directly from your buckets. Your buckets are accessible using the Scaleway console or any other Amazon S3-compatible CLI tool. - ### What notebook is included with Dedicated Data Labs? The service provides a JupyterLab notebook running on a dedicated CPU Instance, fully integrated with the Apache Spark cluster for seamless data processing and calculations. @@ -40,9 +36,11 @@ The service provides a JupyterLab notebook running on a dedicated CPU Instance, ### How am I billed for Data Lab for Apache Spark™? -Data Lab for Apache Spark™ is billed based on two factors: -- The main node configuration selected +Data Lab for Apache Spark™ is billed based on the following factors: +- The main node configuration selected. - The worker node configuration selected, and the number of worker nodes in the cluster. +- The Persistent volume size provisioned. +- The presence of a notebook. ## Compatibility and integration @@ -50,13 +48,11 @@ Data Lab for Apache Spark™ is billed based on two factors: Yes, you can run your cluster on either CPUs or GPUs. Scaleway leverages Nvidia's [RAPIDS Accelerator For Apache Spark](https://www.nvidia.com/en-gb/deep-learning-ai/software/rapids/), an open-source suite of software libraries and APIs to execute end-to-end data science and analytics pipelines entirely on GPUs. This technology allows for significant acceleration of data processing tasks compared to CPU-based processing. -### Can I connect to S3 buckets from other cloud providers? - -Currently, connections are limited to Scaleway's Object Storage environment. +### Can I connect a separate notebook environment to the Data Lab? -### Can I connect my local JupyterLab to the Data Lab? +Yes, you can connect a different notebook via Private Networks. -Remote connections to a Data Lab cluster are currently not supported. +Refer to the [dedicated documentation](/data-lab/how-to/use-private-networks/) for comprehensive information on how to connect to a Data Lab for Apache Spark™ cluster over Private Networks. ## Usage and management diff --git a/pages/data-lab/how-to/create-data-lab.mdx b/pages/data-lab/how-to/create-data-lab.mdx index 1f42fe894a..cf30424c66 100644 --- a/pages/data-lab/how-to/create-data-lab.mdx +++ b/pages/data-lab/how-to/create-data-lab.mdx @@ -3,7 +3,7 @@ title: How to create a Data Lab for Apache Spark™ description: Step-by-step guide to creating a Data Lab for Apache Spark™ on Scaleway. tags: data lab apache spark create process dates: - validation: 2025-09-02 + validation: 2025-12-10 posted: 2024-07-31 --- import Requirements from '@macros/iam/requirements.mdx' @@ -14,8 +14,8 @@ Data Lab for Apache Spark™ is a product designed to assist data scientists and - A Scaleway account logged into the [console](https://console.scaleway.com) - [Owner](/iam/concepts/#owner) status or [IAM permissions](/iam/concepts/#permission) allowing you to perform actions in the intended Organization -- Optionally, an [Object Storage bucket](/object-storage/how-to/create-a-bucket/) - A valid [API key](/iam/how-to/create-api-keys/) +- Created a [Private Network](/vpc/how-to/create-private-network/) 1. Click **Data Lab** under **Data & Analytics** on the side menu. The Data Lab for Apache Spark™ page displays. @@ -25,17 +25,22 @@ Data Lab for Apache Spark™ is a product designed to assist data scientists and 4. Choose a main node type. If you plan to add a notebook to your cluster, select the **DDL-PLAY2-MICRO** configuration to provision sufficient resources for it. -5. Select a worker node configuration. - - Enter the desired number of worker nodes. - - Provisioning zero worker nodes lets you retain and access you cluster and notebook configurations, but will not allow you to run calculations. - - - Activate the [persistent volume](/data-lab/concepts/#persistent-volume) if required, then enter a volume size according to your needs. - - Persistent volume usage depends on your workload, and only the actual usage will be billed, within the limit defined. A minimum of 1 GB is required to run the notebook. - - - Enter a name for your Data Lab. - - Optionally, add a description and/or tags for your Data Lab. - - Verify the estimated cost. - -4. Click **Create Data Lab cluster** to finish. You are directed to the Data Lab cluster overview page. \ No newline at end of file +5. Choose a worker node type depending on your hardware requirements. + +6. Enter the desired number of worker nodes. + +7. Add a [persistent volume](/data-lab/concepts/#persistent-volume) if required, then enter a volume size according to your needs. + + + Persistent volume usage depends on your workload, and only the actual usage will be billed, within the limit defined. A minimum of 1 GB is required to run the notebook. + + +8. Add a notebook if you want to use an integrated notebook environment to interact with your cluster. Adding a notebook requires 1 GB of billable storage. + +9. Select a Private Network from the drop-down menu to attach to your cluster, or create a new one. Data Lab clusters cannot be used without a Private Network. + +8. Enter a name for your Data Lab cluster, and add an optional description and/or tags. + +9. Verify the estimated cost. + +10. Click **Create Data Lab cluster** to finish. You are directed to the Data Lab cluster overview page. \ No newline at end of file diff --git a/pages/data-lab/how-to/manage-delete-data-lab.mdx b/pages/data-lab/how-to/manage-delete-data-lab.mdx index 8ad567bcbe..991a949b53 100644 --- a/pages/data-lab/how-to/manage-delete-data-lab.mdx +++ b/pages/data-lab/how-to/manage-delete-data-lab.mdx @@ -3,7 +3,7 @@ title: How to manage and delete a Data Lab for Apache Spark™ description: Step-by-step guide to managing and deleting a Data Lab for Apache Spark™ with the Scaleway console. tags: data lab apache spark delete remove suppress dates: - validation: 2025-09-02 + validation: 2025-12-10 posted: 2024-07-31 --- import Requirements from '@macros/iam/requirements.mdx' @@ -20,7 +20,11 @@ This page explains how to manage and delete your Data Lab for Apache Spark™. 1. Click **Data Lab** under **Data & Analytics** on the side menu. The Data Lab for Apache Spark™ page displays. -2. Click the name of the Data Lab cluster you want to manage. The overview tab of the cluster displays. From this view, you can see the configuration of your cluster. +2. Click the name of the Data Lab cluster you want to manage. The overview tab of the cluster displays. From this view, you can: + - Consult the configuration of your cluster. + - View the network information of your cluster. + - [Access the Apache Spark™ UI](/data-lab/how-to/acess-spark-ui/) of your cluster. + - [Access the notebook environment](/data-lab/how-to/acess-notebook/) of your cluster. 3. Click the **Settings** tab. diff --git a/pages/data-lab/how-to/use-private-networks.mdx b/pages/data-lab/how-to/use-private-networks.mdx index 5541da232b..ac871c6e96 100644 --- a/pages/data-lab/how-to/use-private-networks.mdx +++ b/pages/data-lab/how-to/use-private-networks.mdx @@ -3,8 +3,8 @@ title: How to use Private Networks with your Data Lab cluster description: This page explains how to use Private Networks with Scaleway Data Lab for Apache Spark™ tags: private-networks private networks data lab spark apache cluster vpc dates: - validation: 2025-06-25 - posted: 2021-06-25 + validation: 2025-12-10 + posted: 2025-12-10 --- import Requirements from '@macros/iam/requirements.mdx' diff --git a/pages/data-lab/quickstart.mdx b/pages/data-lab/quickstart.mdx index 90de72d1e3..dd083df913 100644 --- a/pages/data-lab/quickstart.mdx +++ b/pages/data-lab/quickstart.mdx @@ -20,6 +20,7 @@ This documentation explains how to create a Data Lab for Apache Spark™ cluster - A Scaleway account logged into the [console](https://console.scaleway.com) - [Owner](/iam/concepts/#owner) status or [IAM permissions](/iam/concepts/#permission) allowing you to perform actions in the intended Organization +- Created a [Private Network](/vpc/how-to/create-private-network/) - Created an [IAM API key](/iam/how-to/create-api-keys/) ## How to create a Data Lab for Apache Spark™ cluster @@ -35,8 +36,8 @@ This documentation explains how to create a Data Lab for Apache Spark™ cluster - Select the **DDL-PLAY2-MICRO** main nide type. - Select a **CPU** worker node configuration. - Enter the desired number of worker nodes. - - Enter a name for your cluster. - - Optionally, add a description and/or tags. + - Select an existing Private Network, or create a new one. + - Enter a name for your cluster, and an optional description and tags. - Verify the estimated cost. 4. Click **Create Data Lab cluster** to finish. From 2a2db390b02c894bdfddfd09626f1a740e06bc9b Mon Sep 17 00:00:00 2001 From: SamyOubouaziz Date: Wed, 10 Dec 2025 11:52:21 +0100 Subject: [PATCH 15/21] Apply suggestions from code review Co-authored-by: ldecarvalho-doc <82805470+ldecarvalho-doc@users.noreply.github.com> --- pages/data-lab/concepts.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/pages/data-lab/concepts.mdx b/pages/data-lab/concepts.mdx index a536e8be4b..3000e2b8e8 100644 --- a/pages/data-lab/concepts.mdx +++ b/pages/data-lab/concepts.mdx @@ -12,7 +12,7 @@ An Apache Spark™ cluster is an orchestrated set of machines over which distrib ## Data Lab -A Data Lab is a project setup that combines a Notebook and an Apache Spark™ Cluster for data analysis and experimentation. it comes with the required infrastructure and tools to allow data scientists, analysts, and researchers to explore data, create models, and gain insights. +A Data Lab is a project setup that combines a Notebook and an Apache Spark™ Cluster for data analysis and experimentation. It comes with the required infrastructure and tools to allow data scientists, analysts, and researchers to explore data, create models, and gain insights. ## Data Lab for Apache Spark™ From 99ef7033b97261649c878dac3c80ccd8c928440c Mon Sep 17 00:00:00 2001 From: SamyOubouaziz Date: Wed, 10 Dec 2025 11:52:35 +0100 Subject: [PATCH 16/21] Update pages/data-lab/concepts.mdx Co-authored-by: ldecarvalho-doc <82805470+ldecarvalho-doc@users.noreply.github.com> --- pages/data-lab/concepts.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/pages/data-lab/concepts.mdx b/pages/data-lab/concepts.mdx index 3000e2b8e8..c8118de55b 100644 --- a/pages/data-lab/concepts.mdx +++ b/pages/data-lab/concepts.mdx @@ -36,7 +36,7 @@ Lighter is a technology that enables SparkMagic commands to be readable and exec ## Main node -The main node in a n Apache Spark™ cluster is the driver Node, which coordinates the execution of the Spark™ application by transforming code into tasks, scheduling them, and managing communication with the cluster. +The main node in an Apache Spark™ cluster is the driver Node, which coordinates the execution of the Spark™ application by transforming code into tasks, scheduling them, and managing communication with the cluster. ## Notebook From 75b7d2b721b3917b89318773e7ad9aab326524ac Mon Sep 17 00:00:00 2001 From: SamyOubouaziz Date: Wed, 10 Dec 2025 11:55:06 +0100 Subject: [PATCH 17/21] Apply suggestions from code review Co-authored-by: Jessica <113192637+jcirinosclwy@users.noreply.github.com> --- pages/data-lab/faq.mdx | 2 +- pages/data-lab/how-to/access-spark-ui.mdx | 6 ++--- pages/data-lab/how-to/create-data-lab.mdx | 6 ++--- .../how-to/manage-delete-data-lab.mdx | 2 +- .../data-lab/how-to/use-private-networks.mdx | 26 +++++++++---------- pages/data-lab/quickstart.mdx | 4 +-- 6 files changed, 23 insertions(+), 23 deletions(-) diff --git a/pages/data-lab/faq.mdx b/pages/data-lab/faq.mdx index e7f26fc615..b66e6a08fa 100644 --- a/pages/data-lab/faq.mdx +++ b/pages/data-lab/faq.mdx @@ -39,7 +39,7 @@ The service provides a JupyterLab notebook running on a dedicated CPU Instance, Data Lab for Apache Spark™ is billed based on the following factors: - The main node configuration selected. - The worker node configuration selected, and the number of worker nodes in the cluster. -- The Persistent volume size provisioned. +- The persistent volume size provisioned. - The presence of a notebook. ## Compatibility and integration diff --git a/pages/data-lab/how-to/access-spark-ui.mdx b/pages/data-lab/how-to/access-spark-ui.mdx index 89b6394357..6010ef2240 100644 --- a/pages/data-lab/how-to/access-spark-ui.mdx +++ b/pages/data-lab/how-to/access-spark-ui.mdx @@ -1,5 +1,5 @@ --- -title: How to Access the Apache Spark™ UI +title: How to access the Apache Spark™ UI description: Step-by-step guide to access and use the Apache Spark™ UI in a Data Lab for Apache Spark™ on Scaleway. tags: data lab apache spark ui gui console dates: @@ -9,7 +9,7 @@ dates: import Requirements from '@macros/iam/requirements.mdx' -This page explains how to Access the Apache Spark™ UI of your Data Lab for Apache Spark™ cluster. +This page explains how to access the Apache Spark™ UI of your Data Lab for Apache Spark™ cluster. @@ -26,6 +26,6 @@ This page explains how to Access the Apache Spark™ UI of your Data Lab for Apa 4. Enter the **secret key** of your API key, then click **Authenticate**. The Apache Spark™ UI dashboard displays. -From this view, you can view and monitor worker nodes, executors and applications. +From this page, you can view and monitor worker nodes, executors, and applications. Refer to the [official Apache Spark™ documentation](https://spark.apache.org/docs/latest/web-ui.html) for comprehensive information on how to use the web UI. \ No newline at end of file diff --git a/pages/data-lab/how-to/create-data-lab.mdx b/pages/data-lab/how-to/create-data-lab.mdx index cf30424c66..77a895f44c 100644 --- a/pages/data-lab/how-to/create-data-lab.mdx +++ b/pages/data-lab/how-to/create-data-lab.mdx @@ -39,8 +39,8 @@ Data Lab for Apache Spark™ is a product designed to assist data scientists and 9. Select a Private Network from the drop-down menu to attach to your cluster, or create a new one. Data Lab clusters cannot be used without a Private Network. -8. Enter a name for your Data Lab cluster, and add an optional description and/or tags. +10. Enter a name for your Data Lab cluster, and add an optional description and/or tags. -9. Verify the estimated cost. +11. Verify the estimated cost. -10. Click **Create Data Lab cluster** to finish. You are directed to the Data Lab cluster overview page. \ No newline at end of file +12. Click **Create Data Lab cluster** to finish. You are directed to the Data Lab cluster overview page. \ No newline at end of file diff --git a/pages/data-lab/how-to/manage-delete-data-lab.mdx b/pages/data-lab/how-to/manage-delete-data-lab.mdx index 991a949b53..7fd87af6c1 100644 --- a/pages/data-lab/how-to/manage-delete-data-lab.mdx +++ b/pages/data-lab/how-to/manage-delete-data-lab.mdx @@ -20,7 +20,7 @@ This page explains how to manage and delete your Data Lab for Apache Spark™. 1. Click **Data Lab** under **Data & Analytics** on the side menu. The Data Lab for Apache Spark™ page displays. -2. Click the name of the Data Lab cluster you want to manage. The overview tab of the cluster displays. From this view, you can: +2. Click the name of the Data Lab cluster you want to manage. The overview tab of the cluster displays. From this page, you can: - Consult the configuration of your cluster. - View the network information of your cluster. - [Access the Apache Spark™ UI](/data-lab/how-to/acess-spark-ui/) of your cluster. diff --git a/pages/data-lab/how-to/use-private-networks.mdx b/pages/data-lab/how-to/use-private-networks.mdx index ac871c6e96..8a423207dc 100644 --- a/pages/data-lab/how-to/use-private-networks.mdx +++ b/pages/data-lab/how-to/use-private-networks.mdx @@ -39,13 +39,13 @@ For full information about Scaleway Private Networks and VPC, see our [dedicated openjdk-17-jre-headless tmux ``` -2. Run the command below to install `pyenv`: +3. Run the command below to install `pyenv`: ```bash curl https://pyenv.run | bash ``` -3. Run the command below to add `pyenv` to your Bash configuration: +4. Run the command below to add `pyenv` to your Bash configuration: ```bash echo 'export PATH="$HOME/.pyenv/bin:$PATH"' >> ~/.bashrc @@ -53,13 +53,13 @@ For full information about Scaleway Private Networks and VPC, see our [dedicated echo 'eval "$(pyenv virtualenv-init -)"' >> ~/.bashrc ``` -4. Run the command below to reload your shell: +5. Run the command below to reload your shell: ```bash exec $SHELL ``` -5. Run the command below to install **Python 3.13**, and activate a virtual environment: +6. Run the command below to install **Python 3.13**, and activate a virtual environment: ```bash pyenv install 3.13.0 @@ -68,33 +68,33 @@ For full information about Scaleway Private Networks and VPC, see our [dedicated ``` - Your Instance Python version must be 3.13. If you encounter an error due to a mismatch between the worker and driver Python versions, run the following command to display minor versions, then reinstall using the exact one: + Your Instance's Python version must be 3.13. If you encounter an error due to a mismatch between the worker and driver Python versions, run the following command to display minor versions, then reinstall using the exact one: ```bash pyenv install -l | grep 3.13 ``` -6. Run the command below to install JupyterLab and PySpark inside the virtual environment: +7. Run the command below to install JupyterLab and PySpark inside the virtual environment: ```bash pip install --upgrade pip pip install jupyterlab pyspark ``` -7. Run the command below to generate a configuration file for your JupyterLab: +8. Run the command below to generate a configuration file for your JupyterLab: ```bash jupyter lab --generate-config ``` -8. Open the configuration file you just created: +9. Open the configuration file you just created: ```bash nano ~/.jupyter/jupyter_lab_config.py ``` -9. Set the following parameters: +10. Set the following parameters: ```python # if running as root: @@ -104,13 +104,13 @@ For full information about Scaleway Private Networks and VPC, see our [dedicated # c.ServerApp.token = "your-super-secure-password" ``` -10. Run the command below to start Jupyterlab: +11. Run the command below to start JupyterLab: ```bash jupyter lab ``` -11. In a new terminal, connect to your JupyterLab via SSH. The Instance public IP can be found in the **Overview** tab of your Instance: +12. In a new terminal, connect to your JupyterLab via SSH. The Instance public IP can be found in the **Overview** tab of your Instance: ```bash ssh -L 8888:127.0.0.1:8888 @ @@ -120,7 +120,7 @@ For full information about Scaleway Private Networks and VPC, see our [dedicated Make sure to allow root connection in your configuration file if you log in as a root user. -12. Access [http://localhost:8888](http://localhost:8888), then enter the token generated while executing the `jupyter lab` command. +13. Access [http://localhost:8888](http://localhost:8888), then enter the token generated while executing the `jupyter lab` command. You now have access to your Data Lab for Apache Spark™ cluster via a Private Network, using a JupyterLab notebook deployed on an Instance. @@ -156,7 +156,7 @@ You now have access to your Data Lab for Apache Spark™ cluster via a Private N 3. Run the cell. -Your notebook hosted on an Instance is ready to be used over Private Network. +Your notebook hosted on an Instance is ready to be used over Private Networks. ### Running an application over Private Networks using spark-submit diff --git a/pages/data-lab/quickstart.mdx b/pages/data-lab/quickstart.mdx index dd083df913..a5aeb1ac90 100644 --- a/pages/data-lab/quickstart.mdx +++ b/pages/data-lab/quickstart.mdx @@ -14,7 +14,7 @@ Follow this guided tour to discover how to navigate the console. Data Lab for Apache Spark™ is a product designed to assist data scientists and data engineers in performing calculations on a remotely managed Apache Spark™ infrastructure. -This documentation explains how to create a Data Lab for Apache Spark™ cluster, how to access its notebook environment and run the included demo file, and how to delete your cluster. +This documentation explains how to create a Data Lab for Apache Spark™ cluster, access its notebook environment and run the included demo file, and delete your cluster. @@ -33,7 +33,7 @@ This documentation explains how to create a Data Lab for Apache Spark™ cluster - Select a region for your cluster. - Choose an Apache Spark™ version from the drop-down menu. - - Select the **DDL-PLAY2-MICRO** main nide type. + - Select the **DDL-PLAY2-MICRO** main node type. - Select a **CPU** worker node configuration. - Enter the desired number of worker nodes. - Select an existing Private Network, or create a new one. From 53b16e7094002e20acc8fe2ffaf7b007af66b1b9 Mon Sep 17 00:00:00 2001 From: Samy OUBOUAZIZ Date: Wed, 10 Dec 2025 11:56:50 +0100 Subject: [PATCH 18/21] feat(dlb): update --- pages/data-lab/faq.mdx | 10 +++++----- pages/data-lab/how-to/create-data-lab.mdx | 4 ++-- 2 files changed, 7 insertions(+), 7 deletions(-) diff --git a/pages/data-lab/faq.mdx b/pages/data-lab/faq.mdx index b66e6a08fa..2f1a8a4341 100644 --- a/pages/data-lab/faq.mdx +++ b/pages/data-lab/faq.mdx @@ -10,11 +10,11 @@ productIcon: DistributedDataLabProductIcon ### What is Apache Spark? -Apache Spark is an open-source unified analytics engine designed for large-scale data processing. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance. Spark offers high-level APIs in Java, Scala, Python, and R, and an optimized engine that supports general execution graphs. +Apache Spark™ is an open-source unified analytics engine designed for large-scale data processing. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance. Spark™ offers high-level APIs in Java, Scala, Python, and R, and an optimized engine that supports general execution graphs. -### How does Apache Spark work? +### How does Apache Spark™ work? -Apache Spark processes data in memory, which allows it to perform tasks up to 100 times faster than traditional disk-based processing frameworks like [Hadoop MapReduce](https://fr.wikipedia.org/wiki/MapReduce). It uses Resilient Distributed Datasets (RDDs) to store data across multiple nodes in a cluster and perform parallel operations on this data. +Apache Spark™ processes data in memory, which allows it to perform tasks up to 100 times faster than traditional disk-based processing frameworks like [Hadoop MapReduce](https://fr.wikipedia.org/wiki/MapReduce). It uses Resilient Distributed Datasets (RDDs) to store data across multiple nodes in a cluster and perform parallel operations on this data. ### What workloads is Data Lab for Apache Spark™ suited for? @@ -24,13 +24,13 @@ Data Lab for Apache Spark™ supports a range of workloads, including: - Machine learning tasks - High-speed operations on large datasets -It offers scalable CPU and GPU Instances with flexible node limits and robust Apache Spark library support. +It offers scalable CPU and GPU Instances with flexible node limits and robust Apache Spark™ library support. ## Offering and availability ### What notebook is included with Dedicated Data Labs? -The service provides a JupyterLab notebook running on a dedicated CPU Instance, fully integrated with the Apache Spark cluster for seamless data processing and calculations. +The service provides a JupyterLab notebook running on a dedicated CPU Instance, fully integrated with the Apache Spark™ cluster for seamless data processing and calculations. ## Pricing and billing diff --git a/pages/data-lab/how-to/create-data-lab.mdx b/pages/data-lab/how-to/create-data-lab.mdx index 77a895f44c..e737d37604 100644 --- a/pages/data-lab/how-to/create-data-lab.mdx +++ b/pages/data-lab/how-to/create-data-lab.mdx @@ -8,7 +8,7 @@ dates: --- import Requirements from '@macros/iam/requirements.mdx' -Data Lab for Apache Spark™ is a product designed to assist data scientists and data engineers in performing calculations on a remotely managed Apache Spark infrastructure. +Data Lab for Apache Spark™ is a product designed to assist data scientists and data engineers in performing calculations on a remotely managed Apache Spark™ infrastructure. @@ -21,7 +21,7 @@ Data Lab for Apache Spark™ is a product designed to assist data scientists and 2. Click **Create Data Lab cluster**. The creation wizard displays. -3. Choose an Apache Spark version from the drop-down menu. +3. Choose an Apache Spark™ version from the drop-down menu. 4. Choose a main node type. If you plan to add a notebook to your cluster, select the **DDL-PLAY2-MICRO** configuration to provision sufficient resources for it. From ddf71e883fea5adcae47249caf836f1ad37f9743 Mon Sep 17 00:00:00 2001 From: Samy OUBOUAZIZ Date: Wed, 10 Dec 2025 11:57:07 +0100 Subject: [PATCH 19/21] feat(dlb): update --- pages/data-lab/concepts.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/pages/data-lab/concepts.mdx b/pages/data-lab/concepts.mdx index c8118de55b..e89ab6f01b 100644 --- a/pages/data-lab/concepts.mdx +++ b/pages/data-lab/concepts.mdx @@ -36,7 +36,7 @@ Lighter is a technology that enables SparkMagic commands to be readable and exec ## Main node -The main node in an Apache Spark™ cluster is the driver Node, which coordinates the execution of the Spark™ application by transforming code into tasks, scheduling them, and managing communication with the cluster. +The main node in an Apache Spark™ cluster is the driver node, which coordinates the execution of the Spark™ application by transforming code into tasks, scheduling them, and managing communication with the cluster. ## Notebook From 0f89393c9d4a6784f666ee5fd43f310eba802acf Mon Sep 17 00:00:00 2001 From: Samy OUBOUAZIZ Date: Wed, 10 Dec 2025 11:58:07 +0100 Subject: [PATCH 20/21] feat(dlb): update --- pages/data-lab/concepts.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/pages/data-lab/concepts.mdx b/pages/data-lab/concepts.mdx index e89ab6f01b..2c6d9974cc 100644 --- a/pages/data-lab/concepts.mdx +++ b/pages/data-lab/concepts.mdx @@ -12,7 +12,7 @@ An Apache Spark™ cluster is an orchestrated set of machines over which distrib ## Data Lab -A Data Lab is a project setup that combines a Notebook and an Apache Spark™ Cluster for data analysis and experimentation. It comes with the required infrastructure and tools to allow data scientists, analysts, and researchers to explore data, create models, and gain insights. +A Data Lab is a project setup that combines a Notebook and an Apache Spark™ cluster for data analysis and experimentation. It includes with the required infrastructure and tools to allow data scientists, analysts, and researchers to explore data, create models, and gain insights. ## Data Lab for Apache Spark™ From 117d0d69c65b928195bb48a1120263f915da95cb Mon Sep 17 00:00:00 2001 From: Samy OUBOUAZIZ Date: Wed, 10 Dec 2025 14:18:28 +0100 Subject: [PATCH 21/21] feat(dlb): update --- pages/data-lab/concepts.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/pages/data-lab/concepts.mdx b/pages/data-lab/concepts.mdx index 2c6d9974cc..8151a7665a 100644 --- a/pages/data-lab/concepts.mdx +++ b/pages/data-lab/concepts.mdx @@ -12,7 +12,7 @@ An Apache Spark™ cluster is an orchestrated set of machines over which distrib ## Data Lab -A Data Lab is a project setup that combines a Notebook and an Apache Spark™ cluster for data analysis and experimentation. It includes with the required infrastructure and tools to allow data scientists, analysts, and researchers to explore data, create models, and gain insights. +A Data Lab is a project setup that combines a Notebook and an Apache Spark™ cluster for data analysis and experimentation. It includes the required infrastructure and tools to allow data scientists, analysts, and researchers to explore data, create models, and gain insights. ## Data Lab for Apache Spark™