Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
31 changes: 21 additions & 10 deletions pages/data-lab/concepts.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -6,13 +6,13 @@ dates:
validation: 2025-09-02
---

## Apache Spark cluster
## Apache Spark cluster

An Apache Spark cluster is an orchestrated set of machines over which distributed/Big data calculus is processed. In the case of Scaleway Data Lab, the Apache Spark cluster is a Kubernetes cluster, with Apache Spark installed in each Pod. For more details, check out the [Apache Spark documentation](https://spark.apache.org/documentation.html).
An Apache Spark cluster is an orchestrated set of machines over which distributed/Big data calculus is processed. In the case of Scaleway Data Lab, the Apache Spark cluster is a Kubernetes cluster, with Apache Spark installed in each Pod. For more details, check out the [Apache Spark documentation](https://spark.apache.org/documentation.html).

## Data Lab

A Data Lab is a project setup that combines a Notebook and an Apache Spark Cluster for data analysis and experimentation. it comes with the required infrastructure and tools to allow data scientists, analysts, and researchers to explore data, create models, and gain insights.
A Data Lab is a project setup that combines a Notebook and an Apache Spark™ cluster for data analysis and experimentation. It includes the required infrastructure and tools to allow data scientists, analysts, and researchers to explore data, create models, and gain insights.

## Data Lab for Apache Spark™

Expand All @@ -24,33 +24,44 @@ A fixture is a set of data forming a request used for testing purposes.

## GPU

GPUs (Graphical Processing Units) allow Apache Spark to accelerate computations for tasks that involve large-scale parallel processing, such as machine learning and specific data-analytics, significantly reducing the processing time for massive datasets and preparation for AI models.
GPUs (Graphical Processing Units) allow Apache Spark to accelerate computations for tasks that involve large-scale parallel processing, such as machine learning and specific data-analytics, significantly reducing the processing time for massive datasets and preparation for AI models.

## JupyterLab

JupyterLab is a web-based platform for interactive computing, letting you work with notebooks, code, and data all in one place. It builds on the classic Jupyter Notebook by offering a more flexible and integrated user interface, making it easier to handle various file formats and interactive components.

## Lighter

Lighter is a technology that enables SparkMagic commands to be readable and executable by the Apache Spark cluster. For more details, check out the [Lighter repository](https://github.com/exacaster/lighter).
Lighter is a technology that enables SparkMagic commands to be readable and executable by the Apache Spark™ cluster. For more details, check out the [Lighter repository](https://github.com/exacaster/lighter).

## Main node

The main node in an Apache Spark™ cluster is the driver node, which coordinates the execution of the Spark™ application by transforming code into tasks, scheduling them, and managing communication with the cluster.


## Notebook

A notebook for an Apache Spark cluster is an interactive, web-based tool that allows users to write and execute code, visualize data, and share results in a collaborative environment. It connects to an Apache Spark cluster to run large-scale data processing tasks directly from the notebook interface, making it easier to develop and test data workflows.
A notebook for an Apache Spark™ cluster is an interactive, web-based tool that allows users to write and execute code, visualize data, and share results in a collaborative environment. It connects to an Apache Spark™ cluster to run large-scale data processing tasks directly from the notebook interface, making it easier to develop and test data workflows.

Adding a notebook to your cluster requires 1 GB of storage.

## Persistent volume

A Persistent Volume (PV) is a cluster-wide storage resource that ensures data persistence beyond the lifecycle of individual Pods. Persistent volumes abstract the underlying storage details, allowing administrators to use various storage solutions.

Apache Spark® executors require storage space for various operations, particularly to shuffle data during wide operations such as sorting, grouping, and aggregation. Wide operations are transformations that require data from different partitions to be combined, often resulting in data movement across the cluster. During the map phase, executors write data to shuffle storage, which is then read by reducers.
Apache Spark executors require storage space for various operations, particularly to shuffle data during wide operations such as sorting, grouping, and aggregation. Wide operations are transformations that require data from different partitions to be combined, often resulting in data movement across the cluster. During the map phase, executors write data to shuffle storage, which is then read by reducers.

A PV sized properly ensures a smooth execution of your workload.
A persistent volume sized properly ensures a smooth execution of your workload.

## SparkMagic

SparkMagic is a set of tools that allows you to interact with Apache Spark clusters through Jupyter notebooks. It provides magic commands for running Spark jobs, querying data, and managing Spark sessions directly within the notebook interface, facilitating seamless integration and execution of Spark tasks. For more details, check out the [SparkMagic repository](https://github.com/jupyter-incubator/sparkmagic).
SparkMagic is a set of tools that allows you to interact with Apache Spark clusters through Jupyter notebooks. It provides magic commands for running Spark jobs, querying data, and managing Spark sessions directly within the notebook interface, facilitating seamless integration and execution of Spark tasks. For more details, check out the [SparkMagic repository](https://github.com/jupyter-incubator/sparkmagic).


## Transaction

An SQL transaction is a sequence of one or more SQL operations (such as queries, inserts, updates, or deletions) executed as a single unit of work. These transactions ensure data integrity and consistency, following the ACID properties: Atomicity, Consistency, Isolation, and Durability, meaning all operations within a transaction either complete successfully or none of them take effect. An SQL transaction can be rolled back in case of an error.
An SQL transaction is a sequence of one or more SQL operations (such as queries, inserts, updates, or deletions) executed as a single unit of work. These transactions ensure data integrity and consistency, following the ACID properties: Atomicity, Consistency, Isolation, and Durability, meaning all operations within a transaction either complete successfully or none of them take effect. An SQL transaction can be rolled back in case of an error.

## Worker nodes

Worker nodes are high-end machines built for intensive computations, featuring powerful CPUs/GPUs, and substantial RAM.
28 changes: 12 additions & 16 deletions pages/data-lab/faq.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -10,11 +10,11 @@ productIcon: DistributedDataLabProductIcon

### What is Apache Spark?

Apache Spark is an open-source unified analytics engine designed for large-scale data processing. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance. Spark offers high-level APIs in Java, Scala, Python, and R, and an optimized engine that supports general execution graphs.
Apache Spark is an open-source unified analytics engine designed for large-scale data processing. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance. Spark offers high-level APIs in Java, Scala, Python, and R, and an optimized engine that supports general execution graphs.

### How does Apache Spark work?
### How does Apache Spark work?

Apache Spark processes data in memory, which allows it to perform tasks up to 100 times faster than traditional disk-based processing frameworks like [Hadoop MapReduce](https://fr.wikipedia.org/wiki/MapReduce). It uses Resilient Distributed Datasets (RDDs) to store data across multiple nodes in a cluster and perform parallel operations on this data.
Apache Spark processes data in memory, which allows it to perform tasks up to 100 times faster than traditional disk-based processing frameworks like [Hadoop MapReduce](https://fr.wikipedia.org/wiki/MapReduce). It uses Resilient Distributed Datasets (RDDs) to store data across multiple nodes in a cluster and perform parallel operations on this data.

### What workloads is Data Lab for Apache Spark™ suited for?

Expand All @@ -24,39 +24,35 @@ Data Lab for Apache Spark™ supports a range of workloads, including:
- Machine learning tasks
- High-speed operations on large datasets

It offers scalable CPU and GPU Instances with flexible node limits and robust Apache Spark library support.
It offers scalable CPU and GPU Instances with flexible node limits and robust Apache Spark library support.

## Offering and availability

### What data source options are available?

Data Lab natively integrates with Scaleway Object Storage for reading and writing data, making it easy to process data directly from your buckets. Your buckets are accessible using the Scaleway console or any other Amazon S3-compatible CLI tool.

### What notebook is included with Dedicated Data Labs?

The service provides a JupyterLab notebook running on a dedicated CPU Instance, fully integrated with the Apache Spark cluster for seamless data processing and calculations.
The service provides a JupyterLab notebook running on a dedicated CPU Instance, fully integrated with the Apache Spark cluster for seamless data processing and calculations.

## Pricing and billing

### How am I billed for Data Lab for Apache Spark™?

Data Lab for Apache Spark™ is billed based on two factors:
- The main node configuration selected
Data Lab for Apache Spark™ is billed based on the following factors:
- The main node configuration selected.
- The worker node configuration selected, and the number of worker nodes in the cluster.
- The persistent volume size provisioned.
- The presence of a notebook.

## Compatibility and integration

### Can I run a Data Lab for Apache Spark™ using GPUs?

Yes, you can run your cluster on either CPUs or GPUs. Scaleway leverages Nvidia's [RAPIDS Accelerator For Apache Spark](https://www.nvidia.com/en-gb/deep-learning-ai/software/rapids/), an open-source suite of software libraries and APIs to execute end-to-end data science and analytics pipelines entirely on GPUs. This technology allows for significant acceleration of data processing tasks compared to CPU-based processing.

### Can I connect to S3 buckets from other cloud providers?

Currently, connections are limited to Scaleway's Object Storage environment.
### Can I connect a separate notebook environment to the Data Lab?

### Can I connect my local JupyterLab to the Data Lab?
Yes, you can connect a different notebook via Private Networks.

Remote connections to a Data Lab cluster are currently not supported.
Refer to the [dedicated documentation](/data-lab/how-to/use-private-networks/) for comprehensive information on how to connect to a Data Lab for Apache Spark™ cluster over Private Networks.

## Usage and management

Expand Down
31 changes: 31 additions & 0 deletions pages/data-lab/how-to/access-notebook.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
---
title: How to access and use the notebook of a Data Lab cluster
description: Step-by-step guide to access and use the notebook environment in a Data Lab for Apache Spark™ on Scaleway.
tags: data lab apache spark notebook environment jupyterlab
dates:
validation: 2025-12-04
posted: 2025-12-04
---

import Requirements from '@macros/iam/requirements.mdx'

This page explains how to access and use the notebook environment of your Data Lab for Apache Spark™ cluster.

<Requirements />

- A Scaleway account logged into the [console](https://console.scaleway.com)
- [Owner](/iam/concepts/#owner) status or [IAM permissions](/iam/concepts/#permission) allowing you to perform actions in the intended Organization
- Created a [Data Lab for Apache Spark™ cluster](/data-lab/how-to/create-data-lab/) with a notebook
- Created an [IAM API key](/iam/how-to/create-api-keys/)

## How to access the notebook of your cluster

1. Click **Data Lab** under **Data & Analytics** on the side menu. The Data Lab for Apache Spark™ page displays.

2. Click the name of the desired Data Lab cluster. The overview tab of the cluster displays.

3. Click the **Open notebook** button. A login page displays.

4. Enter the **secret key** of your API key, then click **Authenticate**. The notebook dashboard displays.

You are now connected to your notebook environment.
31 changes: 31 additions & 0 deletions pages/data-lab/how-to/access-spark-ui.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
---
title: How to access the Apache Spark™ UI
description: Step-by-step guide to access and use the Apache Spark™ UI in a Data Lab for Apache Spark™ on Scaleway.
tags: data lab apache spark ui gui console
dates:
validation: 2025-12-04
posted: 2025-12-04
---

import Requirements from '@macros/iam/requirements.mdx'

This page explains how to access the Apache Spark™ UI of your Data Lab for Apache Spark™ cluster.

<Requirements />

- A Scaleway account logged into the [console](https://console.scaleway.com)
- [Owner](/iam/concepts/#owner) status or [IAM permissions](/iam/concepts/#permission) allowing you to perform actions in the intended Organization
- Created a [Data Lab for Apache Spark™ cluster](/data-lab/how-to/create-data-lab/)
- Created an [IAM API key](/iam/how-to/create-api-keys/)

1. Click **Data Lab** under **Data & Analytics** on the side menu. The Data Lab for Apache Spark™ page displays.

2. Click the name of the desired Data Lab cluster. The overview tab of the cluster displays.

3. Click the **Open Apache Spark™ UI** button. A login page displays.

4. Enter the **secret key** of your API key, then click **Authenticate**. The Apache Spark™ UI dashboard displays.

From this page, you can view and monitor worker nodes, executors, and applications.

Refer to the [official Apache Spark™ documentation](https://spark.apache.org/docs/latest/web-ui.html) for comprehensive information on how to use the web UI.
38 changes: 0 additions & 38 deletions pages/data-lab/how-to/connect-to-data-lab.mdx

This file was deleted.

45 changes: 26 additions & 19 deletions pages/data-lab/how-to/create-data-lab.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -3,37 +3,44 @@ title: How to create a Data Lab for Apache Spark™
description: Step-by-step guide to creating a Data Lab for Apache Spark™ on Scaleway.
tags: data lab apache spark create process
dates:
validation: 2025-09-02
validation: 2025-12-10
posted: 2024-07-31
---
import Requirements from '@macros/iam/requirements.mdx'

Data Lab for Apache Spark™ is a product designed to assist data scientists and data engineers in performing calculations on a remotely managed Apache Spark infrastructure.
Data Lab for Apache Spark™ is a product designed to assist data scientists and data engineers in performing calculations on a remotely managed Apache Spark infrastructure.

<Requirements />

- A Scaleway account logged into the [console](https://console.scaleway.com)
- [Owner](/iam/concepts/#owner) status or [IAM permissions](/iam/concepts/#permission) allowing you to perform actions in the intended Organization
- Optionally, an [Object Storage bucket](/object-storage/how-to/create-a-bucket/)
- A valid [API key](/iam/how-to/create-api-keys/)
- Created a [Private Network](/vpc/how-to/create-private-network/)

1. Click **Data Lab** under **Data & Analytics** on the side menu. The Data Lab for Apache Spark™ page displays.

2. Click **Create Data Lab cluster**. The creation wizard displays.

3. Complete the following steps in the wizard:
- Choose an Apache Spark version from the drop-down menu.
- Select a worker node configuration.
- Enter the desired number of worker nodes.
<Message type="note">
Provisioning zero worker nodes lets you retain and access you cluster and notebook configurations, but will not allow you to run calculations.
</Message>
- Activate the [persistent volume](/data-lab/concepts/#persistent-volume) if required, then enter a volume size according to your needs.
<Message type="note">
Persistent volume usage depends on your workload, and only the actual usage will be billed, within the limit defined. A minimum of 1 GB is required to run the notebook.
</Message>
- Enter a name for your Data Lab.
- Optionally, add a description and/or tags for your Data Lab.
- Verify the estimated cost.

4. Click **Create Data Lab cluster** to finish. You are directed to the Data Lab cluster overview page.
3. Choose an Apache Spark™ version from the drop-down menu.

4. Choose a main node type. If you plan to add a notebook to your cluster, select the **DDL-PLAY2-MICRO** configuration to provision sufficient resources for it.

5. Choose a worker node type depending on your hardware requirements.

6. Enter the desired number of worker nodes.

7. Add a [persistent volume](/data-lab/concepts/#persistent-volume) if required, then enter a volume size according to your needs.

<Message type="note">
Persistent volume usage depends on your workload, and only the actual usage will be billed, within the limit defined. A minimum of 1 GB is required to run the notebook.
</Message>

8. Add a notebook if you want to use an integrated notebook environment to interact with your cluster. Adding a notebook requires 1 GB of billable storage.

9. Select a Private Network from the drop-down menu to attach to your cluster, or create a new one. Data Lab clusters cannot be used without a Private Network.

10. Enter a name for your Data Lab cluster, and add an optional description and/or tags.

11. Verify the estimated cost.

12. Click **Create Data Lab cluster** to finish. You are directed to the Data Lab cluster overview page.
Loading