diff --git a/pages/data-lab/concepts.mdx b/pages/data-lab/concepts.mdx index 6e8d9269b7..8151a7665a 100644 --- a/pages/data-lab/concepts.mdx +++ b/pages/data-lab/concepts.mdx @@ -6,13 +6,13 @@ dates: validation: 2025-09-02 --- -## Apache Spark cluster +## Apache Spark™ cluster -An Apache Spark cluster is an orchestrated set of machines over which distributed/Big data calculus is processed. In the case of Scaleway Data Lab, the Apache Spark cluster is a Kubernetes cluster, with Apache Spark installed in each Pod. For more details, check out the [Apache Spark documentation](https://spark.apache.org/documentation.html). +An Apache Spark™ cluster is an orchestrated set of machines over which distributed/Big data calculus is processed. In the case of Scaleway Data Lab, the Apache Spark™ cluster is a Kubernetes cluster, with Apache Spark™ installed in each Pod. For more details, check out the [Apache Spark™ documentation](https://spark.apache.org/documentation.html). ## Data Lab -A Data Lab is a project setup that combines a Notebook and an Apache Spark Cluster for data analysis and experimentation. it comes with the required infrastructure and tools to allow data scientists, analysts, and researchers to explore data, create models, and gain insights. +A Data Lab is a project setup that combines a Notebook and an Apache Spark™ cluster for data analysis and experimentation. It includes the required infrastructure and tools to allow data scientists, analysts, and researchers to explore data, create models, and gain insights. ## Data Lab for Apache Spark™ @@ -24,7 +24,7 @@ A fixture is a set of data forming a request used for testing purposes. ## GPU -GPUs (Graphical Processing Units) allow Apache Spark to accelerate computations for tasks that involve large-scale parallel processing, such as machine learning and specific data-analytics, significantly reducing the processing time for massive datasets and preparation for AI models. +GPUs (Graphical Processing Units) allow Apache Spark™ to accelerate computations for tasks that involve large-scale parallel processing, such as machine learning and specific data-analytics, significantly reducing the processing time for massive datasets and preparation for AI models. ## JupyterLab @@ -32,25 +32,36 @@ JupyterLab is a web-based platform for interactive computing, letting you work w ## Lighter -Lighter is a technology that enables SparkMagic commands to be readable and executable by the Apache Spark cluster. For more details, check out the [Lighter repository](https://github.com/exacaster/lighter). +Lighter is a technology that enables SparkMagic commands to be readable and executable by the Apache Spark™ cluster. For more details, check out the [Lighter repository](https://github.com/exacaster/lighter). + +## Main node + +The main node in an Apache Spark™ cluster is the driver node, which coordinates the execution of the Spark™ application by transforming code into tasks, scheduling them, and managing communication with the cluster. + ## Notebook -A notebook for an Apache Spark cluster is an interactive, web-based tool that allows users to write and execute code, visualize data, and share results in a collaborative environment. It connects to an Apache Spark cluster to run large-scale data processing tasks directly from the notebook interface, making it easier to develop and test data workflows. +A notebook for an Apache Spark™ cluster is an interactive, web-based tool that allows users to write and execute code, visualize data, and share results in a collaborative environment. It connects to an Apache Spark™ cluster to run large-scale data processing tasks directly from the notebook interface, making it easier to develop and test data workflows. + +Adding a notebook to your cluster requires 1 GB of storage. ## Persistent volume A Persistent Volume (PV) is a cluster-wide storage resource that ensures data persistence beyond the lifecycle of individual Pods. Persistent volumes abstract the underlying storage details, allowing administrators to use various storage solutions. -Apache Spark® executors require storage space for various operations, particularly to shuffle data during wide operations such as sorting, grouping, and aggregation. Wide operations are transformations that require data from different partitions to be combined, often resulting in data movement across the cluster. During the map phase, executors write data to shuffle storage, which is then read by reducers. +Apache Spark™ executors require storage space for various operations, particularly to shuffle data during wide operations such as sorting, grouping, and aggregation. Wide operations are transformations that require data from different partitions to be combined, often resulting in data movement across the cluster. During the map phase, executors write data to shuffle storage, which is then read by reducers. -A PV sized properly ensures a smooth execution of your workload. +A persistent volume sized properly ensures a smooth execution of your workload. ## SparkMagic -SparkMagic is a set of tools that allows you to interact with Apache Spark clusters through Jupyter notebooks. It provides magic commands for running Spark jobs, querying data, and managing Spark sessions directly within the notebook interface, facilitating seamless integration and execution of Spark tasks. For more details, check out the [SparkMagic repository](https://github.com/jupyter-incubator/sparkmagic). +SparkMagic is a set of tools that allows you to interact with Apache Spark™ clusters through Jupyter notebooks. It provides magic commands for running Spark™ jobs, querying data, and managing Spark™ sessions directly within the notebook interface, facilitating seamless integration and execution of Spark™ tasks. For more details, check out the [SparkMagic repository](https://github.com/jupyter-incubator/sparkmagic). ## Transaction -An SQL transaction is a sequence of one or more SQL operations (such as queries, inserts, updates, or deletions) executed as a single unit of work. These transactions ensure data integrity and consistency, following the ACID properties: Atomicity, Consistency, Isolation, and Durability, meaning all operations within a transaction either complete successfully or none of them take effect. An SQL transaction can be rolled back in case of an error. \ No newline at end of file +An SQL transaction is a sequence of one or more SQL operations (such as queries, inserts, updates, or deletions) executed as a single unit of work. These transactions ensure data integrity and consistency, following the ACID properties: Atomicity, Consistency, Isolation, and Durability, meaning all operations within a transaction either complete successfully or none of them take effect. An SQL transaction can be rolled back in case of an error. + +## Worker nodes + +Worker nodes are high-end machines built for intensive computations, featuring powerful CPUs/GPUs, and substantial RAM. diff --git a/pages/data-lab/faq.mdx b/pages/data-lab/faq.mdx index 20cc6dea3d..2f1a8a4341 100644 --- a/pages/data-lab/faq.mdx +++ b/pages/data-lab/faq.mdx @@ -10,11 +10,11 @@ productIcon: DistributedDataLabProductIcon ### What is Apache Spark? -Apache Spark is an open-source unified analytics engine designed for large-scale data processing. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance. Spark offers high-level APIs in Java, Scala, Python, and R, and an optimized engine that supports general execution graphs. +Apache Spark™ is an open-source unified analytics engine designed for large-scale data processing. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance. Spark™ offers high-level APIs in Java, Scala, Python, and R, and an optimized engine that supports general execution graphs. -### How does Apache Spark work? +### How does Apache Spark™ work? -Apache Spark processes data in memory, which allows it to perform tasks up to 100 times faster than traditional disk-based processing frameworks like [Hadoop MapReduce](https://fr.wikipedia.org/wiki/MapReduce). It uses Resilient Distributed Datasets (RDDs) to store data across multiple nodes in a cluster and perform parallel operations on this data. +Apache Spark™ processes data in memory, which allows it to perform tasks up to 100 times faster than traditional disk-based processing frameworks like [Hadoop MapReduce](https://fr.wikipedia.org/wiki/MapReduce). It uses Resilient Distributed Datasets (RDDs) to store data across multiple nodes in a cluster and perform parallel operations on this data. ### What workloads is Data Lab for Apache Spark™ suited for? @@ -24,25 +24,23 @@ Data Lab for Apache Spark™ supports a range of workloads, including: - Machine learning tasks - High-speed operations on large datasets -It offers scalable CPU and GPU Instances with flexible node limits and robust Apache Spark library support. +It offers scalable CPU and GPU Instances with flexible node limits and robust Apache Spark™ library support. ## Offering and availability -### What data source options are available? - -Data Lab natively integrates with Scaleway Object Storage for reading and writing data, making it easy to process data directly from your buckets. Your buckets are accessible using the Scaleway console or any other Amazon S3-compatible CLI tool. - ### What notebook is included with Dedicated Data Labs? -The service provides a JupyterLab notebook running on a dedicated CPU Instance, fully integrated with the Apache Spark cluster for seamless data processing and calculations. +The service provides a JupyterLab notebook running on a dedicated CPU Instance, fully integrated with the Apache Spark™ cluster for seamless data processing and calculations. ## Pricing and billing ### How am I billed for Data Lab for Apache Spark™? -Data Lab for Apache Spark™ is billed based on two factors: -- The main node configuration selected +Data Lab for Apache Spark™ is billed based on the following factors: +- The main node configuration selected. - The worker node configuration selected, and the number of worker nodes in the cluster. +- The persistent volume size provisioned. +- The presence of a notebook. ## Compatibility and integration @@ -50,13 +48,11 @@ Data Lab for Apache Spark™ is billed based on two factors: Yes, you can run your cluster on either CPUs or GPUs. Scaleway leverages Nvidia's [RAPIDS Accelerator For Apache Spark](https://www.nvidia.com/en-gb/deep-learning-ai/software/rapids/), an open-source suite of software libraries and APIs to execute end-to-end data science and analytics pipelines entirely on GPUs. This technology allows for significant acceleration of data processing tasks compared to CPU-based processing. -### Can I connect to S3 buckets from other cloud providers? - -Currently, connections are limited to Scaleway's Object Storage environment. +### Can I connect a separate notebook environment to the Data Lab? -### Can I connect my local JupyterLab to the Data Lab? +Yes, you can connect a different notebook via Private Networks. -Remote connections to a Data Lab cluster are currently not supported. +Refer to the [dedicated documentation](/data-lab/how-to/use-private-networks/) for comprehensive information on how to connect to a Data Lab for Apache Spark™ cluster over Private Networks. ## Usage and management diff --git a/pages/data-lab/how-to/access-notebook.mdx b/pages/data-lab/how-to/access-notebook.mdx new file mode 100644 index 0000000000..fbce1eddfb --- /dev/null +++ b/pages/data-lab/how-to/access-notebook.mdx @@ -0,0 +1,31 @@ +--- +title: How to access and use the notebook of a Data Lab cluster +description: Step-by-step guide to access and use the notebook environment in a Data Lab for Apache Spark™ on Scaleway. +tags: data lab apache spark notebook environment jupyterlab +dates: + validation: 2025-12-04 + posted: 2025-12-04 +--- + +import Requirements from '@macros/iam/requirements.mdx' + +This page explains how to access and use the notebook environment of your Data Lab for Apache Spark™ cluster. + + + +- A Scaleway account logged into the [console](https://console.scaleway.com) +- [Owner](/iam/concepts/#owner) status or [IAM permissions](/iam/concepts/#permission) allowing you to perform actions in the intended Organization +- Created a [Data Lab for Apache Spark™ cluster](/data-lab/how-to/create-data-lab/) with a notebook +- Created an [IAM API key](/iam/how-to/create-api-keys/) + +## How to access the notebook of your cluster + +1. Click **Data Lab** under **Data & Analytics** on the side menu. The Data Lab for Apache Spark™ page displays. + +2. Click the name of the desired Data Lab cluster. The overview tab of the cluster displays. + +3. Click the **Open notebook** button. A login page displays. + +4. Enter the **secret key** of your API key, then click **Authenticate**. The notebook dashboard displays. + +You are now connected to your notebook environment. \ No newline at end of file diff --git a/pages/data-lab/how-to/access-spark-ui.mdx b/pages/data-lab/how-to/access-spark-ui.mdx new file mode 100644 index 0000000000..6010ef2240 --- /dev/null +++ b/pages/data-lab/how-to/access-spark-ui.mdx @@ -0,0 +1,31 @@ +--- +title: How to access the Apache Spark™ UI +description: Step-by-step guide to access and use the Apache Spark™ UI in a Data Lab for Apache Spark™ on Scaleway. +tags: data lab apache spark ui gui console +dates: + validation: 2025-12-04 + posted: 2025-12-04 +--- + +import Requirements from '@macros/iam/requirements.mdx' + +This page explains how to access the Apache Spark™ UI of your Data Lab for Apache Spark™ cluster. + + + +- A Scaleway account logged into the [console](https://console.scaleway.com) +- [Owner](/iam/concepts/#owner) status or [IAM permissions](/iam/concepts/#permission) allowing you to perform actions in the intended Organization +- Created a [Data Lab for Apache Spark™ cluster](/data-lab/how-to/create-data-lab/) +- Created an [IAM API key](/iam/how-to/create-api-keys/) + +1. Click **Data Lab** under **Data & Analytics** on the side menu. The Data Lab for Apache Spark™ page displays. + +2. Click the name of the desired Data Lab cluster. The overview tab of the cluster displays. + +3. Click the **Open Apache Spark™ UI** button. A login page displays. + +4. Enter the **secret key** of your API key, then click **Authenticate**. The Apache Spark™ UI dashboard displays. + +From this page, you can view and monitor worker nodes, executors, and applications. + +Refer to the [official Apache Spark™ documentation](https://spark.apache.org/docs/latest/web-ui.html) for comprehensive information on how to use the web UI. \ No newline at end of file diff --git a/pages/data-lab/how-to/connect-to-data-lab.mdx b/pages/data-lab/how-to/connect-to-data-lab.mdx deleted file mode 100644 index 8a941f17a1..0000000000 --- a/pages/data-lab/how-to/connect-to-data-lab.mdx +++ /dev/null @@ -1,38 +0,0 @@ ---- -title: How to connect to a Data Lab for Apache Spark™ -description: Step-by-step guide to connecting to a Data Lab for Apache Spark™ with the Scaleway console. -tags: data lab for apache spark create process -dates: - validation: 2025-09-17 - posted: 2024-07-31 ---- -import Requirements from '@macros/iam/requirements.mdx' - - - - -- A Scaleway account logged into the [console](https://console.scaleway.com) -- [Owner](/iam/concepts/#owner) status or [IAM permissions](/iam/concepts/#permission) allowing you to perform actions in the intended Organization -- Created a [Data Lab for Apache Spark™ cluster](/data-lab/how-to/create-data-lab/) -- A valid [API key](/iam/how-to/create-api-keys/) - -1. Click **Data Lab** under **Data & Analytics** on the side menu. The Data Lab for Apache Spark™ page displays. - -2. Click the name of the Data Lab cluster you want to connect to. The cluster **Overview** page displays. - -3. Click **Open Notebook** in the **Notebook** section. You are directed to the notebook login page. - -4. Enter your [API secret key](/iam/concepts/#api-key) when prompted for a password, then click **Log in**. You are directed to the lab's home screen. - -5. In the files list on the left, double-click the `DatalabDemo.ipynb` file to open it. - -6. Update the first cell of the file with your API access key and secret key, as shown below: - - ```json - "spark.hadoop.fs.s3a.access.key": "[your-api-access-key]", - "spark.hadoop.fs.s3a.secret.key": "[your-api-secret-key]", - ``` - - Your notebook environment is now ready to be used. - -7. Optionally, follow the instructions contained in the `DatalabDemo.ipynb` file to process a test batch of data. \ No newline at end of file diff --git a/pages/data-lab/how-to/create-data-lab.mdx b/pages/data-lab/how-to/create-data-lab.mdx index 335efadc4d..e737d37604 100644 --- a/pages/data-lab/how-to/create-data-lab.mdx +++ b/pages/data-lab/how-to/create-data-lab.mdx @@ -3,37 +3,44 @@ title: How to create a Data Lab for Apache Spark™ description: Step-by-step guide to creating a Data Lab for Apache Spark™ on Scaleway. tags: data lab apache spark create process dates: - validation: 2025-09-02 + validation: 2025-12-10 posted: 2024-07-31 --- import Requirements from '@macros/iam/requirements.mdx' -Data Lab for Apache Spark™ is a product designed to assist data scientists and data engineers in performing calculations on a remotely managed Apache Spark infrastructure. +Data Lab for Apache Spark™ is a product designed to assist data scientists and data engineers in performing calculations on a remotely managed Apache Spark™ infrastructure. - A Scaleway account logged into the [console](https://console.scaleway.com) - [Owner](/iam/concepts/#owner) status or [IAM permissions](/iam/concepts/#permission) allowing you to perform actions in the intended Organization -- Optionally, an [Object Storage bucket](/object-storage/how-to/create-a-bucket/) - A valid [API key](/iam/how-to/create-api-keys/) +- Created a [Private Network](/vpc/how-to/create-private-network/) 1. Click **Data Lab** under **Data & Analytics** on the side menu. The Data Lab for Apache Spark™ page displays. 2. Click **Create Data Lab cluster**. The creation wizard displays. -3. Complete the following steps in the wizard: - - Choose an Apache Spark version from the drop-down menu. - - Select a worker node configuration. - - Enter the desired number of worker nodes. - - Provisioning zero worker nodes lets you retain and access you cluster and notebook configurations, but will not allow you to run calculations. - - - Activate the [persistent volume](/data-lab/concepts/#persistent-volume) if required, then enter a volume size according to your needs. - - Persistent volume usage depends on your workload, and only the actual usage will be billed, within the limit defined. A minimum of 1 GB is required to run the notebook. - - - Enter a name for your Data Lab. - - Optionally, add a description and/or tags for your Data Lab. - - Verify the estimated cost. - -4. Click **Create Data Lab cluster** to finish. You are directed to the Data Lab cluster overview page. \ No newline at end of file +3. Choose an Apache Spark™ version from the drop-down menu. + +4. Choose a main node type. If you plan to add a notebook to your cluster, select the **DDL-PLAY2-MICRO** configuration to provision sufficient resources for it. + +5. Choose a worker node type depending on your hardware requirements. + +6. Enter the desired number of worker nodes. + +7. Add a [persistent volume](/data-lab/concepts/#persistent-volume) if required, then enter a volume size according to your needs. + + + Persistent volume usage depends on your workload, and only the actual usage will be billed, within the limit defined. A minimum of 1 GB is required to run the notebook. + + +8. Add a notebook if you want to use an integrated notebook environment to interact with your cluster. Adding a notebook requires 1 GB of billable storage. + +9. Select a Private Network from the drop-down menu to attach to your cluster, or create a new one. Data Lab clusters cannot be used without a Private Network. + +10. Enter a name for your Data Lab cluster, and add an optional description and/or tags. + +11. Verify the estimated cost. + +12. Click **Create Data Lab cluster** to finish. You are directed to the Data Lab cluster overview page. \ No newline at end of file diff --git a/pages/data-lab/how-to/manage-delete-data-lab.mdx b/pages/data-lab/how-to/manage-delete-data-lab.mdx index 8ad567bcbe..7fd87af6c1 100644 --- a/pages/data-lab/how-to/manage-delete-data-lab.mdx +++ b/pages/data-lab/how-to/manage-delete-data-lab.mdx @@ -3,7 +3,7 @@ title: How to manage and delete a Data Lab for Apache Spark™ description: Step-by-step guide to managing and deleting a Data Lab for Apache Spark™ with the Scaleway console. tags: data lab apache spark delete remove suppress dates: - validation: 2025-09-02 + validation: 2025-12-10 posted: 2024-07-31 --- import Requirements from '@macros/iam/requirements.mdx' @@ -20,7 +20,11 @@ This page explains how to manage and delete your Data Lab for Apache Spark™. 1. Click **Data Lab** under **Data & Analytics** on the side menu. The Data Lab for Apache Spark™ page displays. -2. Click the name of the Data Lab cluster you want to manage. The overview tab of the cluster displays. From this view, you can see the configuration of your cluster. +2. Click the name of the Data Lab cluster you want to manage. The overview tab of the cluster displays. From this page, you can: + - Consult the configuration of your cluster. + - View the network information of your cluster. + - [Access the Apache Spark™ UI](/data-lab/how-to/acess-spark-ui/) of your cluster. + - [Access the notebook environment](/data-lab/how-to/acess-notebook/) of your cluster. 3. Click the **Settings** tab. diff --git a/pages/data-lab/how-to/use-private-networks.mdx b/pages/data-lab/how-to/use-private-networks.mdx new file mode 100644 index 0000000000..8a423207dc --- /dev/null +++ b/pages/data-lab/how-to/use-private-networks.mdx @@ -0,0 +1,221 @@ +--- +title: How to use Private Networks with your Data Lab cluster +description: This page explains how to use Private Networks with Scaleway Data Lab for Apache Spark™ +tags: private-networks private networks data lab spark apache cluster vpc +dates: + validation: 2025-12-10 + posted: 2025-12-10 +--- +import Requirements from '@macros/iam/requirements.mdx' + + +[Private Networks](/vpc/concepts/#private-networks) allow your Data Lab for Apache Spark™ cluster to communicate in an isolated and secure network without needing to be connected to the public internet. + +At the moment, Data Lab clusters can only be attached to a Private Network [during their creation](/data-lab/how-to/create-data-lab/), and cannot be detached and reattached to another Private Network afterward. + +For full information about Scaleway Private Networks and VPC, see our [dedicated documentation](/vpc/) and [best practices guide](/vpc/reference-content/getting-most-private-networks/). + + + +- A Scaleway account logged into the [console](https://console.scaleway.com) +- [Owner](/iam/concepts/#owner) status or [IAM permissions](/iam/concepts/#permission) allowing you to perform actions in the intended Organization +- [Created a Private Network](/vpc/how-to/create-private-network/) +- [Created an Ubuntu Instance](/instances/how-to/create-an-instance/) attached to a [Private Network](/instances/how-to/use-private-networks/) + +## How to use a cluster through a Private Network + +### Setting up your Instance + +1. [Connect to your Instance via SSH](/instances/how-to/connect-to-instance/). + +2. Run the command below from the shell of your Instance to install the required dependencies: + + ```bash + sudo apt update + sudo apt install -y \ + build-essential curl git \ + libssl-dev zlib1g-dev libbz2-dev libreadline-dev libsqlite3-dev \ + libncursesw5-dev xz-utils tk-dev libxml2-dev libxmlsec1-dev libffi-dev liblzma-dev \ + openjdk-17-jre-headless tmux + ``` + +3. Run the command below to install `pyenv`: + + ```bash + curl https://pyenv.run | bash + ``` + +4. Run the command below to add `pyenv` to your Bash configuration: + + ```bash + echo 'export PATH="$HOME/.pyenv/bin:$PATH"' >> ~/.bashrc + echo 'eval "$(pyenv init -)"' >> ~/.bashrc + echo 'eval "$(pyenv virtualenv-init -)"' >> ~/.bashrc + ``` + +5. Run the command below to reload your shell: + + ```bash + exec $SHELL + ``` + +6. Run the command below to install **Python 3.13**, and activate a virtual environment: + + ```bash + pyenv install 3.13.0 + pyenv virtualenv 3.13.0 jupyter-spark-3.13 + pyenv activate jupyter-spark-3.13 + ``` + + + Your Instance's Python version must be 3.13. If you encounter an error due to a mismatch between the worker and driver Python versions, run the following command to display minor versions, then reinstall using the exact one: + + ```bash + pyenv install -l | grep 3.13 + ``` + + +7. Run the command below to install JupyterLab and PySpark inside the virtual environment: + + ```bash + pip install --upgrade pip + pip install jupyterlab pyspark + ``` + +8. Run the command below to generate a configuration file for your JupyterLab: + + ```bash + jupyter lab --generate-config + ``` + +9. Open the configuration file you just created: + + ```bash + nano ~/.jupyter/jupyter_lab_config.py + ``` + +10. Set the following parameters: + + ```python + # if running as root: + c.ServerApp.allow_root = True + c.ServerApp.port = 8888 + # optional authentication token: + # c.ServerApp.token = "your-super-secure-password" + ``` + +11. Run the command below to start JupyterLab: + + ```bash + jupyter lab + ``` + +12. In a new terminal, connect to your JupyterLab via SSH. The Instance public IP can be found in the **Overview** tab of your Instance: + + ```bash + ssh -L 8888:127.0.0.1:8888 @ + ``` + + + Make sure to allow root connection in your configuration file if you log in as a root user. + + +13. Access [http://localhost:8888](http://localhost:8888), then enter the token generated while executing the `jupyter lab` command. + +You now have access to your Data Lab for Apache Spark™ cluster via a Private Network, using a JupyterLab notebook deployed on an Instance. + +### Running a sample workload over Private Networks + +1. In a new Jupyter notebook file, add the code below to a new cell: + + ```python + from pyspark.sql import SparkSession + + MASTER_URL = "" # "spark://master-datalab-[...]:7077" format + DRIVER_HOST = "" # "XX.XX.XX.XX" format + + spark = ( + SparkSession.builder + .appName("jupyter-from-vpc-instance-test") + .master(MASTER_URL) + # make sure executors can talk back to this driver + .config("spark.driver.host", DRIVER_HOST) + .config("spark.driver.bindAddress", "0.0.0.0") + .config("spark.driver.port", "7078") + .config("spark.blockManager.port", "7079") + .config("spark.ui.port", "4040") + .getOrCreate() + ) + + spark.range(10).show() + ``` + +2. Replace the placeholders with the appropriate values: + - `` can be found in the **Overview** tab of your cluster, under **Private endpoint** in the **Network** section. + - `` can be found in the **Private Networks** tab of your Instance. Make sure to only copy the IP, and not the `/22` part. + +3. Run the cell. + +Your notebook hosted on an Instance is ready to be used over Private Networks. + +### Running an application over Private Networks using spark-submit + +1. [Connect to your Instance via SSH](/instances/how-to/connect-to-instance/). + +2. Run the command below from the shell of your Instance to install the required dependencies: + + ```bash + sudo apt update + sudo apt install -y openjdk-17-jdk curl wget tar + java -version + ``` + +3. Run the command below to install Apache Spark™: + + ```bash + cd ~ + wget https://archive.apache.org/dist/spark/spark-4.0.0/spark-4.0.0-bin-hadoop3.tgz + ``` + +4. Run the command below to unzip the archive: + + ```bash + sudo mkdir -p /opt/spark + sudo tar -xzf spark-4.0.0-bin-hadoop3.tgz -C /opt/spark --strip-components=1 + ``` + +5. Run the command below to add Apache Spark™ to your Bash configuration, and reload your bash session: + + ```bash + echo 'export SPARK_HOME=/opt/spark' >> ~/.bashrc + echo 'export PATH="$SPARK_HOME/bin:$PATH"' >> ~/.bashrc + source ~/.bashrc + ``` + +6. Install Python 3.13 if you have not done it yet, then set the environment variables below: + + ```bash + export PYSPARK_PYTHON=$(which python) //should be 3.13 + export PYSPARK_DRIVER_PYTHON=$(which python) + ``` + +7. Run the command below to execute `spark-submit` to calculate pi for 100 iterations. Do not forget to replace the placeholders with the appropriate values. + + ```bash + spark-submit \ + --master spark://:7077 \ + --deploy-mode client \ + --conf spark.driver.port=7078 \ + --conf spark.blockManager.port=7079 \ + --conf spark.driver.host= \ + $SPARK_HOME/examples/src/main/python/pi.py 100 + ``` + + + - `` can be found in the **Overview** tab of your cluster, under **Private endpoint** in the **Network** section. + - `` can be found in the **Private Networks** tab of your Instance. Make sure to only copy the IP, and not the `/22` part. + + +8. [Access the Apache Spark™ UI](/data-lab/how-to/access-spark-ui/) of your cluster. The list of completed applications displays. From here, you can inspect the jobs previously started using `spark-submit`. + +You successfully run workloads on your cluster from an Instance over a Private Network. \ No newline at end of file diff --git a/pages/data-lab/menu.ts b/pages/data-lab/menu.ts index 893c9a3cbf..ad0bdeb9af 100644 --- a/pages/data-lab/menu.ts +++ b/pages/data-lab/menu.ts @@ -19,15 +19,23 @@ export const dataLabMenu = { { items: [ { - label: 'Create a Data Lab', + label: 'Create a Data Lab cluster', slug: 'create-data-lab', }, { - label: 'Connect to a Data Lab', - slug: 'connect-to-data-lab', + label: 'Access the notebook', + slug: 'access-notebook', }, { - label: 'Manage and delete a Data Lab', + label: 'Access the Spark™ UI', + slug: 'access-spark-ui', + }, + { + label: 'Use a cluster with Private Networks', + slug: 'use-private-networks', + }, + { + label: 'Manage and delete a cluster', slug: 'manage-delete-data-lab', }, ], diff --git a/pages/data-lab/quickstart.mdx b/pages/data-lab/quickstart.mdx index ed60589db6..a5aeb1ac90 100644 --- a/pages/data-lab/quickstart.mdx +++ b/pages/data-lab/quickstart.mdx @@ -3,7 +3,7 @@ title: Data Lab for Apache Spark™ - Quickstart description: Get started with Scaleway Data Lab for Apache Spark™ quickly and efficiently. tags: data lab apache spark notebook jupyter processing dates: - validation: 2025-09-02 + validation: 2025-12-09 posted: 2024-07-10 --- import Requirements from '@macros/iam/requirements.mdx' @@ -12,23 +12,16 @@ import Requirements from '@macros/iam/requirements.mdx' Follow this guided tour to discover how to navigate the console. -Data Lab for Apache Spark™ is a product designed to assist data scientists and data engineers in performing calculations on a remotely managed Apache Spark infrastructure. +Data Lab for Apache Spark™ is a product designed to assist data scientists and data engineers in performing calculations on a remotely managed Apache Spark™ infrastructure. -It is composed of the following: - - - Cluster: An Apache Spark cluster powered by a Kubernetes architecture. - - - Notebook: A JupyterLab service operating on a dedicated node type. - -Scaleway provides dedicated node types for both the notebook and the cluster. The cluster nodes are high-end machines built for intensive computations, featuring powerful CPUs/GPUs, and substantial RAM. - -The notebook, although capable of performing some local computations, primarily serves as a web interface for interacting with the Apache Spark cluster. +This documentation explains how to create a Data Lab for Apache Spark™ cluster, access its notebook environment and run the included demo file, and delete your cluster. - A Scaleway account logged into the [console](https://console.scaleway.com) - [Owner](/iam/concepts/#owner) status or [IAM permissions](/iam/concepts/#permission) allowing you to perform actions in the intended Organization -- Optionally, an [Object Storage bucket](/object-storage/how-to/create-a-bucket/) +- Created a [Private Network](/vpc/how-to/create-private-network/) +- Created an [IAM API key](/iam/how-to/create-api-keys/) ## How to create a Data Lab for Apache Spark™ cluster @@ -37,19 +30,21 @@ The notebook, although capable of performing some local computations, primarily 2. Click **Create Data Lab cluster**. The creation wizard displays. 3. Complete the following steps in the wizard: - - Choose an Apache Spark version from the drop-down menu. - - Select a worker node configuration. For this procedure, we recommend selecting a CPU rather than a GPU. + + - Select a region for your cluster. + - Choose an Apache Spark™ version from the drop-down menu. + - Select the **DDL-PLAY2-MICRO** main node type. + - Select a **CPU** worker node configuration. - Enter the desired number of worker nodes. - - Provisioning zero worker nodes lets you retain and access you cluster and notebook configurations, but will not allow you to run calculations. - - - Enter a name for your Data Lab. - - Optionally, add a description and/or tags for your Data Lab. + - Select an existing Private Network, or create a new one. + - Enter a name for your cluster, and an optional description and tags. - Verify the estimated cost. -4. Click **Create Data Lab cluster** to finish. You are directed to the Data Lab cluster overview page. +4. Click **Create Data Lab cluster** to finish. -## How to connect to your Data Lab +Once the cluster is created, you are directed to its **Overview** page. + +## How to connect to your cluster's notebook 1. Click **Data Lab** under **Data & Analytics** on the side menu. The Data Lab for Apache Spark™ page displays. @@ -63,45 +58,11 @@ The notebook, although capable of performing some local computations, primarily Each Data Lab for Apache Spark™ comes with a default `DatalabDemo.ipynb` demonstration file for testing purposes. This file contains a preconfigured notebook environment that requires no modification to run. -Execute the cells in order to perform pre-determined operations on a dummy data set. - -## How to set up a new Data Lab environment - -1. From the notebook **Launcher** tab, select **PySpark** under **Notebook**. - -2. In a new cell, copy and paste the code below and replace the placeholders with your API access key, secret key, and the endpoint of your Object Storage Bucket to set up the Apache Spark session: +Execute the cells in order to perform pre-determined operations on a dummy data set representative of real life use cases and workloads to assess the performance of your cluster. - ```json - %%configure -f - { - "name": "My Spark", - "conf":{ - "spark.hadoop.fs.s3a.access.key": "your-api-access-key", - "spark.hadoop.fs.s3a.secret.key": "your-api-secret-key", - "spark.hadoop.fs.s3a.endpoint": "your-bucket-endpoint" - } - } - ``` - - - The Object Storage bucket endpoint is required only if you did not specify a bucket when creating the cluster. - - -3. In a new cell below, copy and paste the following command to initialize the Apache Spark session: - - ```python - from pyspark.sql.types import StructType, StructField, LongType, DoubleType, StringType - ``` - -4. Execute the two cells you just created. - - - The initialization of your Apache Spark session can take a few minutes. - - - Once initialized, the information of the Spark session displays. - -You can now execute commands that will run on the resources defined when creating the Data Lab for Apache Spark™. + +The demo file also contains a set of examples to configure and extend your Apache Spark™ configuration. + ## How to delete a Data Lab for Apache Spark™