In [None]:
# Copyright 2022 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

使用Dataproc从BigQuery中摘要和分析数据

从GitHub查看：
[![GitHub logo](https://cloud.google.com/ml-engine/images/github-logo-32px.png "GitHub logo")](https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/official/workbench/spark/spark_bigquery.ipynb)

在Colab中运行：
[![Colab logo](https://cloud.google.com/ml-engine/images/colab-logo-32px.png "Colab logo")](https://colab.research.google.com/github/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/official/workbench/spark/spark_bigquery.ipynb)

在Vertex AI Workbench中打开：
[![Vertex AI logo](https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32 "Vertex AI logo")](https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/vertex-ai-samples/main/notebooks/official/workbench/spark/spark_bigquery.ipynb)

## 概述

本笔记本教程向您展示如何使用 Apache Spark 和 [Dataproc](https://cloud.google.com/dataproc) 将数据摄取、分析并写入 BigQuery。该笔记本代码分析 GitHub 活动数据，探索与 GitHub 仓库中使用的编程语言相关的指标。

要运行此笔记本，请点击上方的 `在 Vertex AI Workbench 中打开` 链接。

了解更多关于 [Vertex AI Workbench](https://cloud.google.com/vertex-ai/docs/workbench/introduction) 和 [Dataproc Serverless for Spark](https://cloud.google.com/dataproc-serverless/docs/guides/bigquery-connector-spark-example)。

### 目标

这个笔记本教程运行一个 Apache Spark 作业，从 BigQuery 的 "GitHub 活动数据" 数据集中获取数据，查询数据，然后将结果写回 BigQuery。这个作业序列代表了一个常见的数据工程用例：摄取、转换和查询数据，然后将输出写入数据库。它还演示了如何提交一个 Apache Spark 作业到 Dataproc。

这个教程使用以下 Google Cloud ML 服务：

- `Dataproc`
- `BigQuery`

执行的步骤包括：

- 设置一个 Google Cloud 项目和 Dataproc 集群。
- 配置 spark-bigquery-connector。
- 将数据从 BigQuery 摄取到 Spark DataFrame 中。
- 对摄取的数据进行预处理。
- 查询单一语言仓库中最常用的编程语言。
- 查询存储在单一语言仓库中每种语言代码平均大小（MB）。
- 查询通常一起找到的多语言仓库中的语言文件。
- 将查询结果写回到 BigQuery。
- 删除为这个笔记本教程创建的资源。

数据集

[GitHub活动数据](https://console.cloud.google.com/marketplace/product/github/github-repos)数据集可在[BigQuery公共数据集](https://cloud.google.com/bigquery/public-data)中获取，每月提供免费查询高达1TB的数据。它包含两种不同类型的存储库数据：支持多种编程语言文件的“polyglot”存储库和支持一种编程语言的“monoglot”存储库。

费用

本教程使用 Google Cloud 中可计费的组件：

* [Vertex AI](https://cloud.google.com/vertex-ai/pricing)
* [Cloud Storage](https://cloud.google.com/storage/pricing)
* [Dataproc](https://cloud.google.com/dataproc/pricing)

您可以使用 [定价计算器](https://cloud.google.com/products/calculator/) 根据您预计的使用量生成成本估算。

### 安装

安装以下包以运行这个笔记本。

由于测试环境没有Java和PySpark，因此需要以下单元格用于测试目的。

In [None]:
import os

if os.getenv("IS_TESTING"):
    """
    The testing suite does not currently support testing on Dataproc clusters,
    so the testing environment is setup to replicate Dataproc via the following steps.
    """
    JAVA_VER = "8u332-b09"
    JAVA_FOLDER = "/tmp/java"
    FILE_NAME = f"openlogic-openjdk-{JAVA_VER}-linux-x64"
    TAR_FILE = f"{JAVA_FOLDER}/{FILE_NAME}.tar.gz"
    DOWNLOAD_LINK = f"https://builds.openlogic.com/downloadJDK/openlogic-openjdk/{JAVA_VER}/openlogic-openjdk-{JAVA_VER}-linux-x64.tar.gz"
    PYSPARK_VER = "3.1.3"

    # Download Open JDK 8. Spark requires Java to execute.
    ! rm -rf $JAVA_FOLDER
    ! mkdir $JAVA_FOLDER
    ! wget -P $JAVA_FOLDER $DOWNLOAD_LINK
    os.environ["JAVA_HOME"] = f"{JAVA_FOLDER}/{FILE_NAME}"
    ! tar -zxf $TAR_FILE -C $JAVA_FOLDER
    ! echo $JAVA_HOME

    # Pin the Spark version to match that the Dataproc 2.0 cluster.
    ! pip install pyspark==$PYSPARK_VER -q

### 创建一个Dataproc集群

在此笔记本教程中执行的Spark作业需要大量计算资源。由于在标准的笔记本环境中完成作业可能需要很长时间，因此此笔记本教程在一个由Dataproc组件网关和Jupyter组件安装在集群上创建的Dataproc集群上运行。

**已存在具有Jupyter的Dataproc集群？**：如果您已经有一个运行中具有[在集群上安装了组件网关和Jupyter组件](https://cloud.google.com/dataproc/docs/concepts/components/jupyter#gcloud-command)的Dataproc集群，您可以在此教程中使用它。如果您计划使用它，请跳过此步骤，直接转到`切换内核`。

In [None]:
if not os.getenv("IS_TESTING"):
    CLUSTER_NAME = "[your-cluster]"  # @param {type: "string"}
    CLUSTER_REGION = "[your-region]"  # @param {type: "string"}

    if CLUSTER_REGION == "[your-region]":
        CLUSTER_REGION = "us-central1"

    print(f"CLUSTER_NAME: {CLUSTER_NAME}")
    print(f"CLUSTER_REGION: {CLUSTER_REGION}")

In [None]:
if not os.getenv("IS_TESTING"):
    !gcloud dataproc clusters create $CLUSTER_NAME \
        --region=$CLUSTER_REGION \
        --enable-component-gateway \
        --image-version=2.0 \
        --optional-components=JUPYTER

你的 `CLUSTER_NAME` 必须在你的 Google Cloud 项目内**是唯一的**。它必须以小写字母开头，后跟最多51个小写字母、数字和连字符，并且不能以连字符结尾。

#### 切换内核

您的笔记本内核位于笔记本页面的顶部。 您的笔记本应该在 Dataproc 集群上运行的 Python 3 内核上运行。

从顶部菜单中选择 **内核 > 更改内核**，然后选择 `Python 3 on CLUSTER_NAME: Dataproc 集群 in REGION (Remote)`。

### 设置您的项目 ID

**如果您不知道您的项目 ID**，请尝试以下操作：
* 运行 `gcloud config list`。
* 运行 `gcloud projects list`。
* 查看支持页面：[查找项目 ID](https://support.google.com/googleapi/answer/7014113)。

In [None]:
PROJECT_ID = "[your-project-id]"  # @param {type:"string"}

# Set the project id
! gcloud config set project {PROJECT_ID}

创建一个BigQuery数据集

本教程中创建的Spark DataFrame 存储在BigQuery中。

UUID

为避免名称冲突，您可以为当前笔记本会话创建一个UUID，然后将UUID附加到您在本教程中创建的BigQuery数据集中。

In [None]:
import random
import string


# Generate a uuid of a specifed length(default=8)
def generate_uuid(length: int = 8) -> str:
    return "".join(random.choices(string.ascii_lowercase + string.digits, k=length))


UUID = generate_uuid()

设置你的BigQuery数据集的名称，然后创建它。

In [None]:
if not os.getenv("IS_TESTING"):
    DATASET_NAME = "[your-dataset-name]"  # @param {type:"string"}

    if (
        DATASET_NAME == ""
        or DATASET_NAME is None
        or DATASET_NAME == "[your-dataset-name]"
    ):
        DATASET_NAME = f"{PROJECT_ID}{UUID}"
else:
    DATASET_NAME = f"python_docs_samples_tests_spark_{UUID}"

In [None]:
! bq mk $DATASET_NAME

## 教程

### 导入所需的库

In [None]:
# You use Spark SQL in a "SparkSession" to create DataFrames
from pyspark.sql import SparkSession
# PySpark functions
from pyspark.sql.functions import avg, col, count, desc, round, size, udf
# These allow us to create a schema for our data
from pyspark.sql.types import ArrayType, IntegerType, StringType

### 初始化SparkSession

要在BigQuery中使用Apache Spark，您必须在初始化`SparkSession`时包含[spark-bigquery-connector](https://github.com/GoogleCloudDataproc/spark-bigquery-connector)。

In [None]:
# Initialize the "SparkSession" with the following config.
VER = "0.26.0"
FILE_NAME = f"spark-bigquery-with-dependencies_2.12-{VER}.jar"

if os.getenv("IS_TESTING"):
    connector = f"https://github.com/GoogleCloudDataproc/spark-bigquery-connector/releases/download/{VER}/{FILE_NAME}"
else:
    connector = f"gs://spark-lib/bigquery/{FILE_NAME}"

spark = (
    SparkSession.builder.appName("spark-bigquery-polyglot-language-demo")
    .config("spark.jars", connector)
    .config("spark.sql.debug.maxToStringFields", "500")
    .getOrCreate()
)

从BigQuery获取数据

In [None]:
# Load the Github Activity public dataset from BigQuery.
df = (
    spark.read.format("bigquery")
    .option("table", "bigquery-public-data.github_repos.languages")
    .load()
)

# Restrict testing data since the testing environment runs on a small Docker image.
if os.getenv("IS_TESTING"):
    df = df.sample(0.0001)

df.printSchema()

### 预处理

如所示的模式显示，Github Activity数据存储在数组中，而不是原始类型。

为了有效地处理数据，将数组转换为原始类型，并将单语和多语存储库数据分开。

三个Python函数的返回类型有一个`@udf`注释（表示一个[用户定义函数](https://spark.apache.org/docs/3.1.3/api/python/reference/api/pyspark.sql.functions.udf.html)）。UDFs扩展了PySpark框架函数。

In [None]:
# Set the LIMIT constant as 10 to get the top ten results.
LIMIT = 10

# A constant used to explode the pie chart to aid visibility.
EXPLODE_PIE_CHART = tuple([0.05] * LIMIT)


@udf(returnType=StringType())
def language_to_mono_language(language) -> str:
    """
    The preprocessing function takes a language array and returns its name if the language has one element.
    Args:
        language: list of struct that contains name and bytes.
                  (e.g., language = [[name: "C", bytes: 300]]
    Returns:
        Monorepo's name
    """
    return language[0].name if len(language) == 1 else None


@udf(returnType=IntegerType())
def language_to_mono_size(language) -> int:
    """
    The preprocessing function takes a language array and returns its bytes if the language has one element.
    Args:
        language: list of struct that contains name and bytes.
                  (e.g., language = [[name: "C", bytes: 300]]
    Returns:
        Monorepo's bytes
    """
    return language[0].bytes if len(language) == 1 else 0


@udf(returnType=StringType())
def language_to_poly_language(language) -> str:
    """
    The preprocessing function takes a language array and returns the top three language names based on their bytes.
    Args:
        language: list of struct that contains name and bytes.
                  (e.g., language = [[name: "C", bytes: 300],
                                     [name: "Java", bytes: 200]]
    Returns:
        Polyrepo's name in string form separated by commas
    """
    if len(language) < 2:
        return None
    # Sort languages by their bytes in a descending order.
    language.sort(key=lambda x: -x.bytes)
    top_3 = language[:3]

    # Sort top_3 languages by their name.
    top_3.sort(key=lambda x: x.name)
    ret = []
    for elem in top_3:
        ret.append(elem.name)
    return ", ".join(ret)

In [None]:
# Create a DataFrame named "preprocessed_df", with the array split into three columns using UDF.
preprocessed_df = df.select(
    col("repo_name"),
    language_to_mono_language(col("language")).alias("mono_language"),
    language_to_mono_size(col("language")).alias("mono_size"),
    language_to_poly_language(col("language")).alias("poly_language"),
)
preprocessed_df.printSchema()

展示的`preprocessed_df`的架构显示语言列分成了三列：`mono_language`、`mono_size`和`poly_language`。

In [None]:
# Output the number of repositories of monoglot(single language used) and polyglot(multiple languages used).
mono = preprocessed_df.where(col("mono_language").isNotNull()).count()
print(f"The number of repositories that use one language is {mono}")

poly = preprocessed_df.where(col("poly_language").isNotNull()).count()
print(f"The number of repositories that use multiple languages is {poly}")

poly_percent = (poly / (mono + poly)) * 100
print(
    f"Polyglot repositories comprise approximately {poly_percent:.2f}% of the total number of repositories."
)

### 分析

#### 单语存储库中最常用的语言是什么？
要回答这个问题，执行以下查询，使用预处理列 `mono_language`。

In [None]:
# Get the monoglot repositories and sort them based on language popularity.
mono_ranking = (
    preprocessed_df.groupBy("mono_language")
    .count()
    .sort(desc("count"))
    .where(col("mono_language").isNotNull())
)
mono_ranking.show()

使用`mono_ranking`，用饼图来展示结果。

In [None]:
# Convert the Spark DataFrame to a Pandas DataFrame to display the pie chart.
mono_panda = mono_ranking.toPandas()[:LIMIT].copy()
mono_panda.groupby(["mono_language"]).sum().plot(
    kind="pie",
    y="count",
    autopct="%1.1f%%",
    label="",
    title="Monoglot repositories",
    legend=False,
    figsize=(7, 7),
    explode=EXPLODE_PIE_CHART,
)

在单语版本库中，每种语言的平均大小是多少？

对`mono_size`和`mono_language`列进行预处理，以获得每种语言的平均大小。

`mono_size`以千字节为单位。以下查询将`mono_size`除以1000，以将大小转换为兆字节。

In [None]:
mono_ranking_avg_bytes = (
    preprocessed_df.groupBy("mono_language")
    .agg(
        count("mono_language").alias("count"),
        round(avg("mono_size") / 1000).alias("average_in_MB"),
    )
    .sort(desc("average_in_MB"))
    .where(col("mono_language").isNotNull() & (col("count") > 500))
)

mono_ranking_avg_bytes.show()

在多语言存储库中，哪三种语言最常出现在一起？

使用预处理的“poly_language”列，实现一个查询，显示根据大小排名的多语言存储库的前三种语言。

In [None]:
# Get the polyglot repositories by language popularity.
poly_ranking = (
    preprocessed_df.groupBy("poly_language")
    .count()
    .sort(desc("count"))
    .where(col("poly_language").isNotNull())
)

poly_ranking.show()

大多数结果包含HTML或CSS和Javascript的组合。

显示一个饼图：

In [None]:
# Convert the Spark DataFrame to a Pandas DataFrame to display the pie chart.
poly_panda = poly_ranking.toPandas()[:LIMIT].copy()
poly_panda.groupby(["poly_language"]).sum().plot(
    kind="pie",
    y="count",
    autopct="%1.1f%%",
    label="",
    title="Polyglot repositories",
    legend=False,
    figsize=(7, 7),
    explode=EXPLODE_PIE_CHART,
)

饼图显示，排名前十名结果中有八个包含`HTML`或`CSS`。您可以使用从BigQuery获取的原始数据在每个存储库中创建语言组合。

In [None]:
# A Python package to get combinations.
from itertools import combinations
# A Python package to use type hint
from typing import List

# PySpark functions
from pyspark.sql.functions import explode

In [None]:
def normalize_name(name: str) -> str:
    """
    Change the language name to avoid invalid characters in the BigQuery data.
    Args:
        name: string
    Returns:
        Normalized name: string
    """
    normalized_arr = []

    # The following sets of characters cannot be used in BigQuery's fields.
    invalid_chars = {",", ";", "{", "}", "(", ")", "\n", "\t", "=", "'"}
    replace_chars = {
        " ": "_",
        ".": "_",
        "-": "_",
        "#": "_sharp",
        "+": "_plus",
        "*": "_star",
    }

    # The name must start with a letter or underscore.
    if name[0].isnumeric():
        normalized_arr.append("_")

    for ch in name:
        # Skip if a character is in the set of invalid characters.
        if ch in invalid_chars:
            continue

        # Replace if a character is in the dictionary of replace_chars.
        if ch in replace_chars:
            normalized_arr.append(replace_chars[ch])

        # Change to lowercase to merge name duplications, for example, "Java" and "java".
        else:
            normalized_arr.append(ch.lower())

    # Convert the array to string
    return "".join(normalized_arr)


@udf(returnType=ArrayType(StringType()))
def reduce_language(language) -> List[str]:
    """
    The preprocess function takes the language and reduces it to remove "bytes".
    Args:
        language: list of struct that contains name and bytes.
                  (e.g., language = [[name: "C", bytes: 300],
                                     [name: "Java", bytes: 200]]
    Returns:
        list of strings that contains name.
                  (e.g., reduced_languages = ["C", "Java"])
    """
    if len(language) < 2:
        return None
    reduced_languages = []
    for elem in language:
        # To write back to BigQuery, the name must be normalized.
        normalized_name = normalize_name(elem.name)
        reduced_languages.append(normalized_name)
    return reduced_languages


@udf(returnType=ArrayType(ArrayType(StringType())))
def preprocess_combination(language) -> List[List[str]]:
    """
    The preprocess function takes the language and returns every language combination.
    Args:
        language: list of struct that contains name and bytes.
                  (e.g., language = [[name: "C", bytes: 300],
                                     [name: "Java", bytes: 200]]
    Returns:
        List of every possible combinations.
                  (e.g., arr_combinations = [["C", "Java"], ["Java", "C"]])
    """
    if not language:
        return None
    arr_combinations = []
    for combination in combinations(language, 2):
        arr_combinations.append(combination)
        arr_combinations.append(combination[::-1])
    return arr_combinations


# Preprocess the "reduced_languages" column using UDF.
df = df.withColumn("reduced_languages", reduce_language(col("language")))

# Preprocess the "combinations" column using UDF.
df = df.withColumn("combinations", preprocess_combination(col("reduced_languages")))

# Create another DataFrame from "df" that has "repo_name" and "combinations" as columns.
frequency_df = df.select(col("repo_name"), col("combinations")).where(
    size(col("language")) > 1
)
frequency_df.printSchema()

`frequency_df`具有存储库名称和语言组合。

使用Spark的`explode()`函数，该函数类似于SQL的`UNNEST`函数。

您的表目前具有以下内容：

| 存储库名称   | 组合              |
| :----------: | :---------------: |
| a           | [['C'，'C++']，['C++'，'C']，['C'，'Java']，['Java'，'C']，['C++'，'Java']，['Java'，'C++']]|
| b           | [['C'，'C++']，['C++'，'C']，['C'，'Python']，['Python'，'C']，['C++'，'Python']，['Python'，'C++']]|

In [None]:
# explode() converts the elements in combinations to rows.
frequency_df = frequency_df.withColumn("languages", explode(col("combinations")))

# Create columns for combinations of languages.
frequency_df = frequency_df.withColumn("language0", col("languages")[0])
frequency_df = frequency_df.withColumn("language1", col("languages")[1])

使用`explode()`函数并添加`language0`和`language1`列后，`frequency_df`表的内容如下：

| repo_name   | languages         | language0    | language1 |
| :---------: | :---------------: | :--------:   | :-------: |
| a           | ['C', 'C++']      | 'C'          |'C++'      |
| a           | ['C++', 'C']      | 'C++'        |'C'        |
| a           | ['C', 'Java']     | 'C'          |'Java'     |
| a           | ['Java', 'C']     | 'Java'       |'C'        |
| a           | ['C++', 'Java']   | 'C++'        |'Java'     |
| a           | ['Java', 'C++']   | 'Java'       |'C++'      |
| b           | ['C', 'C++']      | 'C'          |'C++'      |
| b           | ['C++', 'C']      | 'C++'        |'C'        |
| b           | ['C', 'Python']   | 'C'          |'Python'   |
| b           | ['Python', 'C']   | 'Python'     |'C'        |
| b           | ['C++', 'Python'] | 'C++'        |'Python'   |
| b           | ['Python', 'C++'] | 'Python'     |'C++'      |

计算`language0`和`language1`列的两两频率表。每行的第一列将包含`language0`的不同值，列名将包含`language1`的不同值。

In [None]:
# crosstab() reshapes the table into a frequency distribution table by using cross tabulations.
frequency_df = frequency_df.crosstab("language0", "language1").withColumnRenamed(
    "language0_language1", "languages"
)

将`crosstab()`应用于`frequency_df`后，DataFrame数据排列如下：

| 语言       |  C  | C++ | Java | Python |
| :--------: | :-: | :-: | :-:  |  :-:   |
|     C      |  0  |  2  |  1   |   1    |
|     C++    |  2  |  0  |  1   |   1    |
|    Java    |  1  |  1  |  0   |   0    |
|   Python   |  1  |  1  |  0   |   0    |

请注意，此表包含样本数据，而非真实数据。

请参阅[frequency distribution](https://en.wikipedia.org/wiki/Frequency_%28statistics%29#Frequency_distribution_table)和[cross tabulations](https://en.wikipedia.org/wiki/Contingency_table)。

DataFrame现在包含每种语言的频率。使用一种流行的语言来可视化它。

In [None]:
# Set of popular languages. You can modify this set to show your preferred languages.
MAJOR_LANGUAGES = {"C", "Java", "Python", "JavaScript", "Go"}

# Declare a dictionary to store the key as a language name and the value as the selected DataFrame
df_dict = dict()

for language in MAJOR_LANGUAGES:
    # Get a top ten languages of each language and store it to the dictionary.
    df_dict[language] = (
        frequency_df.select(col("languages"), language).sort(-col(language)).limit(10)
    )

In [None]:
for language in df_dict:
    # Convert Spark DataFrame to Pandas DataFrame to display the bar chart.
    elem_panda = df_dict[language].toPandas()[:LIMIT].copy()
    elem_panda.set_index("languages", inplace=True)
    elem_panda.sort_values(language, ascending=True, inplace=True)
    elem_panda.plot(
        kind="barh",
        title=language,
        legend=False,
        xlabel="",
    )

### 写回到 BigQuery

在分析这些查询后，会有几个 DataFrames：单语库的排名、单语库的平均字节数，以及仓库中每种语言使用频率表。

这些 DataFrames 将使用 [spark-bigquery-connector](https://github.com/GoogleCloudDataproc/spark-bigquery-connector) 存储在 BigQuery 中。

In [None]:
dataframes = {
    "mono_ranking": mono_ranking,
}
if not os.getenv("IS_TESTING"):
    dataframes["mono_ranking_avg_bytes"] = mono_ranking_avg_bytes
    dataframes["frequency_table"] = frequency_df

# Iterate through the DataFrames and save them to the BigQuery.
for df in dataframes:
    dataframes[df].write.format("bigquery").option("writeMethod", "direct").option(
        "table", f"{DATASET_NAME}.{df}"
    ).save()

如果没有错误报告，恭喜！您的DataFrame已成功存储在BigQuery中。

您可以在[Google Cloud控制台](https://console.corp.google.com/bigquery)上查看数据，或者使用`bq`命令行工具。

In [None]:
QUERY = f"SELECT languages, python FROM {PROJECT_ID}.{DATASET_NAME}.frequency_table ORDER BY python DESC LIMIT 10"

! bq query --nouse_legacy_sql $QUERY

清理

请参考 [清理](https://cloud.google.com/vertex-ai/docs/workbench/managed/create-managed-notebooks-instance-console-quickstart#clean-up) 来删除您在本教程中创建的项目或托管笔记本。

###删除BigQuery数据集

In [None]:
! bq rm -r -f $DATASET_NAME

在删除BigQuery数据集之后，您可以使用以下命令检查BigQuery中的数据集。

In [None]:
! bq ls

删除Dataproc集群

除非将内核切换为本地，否则无法删除当前正在使用的集群。要删除它，您需要将内核切换到本地的 `Python 3` 或 `PySpark`，在以下单元格中手动设置您的 `CLUSTER_NAME` 和 `CLUSTER_REGION`，然后执行 `gcloud` 命令。

请查看[删除集群](https://cloud.google.com/dataproc/docs/guides/manage-cluster#console)以删除本教程中创建的Dataproc集群。

In [None]:
CLUSTER_NAME = "[your-cluster-name]"
CLUSTER_REGION = "[your-cluster-region]"

In [None]:
if not os.getenv("IS_TESTING"):
    ! gcloud dataproc clusters delete $CLUSTER_NAME --region=$CLUSTER_REGION -q