In [None]:
# Copyright 2024 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# 使用R和BigQuery进行探索性数据分析

**作者**: [Alok Pattani](https://github.com/alokpattani), [Khalid Salama](https://github.com/ksalama)

**最后更新**: 2024年2月

## 概述

本笔记本演示了如何使用[R](https://www.r-project.org/about.html)在从[BigQuery](https://cloud.google.com/bigquery)提取的数据上执行探索性数据分析（EDA）。在分析和处理数据之后，转换后的数据存储在[Cloud Storage](https://cloud.google.com/storage)中，以供进一步的机器学习（ML）任务使用。

R是用于统计建模的最广泛使用的编程语言之一。它拥有一个庞大和活跃的数据科学家和机器学习（ML）专业人员社区。在[CRAN](https://cran.r-project.org/)的开源仓库中有超过20,000个包，R拥有所有统计数据分析应用程序、ML和可视化工具。

## 数据集
本教程中使用的数据集是BigQuery natality数据集。这个公共数据集包含了1969年至2008年在美国注册的超过1.37亿个出生信息。数据集可在[此处](https://console.cloud.google.com/bigquery?p=bigquery-public-data&d=samples&t=natality&page=table&_ga=2.99329886.-1705629017.1551465326&_gac=1.109796023.1561476396.CI2rz-z4hOMCFc6RhQods4oEXA)找到。

在本笔记本中，我们专注于使用R和BigQuery进行探索性数据分析和可视化，目标是预测给定有关怀孕和婴儿母亲的若干因素的情况下婴儿体重的机器学习目标。

## 目标
本教程的目标是：
1. 使用[bigrquery](https://cran.r-project.org/web/packages/bigrquery/index.html) R库从BigQuery查询和分析数据。
2. 为ML准备和存储数据在Cloud Storage中。

## 成本
本教程使用Google Cloud的以下计费组件:
1. [BigQuery](https://cloud.google.com/bigquery/pricing)
2. [Cloud Storage](https://cloud.google.com/storage/pricing)
3. [Vertex AI Workbench Instances](https://cloud.google.com/vertex-ai/pricing#notebooks)（如果在那里运行本笔记本）

使用[Pricing Calculator](https://cloud.google.com/products/calculator/)根据您的预期使用量生成成本估算。

## 0. 设置

检查正在运行的R的版本。

In [None]:
version

如果当前会话中还没有可用的必要 R 包，则安装必要的 R 包。

In [None]:
# List the necessary packages
needed_packages <- c("dplyr", "ggplot2", "bigrquery")

# Check if packages are installed
installed_packages <- .packages(all.available = TRUE)
missing_packages <- needed_packages[!(needed_packages %in% installed_packages)]

# If any are missing, install them
if (length(missing_packages) > 0) {
  install.packages(missing_packages)
}

In [None]:
# Load the required packages
lapply(needed_packages, library, character.only = TRUE) 

使用 BigQuery 的带外认证

In [None]:
bq_auth(use_oob = TRUE)

将一个变量设置为您想在本教程中使用的项目名称。

In [None]:
# Set the project ID
PROJECT_ID <- "[YOUR-PROJECT-ID]"

将一个变量设定为你以后想要使用的云存储桶的名称，用来存储输出数据。这个名称必须是全局唯一的。

In [None]:
# Set your Cloud Storage bucket name
BUCKET_NAME <- "[YOUR-BUCKET-NAME]"

1. 从BigQuery查询数据

准备BigQuery查询

In [None]:
sql_query_template <- "
    SELECT
      ROUND(weight_pounds, 2) AS weight_pounds,
      is_male,
      mother_age,
      plurality,
      gestation_weeks,
      cigarette_use,
      alcohol_use,
      CAST(ABS(FARM_FINGERPRINT(CONCAT(
        CAST(YEAR AS STRING), CAST(month AS STRING), 
        CAST(weight_pounds AS STRING)))
        ) AS STRING) AS key
    FROM
        publicdata.samples.natality
    WHERE 
      year > 2000
      AND weight_pounds > 0
      AND mother_age > 0
      AND plurality > 0
      AND gestation_weeks > 0
      AND month > 0
    LIMIT %s
"

### 1.2. 执行查询
数据将从BigQuery中检索，结果将存储在内存中的[tibble](https://tibble.tidyverse.org/)中（类似于数据框）。

In [None]:
sample_size <- 10000

sql_query <- sprintf(sql_query_template, sample_size)

natality_data <- bq_table_download(
    bq_project_query(
        PROJECT_ID, 
        query = sql_query
    )
)

1.3. 查看查询结果

In [None]:
# View the query result
head(natality_data)

In [None]:
# Show # of rows and data types of each column
str(natality_data)

In [None]:
# View the results summary
summary(natality_data)

2. 可视化检索到的数据

In [None]:
# Display the distribution of baby weights using a histogram
ggplot(
    data = natality_data, 
    aes(x = weight_pounds)
    ) + 
geom_histogram(bins = 200)

In [None]:
# Display the relationship between gestation weeks and baby weights 
ggplot(
    data = natality_data, 
    aes(x = gestation_weeks, y = weight_pounds)
    ) + 
geom_point() + 
geom_smooth(method = "lm")

在BigQuery中执行处理
创建一个函数，查找所选列的每个值的记录数和平均重量。

In [None]:
get_distinct_values <- function(column_name) {
    query <- paste0(
        'SELECT ', column_name, ', 
            COUNT(1) AS num_babies,
            AVG(weight_pounds) AS avg_wt
        FROM publicdata.samples.natality
        WHERE year > 2000
        GROUP BY ', column_name)
    
    bq_table_download(
        bq_project_query(
            PROJECT_ID, 
            query = query
        )
    )
}

应用该函数来获取各列的不同值，并将它们绘制出来以研究模式。

In [None]:
df <- get_distinct_values('mother_age')

ggplot(
    data = df, 
    aes(x = mother_age, y = num_babies)
    ) + 
geom_line()

ggplot(
    data = df, 
    aes(x = mother_age, y = avg_wt)
    ) + 
geom_line()

In [None]:
df <- get_distinct_values('is_male')

ggplot(
    data = df, 
    aes(x = is_male, y = num_babies)
    ) + 
geom_col()

ggplot(
    data = df, 
    aes(x = is_male, y = avg_wt)
    ) + 
geom_col()

In [None]:
df <- get_distinct_values('plurality')

ggplot(
    data = df, 
    aes(x = plurality, y = num_babies)
    ) + 
geom_col() + 
scale_y_log10()

ggplot(
    data = df,
    aes(x = plurality, y = avg_wt)
    ) + 
geom_col()

In [None]:
df <- get_distinct_values('gestation_weeks')

ggplot(
    data = df,
    aes(x = gestation_weeks, y = num_babies)
    ) + 
geom_col() + 
scale_y_log10()

ggplot(
    data = df,
    aes(x = gestation_weeks, y = avg_wt)
    ) + 
geom_col()

将数据保存为CSV文件到云存储。

In [None]:
# Prepare training and evaluation data from BigQuery
sample_size <- 10000

sql_query <- sprintf(sql_query_template, sample_size)

# Split data into 75% training, 25% evaluation
train_query <- paste('SELECT * FROM (', sql_query, 
  ') WHERE MOD(CAST(key AS INT64), 100) <= 75')
eval_query <- paste('SELECT * FROM (', sql_query,
  ') WHERE MOD(CAST(key AS INT64), 100) > 75')

# Load training data to data frame
train_data <- bq_table_download(
    bq_project_query(
        PROJECT_ID, 
        query = train_query
    )
)

# Load evaluation data to data frame
eval_data <- bq_table_download(
    bq_project_query(
        PROJECT_ID, 
        query = eval_query
    )
)

In [None]:
print(paste0("Training instances count: ", nrow(train_data)))

print(paste0("Evaluation instances count: ", nrow(eval_data)))

In [None]:
# Write data frames to local CSV files, without headers or row names
dir.create(file.path('data'), showWarnings = FALSE)

write.table(train_data, "data/train_data.csv", 
   row.names = FALSE, col.names = FALSE, sep = ",")

write.table(eval_data, "data/eval_data.csv", 
   row.names = FALSE, col.names = FALSE, sep = ",")

In [None]:
# Upload CSV data to Cloud Storage by passing gsutil commands to system
gcs_url <- paste0("gs://", BUCKET_NAME, "/")

command <- paste("gsutil mb", gcs_url)

system(command)

gcs_data_dir <- paste0("gs://", BUCKET_NAME, "/data")

command <- paste("gsutil cp data/*_data.csv", gcs_data_dir)

system(command)

command <- paste("gsutil ls -l", gcs_data_dir)

system(command, intern = TRUE)