# Cài đặt gcloud

Cài đặt gcloud theo link và hướng dẫn ở đây:

https://cloud.google.com/sdk/docs/install

### Một số câu lệnh

Sau khi cài đặt xong, chúng ta sử dụng một số câu lệnh sau:

Đăng nhập vào app gcloud
```
gcloud auth login
```
Lấy crediential để có thể chạy code trên máy local:

```
gcloud auth application-default login
```

Update gcloud
```
gcloud components update
```

Update setting về vùng
```
gcloud config set compute/region asia-east1
```

### Nếu bạn có nhiều account và project
List Account và project
```
gcloud auth list
gcloud projects list --sort-by=projectId
```


Set Account và Project

```
gcloud config set account ACCOUNT_EMAIL
gcloud config set projects PROJECT_ID
```


Nếu sử dụng service account nên cấu hình environment variable tới
GOOGLE_APPLICATION_CREDENTIALS
```
export GOOGLE_APPLICATION_CREDENTIALS="path/to/service/account"
```

# Tạo service Account

batch-job-service-account

<!-- @test-app-309909.iam.gserviceaccount.com -->

# Thao tác với gs

Cài đặt thư viện trên python

In [None]:
# !pip install --upgrade google-cloud-storage

In [None]:
# !gcloud auth application-default login

In [1]:
from google.colab import auth

PROJECT_ID = "test-app-309909"
auth.authenticate_user(project_id=PROJECT_ID)

In [2]:
from google.cloud import storage

# Khởi tạo client
PROJECT_ID = 'test-app-309909'
storage_client = storage.Client(project = PROJECT_ID)

In [None]:
# Tạo bucket
# BUCKET_NAME = "bronze-zone"
# bucket = storage_client.create_bucket(bucket_name)

In [3]:
# Lấy bucket
BUCKET_NAME = "mmo_event_processing"
bucket = storage_client.get_bucket(BUCKET_NAME)

In [4]:
# List các file có trong bucket với đường dẫn:
# gs://bronze-zone/event/2023/08/12/

PREFIX = "bronze-zone/event/2023/08/12/"
for blob in bucket.list_blobs(prefix=PREFIX):
    print(blob)

# Câu lệnh tương tự:
# for blob in storage_client.list_blobs(BUCKET_NAME,prefix=PREFIX):
#     print(blob.name)

<Blob: mmo_event_processing, bronze-zone/event/2023/08/12/c6bfa982-3933-11ee-800c-56934cb9d3ad.json, 1692770728797278>
<Blob: mmo_event_processing, bronze-zone/event/2023/08/12/c8aaf800-3933-11ee-800c-56934cb9d3ad.json, 1692770730477281>


In [5]:
# Lấy blob / Tạo blob
GCS_BLOB_NAME = "bronze-zone/event/2023/08/12/c6bfa982-3933-11ee-800c-56934cb9d3ad.json"
# GCS_BLOB_NAME = "none/path/blob"

blob = bucket.blob(GCS_BLOB_NAME)
blob

<Blob: mmo_event_processing, bronze-zone/event/2023/08/12/c6bfa982-3933-11ee-800c-56934cb9d3ad.json, None>

In [6]:
# Kiểm tra blob có tồn tại hay không
is_existed = blob.exists()
is_existed

True

In [7]:
# Download file về máy
# Đổi tên thành đường dẫn bạn muốn down
LOCAL_FILE_PATH = "./downloaded.json"

blob.download_to_filename(LOCAL_FILE_PATH)

In [8]:
# Download file vào bộ nhớ
data = blob.download_as_bytes()

In [None]:
# data
# Trường hợp data là file json
for line in data.decode('utf-8').split("/n"):
    print(line)
    break

In [None]:
# Upload file lên blob
# blob.upload_from_filename(LOCAL_FILE_PATH)

# Thao tác với pyarrow

https://arrow.apache.org/cookbook/py/data.html


In [None]:
# Cài đặt pyarrow
!pip install --upgrade pyarrow

In [25]:
import pyarrow
import pyarrow.json
import pyarrow.parquet

In [12]:
GCS_BLOB_NAME = "gold-zone/json/c6bfa982-3933-11ee-800c-56934cb9d3ad.json"
blob = bucket.blob(GCS_BLOB_NAME)

LOCAL_FILE_PATH = "./cleaned.json"
blob.download_to_filename(LOCAL_FILE_PATH)

# Đọc JSON từ file.
table = pyarrow.json.read_json(LOCAL_FILE_PATH)

In [None]:
table.schema

In [19]:
schema = pyarrow.schema([
    ('event_id', pyarrow.string()),
    ('event_type', pyarrow.string()),
    ('timestamp', pyarrow.string()),   # Sửa kiểu timestamp thành string
    ('user_id', pyarrow.int64()),
    ('location', pyarrow.string()),
    ('device', pyarrow.string()),
    ('ip_address', pyarrow.string()),
    ('event_attribute', pyarrow.list_(pyarrow.struct([
        ('key', pyarrow.string()),
        ('int_value', pyarrow.int64()),
        ('float_value', pyarrow.float64()),
        ('string_value', pyarrow.string()),
        ('bool_value', pyarrow.bool_())
    ])))
])

parse_opt = pyarrow.json.ParseOptions(
    explicit_schema = schema
)
table = pyarrow.json.read_json(LOCAL_FILE_PATH,parse_options=parse_opt)

In [None]:
table.schema

In [None]:
table['event_id']

# table[['event_id','event_attribute']] Lỗi

In [None]:
table.select(['event_id','event_attribute'])

In [25]:
table.group_by('user_id').aggregate([("event_id", "count"),
                                    ("timestamp", "max")])

In [29]:
# Xuất table ra file
pyarrow.parquet.write_table(table,"table.parquet", compression="snappy")

In [31]:
# Check size của data
!du -h cleaned.json table.parquet

21M	cleaned.json
4.5M	table.parquet


In [None]:
# Đọc table từ parquet file
pyarrow.parquet.read_table("table.parquet")

In [32]:
# Ghi ra dataset
pyarrow.parquet.write_to_dataset(table, root_path='event_info_dataset',
                                    partition_cols=['event_type',
                                                    'user_id'])

In [10]:
import pyarrow.fs

# Định nghĩa google file system
gcs = pyarrow.fs.GcsFileSystem(anonymous=False)

In [11]:
# Mở file trên google cloud storage và đọc file
my_file = gcs.open_input_stream("mmo_event_processing/gold-zone/json/c6bfa982-3933-11ee-800c-56934cb9d3ad.json")

schema = pyarrow.schema([
    ('event_id', pyarrow.string()),
    ('event_type', pyarrow.string()),
    ('timestamp', pyarrow.string()),
    ('user_id', pyarrow.int64()),
    ('location', pyarrow.string()),
    ('device', pyarrow.string()),
    ('ip_address', pyarrow.string()),
    ('event_attribute', pyarrow.list_(pyarrow.struct([
        ('key', pyarrow.string()),
        ('int_value', pyarrow.int64()),
        ('float_value', pyarrow.float64()),
        ('string_value', pyarrow.string()),
        ('bool_value', pyarrow.bool_())
    ])))
])

parse_opt = pyarrow.json.ParseOptions(
    explicit_schema = schema
)

table = pyarrow.json.read_json(my_file,parse_options=parse_opt)

In [None]:
table

In [37]:
# Ghi ra dataset file
pyarrow.parquet.write_to_dataset(table,
                                root_path='mmo_event_processing/gold-zone/event_info_dataset',
                                partition_cols=['event_type', 'user_id'],
                                filesystem=gcs)

# Problem :
- Pyarrow chỉ Overwrite dataset
- Work arround:
    - Lưu data theo path year/month/day
    - Parse Timestamp thành cột year/month/day


# Pandas và pyarrow

Pandas 2.0 có hỗ trợ pyarrow engine


In [None]:
!pip install --upgrade pandas

In [14]:
import pandas as pd


pd.read_parquet("./table.parquet",engine = "pyarrow")

# Convert Pyarrow sang pandas
df = table.to_pandas( types_mapper = pd.ArrowDtype)

In [26]:
df['datetime'] = df['timestamp'].str.split(" ").map(lambda x: x[0])
df['year'] = df['datetime'].str.split("-").map(lambda x: x[0])
df['month'] = df['datetime'].str.split("-").map(lambda x: x[1])
df['day'] = df['datetime'].str.split("-").map(lambda x: x[2])

new_table = pyarrow.Table.from_pandas(df)

pyarrow.parquet.write_to_dataset(new_table,
                                root_path='mmo_event_processing/gold-zone/event_info_dataset',
                                partition_cols=['year','month','day'],
                                filesystem=gcs)

# Cloud Build

Câu lệnh build bằng docker
```
docker build
```

Câu lệnh build trên cloud
```
gcloud builds submit --tag gcr.io/project-id/test-app-309909/simple-image:v1.0
```

```
# cloudbuild.yaml
steps:
    - name: 'gcr.io/cloud-builders/docker'
args: [ 'build', '-t','gcr.io/${_PROJECT_ID}/${_IMAGE_NAME}:${_TAG}', '.' ]
substitutions:
_PROJECT_ID: test-app-309909 # default value
_IMAGE_NAME: simple-image # default value
_TAG: v1.0
images:
- 'gcr.io/${_PROJECT_ID}/${_IMAGE_NAME}:${_TAG}'
```

```
gcloud builds submit --config=cloudbuild.yaml
```

```
gcloud builds submit \
        --config=cloudbuild.yaml \
        --substitutions=_PROJECT_ID=test-app-309909,_IMAGE_NAME=simple-image,_TAG=v1.0
```

# Cloud Run

Câu lệnh tạo job cloud run
```
export PROJECT_ID=test-app-309909
export IMAGE=simple-image
export TAG=v1.0
export JOB_NAME=simple-job
export SERVICE_ACCOUNT=batch-job-service-account@test-app-309909.iam.gserviceaccount.com

gcloud run jobs create ${JOB_NAME} \
            --region asia-east1 \
            --image gcr.io/${PROJECT_ID}/${IMAGE}:${TAG} \
            --service-account ${SERVICE_ACCOUNT}
```

# Cloud Scheduler

Câu lệnh tạo cloud scheduler

```
export PROJECT_ID=test-app-309909
export IMAGE=simple-image
export TAG=v1.0
export SCHEDULER_NAME=simple-scheduler
export JOB_NAME=simple-job
export CLOUD_RUN_REGION=asia-east1
export SCHEDULER_REGION=asia-east1
export SERVICE_ACCOUNT=batch-job-service-account@test-app-309909.iam.gserviceaccount.com


gcloud scheduler jobs create http ${SCHEDULER_NAME}\
        --location ${SCHEDULER_REGION} \
        --schedule="* * * * *" \
        --uri="https://${CLOUD_RUN_REGION}-run.googleapis.com/apis/run.googleapis.com/v1/namespaces/${PROJECT_ID}/jobs/${JOB_NAME}:run" \
        --http-method POST \
        --oidc-service-account-email ${SERVICE_ACCOUNT}

```