***

**<center><font size = "6">Perform Foundational Data, ML, and AI Tasks in Google Cloud<center>**
***
<center><font size = "2">Prepared by: Sitsawek Sukorn<center>

### Vertex AI: Qwik Start

### Enable Google Cloud services

- In Cloud Shell, use gcloud to enable the services used in the lab.

In [None]:
gcloud services enable \
  compute.googleapis.com \
  iam.googleapis.com \
  iamcredentials.googleapis.com \
  monitoring.googleapis.com \
  logging.googleapis.com \
  notebooks.googleapis.com \
  aiplatform.googleapis.com \
  bigquery.googleapis.com \
  artifactregistry.googleapis.com \
  cloudbuild.googleapis.com \
  container.googleapis.com

### Create Vertex AI custom service account for Vertex Tensorboard integration

- Create custom service account

In [None]:
SERVICE_ACCOUNT_ID=vertex-custom-training-sa
gcloud iam service-accounts create $SERVICE_ACCOUNT_ID  \
    --description="A custom service account for Vertex custom training with Tensorboard" \
    --display-name="Vertex AI Custom Training"

- Grant it access to GCS for writing and retrieving Tensorboard logs

In [None]:
PROJECT_ID=$(gcloud config get-value core/project)
gcloud projects add-iam-policy-binding $PROJECT_ID \
    --member=serviceAccount:$SERVICE_ACCOUNT_ID@$PROJECT_ID.iam.gserviceaccount.com \
    --role="roles/storage.admin"

- Grant it access to your BigQuery data source to read data into your TensorFlow model

In [None]:
gcloud projects add-iam-policy-binding $PROJECT_ID \
    --member=serviceAccount:$SERVICE_ACCOUNT_ID@$PROJECT_ID.iam.gserviceaccount.com \
    --role="roles/bigquery.admin"

- Grant it access to Vertex AI for running model training, deployment, and explanation jobs.


In [None]:
gcloud projects add-iam-policy-binding $PROJECT_ID \
    --member=serviceAccount:$SERVICE_ACCOUNT_ID@$PROJECT_ID.iam.gserviceaccount.com \
    --role="roles/aiplatform.user"

### Launch Vertex AI Workbench notebook

To create and launch a Vertex AI Workbench notebook:

- In the Navigation Menu Navigation menu icon, click Vertex AI > Workbench.

- On the Workbench page, click New Notebook.

- In the Customize instance menu, select TensorFlow Enterprise and choose the latest version of TensorFlow Enterprise 2.x (with LTS) > Without GPUs.

- Name the notebook.

- Set Region to us-central1 and Zone to any zone within the designated region.

- In the Notebook properties, click the pencil icon pencil icon to edit the instance properties.

- Scroll down to Machine configuration and select e2-standard-2 for Machine type.

- Leave the remaining fields at their default and click Create.

After a few minutes, the Workbench page lists your instance, followed by Open JupyterLab.

- Click Open JupyterLab to open JupyterLab in a new tab.

### Clone the example repo within your Workbench instance

To clone the training-data-analyst repository in your JupyterLab instance:

- In JupyterLab, click the Terminal icon to open a new terminal.

- At the command-line prompt, type the following command and press ENTER:

In [None]:
git clone https://github.com/GoogleCloudPlatform/training-data-analyst

- To confirm that you have cloned the repository, in the left panel, double click the training-data-analyst folder to see its contents.

It will take several minutes for the notebook to clone.

### Install lab dependencies

- Run the following to go to the training-data-analyst/self-paced-labs/vertex-ai/vertex-ai-qwikstart folder, then pip install requirements.txt to install lab dependencies:

In [None]:
cd training-data-analyst/self-paced-labs/vertex-ai/vertex-ai-qwikstart
pip install -U -r requirements.txt

#### Navigate to lab notebook

- In your notebook, navigate to training-data-analyst > self-paced-labs > vertex-ai > vertex-ai-qwikstart, and open lab_exercise.ipynb.

- Continue the lab in the notebook, and run each cell by clicking the Run icon at the top of the screen.


Alternatively, you can execute the code in a cell with SHIFT + ENTER.

Read the narrative and make sure you understand what's happening in each cell.

***

**<center><font size = "6">Dataprep: Qwik Start<center>**
***

### Create a Cloud Storage bucket in your project

- In the Cloud Console, select Navigation menu(Navigation menu icon) > Cloud Storage > Buckets.

- Click Create bucket.

- In the Create a bucket dialog, Name the bucket a unique name. Leave other settings at their default value.

Note: Learn more about naming buckets from Bucket naming guidelines.

- Click Create.

### Initialize Cloud Dataprep

- Select Navigation menu > Dataprep.

- Check to accept the Google Dataprep Terms of Service, then click Accept.

- Check to authorize sharing your account information with Trifacta, then click Agree and Continue.

- Click Allow to allow Trifacta to access project data.

- Click your student username to sign in to Cloud Dataprep by Trifacta. Your username is the Username in the left panel in your lab.

- Click Allow to grant Cloud Dataprep access to your Google Cloud lab account.

- Check to agree to Trifacta Terms of Service, and then click Accept.

- Click Continue on the First time setup screen to create the default storage location.

Dataprep opens.

- Click on the Dataprep icon on the top left corner to go to the home screen.



### Create a flow

Cloud Dataprep uses a flow workspace to access and manipulate datasets.

- Click Flows icon, then the Create button, then select Blank Flow :

- Click on Untitled Flow, then name and describe the flow. Since this lab uses 2016 data from the United States Federal Elections Commission 2016, name the flow "FEC-2016", and then describe the flow as "United States Federal Elections Commission 2016".

- Click OK.

### Import datasets

In this section you import and add data to the FEC-2016 flow.

- Click Add Datasets, then select the Import Datasets link.

- In the left menu pane, select Cloud Storage to import datasets from Cloud Storage, then click on the pencil to edit the file path.

- Type gs://spls/gsp105 in the Choose a file or folder text box, then click Go.
You may have to widen the browser window to see the Go and Cancel buttons.

- Click us-fec/.

- Click the + icon next to cn-2016.txt to create a dataset shown in the right pane. Click on the title in the dataset in the right pane and rename it "Candidate Master 2016".

- In the same way add the itcont-2016-orig.txt dataset, and rename it "Campaign Contributions 2016".

- Both datasets are listed in the right pane; click Import & Add to Flow.

You see both datasets listed as a flow.

### Prep the candidate file

- By default, the Candidate Master 2016 dataset is selected. In the right pane, click Edit Recipe.
Candidate Master 2016 dataset page

The Candidate Master 2016 Transformer page opens in the grid view.

The Transformer page is where you build your transformation recipe and see the results applied to the sample. When you are satisfied with what you see, execute the job against your dataset.

- Each of the column heads have a Name and value that specify the data type. To see data types, click the column icon:

- Notice also that when you click the name of the column, a Details panel opens on the right.

- Click X in the top right of the Details panel to close the Details panel.

In the following steps you explore data in the grid view and apply transformation steps to your recipe.

- Column5 provides data from 1990-2064. Widen column5 (like you would on a spreadsheet) to separate each year. Click to select the tallest bin, which represents the year 2016.

This creates a step where these values are selected.

- In the Suggestions panel on the right, in the Keep rows section, click Add to add this step to your recipe.

The Recipe panel on the right now has the following step:

Keep rows where(DATE(2016, 1, 1) <= column5) && (column5 < DATE(2018, 1, 1))

- In Column6 (State), hover over and click on the mismatched (red) portion of the header to select the mismatched rows.

Scroll down to the bottom (highlighted in red) find the mismatched values and notice how most of these records have the value "P" in column7, and "US" in column6. The mismatch occurs because column6 is marked as a "State" column (indicated by the flag icon), but there are non-state (such as "US") values.

- To correct the mismatch, click X in the top of the Suggestions panel to cancel the transformation, then click on the flag icon in Column6 and change it to a "String" column.

There is no longer a mismatch and the column marker is now green.

- Filter on just the presidential candidates, which are those records that have the value "P" in column7. In the histogram for column7, hover over the two bins to see which is "H" and which is "P". Click the "P" bin.

- In the right Suggestions panel, click Add to accept the step to the recipe.

### Wrangle the Contributions file and join it to the Candidates file

On the Join page, you can add your current dataset to another dataset or recipe based on information that is common to both datasets.

Before you join the Contributions file to the Candidates file, clean up the Contributions file.

- Click on FEC-2016 (the dataset selector) at the top of the grid view page.

- Click to select the grayed out Campaign Contributions 2016.

- In the right pane, click Add > Recipe, then click Edit Recipe.

- Click the recipe icon at the top right of the page, then click Add New Step.

Remove extra delimiters in the dataset.

- Insert the following Wrangle language command in the Search box:

replacepatterns col: * with: '' on: `{start}"|"{end}` global: true

The Transformation Builder parses the Wrangle command and populates the Find and Replace transformation fields.

- Click Add to add the transform to the recipe.

- Add another new step to the recipe. Click New Step, then type "Join" in the Search box.

- Click Join datasets to open the Joins page.

- Click on "Candidate Master 2016" to join with Campaign Contributions 2016, then Accept in the bottom right.

- On the right side, hover in the Join keys section, then click on the pencil (Edit icon).

Dataprep infers common keys. There are many common values that Dataprep suggests as Join Keys.

- In the Add Key panel, in the Suggested join keys section, click column2 = column11.

- Click Save and Continue.

Columns 2 and 11 open for your review.

- Click Next, then check the checkbox to the left of the "Column" label to add all columns of both datasets to the joined dataset.

- Click Review, and then Add to Recipe to return to the grid view.

### Summary of data

Generate a useful summary by aggregating, averaging, and counting the contributions in Column 16 and grouping the candidates by IDs, names, and party affiliation in Columns 2, 24, 8 respectively.

- At the top of the Recipe panel on the right, click on New Step and enter the following formula in the Transformation search box to preview the aggregated data.

pivot value:sum(column16),average(column16),countif(column16 > 0) group: column2,column24,column8

An initial sample of the joined and aggregated data is displayed, representing a summary table of US presidential candidates and their 2016 campaign contribution metrics.

- Click Add to open a summary table of major US presidential candidates and their 2016 campaign contribution metrics.

### Rename columns

You can make the data easier to interpret by renaming the columns.

- Add each of the renaming and rounding steps individually to the recipe by clicking New Step, then enter:

rename type: manual mapping: [column24,'Candidate_Name'], [column2,'Candidate_ID'],[column8,'Party_Affiliation'], [sum_column16,'Total_Contribution_Sum'], [average_column16,'Average_Contribution_Sum'], [countif,'Number_of_Contributions']

- Then click Add.

- Add in this last New Step to round the Average Contribution amount:

set col: Average_Contribution_Sum value: round(Average_Contribution_Sum)

- Then click Add.

***

**<center><font size = "6">Dataflow: Qwik Start - Templates<center>**
***

#### Ensure that the Dataflow API is successfully enabled


To ensure access to the necessary API, restart the connection to the Dataflow API.

- In the Cloud Console, enter "Dataflow API" in the top search bar. Click on the result for Dataflow API.

- Click Manage.

- Click Disable API.

If asked to confirm, click Disable.

- Click Enable.

When the API has been enabled again, the page will show the option to disable.

### Create a Cloud BigQuery dataset and table Using Cloud Shell

Let's first create a BigQuery dataset and table.

Note: This section uses the bq command-line tool. Skip down if you want to run through this lab using the console.

- Run the following command to create a dataset called taxirides:

In [None]:
bq mk taxirides

Your output should look similar to:

Dataset '' successfully created

- Run the following command to do so:

In [None]:
bq mk \
--time_partitioning_field timestamp \
--schema ride_id:string,point_idx:integer,latitude:float,longitude:float,\
timestamp:timestamp,meter_reading:float,meter_increment:float,ride_status:string,\
passenger_count:integer -t taxirides.realtime

Your output should look similar to:

Table 'myprojectid:taxirides.realtime' successfully created

#### Create a storage bucket

Now that we have our table instantiated, let's create a bucket.

- Run the following commands to do so:

In [None]:
export BUCKET_NAME="<your-unique-name>"

In [None]:
gsutil mb gs://$BUCKET_NAME/

### Create a Cloud BigQuery dataset and table using the Cloud Console

Note: Don't go through this section if you've done the command-line setup!

- From the left-hand menu, in the Big Data section, click on BigQuery.

- Then click Done.

- Click on the three dots next to your project name under the Explorer section, then click Create dataset.

- Input taxirides as your dataset ID:

- Select us (multiple regions in United States) in Data location.

- Leave all of the other default settings in place and click CREATE DATASET.

- You should now see the taxirides dataset underneath your project ID in the left-hand console.

- Click on the three dots next to taxirides dataset and select Open.

- Then select CREATE TABLE in the right-hand side of the console.

- In the Destination > Table Name input, enter realtime.

- Under Schema, toggle the Edit as text slider and enter the following:

ride_id:string,point_idx:integer,latitude:float,longitude:float,timestamp:timestamp,
meter_reading:float,meter_increment:float,ride_status:string,passenger_count:integer

- Now, click Create table.

### Create a storage bucket

- Go back to the Cloud Console and navigate to Cloud Storage > Browser > Create bucket.

- Give your bucket a unique name.

- Leave all other default settings, then click Create.

### Run the pipeline

- From the Navigation menu, find the Analytics section and click on Dataflow.

- Click on + Create job from template at the top of the screen.

- Enter iotflow as the Job name for your Cloud Dataflow job and select us-east1 for Regional Endpoint.

- Under Dataflow Template, select the Pub/Sub Topic to BigQuery template.

- Under Input Pub/Sub topic, click Enter Topic Manually and enter:

projects/pubsub-public-data/topics/taxirides-realtime

- Under BigQuery output table, enter the name of the table that was created:

<myprojectid>:taxirides.realtime

- Add your bucket as Temporary Location:

gs://Your_Bucket_Name/temp

### Submit a query

You can submit queries using standard SQL.

- In the BigQuery Editor field add the following, replacing myprojectid with the Project ID from the Qwiklabs page:

In [None]:
SELECT * FROM `myprojectid.taxirides.realtime` LIMIT 1000

- Now click RUN.

If you run into any issues or errors, run the query again (the pipeline takes a minute to start up.)

- When the query runs successfully, you'll see the output in the Query Results panel as shown below:

Great work! You just pulled 1000 taxi rides from a Pub/Sub topic and pushed them to a BigQuery table. As you saw firsthand, templates are a practical, easy-to-use way to run Dataflow jobs. Be sure to check out, in the Dataflow Documentation, some other Google Templates in the Get started with Google-provided templates Guide.

***

**<center><font size = "6">Dataflow: Qwik Start - Python<center>**
***

### Create a Cloud Storage bucket

- In the Cloud Console, click on Navigation menu and then click on Cloud Storage.

- Click Create bucket.

- In the Create bucket dialog, specify the following attributes:

Name: A unique bucket name. Do not include sensitive information in the bucket name, as the bucket namespace is global and publicly visible.

- Location type: Multi-region

- Location: us

A location where bucket data will be stored.

- Click Create.


### Install pip and the Cloud Dataflow SDK

The latest Cloud Dataflow SDK for Python requires a Python version >= 3.7.

- To ensure you are running the process with the correct version, run the Python3.9 Docker Image:

In [None]:
docker run -it -e DEVSHELL_PROJECT_ID=$DEVSHELL_PROJECT_ID python:3.9 /bin/bash

This command pulls a Docker container with the latest stable version of Python 3.9 and then opens up a command shell for you to run the following commands inside your container.

- After the container is running, install the latest version of the Apache Beam for Python by running the following command from a virtual environment:

In [None]:
pip install 'apache-beam[gcp]'==2.42.0rc2

You will see some warnings returned that are related to dependencies. It is safe to ignore them for this lab.

- Run the wordcount.py example locally by running the following command:

In [None]:
python -m apache_beam.examples.wordcount --output OUTPUT_FILE

Note: You installed google-cloud-dataflow but are executing wordcount with Apache_beam. The reason for this is that Cloud Dataflow is a distribution of Apache Beam.

You may see a message similar to the following:

INFO:root:Missing pipeline option (runner). Executing pipeline using the default runner: DirectRunner.
INFO:oauth2client.client:Attempting refresh to obtain initial access_token


This message can be ignored.

- You can now list the files that are on your local cloud environment to get the name of the OUTPUT_FILE:

In [None]:
ls

- Copy the name of the OUTPUT_FILE and cat into it:

In [None]:
cat <file name>

### Run an example pipeline remotely

- Set the BUCKET environment variable to the bucket you created earlier:

In [None]:
BUCKET=gs://<bucket name provided earlier>

- Now you'll run the wordcount.py example remotely:

In [None]:
python -m apache_beam.examples.wordcount --project $DEVSHELL_PROJECT_ID \
  --runner DataflowRunner \
  --staging_location $BUCKET/staging \
  --temp_location $BUCKET/temp \
  --output $BUCKET/results/output \
  --region us-west1

In your output, wait until you see the message:

JOB_MESSAGE_DETAILED: Workers have started successfully.

### Check that your job succeeded

- Open the Navigation menu and click Dataflow from the list of services.

You should see your wordcount job with a status of Running at first.

- Click on the name to watch the process. When all the boxes are checked off, you can continue watching the logs in Cloud Shell.

The process is complete when the status is Succeeded.

- Click Navigation menu > Cloud Storage in the Cloud Console.

- Click on the name of your bucket. In your bucket, you should see the results and staging directories.

- Click on the results folder and you should see the output files that your job created:

- Click on a file to see the word counts it contains.

***

**<center><font size = "6">Dataproc: Qwik Start - Console<center>**
***

#### Confirm Cloud Dataproc API is enabled

To create a Dataproc cluster in Google Cloud, the Cloud Dataproc API must be enabled. To confirm the API is enabled:

- Click Navigation menu > APIs & Services > Library:

- Type Cloud Dataproc in the Search for APIs & Services dialog. The console will display the Cloud Dataproc API in the search results.

- Click on Cloud Dataproc API to display the status of the API. If the API is not already enabled, click the Enable button.

### Create a cluster

- In the Cloud Platform Console, select Navigation menu > Dataproc > Clusters, then click Create cluster.

- Click Create for Cluster on Compute Engine.

- Set the following fields for your cluster and accept the default values for all other fields:

Note: In the Configure nodes section ensure both the Master node and Worker nodes are set to the correct Machine Series and Machine Type

Note: A Zone is a special multi-region namespace that is capable of deploying instances into all Google Compute zones globally. You can also specify distinct regions, such as us-central1 or europe-west1, to isolate resources (including VM instances and Cloud Storage) and metadata storage locations utilized by Cloud Dataproc within the user-specified region.

- Click Create to create the cluster.

Your new cluster will appear in the Clusters list. It may take a few minutes to create, the cluster Status shows as Provisioning until the cluster is ready to use, then changes to Running.

### Submit a job

To run a sample Spark job:

- Click Jobs in the left pane to switch to Dataproc's jobs view, then click Submit job.

- Set the following fields to update Job. Accept the default values for all other fields:

- Click Submit.

Note: How the job calculates Pi: The Spark job estimates a value of Pi using the Monte Carlo method. It generates x,y points on a coordinate plane that models a circle enclosed by a unit square. The input argument (1000) determines the number of x,y pairs to generate; the more pairs generated, the greater the accuracy of the estimation. This estimation leverages Cloud Dataproc worker nodes to parallelize the computation. For more information, see Estimating Pi using the Monte Carlo Method and see JavaSparkPi.java on GitHub.

Your job should appear in the Jobs list, which shows your project's jobs with its cluster, type, and current status. Job status displays as Running, and then Succeeded after it completes.

### View the job output

To see your completed job's output:

- Click the job ID in the Jobs list.

- Check Line wrapping or scroll all the way to the right to see the calculated value of Pi. Your output, with Line wrapping checked, should look something like this:

### Update a cluster

To change the number of worker instances in your cluster:

- Select Clusters in the left navigation pane to return to the Dataproc Clusters view.

- Click example-cluster in the Clusters list. By default, the page displays an overview of your cluster's CPU usage.

- Click Configuration to display your cluster's current settings.

- Click Edit. The number of worker nodes is now editable.

- Enter 4 in the Worker nodes field.

- Click Save.

Your cluster is now updated. Check out the number of VM instances in the cluster.

- To rerun the job with the updated cluster, you would click Jobs in the left pane, then click SUBMIT JOB.

- Set the same fields you set in the Submit a job section:

- Click Submit.

***

**<center><font size = "6">Cloud Natural Language API: Qwik Start<center>**
***