# Module 5: High throughput and low-latency with Bigtable
1. Overview
2. What is Bigtable?
3. Ingesting into Bigtable
4. Designing for Bigtable
5. Lab 4: Streaming into Bigtable
6. Performance Considerations

## 1. Overview
![](img/5-1-01.png)
![](img/5-1-02.png)
![](img/5-1-03.png)
![](img/5-1-04.png)
![](img/5-1-05.png)

## 2. What is Bigtable?
![](img/5-2-01.png)
![](img/5-2-02.png)
![](img/5-2-03.png)
![](img/5-2-04.png)
![](img/5-2-05.png)
![](img/5-2-06.png)
![](img/5-2-07.png)

## 3. Ingesting into Bigtable
![](img/5-3-01.png)
![](img/5-3-02.png)
![](img/5-3-03.png)
![](img/5-3-04.png)
![](img/5-3-05.png)
![](img/5-3-06.png)
![](img/5-3-07.png)
![](img/5-3-08.png)
![](img/5-3-09.png)

## 4. Designing for Bigtable
![](img/5-4-01.png)
![](img/5-4-02.png)
![](img/5-4-03.png)
![](img/5-4-04.png)
![](img/5-4-05.png)
![](img/5-4-06.png)
![](img/5-4-07.png)
![](img/5-4-08.png)
![](img/5-4-09.png)
![](img/5-4-10.png)

## 5. Lab 4: Streaming into Bigtable
![ ](img/lab4-01.png)
### Overview
`At the time of this writing, streaming pipelines are not available in the DataFlow Python SDK. So the streaming labs are written in Java.`
### Lab 4: Streaming Data Pipelines into Bigtable
In this lab you will use Dataflow to collect traffic events from simulated traffic sensor data made available through Google Cloud PubSub, and write them into a Bigtable table.

- Launch Dataflow pipeline to read from PubSub and write into Bigtable
- Open an HBase shell to query the Bigtable data

### Task 1: Preparation
You will be running a sensor simulator from the training VM. In Lab 1 you manually setup the Pub/Sub components. In this lab several of those process are automated.

__Open the SSH terminal and connect to the training VM__

1. In the Console, on the Navigation menu () click Compute Engine > VM instances.
2. Locate the line with the instance called training_vm.
3. On the far right, under 'connect', Click on SSH to open a terminal window.
4. In this lab you will enter CLI commands on the training_vm.<br>
__Verify initialization is complete__
5. The training_vm is installing software in the background. Verify that setup is complete by checking that the following directory exists. If it does not exist, wait a few minutes and try again.<br>
`ls /training`<br>
Wait until setup is complete before proceeding. You can verify the installation of maven with mvn -version and the JDK with java -version.<br>
__Copy files__
6. A repository has been downloaded to the VM. Copy the repository to your home directory.<br>
`cp -r /training/training-data-analyst/ .`<br>
__Set environment variables__
7. On the training_vm SSH terminal enter the following:<br>
`source /training/project_env.sh`<br>
This script sets the **DEVSHELL_PROJECT_ID** and **BUCKET** environment variables<br>
**Prepare HBase quickstart files**
8. In the training_vm SSH terminal run the script to download and unzip the quickstart files (you will later use these to run the HBase shell.)<br>
`cd ~/training-data-analyst/courses/streaming/process/sandiego
./install_quickstart.sh`

### Task 2: Simulate traffic sensor data into Pub/Sub
1. In the training_vm SSH terminal, start the sensor simulator. The script reads sample data from a csv file and publishes it to Pub/Sub.<br>
`/training/sensor_magic.sh`<br>
This command will send 1 hour of data in 1 minute. Let the script continue to run in the current terminal.<br>
**Open a second SSH terminal and connect to the training VM**
1. In the upper right corner of the training_vm SSH terminal, click on the gear-shaped button () and select New Connection to training-vm from the drop-down menu. A new terminal window will open.
![ ](img/lab4-02.png)
3. The new terminal session will not have the required environment variables. Run the following command to set them.
4. In the new training_vm SSH terminal enter the following:<br>
`source /training/project_env.sh`
### Task 3: Launch Dataflow Pipeline
1. In the second training_vm SSH terminal, navigate to the directory for this lab. Examine the script in Cloud Shell or using nano. Do not make any changes to the code.<br>
`cd ~/training-data-analyst/courses/streaming/process/sandiego 
nano run_oncloud.sh`<br>
What does the script do?
2. The script takes 3 required arguments: project id, bucket name, classname and possibly a 4th argument: options. In this part of the lab, we will use the --bigtable option which will direct the pipeline to write into Cloud Bigtable.
3. Run the following script to create the Bigtable instance.
`cd ~/training-data-analyst/courses/streaming/process/sandiego
./create_cbt.sh`
4. Run the Dataflow pipeline to read from PubSub and write into Cloud Bigtable
`cd ~/training-data-analyst/courses/streaming/process/sandiego
./run_oncloud.sh DEVSHELL_PROJECT_ID  BUCKET CurrentConditions --bigtable`<br>
Example successful run:<br>
`[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 47.582 s
[INFO] Finished at: 2018-06-08T21:25:32+00:00
[INFO] Final Memory: 58M/213M
[INFO] ------------------------------------------------------------------------`

### Task 4: Explore the pipeline
1. Return to the browser tab for Console. On the Navigation menu () click Dataflow and click on the new pipeline job. Confirm that the pipeline job is listed and verify that it is running without errors.
2. Find the write:cbt step in the pipeline graph, and click on the down arrow on the right to see the writer in action. Review the Bigtable options in the step summary.

### Task 5: Query Bigtable data
1. In the second training_vm SSH terminal, run the quickstart.sh script to launch the HBase shell.
`cd ~/training-data-analyst/courses/streaming/process/sandiego/quickstart
./quickstart.sh`
2. If the script runs successfully, you would be in an HBase shell prompt that looks something like this:
`hbase(main):001:0>`
3. At the HBase shell prompt, type the following query to retrieve 2 rows from your Bigtable table that was populated by the pipeline.
`scan 'current_conditions', {'LIMIT' => 2}`
4. Review the output. Notice each row is broken into column, timestamp, value combinations.
5. Run another query. This time look only at the lane: speed column, limit to 10 rows, and specify rowid patterns for start and end rows to scan over.
`scan 'current_conditions', {'LIMIT' => 10, STARTROW => '15#S#1', ENDROW => '15#S#999', COLUMN => 'lane:speed'}`
6. Review the output. Notice that you see 10 of the column, timestamp, value combinations, all of which correspond to Highway 15. Also notice that column is restricted to lane: speed.
7. Feel free to run other queries if you are familiar with the syntax. Once you're satisfied, ‘quit' to exit the shell.
`quit`

### Cleanup
1. Run the script to delete your Bigtable instance<br>
`cd ~/training-data-analyst/courses/streaming/process/sandiego
./delete_cbt.sh`
2. On your Dataflow page in your Cloud Console, click on the pipeline job name and click the ‘stop job' on the right panel.
3. Go back to the first Cloud Shell tab with the publisher and type Ctrl-C to stop it.
4. Go to the BigQuery console and delete the dataset demos.

### Completion
**Cleanup**<br>
In the Cloud Platform Console, sign out of the Google account.

Close the browser tab.

End your lab

## Module 5: Quiz
### Question 1
Which of the following are true about Cloud Bigtable?
(Mark all 3 correct responses)
- [x] **Offers very low-latency in the order of milliseconds**
- [x] **Ideal for >1TB data**
- [x] **Great for time-series data**
- [ ] Support for SQL

### Question 2
True or False?
Cloud Bigtable learns access patterns and attempts to distribute reads and storage across nodes evenly

- [x] **True**
- [ ] False

### Question 3
Which of the following can help improve performance of Bigtable?
(Select all 3 correct responses)

- [x] **Change schema to minimize data skew**
- [x] **Clients and Bigtable are in same zone**
- [ ] Use HDD instead of SDD
- [x] **Add more nodes**