**Updated:** 02/22/2021 (Allie Hajian)

## Purpose

This tutorial is an introduction to the storage options available in a Terra Workapace and how to move data between them using a notebook or the terminal in your Cloud Environment.     

In this tutorial you will:

- Copy data from a public bucket to your notebook's Persistent Disk storage   
- Move data from your PD to the Workspace bucket     
- Use gsutil commands in a notebook and the terminal command line interface (CLI)   
- Verify the persistence of the Cloud Environment Persistent Disk

## Cloud Environment requirements

The requirements of this notebook are minimal and almost any Cloud Environment configuration should work correctly.    

Note that the notebook was tested using the following Cloud Environment configuration:
* Application configuration: Current default
* Computer power: Standard VM

## How to use this notebook

You should run the code cells in order, by selecting "the "Run" button at the top, or using the "shift + enter" keyboard shortcut. Because it's a tutorial, there is plenty of documentation to accompany each code cell that serves as instructions and to give overall background. Some commands are optional.

## Overview: Notebooks and data in Terra

### You are here! (platform architecture)    

**Overall architecture**     
You are running a Jupyter Notebook on a virtual computer, or machine, in the cloud. This virtual machine (VM) is part of the Terra Cloud Environment (blue rectangle) which lives in your workspace (green rectangle).    
![Where is the notebook in the Terra architecture?](https://storage.googleapis.com/terra-featured-workspaces/QuickStart/Notebooks-Quickstart_Terra-architecture_Diagram_You-are-here_scaled.png)   

**Cloud Environment components**     
When you selected a Cloud Environment configuration, Terra created the VM and the persistent disk from a boot disk (Docker image) that stores the software necessary to launch and run the notebook.      

**The detachable Persistent Disk**     
By default, your notebook data is stored in a detachable persistent disk which is part of the VM — think of it as a removable flash drive. As you’re working in the notebook, the Persistent Disk stores the data you generate as well as any additional packages or libraries you install. 

**Workspace bucket**       
The notebook itself (.jpynb file) is automatically copied to the Workspace bucket. 

### Where's my Workspace data stored? 
Data used in an analysis on Terra is in the cloud, including [open-access data stored in GCP buckets](https://cloud.google.com/life-sciences/docs/resources/public-datasets) and open-access or controlled data stored in a data repository such as [Gen 3](https://gen3.org/).     

Your Terra workspace stores data in two places:         
- Your Cloud Environment Persistent Disk    
- Your Workspace bucket   

**Notebook storage (Persistent Disk)**    
Data generated by a notebook analysis - as well as any additional libraries or packages you install - is stored in the Persistent Disk associated with your Cloud Environment by default. There is currently one Cloud Environment per user per Terra Billing project (pale blue rectangle). Since your PD is unique to you, you will need to move it to the Workspace bucket to share with colleagues (even if you are collaborating in a shared workspace).   

**Workspace bucket**    
Each workspace has a dedicated Workspace bucket for permanent storage. It is integrated with workflows. **You will move notebook data from the PD to the Workspace bucket if you want to 1) share with colleagues, 2) use as input for a workflow, or 3) archive the data to less expensive storage.**         

**To learn more about the Terra platform architecture, see [this article](https://support.terra.bio/hc/en-us/articles/360058163311-).** 

### Example data used in this tutorial
This Notebook tutorial uses two small text files stored in a public bucket: 

1. gs://terra-featured-workspaces/QuickStart/Tutorial-data.txt
2. gs://terra-featured-workspaces/QuickStart/Tutorial-data-2.txt

## Set environment variables

The preset environement variable `WORKSPACE_BUCKET` enables you to save to **your unique** workspace bucket without changing the code.

In [1]:
# Set workspace bucket variable
import os
WORKSPACE_BUCKET=os.environ['WORKSPACE_BUCKET']

## Verify bucket
WORKSPACE_BUCKET

'gs://fc-353b14e5-19da-46ef-b358-f5b85cffef02'

## Using gsutil in a notebook
gsutil is a Python application that lets you access Cloud Storage from the command line in a terminal. You can run the terminal on your local machine or use the one built into the workspace Cloud Environment. You can use gsutil to do a wide range of tasks associated with data files, including:

- Uploading, downloading, and deleting data files.
- Listing files in a Workspace bucket.
- Moving, copying, and renaming files.   

**Bash commands in a notebook**      
In this tutorial notebook, you will run gsutil commands in a command cell by prepending an exclamation point/bang (i.e. `!`) to the beginning of the command.

**Places where you would customize (to your own files and file locations) are noted in a <font color="green"># comment</font> in the code cell.**

### List current directory/file information    

In [2]:
# List the file directory
! pwd

/home/jupyter/notebooks/broad-cpa-pipeline/edit


![The Persistent Disk holds notebook data, packages and libraries](https://storage.googleapis.com/terra-featured-workspaces/QuickStart/Notebooks-Quickstart_Terra-architecture_Your-notebook-data-is-here_scaled.png) 

In [3]:
# List files in the directory. Note that all notebooks and data generated in a notebook are here by default.
! ls -l 

total 296
-rw-rw-r-- 1 welder-user users  31886 Oct 12 14:46  1_R_environment_setup.ipynb
-rw-rw-r-- 1 welder-user users  35173 Oct 12 14:50  2_BigQuery_cohort_analysis.ipynb
-rw-rw-r-- 1 welder-user users 108897 Oct 12 15:24  3_Access_and_plot_public_BigQuery_data.ipynb
-rw-rw-r-- 1 welder-user users  23068 Oct 12 15:25  4_Working_with_data_in_your_cloud_environment.ipynb
-rw-rw-r-- 1 welder-user users  69333 Oct 11 20:01 'Create Reference and Test Data.ipynb'
-rw-rw-r-- 1 welder-user users  27635 Oct  6 12:41 'Workflow Cost Estimator.ipynb'


### Copy/download a single file to the Cloud Environment VM
![Copy data from a public bucket to the current dfirectory of the cloud environment VM](https://storage.googleapis.com/terra-featured-workspaces/QuickStart/Notebooks-Quickstart_Copy-data-from-external-bucket_scaled.png)      

In [4]:
# Use gsutil to copy the Tutorial-data.txt file to this directory in the PD

! gsutil cp gs://terra-featured-workspaces/QuickStart/Tutorial-data.txt .
    
# To copy a different file, replace gs://terra-featured-workspaces/QuickStart/Tutorial-data.txt 
# with your own complete file path 

Copying gs://terra-featured-workspaces/QuickStart/Tutorial-data.txt...
/ [1 files][   63.0 B/   63.0 B]                                                
Operation completed over 1 objects/63.0 B.                                       


In [5]:
# List files in the directory. Note that you should see the Tutorial-data.txt file!
! ls -l   

total 300
-rw-rw-r-- 1 welder-user users  31886 Oct 12 14:46  1_R_environment_setup.ipynb
-rw-rw-r-- 1 welder-user users  35173 Oct 12 14:50  2_BigQuery_cohort_analysis.ipynb
-rw-rw-r-- 1 welder-user users 108897 Oct 12 15:24  3_Access_and_plot_public_BigQuery_data.ipynb
-rw-rw-r-- 1 welder-user users  23068 Oct 12 15:25  4_Working_with_data_in_your_cloud_environment.ipynb
-rw-rw-r-- 1 welder-user users  69333 Oct 11 20:01 'Create Reference and Test Data.ipynb'
-rw-rw-r-- 1 jupyter     users     63 Oct 12 15:25  Tutorial-data.txt
-rw-rw-r-- 1 welder-user users  27635 Oct  6 12:41 'Workflow Cost Estimator.ipynb'


In [6]:
# Sanity check - print the top line of the file
! head Tutorial-data.txt

Congratulations! You�ve got the data on your Persistent Disk!!

### Copy a single file from PD to the Workspace bucket
You can copy to the Workspace bucket, or any other Google bucket to which you have write access. This example uses the Workspace bucket.     

![Workspace bucket](https://storage.googleapis.com/terra-featured-workspaces/QuickStart/Notebooks-Quickstart_Terra-architecture_You-move-data-here_scaled.png)

**Why copy from notebook PD to the Workspace bucket?**    
Copying a file to a Google bucket is the most permanent storage option. It is also lets you share data generated in a notebook analysis with colleagues or use it as input for a workflow. This is because your Cloud Environment is unique to you and the PD cannot be accessed by others - even if they share the same workspace!  

**Note that the bucket path is set to be the workspace variable WORKSPACE_BUCKET, defined above**. Thus you do not have to change anything to use with your own copy of the workspace!

In [7]:
# Verify workspace bucket
WORKSPACE_BUCKET

'gs://fc-353b14e5-19da-46ef-b358-f5b85cffef02'

In [8]:
# List files/directories in the Workspace bucket
! gsutil ls -l {WORKSPACE_BUCKET}

   3304193  2021-09-21T22:55:39Z  gs://fc-353b14e5-19da-46ef-b358-f5b85cffef02/genelist.gff
  12400379  2021-09-15T10:10:42Z  gs://fc-353b14e5-19da-46ef-b358-f5b85cffef02/sacCer3.fa
    238694  2021-10-11T14:51:57Z  gs://fc-353b14e5-19da-46ef-b358-f5b85cffef02/test-1.fastq.gz
    205017  2021-09-15T10:11:05Z  gs://fc-353b14e5-19da-46ef-b358-f5b85cffef02/xwt-1.fastq.gz
                                 gs://fc-353b14e5-19da-46ef-b358-f5b85cffef02/0d32d7c9-9512-4db0-b6c9-6654f840de15/
                                 gs://fc-353b14e5-19da-46ef-b358-f5b85cffef02/12101555-5060-41ca-a3cf-665434207dc0/
                                 gs://fc-353b14e5-19da-46ef-b358-f5b85cffef02/12e100aa-86a0-4739-abbd-3bfd589e31cb/
                                 gs://fc-353b14e5-19da-46ef-b358-f5b85cffef02/15cf0b45-ea46-4123-8d08-3f6b9590c473/
                                 gs://fc-353b14e5-19da-46ef-b358-f5b85cffef02/1a3f5c55-561b-4dac-bab4-4a237574f3bd/
                                 gs://fc

In [9]:
# Use gsutil to copy Tutorial-data.txt from the PD to the Workspace bucket
# Note that the command below creates a directory (notebook-data) to store the file
! gsutil cp Tutorial-data.txt {WORKSPACE_BUCKET}/notebook-data/

# To copy a different file, replace tutorial sample with your own file name. 

Copying file://Tutorial-data.txt [Content-Type=text/plain]...
/ [1 files][   63.0 B/   63.0 B]                                                
Operation completed over 1 objects/63.0 B.                                       


(Optional) List the file in the workspace bucket

In [10]:
# List file in the Workspace bucket
! gsutil ls -l {WORKSPACE_BUCKET}/notebook-data/

# To list a specific file, or a different directory, change the command above 

        63  2021-10-12T15:26:04Z  gs://fc-353b14e5-19da-46ef-b358-f5b85cffef02/notebook-data/Tutorial-data.txt
TOTAL: 1 objects, 63 bytes (63 B)


### See your Workspace bucket in GCP console (optional - click arrow at left to expand)

The code below will generate a link for the workspace bucket on GCP console that you can click to check for your file...

In [11]:
# Grab the ID (a workspace variable) for this workspace bucket
bucket = os.environ['WORKSPACE_BUCKET']

# Create full path to workspace bucket in GCP console
workspace_id = bucket[5:]
bucket_in_console = "https://console.cloud.google.com/storage/browser/{}".format(workspace_id)

print(bucket_in_console)

https://console.cloud.google.com/storage/browser/fc-353b14e5-19da-46ef-b358-f5b85cffef02


### Clean-up/remove the example file from the PD (optional - click arrow at left for command)

In [12]:
# Remove this file from the PD (once you've copied it to the Workspace bucket!)
! rm Tutorial-data.txt

In [13]:
# You can rerun the command cell from above to verify that you've removed the file
# Note that you not should see the Tutorial-data.txt file!
! ls -l   

# To clean up a different file, replace "Tutorial-data.txt" with your own file name 

total 304
-rw-rw-r-- 1 welder-user users  31886 Oct 12 14:46  1_R_environment_setup.ipynb
-rw-rw-r-- 1 welder-user users  35173 Oct 12 14:50  2_BigQuery_cohort_analysis.ipynb
-rw-rw-r-- 1 welder-user users 108897 Oct 12 15:24  3_Access_and_plot_public_BigQuery_data.ipynb
-rw-rw-r-- 1 welder-user users  31171 Oct 12 15:26  4_Working_with_data_in_your_cloud_environment.ipynb
-rw-rw-r-- 1 welder-user users  69333 Oct 11 20:01 'Create Reference and Test Data.ipynb'
-rw-rw-r-- 1 welder-user users  27635 Oct  6 12:41 'Workflow Cost Estimator.ipynb'


### Clean-up files from the Workspace bucket (optional - click arrow at left to expand)

Clean-up/remove the example file from the workspace bucket:

In [None]:
# To remove a different file, replace tutorial sample with your own file name 
! gsutil rm {WORKSPACE_BUCKET}/notebook-data/Tutorial-data.txt

### Copy a list of files to the Cloud Environment VM

First, create a directory for the example files:

In [None]:
! mkdir tutorial_example_files

Verify that you created the directory

In [None]:
! ls -l

Copy the *list* of files to the directory using the `*` wildcard

In [None]:
! gsutil cp gs://terra-featured-workspaces/QuickStart/Tutorial-data* tutorial_example_files/.

List the files in the directory on the Cloud Environment VM:

In [None]:
! ls -lr tutorial_example_files

In [None]:
# Sanity check - print the top line of the files
! head tutorial_example_files/Tutorial-data.txt

In [None]:
! head tutorial_example_files/Tutorial-data-2.txt

Clean-up/remove the example directory and files:

In [None]:
# Uncomment to remove files
#! rm -r tutorial_example_files/*

### Congratulations! You now know how to move data using gsutil in a notebook! <a class="tocSkip">

## Exercises - Confirm the persistence of your PD
Try the exercises below to verify where in your Cloud Environment VM the files are stored, and that they are protected from deletion if you are using the Persistent Disk (default option).

###  <font color="#FF6600">(click arrow at left to expand) </font> Find the files in the Cloud Environment  <a class="tocSkip">

Each VM includes a persistent storage location. This disk is mounted to the directory /home/jupyter-user/notebooks so remember that it has to be saved there if you want it to persist. Anything saved outside of this directory is not saved to the persistent disk, and will still be lost on deletion. 

Let's first figure out what directory we are currently in (i.e. where files were copied to in 5.1 above)

In [None]:
# List the current directory
! pwd

In [None]:
# List everything inside this directory
! ls

You will see all the notebooks from this workspace, as well as the `Tutorial-data.txt` and `Tutorial-data-2.txt` files you copied into the Cloud Environment (steps 5.1 above)

###  <font color="#FF6600">(click arrow at left to expand) </font> Use the terminal to find files stored in the Cloud Environment  <a class="tocSkip">

You may be used to using a command-line terminal to access andd manage files, especially if you work on a local cluster. You can access the terminal in Terra by clicking on the `>_` icon on the left side of the Cloud Environment widget at the top right.

1. Open the terminal (click on the `>_` inside the widget at the top right)
2. Type `ls` to find what directory you are in
3. Change to the workspace directory by typing `cd ` and `notebooks` 
4. Type the name of this workspace
4. Type `ls` to list the contents 
5. Type `cd edit` to go to the director associated with thhe notebooks in `Edit` mode
6. Type `ls` to find the file in the Detachable Persistent Disk

### <font color="#FF6600">(click arrow at left to expand) </font> Verify files on persistent disk don't disappear  <a class="tocSkip">

Because this file is in the /home/jupyter-user/notebooks/ directory, which is in the detachable persistent disk, it should remain even if you ddelete the current Cloud Environment. **Let's test that!**

**Delete the Cloud Environment but keep the Detachable Peristent Disk**

1. Go to the widget at the top right and click on the gear icon
2. Click "Delete Runtime Options" at the bottom left 
3. Select the top radio button option ('Keep persistent disk, delete application configuration and compute profile`)

**Restart the Cloud Environment and verify files are still there**   

4. Once the Cloud Environment is deleted, create a fresh one and **reopen this notebook in the same mode as before**
5. Return to this cell and run the code below to verify that the file is still there

In [None]:
# List the current directory
! pwd

# List everything inside this directory
! ls

Notice that the `Tutorial-data.txt` file and `tutorial_example_files` directory are still there!

## Contact information <a class="tocSkip">

Please share any feedback you have on this Notebook, either things that you liked or things you would like to see improved. Thank you!