<a href="https://colab.research.google.com/github/rtheman/Data_IO/blob/master/from_Google_Cloud_Storage.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Objective

This notebook provide receipes for loading/extracting data into this Juypter Notebook (Google Colab), in this case, from **Google Cloud Storage (GCS)**.

There are 2-ways to connect:
1. gsutil
1. Python API

# Init. settings and libraries

In [1]:
project_id = 'rleung-sandbox'
bucket_name = 'samples_data'

file_name = 'COVID_Active_Cases_USA_NY.csv'

local_path = "/content/sample_data/"

Authenticate to GCS...

In [2]:
from google.colab import auth
auth.authenticate_user()

# 1.) Ingest Data

## a.) via `gsutil`

First, we configure `gsutil` to use the project we specified above by using `gcloud`.

In [3]:
!gcloud config set project {project_id}

Updated property [core/project].


Download file from GCS using `gsutil cp` command

In [4]:
!gsutil cp gs://{bucket_name}/{file_name} {local_path}/{file_name}
  
# Print the result to make sure the transfer worked.
# !cat /content/sample_data/{file_name}

Copying gs://samples_data/COVID_Active_Cases_USA_NY.csv...
/ [1 files][  1.9 KiB/  1.9 KiB]                                                
Operation completed over 1 objects/1.9 KiB.                                      


## b.) via Python API

These snippets based on [a larger example](https://github.com/GoogleCloudPlatform/storage-file-transfer-json-python/blob/master/chunked_transfer.py) that shows additional uses of the API.

 First, we create the service client.

In [None]:
from googleapiclient.discovery import build
gcs_service = build('storage', 'v1')

Download the file from GCS using `apiclient.http` library.

In [None]:
from apiclient.http import MediaIoBaseDownload

with open('/content/sample_data/COVID_Active_Cases_USA_NY.csv', 'wb') as f:
  request = gcs_service.objects().get_media(bucket=bucket_name,
                                            object=file_name)
  media = MediaIoBaseDownload(f, request)

  done = False
  while not done:
    # _ is a placeholder for a progress object that we ignore.
    # (Our file is small, so we skip reporting progress.)
    _, done = media.next_chunk()

print('Download complete')

Download complete


# 2.) Transform D/L data as Dataframe

In [5]:
import pandas as pd

In [6]:
path = local_path + file_name

df = pd.read_csv(path, header=0, index_col=0)
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 101 entries, 2020-03-12 to 2020-06-20
Data columns (total 1 columns):
 #   Column        Non-Null Count  Dtype
---  ------        --------------  -----
 0   Active_Cases  101 non-null    int64
dtypes: int64(1)
memory usage: 1.6+ KB


In [7]:
df.dtypes

Active_Cases    int64
dtype: object

In [8]:
df

Unnamed: 0_level_0,Active_Cases
Observ_Date,Unnamed: 1_level_1
2020-03-12,328
2020-03-13,421
2020-03-14,524
2020-03-15,729
2020-03-16,956
...,...
2020-06-16,288566
2020-06-17,288954
2020-06-18,290469
2020-06-19,291432


# REFERENCE

*   Colab Notebook Examples (Google) https://colab.research.google.com/notebooks/io.ipynb#scrollTo=S7c8WYyQdh5i
*   Google Storage Client Libraries https://cloud.google.com/storage/docs/reference/libraries#command-line