This notebook will walk you through accessing your files stored in a GCP bucket

*Note: The best way to access your embeddings or pre-processed data is by uploading it to a kaggle dataset.*

But if you have an automated data pipelines, it will be easier to pre-process the data on an offline VM or local machine & store it in a GCP bucket. From there you can access your data via the following two methods. 

1. Access data via the client storage API 
2. Data is publicly available  - anyone with the link can access the data

+++++++++++++  ***Note***  +++++++++++++

As per the tasks submission guidelines: To be valid, a submission must be contained in a single notebook made public on or before the submission deadline. Participants are free to use additional datasets in addition to the official Kaggle dataset, but those datasets must also be publicly available on either Kaggle, Allen.ai, or Semantic Scholar in order for the submission to be valid. Participants must also accept the competition rules.
Source: [Tasks description](https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge/tasks?taskId=568)



# 1. Access data via client storage API

A google account can be added when a new notebook is being created. Select *"SHOW ADVANCED SETTINGS"*

Select *"Link a Google Account to access Google services"*

![Selecting gcp services](https://storage.googleapis.com/tpu-aakash/Access%20GCP%20bucket/Access_GCP_account.jpg)


Your google account will be visible. Select *Attach to Notebook*

![Attach to notebook](https://storage.googleapis.com/tpu-aakash/Access%20GCP%20bucket/Select%20Account.jpg)

Once the account is attached, you will see the list of GCP APIs that are linked. 

Big Query is automatically selected. Its good if you are using data stored as a BigTable. Note that Storage is not automatically authorized. 

![Select API](https://storage.googleapis.com/tpu-aakash/Access%20GCP%20bucket/Select_api.jpg)

Click on *"Add authorization"*

You will have to select your google account & allow kaggle to access the API service. Once the API is authorized you will instantiate the new notebook, with GCP access. The next code block will walk you through accessing a csv file in your GCP bucket

![Authorized API](https://storage.googleapis.com/tpu-aakash/Access%20GCP%20bucket/GCP_bucket_authorization.jpg)

# Code for accessing file in GCP bucket

Please follow the above steps to enable GCP access on your notebook & enabling the relevant APIs. 
Note that Internet access should be enabled from your settings panel. *(default setting)*

![Internet access](https://storage.googleapis.com/tpu-aakash/Access%20GCP%20bucket/default_internet.jpg)

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# System packages 
import os
import sys


In [None]:
# Set your own project id here
PROJECT_ID = 'project-x-262017' # This needs to be replaced with your own project name 

from google.cloud import bigquery # Import bigquery API client library 
bigquery_client = bigquery.Client(project=PROJECT_ID)

from google.cloud import storage  # Import storage client library 
storage_client = storage.Client(project=PROJECT_ID)

In [None]:
bucket_name = 'tpu-aakash' # select the GCP bucket 
bucket = storage_client.get_bucket(bucket_name)

In [None]:
%%time
blob_name = "DrugVisData_small.csv" # Select filename & print its meta information
blob = bucket.get_blob(blob_name)

print("Name: {}".format(blob.id))
print("Size: {} bytes".format(blob.size))
print("Content type: {}".format(blob.content_type))
print("Public URL: {}".format(blob.public_url))

In [None]:
output_file_name = "DrugVisData_small.csv" # select local filename to save file
blob.download_to_filename(output_file_name)

print("Downloaded blob {} to {}.".format(blob.name, output_file_name))

In [None]:
# Read the csv file & print out its header
drugData =pd.read_csv(output_file_name)
print(drugData.shape)
drugData.head()

In [None]:
# This is an additional code block if you want to identify the list of files <- Uncomment to run
# blobs = bucket.list_blobs()

# print("Blobs in {}:".format(bucket.name))
#for item in blobs:
#    print("\t" + item.name)

# 2. Data is publicly available


Once the file is uploaded to a GCP bucket, you can generate a publicly visible download link. Click on the three dots to the right most of the row

1. Select *Edit permissions* 
2. Select *+ Add Item*
3. Now select *Entity* = *User*
4. Under *Name* add *allUsers*
5. Keep *Access* = *Reader*
6. Save the settings 

![Edit permissions](https://storage.googleapis.com/tpu-aakash/Access%20GCP%20bucket/editPermission.jpg)

A publicly available url will be visible. Copy it below

In [None]:
#This is the public url which is available once you add access to *allUsers* 
url = "https://storage.googleapis.com/tpu-aakash/DrugVisData_small.csv"

In [None]:
%%time
# csv files can be directly read using pandas *read_csv* 
drugData = pd.read_csv(url)
print(drugData.shape)
drugData.head()