# Uploading Data to Google Cloud Storage

The code below is based off of the code sample in this link: 
[code sample](https://cloud.google.com/storage/docs/uploading-objects#prereq-code-samples).

Here we import all of the needed Python packages for authenticating access to
Google Cloud Platform and for interacting with Google Cloud Storage and Google
BigQuery.

In [1]:
from glob import glob
from google.cloud import storage, bigquery
from google.oauth2.service_account import Credentials
import pydata_google_auth

Run the code cell below to authenticate using `pydata_google_auth`. Click
on the link and login using your Bayer email. Copy the token and post it into
the prompt to finish authenticating.

In [2]:
credentials = pydata_google_auth.get_user_credentials(
    ['https://www.googleapis.com/auth/cloud-platform'],
)

Please visit this URL to authorize this application: https://accounts.google.com/o/oauth2/auth?response_type=code&client_id=262006177488-3425ks60hkk80fssi9vpohv88g6q1iqd.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fcloud-platform&state=0j9083d6Y2MZI4UQo6gdeyHRERczYp&prompt=consent&access_type=offline


The function below implements the steps for uploading a file to GCS.

In [3]:
# From: https://cloud.google.com/storage/docs/uploading-objects#prereq-code-samples
def upload_file_to_gcs(
    project: str, bucket_name: str, source_file_name: str,
    destination_blob_name: str, credentials: Credentials,
) -> None:
    """
    Uploads a local file to Google Cloud Storage (GCS).

    Parameters
    ----------
    project : str
        Name of Google Cloud project.
    bucket_name : str
        Name of bucket in GCS.
    source_file_name : str
        Name of local source file being uploaded.
    destination_blob : str
        Name of blob (path) where local file will be written in GCS.
    credentials : Credentials
        Credentials for Google Cloud from `pydata_google_auth`.
    """
    storage_client = storage.Client(
        project=project, credentials=credentials
    )
    bucket = storage_client.bucket(bucket_name)
    blob = bucket.blob(destination_blob_name)

    blob.upload_from_filename(source_file_name)

    print(
        f"File {source_file_name} uploaded to {destination_blob_name}."
    )

Next, we get a list of the files we want to upload to Google Cloud Storage,
iterate through them and call the `upload_file_to_gcs` function, supplying
the desired project and bucket names, the destination blob/file name, and
the credentials we obtained above.

In [5]:
json_files = glob("example_data/*.json")
for json_file in json_files:
    upload_file_to_gcs(
        "bcs-genomics-analytics-sbx",  # project
        "bcs-genomics-analytics-sbx-sandbox",  # bucket
        json_file,  # local file
        f"csw_workspace/{json_file}",  # destination file
        credentials
    )

File example_data/example_data2.json uploaded to csw_workspace/example_data/example_data2.json.
File example_data/example_data1.json uploaded to csw_workspace/example_data/example_data1.json.


# Create Table and Load Data in Google Big Query

The example code below is based on code from these doc pages: 
[link](https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-json).

The first thing we need to do is to define the table schema, which follows the same
layout as the JSON files we uploaded in the previous section. The schema is a Python
list of `bigquery.SchemaField` objects. The `chromosome` and `position` columns are 
`NULLABLE` entries. The `markers` field is a nested entry, so we need to include
that it is a `RECORD` type that can be `REPEATED`. The `fields` argument is where
we define the nested fields for `markers`. We only have two here, `genotype` and
`count`.

In [7]:
table_schema = [
    bigquery.SchemaField(
        "chromosome", "INTEGER", "NULLABLE",
        description="Chromosome number."
    ),
    bigquery.SchemaField(
        "position", "FLOAT", "NULLABLE",
        description="0.1 cM bin position."
    ),
    bigquery.SchemaField(
        "markers", "RECORD", "REPEATED",
        description="Observed markers in the given 0.1 cM bin.",
        fields=[
            bigquery.SchemaField(
                "genotype", "STRING", "NULLABLE",
                description="Observed genotype (e.g., AA, AT)."
            ),
            bigquery.SchemaField(
                "count", "INTEGER", "NULLABLE",
                description="Number of times genotype was observed."
            )
        ]
    )
]

Next we define the project name, create a BigQuery client, and give a name to the
table.

In [9]:
project_name = "bcs-genomics-analytics-sbx"
client = bigquery.Client(credentials=credentials, project=project_name)
table_id = f"{project_name}.gmlfs.csw_workspace_example_data"

After that, we create a table object with the `table_id` and schema we defined. We can also add clustering fields and a table description here. Then, we create the table in BigQuery.

In [12]:
table = bigquery.Table(table_id, schema=table_schema)
table.clustering_fields = ["chromosome"]
table.description = "Toy data set for CSW Workspace."
table = client.create_table(table)

The last thing to do is to load the data that we put in GCS into the table. We
first set up some configuration to define our schema and to define the format of
the input data file. Then we specify the path/URI to our example JSON data (this
can use wildcards). Then, we use the client object we created before to call the
`load_table_from_uri` function to actually load the data into the table. Finally,
we wait for the upload to finish and then print the number of rows created.

In [13]:
job_config = bigquery.LoadJobConfig(
    schema=table_schema,
    source_format=bigquery.SourceFormat.NEWLINE_DELIMITED_JSON,
)
uri = "gs://bcs-genomics-analytics-sbx-sandbox/csw_workspace/example_data/*.json"
load_job = client.load_table_from_uri(
    uri,
    table_id,
    location="US",  # Must match the destination dataset location.
    job_config=job_config,
)
load_job.result()  # Waits for the job to complete.

destination_table = client.get_table(table_id)
print("Loaded {} rows.".format(destination_table.num_rows))

Loaded 10 rows.
