## Project to Upload Files to GCS using Python

As part of the series of lectures we will see how to upload files to GCS using Python. We will be using `glob`, `os`, `storage` from `google.cloud` to build the application logic.

Here are the design details.
* First, we need to get list of file names from the local file system to upload.
* We need to build `blob` object for each file.
* We can use `upload_from_filename` on top of blob object to upload file as blob in GCS.
* We will use metadata or data driven development approach to take care uploading all the files related to retail to GCS.
* Blobs will be named using file names as reference.

In [None]:
# !gsutil rm -r gs://sgretail/pythondemo

In [1]:
!gsutil rm -r gs://sgretail/pythondemo



Updates are available for some Google Cloud CLI components.  To install them,
please run:
  $ gcloud components update

Removing gs://sgretail/pythondemo/retail_db/orders/part-00000#1768329056013824...
/ [1 objects]                                                                   
Operation completed over 1 objects.                                              


In [None]:
# !gsutil ls gs://sgretail/

In [4]:
!gsutil ls gs://sgretail/

gs://sgretail/retail_db/


In [3]:
!gsutil rm -r gs://sgretail/retail_db_parquet 

CommandException: No URLs matched: gs://sgretail/retail_db_parquet


In [None]:
# import glob

In [5]:
import glob 
import os

In [6]:
def get_file_names(src_base_dir):
    items = glob.glob(f'{src_base_dir}/**', recursive=True)
    return list(filter(lambda item: os.path.isfile(item) and item.endswith('part-00000'), items))

In [None]:
# src_base_dir = '../../data/retail_db'

In [7]:
src_base_dir = '../../data/retail_db'

In [8]:
get_file_names(src_base_dir)

['../../data/retail_db/products/part-00000',
 '../../data/retail_db/orders/part-00000',
 '../../data/retail_db/departments/part-00000',
 '../../data/retail_db/customers/part-00000',
 '../../data/retail_db/categories/part-00000',
 '../../data/retail_db/order_items/part-00000']

In [None]:
# items = glob.glob(f'{src_base_dir}/**', recursive=True)

In [10]:
items = glob.glob(f'{src_base_dir}/**', recursive=True)

In [None]:
# items

In [11]:
items

['../../data/retail_db/',
 '../../data/retail_db/products',
 '../../data/retail_db/products/part-00000',
 '../../data/retail_db/create_db_tables_pg.sql',
 '../../data/retail_db/schemas.json',
 '../../data/retail_db/orders',
 '../../data/retail_db/orders/part-00000',
 '../../data/retail_db/load_db_tables_pg.sql',
 '../../data/retail_db/departments',
 '../../data/retail_db/departments/part-00000',
 '../../data/retail_db/customers',
 '../../data/retail_db/customers/part-00000',
 '../../data/retail_db/categories',
 '../../data/retail_db/categories/part-00000',
 '../../data/retail_db/order_items',
 '../../data/retail_db/order_items/part-00000']

In [None]:
# item = items[2]

In [None]:
# item

In [None]:
# import os
# os.path.isfile(item)

In [None]:
# files = filter(lambda item: os.path.isfile(item), items)

In [None]:
# list(files)

In [None]:
# files = list(filter(lambda item: os.path.isfile(item), items))
# file = files[0]

In [12]:
files = list(filter(lambda item: os.path.isfile(item), items))
file = files[0]

In [None]:
# file

In [13]:
file

'../../data/retail_db/products/part-00000'

In [None]:
# file.split('/')[3:]

In [15]:
file.split('/')[3:]

['retail_db', 'products', 'part-00000']

In [None]:
# '/'.join(file.split('/')[3:])

In [16]:
'/'.join(file.split('/')[3:])

'retail_db/products/part-00000'

In [None]:
# tgt_base_dir = 'pythondemo'

In [17]:
tgt_base_dir = 'pythondemo'

In [None]:
# from google.cloud import storage

In [18]:
from google.cloud import storage



In [None]:
# gsclient = storage.Client()

In [19]:
gsclient = storage.Client()

In [None]:
# files = filter(lambda item: os.path.isfile(item), items)
# bucket = gsclient.get_bucket('sgretail')
# for file in files:
#     print(f'Uploading file {file}')
#     blob_suffix = '/'.join(file.split('/')[3:])
#     blob_name = f'{tgt_base_dir}/{blob_suffix}'
#     blob = bucket.blob(blob_name)
#     blob.upload_from_filename(file)

In [20]:
files = filter(lambda item: os.path.isfile(item), items)
bucket = gsclient.get_bucket('sgretail')
for file in files:
    print(f'Uploading file {file}')
    blob_suffix = '/'.join(file.split('/')[3:])
    blob_name = f'{tgt_base_dir}/{blob_suffix}'
    blob = bucket.blob(blob_name)
    blob.upload_from_filename(file)


Uploading file ../../data/retail_db/products/part-00000
Uploading file ../../data/retail_db/create_db_tables_pg.sql
Uploading file ../../data/retail_db/schemas.json
Uploading file ../../data/retail_db/orders/part-00000
Uploading file ../../data/retail_db/load_db_tables_pg.sql
Uploading file ../../data/retail_db/departments/part-00000
Uploading file ../../data/retail_db/customers/part-00000
Uploading file ../../data/retail_db/categories/part-00000
Uploading file ../../data/retail_db/order_items/part-00000


In [None]:
# !gsutil ls -r gs://sgretail/pythondemo

In [None]:
# gsclient.list_blobs?

In [None]:
# gsclient.list_blobs(
#     'sgretail',
#     prefix='pythondemo'
# )

In [None]:
# blobs = list(gsclient.list_blobs(
#     'sgretail',
#     prefix='pythondemo'
# ))

In [None]:
# blobs