## Project to Upload Files to GCS using Python

As part of the series of lectures we will see how to upload files to GCS using Python. We will be using `glob`, `os`, `storage` from `google.cloud` to build the application logic.

Here are the design details.
* First, we need to get list of file names from the local file system to upload.
* We need to build `blob` object for each file.
* We can use `upload_from_filename` on top of blob object to upload file as blob in GCS.
* We will use metadata or data driven development approach to take care uploading all the files related to retail to GCS.
* Blobs will be named using file names as reference.

In [1]:
#!gsutil rm -r gs://airetail/pythondemo
!gsutil rm -r gs://udemy-retail-gcpbucket/pythondemo

CommandException: No URLs matched: gs://udemy-retail-gcpbucket/pythondemo


In [2]:
#!gsutil ls gs://airetail/
!gsutil ls gs://udemy-retail-gcpbucket/

In [19]:
import glob
import os

In [20]:
def get_file_name (src_base_dir):
    #NOTE can also accomplish glob.glob(f'{src_base_dir}/**', recursive=True) using os.walk ()
    items = glob.glob (f"{src_base_dir}/**", recursive=True)
    return list (filter (lambda item: os.path.isfile (item) and item.endswith ("part-00000"), items))

In [24]:
#src_base_dir = '../../data/retail_db'
#src_base_dir = "c:\\users\\user\\desktop\\computerscience\\udemy\\dataengineering\\data-engineering-on-gcp\\data\\retail"
#src_base_dir = '..\..\data\retail_db'
src_base_dir = os.path.join (os.getcwd (), "data-engineering-on-gcp", "data", "retail_db")

In [25]:
#NOTE can also accomplish glob.glob(f'{src_base_dir}/**', recursive=True) using os.walk ()
items = glob.glob(f'{src_base_dir}/**', recursive=True)

In [26]:
items

['C:\\Users\\User\\Desktop\\ComputerScience\\Udemy\\DataEngineering\\data-engineering-on-gcp\\data\\retail_db\\',
 'C:\\Users\\User\\Desktop\\ComputerScience\\Udemy\\DataEngineering\\data-engineering-on-gcp\\data\\retail_db\\categories',
 'C:\\Users\\User\\Desktop\\ComputerScience\\Udemy\\DataEngineering\\data-engineering-on-gcp\\data\\retail_db\\categories\\part-00000',
 'C:\\Users\\User\\Desktop\\ComputerScience\\Udemy\\DataEngineering\\data-engineering-on-gcp\\data\\retail_db\\create_db_tables_pg.sql',
 'C:\\Users\\User\\Desktop\\ComputerScience\\Udemy\\DataEngineering\\data-engineering-on-gcp\\data\\retail_db\\customers',
 'C:\\Users\\User\\Desktop\\ComputerScience\\Udemy\\DataEngineering\\data-engineering-on-gcp\\data\\retail_db\\customers\\part-00000',
 'C:\\Users\\User\\Desktop\\ComputerScience\\Udemy\\DataEngineering\\data-engineering-on-gcp\\data\\retail_db\\departments',
 'C:\\Users\\User\\Desktop\\ComputerScience\\Udemy\\DataEngineering\\data-engineering-on-gcp\\data\\retail

In [27]:
item = items[2]

In [28]:
item

'C:\\Users\\User\\Desktop\\ComputerScience\\Udemy\\DataEngineering\\data-engineering-on-gcp\\data\\retail_db\\categories\\part-00000'

In [29]:
import os
os.path.isfile(item)

True

In [30]:
files = filter(lambda item: os.path.isfile(item), items)

In [31]:
list(files)

['C:\\Users\\User\\Desktop\\ComputerScience\\Udemy\\DataEngineering\\data-engineering-on-gcp\\data\\retail_db\\categories\\part-00000',
 'C:\\Users\\User\\Desktop\\ComputerScience\\Udemy\\DataEngineering\\data-engineering-on-gcp\\data\\retail_db\\create_db_tables_pg.sql',
 'C:\\Users\\User\\Desktop\\ComputerScience\\Udemy\\DataEngineering\\data-engineering-on-gcp\\data\\retail_db\\customers\\part-00000',
 'C:\\Users\\User\\Desktop\\ComputerScience\\Udemy\\DataEngineering\\data-engineering-on-gcp\\data\\retail_db\\departments\\part-00000',
 'C:\\Users\\User\\Desktop\\ComputerScience\\Udemy\\DataEngineering\\data-engineering-on-gcp\\data\\retail_db\\load_db_tables_pg.sql',
 'C:\\Users\\User\\Desktop\\ComputerScience\\Udemy\\DataEngineering\\data-engineering-on-gcp\\data\\retail_db\\orders\\part-00000',
 'C:\\Users\\User\\Desktop\\ComputerScience\\Udemy\\DataEngineering\\data-engineering-on-gcp\\data\\retail_db\\order_items\\part-00000',
 'C:\\Users\\User\\Desktop\\ComputerScience\\Udemy\

In [56]:
#files = list(filter(lambda item: os.path.isfile(item), items))
files = list (filter (lambda item: os.path.isfile (item) and item.endswith ("part-00000"), items))
file = files[0]

In [33]:
file

'C:\\Users\\User\\Desktop\\ComputerScience\\Udemy\\DataEngineering\\data-engineering-on-gcp\\data\\retail_db\\categories\\part-00000'

In [52]:
#file.split('/')[3:]
print (file.split (os.sep))
file.split (os.sep).index ("retail_db")

['C:', 'Users', 'User', 'Desktop', 'ComputerScience', 'Udemy', 'DataEngineering', 'data-engineering-on-gcp', 'data', 'retail_db', 'categories', 'part-00000']


9

In [53]:
#'/'.join(file.split('/')[3:])
"/".join (file.split (os.sep)[file.split (os.sep).index ("retail_db"):])

'retail_db/categories/part-00000'

In [57]:
#tgt_base_dir = 'pythondemo'
tgt_base_dir = "retail_pythondemo"

In [58]:
from google.cloud import storage

In [59]:
gsclient = storage.Client()

In [60]:
#files = filter(lambda item: os.path.isfile(item), items)
files = list (filter (lambda item: os.path.isfile (item) and item.endswith ("part-00000"), items))
#bucket = gsclient.get_bucket('airetail')
bucket = gsclient.get_bucket ("udemy-retail-gcpbucket")
for file in files:
    print(f'Uploading file {file}')
    #blob_suffix = '/'.join(file.split('/')[3:])
    blob_suffix = "/".join (file.split (os.sep) [file.split (os.sep).index ("retail_db"):])
    blob_name = f'{tgt_base_dir}/{blob_suffix}'
    blob = bucket.blob(blob_name)
    blob.upload_from_filename(file)

Uploading file C:\Users\User\Desktop\ComputerScience\Udemy\DataEngineering\data-engineering-on-gcp\data\retail_db\categories\part-00000
Uploading file C:\Users\User\Desktop\ComputerScience\Udemy\DataEngineering\data-engineering-on-gcp\data\retail_db\customers\part-00000
Uploading file C:\Users\User\Desktop\ComputerScience\Udemy\DataEngineering\data-engineering-on-gcp\data\retail_db\departments\part-00000
Uploading file C:\Users\User\Desktop\ComputerScience\Udemy\DataEngineering\data-engineering-on-gcp\data\retail_db\orders\part-00000
Uploading file C:\Users\User\Desktop\ComputerScience\Udemy\DataEngineering\data-engineering-on-gcp\data\retail_db\order_items\part-00000
Uploading file C:\Users\User\Desktop\ComputerScience\Udemy\DataEngineering\data-engineering-on-gcp\data\retail_db\products\part-00000


In [61]:
#!gsutil ls -r gs://airetail/pythondemo
!gsutil ls -r gs://udemy-retail-gcpbucket/retail_pythondemo

gs://udemy-retail-gcpbucket/retail_pythondemo/:

gs://udemy-retail-gcpbucket/retail_pythondemo/retail_db/:

gs://udemy-retail-gcpbucket/retail_pythondemo/retail_db/categories/:
gs://udemy-retail-gcpbucket/retail_pythondemo/retail_db/categories/part-00000

gs://udemy-retail-gcpbucket/retail_pythondemo/retail_db/customers/:
gs://udemy-retail-gcpbucket/retail_pythondemo/retail_db/customers/part-00000

gs://udemy-retail-gcpbucket/retail_pythondemo/retail_db/departments/:
gs://udemy-retail-gcpbucket/retail_pythondemo/retail_db/departments/part-00000

gs://udemy-retail-gcpbucket/retail_pythondemo/retail_db/order_items/:
gs://udemy-retail-gcpbucket/retail_pythondemo/retail_db/order_items/part-00000

gs://udemy-retail-gcpbucket/retail_pythondemo/retail_db/orders/:
gs://udemy-retail-gcpbucket/retail_pythondemo/retail_db/orders/part-00000

gs://udemy-retail-gcpbucket/retail_pythondemo/retail_db/products/:
gs://udemy-retail-gcpbucket/retail_pythondemo/retail_db/products/part-00000


In [64]:
gsclient.list_blobs?
#help (gsclient.list_blobs)

[1;31mSignature:[0m
[0mgsclient[0m[1;33m.[0m[0mlist_blobs[0m[1;33m([0m[1;33m
[0m    [0mbucket_or_name[0m[1;33m,[0m[1;33m
[0m    [0mmax_results[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mpage_token[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mprefix[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mdelimiter[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mstart_offset[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mend_offset[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0minclude_trailing_delimiter[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mversions[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mprojection[0m[1;33m=[0m[1;34m'noAcl'[0m[1;33m,[0m[1;33m
[0m    [0mfields[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mpage_size[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mtimeout[0m[1;33m=[0m[1;36m60[0m

In [65]:
'''
gsclient.list_blobs(
    'airetail',
    prefix='pythondemo'
)
'''
gsclient.list_blobs ("udemy-retail-gcpbucket", prefix="retail_pythondemo")

<google.api_core.page_iterator.HTTPIterator at 0x2497ceb96f0>

In [66]:
'''
blobs = list(gsclient.list_blobs(
    'airetail',
    prefix='pythondemo'
))
'''
blobs = list (gsclient.list_blobs ("udemy-retail-gcpbucket", prefix=""))

In [67]:
blobs

[<Blob: udemy-retail-gcpbucket, retail_pythondemo/retail_db/categories/part-00000, 1709539445352783>,
 <Blob: udemy-retail-gcpbucket, retail_pythondemo/retail_db/customers/part-00000, 1709539446623319>,
 <Blob: udemy-retail-gcpbucket, retail_pythondemo/retail_db/departments/part-00000, 1709539447009923>,
 <Blob: udemy-retail-gcpbucket, retail_pythondemo/retail_db/order_items/part-00000, 1709539449973035>,
 <Blob: udemy-retail-gcpbucket, retail_pythondemo/retail_db/orders/part-00000, 1709539448526361>,
 <Blob: udemy-retail-gcpbucket, retail_pythondemo/retail_db/products/part-00000, 1709539450739387>]