![tracker](https://us-central1-vertex-ai-mlops-369716.cloudfunctions.net/pixel-tracking?path=statmike%2Fvertex-ai-mlops%2FTips&file=Python+Packages.ipynb)
<!--- header table --->
<table align="left">
  <td style="text-align: center">
    <a href="https://colab.research.google.com/github/statmike/vertex-ai-mlops/blob/main/Tips/Python%20Packages.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Google Colaboratory logo">
      <br>Run in<br>Colab
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/colab/import/https%3A%2F%2Fraw.githubusercontent.com%2Fstatmike%2Fvertex-ai-mlops%2Fmain%2FTips%2FPython%2520Packages.ipynb">
      <img width="32px" src="https://lh3.googleusercontent.com/JmcxdQi-qOpctIvWKgPtrzZdJJK-J3sWE1RsfjZNwshCFgE_9fULcNpuXYTilIR2hjwN" alt="Google Cloud Colab Enterprise logo">
      <br>Run in<br>Colab Enterprise
    </a>
  </td>      
  <td style="text-align: center">
    <a href="https://github.com/statmike/vertex-ai-mlops/blob/main/Tips/Python%20Packages.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo">
      <br>View on<br>GitHub
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/statmike/vertex-ai-mlops/main/Tips/Python%20Packages.ipynb">
      <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo">
      <br>Open in<br>Vertex AI Workbench
    </a>
  </td>
</table>

# Python Packages - for Training Code

At the simplest, all the training code may be in a single `filename.py` file that is a module. There are a couple of layers of depth that are commonly added to this:

**Python Modules**

Modules are files: `filename.py`

**Python Project**

Projects are collections of **Python Modules** in folders and possibly subfolders.  Here is an example project named `trainer`.
```bash
│   │   ├── trainer/
│   │   │   ├── __init__.py
│   │   │   ├── train.py
│   │   │   ├── module_1.py
│   │   │   ├── helpers/
│   │   │   │   ├── __init__.py
│   │   │   │   ├── module_a.py
│   │   │   │   ├── module_a.py
```
Here the `train.py` might have `import module_1` and `import helpers.module_a as module_a`.  Note the `__init__.py` file in the folders - this is an empty file that lets Python know the folder can be imported as a module.

**Python Packages**

Packages are creating by adding necessary files to a **Python Project** to help create a distribution package.
```bash
├── training_package/
│   ├── pyproject.toml
│   ├── src/
│   │   ├── trainer/
│   │   │   ├── __init__.py
│   │   │   ├── train.py
│   │   │   ├── module_1.py
│   │   │   ├── helpers/
│   │   │   │   ├── __init__.py
│   │   │   │   ├── module_a.py
│   │   │   │   ├── module_a.py
```

Example `pyproject.toml` file that sets `setuptools` as the build system:
```python
[build-system]
requires = ["setuptools"]
build-backend = "setuptools.build_meta"

[project]
name = 'trainer'
version = '0.1'
dependencies = ['tensorflow_io', 'google-cloud-aiplatform>=1.17.0']
description = 'Training Package'
authors = [{{name = 'statmike'}}]
```

**Python Distribution Archive**

Prepare **Python Packages** for distribution - called an archive, distribution, or distribution archive. There are two formats for these:
- `file.tar.gz` is a source distributions
    - created with `python setup.py sdist` or `python -m build` run in the package level folder
    - tarballs, `file.tar`, a collection of files wrapped into a single file
    - compressed, `file.tar.gz`, using [gzip](https://www.gzip.org/)
    - contains metadata and source files to be installed by pip
- `.whl` is a built distribution
    - created with `python setup.py bdist_wheel` or `python -m build` from the `package` level folder
    - wheels, `file.whl`, built into a compressed binary format that is portable

Notes on distribution tools:
- here we use the setuptools as the backend build tool specified in the `[build-system]` section of `pyproject.toml`
    - `python -m build` uses `pyproject.toml` to automatically create both `.whl` built distribution and `.tar.gz` source distribution versions
- another way you may see this done is using setuptools directly by creating a `setup.py` file instead of `pyproject.toml`.  It can then be used with setuptools:
    - `python setup.py sdist` which automaticlaly creates `file.tar.gz` by default
    - `python setup.py bdist_wheel` which creates `file.whl`
    - this is the method mentioned on this Vertex AI documentation page for [creating a python training application for a pre-built container](https://cloud.google.com/vertex-ai/docs/training/create-python-pre-built-container)
        - the method in this notebook also builds the source distribution in a compatible way for use with Vertex AI pre-built containers.  You can actually directly use `gzip` to create the source distribution for a folder of training files!
- several advantages to using `build`
    - automatically create source and built distribution in a `/dist` subfolder
    - automatic discovery of modules for common directory structures - [link](https://setuptools.pypa.io/en/latest/userguide/package_discovery.html#automatic-discovery)
    - defaults to include package data files - [link](https://setuptools.pypa.io/en/latest/userguide/datafiles.html#include-package-data)
    - can link to readme.md and license files

**Installing Packages**

When you `pip install ...` what is happening?  This causes pip to look for the package and install it.  The default location to look is [PyPI](https://pypi.org/).  This can be overridden:
- local install `pip install path/to/file.tar.gz` or `pip install path/to/file.whl`
- install from custom repository on Artifact Registry with `pip install --index-url https://{REGION}-python.pkg.dev/{PROJECT}/{REPOSITORY}/{PACKAGE}/ sampleproject`


Resources:
- [pip install](https://pip.pypa.io/en/stable/cli/pip_install/)
- [Packaging Python Projects Tutorial](https://packaging.python.org/en/latest/tutorials/packaging-projects/)
- [setuptools](https://docs.python.org/3/distutils/sourcedist.html)
- [setuptools quickstart](https://setuptools.pypa.io/en/latest/userguide/quickstart.html)

---
## Colab Setup

To run this notebook in Colab click [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/statmike/vertex-ai-mlops/blob/main/Tips/Python%20Packages.ipynb) and run the cells in this section.  Otherwise, skip this section.

This cell will authenticate to GCP (follow prompts in the popup).

In [1]:
PROJECT_ID = 'statmike-mlops-349915' # replace with project ID

In [2]:
try:
    import google.colab
    from google.colab import auth
    auth.authenticate_user()
    !gcloud config set project {PROJECT_ID}
except Exception:
    pass

---
## Installs

The list `packages` contains tuples of package import names and install names.  If the import name is not found then the install name is used to install quitely for the current user.

In [3]:
# tuples of (import name, install name)
packages = [
    ('twine', 'twine'),
    ('build', 'build'),
    ('google.cloud.artifactregistry_v1', 'google-cloud-artifact-registry')
]

import importlib
install = False
for package in packages:
    if not importlib.util.find_spec(package[0]):
        print(f'installing package {package[1]}')
        install = True
        !pip install {package[1]} -U -q --user

### Enable APIs

In [4]:
!gcloud services enable artifactregistry.googleapis.com

### Restart Kernel (If Installs Occured)

After a kernel restart the code submission can start with the next cell after this one.

In [5]:
if install:
    import IPython
    app = IPython.Application.instance()
    app.kernel.do_shutdown(True)

---
## Setup

inputs:

In [6]:
project = !gcloud config get-value project
PROJECT_ID = project[0]
PROJECT_ID

'statmike-mlops-349915'

In [7]:
REGION = 'us-central1'
EXPERIMENT = 'packages'
SERIES = 'tips'

# specify a GCS Bucket
GCS_BUCKET = PROJECT_ID

packages:

In [8]:
import os, shutil
import importlib
from datetime import datetime
from google.cloud import storage

from google.cloud import artifactregistry_v1

clients:

In [9]:
gcs = storage.Client(project = PROJECT_ID)
ar_client = artifactregistry_v1.ArtifactRegistryClient()

parameters:

In [10]:
DIR = f'code/{EXPERIMENT}'

environment:

In [11]:
if not os.path.exists(DIR):
    os.makedirs(DIR)
else:
    shutil.rmtree(DIR, ignore_errors = True)
    os.makedirs(DIR)
    
# list contents of directory one level higher than DIR
os.listdir(DIR + '/../')

['packages']

---
## Construct Python Package

Use the temp dirctory created at DIR:

In [12]:
DIR

'code/packages'

In [13]:
os.listdir(f'{DIR}')

[]

### Create the folder structure:

In [14]:
os.makedirs(DIR + f'/{SERIES}_trainer/src/{SERIES}_trainer')

In [15]:
for root, dirs, files in os.walk(DIR):
    print(root)

code/packages
code/packages/tips_trainer
code/packages/tips_trainer/src
code/packages/tips_trainer/src/tips_trainer


### Add files to directory:

The [05 - TensorFlow](../05%20-%20TensorFlow/readme.md) series has a model training file named [train.py](../05%20-%20TensorFlow/code/train.py) that will be used here.

Retrieve the training script (if not already included in a clone of this repository):

In [16]:
file = "../05 - TensorFlow/code/train.py"
if not os.path.exists(file):
    print('Retrieving document...')
    if not os.path.exists(os.path.dirname(file)):
      os.makedirs(os.path.dirname(file))
    import requests, urllib.parse
    r = requests.get(f'https://raw.githubusercontent.com/statmike/vertex-ai-mlops/main/05%20-%20TensorFlow/{urllib.parse.quote(file[2:])}')
    open(file, 'wb').write(r.content)
    print(f'Document now at `{file}`')
else:
    print(f'Document Found at `{file}`')

Document Found at `../05 - TensorFlow/code/train.py`


Copy the training file:

In [17]:
shutil.copyfile('../05 - TensorFlow/code/train.py', f'{DIR}/{SERIES}_trainer/src/{SERIES}_trainer/train.py')

'code/packages/tips_trainer/src/tips_trainer/train.py'

Create an empty `__init__.py` file

In [18]:
with open(f'{DIR}/{SERIES}_trainer/src/{SERIES}_trainer/__init__.py', 'w') as file: pass

In [19]:
with open(f'{DIR}/{SERIES}_trainer/pyproject.toml', 'w') as file:
    file.write(f"""[build-system]
requires = ["setuptools"]
build-backend = "setuptools.build_meta"

[project]
name = '{SERIES}_trainer'
version = '0.1'
dependencies = ['tensorflow_io', 'google-cloud-aiplatform', 'db-dtypes', 'protobuf>={importlib.metadata.version('protobuf')}']
description = 'Training Package'
authors = [{{name = 'statmike'}}]
""")

list directory:

In [20]:
for root, dirs, files in os.walk(DIR):
    for f in files:
        print(os.path.join(root, f))

code/packages/tips_trainer/pyproject.toml
code/packages/tips_trainer/src/tips_trainer/__init__.py
code/packages/tips_trainer/src/tips_trainer/train.py


### Build the Python distribution archives:

The build process creates both a `.tar.gz` source distribution and a `.whl` built distribution

In [21]:
!cd ./{DIR}/{SERIES}_trainer && python -m build

[1m* Creating virtualenv isolated environment...[0m
[1m* Installing packages in isolated environment... (setuptools)[0m
[1m* Getting build dependencies for sdist...[0m
running egg_info
creating src/tips_trainer.egg-info
writing src/tips_trainer.egg-info/PKG-INFO
writing dependency_links to src/tips_trainer.egg-info/dependency_links.txt
writing requirements to src/tips_trainer.egg-info/requires.txt
writing top-level names to src/tips_trainer.egg-info/top_level.txt
writing manifest file 'src/tips_trainer.egg-info/SOURCES.txt'
reading manifest file 'src/tips_trainer.egg-info/SOURCES.txt'
writing manifest file 'src/tips_trainer.egg-info/SOURCES.txt'
[1m* Building sdist...[0m
running sdist
running egg_info
writing src/tips_trainer.egg-info/PKG-INFO
writing dependency_links to src/tips_trainer.egg-info/dependency_links.txt
writing requirements to src/tips_trainer.egg-info/requires.txt
writing top-level names to src/tips_trainer.egg-info/top_level.txt
reading manifest file 'src/tips_t

list directory:

In [22]:
for root, dirs, files in os.walk(DIR):
    for f in files:
        print(os.path.join(root, f))

code/packages/tips_trainer/pyproject.toml
code/packages/tips_trainer/src/tips_trainer/__init__.py
code/packages/tips_trainer/src/tips_trainer/train.py
code/packages/tips_trainer/src/tips_trainer.egg-info/top_level.txt
code/packages/tips_trainer/src/tips_trainer.egg-info/PKG-INFO
code/packages/tips_trainer/src/tips_trainer.egg-info/requires.txt
code/packages/tips_trainer/src/tips_trainer.egg-info/dependency_links.txt
code/packages/tips_trainer/src/tips_trainer.egg-info/SOURCES.txt
code/packages/tips_trainer/dist/tips_trainer-0.1-py3-none-any.whl
code/packages/tips_trainer/dist/tips_trainer-0.1.tar.gz


**Review**

This single folder now has:

- a single training file/module: {DIR}/tips_trainer/src/tips_trainer/train.py
- a folder of training code: {DIR}/tips_trainer/src/tips_trainer*
    - with a starting point of train.py
- a source distribution: {DIR}/tips_trainer/dist/tips_trainer-0.1.tar.gz
- a built distribution: {DIR}/tips_trainer/dist/tips_trainer-0.1-py3-none-any.whl

### Copy to GCS

Here the folder structure for DIR will be copied to the GCS Bucket used across this project.  This section uses skills that are discussed in more detail in the [Python Client for GCS](./Python%20Client%20for%20GCS.ipynb) notebook.

Get the bucket:

In [23]:
bucket = gcs.lookup_bucket(GCS_BUCKET)

list files to upload:

In [24]:
for root, dirs, files in os.walk(DIR):
    for f in files:
        print(os.path.join(root, f))

code/packages/tips_trainer/pyproject.toml
code/packages/tips_trainer/src/tips_trainer/__init__.py
code/packages/tips_trainer/src/tips_trainer/train.py
code/packages/tips_trainer/src/tips_trainer.egg-info/top_level.txt
code/packages/tips_trainer/src/tips_trainer.egg-info/PKG-INFO
code/packages/tips_trainer/src/tips_trainer.egg-info/requires.txt
code/packages/tips_trainer/src/tips_trainer.egg-info/dependency_links.txt
code/packages/tips_trainer/src/tips_trainer.egg-info/SOURCES.txt
code/packages/tips_trainer/dist/tips_trainer-0.1-py3-none-any.whl
code/packages/tips_trainer/dist/tips_trainer-0.1.tar.gz


list of desired bucket object URIs:

In [25]:
for root, dirs, files in os.walk(DIR):
    for f in files:
        filepath = os.path.join(root, f)
        gcspath = f'{SERIES}/{filepath}'
        print(gcspath)

tips/code/packages/tips_trainer/pyproject.toml
tips/code/packages/tips_trainer/src/tips_trainer/__init__.py
tips/code/packages/tips_trainer/src/tips_trainer/train.py
tips/code/packages/tips_trainer/src/tips_trainer.egg-info/top_level.txt
tips/code/packages/tips_trainer/src/tips_trainer.egg-info/PKG-INFO
tips/code/packages/tips_trainer/src/tips_trainer.egg-info/requires.txt
tips/code/packages/tips_trainer/src/tips_trainer.egg-info/dependency_links.txt
tips/code/packages/tips_trainer/src/tips_trainer.egg-info/SOURCES.txt
tips/code/packages/tips_trainer/dist/tips_trainer-0.1-py3-none-any.whl
tips/code/packages/tips_trainer/dist/tips_trainer-0.1.tar.gz


upload files as objects:

In [26]:
for root, dirs, files in os.walk(DIR):
    for f in files:
        filepath = os.path.join(root, f)
        gcspath = f'{SERIES}/{filepath}'
        blob = bucket.blob(gcspath)
        blob.upload_from_filename(filepath)

In [27]:
print(f"View the bucket directly here:\nhttps://console.cloud.google.com/storage/browser/{GCS_BUCKET};tab=objects&project={PROJECT_ID}")

View the bucket directly here:
https://console.cloud.google.com/storage/browser/statmike-mlops-349915;tab=objects&project=statmike-mlops-349915


list files in bucket:

In [28]:
for blob in list(bucket.list_blobs(prefix = f'{SERIES}/{DIR}')):
    print(blob.name)

tips/code/packages/tips_trainer/dist/tips_trainer-0.1-py3-none-any.whl
tips/code/packages/tips_trainer/dist/tips_trainer-0.1.tar.gz
tips/code/packages/tips_trainer/pyproject.toml
tips/code/packages/tips_trainer/src/tips_trainer.egg-info/PKG-INFO
tips/code/packages/tips_trainer/src/tips_trainer.egg-info/SOURCES.txt
tips/code/packages/tips_trainer/src/tips_trainer.egg-info/dependency_links.txt
tips/code/packages/tips_trainer/src/tips_trainer.egg-info/requires.txt
tips/code/packages/tips_trainer/src/tips_trainer.egg-info/top_level.txt
tips/code/packages/tips_trainer/src/tips_trainer/__init__.py
tips/code/packages/tips_trainer/src/tips_trainer/train.py


---
## Artifact Registry

Artifact registry organizes artifacts with repositories.  Each repository contains packages and is designated to hold a partifcular format of package: Docker images, Python Packages and [others](https://cloud.google.com/artifact-registry/docs/supported-formats#package).


- upload package
- show listing at multiple levels - see doc

### List Repositories

In [29]:
for repo in ar_client.list_repositories(parent = f'projects/{PROJECT_ID}/locations/{REGION}'):
    print(repo.name)

projects/statmike-mlops-349915/locations/us-central1/repositories/statmike-mlops-349915
projects/statmike-mlops-349915/locations/us-central1/repositories/statmike-mlops-349915-docker
projects/statmike-mlops-349915/locations/us-central1/repositories/statmike-mlops-349915-python


### Create/Retrieve Python Package Repository

Create an Artifact Registry Repository to hold Python Packages created by this notebook.

In [30]:
python_repo = None
for repo in ar_client.list_repositories(parent = f'projects/{PROJECT_ID}/locations/{REGION}'):
    if f'{PROJECT_ID}-python' in repo.name:
        python_repo = repo
        print(f'Retrieved existing repo: {python_repo.name}')

if not python_repo:
    operation = ar_client.create_repository(
        request = artifactregistry_v1.CreateRepositoryRequest(
            parent = f'projects/{PROJECT_ID}/locations/{REGION}',
            repository_id = f'{PROJECT_ID}-python',
            repository = artifactregistry_v1.Repository(
                description = f'A repository for the {PROJECT_ID} experiment that holds Python Packages.',
                name = f'{PROJECT_ID}-python',
                format_ = artifactregistry_v1.Repository.Format.PYTHON,
                labels = {'series': SERIES}
            )
        )
    )
    print('Creating Repository ...')
    python_repo = operation.result()
    print(f'Completed creating repo: {python_repo.name}')

Retrieved existing repo: projects/statmike-mlops-349915/locations/us-central1/repositories/statmike-mlops-349915-python


In [31]:
python_repo.name, python_repo.format_.name

('projects/statmike-mlops-349915/locations/us-central1/repositories/statmike-mlops-349915-python',
 'PYTHON')

### Upload Python Package Distribution

This uses the `twine` Python package to upload the model to the repository. Not PyPI as twine was intended but to our customer repository in Artifact Registry by using the `--repository-url` flag.

This command runs from out Tips folder so it needs to change directories into DIR and into the subfolder for the trainer project, then run the upload of the full `dist` subfolder which contains both distribution types (source and built).

In [32]:
!cd ./{DIR}/{SERIES}_trainer && python -m twine upload --verbose --repository-url https://{REGION}-python.pkg.dev/{PROJECT_ID}/{PROJECT_ID}-python dist/*

Uploading distributions to 
https://us-central1-python.pkg.dev/statmike-mlops-349915/statmike-mlops-349915-p
ython
[34mINFO    [0m dist/tips_trainer-0.1-py3-none-any.whl (3.7 KB)                        
[34mINFO    [0m dist/tips_trainer-0.1.tar.gz (3.3 KB)                                  
[34mINFO    [0m Querying keyring for username                                          
[34mINFO    [0m username set from keyring                                              
[34mINFO    [0m Querying keyring for password                                          
[34mINFO    [0m password set from keyring                                              
[34mINFO    [0m username: oauth2accesstoken                                            
[34mINFO    [0m password: <hidden>                                                     
Uploading tips_trainer-0.1-py3-none-any.whl
[2K[35m100%[0m [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.7/6.7 kB[0m • [33m00:00[0m • [31m?[0m
[?

### Review Repository Contents

In [33]:
python_repo.name

'projects/statmike-mlops-349915/locations/us-central1/repositories/statmike-mlops-349915-python'

List Packages in the repository:

In [34]:
ar_client.list_packages(
    parent = python_repo.name
)

ListPackagesPager<packages {
  name: "projects/statmike-mlops-349915/locations/us-central1/repositories/statmike-mlops-349915-python/packages/tips-trainer"
  create_time {
    seconds: 1703256229
    nanos: 454826000
  }
  update_time {
    seconds: 1703256229
    nanos: 639447000
  }
}
>

List files in the repository:

In [35]:
ar_client.list_files(
    parent = python_repo.name
)

ListFilesPager<files {
  name: "projects/statmike-mlops-349915/locations/us-central1/repositories/statmike-mlops-349915-python/files/tips-trainer%2Ftips_trainer-0.1-py3-none-any.whl"
  size_bytes: 3751
  hashes {
    type_: SHA256
    value: "%\006 N\362\206\211J\243+~\225=bg\272\235\004C\343\233\021\316r\017\203\370@p}\r\027"
  }
  create_time {
    seconds: 1703256229
    nanos: 454826000
  }
  update_time {
    seconds: 1703256229
    nanos: 454826000
  }
  owner: "projects/statmike-mlops-349915/locations/us-central1/repositories/statmike-mlops-349915-python/packages/tips-trainer/versions/0.1"
}
files {
  name: "projects/statmike-mlops-349915/locations/us-central1/repositories/statmike-mlops-349915-python/files/tips-trainer%2Ftips_trainer-0.1.tar.gz"
  size_bytes: 3413
  hashes {
    type_: SHA256
    value: "\207{\303\000\332\244\034Bz3Ds\020\374\202\325\216\320\320\306/6O\256c\017ri\027F\254~"
  }
  create_time {
    seconds: 1703256229
    nanos: 639447000
  }
  update_time {
   

List versions of a package:

In [36]:
ar_client.list_versions(
    parent = python_repo.name + f'/packages/{SERIES}-trainer'
)

ListVersionsPager<versions {
  name: "projects/statmike-mlops-349915/locations/us-central1/repositories/statmike-mlops-349915-python/packages/tips-trainer/versions/0.1"
  create_time {
    seconds: 1703256229
    nanos: 454826000
  }
  update_time {
    seconds: 1703256229
    nanos: 639447000
  }
}
>

---
## Using Training Code

This notebook, [Python Packages](./Python%20Packages.ipynb), created versions of training code (script, folder, distribution) in multiple locations (local, GCS Bucket, Artifact Registry, GitHub).

The next notebook, [Python Custom Containers](./Python%20Custom%20Containers.ipynb), demonstrates many workflows for getting the training code into a custom container.

Then another notebook, [Python Training](./Python%20Training.ipynb), uses these training code and custom containers to run Vertex AI Training Custom Jobs demonstrated in [Python Training](./Python%20Training.ipynb).
