# Python Packages - for Training Code

At the simplest, all the training code may be in a single `filename.py` file that is a module. There are a couple of layers of depth that are commonly added to this:

**Python Modules**

Modules are files: `filename.py`

**Python Project**

Projects are collections of **Python Modules** in folders and possibly subfolders.  Here is an example project named `trainer`.
```bash
│   │   ├── trainer/
│   │   │   ├── __init__.py
│   │   │   ├── train.py
│   │   │   ├── module_1.py
│   │   │   ├── helpers/
│   │   │   │   ├── __init__.py
│   │   │   │   ├── module_a.py
│   │   │   │   ├── module_a.py
```
Here the `train.py` might have `import module_1` and `import helpers.module_a as module_a`.  Note the `__init__.py` file in the folders - this is an empty file that lets Python know the folder can be imported as a module.

**Python Packages**

Packages are creating by adding necessary files to a **Python Project** to help create a distribution package.
```bash
├── training_package/
│   ├── pyproject.toml
│   ├── src/
│   │   ├── trainer/
│   │   │   ├── __init__.py
│   │   │   ├── train.py
│   │   │   ├── module_1.py
│   │   │   ├── helpers/
│   │   │   │   ├── __init__.py
│   │   │   │   ├── module_a.py
│   │   │   │   ├── module_a.py
```

Example `pyproject.toml` file that sets `setuptools` as the build system:
```python
[build-system]
requires = ["setuptools"]
build-backend = "setuptools.build_meta"

[project]
name = 'trainer'
version = '0.1'
dependencies = ['tensorflow_io', 'google-cloud-aiplatform>=1.17.0']
description = 'Training Package'
authors = [{{name = 'statmike'}}]
```

**Python Distribution Archive**

Prepare **Python Packages** for distribution - called an archive, distribution, or distribution archive. There are two formats for these:
- `file.tar.gz` is a source distributions
    - created with `python setup.py sdist` or `python -m build` run in the package level folder
    - tarballs, `file.tar`, a collection of files wrapped into a single file
    - compressed, `file.tar.gz`, using [gzip](https://www.gzip.org/)
    - contains metadata and source files to be installed by pip
- `.whl` is a built distribution
    - created with `python setup.py bdist_wheel` or `python -m build` from the `package` level folder
    - wheels, `file.whl`, built into a compressed binary format that is portable

Notes on distribution tools:
- here we use the setuptools as the backend build tool specified in the `[build-system]` section of `pyproject.toml`
    - `python -m build` uses `pyproject.toml` to automatically create both `.whl` built distribution and `.tar.gz` source distribution versions
- another way you may see this done is using setuptools directly by creating a `setup.py` file instead of `pyproject.toml`.  It can then be used with setuptools:
    - `python setup.py sdist` which automaticlaly creates `file.tar.gz` by default
    - `python setup.py bdist_wheel` which creates `file.whl`
    - this is the method mentioned on this Vertex AI documentation page for [creating a python training application for a pre-built container](https://cloud.google.com/vertex-ai/docs/training/create-python-pre-built-container)
        - the method in this notebook also builds the source distribution in a compatible way for use with Vertex AI pre-built containers.  You can actually directly use `gzip` to create the source distribution for a folder of training files!
- several advantages to using `build`
    - automatically create source and built distribution in a `/dist` subfolder
    - automatic discovery of modules for common directory structures - [link](https://setuptools.pypa.io/en/latest/userguide/package_discovery.html#automatic-discovery)
    - defaults to include package data files - [link](https://setuptools.pypa.io/en/latest/userguide/datafiles.html#include-package-data)
    - can link to readme.md and license files

**Installing Packages**

When you `pip install ...` what is happening?  This causes pip to look for the package and install it.  The default location to look is [PyPI](https://pypi.org/).  This can be overridden:
- local install `pip install path/to/file.tar.gz` or `pip install path/to/file.whl`
- install from custom repository on Artifact Registry with `pip install --index-url https://{REGION}-python.pkg.dev/{PROJECT}/{REPOSITORY}/{PACKAGE}/ sampleproject`


Resources:
- [pip install](https://pip.pypa.io/en/stable/cli/pip_install/)
- [Packaging Python Projects Tutorial](https://packaging.python.org/en/latest/tutorials/packaging-projects/)
- [setuptools](https://docs.python.org/3/distutils/sourcedist.html)
- [setuptools quickstart](https://setuptools.pypa.io/en/latest/userguide/quickstart.html)

---
## Setup

### Package Installs (if needed)

This notebook uses the Python Clients for
- Google Service Usage
    - to enable APIs (Artifact Registry)
- Artifact Registry
    - to create a repository for storing custom Python packages in a GCP Project

The cells below check to see if the required Python libraries are installed.  If any are not it will print a message to do the install with the associated pip command to use.  These installs must be completed before continuing this notebook.

In [170]:
try:
    import google.cloud.service_usage_v1
except ImportError:
    print('You need to pip install google-cloud-service-usage')
    !pip install google-cloud-service-usage -q

In [171]:
try:
    import google.cloud.artifactregistry_v1
except ImportError:
    print('You need to pip install google-cloud-artifact-registry')
    !pip install google-cloud-artifact-registry -q

In [172]:
try:
    import build
except ImportError:
    print('You need to pip install build')
    !pip install build -q

In [173]:
try:
    import twine
except ImportError:
    print('You need to pip install twine')
    !pip install twine -q

### Environment

inputs:

In [221]:
project = !gcloud config get-value project
PROJECT_ID = project[0]
PROJECT_ID

'statmike-mlops-349915'

In [222]:
REGION = 'us-central1'
EXPERIMENT = 'packages'
SERIES = 'tips'

packages:

In [235]:
import os, shutil
import pkg_resources
from datetime import datetime
from google.cloud import storage
from google.cloud import aiplatform

from google.cloud import service_usage_v1
from google.cloud import artifactregistry_v1

clients:

In [224]:
gcs = storage.Client()
aiplatform.init(project = PROJECT_ID, location = REGION)

su_client = service_usage_v1.ServiceUsageClient()
ar_client = artifactregistry_v1.ArtifactRegistryClient()

parameters:

In [225]:
DIR = 'code'

environment:

In [227]:
# remove directory named DIR if exists
shutil.rmtree(DIR, ignore_errors = True)

# create directory DIR
os.makedirs(DIR)

# check for existance of DIR
print('DIR exists? ', os.path.exists(DIR))

DIR exists?  True


---
## Construct Python Package

Use the temp dirctory created at DIR:

In [228]:
DIR

'code'

In [230]:
os.listdir(f'{DIR}')

[]

### Create the folder structure:

In [231]:
os.makedirs(DIR + f'/{SERIES}_trainer/src/{SERIES}_trainer')

In [232]:
for root, dirs, files in os.walk(DIR):
    print(root)

code
code/tips_trainer
code/tips_trainer/src
code/tips_trainer/src/tips_trainer


### Add files to directory:

The [05 - TensorFlow](../05%20-%20TensorFlow/readme.md) series has a model training file named [train.py](../05%20-%20TensorFlow/code/train.py) that will be used here.

In [238]:
shutil.copyfile('../05 - TensorFlow/code/train.py', f'{DIR}/{SERIES}_trainer/src/{SERIES}_trainer/train.py')
with open(f'{DIR}/{SERIES}_trainer/src/{SERIES}_trainer/__init__.py', 'w') as file: pass

In [239]:
with open(f'{DIR}/{SERIES}_trainer/pyproject.toml', 'w') as file:
    file.write(f"""[build-system]
requires = ["setuptools"]
build-backend = "setuptools.build_meta"

[project]
name = '{SERIES}_trainer'
version = '0.1'
dependencies = ['tensorflow_io', 'google-cloud-aiplatform>={aiplatform.__version__}', 'protobuf=={pkg_resources.get_distribution('protobuf').version}']
description = 'Training Package'
authors = [{{name = 'statmike'}}]
""")

list directory:

In [240]:
for root, dirs, files in os.walk(DIR):
    for f in files:
        print(os.path.join(root, f))

code/tips_trainer/pyproject.toml
code/tips_trainer/.ipynb_checkpoints/pyproject-checkpoint.toml
code/tips_trainer/src/tips_trainer/__init__.py
code/tips_trainer/src/tips_trainer/train.py


### Build the Python distribution archives:

The build process creates both a `.tar.gz` source distribution and a `.whl` built distribution

In [241]:
!cd ./{DIR}/{SERIES}_trainer && python -m build

[1m* Creating virtualenv isolated environment...[0m
[1m* Installing packages in isolated environment... (setuptools)[0m
[1m* Getting dependencies for sdist...[0m
running egg_info
creating src/tips_trainer.egg-info
writing src/tips_trainer.egg-info/PKG-INFO
writing dependency_links to src/tips_trainer.egg-info/dependency_links.txt
writing requirements to src/tips_trainer.egg-info/requires.txt
writing top-level names to src/tips_trainer.egg-info/top_level.txt
writing manifest file 'src/tips_trainer.egg-info/SOURCES.txt'
reading manifest file 'src/tips_trainer.egg-info/SOURCES.txt'
writing manifest file 'src/tips_trainer.egg-info/SOURCES.txt'
[1m* Building sdist...[0m
running sdist
running egg_info
writing src/tips_trainer.egg-info/PKG-INFO
writing dependency_links to src/tips_trainer.egg-info/dependency_links.txt
writing requirements to src/tips_trainer.egg-info/requires.txt
writing top-level names to src/tips_trainer.egg-info/top_level.txt
reading manifest file 'src/tips_trainer

list directory:

In [242]:
for root, dirs, files in os.walk(DIR):
    for f in files:
        print(os.path.join(root, f))

code/tips_trainer/pyproject.toml
code/tips_trainer/.ipynb_checkpoints/pyproject-checkpoint.toml
code/tips_trainer/src/tips_trainer/__init__.py
code/tips_trainer/src/tips_trainer/train.py
code/tips_trainer/src/tips_trainer.egg-info/top_level.txt
code/tips_trainer/src/tips_trainer.egg-info/SOURCES.txt
code/tips_trainer/src/tips_trainer.egg-info/requires.txt
code/tips_trainer/src/tips_trainer.egg-info/dependency_links.txt
code/tips_trainer/src/tips_trainer.egg-info/PKG-INFO
code/tips_trainer/dist/tips_trainer-0.1.tar.gz
code/tips_trainer/dist/tips_trainer-0.1-py3-none-any.whl


**Review**

This single folder now has:

- a single training file/module: {DIR}/tips_trainer/src/tips_trainer/train.py
- a folder of training code: {DIR}/tips_trainer/src/tips_trainer*
    - with a starting point of train.py
- a source distribution: {DIR}/tips_trainer/dist/tips_trainer-0.1.tar.gz
- a built distribution: {DIR}/tips_trainer/dist/tips_trainer-0.1-py3-none-any.whl

### Copy to GCS

Here the folder structure for DIR will be copied to the GCS Bucket used across this project.  This section uses skills that are discussed in more detail in the [Python Client for GCS](./Python%20Client%20for%20GCS.ipynb) notebook.

List buckets in project:

In [243]:
list(gcs.list_buckets())

[<Bucket: cloud-ai-platform-a68e7f3a-fac8-47f6-9f92-fff95c09cdb8>,
 <Bucket: statmike-mlops-349915>,
 <Bucket: statmike-mlops-349915-vertex-pipelines-us-central1>]

Get the bucket:

In [244]:
bucket = gcs.lookup_bucket(PROJECT_ID)

list files to upload:

In [245]:
for root, dirs, files in os.walk(DIR):
    for f in files:
        print(os.path.join(root, f))

code/tips_trainer/pyproject.toml
code/tips_trainer/.ipynb_checkpoints/pyproject-checkpoint.toml
code/tips_trainer/src/tips_trainer/__init__.py
code/tips_trainer/src/tips_trainer/train.py
code/tips_trainer/src/tips_trainer.egg-info/top_level.txt
code/tips_trainer/src/tips_trainer.egg-info/SOURCES.txt
code/tips_trainer/src/tips_trainer.egg-info/requires.txt
code/tips_trainer/src/tips_trainer.egg-info/dependency_links.txt
code/tips_trainer/src/tips_trainer.egg-info/PKG-INFO
code/tips_trainer/dist/tips_trainer-0.1.tar.gz
code/tips_trainer/dist/tips_trainer-0.1-py3-none-any.whl


list of desired bucket object URIs:

In [247]:
for root, dirs, files in os.walk(DIR):
    for f in files:
        filepath = os.path.join(root, f)
        gcspath = f'{SERIES}/{filepath}'
        print(gcspath)

tips/code/tips_trainer/pyproject.toml
tips/code/tips_trainer/.ipynb_checkpoints/pyproject-checkpoint.toml
tips/code/tips_trainer/src/tips_trainer/__init__.py
tips/code/tips_trainer/src/tips_trainer/train.py
tips/code/tips_trainer/src/tips_trainer.egg-info/top_level.txt
tips/code/tips_trainer/src/tips_trainer.egg-info/SOURCES.txt
tips/code/tips_trainer/src/tips_trainer.egg-info/requires.txt
tips/code/tips_trainer/src/tips_trainer.egg-info/dependency_links.txt
tips/code/tips_trainer/src/tips_trainer.egg-info/PKG-INFO
tips/code/tips_trainer/dist/tips_trainer-0.1.tar.gz
tips/code/tips_trainer/dist/tips_trainer-0.1-py3-none-any.whl


upload files as objects:

In [248]:
for root, dirs, files in os.walk(DIR):
    for f in files:
        filepath = os.path.join(root, f)
        gcspath = f'{SERIES}/{filepath}'
        blob = bucket.blob(gcspath)
        blob.upload_from_filename(filepath)

In [249]:
print(f"View the bucket directly here:\nhttps://console.cloud.google.com/storage/browser/{PROJECT_ID};tab=objects&project={PROJECT_ID}")

View the bucket directly here:
https://console.cloud.google.com/storage/browser/statmike-mlops-349915;tab=objects&project=statmike-mlops-349915


list files in bucket:

In [251]:
for blob in list(bucket.list_blobs(prefix = f'{SERIES}/{DIR}')):
    print(blob.name)

tips/code/tips_trainer/.ipynb_checkpoints/pyproject-checkpoint.toml
tips/code/tips_trainer/dist/tips_trainer-0.1-py3-none-any.whl
tips/code/tips_trainer/dist/tips_trainer-0.1.tar.gz
tips/code/tips_trainer/pyproject.toml
tips/code/tips_trainer/src/tips_trainer.egg-info/PKG-INFO
tips/code/tips_trainer/src/tips_trainer.egg-info/SOURCES.txt
tips/code/tips_trainer/src/tips_trainer.egg-info/dependency_links.txt
tips/code/tips_trainer/src/tips_trainer.egg-info/requires.txt
tips/code/tips_trainer/src/tips_trainer.egg-info/top_level.txt
tips/code/tips_trainer/src/tips_trainer/__init__.py
tips/code/tips_trainer/src/tips_trainer/train.py


---
## Artifact Registry

Artifact registry organizes artifacts with repositories.  Each repository contains packages and is designated to hold a partifcular format of package: Docker images, Python Packages and [others](https://cloud.google.com/artifact-registry/docs/supported-formats#package).


- upload package
- show listing at multiple levels - see doc

### Enable API for Artifact Registry

Using Artifact Registry requires enabling the API for the Google Cloud Project.

Options for enabeling these.  In this notebook (2) is used.
 1. Use the APIs & Services page in the console: https://console.cloud.google.com/apis
     - `+ Enable APIs and Services`
     - Search for Cloud Build and Enable
     - Search for Artifact Registry and Enable
 2. Use [Google Service Usage](https://cloud.google.com/service-usage/docs) API from Python
     - [Python Client For Service Usage](https://github.com/googleapis/python-service-usage)
     - [Python Client Library Documentation](https://cloud.google.com/python/docs/reference/serviceusage/latest)
     
The following code cells use the Service Usage Client to:
- get the state of the service
- if 'DISABLED':
    - Try enabling the service and return the state after trying
- if 'ENABLED' print the state for confirmation

In [252]:
artifactregistry = su_client.get_service(
    request = service_usage_v1.GetServiceRequest(
        name = f'projects/{PROJECT_ID}/services/artifactregistry.googleapis.com'
    )
).state.name


if artifactregistry == 'DISABLED':
    print(f'Artifact Registry is currently {artifactregistry} for project: {PROJECT_ID}')
    print(f'Trying to Enable...')
    operation = su_client.enable_service(
        request = service_usage_v1.EnableServiceRequest(
            name = f'projects/{PROJECT_ID}/services/artifactregistry.googleapis.com'
        )
    )
    response = operation.result()
    if response.service.state.name == 'ENABLED':
        print(f'Artifact Registry is now enabled for project: {PROJECT_ID}')
    else:
        print(response)
else:
    print(f'Artifact Registry already enabled for project: {PROJECT_ID}')

Artifact Registry already enabled for project: statmike-mlops-349915


### Create Python Package Repository

Create an Artifact Registry Repository to hold Python Packages created by this notebook.

In [253]:
python_repo = None
for repo in ar_client.list_repositories(parent = f'projects/{PROJECT_ID}/locations/{REGION}'):
    if f'{PROJECT_ID}-python' in repo.name:
        python_repo = repo
        print(f'Retrieved existing repo: {python_repo.name}')

if not python_repo:
    operation = ar_client.create_repository(
        request = artifactregistry_v1.CreateRepositoryRequest(
            parent = f'projects/{PROJECT_ID}/locations/{REGION}',
            repository_id = f'{PROJECT_ID}-python',
            repository = artifactregistry_v1.Repository(
                description = f'A repository for the {PROJECT_ID} experiment that holds Python Packages.',
                name = f'{PROJECT_ID}-python',
                format_ = artifactregistry_v1.Repository.Format.PYTHON,
                labels = {'series': SERIES, 'experiment': EXPERIMENT}
            )
        )
    )
    print('Creating Repository ...')
    python_repo = operation.result()
    print(f'Completed creating repo: {python_repo.name}')

Retrieved existing repo: projects/statmike-mlops-349915/locations/us-central1/repositories/statmike-mlops-349915-python


In [254]:
python_repo.name, python_repo.format_.name

('projects/statmike-mlops-349915/locations/us-central1/repositories/statmike-mlops-349915-python',
 'PYTHON')

### List Repositories

In [255]:
for repo in ar_client.list_repositories(parent = f'projects/{PROJECT_ID}/locations/{REGION}'):
    print(repo.name)

projects/statmike-mlops-349915/locations/us-central1/repositories/statmike-mlops-349915
projects/statmike-mlops-349915/locations/us-central1/repositories/statmike-mlops-349915-python


### Upload Python Package Distribution

This uses the `twine` Python package to upload the model to the repository. Not PyPI as twine was intended but to our customer repository in Artifact Registry by using the `--repository-url` flag.

This command runs from out Tips folder so it needs to change directories into DIR and into the subfolder for the trainer project, then run the upload of the full `dist` subfolder which contains both distribution types (source and built).

In [256]:
!cd ./{DIR}/{SERIES}_trainer && python -m twine upload --repository-url https://{REGION}-python.pkg.dev/{PROJECT_ID}/{PROJECT_ID}-python dist/*

Uploading distributions to 
https://us-central1-python.pkg.dev/statmike-mlops-349915/statmike-mlops-349915-p
ython
Uploading tips_trainer-0.1-py3-none-any.whl
[2K[35m100%[0m [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.5/6.5 kB[0m • [33m00:00[0m • [31m?[0m
[?25hUploading tips_trainer-0.1.tar.gz
[2K[35m100%[0m [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.6/5.6 kB[0m • [33m00:00[0m • [31m?[0m
[?25h

### Review Repository Contents

In [257]:
python_repo.name

'projects/statmike-mlops-349915/locations/us-central1/repositories/statmike-mlops-349915-python'

List Packages in the repository:

In [258]:
ar_client.list_packages(
    parent = python_repo.name
)

ListPackagesPager<packages {
  name: "projects/statmike-mlops-349915/locations/us-central1/repositories/statmike-mlops-349915-python/packages/tips-trainer"
  create_time {
    seconds: 1663768642
    nanos: 502115000
  }
  update_time {
    seconds: 1663768642
    nanos: 771288000
  }
}
>

List files in the repository:

In [259]:
ar_client.list_files(
    parent = python_repo.name
)

ListFilesPager<files {
  name: "projects/statmike-mlops-349915/locations/us-central1/repositories/statmike-mlops-349915-python/files/tips-trainer%2Ftips_trainer-0.1-py3-none-any.whl"
  size_bytes: 3626
  create_time {
    seconds: 1663768642
    nanos: 502115000
  }
  update_time {
    seconds: 1663768642
    nanos: 502115000
  }
  owner: "projects/statmike-mlops-349915/locations/us-central1/repositories/statmike-mlops-349915-python/packages/tips-trainer/versions/0.1"
}
files {
  name: "projects/statmike-mlops-349915/locations/us-central1/repositories/statmike-mlops-349915-python/files/tips-trainer%2Ftips_trainer-0.1.tar.gz"
  size_bytes: 3070
  create_time {
    seconds: 1663768642
    nanos: 771288000
  }
  update_time {
    seconds: 1663768642
    nanos: 771288000
  }
  owner: "projects/statmike-mlops-349915/locations/us-central1/repositories/statmike-mlops-349915-python/packages/tips-trainer/versions/0.1"
}
>

List versions of a package:

In [262]:
ar_client.list_versions(
    parent = python_repo.name + f'/packages/{SERIES}-trainer'
)

ListVersionsPager<versions {
  name: "projects/statmike-mlops-349915/locations/us-central1/repositories/statmike-mlops-349915-python/packages/tips-trainer/versions/0.1"
  create_time {
    seconds: 1663768642
    nanos: 502115000
  }
  update_time {
    seconds: 1663768642
    nanos: 771288000
  }
}
>

---
## Using Training Code

This notebook created versions of training code (script, folder, distribution) in multiple locations (local, GCS Bucket, Artifact Registry, GitHub).  Using these versions and forms in Vertex AI Training Custom Jobs demonstrated in [Python Training]('./Python%20Training.ipynb') which also uses many workflows with custom containers created by the demonstrations in [Python Custom Containers]('./Python%20Custom%20Containers.ipynb').

