# Run Job Action List

## Checklist to Complete Prior to Run

Ensure the follow steps are complete before running the notebook.

1. Run this notebook from the CDH_Cluster_Python_SQL_UC_Shared cluster in dev to perform process_ingress or process_data actions.
    Users will need to be in the AD - developer or administrator groups to have permission to perform these actions.
    Users in the analyst group will likely not have ADLS write file permission, particularly to the database container.
2. Ensure the service principal EDAV_DATAHUB_DEV: e08bf725-02ed-4bb6-83dd-2211235be8b1 has full rights to repo in dev
    or the run_analytics_processing action will fail saving to repo.
3. Ensure the service principal secret is available in databricks secret apps-client-secret scope dbs-scope-CDH.
4. Ensure the az_sub_client_id in config json is set to service principal for project : EDAV_DATAHUB_DEV :
    140ec12a-3b3d-4138-8294-57d6c0e82dd6.
5. Ensure the cdh_oauth_databricks_resource_id is config json is set to service principal for databricks :
    AzureDatabricks : 2ff814a6-3304-4ab8-85cb-cd0e6f879c1d.
6. Run show users in the cdh_global_reference database and make sure that EDAV_DATAHUB_DEV is a user in Databricks SQL.
7. Ensure all developers and EDAV_DATAHUB_DEV are members of AD Group gp-u-EDAV-CDH-DEV-DBR-ADMIN.
8. Ensure AD Group gp-u-EDAV-CDH-DEV-DBR-ADMIN is the owner of the database cdh_global_reference.
9. Double check / set default values in the first cell of the template notebook such as default_environment
    - default_environment = "dev" (if not set will use ENVIRONMENT variable)
    - default_data_product_id = "global_reference" (should default based on directory)
10. Upload input and configurations file to appropriate containers and folders (create containers if necessary)
    - ingress directory requires subdirectories named per source abbreviation with the files listed in dataset list for
        the following sources
    - ingress directory requires subdirectory to hold json config output
        - config
    - config directory requires
        - cdh folder with csv configurations (currently populated from Excel with manual upload)
        - json config file root for each ENVIRONMENT
    - autogenerated folder needs to exist below /cdh/reference_data/autogenerated with subfolders
        - python
        - sql
11. Ensure that you are running from a cluster that support ad pass through or you have configured spark oauth
        clientid and client secret.
12. To debug library source code from notebook 
    - cd cdh_dav_python
    - pip install -e .

## Usage Instructions

* Select JOB_NAME from drop down. This filters an array list of job actions to perform.
* List can contain one or more job action items associated with the job name.
* Job actions are configure in the job tab in the data life cycle Excel.

### Run Job - Interactively

* Select the name of the job to run
* Select the as of Year (YYYY), Month (MM) and Day (DD or NA for blank)
* Run notebook

This script is used to run job actions in the CDH (CDC Data Hub) project.
It provides a checklist of steps to complete before running the notebook and usage instructions for running
the job interactively.

The script imports necessary modules and defines functions for installing Python packages,
setting up the environment, and running job actions.
It also includes code for handling different execution environments, such as Databricks and local.

To run a job, select the job name from a dropdown list and provide additional parameters such as the
as of year, month, and day.
The script then executes the specified job using the provided parameters.

Supported methods for running actions in the global-reference/cdh_lava_lib:
1. Run job interactively via this notebook in databricks
2. Run job from another Databricks notebook (Python, Scala, R) by calling dbutils.run
3. Run job from bash, powershell, or DevOps notebook by calling a Python shell script
4. Run job from a Python Jupyter notebook by calling the library API directly
5. Run job from a functional call inside a Databricks Python notebook by calling the function
6. Run job from VS Code in a client server set up using a Databricks session
7. Run job from VS Code using a local spark server without using Databricks

Note: Before running the script, ensure that the necessary requirements are installed and
the environment is properly configured.


In [3]:
import os
import sys
import ipywidgets as widgets
from IPython.display import display

dbutils_exists = "dbutils" in locals() or "dbutils" in globals()
if dbutils_exists is False:
    # pylint: disable=invalid-name
    dbutils = None

running_local = dbutils is None

if running_local is False:

    # Get the current working directory
    current_dir = os.getcwd()
    print("Current Directory:", current_dir)

    # Go up two directories
    parent_dir = os.path.abspath(os.path.join(current_dir, ".."))
    print("Parent Directory:", parent_dir)

    # Add the parent directory path to sys.path
    if parent_dir not in sys.path:
        sys.path.append(parent_dir)

    core_dir = parent_dir + "/cdh_lava_core_lib"
    print(core_dir)

    # Add the parent directory path to sys.path
    if core_dir not in sys.path:
        sys.path.append(core_dir)

    lib_dir = core_dir + "/cdh_lava_core"
    print(lib_dir)

    # Add the parent directory path to sys.path
    if lib_dir not in sys.path:
        sys.path.append(lib_dir)

    # Now, list files in the parent directory
    try:
        files = os.listdir(parent_dir)
        print("Files in Parent Directory:", files)
    except Exception as e:
        print(f"Error accessing {parent_dir}: {e}")
    current_file_dir = None  # or set a default path
else:
    # Fallback to using __file__ if not in Databricks
    current_file_dir = os.path.dirname(os.path.abspath(__file__))
    # Resolve the path to its absolute form
    peer_dir = os.path.join(current_file_dir, "..")
    full_path = os.path.abspath(peer_dir)
    # Print the full, resolved path
    print(full_path)
    # Add the peer directory to sys.path
    sys.path.insert(0, full_path)

import cdh_lava_core_lib.cdh_lava_core as cdh_lava_core

# Define your default job name
# "process_data"
DEFAULT_JOB_NAME = "process_data_where_source_abbreviation_name_is_phvs"  # Replace with your actual default job name
PACKAGE_NAME = "global_reference"
ENVIRONMENT = "DEV"
DATA_PRODUCT_ID = "global_reference"

spark_exists = "spark" in locals() or "spark" in globals()
if spark_exists is False:
    # pylint: disable=invalid-name
    spark = None

print(f"running_local: {running_local}")
initial_script_dir = (
    os.path.dirname(os.path.abspath(__file__))
    if "__file__" in globals()
    else os.getcwd()
)

print(f"initial_script_dir: {initial_script_dir}")
parent_dir = os.path.abspath(os.path.join(initial_script_dir, ""))

print(f"parent_dir: {parent_dir}")
if parent_dir not in sys.path:
    sys.path.append(parent_dir)

from cdh_lava_core_lib import run_install_cdh_lava_core

(
    spark,
    jobs_list,
    job_names,
    obj_environment_metadata,
    obj_job_metadata,
    config,
    job_name,
) = run_install_cdh_lava_core.setup_job(
    running_local,
    PACKAGE_NAME,
    DEFAULT_JOB_NAME,
    initial_script_dir,
    dbutils,
    spark,
    ENVIRONMENT,
    DATA_PRODUCT_ID,
)

if DEFAULT_JOB_NAME != "Select job to run":
    config_jobs_path = config.get("config_jobs_path")
    obj_job_metadata.run_job_name(
        obj_environment_metadata,
        spark,
        job_name,
        config,
        dbutils,
        DATA_PRODUCT_ID,
        ENVIRONMENT,
    )

2024-03-06 21:58:57cdh_lava_core_lib:run_install_cdh_lava_core.pyrun_install_cdh_lava_core654INFOtracer: <opentelemetry.sdk.trace.Tracer object at 0x7f64fa843010>
INFO:cdh_lava_core_lib:run_install_cdh_lava_core.py:tracer: <opentelemetry.sdk.trace.Tracer object at 0x7f64fa843010>
2024-03-06 21:58:57cdh_lava_core_lib:run_install_cdh_lava_core.pyjob_core69INFOvirtual_env: reference_data_dev
INFO:cdh_lava_core_lib:run_install_cdh_lava_core.py:virtual_env: reference_data_dev


Current Directory: /home/developer/projects/cdh-ref/reference_data
Parent Directory: /home/developer/projects/cdh-ref
/home/developer/projects/cdh-ref/cdh_lava_core_lib
/home/developer/projects/cdh-ref/cdh_lava_core_lib/cdh_lava_core
Files in Parent Directory: ['cdh_lava_core_lib', '.vscode', 'poetry.lock', 'yarn.lock', 'README.md', 'docs', '.python-version', 'setup.py', 'cdh_ref', '.git', '.pytest_cache', '.gitignore', '.gitmodules', 'package.json', 'poetry.toml', '.VSCodeCounter', 'setup.cfg', 'pyproject.toml', 'configs', 'requirements.txt', 'reference_data', '.github', '.databricks']
running_local: False
initial_script_dir: /home/developer/projects/cdh-ref/reference_data
parent_dir: /home/developer/projects/cdh-ref/reference_data
initial_script_directory: /home/developer/projects/cdh-ref/reference_data
library_root: /home/developer/projects/cdh-ref/cdh_lava_core_lib
script_directory:/home/developer/projects/cdh-ref/reference_data
Package cdh_ref is already installed.
absolute_path: 

Box(children=(Label(value='report_yyyy'), Dropdown(index=3, options=('2021', '2022', '2023', '2024'), value='2…

Box(children=(Label(value='report_mm'), Dropdown(index=2, options=('01', '02', '03', '04', '05', '06', '07', '…

Box(children=(Label(value='report_dd'), Dropdown(options=('NA', '01', '02', '03', '04', '05', '06', '07', '08'…

2024-03-06 21:58:57cdh_lava_core_lib:run_install_cdh_lava_core.pyjob_metadata164INFOconfig_jobs_path:/home/developer/projects/cdh-ref/reference_data/config/bronze_sps_config_jobs.csv
INFO:cdh_lava_core_lib:run_install_cdh_lava_core.py:config_jobs_path:/home/developer/projects/cdh-ref/reference_data/config/bronze_sps_config_jobs.csv
2024-03-06 21:58:57cdh_lava_core_lib:run_install_cdh_lava_core.pyjob_metadata190INFOjob_name_values_list:['Select job to run', 'process_analytics', 'process_data', 'process_data_where_source_abbreviation_name_is_phvs', 'process_ingress', 'process_ingress_where_source_abbreviation_name_is_athena', 'process_ingress_where_source_abbreviation_name_is_phvs']
INFO:cdh_lava_core_lib:run_install_cdh_lava_core.py:job_name_values_list:['Select job to run', 'process_analytics', 'process_data', 'process_data_where_source_abbreviation_name_is_phvs', 'process_ingress', 'process_ingress_where_source_abbreviation_name_is_athena', 'process_ingress_where_source_abbr

parameters: {'environment': 'dev', 'data_product_id_root': 'reference', 'data_product_id_individual': 'data', 'data_product_id': 'reference_data', 'yyyy': '2024', 'mm': '03', 'dd': 'NA', 'repository_path': '/home/developer/projects/cdh-ref/reference_data', 'dataset_name': 'all', 'cicd_action': 'pull_request', 'running_local': False, 'array_jobs': ['Select job to run', 'process_analytics', 'process_data', 'process_data_where_source_abbreviation_name_is_phvs', 'process_ingress', 'process_ingress_where_source_abbreviation_name_is_athena', 'process_ingress_where_source_abbreviation_name_is_phvs'], 'data_product_root_id': 'reference', 'data_product_individual_id': 'data'}


INFO:reference_data:running_local: False
INFO:reference_data:env_file_path: /home/developer/share/.env
INFO:reference_data:default_connection_string: InstrumentationKey=8f02ef9a-cd94-48cf-895a-367f102e8a24;IngestionEndpoint=https://eastus-8.in.applicationinsights.azure.com/;LiveEndpoint=https://eastus.livediagnostics.monitor.azure.com/
INFO:reference_data:application_insights_connection_string: InstrumentationKey=8f02ef9a-cd94-48cf-895a-367f102e8a24;IngestionEndpoint=https://eastus-8.in.applicationinsights.azure.com/;LiveEndpoint=https://eastus.livediagnostics.monitor.azure.com/
INFO:reference_data:dotenv_file: /home/developer/share/.env
2024-03-06 21:58:57cdh_lava_core_lib:run_install_cdh_lava_core.pyenvironment_metadata1265INFORetrieving Databricks secret for apps-client-secret.
INFO:cdh_lava_core_lib:run_install_cdh_lava_core.py:Retrieving Databricks secret for apps-client-secret.
2024-03-06 21:58:57cdh_lava_core_lib:run_install_cdh_lava_core.pyenvironment_metadata1266INFO

default_connection_string: InstrumentationKey=8f02ef9a-cd94-48cf-895a-367f102e8a24;IngestionEndpoint=https://eastus-8.in.applicationinsights.azure.com/;LiveEndpoint=https://eastus.livediagnostics.monitor.azure.com/
application_insights_connection_string: InstrumentationKey=8f02ef9a-cd94-48cf-895a-367f102e8a24;IngestionEndpoint=https://eastus-8.in.applicationinsights.azure.com/;LiveEndpoint=https://eastus.livediagnostics.monitor.azure.com/
dotenv_file: /home/developer/share/.env


2024-03-06 21:58:58cdh_lava_core_lib:run_install_cdh_lava_core.pysecurity_core162ERRORFile : /home/developer/.pyenv/versions/3.10.10/envs/REFERENCE_DATA_DEV/lib/python3.10/site-packages/msal/authority.py , Line : 165, Func.Name : canonicalize, Message : raise ValueError(, Type : <class 'ValueError'>, Value : Your given address (https://login.microsoftonline.com9ce70869-60db-44fd-abe8-d2767077fc8f) should consist of an https url with a minimum of one segment in a path: e.g. https://login.microsoftonline.com/{tenant} or https://{tenant_name}.ciamlogin.com/{tenant} or https://{tenant_name}.b2clogin.com/{tenant_name}.onmicrosoft.com/policy
ERROR:cdh_lava_core_lib:run_install_cdh_lava_core.py:File : /home/developer/.pyenv/versions/3.10.10/envs/REFERENCE_DATA_DEV/lib/python3.10/site-packages/msal/authority.py , Line : 165, Func.Name : canonicalize, Message : raise ValueError(, Type : <class 'ValueError'>, Value : Your given address (https://login.microsoftonline.com9ce70869-60db-44fd-ab

acquire_access_token_with_client_credentials for reference_data
az_sub_oauth_token_endpoint:https://login.microsoftonline.com/9ce70869-60db-44fd-abe8-d2767077fc8f
sp_client_id:e08bf725-02ed-4bb6-83dd-2211235be8b1
azure_databricks_resource_id:2ff814a6-3304-4ab8-85cb-cd0e6f879c1d
catalog_name: edav_dev_cdh


2024-03-06 21:59:14cdh_lava_core_lib:run_install_cdh_lava_core.pydatabase76INFODatabase cdh_reference already exists.
INFO:cdh_lava_core_lib:run_install_cdh_lava_core.py:Database cdh_reference already exists.
2024-03-06 21:59:38cdh_lava_core_lib:run_install_cdh_lava_core.pyjob_metadata533INFOdf_datasets unfiltered count:10
INFO:cdh_lava_core_lib:run_install_cdh_lava_core.py:df_datasets unfiltered count:10
2024-03-06 21:59:39cdh_lava_core_lib:run_install_cdh_lava_core.pyenvironment_logging430ERROR('Error: %s', ResourceDoesNotExist('No file or directory exists on path abfss://cdh@davsynapseanalyticsdev.dfs.core.windows.net/raw/reference_data/config/bronze_sps_config_columns.csv.')): ResourceDoesNotExist: No file or directory exists on path abfss://cdh@davsynapseanalyticsdev.dfs.core.windows.net/raw/reference_data/config/bronze_sps_config_columns.csv.:   File "/home/developer/projects/cdh-ref/cdh_lava_core_lib/cdh_lava_core/cdc_tech_environment_service/environment_file.py",

ResourceDoesNotExist: No file or directory exists on path abfss://cdh@davsynapseanalyticsdev.dfs.core.windows.net/raw/reference_data/config/bronze_sps_config_columns.csv.