Skip to content

snhaider9977/monitoring-databricks-jobs

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 

Repository files navigation

Databricks Job Runs Data monitoring Tool

This script is designed to fetch and process job runs data from an Azure Databricks instance using the Databricks REST API. It extracts relevant information about job runs, processes the data, and provides an output in the form of a Pandas DataFrame and a CSV file.

Prerequisites

Before running this script, ensure you have the following:

  • Azure Databricks Instance: You need access to an Azure Databricks instance.
  • API Token: Generate an API token from your Databricks instance with appropriate permissions to access job run data.

Getting Started

  1. Install the required libraries using the following command:
pip install requests pandas
  1. Replace the placeholders in the code with your actual values:
baseURI: Replace with your Azure Databricks instance URL.
apiToken: Replace with your API token.

How the Script Works

The script starts by importing necessary libraries: requests, pandas, math, datetime, and json.

Function fetch_and_process_job_runs

The script defines the function fetch_and_process_job_runs responsible for fetching job run data using the Databricks API. The function takes three arguments:

  • base_uri: The base URL of your Databricks instance.
  • api_token: Your API token for authentication.
  • params: A dictionary containing query parameters, including start_time_from, start_time_to, and expand_tasks.

Inside the function:

  • An API request is made to the specified endpoint.
  • The response is processed to extract job run details.
  • Processed data is accumulated and transformed into a Pandas DataFrame.
  • Pagination is managed using the has_more field in the response.

Data Analysis and Output

After fetching and processing the job run data:

  • The resulting DataFrame is sorted based on the execution_duration_in_mins column in descending order.
  • The total execution time for all job runs is calculated and added as a row in the DataFrame.
  • The processed DataFrame is saved as a CSV file named jobs.csv.
  • The sorted DataFrame is printed to the console.

Usage

Make sure you have fulfilled the prerequisites and replaced the placeholder values in the code.

Run the script. It will fetch and process job runs data, display the sorted results, save them to a CSV file, and print a Markdown table.

Note: This script provides a basic example of how to fetch and process job runs data from Azure Databricks using the Databricks REST API. You can further enhance and customize the script to suit your specific use case and requirements.

##output

KPI Report for 2023-10-07:

Total Jobs: 160 Total Tasks: 214 Successful Tasks: 174 Failed Tasks: 15 Total Execution Time (mins): 1158 Average Execution Time (mins): 10.82 Min Execution Time (mins): 0 Max Execution Time (mins): 1158

Key Insights:

  1. Task Status Distribution:
{
    "SUCCESS": 174,
    "CANCELED": 24,
    "FAILED": 15
}
  1. Execution Duration Distribution:

    • Min: 0 mins
    • Max: 1158 mins
    • Average: 10.82 mins
  2. Jobs with Longest Execution Time:

    job_id execution_duration_in_mins
    260792223809789 140
    74519312719017 93
    371241484431340 88
    655421446142082 85
    887636488212750 65

Image

About

Databricks jobs monitoring using rest api call

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages