<a href="https://colab.research.google.com/github/tianyuenyt/MO-PCDE_M9_final_assignment/blob/main/BigQuery_Cloud_Cost_Management_Self_Service_Tool.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 📌 **BigQuery Cloud Cost Managemtn Self Service Tool**
This tool is designed to help Bigquery user track cloud cost trends, identify optimization opportunities, and improve cost efficiency.

## How It Works  
This tool consists of 3 parts to help analyze and optimize BigQuery costs:
1. **Billing Overview** is the starting point for trend analysis, providing the most accurate cost data through project-level reports.
2. **Compute Cost** provides user-level, job-level, and data product (destination table) costs, which are estimated based on [`INFORMATION_SCHEMA.JOBS_BY_ORGANIZATION`](https://cloud.google.com/bigquery/docs/information-schema-jobs-by-organization).
3. **Storage Cost** includes dataset-level and table-level costs, estimated based on [`INFORMATION_SCHEMA.TABLE_STORAGE`](https://cloud.google.com/bigquery/docs/information-schema-table-storage
)

Please follow the script's guidance to ENTER `date range`, `project name`, and `other granular filters` to refine the data

---
📩 Need Help?  
If you encounter **Permission Issues** or have any questions about this tool and analysis, please contact the *Platform Analytics* Team:  
- Tommy: tommy.wu@nytimes.com
- Tian: tian.yue@nytimes.com
---

---
## 💰 Billing Overview
---

In [40]:
#@title **Billing Overview**
#@markdown 👈 **Click this Play button** to start.
#@markdown
#@markdown - If no **Date Range** is entered, the query will **automatically load data from the last two quarters**.

from google.colab import auth
import pandas as pd
import plotly.express as px
import ipywidgets as widgets
import google.colab.data_table
from google.colab import files
from google.cloud import bigquery
from datetime import datetime, timedelta

# --- Authenticate & Init ---
auth.authenticate_user()
project_id = "nyt-platform-analytics-dbt"
client = bigquery.Client(project=project_id)

# --- Default Date Helper ---
def get_default_date_range():
    today = datetime.today()
    current_quarter_month = 3 * ((today.month - 1) // 3) + 1
    first_day_of_this_quarter = datetime(today.year, current_quarter_month, 1)
    if current_quarter_month == 1:
        start = datetime(today.year - 1, 10, 1)
    else:
        start = datetime(today.year, current_quarter_month - 3, 1)
    return start.strftime('%Y-%m-%d'), (today - timedelta(days=1)).strftime('%Y-%m-%d')

# --- Date Input ---
start_date_input = input("Enter start date (YYYY-MM-DD) or press Enter to use default: ").strip()
end_date_input = input("Enter end date (YYYY-MM-DD) or press Enter to use default: ").strip()
default_start, default_end = get_default_date_range()
start_date = start_date_input if start_date_input else default_start
end_date = end_date_input if end_date_input else default_end

# --- Date Validation ---
try:
    start_date_dt = datetime.strptime(start_date, "%Y-%m-%d")
    end_date_dt = datetime.strptime(end_date, "%Y-%m-%d")
    if end_date_dt < start_date_dt:
        raise ValueError("End date cannot be earlier than start date.")
    if end_date_dt > datetime.today():
        raise ValueError("End date cannot be in the future.")
except ValueError as e:
    raise ValueError(f"Invalid date input: {e}")

# --- Query Billing Data ---
billing_query = f"""
SELECT
  _pt AS usage_date,
  project_id,
  CASE
    WHEN service_description = 'BigQuery' AND sku_description LIKE '%Storage%' THEN 'Storage'
    WHEN service_description = 'BigQuery' AND sku_description LIKE '%Analysis%' THEN 'Compute'
    WHEN service_description = 'BigQuery Reservation API' THEN 'Compute'
    ELSE 'BigQuery Others'
  END AS label,
  ROUND(SUM(cost), 2) AS cost,
  COALESCE(mapping.team, 'Untagged') AS team,
  COALESCE(mapping.mission, 'Untagged') AS mission
FROM `nyt-platform-analytics-dbt.prod.stg_bq__invoiced_costs` billing
LEFT JOIN `nyt-platform-analytics-dbt.dbt_tyue.seeds__finout_project_owner_budget_2025` mapping
  ON billing.project_id = mapping.application
WHERE _pt BETWEEN '{start_date}' AND '{end_date}'
GROUP BY ALL
ORDER BY usage_date, project_id, label
"""


try:
    df = client.query(billing_query).to_dataframe()
    # print(f"✅ Billing data loaded: {len(df)} rows from {start_date} to {end_date}")
except Exception as e:
    print(f"❌ Failed to load billing data: {e}")
    df = pd.DataFrame()

# --- If data is available ---
if not df.empty:
    df["usage_date"] = pd.to_datetime(df["usage_date"])
    df["year_month"] = df["usage_date"].dt.to_period("M").astype(str)

    color_mapping = {
        "Compute": "#9BB4C1",
        "Storage": "#FFC440",
        "BigQuery Others": "#CFC4B6"
    }

    # --- Chart 1: Monthly Stacked Bar ---
    monthly_cost = df.groupby(["year_month", "label"])["cost"].sum().reset_index()
    monthly_cost["cost_formatted"] = monthly_cost["cost"].apply(lambda x: f"${x:,.2f}")

    fig_bar = px.bar(
        monthly_cost,
        x="year_month",
        y="cost",
        color="label",
        text="cost_formatted",
        title="📊 Monthly Cost Breakdown by Label",
        labels={"year_month": "Month", "cost": "Cost (USD)", "label": "Label"},
        barmode="stack",
        hover_data={"cost": ":$.2f"},
        color_discrete_map=color_mapping
    )
    fig_bar.update_traces(texttemplate='%{text}', textposition='outside')
    fig_bar.update_layout(
        xaxis=dict(tickangle=-45),
        yaxis_tickprefix="$",
        legend_title="Label",
        hovermode="x unified"
    )
    fig_bar.show()

    # --- Chart 2: Daily Cost Line ---
    daily_cost = df.groupby(["usage_date", "label"])["cost"].sum().reset_index()
    daily_cost["cost_formatted"] = daily_cost["cost"].apply(lambda x: f"${x:,.2f}")

    fig_line = px.line(
        daily_cost,
        x="usage_date",
        y="cost",
        color="label",
        title="📈 Day-to-Day Cost Trend by Label",
        labels={"usage_date": "Date", "cost": "Cost (USD)", "label": "Label"},
        hover_data={"cost": ":$.2f"},
        color_discrete_map=color_mapping,
        markers=True
    )
    fig_line.update_layout(
        xaxis_title="Date",
        yaxis_title="Cost (USD)",
        xaxis=dict(tickangle=-45),
        legend_title="Label",
        yaxis_tickprefix="$",
        hovermode="x unified"
    )
    fig_line.show()

    # --- Pivot: Monthly Cost by Project ---
    project_monthly_cost = (
        df.groupby(["year_month", "project_id", "label", "team", "mission"])["cost"]
        .sum()
        .reset_index()
        .rename(columns={"cost": "monthly_cost"})
    )
    project_monthly_cost["monthly_cost"] = project_monthly_cost["monthly_cost"].round(2)

    pivot_table = project_monthly_cost.pivot(
        index=["project_id", "label", "team", "mission"],
        columns="year_month",
        values="monthly_cost"
    ).fillna(0).reset_index()

    pivot_table = pivot_table.loc[:, ~pivot_table.columns.duplicated()]
    valid_months = sorted([col for col in pivot_table.columns if isinstance(col, str) and col.startswith("202")])
    ordered_columns = ["project_id", "label"] + valid_months + ["team", "mission"]
    pivot_table = pivot_table[ordered_columns]

    display(google.colab.data_table.DataTable(pivot_table))

    # --- Export Button ---
    def download_file(change):
        file_name = f"/content/monthly_billing_data_for_all_projects_from_{start_date}_to_{end_date}.xlsx"
        pivot_table.to_excel(file_name, index=False)
        print(f"\n✅ Excel file saved: {file_name}")
        files.download(file_name)

    download_button = widgets.Button(description="💾 Download as Excel")
    download_button.on_click(download_file)
    display(download_button)

else:
    print("⚠️ No data found in this date range.")


Enter start date (YYYY-MM-DD) or press Enter to use default: 
Enter end date (YYYY-MM-DD) or press Enter to use default: 


year_month,project_id,label,2025-01,2025-02,2025-03,2025-04,2025-05,team,mission
0,ai-strat-ops,Compute,0.00,0.00,0.00,0.00,0.00,Untagged,Untagged
1,ai-strat-ops,Storage,0.00,0.00,0.00,0.00,0.00,Untagged,Untagged
2,aristo-sadp-dev,Storage,0.00,0.00,0.00,0.00,0.00,Domain Data Products,Data Platform
3,aristo-sadp-prd,Compute,0.00,0.00,3.80,0.44,0.00,Domain Data Products,Data Platform
4,bisque-dbt-bridge-prd,Compute,3.06,54.95,51.14,265.03,23.76,Warehouse Platform,Data Platform
...,...,...,...,...,...,...,...,...,...
750,uxf-dev-4766,Compute,0.00,0.00,0.09,0.45,0.00,Untagged,Untagged
751,wc-ga-167813,Storage,13.33,13.43,13.33,13.50,0.27,Untagged,Untagged
752,webstore-analytics-export-prd,BigQuery Others,0.11,0.01,0.00,0.01,0.00,Marketing Analytics,Marketing
753,webstore-analytics-export-prd,Compute,0.13,0.01,0.05,0.03,0.01,Marketing Analytics,Marketing


Button(description='💾 Download as Excel', style=ButtonStyle())

---
## 📈 Project Billing
---

In [50]:
#@title **Project Billing - Filter Project & Date Range**
#@markdown 👈 **Click this Play button** to start.
#@markdown
#@markdown Now, enter a `Project Name` and `Date Range` to explore the Daily Costs distribution in more detail.


from google.cloud import bigquery
from datetime import datetime
import pandas as pd
import plotly.express as px
import google.colab.data_table
import ipywidgets as widgets
from google.colab import files

assert 'df' in globals(), "❌ Billing data not found. Please run Step 1 first."
assert 'start_date' in globals() and 'end_date' in globals(), "❌ Date range missing from Step 1."

project_id_input = ""
while not project_id_input:
    project_id_input = input("Please enter a valid Project ID to continue: ").strip()
    if not project_id_input:
        print("⚠️ Project ID is required. Please try again.")

subset_start_date_input = input(f"Enter subset start date (YYYY-MM-DD) or press Enter to use {start_date}: ").strip()
subset_end_date_input = input(f"Enter subset end date (YYYY-MM-DD) or press Enter to use {end_date}: ").strip()

subset_start_date = subset_start_date_input if subset_start_date_input else start_date
subset_end_date = subset_end_date_input if subset_end_date_input else end_date

try:
    subset_start_dt = pd.to_datetime(subset_start_date)
    subset_end_dt = pd.to_datetime(subset_end_date)
    if subset_end_dt < subset_start_dt:
        raise ValueError("End date cannot be earlier than start date.")
except Exception as e:
    raise ValueError(f"Invalid date: {e}")

billing_df = df.copy()
billing_df["usage_date"] = pd.to_datetime(billing_df["usage_date"])

filtered_billing_df = billing_df.loc[
    (billing_df["usage_date"] >= pd.to_datetime(subset_start_date)) &
    (billing_df["usage_date"] <= pd.to_datetime(subset_end_date)) &
    (billing_df["project_id"] == project_id_input)
].copy()

filtered_billing_df["source"] = "Billing"

# print(f"✅ Billing data filtered: {len(filtered_billing_df)} rows")

if 'client' not in globals():
    from google.cloud import bigquery
    client = bigquery.Client()

# Compute
print(f"\nFetching Billing, Compute and Storage data for '{project_id_input}' from {subset_start_date} to {subset_end_date}...")

compute_query = f"""
WITH base AS (
  SELECT
    DATE(start_time, 'America/New_York') AS job_start_date,
    user_email,
    CASE
      WHEN reservation_id LIKE '%.%' THEN SPLIT(reservation_id, '.')[OFFSET(1)]
      ELSE reservation_id
    END AS reservation_name,
    project_id,
    job_id,
    job_type,
    statement_type,
    destination_table.project_id AS destination_project_id,
    destination_table.dataset_id AS destination_dataset_id,
    destination_table.table_id AS destination_table_id,
    total_slot_ms,
    total_bytes_billed
  FROM `nyt-platform-analytics-dbt.prod.stg_bq__jobs_by_organization`
  WHERE _pt BETWEEN '{subset_start_date}' AND '{subset_end_date}'
    AND project_id = '{project_id_input}'
)

SELECT
  job_start_date,
  user_email,
  reservation_name,
  CASE WHEN reservation_name IS NULL THEN 'Analysis' ELSE 'Reservation' END AS current_pricing_model,
  project_id,
  job_id,
  job_type,
  destination_project_id,
  destination_dataset_id,
  destination_table_id,
  SUM(total_slot_ms/1000/60/60) AS period_slot_hr,
  SUM(total_bytes_billed/POWER(2, 40)) AS total_tb_billed,
  SUM(
    CASE
      WHEN reservation_name IS NULL THEN total_bytes_billed/POWER(2, 40) * 5.1875
      ELSE total_slot_ms/1000/60/60 * 0.0522
    END
  ) AS cost,
  SUM(
    CASE
      WHEN reservation_name IS NOT NULL THEN total_bytes_billed/POWER(2, 40) * 5.1875
      ELSE total_slot_ms/1000/60/60 * 0.0522
    END
  ) AS alternative_cost,
  bqutil.fn.job_url(project_id || ':us.' || job_id) AS job_url
FROM base
WHERE NOT (reservation_name IS NULL AND statement_type = 'SCRIPT')
GROUP BY ALL
ORDER BY job_start_date DESC
"""

try:
    compute_df = client.query(compute_query).to_dataframe()
    compute_df["source"] = "Compute"
    # print(f"✅ Compute data ready: {len(compute_df)} rows")
except Exception as e:
    print(f"❌ Compute query failed: {e}")
    compute_df = pd.DataFrame()

# Storage

storage_query = f"""
SELECT
  DATE_TRUNC(_pt, MONTH) AS month,
  project_id,
  table_schema AS dataset_id,
  table_name AS table_id,
  creation_time,
  table_type,
  ROUND(SUM(active_physical_cost), 2) AS active_physical_cost,
  ROUND(SUM(long_term_physical_cost), 2) AS long_term_physical_cost,
  ROUND(SUM(active_logical_cost), 2) AS active_logical_cost,
  ROUND(SUM(long_term_logical_cost), 2) AS long_term_logical_cost
FROM `nyt-platform-analytics-dbt.prd_bigquery.int_bq__table_daily_storage_costs`
WHERE _pt BETWEEN '{subset_start_date}' AND '{subset_end_date}'
  AND project_id = '{project_id_input}'
GROUP BY ALL
ORDER BY long_term_physical_cost DESC
"""

try:
    storage_df = client.query(storage_query).to_dataframe()
    storage_df["source"] = "Storage"
    # print(f"✅ Storage data ready: {len(storage_df)} rows")
except Exception as e:
    print(f"❌ Storage query failed: {e}")
    storage_df = pd.DataFrame()

print("\nAll data pulled and filtered successfully.")

if not filtered_billing_df.empty:
    daily_cost_filtered = (
        filtered_billing_df
        .groupby(["project_id", "usage_date", "label"])["cost"]
        .sum()
        .reset_index()
        .sort_values("usage_date")
    )

    color_mapping = {
        "Compute": "#9BB4C1",
        "Storage": "#FFC440",
        "BigQuery Others": "#CFC4B6"
    }

    fig = px.line(
        daily_cost_filtered,
        x="usage_date",
        y="cost",
        color="label",
        title=f"Daily Billing Cost Trend for {project_id_input}",
        labels={"usage_date": "Date", "cost": "Cost (USD)", "label": "Label"},
        hover_data={"cost": ":$.2f"},
        color_discrete_map=color_mapping,
        markers=True
    )

    fig.update_layout(
        xaxis_title="Date",
        yaxis_title="Cost (USD)",
        legend_title="Label",
        xaxis=dict(tickangle=-45),
        yaxis_tickprefix="$",
        hovermode="x unified"
    )

    fig.show()
    display(google.colab.data_table.DataTable(daily_cost_filtered))

    # Download
    def download_file(change):
        """Download filtered billing data as Excel"""
        project_name = project_id_input
        subset_start = subset_start_date_input if subset_start_date_input else start_date
        subset_end = subset_end_date_input if subset_end_date_input else end_date
        file_name = f"/content/daily_billing_data_for_{project_name}_from_{subset_start}_to_{subset_end}.xlsx"

        daily_cost_filtered.to_excel(file_name, index=False)
        print(f"\n✅ Excel file saved successfully: {file_name}")
        files.download(file_name)

    download_button = widgets.Button(description="💾 Download as Excel")
    download_button.on_click(download_file)
    display(download_button)

else:
    print("⚠️ No billing data available to visualize or download.")

Please enter a valid Project ID to continue: nyt-algo-recs-dbt
Enter subset start date (YYYY-MM-DD) or press Enter to use 2025-01-01: 
Enter subset end date (YYYY-MM-DD) or press Enter to use 2025-05-01: 

Fetching Billing, Compute and Storage data for 'nyt-algo-recs-dbt' from 2025-01-01 to 2025-05-01...

All data pulled and filtered successfully.


Unnamed: 0,project_id,usage_date,label,cost
0,nyt-algo-recs-dbt,2025-01-01,Compute,14.22
1,nyt-algo-recs-dbt,2025-01-01,Storage,43.48
2,nyt-algo-recs-dbt,2025-01-02,Compute,49.96
3,nyt-algo-recs-dbt,2025-01-02,Storage,43.64
4,nyt-algo-recs-dbt,2025-01-03,Compute,22.42
...,...,...,...,...
237,nyt-algo-recs-dbt,2025-04-29,Storage,16.61
238,nyt-algo-recs-dbt,2025-04-30,Compute,113.83
239,nyt-algo-recs-dbt,2025-04-30,Storage,16.64
240,nyt-algo-recs-dbt,2025-05-01,Compute,163.48


Button(description='💾 Download as Excel', style=ButtonStyle())

---
## 📊 Compute Cost  
---

In [None]:
#@title **README**
#@markdown BigQuery offers 2 Compute pricing models:
#@markdown - **Analysis-Based**: Charged by data volume processed.
#@markdown - **Reservation-Based**: GPU-based pricing, where compute resources are allocated in slots, and costs adjust dynamically based on actual usage.
#@markdown
#@markdown The data in this section is an approximation. The analysis-type costs are 99.99% aligned with billing data. However, reservation costs are harder to estimate becasue it's dynamic and less accurate on approximation.
#@markdown
#@markdown 💡 Tip: Switching to the optimal model can reduce costs. Estimated potential savings are provided.
#@markdown Please contact *DBRE* Team (`#bigquery-admin` channel) for pricing change inquiries.




In [51]:
#@title **Compute Part I - MoM User Level Cost Overview**
#@markdown 👈 **Click this Play button** to start.



import pandas as pd
import google.colab.data_table

# ANSI Color Code for Red
RED = "\033[31m"
RESET = "\033[0m"  # Reset color to default

# Ensure compute_df has necessary columns
if not compute_df.empty:
    compute_user_df = compute_df.copy()

    compute_user_df["job_start_date"] = pd.to_datetime(compute_user_df["job_start_date"])
    compute_user_df["year_month"] = compute_user_df["job_start_date"].dt.to_period("M").astype(str)

    compute_user_df["potential_savings_from_alternative_pricing"] = \
        compute_user_df["cost"] - compute_user_df["alternative_cost"]

    compute_user_df = compute_user_df.groupby([
        "year_month", "project_id", "current_pricing_model", "user_email"
    ])[["cost", "potential_savings_from_alternative_pricing"]].sum().reset_index()

    compute_user_df[["cost", "potential_savings_from_alternative_pricing"]] = \
        compute_user_df[["cost", "potential_savings_from_alternative_pricing"]].round(2)

    compute_user_df = compute_user_df.sort_values(
        by=["year_month", "potential_savings_from_alternative_pricing"], ascending=[False, False]
    )


    print(f"{RED}⚠️ WARNING: If you see 'potential_savings_from_alternative_pricing' as a negative value, it means you are already on the optimal pricing.{RESET}")
    display(google.colab.data_table.DataTable(compute_user_df, include_index=False))

    # --- Download Button ---
    def download_file(change):
        """ Function to handle file download when button is clicked """
        project_name = project_id_input if project_id_input else "all_projects"
        file_name = f"/content/high_compute_costs_jobs_for_{project_name}.xlsx"
        compute_user_df.to_excel(file_name, index=False)
        print(f"\n✅ Excel file saved successfully: {file_name}")
        files.download(file_name)

    download_button = widgets.Button(description="💾 Download as Excel")
    download_button.on_click(download_file)
    display(download_button)

else:
    print("No compute job data available for aggregation.")





Unnamed: 0,year_month,project_id,current_pricing_model,user_email,cost,potential_savings_from_alternative_pricing
25,2025-05,nyt-algo-recs-dbt,Analysis,aditi.sarkar@nytimes.com,0.0,0.0
26,2025-05,nyt-algo-recs-dbt,Analysis,nyt-algo-recs-dbt@nyt-algo-recs-dbt.iam.gservi...,0.0,0.0
28,2025-05,nyt-algo-recs-dbt,Reservation,nyt-algo-recs-dbt@nyt-algo-recs-dbt.iam.gservi...,135.69,-67.15
27,2025-05,nyt-algo-recs-dbt,Reservation,aditi.sarkar@nytimes.com,99.04,-111.18
21,2025-04,nyt-algo-recs-dbt,Reservation,aditi.sarkar@nytimes.com,147.68,139.96
22,2025-04,nyt-algo-recs-dbt,Reservation,james.schintz@nytimes.com,61.86,49.66
18,2025-04,nyt-algo-recs-dbt,Analysis,aditi.sarkar@nytimes.com,0.0,0.0
19,2025-04,nyt-algo-recs-dbt,Analysis,nyt-algo-recs-dbt@nyt-algo-recs-dbt.iam.gservi...,0.0,0.0
20,2025-04,nyt-algo-recs-dbt,Analysis,zach.davis@nytimes.com,0.0,0.0
24,2025-04,nyt-algo-recs-dbt,Reservation,zach.davis@nytimes.com,0.75,-6.85


Button(description='💾 Download as Excel', style=ButtonStyle())

In [52]:
#@title **Compute Part II - MoM Dataset Level Cost Overview**
#@markdown 👈 **Click this Play button** to start.
#@markdown
#@markdown Click **💾 Download as Excel** to save the data locally.

import pandas as pd
import google.colab.data_table
import ipywidgets as widgets
from google.colab import files

# Ensure compute_df has necessary columns
if not compute_df.empty:
    dataset_compute_df = compute_df.copy()

    dataset_compute_df["job_start_date"] = pd.to_datetime(dataset_compute_df["job_start_date"])

    dataset_compute_df["year_month"] = dataset_compute_df["job_start_date"].dt.to_period("M").astype(str)

    dataset_compute_df = dataset_compute_df.groupby(
        ["year_month", "project_id", "destination_project_id", "destination_dataset_id"]
    )["cost"].sum().reset_index()

    dataset_compute_df["cost"] = dataset_compute_df["cost"].round(2)

    dataset_compute_df = dataset_compute_df.sort_values(
        by=["year_month", "cost"], ascending=[False, False]
    )

    display(google.colab.data_table.DataTable(dataset_compute_df, include_index=False))

    # --- Download Button ---
    def download_file(change):
        """ Function to handle file download when button is clicked """
        project_name = project_id_input if project_id_input else "all_projects"
        file_name = f"/content/high_compute_costs_datasets_for_{project_name}.xlsx"

        dataset_compute_df.to_excel(file_name, index=False)
        print(f"\n✅ Excel file saved successfully: {file_name}")
        files.download(file_name)

    download_button = widgets.Button(description="💾 Download as Excel")
    download_button.on_click(download_file)

    display(download_button)

else:
    print("No compute job data available for aggregation.")


Unnamed: 0,year_month,project_id,destination_project_id,destination_dataset_id,cost
186,2025-05,nyt-algo-recs-dbt,nyt-algo-recs-dbt,dbt_asarkar_algo_experiment_reporting,97.25
194,2025-05,nyt-algo-recs-dbt,nyt-algo-recs-dbt,dbt_cloud_pr_711785_86_intermediate,59.28
201,2025-05,nyt-algo-recs-dbt,nyt-algo-recs-dbt,prod_intermediate,59.17
202,2025-05,nyt-algo-recs-dbt,nyt-algo-recs-dbt,prod_ufn_batch,3.34
197,2025-05,nyt-algo-recs-dbt,nyt-algo-recs-dbt,prod_algo_first_expose,2.90
...,...,...,...,...,...
40,2025-01,nyt-algo-recs-dbt,nyt-algo-recs-dbt,dbt_cloud_pr_711785_58,0.00
41,2025-01,nyt-algo-recs-dbt,nyt-algo-recs-dbt,dbt_cloud_pr_711785_59,0.00
42,2025-01,nyt-algo-recs-dbt,nyt-algo-recs-dbt,dbt_cloud_pr_711785_59_ufn_batch,0.00
43,2025-01,nyt-algo-recs-dbt,nyt-algo-recs-dbt,dbt_zdavis,0.00


Button(description='💾 Download as Excel', style=ButtonStyle())

In [55]:
#@title **Compute Part III - MoM Table Level Cost Overview**
#@markdown 👈 **Click this Play button** to start.
#@markdown
#@markdown Enter a `destination_dataset_id` to explore table-level compute costs, displaying only the top 200 highest-cost tables per dataset per month.
#@markdown
#@markdown Click **💾 Download as Excel** to save the data locally.


import pandas as pd
import google.colab.data_table
import ipywidgets as widgets
from google.colab import files

# User input for destination_table_dataset_id filter
destination_dataset_input = input("Enter a destination dataset (destination_table_dataset_id) to filter (or press Enter to include all): ").strip()

if not compute_df.empty:
    table_compute_df = compute_df.copy()

    table_compute_df["job_start_date"] = pd.to_datetime(table_compute_df["job_start_date"])

    table_compute_df["year_month"] = table_compute_df["job_start_date"].dt.to_period("M").astype(str)

    if destination_dataset_input:
        table_compute_df = table_compute_df[table_compute_df["destination_dataset_id"] == destination_dataset_input]

    table_compute_df = table_compute_df.groupby(
        ["year_month", "project_id", "destination_project_id", "destination_dataset_id", "destination_table_id"]
    )["cost"].sum().reset_index()

    table_compute_df["cost"] = table_compute_df["cost"].round(2)

    table_compute_df = table_compute_df.sort_values(by=["year_month", "cost"], ascending=[False, False])

    # Keep only the **top 200 highest-cost tables per dataset per month**
    table_compute_df = table_compute_df.groupby(["year_month"]).head(200)

    display(google.colab.data_table.DataTable(table_compute_df, include_index=False))

    # --- Download Button ---
    def download_file(change):
        """ Function to handle file download when button is clicked """
        project_name = project_id_input if project_id_input else "all_projects"
        dataset_name = destination_dataset_input if destination_dataset_input else "all_datasets"
        file_name = f"/content/high_compute_costs_tables_for_{project_name}_{dataset_name}.xlsx"

        table_compute_df.to_excel(file_name, index=False)
        print(f"\n✅ Excel file saved successfully: {file_name}")
        files.download(file_name)

    download_button = widgets.Button(description="💾 Download as Excel")
    download_button.on_click(download_file)

    display(download_button)

else:
    print("No compute job data available for aggregation.")


Enter a destination dataset (destination_table_dataset_id) to filter (or press Enter to include all): 	prod_intermediate


Unnamed: 0,year_month,project_id,destination_project_id,destination_dataset_id,destination_table_id,cost
200,2025-05,nyt-algo-recs-dbt,nyt-algo-recs-dbt,prod_intermediate,int_monthly_etsor_agg__dbt_tmp,43.90
184,2025-05,nyt-algo-recs-dbt,nyt-algo-recs-dbt,prod_intermediate,dim_user_history_behavior__dbt_tmp,4.30
196,2025-05,nyt-algo-recs-dbt,nyt-algo-recs-dbt,prod_intermediate,int_holdout_monthly_engagement__dbt_tmp,4.09
190,2025-05,nyt-algo-recs-dbt,nyt-algo-recs-dbt,prod_intermediate,int_content_meta_and_engagement_daily__dbt_tmp,1.89
199,2025-05,nyt-algo-recs-dbt,nyt-algo-recs-dbt,prod_intermediate,int_monthly_etsor_agg,1.40
...,...,...,...,...,...,...
25,2025-01,nyt-algo-recs-dbt,nyt-algo-recs-dbt,prod_intermediate,int_user_email_click_embeddings,0.00
26,2025-01,nyt-algo-recs-dbt,nyt-algo-recs-dbt,prod_intermediate,int_user_embedding_features,0.00
27,2025-01,nyt-algo-recs-dbt,nyt-algo-recs-dbt,prod_intermediate,int_user_geo_features,0.00
28,2025-01,nyt-algo-recs-dbt,nyt-algo-recs-dbt,prod_intermediate,int_user_ufn_click_history_features,0.00


Button(description='💾 Download as Excel', style=ButtonStyle())

In [56]:
#@title **Compute Part IV - Job Level Cost Drilldown Tool**
#@markdown Live filtering by User, Date, Dataset, and Table

import pandas as pd
import ipywidgets as widgets
from IPython.display import display, clear_output
import google.colab.data_table
from google.colab import files

RED = "\033[31m"
BOLD = "\033[1m"
RESET = "\033[0m"

if not compute_df.empty:
    compute_df["job_start_date"] = pd.to_datetime(compute_df["job_start_date"])

    # Initiate Dropdown value options
    unique_users = sorted(compute_df["user_email"].dropna().unique())
    unique_datasets = sorted(compute_df["destination_dataset_id"].dropna().unique())
    unique_tables = sorted(compute_df["destination_table_id"].dropna().unique())

    start_picker = widgets.DatePicker(description="Start date")
    end_picker = widgets.DatePicker(description="End date")
    user_dropdown = widgets.Dropdown(options=["All"] + unique_users, description="User:")
    dataset_dropdown = widgets.Dropdown(options=["All"] + unique_datasets, description="Dataset:")
    table_dropdown = widgets.Dropdown(options=["All"] + unique_tables, description="Table:")

    output_table = widgets.Output()
    download_button = widgets.Button(description="💾 Download as Excel")
    final_df_cache = {"df": pd.DataFrame()}

    def filter_df_for_dropdowns():
        df = compute_df.copy()
        if start_picker.value and end_picker.value:
            start = pd.Timestamp(start_picker.value)
            end = pd.Timestamp(end_picker.value)
            if start <= end:
                df = df[(df["job_start_date"] >= start) & (df["job_start_date"] <= end)]
        return df

    def refresh_user_and_dataset_table_dropdowns(change=None):
        """Update user, dataset, and table dropdowns based on current filters."""
        base_df = filter_df_for_dropdowns()

        # Refresh user dropdown
        users = sorted(base_df["user_email"].dropna().unique())
        user_dropdown.options = ["All"] + users
        if user_dropdown.value not in user_dropdown.options:
            user_dropdown.value = "All"

        # Apply user filter if selected
        user_df = base_df.copy()
        if user_dropdown.value != "All":
            user_df = user_df[user_df["user_email"] == user_dropdown.value]

        # Refresh dataset dropdown
        datasets = sorted(user_df["destination_dataset_id"].dropna().unique())
        dataset_dropdown.options = ["All"] + datasets
        if dataset_dropdown.value not in dataset_dropdown.options:
            dataset_dropdown.value = "All"

        # Apply dataset filter if selected
        dataset_df = user_df.copy()
        if dataset_dropdown.value != "All":
            dataset_df = dataset_df[dataset_df["destination_dataset_id"] == dataset_dropdown.value]

        # Refresh table dropdown
        tables = sorted(dataset_df["destination_table_id"].dropna().unique())
        table_dropdown.options = ["All"] + tables
        if table_dropdown.value not in table_dropdown.options:
            table_dropdown.value = "All"

    def apply_filters_and_update_table(change=None):
        output_table.clear_output()
        df = compute_df.copy()
        if user_dropdown.value != "All":
            df = df[df["user_email"] == user_dropdown.value]
        if dataset_dropdown.value != "All":
            df = df[df["destination_dataset_id"] == dataset_dropdown.value]
        if table_dropdown.value != "All":
            df = df[df["destination_table_id"] == table_dropdown.value]
        if start_picker.value and end_picker.value:
            start = pd.Timestamp(start_picker.value)
            end = pd.Timestamp(end_picker.value)
            if start > end:
                with output_table:
                    print("⚠️ Start date cannot be after end date.")
                return
            df = df[(df["job_start_date"] >= start) & (df["job_start_date"] <= end)]

        if df.empty:
            with output_table:
                print("🔍 No jobs match your filters.")
            return

        df["year_month"] = df["job_start_date"].dt.to_period("M").astype(str)
        df["potential_savings_from_alternative_pricing"] = df["cost"] - df["alternative_cost"]

        grouped = df.groupby([
            "year_month", "job_start_date", "project_id", "current_pricing_model",
            "user_email", "job_id", "job_url", "destination_project_id",
            "destination_dataset_id", "destination_table_id"
        ]).agg({
            "cost": "sum",
            "potential_savings_from_alternative_pricing": "sum"
        }).reset_index()

        grouped[["cost", "potential_savings_from_alternative_pricing"]] = grouped[
            ["cost", "potential_savings_from_alternative_pricing"]
        ].round(2)

        grouped = grouped[grouped["cost"] >= 1]
        grouped = grouped.sort_values(by=["year_month", "cost"], ascending=[False, False])
        grouped = grouped.groupby("year_month").head(200)
        grouped["job_start_date"] = grouped["job_start_date"].dt.strftime('%Y-%m-%d')

        final_df = grouped[[
            "job_start_date", "project_id", "current_pricing_model", "user_email",
            "cost", "potential_savings_from_alternative_pricing", "job_id", "job_url",
            "destination_project_id", "destination_dataset_id", "destination_table_id"
        ]]

        final_df_cache["df"] = final_df.copy()

        with output_table:
            print(f"{RED}⚠️ WARNING: If you see {BOLD}potential_savings_from_alternative_pricing as a negative value{RESET}{RED}, it means you are already on the optimal pricing.{RESET}")
            display(google.colab.data_table.DataTable(final_df, include_index=False))

    def export_to_excel(change=None):
        df = final_df_cache["df"]
        if df.empty:
            print("⚠️ No data to download.")
            return

        user = user_dropdown.value if user_dropdown.value != "All" else "all_users"
        dataset = dataset_dropdown.value if dataset_dropdown.value != "All" else "all_datasets"
        table = table_dropdown.value if table_dropdown.value != "All" else "all_tables"
        if start_picker.value and end_picker.value:
            date_range = f"{start_picker.value.strftime('%Y%m%d')}_{end_picker.value.strftime('%Y%m%d')}"
        else:
            date_range = "alldates"

        file_name = f"/content/job_cost_{user}_{dataset}_{table}_{date_range}.xlsx".replace(" ", "_")
        df.to_excel(file_name, index=False)
        print(f"\n✅ Excel file saved successfully: {file_name}")
        files.download(file_name)

    # Register observers
    for widget in [start_picker, end_picker, user_dropdown]:
        widget.observe(refresh_user_and_dataset_table_dropdowns, names="value")

    for widget in [start_picker, end_picker, user_dropdown, dataset_dropdown, table_dropdown]:
        widget.observe(apply_filters_and_update_table, names="value")

    download_button.on_click(export_to_excel)

    # Layout
    print("📊 Filter job-level details using the options below:")
    display(widgets.VBox([
        widgets.HBox([start_picker, end_picker]),
        widgets.HBox([user_dropdown, dataset_dropdown, table_dropdown]),
    ]))
    display(output_table)
    display(download_button)

    # Initial render
    refresh_user_and_dataset_table_dropdowns()
    apply_filters_and_update_table()
else:
    print("No compute job data available for job-level detail.")


📊 Filter job-level details using the options below:


VBox(children=(HBox(children=(DatePicker(value=None, description='Start date'), DatePicker(value=None, descrip…

Output()

Button(description='💾 Download as Excel', style=ButtonStyle())

---
## 🗄️ Storage Cost  
---

In [None]:
#@title **README**
#@markdown BigQuery storage is charged under 2 models:
#@markdown - **Physical Storage**: Default for most tables.
#@markdown - **Logical Storage**: Used by some tables; confirm with the project owner if unclear.
#@markdown
#@markdown Unlike Compute costs, Storage costs accumulate over time:
#@markdown - **Active Storage**: Modified within 90 days.
#@markdown - **Long-Term Storage**: Unused for 90+ days.
#@markdown
#@markdown Note: This analysis focuses only on BASE TABLES for the most accurate results.
#@markdown
#@markdown 💡 Tip: Focus on Long-Term storage savings by deleting unnecessary tables and setting proper retention policies.


In [54]:
#@title **Storage Part I - Dataset Level Cost**
#@markdown 👈 **Click this Play button** to start.
#@markdown
#@markdown Click the **💾 Download as Excel** button below to save the dataset-level storage cost data locally.

import pandas as pd
import google.colab.data_table
import ipywidgets as widgets
from google.colab import files

# ANSI Color Code for Red
RED = "\033[31m"
ITALIC = "\033[3m"
RESET = "\033[0m"  # Reset color to default

# Ensure storage_df has necessary columns and apply filters
if not storage_df.empty:
    dataset_level_df = storage_df.copy()

    dataset_level_df["month"] = pd.to_datetime(dataset_level_df["month"])

    # Filter only BASE TABLES
    dataset_level_df = dataset_level_df[dataset_level_df["table_type"] == "BASE TABLE"]

    aggregated_dataset_df = dataset_level_df.groupby(
        ["month", "project_id", "dataset_id"]
    )[
        ["active_physical_cost", "long_term_physical_cost", "active_logical_cost", "long_term_logical_cost"]
    ].sum().reset_index()

    aggregated_dataset_df[["active_physical_cost", "long_term_physical_cost", "active_logical_cost", "long_term_logical_cost"]] = \
        aggregated_dataset_df[["active_physical_cost", "long_term_physical_cost", "active_logical_cost", "long_term_logical_cost"]].round(2)

    aggregated_dataset_df = aggregated_dataset_df.sort_values(
        by=["month", "long_term_physical_cost", "dataset_id"], ascending=[False, False, False]
    )

    aggregated_dataset_df["month"] = aggregated_dataset_df["month"].dt.strftime('%Y-%m')


    print(f"{RED}⚠️ This section is still an experimental feature in development.\n"
          f"If you're unsure about the storage type (Physical or Logical) of a dataset,\n"
          f"please reach out to the {ITALIC}Project Owner, DBRE, or Tian{RESET} {RED}for assistance.{RESET}")


    display(google.colab.data_table.DataTable(aggregated_dataset_df, include_index=False))

    # --- Download Button ---
    def download_file(change):
        """ Function to handle file download when button is clicked """
        project_name = project_id_input if project_id_input else "all_projects"
        file_name = f"/content/high_storage_costs_datasets_for_{project_name}.xlsx"

        aggregated_dataset_df.to_excel(file_name, index=False)
        print(f"\n✅ Excel file saved successfully: {file_name}")
        files.download(file_name)

    download_button = widgets.Button(description="💾 Download as Excel")
    download_button.on_click(download_file)

    display(download_button)

else:
    print("No storage data available for the selected range.")


[31m⚠️ This section is still an experimental feature in development.
If you're unsure about the storage type (Physical or Logical) of a dataset,
please reach out to the [3mProject Owner, DBRE, or Tian[0m [31mfor assistance.[0m


Unnamed: 0,month,project_id,dataset_id,active_physical_cost,long_term_physical_cost,active_logical_cost,long_term_logical_cost
287,2025-05,nyt-algo-recs-dbt,prod_algo_experiment_reporting,2.05,4.27,5.22,10.22
288,2025-05,nyt-algo-recs-dbt,prod_algo_first_expose,1.61,1.84,2.61,3.03
292,2025-05,nyt-algo-recs-dbt,prod_intermediate,3.58,1.44,11.30,2.98
256,2025-05,nyt-algo-recs-dbt,dbt_asarkar_algo_first_expose,0.00,0.40,0.00,0.60
289,2025-05,nyt-algo-recs-dbt,prod_cooking_experiment_reporting,0.10,0.19,0.54,0.87
...,...,...,...,...,...,...,...
4,2025-01,nyt-algo-recs-dbt,dbt_asarkar_algo_experiment_reporting,205.30,0.00,491.42,0.00
3,2025-01,nyt-algo-recs-dbt,dbt_asarkar,0.00,0.00,0.04,0.00
2,2025-01,nyt-algo-recs-dbt,dbt_asaez_intermediate,0.26,0.00,0.58,0.00
1,2025-01,nyt-algo-recs-dbt,dbt_asaez_home_experiment_reporting,0.00,0.00,0.00,0.00


Button(description='💾 Download as Excel', style=ButtonStyle())

In [57]:
#@title **Storage Part II - Table Level Cost**
#@markdown 👈 **Click this Play button** to start.
#@markdown
#@markdown Enter a `dataset_id_` to filter and explore storage costs at the **table level**.
#@markdown
#@markdown Click the **💾 Download as Excel** button below the table to save the data locally.

import pandas as pd
import google.colab.data_table
import ipywidgets as widgets
from google.colab import files

# ANSI Color Code for Red
RED = "\033[31m"
ITALIC = "\033[3m"
RESET = "\033[0m"  # Reset color to default

# User input for dataset_id filter (moved outside the function)
dataset_id_input = input("Enter a dataset (dataset_id) to filter (or press Enter to include all): ").strip()

# Ensure dataset_level_df has necessary columns
if not dataset_level_df.empty:
    table_level_df = storage_df.copy()

    table_level_df["month"] = pd.to_datetime(table_level_df["month"])

    # Filter only BASE TABLES
    table_level_df = table_level_df[table_level_df["table_type"] == "BASE TABLE"]

    if dataset_id_input:
        table_level_df = table_level_df[table_level_df["dataset_id"] == dataset_id_input]

    table_level_df = table_level_df.groupby(
        ["month", "project_id", "dataset_id", "table_id"]
    )[
        ["active_physical_cost", "long_term_physical_cost", "active_logical_cost", "long_term_logical_cost"]
    ].sum().reset_index()

    table_level_df[["active_physical_cost", "long_term_physical_cost", "active_logical_cost", "long_term_logical_cost"]] = \
        table_level_df[["active_physical_cost", "long_term_physical_cost", "active_logical_cost", "long_term_logical_cost"]].round(2)

    table_level_df = table_level_df.sort_values(
        by=["month", "long_term_physical_cost"], ascending=[False, False]
    )

    table_level_df["month"] = table_level_df["month"].dt.strftime('%Y-%m')

    print(f"{RED}⚠️ This section is still an experimental feature in development.\n"
          f"If you're unsure about the storage type (Physical or Logical) of a table,\n"
          f"please reach out to the {ITALIC}Project Owner, DBRE, or Tian{RESET} {RED}for assistance.{RESET}")

    display(google.colab.data_table.DataTable(table_level_df, include_index=False))

    # --- Download Button ---
    def download_file(change):
        """ Function to handle file download when button is clicked """
        project_name = project_id_input if project_id_input else "all_projects"
        dataset_name = dataset_id_input if dataset_id_input else "all_datasets"
        file_name = f"/content/high_storage_costs_tables_for_{project_name}_{dataset_name}.xlsx"

        table_level_df.to_excel(file_name, index=False)
        print(f"\n✅ Excel file saved successfully: {file_name}")
        files.download(file_name)

    download_button = widgets.Button(description="💾 Download as Excel")
    download_button.on_click(download_file)

    display(download_button)

else:
    print("No storage data available for the selected range.")


Enter a dataset (dataset_id) to filter (or press Enter to include all): prod_algo_experiment_reporting
[31m⚠️ This section is still an experimental feature in development.
If you're unsure about the storage type (Physical or Logical) of a table,
please reach out to the [3mProject Owner, DBRE, or Tian[0m [31mfor assistance.[0m


Unnamed: 0,month,project_id,dataset_id,table_id,active_physical_cost,long_term_physical_cost,active_logical_cost,long_term_logical_cost
38,2025-05,nyt-algo-recs-dbt,prod_algo_experiment_reporting,fact_algo_surface_impressions_interactions,2.05,4.27,5.2,10.18
32,2025-05,nyt-algo-recs-dbt,prod_algo_experiment_reporting,agg_day_home_monitoring,0.0,0.0,0.0,0.0
33,2025-05,nyt-algo-recs-dbt,prod_algo_experiment_reporting,agg_day_home_monitoring__dbt_tmp,0.0,0.0,0.0,0.0
34,2025-05,nyt-algo-recs-dbt,prod_algo_experiment_reporting,agg_day_variant_surface_impressions_interactions,0.0,0.0,0.0,0.0
35,2025-05,nyt-algo-recs-dbt,prod_algo_experiment_reporting,agg_day_variant_surface_impressions_interactio...,0.0,0.0,0.0,0.0
36,2025-05,nyt-algo-recs-dbt,prod_algo_experiment_reporting,agg_day_variant_surface_story_impressions_inte...,0.0,0.0,0.02,0.04
37,2025-05,nyt-algo-recs-dbt,prod_algo_experiment_reporting,agg_day_variant_surface_story_impressions_inte...,0.0,0.0,0.0,0.0
39,2025-05,nyt-algo-recs-dbt,prod_algo_experiment_reporting,fact_algo_surface_impressions_interactions__db...,0.0,0.0,0.0,0.0
30,2025-04,nyt-algo-recs-dbt,prod_algo_experiment_reporting,fact_algo_surface_impressions_interactions,64.47,121.18,161.46,289.09
28,2025-04,nyt-algo-recs-dbt,prod_algo_experiment_reporting,agg_day_variant_surface_story_impressions_inte...,0.03,0.07,0.61,1.15


Button(description='💾 Download as Excel', style=ButtonStyle())