# Homework 1: Data Validation and Transformation via Polars DataFrames

## Background

You work in Supply Chain for a hyperscale cloud provider. One of your organization's responsibilities is to procure server racks for data centers. Server racks have two major categories of components: server components and rack components. Server components make up each individual server (e.g., CPU, SSD, HDD, etc.). There can be multiple servers in a rack. Rack components hold all the servers in a chassis and provide a top-of-rack (TOR) networking switch for data center connectivity.

## Situation

We manage several different server rack programs for various compute products. Each program is quoted by a variety of vendors. The quoting process is challenging. Initially, vendors submitted summary-level quotes that provided the cost of an entire server rack. As your organization evolved, you asked vendors to start providing more detailed quotes where each line represents the cost of a individual component in the server rack. These quote line details do not include server or rack quantities. Unfortunately, internal systems are still tied to the summary-level quote submissions. The current state requires vendors to provide quotes as both summaries and detailed line items.
Vendors have built automation to continuously supply quotes, as market rates for various components can fluctuate daily. 

## Available Data

- Server component quantites (Excel)
- Rack component quantities (Excel)
- Vendor quote summary data (csv)
- Vendor quote line detail data (csv)

# Import Libraries and Inspect Datasets

This section imports all four source datasets and removes the "Server" label on the program columns from the excel dataframes (server_specs_raw and rack_specs_raw).

All four datasets come from Data 516 - Scalable Algorithms taught by professor Mark Kazzaz.

In [1]:
%%capture
%pip install polars
%pip install pandas
%pip install fastexcel
%pip install pyarrow
%pip install openpyxl

import pyarrow
import fastexcel
import polars as pl
import pandas as pd
from datetime import datetime
from IPython.display import display

# Assignment

As part of the data team, you need to transform various datasets to create the following:

- A table detailing the total server rack quantity for each component across all programs. 

    - Each record in the table should represent a distinct component. 
    - Each attribute of the table should represent a distinct program. 
    - The intersection of record and attribute should represent the total extended quantity of parts for that component and program combination.

- You must join the relevant datasets and apply the necessary calculations, transformations, and filters to create a final dataset that complies with all of our data validation requirements.

- Once you have a dataset that meets all the requirements, create tables for each of the following scenarios:

    - If we only consider the latest quote received by each vendor for each program, what is the total server rack cost per program per vendor?
    - If we only consider the first quote received by each vendor for each program, what is the total cost per program per vendor?
    - To determine "best-in-class" pricing, calculate the total server rack cost by determining the lowest price per component and summing the total, regardless of vendor. What is the "best-in-class" pricing for each program?

In [2]:
import polars as pl

# import pandas as pd
quote_lines_raw = pl.read_csv("quote_lines.csv")
quote_summaries_raw = pl.read_csv("quote_summaries.csv")
server_specs_raw = pl.read_excel('/Users/sarahkilpatrick/Documents/DATA 516 Homework 1/program configurations.xlsx', sheet_name='server_specs')
rack_specs_raw = pl.read_excel('/Users/sarahkilpatrick/Documents/DATA 516 Homework 1/program configurations.xlsx', sheet_name='rack_specs')
# Server factors for each program
servers_factors = {'A': 12, 'B': 8, 'C': 14, 'D': 10, 'E': 10}



In [3]:
from datetime import datetime
import polars as pl

# Convert the quote_timestamp to datetime and filter the range
quote_lines_dated = (
    quote_lines_raw
    .with_columns(
        pl.col("quote_timestamp").str.to_datetime("%Y-%m-%dT%H:%M:%S%.f")
    )
    .filter(
        pl.col("quote_timestamp")
        .is_between(datetime(2024, 9, 2), datetime(2024, 9, 26))
    )

)
# Filter rows outside the above date range
discarded_rows = (
    quote_lines_raw
    .with_columns(
        pl.col("quote_timestamp").str.to_datetime("%Y-%m-%dT%H:%M:%S%.f")
    )
    .filter(
        ~pl.col("quote_timestamp").is_between(datetime(2024, 9, 2), datetime(2024, 9, 26))
    )
)

print(quote_lines_dated)

if discarded_rows.height > 0:
    print(f"Discarding {discarded_rows.height} quotes outside the date range:\n")
    print(discarded_rows.select(["Vendor", "Program", "quote_timestamp"]))



shape: (167, 14)
┌──────────┬───────────┬─────────────────┬────────┬───┬────────┬───────┬────────┬─────────┐
│ Vendor   ┆ Program   ┆ quote_timestamp ┆ CPU    ┆ … ┆ PSU    ┆ TRAY  ┆ TOR    ┆ CHASSIS │
│ ---      ┆ ---       ┆ ---             ┆ ---    ┆   ┆ ---    ┆ ---   ┆ ---    ┆ ---     │
│ str      ┆ str       ┆ datetime[μs]    ┆ f64    ┆   ┆ f64    ┆ f64   ┆ f64    ┆ f64     │
╞══════════╪═══════════╪═════════════════╪════════╪═══╪════════╪═══════╪════════╪═════════╡
│ Vendor_7 ┆ Program_D ┆ 2024-09-25      ┆ 305.84 ┆ … ┆ 94.19  ┆ 11.57 ┆ 746.67 ┆ 1015.84 │
│          ┆           ┆ 03:52:19.637095 ┆        ┆   ┆        ┆       ┆        ┆         │
│ Vendor_6 ┆ Program_E ┆ 2024-09-12      ┆ 324.21 ┆ … ┆ 89.99  ┆ 70.05 ┆ 728.08 ┆ 1068.57 │
│          ┆           ┆ 00:26:22.414219 ┆        ┆   ┆        ┆       ┆        ┆         │
│ Vendor_5 ┆ Program_A ┆ 2024-09-23      ┆ 300.29 ┆ … ┆ 102.29 ┆ 22.98 ┆ 734.33 ┆ 1770.98 │
│          ┆           ┆ 18:35:08.078890 ┆        ┆   ┆        

In [4]:
import polars as pl

def remove_vendor_7_programs(df):
    # Filter out rows where Vendor is 'Vendor_7' and Program is either 'Program_C' or 'Program_E'
    df_filtered = df.filter(
        ~(
            (pl.col("Vendor") == "Vendor_7") &
            (pl.col("Program").is_in(["C", "E"]))
        )
    )

    discarded_rows = df.filter(
        (pl.col("Vendor") == "Vendor_7") &
        (pl.col("Program").is_in(["C", "E"]))
    )

    if discarded_rows.height > 0:
        print(f"Discarding {discarded_rows.height} quotes from Vendor_7 for Program C and Program E:\n")
        print(discarded_rows.select(["Vendor", "Program", "quote_timestamp"]))

    return df_filtered

# Remove Vendor_7 from Program C or E
quote_lines_cleaned = remove_vendor_7_programs(quote_lines_dated)

print("Cleaned quote_lines\n", quote_lines_cleaned)

Cleaned quote_lines
 shape: (167, 14)
┌──────────┬───────────┬─────────────────┬────────┬───┬────────┬───────┬────────┬─────────┐
│ Vendor   ┆ Program   ┆ quote_timestamp ┆ CPU    ┆ … ┆ PSU    ┆ TRAY  ┆ TOR    ┆ CHASSIS │
│ ---      ┆ ---       ┆ ---             ┆ ---    ┆   ┆ ---    ┆ ---   ┆ ---    ┆ ---     │
│ str      ┆ str       ┆ datetime[μs]    ┆ f64    ┆   ┆ f64    ┆ f64   ┆ f64    ┆ f64     │
╞══════════╪═══════════╪═════════════════╪════════╪═══╪════════╪═══════╪════════╪═════════╡
│ Vendor_7 ┆ Program_D ┆ 2024-09-25      ┆ 305.84 ┆ … ┆ 94.19  ┆ 11.57 ┆ 746.67 ┆ 1015.84 │
│          ┆           ┆ 03:52:19.637095 ┆        ┆   ┆        ┆       ┆        ┆         │
│ Vendor_6 ┆ Program_E ┆ 2024-09-12      ┆ 324.21 ┆ … ┆ 89.99  ┆ 70.05 ┆ 728.08 ┆ 1068.57 │
│          ┆           ┆ 00:26:22.414219 ┆        ┆   ┆        ┆       ┆        ┆         │
│ Vendor_5 ┆ Program_A ┆ 2024-09-23      ┆ 300.29 ┆ … ┆ 102.29 ┆ 22.98 ┆ 734.33 ┆ 1770.98 │
│          ┆           ┆ 18:35:08.078890 ┆

In [36]:

def clean_program_column(df: pl.DataFrame) -> pl.DataFrame:
    # Remove the "Program_" prefix from the Program column
    df_cleaned = df.with_columns(
        pl.col("Program").str.replace("Program_", "")  # Removing the "Program_" prefix
    )

    return df_cleaned

quote_lines_cleaned = clean_program_column(quote_lines_cleaned)

print(quote_lines_cleaned)



shape: (167, 14)
┌──────────┬─────────┬────────────────────────────┬────────┬───┬────────┬───────┬────────┬─────────┐
│ Vendor   ┆ Program ┆ quote_timestamp            ┆ CPU    ┆ … ┆ PSU    ┆ TRAY  ┆ TOR    ┆ CHASSIS │
│ ---      ┆ ---     ┆ ---                        ┆ ---    ┆   ┆ ---    ┆ ---   ┆ ---    ┆ ---     │
│ str      ┆ str     ┆ datetime[μs]               ┆ f64    ┆   ┆ f64    ┆ f64   ┆ f64    ┆ f64     │
╞══════════╪═════════╪════════════════════════════╪════════╪═══╪════════╪═══════╪════════╪═════════╡
│ Vendor_7 ┆ D       ┆ 2024-09-25 03:52:19.637095 ┆ 305.84 ┆ … ┆ 94.19  ┆ 11.57 ┆ 746.67 ┆ 1015.84 │
│ Vendor_6 ┆ E       ┆ 2024-09-12 00:26:22.414219 ┆ 324.21 ┆ … ┆ 89.99  ┆ 70.05 ┆ 728.08 ┆ 1068.57 │
│ Vendor_5 ┆ A       ┆ 2024-09-23 18:35:08.078890 ┆ 300.29 ┆ … ┆ 102.29 ┆ 22.98 ┆ 734.33 ┆ 1770.98 │
│ Vendor_5 ┆ B       ┆ 2024-09-24 11:33:41.034306 ┆ 333.53 ┆ … ┆ 113.9  ┆ 80.13 ┆ 616.13 ┆ 1563.11 │
│ Vendor_7 ┆ C       ┆ 2024-09-15 13:20:11.557807 ┆ 324.79 ┆ … ┆ 86.8   ┆ 

In [37]:
# Check if 'quote_timestamp' column in 'quote_summaries_raw' is of type string

if quote_summaries_raw.schema["quote_timestamp"] == pl.Utf8:
    # Convert 'quote_timestamp' from string to datetime format
    quote_summaries_raw = quote_summaries_raw.with_columns(
        pl.col("quote_timestamp").str.to_datetime("%Y-%m-%dT%H:%M:%S%.f")
    )
    print("'quote_timestamp' has been successfully converted to datetime.")
else:
    print("The 'quote_timestamp' column is already in datetime format or another format:\n")
    display(quote_summaries_raw)


The 'quote_timestamp' column is already in datetime format or another format:



vendor,program,quote_timestamp,reported_total_price
str,str,datetime[μs],f64
"""Vendor_4""","""E""",2024-09-01 11:34:47.063229,60041.39
"""Vendor_7""","""A""",2024-09-08 09:24:26.457788,18166.85
"""Vendor_7""","""C""",2024-09-04 20:54:28.430814,20561.73
"""Vendor_1""","""D""",2024-09-24 20:49:52.296317,52527.28
"""Vendor_1""","""D""",2024-09-04 17:07:58.019610,36545.28
…,…,…,…
"""Vendor_6""","""B""",2024-09-03 04:36:02.445715,29320.38
"""Vendor_7""","""C""",2024-09-02 00:38:15.661547,22139.59
"""Vendor_1""","""E""",2024-09-24 10:23:03.302944,54118.35
"""Vendor_7""","""A""",2024-09-20 03:53:18.814830,18946.16


In [38]:
#Concatenate the scaling factors together from server specs and rack specs
rack_specs_filtered = rack_specs_raw.filter(pl.col('Item').is_in(['TOR', 'CHASSIS']))

scaling_factors = pl.concat([server_specs_raw, rack_specs_filtered], how='vertical')

print(scaling_factors)

shape: (11, 6)
┌─────────┬──────────┬──────────┬──────────┬──────────┬──────────┐
│ Item    ┆ Server A ┆ Server B ┆ Server C ┆ Server D ┆ Server E │
│ ---     ┆ ---      ┆ ---      ┆ ---      ┆ ---      ┆ ---      │
│ str     ┆ i64      ┆ i64      ┆ i64      ┆ i64      ┆ i64      │
╞═════════╪══════════╪══════════╪══════════╪══════════╪══════════╡
│ CPU     ┆ 2        ┆ 2        ┆ 2        ┆ 1        ┆ 1        │
│ GPU     ┆ 0        ┆ 4        ┆ 0        ┆ 0        ┆ 2        │
│ RAM     ┆ 4        ┆ 4        ┆ 4        ┆ 8        ┆ 8        │
│ SSD     ┆ 1        ┆ 2        ┆ 1        ┆ 1        ┆ 0        │
│ HDD     ┆ 0        ┆ 0        ┆ 0        ┆ 20       ┆ 20       │
│ …       ┆ …        ┆ …        ┆ …        ┆ …        ┆ …        │
│ NIC     ┆ 2        ┆ 2        ┆ 2        ┆ 2        ┆ 1        │
│ PSU     ┆ 1        ┆ 2        ┆ 1        ┆ 1        ┆ 1        │
│ TRAY    ┆ 1        ┆ 1        ┆ 1        ┆ 1        ┆ 1        │
│ TOR     ┆ 1        ┆ 2        ┆ 1        ┆ 1 

In [39]:
# Separate the last two rows to keep them unscaled
unscaled_rows = scaling_factors.tail(2)

# Scale the remaining rows
scaled_factors = scaling_factors.head(-2).select(
    [pl.col("Item")] +  # Keep the "Item" column as is
    [(pl.col(f"Server {server}") * factor).alias(f"Server {server}")
     for server, factor in servers_factors.items()]
)

# Concatenate the scaled factors with the unscaled rows
scaled_factors = pl.concat([scaled_factors, unscaled_rows])

# Display the result
print("Scaled factors:\n", scaled_factors)

Scaled factors:
 shape: (11, 6)
┌─────────┬──────────┬──────────┬──────────┬──────────┬──────────┐
│ Item    ┆ Server A ┆ Server B ┆ Server C ┆ Server D ┆ Server E │
│ ---     ┆ ---      ┆ ---      ┆ ---      ┆ ---      ┆ ---      │
│ str     ┆ i64      ┆ i64      ┆ i64      ┆ i64      ┆ i64      │
╞═════════╪══════════╪══════════╪══════════╪══════════╪══════════╡
│ CPU     ┆ 24       ┆ 16       ┆ 28       ┆ 10       ┆ 10       │
│ GPU     ┆ 0        ┆ 32       ┆ 0        ┆ 0        ┆ 20       │
│ RAM     ┆ 48       ┆ 32       ┆ 56       ┆ 80       ┆ 80       │
│ SSD     ┆ 12       ┆ 16       ┆ 14       ┆ 10       ┆ 0        │
│ HDD     ┆ 0        ┆ 0        ┆ 0        ┆ 200      ┆ 200      │
│ …       ┆ …        ┆ …        ┆ …        ┆ …        ┆ …        │
│ NIC     ┆ 24       ┆ 16       ┆ 28       ┆ 20       ┆ 10       │
│ PSU     ┆ 12       ┆ 16       ┆ 14       ┆ 10       ┆ 10       │
│ TRAY    ┆ 12       ┆ 8        ┆ 14       ┆ 10       ┆ 10       │
│ TOR     ┆ 1        ┆ 2      

In [40]:
# Remove "Server " from all column names (except "Item")
scaling_factors = scaled_factors.rename(
    {col: col.replace("Server ", "") for col in scaled_factors.columns if col != "Item"}
)
print("scaling factors, no server:\n", scaling_factors)

scaling factors, no server:
 shape: (11, 6)
┌─────────┬─────┬─────┬─────┬─────┬─────┐
│ Item    ┆ A   ┆ B   ┆ C   ┆ D   ┆ E   │
│ ---     ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str     ┆ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═════════╪═════╪═════╪═════╪═════╪═════╡
│ CPU     ┆ 24  ┆ 16  ┆ 28  ┆ 10  ┆ 10  │
│ GPU     ┆ 0   ┆ 32  ┆ 0   ┆ 0   ┆ 20  │
│ RAM     ┆ 48  ┆ 32  ┆ 56  ┆ 80  ┆ 80  │
│ SSD     ┆ 12  ┆ 16  ┆ 14  ┆ 10  ┆ 0   │
│ HDD     ┆ 0   ┆ 0   ┆ 0   ┆ 200 ┆ 200 │
│ …       ┆ …   ┆ …   ┆ …   ┆ …   ┆ …   │
│ NIC     ┆ 24  ┆ 16  ┆ 28  ┆ 20  ┆ 10  │
│ PSU     ┆ 12  ┆ 16  ┆ 14  ┆ 10  ┆ 10  │
│ TRAY    ┆ 12  ┆ 8   ┆ 14  ┆ 10  ┆ 10  │
│ TOR     ┆ 1   ┆ 2   ┆ 1   ┆ 1   ┆ 1   │
│ CHASSIS ┆ 1   ┆ 1   ┆ 1   ┆ 1   ┆ 1   │
└─────────┴─────┴─────┴─────┴─────┴─────┘


In [41]:
# Unpivot the DataFrame
unpivoted_df = scaling_factors.unpivot(index="Item", variable_name="Program", value_name="Value")

# Pivot to get the desired structure
cleaned_scaling_factors_final = unpivoted_df.pivot(index="Program", on="Item", values="Value")

# Display the final DataFrame
print("cleaned_scaling_factors_final:\n", cleaned_scaling_factors_final)


cleaned_scaling_factors_final:
 shape: (5, 12)
┌─────────┬─────┬─────┬─────┬───┬─────┬──────┬─────┬─────────┐
│ Program ┆ CPU ┆ GPU ┆ RAM ┆ … ┆ PSU ┆ TRAY ┆ TOR ┆ CHASSIS │
│ ---     ┆ --- ┆ --- ┆ --- ┆   ┆ --- ┆ ---  ┆ --- ┆ ---     │
│ str     ┆ i64 ┆ i64 ┆ i64 ┆   ┆ i64 ┆ i64  ┆ i64 ┆ i64     │
╞═════════╪═════╪═════╪═════╪═══╪═════╪══════╪═════╪═════════╡
│ A       ┆ 24  ┆ 0   ┆ 48  ┆ … ┆ 12  ┆ 12   ┆ 1   ┆ 1       │
│ B       ┆ 16  ┆ 32  ┆ 32  ┆ … ┆ 16  ┆ 8    ┆ 2   ┆ 1       │
│ C       ┆ 28  ┆ 0   ┆ 56  ┆ … ┆ 14  ┆ 14   ┆ 1   ┆ 1       │
│ D       ┆ 10  ┆ 0   ┆ 80  ┆ … ┆ 10  ┆ 10   ┆ 1   ┆ 1       │
│ E       ┆ 10  ┆ 20  ┆ 80  ┆ … ┆ 10  ┆ 10   ┆ 1   ┆ 1       │
└─────────┴─────┴─────┴─────┴───┴─────┴──────┴─────┴─────────┘


In [42]:
# Merge the DataFrames on the Program column
print(quote_lines_cleaned)
print(cleaned_scaling_factors_final)
merged_df = quote_lines_cleaned.join(cleaned_scaling_factors_final, on="Program", suffix="_scaling")
print(merged_df)

shape: (167, 14)
┌──────────┬─────────┬────────────────────────────┬────────┬───┬────────┬───────┬────────┬─────────┐
│ Vendor   ┆ Program ┆ quote_timestamp            ┆ CPU    ┆ … ┆ PSU    ┆ TRAY  ┆ TOR    ┆ CHASSIS │
│ ---      ┆ ---     ┆ ---                        ┆ ---    ┆   ┆ ---    ┆ ---   ┆ ---    ┆ ---     │
│ str      ┆ str     ┆ datetime[μs]               ┆ f64    ┆   ┆ f64    ┆ f64   ┆ f64    ┆ f64     │
╞══════════╪═════════╪════════════════════════════╪════════╪═══╪════════╪═══════╪════════╪═════════╡
│ Vendor_7 ┆ D       ┆ 2024-09-25 03:52:19.637095 ┆ 305.84 ┆ … ┆ 94.19  ┆ 11.57 ┆ 746.67 ┆ 1015.84 │
│ Vendor_6 ┆ E       ┆ 2024-09-12 00:26:22.414219 ┆ 324.21 ┆ … ┆ 89.99  ┆ 70.05 ┆ 728.08 ┆ 1068.57 │
│ Vendor_5 ┆ A       ┆ 2024-09-23 18:35:08.078890 ┆ 300.29 ┆ … ┆ 102.29 ┆ 22.98 ┆ 734.33 ┆ 1770.98 │
│ Vendor_5 ┆ B       ┆ 2024-09-24 11:33:41.034306 ┆ 333.53 ┆ … ┆ 113.9  ┆ 80.13 ┆ 616.13 ┆ 1563.11 │
│ Vendor_7 ┆ C       ┆ 2024-09-15 13:20:11.557807 ┆ 324.79 ┆ … ┆ 86.8   ┆ 

In [43]:
# Calculate multiplied values and ensure they are floats
for column in ['CPU', 'GPU', 'RAM', 'SSD', 'HDD', 'MOBO', 'NIC', 'PSU', 'TRAY', 'TOR', 'CHASSIS']:
    merged_df = merged_df.with_columns(
        (pl.col(column) * pl.col(f"{column}_scaling")).cast(pl.Float64).alias(f"{column}_multiplied")
    )

# Display the final DataFrame with the new multiplied columns
print("Final DataFrame with multiplied values:\n", merged_df)

Final DataFrame with multiplied values:
 shape: (167, 36)
┌──────────┬─────────┬────────────┬────────┬───┬────────────┬────────────┬────────────┬────────────┐
│ Vendor   ┆ Program ┆ quote_time ┆ CPU    ┆ … ┆ PSU_multip ┆ TRAY_multi ┆ TOR_multip ┆ CHASSIS_mu │
│ ---      ┆ ---     ┆ stamp      ┆ ---    ┆   ┆ lied       ┆ plied      ┆ lied       ┆ ltiplied   │
│ str      ┆ str     ┆ ---        ┆ f64    ┆   ┆ ---        ┆ ---        ┆ ---        ┆ ---        │
│          ┆         ┆ datetime[μ ┆        ┆   ┆ f64        ┆ f64        ┆ f64        ┆ f64        │
│          ┆         ┆ s]         ┆        ┆   ┆            ┆            ┆            ┆            │
╞══════════╪═════════╪════════════╪════════╪═══╪════════════╪════════════╪════════════╪════════════╡
│ Vendor_7 ┆ D       ┆ 2024-09-25 ┆ 305.84 ┆ … ┆ 941.9      ┆ 115.7      ┆ 746.67     ┆ 1015.84    │
│          ┆         ┆ 03:52:19.6 ┆        ┆   ┆            ┆            ┆            ┆            │
│          ┆         ┆ 37095     

In [44]:
merged_df = merged_df.with_columns(
    (pl.col('CPU_multiplied') +
     pl.col('GPU_multiplied') +
     pl.col('RAM_multiplied') +
     pl.col('SSD_multiplied') +
     pl.col('HDD_multiplied') +
     pl.col('MOBO_multiplied') +
     pl.col('NIC_multiplied') +
     pl.col('PSU_multiplied') +
     pl.col('TRAY_multiplied') +
     pl.col('TOR_multiplied') +
     pl.col('CHASSIS_multiplied')).round(2).alias("multiplied_items_sum")
)

# Display the final DataFrame with the new summed column
print("Final DataFrame with multiplied items sum:\n", merged_df)

Final DataFrame with multiplied items sum:
 shape: (167, 37)
┌──────────┬─────────┬────────────┬────────┬───┬────────────┬────────────┬────────────┬────────────┐
│ Vendor   ┆ Program ┆ quote_time ┆ CPU    ┆ … ┆ TRAY_multi ┆ TOR_multip ┆ CHASSIS_mu ┆ multiplied │
│ ---      ┆ ---     ┆ stamp      ┆ ---    ┆   ┆ plied      ┆ lied       ┆ ltiplied   ┆ _items_sum │
│ str      ┆ str     ┆ ---        ┆ f64    ┆   ┆ ---        ┆ ---        ┆ ---        ┆ ---        │
│          ┆         ┆ datetime[μ ┆        ┆   ┆ f64        ┆ f64        ┆ f64        ┆ f64        │
│          ┆         ┆ s]         ┆        ┆   ┆            ┆            ┆            ┆            │
╞══════════╪═════════╪════════════╪════════╪═══╪════════════╪════════════╪════════════╪════════════╡
│ Vendor_7 ┆ D       ┆ 2024-09-25 ┆ 305.84 ┆ … ┆ 115.7      ┆ 746.67     ┆ 1015.84    ┆ 34309.11   │
│          ┆         ┆ 03:52:19.6 ┆        ┆   ┆            ┆            ┆            ┆            │
│          ┆         ┆ 37095  

In [45]:
# Rename columns in merged_df to match quote_summaries_raw
merged_df_renamed = merged_df.rename({"Vendor": "vendor", "Program": "program", "quote_timestamp": "quote_timestamp"})

# Perform the join
final_df = merged_df_renamed.join(quote_summaries_raw, on=["vendor", "program", "quote_timestamp"], how="inner")

# Display the final joined DataFrame
print("Final joined DataFrame:\n", final_df)

Final joined DataFrame:
 shape: (167, 38)
┌──────────┬─────────┬────────────┬────────┬───┬────────────┬────────────┬────────────┬────────────┐
│ vendor   ┆ program ┆ quote_time ┆ CPU    ┆ … ┆ TOR_multip ┆ CHASSIS_mu ┆ multiplied ┆ reported_t │
│ ---      ┆ ---     ┆ stamp      ┆ ---    ┆   ┆ lied       ┆ ltiplied   ┆ _items_sum ┆ otal_price │
│ str      ┆ str     ┆ ---        ┆ f64    ┆   ┆ ---        ┆ ---        ┆ ---        ┆ ---        │
│          ┆         ┆ datetime[μ ┆        ┆   ┆ f64        ┆ f64        ┆ f64        ┆ f64        │
│          ┆         ┆ s]         ┆        ┆   ┆            ┆            ┆            ┆            │
╞══════════╪═════════╪════════════╪════════╪═══╪════════════╪════════════╪════════════╪════════════╡
│ Vendor_7 ┆ A       ┆ 2024-09-08 ┆ 313.71 ┆ … ┆ 709.1      ┆ 1478.43    ┆ 18166.85   ┆ 18166.85   │
│          ┆         ┆ 09:24:26.4 ┆        ┆   ┆            ┆            ┆            ┆            │
│          ┆         ┆ 57788      ┆        ┆   ┆ 

In [46]:
# Filter to keep only records where multiplied_items_sum equals reported_total_price
filtered_df = final_df.filter(pl.col("multiplied_items_sum") == pl.col("reported_total_price"))

# Display the filtered DataFrame
print("Filtered DataFrame with matching multiplied_items_sum and reported_total_price:\n", filtered_df)

Filtered DataFrame with matching multiplied_items_sum and reported_total_price:
 shape: (155, 38)
┌──────────┬─────────┬────────────┬────────┬───┬────────────┬────────────┬────────────┬────────────┐
│ vendor   ┆ program ┆ quote_time ┆ CPU    ┆ … ┆ TOR_multip ┆ CHASSIS_mu ┆ multiplied ┆ reported_t │
│ ---      ┆ ---     ┆ stamp      ┆ ---    ┆   ┆ lied       ┆ ltiplied   ┆ _items_sum ┆ otal_price │
│ str      ┆ str     ┆ ---        ┆ f64    ┆   ┆ ---        ┆ ---        ┆ ---        ┆ ---        │
│          ┆         ┆ datetime[μ ┆        ┆   ┆ f64        ┆ f64        ┆ f64        ┆ f64        │
│          ┆         ┆ s]         ┆        ┆   ┆            ┆            ┆            ┆            │
╞══════════╪═════════╪════════════╪════════╪═══╪════════════╪════════════╪════════════╪════════════╡
│ Vendor_7 ┆ A       ┆ 2024-09-08 ┆ 313.71 ┆ … ┆ 709.1      ┆ 1478.43    ┆ 18166.85   ┆ 18166.85   │
│          ┆         ┆ 09:24:26.4 ┆        ┆   ┆            ┆            ┆            ┆       

In [47]:
# Final Four Cells - Cell 1 

print("The table of extended quantites per component for all programs\n")
display(cleaned_scaling_factors_final)

The table of extended quantites per component for all programs



Program,CPU,GPU,RAM,SSD,HDD,MOBO,NIC,PSU,TRAY,TOR,CHASSIS
str,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64
"""A""",24,0,48,12,0,12,24,12,12,1,1
"""B""",16,32,32,16,0,8,16,16,8,2,1
"""C""",28,0,56,14,0,14,28,14,14,1,1
"""D""",10,0,80,10,200,10,20,10,10,1,1
"""E""",10,20,80,0,200,10,10,10,10,1,1


In [48]:
# Final Four Cells - Cell 2

latest_total_cost_per_program_per_vendor = (

    filtered_df
    .group_by(['vendor', 'program'])
    .agg(
        pl.col("quote_timestamp").max().alias("latest_date")
    )
    .join(
        filtered_df, left_on = ["program", "vendor", "latest_date"], right_on = ["program", "vendor", "quote_timestamp"]
    )
    .select(
        ["program", "vendor", "latest_date", "reported_total_price"]
    )
)

print("Total cost per program per vendor based on the latest received quote:\n")
display(latest_total_cost_per_program_per_vendor)


Total cost per program per vendor based on the latest received quote:



program,vendor,latest_date,reported_total_price
str,str,datetime[μs],f64
"""A""","""Vendor_3""",2024-09-17 18:56:17.120014,18337.45
"""D""","""Vendor_2""",2024-09-15 16:31:01.577563,50639.78
"""E""","""Vendor_3""",2024-09-21 07:42:39.011815,50823.06
"""A""","""Vendor_4""",2024-09-22 22:27:29.616821,18097.66
"""C""","""Vendor_1""",2024-09-21 09:52:40.202503,20237.5
…,…,…,…
"""E""","""Vendor_1""",2024-09-19 00:04:36.436457,56903.34
"""C""","""Vendor_2""",2024-09-16 13:58:07.986085,20375.36
"""E""","""Vendor_6""",2024-09-20 12:38:17.371232,49034.84
"""B""","""Vendor_2""",2024-09-25 02:56:27.371579,31166.21


In [49]:
# Final Four Cells - Cell 3
earliest_total_cost_per_program_per_vendor = (

    filtered_df
    .group_by(['vendor', 'program'])
    .agg(
        pl.col("quote_timestamp").min().alias("latest_date")
    )
    .join(
        filtered_df, left_on = ["program", "vendor", "latest_date"], right_on = ["program", "vendor", "quote_timestamp"]
    )
    .select(
        ["program", "vendor", "latest_date", "reported_total_price"]
    )
)

print("Total cost per program per vendor based on the latest received quote:\n")
display(earliest_total_cost_per_program_per_vendor)

Total cost per program per vendor based on the latest received quote:



program,vendor,latest_date,reported_total_price
str,str,datetime[μs],f64
"""D""","""Vendor_1""",2024-09-04 14:32:50.758660,35885.62
"""B""","""Vendor_2""",2024-09-07 05:29:28.582140,30943.99
"""D""","""Vendor_7""",2024-09-14 21:51:34.330801,39209.79
"""A""","""Vendor_7""",2024-09-03 23:40:49.599144,18099.99
"""B""","""Vendor_6""",2024-09-03 02:21:46.569968,32021.13
…,…,…,…
"""D""","""Vendor_5""",2024-09-02 04:38:28.238073,54559.0
"""D""","""Vendor_2""",2024-09-02 15:44:11.625368,45809.01
"""A""","""Vendor_3""",2024-09-03 20:38:52.014528,18934.65
"""C""","""Vendor_7""",2024-09-02 00:38:15.661547,22139.59


In [50]:
# Final Four Cells - Cell 4 
# The table of "best-in-class" total cost per program (regardless of vendor).

sorted_filtered_df = filtered_df.sort(["program", "quote_timestamp"])

# Calculate the best-in-class total cost per program
best_in_class_total_cost_per_program = (
    sorted_filtered_df
    .group_by("program")
    .agg(
        [
            pl.col("CPU_multiplied").min().alias("CPU_min"),
            pl.col("GPU_multiplied").min().alias("GPU_min"),
            pl.col("RAM_multiplied").min().alias("RAM_min"),
            pl.col("SSD_multiplied").min().alias("SSD_min"),
            pl.col("HDD_multiplied").min().alias("HDD_min"),
            pl.col("MOBO_multiplied").min().alias("MOBO_min"),
            pl.col("NIC_multiplied").min().alias("NIC_min"),
            pl.col("PSU_multiplied").min().alias("PSU_min"),
            pl.col("TRAY_multiplied").min().alias("TRAY_min"),
            pl.col("TOR_multiplied").min().alias("TOR_min"),
            pl.col("CHASSIS_multiplied").min().alias("CHASSIS_min"),
        ]
    )
    .with_columns(
        (
            pl.col("CPU_min") + pl.col("GPU_min") + pl.col("RAM_min") + pl.col("SSD_min") +
            pl.col("HDD_min") + pl.col("MOBO_min") + pl.col("NIC_min") + pl.col("PSU_min") +
            pl.col("TRAY_min") + pl.col("TOR_min") + pl.col("CHASSIS_min")
        ).alias("total_minimum_sum")
    )
)

print("The table of best-in-class total cost per program (regardless of vendor):\n")
display(best_in_class_total_cost_per_program)


The table of best-in-class total cost per program (regardless of vendor):



program,CPU_min,GPU_min,RAM_min,SSD_min,HDD_min,MOBO_min,NIC_min,PSU_min,TRAY_min,TOR_min,CHASSIS_min,total_minimum_sum
str,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64
"""C""",8423.52,0.0,2474.08,2524.76,0.0,1139.04,576.52,1136.1,155.82,502.95,1007.48,17940.27
"""E""",3017.7,8211.4,3528.8,0.0,20390.0,803.5,201.9,803.9,115.7,523.49,1009.27,38605.66
"""D""",3013.6,0.0,3568.0,1802.4,20178.0,807.2,419.6,819.7,115.7,508.13,1015.84,32248.17
"""B""",4808.48,12924.16,1421.44,2889.44,0.0,650.88,330.24,1315.68,98.64,1007.74,1018.23,26464.93
"""A""",7206.96,0.0,2122.56,2169.96,0.0,975.24,486.72,977.88,239.88,507.45,1097.67,15784.32
