# Homework 1: Scalable Data 516
In this assignment, we focus on the supply chain management of a hyperscale cloud provider, specifically on procuring server racks for data centers. Server racks consist of two main types of components: 
- **Server components** (e.g., CPU, SSD, HDD), which make up each individual server, and 
- **Rack components**, which hold all servers in a chassis and include a top-of-rack (TOR) networking switch for connectivity.

## Situation

We manage multiple server rack programs, with quotes submitted by various vendors. Initially, quotes were provided as summary-level totals for the entire server rack. Now, vendors submit detailed quotes, where each line represents the cost of an individual component. However, internal systems still rely on summary quotes, meaning vendors must provide both summary and detailed quote line items.

Since vendor quotes are automated and can fluctuate daily, we need to ensure proper validation of incoming data. This will result in the following:

1. The table of extended quantites per component for all programs.
2. The table of total cost per program per vendor, based on the latest received quote.
3. The table of total cost per program per vendor, based on the first received quote.
4. The table of "best-in-class" total cost per program (regardless of vendor).

In [111]:
import polars as pl
from datetime import datetime

## Part 1: Extended quantities per component for all programs
Task: 
A table detailing the total server rack quantity for each component across all programs. Each record in the table should represent a distinct component. Each attribute of the table should represent a distinct program. The intersection of record and attribute should represent the total extended quantity of parts for that component and program combination.

Process:
First, we want to get all of the scaled quantities in one table. I multiplied the amount of servers listed for each program to every item in `server_specs`. Then, I concatenated the remaining of the items for `rack_specs` to get a list of scaled quantities to allow us to calculate the total cost per program per vendor.

In [112]:
# read in csv files and their respective sheets
server_specs = pl.read_excel('input_data/program configurations.xlsx', sheet_name='server_specs')
rack_specs = pl.read_excel('input_data/program configurations.xlsx', sheet_name='rack_specs')


In [113]:
# filter rack_specs df to get only row where itme is SERVERS
# remove item column to only leave numerical quantities for each server
# then, get only first row from filtered df to get the quantities which outputs [12,8,14,10,10]
servers_quantities = rack_specs.filter(pl.col('Item') == 'SERVERS').select(pl.exclude('Item')).row(0)

# multiply each component in server_specs by the corresponding SERVERS quantity
# we want (pl.col('Server A') * servers_quantities[0]).alias('Server A') for each server

server_columns = ['Server A', 'Server B', 'Server C', 'Server D', 'Server E']
scaled_servers_quantities = server_specs.with_columns([
    (pl.col(server) * servers_quantities[i]).alias(server) 
    for i, server in enumerate(server_columns)
])

In [114]:
rack_quantities = rack_specs.filter(pl.col('Item').is_in(['TOR', 'CHASSIS']))

extended_quantities_df = pl.concat([scaled_servers_quantities, rack_quantities], how='vertical')

print("Table with Extended Quantities Per Component for All Programs")

extended_quantities_df.write_csv("output_tables/extended_quantities_df.csv")

extended_quantities_df

Table with Extended Quantities Per Component for All Programs


Item,Server A,Server B,Server C,Server D,Server E
str,i64,i64,i64,i64,i64
"""CPU""",24,16,28,10,10
"""GPU""",0,32,0,0,20
"""RAM""",48,32,56,80,80
"""SSD""",12,16,14,10,0
"""HDD""",0,0,0,200,200
…,…,…,…,…,…
"""NIC""",24,16,28,20,10
"""PSU""",12,16,14,10,10
"""TRAY""",12,8,14,10,10
"""TOR""",1,2,1,1,1


## Part 2: Data Validation Requirements
We need to calculate the total cost per program/server per vendor based on the latest received quote. Therefore, we'll first need to filter the `quote_lines.csv` by our data validation requirements given at the beginning which were:
- Quotes for a given month can only be accepted starting on the first Monday of the month and ending on the 25th. These boundary dates are inclusive. Any quotes provided outside of these dates will not be considered.
- The sum of the quote line details must match the provided quote summary total. If the two datasets do not agree, the quotes will not be considered.
- We do not purchase Program C and Program E quotes from Vendor 7. Vendor 7’s systems are configured to quote all available programs, so these quotes need to be discarded.

#### Loading in the Data 
Loading in `quote_lines.csv` and `quote_summaries.csv`

In [115]:
# Reading in the raw data and doing some light formatting before we get started
# Converted the quote_timestamp column from string to datetime format to make it produce meaningful calculations
quote_lines_df_raw = (
    pl.read_csv("input_data/quote_lines.csv")
    .rename({"Vendor":"vendor", "Program":"program"})
    .with_columns(
    pl.col("quote_timestamp").str.to_datetime().alias("quote_timestamp")      

    )
)



In [116]:
# reformatting quote_summaries.csv
relabel_to_program = {
    "A": "Program_A",
    "B": "Program_B",
    "C": "Program_C",
    "D": "Program_D",
    "E": "Program_E"
}
quote_summaries_df_raw = (
    pl.read_csv("input_data/quote_summaries.csv")
    .with_columns(
    pl.col("quote_timestamp").str.to_datetime().alias("quote_timestamp")
    )
)

quote_summaries_df_raw = quote_summaries_df_raw.with_columns(
    pl.col("program").replace(relabel_to_program).alias("program")
)


In [117]:
# Filtering our data between required timeframes and specified vendor info 
def filter_quote_dataframes(df: pl.DataFrame, program_prefix: str = "") -> pl.DataFrame:
    filtered_df = (
        df.filter(
            pl.col("quote_timestamp").is_between(datetime(2024, 9, 2), datetime(2024, 9, 26))
        )
        .filter(
            ~((pl.col("vendor") == "Vendor_7") & (pl.col("program").is_in([f"{program_prefix}C", f"{program_prefix}E"])))
        )
    )

    return filtered_df

filtered_quote_lines_df1 = filter_quote_dataframes(quote_lines_df_raw, "Program_")
filtered_quote_summaries_df1 = filter_quote_dataframes(quote_summaries_df_raw, "Program_")


In [118]:
filtered_quote_lines_df1

vendor,program,quote_timestamp,CPU,GPU,RAM,SSD,HDD,MOBO,NIC,PSU,TRAY,TOR,CHASSIS
str,str,datetime[μs],f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64
"""Vendor_7""","""Program_D""",2024-09-25 03:52:19.637095,305.84,409.06,53.43,196.14,103.49,107.72,20.98,94.19,11.57,746.67,1015.84
"""Vendor_6""","""Program_E""",2024-09-12 00:26:22.414219,324.21,446.35,44.91,194.27,184.53,100.12,28.8,89.99,70.05,728.08,1068.57
"""Vendor_5""","""Program_A""",2024-09-23 18:35:08.078890,300.29,414.38,48.73,187.55,105.42,103.5,21.64,102.29,22.98,734.33,1770.98
"""Vendor_5""","""Program_B""",2024-09-24 11:33:41.034306,333.53,445.41,48.16,193.25,104.02,90.87,23.0,113.9,80.13,616.13,1563.11
"""Vendor_3""","""Program_A""",2024-09-17 18:56:17.120014,338.88,441.63,51.69,198.78,110.31,113.62,23.77,112.67,32.99,547.85,1108.16
…,…,…,…,…,…,…,…,…,…,…,…,…,…
"""Vendor_5""","""Program_A""",2024-09-05 03:11:32.722222,330.5,468.04,47.79,190.65,110.16,91.25,25.52,94.26,38.55,684.71,1268.8
"""Vendor_4""","""Program_C""",2024-09-08 16:45:01.071635,338.68,466.55,50.5,190.87,142.77,102.34,29.21,116.78,93.16,617.82,1820.72
"""Vendor_4""","""Program_E""",2024-09-11 17:20:46.390587,313.77,476.99,46.76,191.12,122.7,81.18,28.76,115.58,27.46,711.28,1094.2
"""Vendor_4""","""Program_D""",2024-09-18 01:40:39.987961,310.68,457.19,49.46,199.81,134.83,106.66,24.49,81.97,97.49,607.15,1170.61


In [119]:
filtered_quote_summaries_df1

vendor,program,quote_timestamp,reported_total_price
str,str,datetime[μs],f64
"""Vendor_7""","""Program_A""",2024-09-08 09:24:26.457788,18166.85
"""Vendor_1""","""Program_D""",2024-09-24 20:49:52.296317,52527.28
"""Vendor_1""","""Program_D""",2024-09-04 17:07:58.019610,36545.28
"""Vendor_5""","""Program_D""",2024-09-04 20:43:51.024192,36864.57
"""Vendor_5""","""Program_D""",2024-09-14 06:05:48.206929,49587.42
…,…,…,…
"""Vendor_6""","""Program_E""",2024-09-05 02:24:06.963283,54307.11
"""Vendor_6""","""Program_B""",2024-09-03 04:36:02.445715,29320.38
"""Vendor_1""","""Program_E""",2024-09-24 10:23:03.302944,54118.35
"""Vendor_7""","""Program_A""",2024-09-20 03:53:18.814830,18946.16


## Calculating Server Rack Quantities
To calculate the server rack quantities, we will pivot the previously filtered tables. This process will be broken down into three main steps:

#### Part 1: Unpivot
First, we will unpivot the `filtered_quote_lines_df1` DataFrame while keeping the vendor, program, and quote_timestamp columns as the index. This will transform all the components (e.g., CPU, GPU, etc.) into a vertical structure. Then, we'll unpivot the `extended_quantities_df` to move the Servers vertically, and rename this column to program for consistency.

#### Part 2: Merge
Next, we will merge the unpivoted tables based on the `program` and `item` columns. After merging, we will multiply the extended quantity and price to scale the data and create a new column representing the total price per component. This will result in the `Table of Scaled Cost Per Component`.

#### Part 3: Pivot and Sum
Finally, we will pivot the data back to its original wide format after scaling, then sum the values across all components to obtain the final server rack quantities.

In [120]:
# Part 1: Unpivot the two dataframes to get similar structure

filtered_quote_lines_unpivoted = filtered_quote_lines_df1.unpivot(
    index=["vendor", "program", "quote_timestamp"],  
    variable_name="item",  
    value_name="price"  
)
extended_quantities_unpivoted = extended_quantities_df.unpivot(
    index="Item",  
    variable_name="program",  
    value_name="quantity"  
)

In [121]:
# relabeling "Server _" to "Program _"
server_to_program = {
    "Server A": "Program_A",
    "Server B": "Program_B",
    "Server C": "Program_C",
    "Server D": "Program_D",
    "Server E": "Program_E"
}

extended_quantities_unpivoted = extended_quantities_unpivoted.with_columns(
    pl.col("program").replace(server_to_program).alias("program")
)

In [122]:
# Part 2: Merging
# Calculate the scaled prices after merging
merged_df = extended_quantities_unpivoted.join(
    filtered_quote_lines_unpivoted,  
    left_on=["program", "Item"],     
    right_on=["program", "item"],    
    how="inner"                      
)

scaled_components_df = merged_df.with_columns(
    (pl.col("quantity") * pl.col("price")).alias("scaled_price")
)
print("Table of Scaled Cost Per Component")
scaled_components_df

Table of Scaled Cost Per Component


Item,program,quantity,vendor,quote_timestamp,price,scaled_price
str,str,i64,str,datetime[μs],f64,f64
"""CPU""","""Program_D""",10,"""Vendor_7""",2024-09-25 03:52:19.637095,305.84,3058.4
"""CPU""","""Program_E""",10,"""Vendor_6""",2024-09-12 00:26:22.414219,324.21,3242.1
"""CPU""","""Program_A""",24,"""Vendor_5""",2024-09-23 18:35:08.078890,300.29,7206.96
"""CPU""","""Program_B""",16,"""Vendor_5""",2024-09-24 11:33:41.034306,333.53,5336.48
"""CPU""","""Program_A""",24,"""Vendor_3""",2024-09-17 18:56:17.120014,338.88,8133.12
…,…,…,…,…,…,…
"""CHASSIS""","""Program_A""",1,"""Vendor_5""",2024-09-05 03:11:32.722222,1268.8,1268.8
"""CHASSIS""","""Program_C""",1,"""Vendor_4""",2024-09-08 16:45:01.071635,1820.72,1820.72
"""CHASSIS""","""Program_E""",1,"""Vendor_4""",2024-09-11 17:20:46.390587,1094.2,1094.2
"""CHASSIS""","""Program_D""",1,"""Vendor_4""",2024-09-18 01:40:39.987961,1170.61,1170.61


In [123]:
# Part 3: Pivot Back
# Pivot back to OG form so we can sum across columns to get the sum of all the comments per vendor/program/time
pivoted_df = scaled_components_df.pivot(
    values="scaled_price",                 
    index=["vendor", "program", "quote_timestamp"],  
    on="Item",                      
    aggregate_function="first"     
)

scaled_quote_lines_df = pivoted_df.with_columns(
    pl.sum_horizontal(
        pivoted_df.select(pl.col(pl.Float64))  
    ).alias("total_cost")
)

scaled_quote_lines_df


vendor,program,quote_timestamp,CPU,GPU,RAM,SSD,HDD,MOBO,NIC,PSU,TRAY,TOR,CHASSIS,total_cost
str,str,datetime[μs],f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64
"""Vendor_7""","""Program_D""",2024-09-25 03:52:19.637095,3058.4,0.0,4274.4,1961.4,20698.0,1077.2,419.6,941.9,115.7,746.67,1015.84,34309.11
"""Vendor_6""","""Program_E""",2024-09-12 00:26:22.414219,3242.1,8927.0,3592.8,0.0,36906.0,1001.2,288.0,899.9,700.5,728.08,1068.57,57354.15
"""Vendor_5""","""Program_A""",2024-09-23 18:35:08.078890,7206.96,0.0,2339.04,2250.6,0.0,1242.0,519.36,1227.48,275.76,734.33,1770.98,17566.51
"""Vendor_5""","""Program_B""",2024-09-24 11:33:41.034306,5336.48,14253.12,1541.12,3092.0,0.0,726.96,368.0,1822.4,641.04,1232.26,1563.11,30576.49
"""Vendor_3""","""Program_A""",2024-09-17 18:56:17.120014,8133.12,0.0,2481.12,2385.36,0.0,1363.44,570.48,1352.04,395.88,547.85,1108.16,18337.45
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
"""Vendor_5""","""Program_A""",2024-09-05 03:11:32.722222,7932.0,0.0,2293.92,2287.8,0.0,1095.0,612.48,1131.12,462.6,684.71,1268.8,17768.43
"""Vendor_4""","""Program_C""",2024-09-08 16:45:01.071635,9483.04,0.0,2828.0,2672.18,0.0,1432.76,817.88,1634.92,1304.24,617.82,1820.72,22611.56
"""Vendor_4""","""Program_E""",2024-09-11 17:20:46.390587,3137.7,9539.8,3740.8,0.0,24540.0,811.8,287.6,1155.8,274.6,711.28,1094.2,45293.58
"""Vendor_4""","""Program_D""",2024-09-18 01:40:39.987961,3106.8,0.0,3956.8,1998.1,26966.0,1066.6,489.8,819.7,974.9,607.15,1170.61,41156.46


### Comparing Outputs to Ensure Quote Line and Quote Summary Totals Match
We will merge the `scaled_quote_lines_df` with `filtered_quote_summaries_df` to compare the summary total against the itemized total. This comparison will help ensure that the total from the detailed quote lines matches the overall summary total provided by the vendors.

In [124]:
# Join between the two DataFrames on 'vendor', 'program', and 'quote_timestamp'
final_merged_df = scaled_quote_lines_df.join(
    filtered_quote_summaries_df1,
    on=["vendor", "program", "quote_timestamp"], 
    how="inner"  
)

final_merged_df

vendor,program,quote_timestamp,CPU,GPU,RAM,SSD,HDD,MOBO,NIC,PSU,TRAY,TOR,CHASSIS,total_cost,reported_total_price
str,str,datetime[μs],f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64
"""Vendor_7""","""Program_A""",2024-09-08 09:24:26.457788,7529.04,0.0,2547.84,2372.28,0.0,1237.92,544.56,1002.84,744.84,709.1,1478.43,18166.85,18166.85
"""Vendor_1""","""Program_D""",2024-09-24 20:49:52.296317,3284.0,0.0,4025.6,1971.3,38052.0,894.8,460.8,948.8,937.2,729.42,1223.36,52527.28,52527.28
"""Vendor_1""","""Program_D""",2024-09-04 17:07:58.019610,3330.4,0.0,3813.6,1958.8,21304.0,856.0,581.0,1035.2,982.2,688.4,1995.68,36545.28,36545.28
"""Vendor_5""","""Program_D""",2024-09-04 20:43:51.024192,3444.4,0.0,3619.2,1949.1,23316.0,1129.4,450.2,961.8,334.4,569.18,1090.89,36864.57,36864.57
"""Vendor_5""","""Program_D""",2024-09-14 06:05:48.206929,3402.8,0.0,3762.4,1954.0,35994.0,1000.4,428.6,950.2,259.4,534.24,1301.38,49587.42,49587.42
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
"""Vendor_6""","""Program_E""",2024-09-05 02:24:06.963283,3249.9,8036.0,3649.6,0.0,32394.0,1147.8,295.5,832.0,619.3,604.75,1896.5,52725.35,54307.11
"""Vendor_6""","""Program_B""",2024-09-03 04:36:02.445715,5138.24,14357.44,1421.44,3026.72,0.0,650.88,449.6,1325.28,397.36,1480.36,1073.06,29320.38,29320.38
"""Vendor_1""","""Program_E""",2024-09-24 10:23:03.302944,3225.3,9222.6,3528.8,0.0,32204.0,843.2,266.4,1080.4,261.2,643.35,1266.84,52542.09,54118.35
"""Vendor_7""","""Program_A""",2024-09-20 03:53:18.814830,7463.28,0.0,2492.64,2385.36,0.0,1106.64,590.64,1240.8,1045.08,672.21,1949.51,18946.16,18946.16


In [130]:
# Adding in some tolerance  
epsilon = 1e-6  

mismatches_df = final_merged_df.filter(
    (pl.col("reported_total_price") - pl.col("total_cost")).abs() > epsilon
)

if mismatches_df.height > 0:
    print("\nDiscrepancies found between reported_total_price and total_cost:")
    print(mismatches_df)
else:
    print("\nAll values match between reported_total_price and total_cost within the tolerance.")



Discrepancies found between reported_total_price and total_cost:
shape: (12, 16)
┌──────────┬───────────┬──────────────┬─────────┬───┬─────────┬─────────┬────────────┬─────────────┐
│ vendor   ┆ program   ┆ quote_timest ┆ CPU     ┆ … ┆ TOR     ┆ CHASSIS ┆ total_cost ┆ reported_to │
│ ---      ┆ ---       ┆ amp          ┆ ---     ┆   ┆ ---     ┆ ---     ┆ ---        ┆ tal_price   │
│ str      ┆ str       ┆ ---          ┆ f64     ┆   ┆ f64     ┆ f64     ┆ f64        ┆ ---         │
│          ┆           ┆ datetime[μs] ┆         ┆   ┆         ┆         ┆            ┆ f64         │
╞══════════╪═══════════╪══════════════╪═════════╪═══╪═════════╪═════════╪════════════╪═════════════╡
│ Vendor_6 ┆ Program_E ┆ 2024-09-12   ┆ 3282.1  ┆ … ┆ 705.71  ┆ 1619.31 ┆ 48689.12   ┆ 50149.79    │
│          ┆           ┆ 21:33:39.401 ┆         ┆   ┆         ┆         ┆            ┆             │
│          ┆           ┆ 565          ┆         ┆   ┆         ┆         ┆            ┆             │
│ Vendor_

### Final Merged Dataframe That Complies with Data Validation Requirements  

To ensure we're following the data validation requirements, we filter the merged DataFrame based on the difference between the `reported_total_price` and the calculated `total_cost`. We allowed for a small margin of error defined by epsilon to account for potential rounding differences.

In [126]:
validated_merged = final_merged_df.filter(
    (pl.col("reported_total_price") - pl.col("total_cost")).abs() <= epsilon
)
print("\nValidated Merged DataFrame (with matching total costs):")
validated_merged


Validated Merged DataFrame (with matching total costs):


vendor,program,quote_timestamp,CPU,GPU,RAM,SSD,HDD,MOBO,NIC,PSU,TRAY,TOR,CHASSIS,total_cost,reported_total_price
str,str,datetime[μs],f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64
"""Vendor_7""","""Program_A""",2024-09-08 09:24:26.457788,7529.04,0.0,2547.84,2372.28,0.0,1237.92,544.56,1002.84,744.84,709.1,1478.43,18166.85,18166.85
"""Vendor_1""","""Program_D""",2024-09-24 20:49:52.296317,3284.0,0.0,4025.6,1971.3,38052.0,894.8,460.8,948.8,937.2,729.42,1223.36,52527.28,52527.28
"""Vendor_1""","""Program_D""",2024-09-04 17:07:58.019610,3330.4,0.0,3813.6,1958.8,21304.0,856.0,581.0,1035.2,982.2,688.4,1995.68,36545.28,36545.28
"""Vendor_5""","""Program_D""",2024-09-04 20:43:51.024192,3444.4,0.0,3619.2,1949.1,23316.0,1129.4,450.2,961.8,334.4,569.18,1090.89,36864.57,36864.57
"""Vendor_5""","""Program_D""",2024-09-14 06:05:48.206929,3402.8,0.0,3762.4,1954.0,35994.0,1000.4,428.6,950.2,259.4,534.24,1301.38,49587.42,49587.42
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
"""Vendor_1""","""Program_C""",2024-09-08 22:11:29.817579,8716.96,0.0,2692.48,2709.7,0.0,1313.34,790.16,1204.98,723.66,587.21,1352.27,20090.76,20090.76
"""Vendor_6""","""Program_B""",2024-09-21 10:38:25.443676,4935.2,13267.52,1690.24,3037.92,0.0,694.08,399.04,1569.44,327.6,1114.1,1111.66,28146.8,28146.8
"""Vendor_6""","""Program_B""",2024-09-03 04:36:02.445715,5138.24,14357.44,1421.44,3026.72,0.0,650.88,449.6,1325.28,397.36,1480.36,1073.06,29320.38,29320.38
"""Vendor_7""","""Program_A""",2024-09-20 03:53:18.814830,7463.28,0.0,2492.64,2385.36,0.0,1106.64,590.64,1240.8,1045.08,672.21,1949.51,18946.16,18946.16


# Part 3: Calculate Requested Tables
Outputting All Requested Tables

In [132]:
print("Table with Extended Quantities Per Component for All Programs")
extended_quantities_df

Table with Extended Quantities Per Component for All Programs


Item,Server A,Server B,Server C,Server D,Server E
str,i64,i64,i64,i64,i64
"""CPU""",24,16,28,10,10
"""GPU""",0,32,0,0,20
"""RAM""",48,32,56,80,80
"""SSD""",12,16,14,10,0
"""HDD""",0,0,0,200,200
…,…,…,…,…,…
"""NIC""",24,16,28,20,10
"""PSU""",12,16,14,10,10
"""TRAY""",12,8,14,10,10
"""TOR""",1,2,1,1,1


In [127]:
latest_dates_df = (
    validated_merged
    .group_by(['vendor', 'program'])
    .agg(
        pl.col("quote_timestamp").max().alias("latest_date")  
    )
)

latest_total_cost_per_program_per_vendor = (
    latest_dates_df
    .join(
        validated_merged,
        left_on=["vendor", "program", "latest_date"],
        right_on=["vendor", "program", "quote_timestamp"]
    )
    .select(
        ["program", "vendor", "latest_date", "reported_total_price"]
    )
)

print("Total cost per program per vendor based on the latest received quote:\n")
latest_total_cost_per_program_per_vendor.write_csv('output_tables/latest_received_total_cost_per_program_per_vendor.csv')
latest_total_cost_per_program_per_vendor.sort(["program","vendor"])

Total cost per program per vendor based on the latest received quote:



program,vendor,latest_date,reported_total_price
str,str,datetime[μs],f64
"""Program_A""","""Vendor_1""",2024-09-23 16:47:39.737290,18226.44
"""Program_A""","""Vendor_2""",2024-09-22 01:24:11.930370,17658.75
"""Program_A""","""Vendor_3""",2024-09-17 18:56:17.120014,18337.45
"""Program_A""","""Vendor_4""",2024-09-22 22:27:29.616821,18097.66
"""Program_A""","""Vendor_5""",2024-09-23 18:35:08.078890,17566.51
…,…,…,…
"""Program_E""","""Vendor_2""",2024-09-19 22:44:42.953684,60823.12
"""Program_E""","""Vendor_3""",2024-09-21 07:42:39.011815,50823.06
"""Program_E""","""Vendor_4""",2024-09-22 11:12:42.133680,55786.45
"""Program_E""","""Vendor_5""",2024-09-23 04:24:43.218127,43245.08


In [128]:
first_dates_df = (
    validated_merged
    .group_by(['vendor', 'program'])
    .agg(
        pl.col("quote_timestamp").min().alias("first_date")  
    )
)

first_total_cost_per_program_per_vendor = (
    first_dates_df
    .join(
        validated_merged,
        left_on=["vendor", "program", "first_date"],
        right_on=["vendor", "program", "quote_timestamp"]
    )
    .select(
        ["program", "vendor", "first_date", "reported_total_price"]
    )
    .sort(["program","vendor"])
)

print("Total cost per program per vendor based on the first received quote:\n")
first_total_cost_per_program_per_vendor.write_csv('output_tables/first_received_total_cost_per_program_per_vendor.csv')
first_total_cost_per_program_per_vendor

Total cost per program per vendor based on the first received quote:



program,vendor,first_date,reported_total_price
str,str,datetime[μs],f64
"""Program_A""","""Vendor_1""",2024-09-02 08:30:18.429076,17555.8
"""Program_A""","""Vendor_2""",2024-09-13 16:51:40.159684,18287.76
"""Program_A""","""Vendor_3""",2024-09-03 20:38:52.014528,18934.65
"""Program_A""","""Vendor_4""",2024-09-02 18:27:51.592116,18547.48
"""Program_A""","""Vendor_5""",2024-09-05 03:11:32.722222,17768.43
…,…,…,…
"""Program_E""","""Vendor_2""",2024-09-09 03:10:16.042863,48245.24
"""Program_E""","""Vendor_3""",2024-09-02 19:47:59.236292,47761.56
"""Program_E""","""Vendor_4""",2024-09-07 17:35:53.593231,61305.92
"""Program_E""","""Vendor_5""",2024-09-03 05:21:26.272780,44481.12


In [129]:
best_in_class = (
    validated_merged
    .group_by("program")  
    .agg([
        pl.col("CPU").min(),  
        pl.col("GPU").min(),
        pl.col("RAM").min(),
        pl.col("SSD").min(),
        pl.col("HDD").min(),
        pl.col("MOBO").min(),
        pl.col("NIC").min(),
        pl.col("PSU").min(),
        pl.col("TRAY").min(),
        pl.col("TOR").min(),
        pl.col("CHASSIS").min()
    ])
    .with_columns(
        (pl.col("CPU") + pl.col("GPU") + pl.col("RAM") + pl.col("SSD") + pl.col("HDD") +
         pl.col("MOBO") + pl.col("NIC") + pl.col("PSU") + pl.col("TRAY") +
         pl.col("TOR") + pl.col("CHASSIS")).alias("best_in_class_total_cost")  
    )
    .select(["program", "best_in_class_total_cost"])
)

print("Best-in-class total cost per program:\n")
best_in_class.write_csv('output_tables/best-in-class.csv')
best_in_class.sort(["program"])


Best-in-class total cost per program:



program,best_in_class_total_cost
str,f64
"""Program_A""",15784.32
"""Program_B""",26464.93
"""Program_C""",18038.69
"""Program_D""",32248.17
"""Program_E""",38613.96
