# Homework 1: Data Validation and Transformation via Polars DataFrames

## Background

You work in Supply Chain for a hyperscale cloud provider. One of your organization's responsibilities is to procure server racks for data centers. Server racks have two major categories of components: server components and rack components. Server components make up each individual server (e.g., CPU, SSD, HDD, etc.). There can be multiple servers in a rack. Rack components hold all the servers in a chassis and provide a top-of-rack (TOR) networking switch for data center connectivity.

## Situation

We manage several different server rack programs for various compute products. Each program is quoted by a variety of vendors. The quoting process is challenging. Initially, vendors submitted summary-level quotes that provided the cost of an entire server rack. As your organization evolved, you asked vendors to start providing more detailed quotes where each line represents the cost of a individual component in the server rack. These quote line details do not include server or rack quantities. Unfortunately, internal systems are still tied to the summary-level quote submissions. The current state requires vendors to provide quotes as both summaries and detailed line items.
Vendors have built automation to continuously supply quotes, as market rates for various components can fluctuate daily. 

## Available Data

- Server component quantites (Excel)
- Rack component quantities (Excel)
- Vendor quote summary data (csv)
- Vendor quote line detail data (csv)

# Import Libraries and Inspect Datasets

This section imports all four source datasets and removes the "Server" label on the program columns from the excel dataframes (server_specs_raw and rack_specs_raw).

All four datasets come from Data 516 - Scalable Algorithms taught by professor Mark Kazzaz.

In [1]:
import openpyxl

In [2]:
%%capture
%pip install polars
%pip install pandas
% pip install openpyxl
import polars as pl
import pandas as pd
from datetime import datetime, timedelta

In [3]:
import pandas as pd
# quote_lines_raw = (
#     pl.read_csv("quote_lines.csv")
# )
# print(quote_lines_raw)
import pandas as pd
quote_lines_raw = pd.read_csv("quote_lines.csv")
print(quote_lines_raw.head())

     Vendor    Program             quote_timestamp     CPU     GPU    RAM  \
0  Vendor_7  Program_D  2024-09-25T03:52:19.637095  305.84  409.06  53.43   
1  Vendor_6  Program_E  2024-09-12T00:26:22.414219  324.21  446.35  44.91   
2  Vendor_6  Program_A  2024-09-01T04:29:41.575326  325.75  404.60  52.24   
3  Vendor_5  Program_A  2024-09-23T18:35:08.078890  300.29  414.38  48.73   
4  Vendor_5  Program_B  2024-09-24T11:33:41.034306  333.53  445.41  48.16   

      SSD     HDD    MOBO    NIC     PSU   TRAY     TOR  CHASSIS  
0  196.14  103.49  107.72  20.98   94.19  11.57  746.67  1015.84  
1  194.27  184.53  100.12  28.80   89.99  70.05  728.08  1068.57  
2  181.26  130.24  114.01  22.66  106.22  28.00  719.14  1665.41  
3  187.55  105.42  103.50  21.64  102.29  22.98  734.33  1770.98  
4  193.25  104.02   90.87  23.00  113.90  80.13  616.13  1563.11  


In [None]:
# quote_summaries_raw = (
#     pl.read_csv("quote_summaries.csv")
# )
# print(quote_summaries_raw)

quote_summaries_raw = pd.read_csv("quote_summaries.csv")
# print(quote_summaries_raw.head())


In [5]:
# server_specs_raw = (
#     pl.read_excel("program configurations.xlsx", sheet_name = "server_specs")
# )
# print(server_specs_raw.head())

server_specs_raw = pd.read_excel("program configurations.xlsx", sheet_name = "server_specs")
server_specs_raw.replace('MoBo', 'MOBO', inplace=True)
# print(server_specs_raw)


In [6]:
# rack_specs_raw = (
#     pl.read_excel("program configurations.xlsx", sheet_name = "rack_specs")
# )
# print(rack_specs_raw.head())a

rack_specs_raw = pd.read_excel("program configurations.xlsx", sheet_name = "rack_specs")
rack_specs_raw.replace('CHASIS', 'CHASSIS', inplace=True)
# print(rack_specs_raw)

In [7]:
def clean_column_names(df):
    # Remove the 'Server ' part from the column names
    df.columns = df.columns.str.replace('Server ', '', regex=False)
    
    # Log the changes (optional)
    print("Column names cleaned. Updated columns:")
    print(df)
    
    return df

# Apply the function to server_specs_raw and rack_specs_raw
server_specs_raw = clean_column_names(server_specs_raw)
rack_specs_raw = clean_column_names(rack_specs_raw)

Column names cleaned. Updated columns:
   Item  A  B  C   D   E
0   CPU  2  2  2   1   1
1   GPU  0  4  0   0   2
2   RAM  4  4  4   8   8
3   SSD  1  2  1   1   0
4   HDD  0  0  0  20  20
5  MOBO  1  1  1   1   1
6   NIC  2  2  2   2   1
7   PSU  1  2  1   1   1
8  TRAY  1  1  1   1   1
Column names cleaned. Updated columns:
      Item   A  B   C   D   E
0  SERVERS  12  8  14  10  10
1      TOR   1  2   1   1   1
2  CHASSIS   1  1   1   1   1


# Assignment

As part of the data team, you need to transform various datasets to create the following:

- A table detailing the total server rack quantity for each component across all programs. 

    - Each record in the table should represent a distinct component. 
    - Each attribute of the table should represent a distinct program. 
    - The intersection of record and attribute should represent the total extended quantity of parts for that component and program combination.

- You must join the relevant datasets and apply the necessary calculations, transformations, and filters to create a final dataset that complies with all of our data validation requirements.

- Once you have a dataset that meets all the requirements, create tables for each of the following scenarios:

    - If we only consider the latest quote received by each vendor for each program, what is the total server rack cost per program per vendor?
    - If we only consider the first quote received by each vendor for each program, what is the total cost per program per vendor?
    - To determine "best-in-class" pricing, calculate the total server rack cost by determining the lowest price per component and summing the total, regardless of vendor. What is the "best-in-class" pricing for each program?

# Clean Data 

There is little to no validation performed on quotes when they enter our system. As such, we need to perform the following validations on our quote data:

- Quotes for a given month can only be accepted starting on the first Monday of the month and ending on the 25th. 

    - These boundary dates are inclusive. 
        - Any quotes provided outside of these dates will not be considered.

    - The sum of the quote line details must match the provided quote summary total. 
        - If the two datasets do not agree, the quotes will not be considered.

- We do not purchase Program C and Program E quotes from Vendor 7. 

    - Vendor 7’s systems are configured to quote all available programs, so these quotes need to be discarded.


# Cleaning Setup

This functions remove the "Program_" prefix from the program column and names the raw dataframe quote_lines_raw_prefix

In [8]:
def clean_program_column(df):
    # Remove the "Program_" prefix from the Program column
    df['Program'] = df['Program'].str.replace('Program_', '', regex=False)
    
    # Log the changes (optional)
    print("Program column cleaned. Updated values:")
    print(df['Program'].unique())
    
    return df

quote_lines_raw_prefix = clean_program_column(quote_lines_raw)

# Output the modified DataFrame
print(quote_lines_raw_prefix.head())

# Group the DataFrame by the 'Program' column
grouped_programs = quote_lines_raw_prefix.groupby('Program')

# Create a dictionary to hold separate DataFrames for each program
program_dataframes = {program: grouped_programs.get_group(program) for program in grouped_programs.groups}

# Display the DataFrame for each program (optional)
for program, program_df in program_dataframes.items():
    print(f"DataFrame for Program {program}:")
    print(program_df, "\n")



Program column cleaned. Updated values:
['D' 'E' 'A' 'B' 'C']
     Vendor Program             quote_timestamp     CPU     GPU    RAM  \
0  Vendor_7       D  2024-09-25T03:52:19.637095  305.84  409.06  53.43   
1  Vendor_6       E  2024-09-12T00:26:22.414219  324.21  446.35  44.91   
2  Vendor_6       A  2024-09-01T04:29:41.575326  325.75  404.60  52.24   
3  Vendor_5       A  2024-09-23T18:35:08.078890  300.29  414.38  48.73   
4  Vendor_5       B  2024-09-24T11:33:41.034306  333.53  445.41  48.16   

      SSD     HDD    MOBO    NIC     PSU   TRAY     TOR  CHASSIS  
0  196.14  103.49  107.72  20.98   94.19  11.57  746.67  1015.84  
1  194.27  184.53  100.12  28.80   89.99  70.05  728.08  1068.57  
2  181.26  130.24  114.01  22.66  106.22  28.00  719.14  1665.41  
3  187.55  105.42  103.50  21.64  102.29  22.98  734.33  1770.98  
4  193.25  104.02   90.87  23.00  113.90  80.13  616.13  1563.11  
DataFrame for Program A:
       Vendor Program             quote_timestamp     CPU     GPU 

# Compile Scaling Factors

The following code compiles all of the scaling factors from two source dataframes (server_specs_raw and rack_specs_raw) into one dictionary

In [9]:
scaling_factors = {}

# Process server_specs_raw
for index, row in server_specs_raw.iterrows():
    item = row['Item']
    for program in ['A', 'B', 'C', 'D', 'E']:
        if program not in scaling_factors:
            scaling_factors[program] = {}
        scaling_factors[program][item] = row[program]

# Process rack_specs_raw
for index, row in rack_specs_raw.iterrows():
    item = row['Item']
    for program in ['A', 'B', 'C', 'D', 'E']:
        if program not in scaling_factors:
            scaling_factors[program] = {}
        scaling_factors[program][item] = row[program]

# print("these are the scaling factors sorted by program:\n", scaling_factors)

program_scaling_factors = {program: factors for program, factors in scaling_factors.items()}

# Display the resulting dictionaries (optional)
for program, factors in program_scaling_factors.items():
    print(f"Scaling factors for Program {program}:")
    print(factors, "\n")


Scaling factors for Program A:
{'CPU': 2, 'GPU': 0, 'RAM': 4, 'SSD': 1, 'HDD': 0, 'MOBO': 1, 'NIC': 2, 'PSU': 1, 'TRAY': 1, 'SERVERS': 12, 'TOR': 1, 'CHASSIS': 1} 

Scaling factors for Program B:
{'CPU': 2, 'GPU': 4, 'RAM': 4, 'SSD': 2, 'HDD': 0, 'MOBO': 1, 'NIC': 2, 'PSU': 2, 'TRAY': 1, 'SERVERS': 8, 'TOR': 2, 'CHASSIS': 1} 

Scaling factors for Program C:
{'CPU': 2, 'GPU': 0, 'RAM': 4, 'SSD': 1, 'HDD': 0, 'MOBO': 1, 'NIC': 2, 'PSU': 1, 'TRAY': 1, 'SERVERS': 14, 'TOR': 1, 'CHASSIS': 1} 

Scaling factors for Program D:
{'CPU': 1, 'GPU': 0, 'RAM': 8, 'SSD': 1, 'HDD': 20, 'MOBO': 1, 'NIC': 2, 'PSU': 1, 'TRAY': 1, 'SERVERS': 10, 'TOR': 1, 'CHASSIS': 1} 

Scaling factors for Program E:
{'CPU': 1, 'GPU': 2, 'RAM': 8, 'SSD': 0, 'HDD': 20, 'MOBO': 1, 'NIC': 1, 'PSU': 1, 'TRAY': 1, 'SERVERS': 10, 'TOR': 1, 'CHASSIS': 1} 



In [15]:
# messing around


     Vendor Program             quote_timestamp     CPU      GPU     RAM  \
0  Vendor_7       D  2024-09-25T03:52:19.637095  305.84     0.00  427.44   
1  Vendor_6       E  2024-09-12T00:26:22.414219  324.21   892.70  359.28   
2  Vendor_6       A  2024-09-01T04:29:41.575326  651.50     0.00  208.96   
3  Vendor_5       A  2024-09-23T18:35:08.078890  600.58     0.00  194.92   
4  Vendor_5       B  2024-09-24T11:33:41.034306  667.06  1781.64  192.64   

      SSD     HDD    MOBO    NIC     PSU   TRAY      TOR  CHASSIS  Total_Sum  \
0  196.14  2069.8  107.72  41.96   94.19  11.57   746.67  1015.84   50171.70   
1    0.00  3690.6  100.12  28.80   89.99  70.05   728.08  1068.57   73524.00   
2  181.26     0.0  114.01  45.32  106.22  28.00   719.14  1665.41   44637.84   
3  187.55     0.0  103.50  43.28  102.29  22.98   734.33  1770.98   45124.92   
4  386.50     0.0   90.87  46.00  227.80  80.13  1232.26  1563.11   50144.08   

   Scaled_Total_Cost  
0          473774.10  
1          80881

# Scale Relevant Quotes (Prices) and Sum

This section of code contains functions that scale the raw quote_lines data using constants from the dictionary scaling_factors, and then creates a new column named Total_Sum that adds these scaled prices together

In [10]:
# List of relevant columns to scale
columns_to_scale = ['CPU', 'GPU', 'RAM', 'SSD', 'HDD', 'MOBO', 'NIC', 'PSU', 'TRAY', 'TOR', 'CHASSIS']

# Function to scale individual columns based on the program for the entire DataFrame
def scale_items(df, columns_to_scale, scaling_factors):
    # Apply scaling to all relevant columns in the DataFrame
    for index, row in df.iterrows():
        program = row['Program']
        for column in columns_to_scale:
            if column in scaling_factors[program]:
                df.at[index, column] = row[column] * scaling_factors[program][column]
    
    return df

# Function to sum all scaled components to create a Total_Sum column
def add_total_sum_column(df):
    # List of columns to sum
    columns_to_sum = ['CPU', 'GPU', 'RAM', 'SSD', 'HDD', 'MOBO', 'NIC', 'PSU', 'TRAY', 'TOR', 'CHASSIS']
    
    # Create a new column 'Total_Sum' which is the sum of the specified columns
    df['Total_Sum'] = df[columns_to_sum].sum(axis=1)
    # Round to the nearest hundredth, two decimal places
    df['Total_Sum'] = df['Total_Sum'].apply(lambda x: round(x, 2))

    return df

# Function to scale the Total_Sum based on the SERVERS scaling factor
def scale_total_sum(df, scaling_factors):
    # Apply scaling to the Total_Sum based on SERVERS scaling factor
    for index, row in df.iterrows():
        program = row['Program']
        if 'SERVERS' in scaling_factors[program]:
            df.at[index, 'Total_Sum'] = row['Total_Sum'] * scaling_factors[program]['SERVERS']
    
    return df

# Scale the items in the DataFrame
quote_lines_scaled = scale_items(quote_lines_raw_prefix, columns_to_scale, scaling_factors)

# Add the Total_Sum column
quote_lines_scaled = add_total_sum_column(quote_lines_scaled)

# Scale the Total_Sum based on the SERVERS scaling factor
quote_lines_scaled = scale_total_sum(quote_lines_scaled, scaling_factors)

# Display and save the scaled dataframe with the new 'Total_Sum' column
print(quote_lines_scaled.head())
quote_lines_scaled.to_csv('quote_lines_summed.csv', index=False)


     Vendor Program             quote_timestamp     CPU      GPU     RAM  \
0  Vendor_7       D  2024-09-25T03:52:19.637095  305.84     0.00  427.44   
1  Vendor_6       E  2024-09-12T00:26:22.414219  324.21   892.70  359.28   
2  Vendor_6       A  2024-09-01T04:29:41.575326  651.50     0.00  208.96   
3  Vendor_5       A  2024-09-23T18:35:08.078890  600.58     0.00  194.92   
4  Vendor_5       B  2024-09-24T11:33:41.034306  667.06  1781.64  192.64   

      SSD     HDD    MOBO    NIC     PSU   TRAY      TOR  CHASSIS  Total_Sum  
0  196.14  2069.8  107.72  41.96   94.19  11.57   746.67  1015.84   50171.70  
1    0.00  3690.6  100.12  28.80   89.99  70.05   728.08  1068.57   73524.00  
2  181.26     0.0  114.01  45.32  106.22  28.00   719.14  1665.41   44637.84  
3  187.55     0.0  103.50  43.28  102.29  22.98   734.33  1770.98   45124.92  
4  386.50     0.0   90.87  46.00  227.80  80.13  1232.26  1563.11   50144.08  


# Remove Quotes With Out-of-Bounds Dates

This set of functions cleans the quote_lines_scaled dataframe, removing quotes outside of the range from the first Monday of the month until the 25th of the same month. It outputs the cleaned data as a dataframe named quote_lines_time

In [18]:
from datetime import datetime, timedelta
# Helper function to get the first Monday of the month
def get_first_monday(year, month):
    date = datetime(year, month, 1)
    while date.weekday() != 0:  # 0 = Monday
        date += timedelta(days=1)
    return date

# Function to clean quote_timestamp based on the specified date criteria
def clean_timestamps(df):
    valid_rows = []
    
    for i, row in df.iterrows():
        # Convert the quote_timestamp to a datetime object
        quote_date = pd.to_datetime(row['quote_timestamp'])
        
        # Get the first Monday and the 25th of the month
        first_monday = get_first_monday(quote_date.year, quote_date.month)
        last_day = datetime(quote_date.year, quote_date.month, 25)
        
        # Check if the quote date is within the valid range
        if first_monday <= quote_date <= last_day:
            valid_rows.append(i)
        else:
            print(f"Discarding quote outside valid range: {row['quote_timestamp']}")
    
    # Return the filtered dataframe
    return df.loc[valid_rows]

# Clean the timestamps
quote_lines_time = clean_timestamps(quote_lines_scaled)
# Display the cleaned dataframe
# print(quote_lines_time.head())

quote_lines_raw_prefix2 = clean_timestamps(quote_lines_raw_prefix)
print(quote_lines_raw_prefix2.head())


Discarding quote outside valid range: 2024-09-25T03:52:19.637095
Discarding quote outside valid range: 2024-09-01T04:29:41.575326
Discarding quote outside valid range: 2024-09-01T18:29:18.286929
Discarding quote outside valid range: 2024-09-01T08:46:16.565852
Discarding quote outside valid range: 2024-09-25T02:56:27.371579
Discarding quote outside valid range: 2024-09-01T11:34:47.063229
Discarding quote outside valid range: 2024-09-01T03:28:05.312822
Discarding quote outside valid range: 2024-09-01T23:45:13.588332
Discarding quote outside valid range: 2024-09-01T22:47:48.325236
Discarding quote outside valid range: 2024-09-01T01:01:25.768272
Discarding quote outside valid range: 2024-09-25T00:00:53.275109
     Vendor Program             quote_timestamp     CPU      GPU     RAM  \
1  Vendor_6       E  2024-09-12T00:26:22.414219  324.21   892.70  359.28   
3  Vendor_5       A  2024-09-23T18:35:08.078890  600.58     0.00  194.92   
4  Vendor_5       B  2024-09-24T11:33:41.034306  667.06  

# Remove Vendor 7 with Programs C or E

In [20]:
def remove_vendor_7_programs(df):
    # Filter out rows where Vendor is 'Vendor_7' and Program is either 'Program_C' or 'Program_E'
    df_filtered = df[~((df['Vendor'] == 'Vendor_7') & (df['Program'].isin(['C', 'E'])))]
    
    # Log the discarded rows for reference (optional)
    discarded_rows = df[(df['Vendor'] == 'Vendor_7') & (df['Program'].isin(['C', 'E']))]
    if not discarded_rows.empty:
        print(f"Discarding {len(discarded_rows)} quotes from Vendor_7 for program C and program E:\n")
        print(discarded_rows[['Vendor', 'Program', 'quote_timestamp']])

    return df_filtered

# Remove vendor_7 from program C or E
quote_lines_7CE_removed = remove_vendor_7_programs(quote_lines_time)

# Display the final cleaned dataframe
# print(quote_lines_7CE_removed.head())

quote_lines_raw_prefix2 = remove_vendor_7_programs(quote_lines_raw_prefix2)
# print(quote_lines_raw_prefix2.head())



Discarding 12 quotes from Vendor_7 for program C and program E:

       Vendor Program             quote_timestamp
5    Vendor_7       C  2024-09-15T13:20:11.557807
27   Vendor_7       E  2024-09-07T17:45:04.688250
41   Vendor_7       C  2024-09-10T10:24:30.472069
48   Vendor_7       E  2024-09-20T06:35:16.738483
49   Vendor_7       C  2024-09-24T22:09:35.544275
60   Vendor_7       C  2024-09-02T00:38:15.661547
83   Vendor_7       C  2024-09-17T00:37:32.947288
90   Vendor_7       C  2024-09-04T20:54:28.430814
114  Vendor_7       E  2024-09-13T12:18:38.436683
120  Vendor_7       C  2024-09-05T08:48:50.223750
170  Vendor_7       E  2024-09-02T19:59:44.793380
171  Vendor_7       C  2024-09-10T20:59:39.959689


In [23]:
def merge_with_reported_prices(quote_lines, quote_summaries):
    # Perform the merge on Vendor, Program, and quote_timestamp
    merged_df = quote_lines.merge(
        quote_summaries,
        how='inner',  # Use 'inner' join to keep only matching records
        left_on=['Vendor', 'Program', 'quote_timestamp'],
        right_on=['vendor', 'program', 'quote_timestamp'],
        suffixes=('', '_reported')  # Avoid column name clashes
    )
    
    # Log the result of the merge
    print(f"Merged DataFrame shape: {merged_df.shape}")
    print("Merged DataFrame preview:")
    print(merged_df.head())
    
    return merged_df

# Merge the datasets
merged_quotes_raw = merge_with_reported_prices(quote_lines_7CE_removed, quote_summaries_raw)
merged_quotes_raw.to_csv("merged_quotes_raw.csv")
# Display the merged DataFrame
# print(merged_quotes_raw.head())

quote_lines_raw_prefix2 = merge_with_reported_prices(quote_lines_raw_prefix2, quote_summaries_raw)
print(quote_lines_raw_prefix2.head())



Merged DataFrame shape: (152, 19)
Merged DataFrame preview:
     Vendor Program             quote_timestamp     CPU      GPU     RAM  \
0  Vendor_6       E  2024-09-12T00:26:22.414219  324.21   892.70  359.28   
1  Vendor_5       A  2024-09-23T18:35:08.078890  600.58     0.00  194.92   
2  Vendor_5       B  2024-09-24T11:33:41.034306  667.06  1781.64  192.64   
3  Vendor_3       A  2024-09-17T18:56:17.120014  677.76     0.00  206.76   
4  Vendor_6       B  2024-09-03T02:21:46.569968  658.04  1977.20  214.88   

      SSD     HDD    MOBO    NIC     PSU   TRAY      TOR  CHASSIS  Total_Sum  \
0    0.00  3690.6  100.12  28.80   89.99  70.05   728.08  1068.57   73524.00   
1  187.55     0.0  103.50  43.28  102.29  22.98   734.33  1770.98   45124.92   
2  386.50     0.0   90.87  46.00  227.80  80.13  1232.26  1563.11   50144.08   
3  198.78     0.0  113.62  47.54  112.67  32.99   547.85  1108.16   36553.56   
4  393.92     0.0   92.94  55.74  217.66  17.06  1107.08  1894.53   53032.40   

  

In [24]:
def clean_merged_dataset(merged_quotes):
    # Filter rows where Total_Sum matches reported_total_price
    cleaned_quotes = merged_quotes[
        merged_quotes['Total_Sum'].astype(float) == merged_quotes['reported_total_price']
    ]
    
    # Log the result of the cleaning process
    print(f"Cleaned DataFrame shape: {cleaned_quotes.shape}")
    print("Cleaned DataFrame preview:")
    print(cleaned_quotes.head())
    
    return cleaned_quotes

merged_quotes_cleaned = clean_merged_dataset(merged_quotes_raw)

# Display the cleaned DataFrame
print(merged_quotes_cleaned)
quote_lines_raw_prefix2 = clean_merged_dataset(quote_lines_raw_prefix2)



Cleaned DataFrame shape: (0, 19)
Cleaned DataFrame preview:
Empty DataFrame
Columns: [Vendor, Program, quote_timestamp, CPU, GPU, RAM, SSD, HDD, MOBO, NIC, PSU, TRAY, TOR, CHASSIS, Total_Sum, Scaled_Total_Cost, vendor, program, reported_total_price]
Index: []
Empty DataFrame
Columns: [Vendor, Program, quote_timestamp, CPU, GPU, RAM, SSD, HDD, MOBO, NIC, PSU, TRAY, TOR, CHASSIS, Total_Sum, Scaled_Total_Cost, vendor, program, reported_total_price]
Index: []
Cleaned DataFrame shape: (0, 19)
Cleaned DataFrame preview:
Empty DataFrame
Columns: [Vendor, Program, quote_timestamp, CPU, GPU, RAM, SSD, HDD, MOBO, NIC, PSU, TRAY, TOR, CHASSIS, Total_Sum, Scaled_Total_Cost, vendor, program, reported_total_price]
Index: []


In [26]:
# Sanity Check
def print_matching_rows(merged_quotes):
    # Find rows where Total_Sum matches reported_total_price
    matching_rows = merged_quotes[
        merged_quotes['Total_Sum'].astype(float) == merged_quotes['reported_total_price']
    ]
    
    # Print the matching rows
    if not matching_rows.empty:
        print("Rows where Total_Sum matches reported_total_price:")
        print(matching_rows)
    else:
        print("No matching rows found.")

# Example usage
# Assuming merged_quotes is the DataFrame obtained from the previous merge function
print_matching_rows(merged_quotes_raw)
# print_matching_rows(quote_lines_raw_prefix2)


No matching rows found.
No matching rows found.


# Submission Requirements

The final four cells of this file display the following four outputs:

- The table of extended quantites per component for all programs.
- The table of total cost per program per vendor, based on the latest received quote.
- The table of total cost per program per vendor, based on the first received quote.
- The table of "best-in-class" total cost per program (regardless of vendor).