# Data Classification Jypyter Notebook

Jupyter Notebook is an interactive web-based environment for creating and sharing documents that contain live code, equations, visualizations, and narrative text. Users can create and share documents that contain code, visualizations and explanatory text. Jupyter notebooks are great for data analysis and visualization. To use Jupyter notebook, you need to install Jupyter Notebook first. Once installed, you can open Jupyter Notebook by running the command `jupyter notebook` in your terminal or command prompt.

To create a new notebook, click on `New` and then `Python 3` in the Jupyter Notebook interface. You can then write your code in one cell and run it by pressing `Shift+Enter`. You can add more cells by clicking on `+ Code` in the toolbar. You can also add text cells to add explanations or comments to your code.

To understand data, you can import libraries such as pandas, matplotlib, and seaborn. You can then read your data into a pandas dataframe and perform various operations on it. You can use matplotlib to create visualizations of your data. You can also use seaborn, which is a library built on top of matplotlib, to create more advanced visualizations.

We'll start the proces by importing the libraries and the data. Then print the first few rows of the file to check if the data was imported correctly.

In [6]:
import pandas as pd

df = pd.read_csv('data/Spend_Intake_010124_063024.csv')
df.head()

  df = pd.read_csv('data/Spend_Intake_010124_063024.csv')


Unnamed: 0,source_system\tdate_extract\ttransaction_id\ttransaction_number\ttransaction_line_number\tsupplier_id\tsupplier_name\tdba\tsupplier_address\tsupplier_contracted\tgl_account_code\tgl_account_desc\tcost_centre_code\tcost_centre_code_desc\ttransaction_line_desc\ttransaction_line_value\ttransaction_line_unit_price\ttransaction_currency\ttransaction_line_qty\tuom\tuom_volume\tinternal_classification_code\tinternal_classification_desc\tgl_date\tcurrency_conversion\titem_code\tcontract_id\ton_off_catalog\tcatalog_name\torder_or_invoice_date\tpayment_date\ttransaction_date\tpayment_terms_desc\tpayment_terms_days_due\tpay_term_discount_amt\tpo_number\tpo_line_number\tbu_level1\tbu_level2\tbu_level3\tbu_region\tbu_country\tbu_state\tsupplier_level1\tsupplier_level2\tsupplier_level3\tsupplier_level4\tcategory_level1\tcategory_level2\tcategory_level3\tcategory_level4\torder_type\titem_description\titem_service_line\tvendor_type\titem_3rd_number\tcompany_code\tcompany_name\tcompany_division\tcompany_region\tcompany_zone\tcompany_service_line\tcompany_service_type\teof,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8
0,ARS_JDE\t2024-07-10\t\t4513040\t1.000\t776691\...,,,,,,,,
1,ARS_JDE\t2024-07-10\t\t4513040\t2.000\t776691\...,,,,,,,,
2,ARS_JDE\t2024-07-10\t\t4513040\t3.000\t776691\...,,,,,,,,
3,ARS_JDE\t2024-07-10\t\t4513040\t4.000\t776691\...,,,,,,,,
4,ARS_JDE\t2024-07-10\t\t4788252\t.000\t649327\t...,,,,,,,,


The \t suggests that the data is tab-delimited. So let's try this again.

In [7]:
import pandas as pd

df = pd.read_csv('data/Spend_Intake_010124_063024.csv', sep='\t')
df.head()

ParserError: Error tokenizing data. C error: Expected 1 fields in line 14, saw 39


Seems like there was some error in the data. Let's try again with some additional logic to clean each row.

In [10]:
# Ensure pandas is imported
import pandas as pd


def clean_messy_csv(file_path):
    """
    Reads a CSV file with tab-separated values, cleans it, and returns a pandas DataFrame.

    Parameters:
    file_path (str): The path to the CSV file.

    Returns:
    pd.DataFrame: Cleaned data as a pandas DataFrame.
    """
    # Read the file as a text file to manually process the lines
    with open(file_path, "r", encoding='utf-8') as file:  # Added encoding to handle potential UnicodeDecodeErrors
        lines = file.readlines()

    # Split the header and data lines
    header = lines[0].strip().split("\t")
    data_lines = [line.strip().split("\t") for line in lines[1:]]

    # Create a DataFrame from the processed data
    cleaned_data = pd.DataFrame(data_lines, columns=header)

    # Remove any leading/trailing whitespace characters from the headers
    cleaned_data.columns = cleaned_data.columns.str.strip()

    # Optionally, remove any rows with entirely empty values
    cleaned_data.dropna(how="all", inplace=True)

    return cleaned_data


# Now, let's use the function and display the first few rows of the cleaned DataFrame
clean_data = clean_messy_csv('data/Spend_Intake_010124_063024.csv')
clean_data.head()

Unnamed: 0,"""source_system",date_extract,transaction_id,transaction_number,transaction_line_number,supplier_id,supplier_name,dba,supplier_address,supplier_contracted,...,vendor_type,item_3rd_number,company_code,company_name,company_division,company_region,company_zone,company_service_line,company_service_type,"eof"",,,,,,,,"
0,"""ARS_JDE",2024-07-10,,4513040,1.0,776691,RHEEM SALES COMPANY INC,RHEEM SALES COMPANY INC,PO BOX 533013 CHARLOTTE NC 28290-3013,,...,General Vendor -DO NOT USE,,9202,McCarthy Heating & Air,Northeast Division,Mid-Atlantic Region,HVAC/Combo Zone,HVAC (02),Home Depot AOR (10),"X"",,,,,,,,"
1,"""ARS_JDE",2024-07-10,,4513040,2.0,776691,RHEEM SALES COMPANY INC,RHEEM SALES COMPANY INC,PO BOX 533013 CHARLOTTE NC 28290-3013,,...,General Vendor -DO NOT USE,,9202,McCarthy Heating & Air,Northeast Division,Mid-Atlantic Region,HVAC/Combo Zone,HVAC (02),Home Depot AOR (10),"X"",,,,,,,,"
2,"""ARS_JDE",2024-07-10,,4513040,3.0,776691,RHEEM SALES COMPANY INC,RHEEM SALES COMPANY INC,PO BOX 533013 CHARLOTTE NC 28290-3013,,...,General Vendor -DO NOT USE,,9202,McCarthy Heating & Air,Northeast Division,Mid-Atlantic Region,HVAC/Combo Zone,HVAC (02),Home Depot AOR (10),"X"",,,,,,,,"
3,"""ARS_JDE",2024-07-10,,4513040,4.0,776691,RHEEM SALES COMPANY INC,RHEEM SALES COMPANY INC,PO BOX 533013 CHARLOTTE NC 28290-3013,,...,General Vendor -DO NOT USE,,9202,McCarthy Heating & Air,Northeast Division,Mid-Atlantic Region,HVAC/Combo Zone,HVAC (02),Home Depot AOR (10),"X"",,,,,,,,"
4,"""ARS_JDE",2024-07-10,,4788252,0.0,649327,C & J FAMILY TRUST,C & J FAMILY TRUST,3100 A PULLMAN ST COSTA MESA` CA 92626,,...,General Vendor -DO NOT USE,,8101,Rescue Rooter Orange #560,Plumbing Division,CA Plumbing Region,Plumbing Zone,Administrative (00),Administrative (00),"X"",,,,,,,,"


Let's filter out some of the noise and create a smaller dataframe from our original set of data.

In [12]:
# Extract a smaller number of fields from a dataframe
smaller_df = clean_data[['supplier_id', 'supplier_name', 'dba', 'gl_account_code', 'gl_account_desc',
                         'cost_centre_code', 'cost_centre_code_desc', 'internal_classification_code',
                         'internal_classification_desc', 'item_code', 'order_type']]
smaller_df.head()


Unnamed: 0,supplier_id,supplier_name,dba,gl_account_code,gl_account_desc,cost_centre_code,cost_centre_code_desc,internal_classification_code,internal_classification_desc,item_code,order_type
0,776691,RHEEM SALES COMPANY INC,RHEEM SALES COMPANY INC,92020210.5207,Serialized Equipment,92020210,HVAC HDepot AOR McCarthy Heat,31450447,Supplier Invoice Number,UP18AZ48AJVCA (W18231586,JOB/FIELD TICKET PURCHASE
1,776691,RHEEM SALES COMPANY INC,RHEEM SALES COMPANY INC,92020210.5207,Serialized Equipment,92020210,HVAC HDepot AOR McCarthy Heat,31450447,Supplier Invoice Number,RHMVZ6021SEACAJ (W162370,JOB/FIELD TICKET PURCHASE
2,776691,RHEEM SALES COMPANY INC,RHEEM SALES COMPANY INC,92020210.5205,Parts and Materials,92020210,HVAC HDepot AOR McCarthy Heat,31450447,Supplier Invoice Number,Sales tax,JOB/FIELD TICKET PURCHASE
3,776691,RHEEM SALES COMPANY INC,RHEEM SALES COMPANY INC,92020210.5209,Vendor Rebates Earned,92020210,HVAC HDepot AOR McCarthy Heat,31450447,Supplier Invoice Number,Vendor Rebates Earned,JOB/FIELD TICKET PURCHASE
4,649327,C & J FAMILY TRUST,C & J FAMILY TRUST,81010000.721,Rent Expense,81010000,Admn RR Orange #560,0124BASE1,Supplier Invoice Number,C&J Family Trust (Base Rent),


Let's pause here and take a minute to create our assistant on the OpenAI playground. Once complete, we're going to
create a new thread to encapsulate the logic for a supplier transaction.

In [6]:
import os
from openai import OpenAI

api_key = os.environ.get("OPENAI_API_KEY")
if not api_key:
    raise ValueError("OPENAI_API_KEY environment variable is not set.")
client = OpenAI(api_key=api_key)
print("OpenAI API key is set.")

ValueError: OPENAI_API_KEY environment variable is not set.