# Data Loading & Quality Verification

---

## Key Findings Summary

**All 6 data files loaded successfully with no critical data quality issues identified.**

### Dataset Overview:
- **Sales Orders**: 1,168 records across fiscal 2016-2018
- **Shipments**: 1,165 shipment records  
- **Customer Invoices**: 1,167 invoice records
- **Customer Master**: 73 distributors across multiple territories
- **Sales Territory**: 5 sales territories with Q4 2017 goals
- **Products**: 14 different product SKUs

### Data Quality Observations:
- All datasets loaded without errors
- No missing critical identifiers (IDs, dates)
- Date fields properly formatted
- Numeric fields contain valid values
- Cross-file linking keys present (SalesOrderID, CustID, etc.)

### Ready for Analysis:
- Requirement 2: Revenue/AR reconciliation
- Requirement 3: Three-way match testing
- Requirements 4-5: Credit limit analysis
- Requirement 6: Aging analysis
- Requirement 7: Fraud detection testing

---

## Objective
Load all UMD data files and perform initial data quality checks to ensure data integrity before analysis.

## Data Sources
- UMD_Data Set_Sales Orders.xlsx
- UMD_Data Set_Shipments.xlsx
- UMD_Data Set_Customer Invoices.xlsx
- UMD_Data Set_Customer Master.xlsx
- UMD_Data Set_Sales Territory.xlsx
- UMD_Data Set_Products.xlsx


In [10]:
import pandas as pd
import numpy as np
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

# Set display options for better readability
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.float_format', '{:.2f}'.format)

## 1. Load All Data Files

Loading all 6 data files from the UMD case.

In [11]:
# Define data directory
data_dir = Path('../data')

# Load all datasets
sales_orders = pd.read_excel(data_dir / 'UMD_Data Set_Sales Orders.xlsx')
shipments = pd.read_excel(data_dir / 'UMD_Data Set_Shipments.xlsx')
invoices = pd.read_excel(data_dir / 'UMD_Data Set_Customer Invoices.xlsx')
customers = pd.read_excel(data_dir / 'UMD_Data Set_Customer Master.xlsx')
territories = pd.read_excel(data_dir / 'UMD_Data Set_Sales Territory.xlsx')
products = pd.read_excel(data_dir / 'UMD_Data Set_Products.xlsx')

print("✓ All data files loaded successfully")

✓ All data files loaded successfully


In [12]:
# Quick overview of all datasets
datasets = {
    'Sales Orders': sales_orders,
    'Shipments': shipments,
    'Customer Invoices': invoices,
    'Customer Master': customers,
    'Sales Territory': territories,
    'Products': products
}

overview = pd.DataFrame({
    'Dataset': list(datasets.keys()),
    'Records': [len(df) for df in datasets.values()],
    'Columns': [len(df.columns) for df in datasets.values()]
})

print("Dataset Overview:")
print("=" * 60)
print(overview.to_string(index=False))
print("=" * 60)

Dataset Overview:
          Dataset  Records  Columns
     Sales Orders     1168       16
        Shipments     1165        7
Customer Invoices     1167        7
  Customer Master       73        8
  Sales Territory        5        6
         Products       14       11


## 2. Sales Orders Data

The Sales Orders dataset is the primary transaction file. Per the case, it includes:
- Orders from prior year (2016) completed in 2017
- Orders from 2017  
- Orders from 2017 not completed until 2018

In [13]:
print("SALES ORDERS Dataset")
print("=" * 80)
print(f"Total Records: {len(sales_orders):,}")
print(f"\nColumns ({len(sales_orders.columns)}):")
print(sales_orders.columns.tolist())
print(f"\nData Types:")
print(sales_orders.dtypes)
print(f"\nMissing Values:")
print(sales_orders.isnull().sum()[sales_orders.isnull().sum() > 0])
print(f"\nSample Records:")
sales_orders.head(3)

SALES ORDERS Dataset
Total Records: 1,168

Columns (16):
['SalesOrderID', 'OrderDate', 'ProdID', 'CustID', 'TerritoryID', 'Quantity', 'UnitPrice', 'SubTotal', 'TaxAmt', 'Freight', 'TotalDue', 'CredApr', 'ShipID', 'InvoiceID', 'ModifiedDate', 'ModifiedTime']

Data Types:
SalesOrderID             int64
OrderDate       datetime64[ns]
ProdID                   int64
CustID                   int64
TerritoryID              int64
Quantity                 int64
UnitPrice                int64
SubTotal                 int64
TaxAmt                 float64
Freight                float64
TotalDue               float64
CredApr                 object
ShipID                   int64
InvoiceID                int64
ModifiedDate    datetime64[ns]
ModifiedTime            object
dtype: object

Missing Values:
Series([], dtype: int64)

Sample Records:


Unnamed: 0,SalesOrderID,OrderDate,ProdID,CustID,TerritoryID,Quantity,UnitPrice,SubTotal,TaxAmt,Freight,TotalDue,CredApr,ShipID,InvoiceID,ModifiedDate,ModifiedTime
0,2679,2016-12-28,7022,12,1,40,895,35800,2685.0,57.28,38542.28,ADB,64883,100635,2017-01-02,17:26:18.778000
1,2680,2016-12-28,7031,26,4,20,3363,67260,5044.5,72.96,72377.46,ADB,64881,100633,2016-12-31,17:45:36.022000
2,2681,2016-12-28,7122,19,2,45,2785,125325,9399.38,112.83,134837.21,ADB,64884,100636,2017-01-03,13:38:19.662000


## 3. Shipments Data

Shipments represent when product was actually shipped to distributors.


In [14]:
print("SHIPMENTS Dataset")
print("=" * 80)
print(f"Total Records: {len(shipments):,}")
print(f"\nColumns ({len(shipments.columns)}):")
print(shipments.columns.tolist())
print(f"\nData Types:")
print(shipments.dtypes)
print(f"\nMissing Values:")
print(shipments.isnull().sum()[shipments.isnull().sum() > 0])
print(f"\nSample Records:")
shipments.head(3)


SHIPMENTS Dataset
Total Records: 1,165

Columns (7):
['ShipID', 'SalesOrderID', 'ShipDate', 'ShipWeight', 'Carrier', 'ModifiedDate', 'ModifiedTime']

Data Types:
ShipID                   int64
SalesOrderID             int64
ShipDate        datetime64[ns]
ShipWeight             float64
Carrier                 object
ModifiedDate    datetime64[ns]
ModifiedTime            object
dtype: object

Missing Values:
Series([], dtype: int64)

Sample Records:


Unnamed: 0,ShipID,SalesOrderID,ShipDate,ShipWeight,Carrier,ModifiedDate,ModifiedTime
0,64883,2679,2017-01-02,23.6,Jet Freight,2017-01-02,17:26:18.778000
1,64884,2681,2017-01-03,29.9,Titan Shipments,2017-01-03,13:38:19.662000
2,64882,2682,2017-01-02,15.73,Bengal Trucking,2017-01-02,14:51:20.210000


## 4. Customer Invoices Data

**Important:** PaidDate = 9/9/9999 indicates unpaid invoices (per case documentation).


In [15]:
print("CUSTOMER INVOICES Dataset")
print("=" * 80)
print(f"Total Records: {len(invoices):,}")
print(f"\nColumns ({len(invoices.columns)}):")
print(invoices.columns.tolist())
print(f"\nData Types:")
print(invoices.dtypes)
print(f"\nMissing Values:")
print(invoices.isnull().sum()[invoices.isnull().sum() > 0])

# Check payment status (PaidDate = 9/9/9999 indicates unpaid)
# Note: PaidDate is stored as string/object, so we compare directly
unpaid_count = (invoices['PaidDate'].astype(str).str.contains('9999')).sum()
paid_count = (~invoices['PaidDate'].astype(str).str.contains('9999')).sum()
print(f"\nPayment Status:")
print(f"  Paid: {paid_count:,}")
print(f"  Unpaid: {unpaid_count:,}")
print(f"\nSample Records:")
invoices.head(3)


CUSTOMER INVOICES Dataset
Total Records: 1,167

Columns (7):
['InvoiceID', 'CustID', 'InvoiceDate', 'SalesOrderID', 'PaidDate', 'ModifiedDate', 'ModifiedTime']

Data Types:
InvoiceID                int64
CustID                   int64
InvoiceDate     datetime64[ns]
SalesOrderID             int64
PaidDate                object
ModifiedDate    datetime64[ns]
ModifiedTime            object
dtype: object

Missing Values:
Series([], dtype: int64)

Payment Status:
  Paid: 1,130
  Unpaid: 37

Sample Records:


Unnamed: 0,InvoiceID,CustID,InvoiceDate,SalesOrderID,PaidDate,ModifiedDate,ModifiedTime
0,100635,54,2017-01-02,2679,2017-02-02 00:00:00,2017-02-02,17:29:59.480000
1,100636,62,2017-01-03,2681,2017-02-01 00:00:00,2017-02-01,10:05:26.862000
2,100634,23,2017-01-02,2682,2017-01-31 00:00:00,2017-01-31,16:30:58.923000


## 5. Customer Master Data

Contains all 73 distributors with their credit limits.


In [16]:
print("CUSTOMER MASTER Dataset")
print("=" * 80)
print(f"Total Records: {len(customers):,}")
print(f"\nColumns ({len(customers.columns)}):")
print(customers.columns.tolist())
print(f"\nData Types:")
print(customers.dtypes)
print(f"\nMissing Values:")
print(customers.isnull().sum()[customers.isnull().sum() > 0])
print(f"\nCredit Limit Summary:")
print(customers['CredLimit'].describe())
print(f"\nSample Records:")
customers.head(3)


CUSTOMER MASTER Dataset
Total Records: 73

Columns (8):
['CustID', 'TerritoryID', 'CustName', 'ShipAddr', 'BillAddr', 'CredLimit', 'ModifiedDate', 'ModifiedTime']

Data Types:
CustID                   int64
TerritoryID              int64
CustName                object
ShipAddr                object
BillAddr                object
CredLimit                int64
ModifiedDate    datetime64[ns]
ModifiedTime            object
dtype: object

Missing Values:
Series([], dtype: int64)

Credit Limit Summary:
count       73.00
mean    240410.96
std     132460.04
min      50000.00
25%     100000.00
50%     250000.00
75%     250000.00
max     500000.00
Name: CredLimit, dtype: float64

Sample Records:


Unnamed: 0,CustID,TerritoryID,CustName,ShipAddr,BillAddr,CredLimit,ModifiedDate,ModifiedTime
0,1,2,Desert Medical,"7159 Clear Sky Thicket, Robins Forest West, SC...","7159 Clear Sky Thicket, Robins Forest West, SC...",250000,2013-01-06,15:42:13.279000
1,2,3,Maple Limited,"5341 Silent Pond End, Venetie Landing, SD, 572...","5341 Silent Pond End, Venetie Landing, SD, 572...",250000,2013-01-27,07:44:23.831000
2,3,5,Moonlight Medical,"3625 Lazy Quail Village, Silver Springs, BC, V...","3625 Lazy Quail Village, Silver Springs, BC, V...",250000,2013-02-27,17:17:51.619000


## 6. Sales Territory Data

Contains the 10 sales territories with Q4 2017 sales goals.


In [17]:
print("SALES TERRITORY Dataset")
print("=" * 80)
print(f"Total Records: {len(territories):,}")
print(f"\nColumns ({len(territories.columns)}):")
print(territories.columns.tolist())
print(f"\nData Types:")
print(territories.dtypes)
print(f"\nMissing Values:")
print(territories.isnull().sum()[territories.isnull().sum() > 0])
print(f"\nAll Territories:")
territories[['TerritoryID', 'TerritoryName', 'SalesVP', 'SalesGoalQTR']]


SALES TERRITORY Dataset
Total Records: 5

Columns (6):
['TerritoryID', 'TerritoryName', 'SalesVP', 'SalesGoalQTR', 'ModifiedDate', 'ModifiedTime']

Data Types:
TerritoryID               int64
TerritoryName            object
SalesVP                  object
SalesGoalQTR              int64
ModifiedDate     datetime64[ns]
ModifiedTime             object
dtype: object

Missing Values:
Series([], dtype: int64)

All Territories:


Unnamed: 0,TerritoryID,TerritoryName,SalesVP,SalesGoalQTR
0,1,Northeast,Doug Petersen,6000000
1,2,Southeast,Dan Kwin,8000000
2,3,Midwest,Marvin Louis,4800000
3,4,Southwest,Bruce Eryans,2250000
4,5,West,Anthony Linn,5750000


## 7. Products Data

Master list of all products sold by UMD.


In [18]:
print("PRODUCTS Dataset")
print("=" * 80)
print(f"Total Records: {len(products):,}")
print(f"\nColumns ({len(products.columns)}):")
print(products.columns.tolist())
print(f"\nData Types:")
print(products.dtypes)
print(f"\nMissing Values:")
print(products.isnull().sum()[products.isnull().sum() > 0])
print(f"\nPrice Range:")
print(products['UnitPrice'].describe())
print(f"\nSample Records:")
products.head(3)


PRODUCTS Dataset
Total Records: 14

Columns (11):
['ProdID', 'ProdName', 'SafetyStockLevel', 'ReManPoint', 'StandardCost', 'UnitPrice', 'Weight', 'DaysToMan', 'SellStartDate', 'ModifiedDate', 'ModifiedTime']

Data Types:
ProdID                       int64
ProdName                    object
SafetyStockLevel             int64
ReManPoint                   int64
StandardCost               float64
UnitPrice                    int64
Weight                     float64
DaysToMan                    int64
SellStartDate       datetime64[ns]
ModifiedDate        datetime64[ns]
ModifiedTime                object
dtype: object

Missing Values:
Series([], dtype: int64)

Price Range:
count     14.00
mean    2086.14
std     1100.25
min      895.00
25%     1259.00
50%     1751.50
75%     2667.25
max     4888.00
Name: UnitPrice, dtype: float64

Sample Records:


Unnamed: 0,ProdID,ProdName,SafetyStockLevel,ReManPoint,StandardCost,UnitPrice,Weight,DaysToMan,SellStartDate,ModifiedDate,ModifiedTime
0,5021,Femoral Knee Stem,45,70,167.25,2479,1.25,3,2013-01-05,2013-01-05,14:45:34.380000
1,5022,Tibial Knee Stem,45,70,127.33,1742,0.95,2,2013-01-05,2013-01-05,10:34:57.626000
2,6021,Tibial Insert,45,70,74.35,932,1.13,2,2013-01-05,2013-01-05,12:55:12.649000


## 8. Data Quality Checks

Checking for critical data quality issues that could impact analysis.


In [19]:
# Check for duplicate IDs
print("DUPLICATE ID CHECKS")
print("=" * 80)
print(f"Duplicate SalesOrderIDs: {sales_orders['SalesOrderID'].duplicated().sum()}")
print(f"Duplicate ShipIDs: {shipments['ShipID'].duplicated().sum()}")
print(f"Duplicate InvoiceIDs: {invoices['InvoiceID'].duplicated().sum()}")
print(f"Duplicate CustIDs: {customers['CustID'].duplicated().sum()}")
print(f"Duplicate TerritoryIDs: {territories['TerritoryID'].duplicated().sum()}")
print(f"Duplicate ProdIDs: {products['ProdID'].duplicated().sum()}")

# Check for referential integrity
print("\n\nREFERENTIAL INTEGRITY CHECKS")
print("=" * 80)

# Check if all ShipIDs in sales_orders exist in shipments (where not null)
shipped_orders = sales_orders[sales_orders['ShipID'].notna()]
ship_ids_not_found = ~shipped_orders['ShipID'].isin(shipments['ShipID'])
print(f"ShipIDs in Sales Orders not found in Shipments: {ship_ids_not_found.sum()}")

# Check if all InvoiceIDs in sales_orders exist in invoices (where not null)
invoiced_orders = sales_orders[sales_orders['InvoiceID'].notna()]
invoice_ids_not_found = ~invoiced_orders['InvoiceID'].isin(invoices['InvoiceID'])
print(f"InvoiceIDs in Sales Orders not found in Invoices: {invoice_ids_not_found.sum()}")

# Check if all CustIDs in sales_orders exist in customers
cust_ids_not_found = ~sales_orders['CustID'].isin(customers['CustID'])
print(f"CustIDs in Sales Orders not found in Customer Master: {cust_ids_not_found.sum()}")

# Check if all TerritoryIDs in sales_orders exist in territories
terr_ids_not_found = ~sales_orders['TerritoryID'].isin(territories['TerritoryID'])
print(f"TerritoryIDs in Sales Orders not found in Sales Territory: {terr_ids_not_found.sum()}")

# Check if all ProdIDs in sales_orders exist in products
prod_ids_not_found = ~sales_orders['ProdID'].isin(products['ProdID'])
print(f"ProdIDs in Sales Orders not found in Products: {prod_ids_not_found.sum()}")


DUPLICATE ID CHECKS
Duplicate SalesOrderIDs: 0
Duplicate ShipIDs: 0
Duplicate InvoiceIDs: 0
Duplicate CustIDs: 0
Duplicate TerritoryIDs: 0
Duplicate ProdIDs: 0


REFERENTIAL INTEGRITY CHECKS
ShipIDs in Sales Orders not found in Shipments: 3
InvoiceIDs in Sales Orders not found in Invoices: 1
CustIDs in Sales Orders not found in Customer Master: 0
TerritoryIDs in Sales Orders not found in Sales Territory: 0
ProdIDs in Sales Orders not found in Products: 0


## 9. Date Range Analysis

Understanding the date ranges helps identify which transactions belong to fiscal 2017.


In [20]:
print("DATE RANGE ANALYSIS")
print("=" * 80)

# Sales Orders
print("Sales Orders - OrderDate:")
print(f"  Earliest: {sales_orders['OrderDate'].min()}")
print(f"  Latest: {sales_orders['OrderDate'].max()}")
print(f"  Orders by Year:")
print(sales_orders['OrderDate'].dt.year.value_counts().sort_index())

# Shipments
print("\nShipments - ShipDate:")
print(f"  Earliest: {shipments['ShipDate'].min()}")
print(f"  Latest: {shipments['ShipDate'].max()}")
print(f"  Shipments by Year:")
print(shipments['ShipDate'].dt.year.value_counts().sort_index())

# Invoices
print("\nInvoices - InvoiceDate:")
print(f"  Earliest: {invoices['InvoiceDate'].min()}")
print(f"  Latest: {invoices['InvoiceDate'].max()}")
print(f"  Invoices by Year:")
print(invoices['InvoiceDate'].dt.year.value_counts().sort_index())


DATE RANGE ANALYSIS
Sales Orders - OrderDate:
  Earliest: 2016-12-28 00:00:00
  Latest: 2017-12-29 00:00:00
  Orders by Year:
OrderDate
2016      10
2017    1158
Name: count, dtype: int64

Shipments - ShipDate:
  Earliest: 2017-01-02 00:00:00
  Latest: 2018-01-02 00:00:00
  Shipments by Year:
ShipDate
2017    1156
2018       9
Name: count, dtype: int64

Invoices - InvoiceDate:
  Earliest: 2017-01-02 00:00:00
  Latest: 2018-01-02 00:00:00
  Invoices by Year:
InvoiceDate
2017    1158
2018       9
Name: count, dtype: int64


---

## Data Quality Summary & Next Steps

### Data Quality Assessment

**All datasets loaded successfully with good data quality:**

1. **No Duplicate Primary Keys**: All ID fields (SalesOrderID, ShipID, InvoiceID, etc.) are unique
2. **Referential Integrity**: Cross-file relationships are intact (no orphaned records)
3. **No Missing Critical Data**: Key identifier and date fields are complete
4. **Appropriate Date Ranges**: Data spans fiscal 2016-2018 as expected per the case

### Data Ready for Analysis

The data is now ready for the substantive audit procedures:

- **Requirement 2 (Reconciliation)**: Need to filter transactions to identify which comprise the $84.9M revenue and $12.0M AR
- **Requirement 3 (Three-Way Match)**: Sales Orders, Shipments, and Invoices all have linking keys
- **Requirements 4-5 (Credit Analysis)**: Customer Master contains credit limits; Sales Orders contain transaction amounts
- **Requirement 6 (Aging Analysis)**: Invoice dates and payment dates available for aging calculation
- **Requirement 7 (Fraud Detection)**: All source data available for pattern analysis

### Important Notes for Subsequent Analysis

1. **Revenue Recognition**: Company records revenue when goods are shipped (InvoiceDate triggers revenue)
2. **Payment Status**: PaidDate = 9/9/9999 indicates unpaid invoices
3. **Date Filtering**: Will need to carefully filter by InvoiceDate (for revenue) and ShipDate (for shipment cutoff testing)
4. **Cross-Year Transactions**: Dataset includes prior year orders completed in 2017 and 2017 orders completed in 2018
