# CMS Open Payments Data Exploration & Analysis

**Project:** AAI-540 Machine Learning Operations - Final Team Project  
**Dataset:** CMS Open Payments Program Year 2024 General Payments  
**Purpose:** Exploratory Data Analysis for Payment Patterns and Statistical Insights

---

## Environment Setup and Variable Retrieval 

In [2]:
import pandas as pd
import numpy as np
import awswrangler as wr
from datetime import datetime
from sklearn.preprocessing import RobustScaler
from sklearn.ensemble import IsolationForest

# retrieve the path variables from Notebook 01
%store -r bucket
%store -r database_name
%store -r table_name_parquet

# reload the cleaned dataset from S3
# This ensures 'df' is defined in this notebook's memory
print("Loading processed data from S3...")
df = wr.athena.read_sql_query(
    sql=f"SELECT * FROM {database_name}.{table_name_parquet}",
    database=database_name
)

print(f"Environment ready. Dataframe shape: {df.shape}")

Loading processed data from S3...


2026-02-04 09:42:12,603	INFO worker.py:1852 -- Started a local Ray instance.


Environment ready. Dataframe shape: (1000000, 91)


## Feature Selection and Matrix Preperation

In [12]:
# restore feature and dataset splits

# turn non-date strings into NaT to prevent crashing
df['date_of_payment'] = pd.to_datetime(df['date_of_payment'], errors='coerce')

# check if we have too many NaTs (indicating a major schema shift)
nan_dates = df['date_of_payment'].isna().sum()
if nan_dates > 0:
    print(f"Warning: {nan_dates} rows had invalid date formats and were set to NaT.")

# fill NaT with a placeholder
df['date_of_payment'] = df['date_of_payment'].ffill().bfill()

df['payment_month'] = df['date_of_payment'].dt.month
df['is_weekend'] = (df['date_of_payment'].dt.dayofweek >= 5).astype(int)

print(f"Success: Features restored. New shape: {df.shape}")

Success: Features restored. New shape: (1000000, 97)


## Baseline Model: Isolation Forest

In [13]:
# define model features (must match what we restored in Block 2)
model_features = [
    'total_amount_of_payment_usdollars', 'hist_pay_avg', 
    'amt_to_avg_ratio', 'is_new_recipient', 'payment_month', 'is_weekend'
]

# create Train/Test splits using the restored 'dataset_usage' column
train_df = df[df['dataset_usage'] == 'train'].copy()
test_df = df[df['dataset_usage'] == 'test'].copy()

# this creates the 'X_train' and 'X_test' variables the model is looking for
scaler = RobustScaler()
X_train = scaler.fit_transform(train_df[model_features])
X_test = scaler.transform(test_df[model_features])

print(f"Success: X_train defined with {X_train.shape[0]} rows.")

Success: X_train defined with 399617 rows.


In [14]:
# initialize the baseline
baseline_model = IsolationForest(n_estimators=100, contamination=0.01, random_state=42)

# fit on the matrices we prepared in the Scaling block
print("Training Baseline (Isolation Forest)...")
baseline_model.fit(X_train)

# predict on the test set
# We use 'baseline' suffixes so we can compare models later
test_df['scores_baseline'] = baseline_model.decision_function(X_test)
test_df['is_anomaly_baseline'] = baseline_model.predict(X_test)

# map to Yes/No
test_df['is_anomaly_baseline'] = test_df['is_anomaly_baseline'].map({-1: 'Yes', 1: 'No'})

print("Baseline complete.")
print(f"Anomalies detected: {test_df[test_df['is_anomaly_baseline'] == 'Yes'].shape[0]}")

Training Baseline (Isolation Forest)...
Baseline complete.
Anomalies detected: 1402


## Baseline Model Report

In [15]:
# calculate the outlier Intensity
# the lower the decision_score, the more 'isolated' or extreme the payment is
print(f"--- BASELINE PERFORMANCE SUMMARY ---")
print(f"Total Test Records: {len(test_df):,}")
print(f"Anomalies Flagged: {test_df[test_df['is_anomaly_baseline'] == 'Yes'].shape[0]:,}")
print(f"Anomaly Rate: {(test_df['is_anomaly_baseline'] == 'Yes').mean():.2%}")

# statistical Validation: Do anomalies look different?
# we compare the average payment of 'Normal' vs 'Anomaly'
comparison = test_df.groupby('is_anomaly_baseline')[['total_amount_of_payment_usdollars', 'amt_to_avg_ratio']].mean()

print("\n--- STATISTICAL VALIDATION ---")
print("Average Payment Amount:")
print(comparison['total_amount_of_payment_usdollars'].map('${:,.2f}'.format))

print("\nAverage Amount-to-Historical-Average Ratio:")
print(comparison['amt_to_avg_ratio'].map('{:.2f}x'.format))

# top 5 most extreme anomalies
print("\n--- TOP 5 MOST EXTREME ANOMALIES ---")
top_red_flags = test_df[test_df['is_anomaly_baseline'] == 'Yes'].sort_values('scores_baseline').head(5)

display(top_red_flags[[
    'nature_of_payment_or_transfer_of_value', 
    'total_amount_of_payment_usdollars', 
    'amt_to_avg_ratio', 
    'scores_baseline'
]])

--- BASELINE PERFORMANCE SUMMARY ---
Total Test Records: 99,866
Anomalies Flagged: 1,402
Anomaly Rate: 1.40%

--- STATISTICAL VALIDATION ---
Average Payment Amount:
is_anomaly_baseline
No     $11,216,779.51
Yes           $317.87
Name: total_amount_of_payment_usdollars, dtype: object

Average Amount-to-Historical-Average Ratio:
is_anomaly_baseline
No     437840.39x
Yes         7.86x
Name: amt_to_avg_ratio, dtype: object

--- TOP 5 MOST EXTREME ANOMALIES ---


Unnamed: 0,nature_of_payment_or_transfer_of_value,total_amount_of_payment_usdollars,amt_to_avg_ratio,scores_baseline
283583,Entertainment,99.12,0.031592,-0.092846
788962,Food and Beverage,77.41,0.024575,-0.092846
97738,Travel and Lodging,598.94,0.204658,-0.086592
294667,Food and Beverage,2.63,0.000638,-0.083334
728268,Food and Beverage,126.39,0.076173,-0.083247


*** SIGTERM received at time=1770199681 on cpu 3 ***
PC: @     0x7f03c0970e9e  (unknown)  epoll_wait
    @     0x7f036c786b0d         64  absl::lts_20240722::AbslFailureSignalHandler()
    @     0x7f03c088d520  (unknown)  (unknown)
[2026-02-04 10:08:01,800 E 3439 3439] logging.cc:497: *** SIGTERM received at time=1770199681 on cpu 3 ***
[2026-02-04 10:08:01,800 E 3439 3439] logging.cc:497: PC: @     0x7f03c0970e9e  (unknown)  epoll_wait
[2026-02-04 10:08:01,801 E 3439 3439] logging.cc:497:     @     0x7f036c786b39         64  absl::lts_20240722::AbslFailureSignalHandler()
[2026-02-04 10:08:01,801 E 3439 3439] logging.cc:497:     @     0x7f03c088d520  (unknown)  (unknown)
