# CMS Open Payments Datalake Setup

**Project:** AAI-540 Machine Learning Operations - Final Team Project  
**Purpose:** Setup AWS S3 Datalake for CMS Open Payments Data  
**Dataset:** CMS Open Payments Program Year 2024

---

## Table of Contents
1. [Environment Setup](#setup)
2. [AWS Configuration & S3 Bucket Creation](#aws-config)
3. [Download CMS Open Payments Data](#download)
4. [Upload Data to S3](#upload)
5. [Create Athena Database](#athena)
6. [Register Data with Athena](#register)
7. [Convert CSV to Parquet](#parquet)
8. [Query Data with AWS Data Wrangler](#query)
9. [Validation & Verification](#validation)

---

## 1. Environment Setup

Install and import necessary libraries for AWS integration and data processing.

In [7]:
# Install required AWS packages
%pip install boto3 sagemaker awswrangler pyathena

Note: you may need to restart the kernel to use updated packages.


In [12]:
# Import necessary libraries
import boto3
import sagemaker
import pandas as pd
import numpy as np
import os
import requests
from pathlib import Path
from datetime import datetime
from io import BytesIO, StringIO
import awswrangler as wr
from pyathena import connect
import warnings

warnings.filterwarnings('ignore')

# Display settings
pd.set_option('display.max_columns', 50)
pd.set_option('display.max_rows', 100)

print("Libraries imported successfully")

Libraries imported successfully


## 2. AWS Configuration & S3 Bucket Creation

Configure AWS session and create S3 bucket for the datalake.

In [19]:
import boto3

# Initialize AWS session
boto_session = boto3.Session()
region = boto_session.region_name

# Get account information
sts_client = boto3.client('sts')
account_id = sts_client.get_caller_identity().get('Account')

# Initialize AWS clients
s3_client = boto3.client('s3', region_name=region)
s3_resource = boto3.resource('s3')

# Define bucket name
bucket = "cmsopenpaymentsystems"

# Get IAM role (if needed)
iam = boto3.client('iam')
try:
    role = iam.get_role(RoleName='LabRole')['Role']['Arn']
except:
    role = "Role not found"

print(f"AWS Configuration:")
print(f"  Region: {region}")
print(f"  Account ID: {account_id}")
print(f"  S3 Bucket: {bucket}")
print(f"  Role: {role}")

# Verify S3 bucket exists or create it
def ensure_bucket_exists(bucket_name, region):
    try:
        if region == 'us-east-1':
            s3_client.create_bucket(Bucket=bucket_name)
        else:
            s3_client.create_bucket(
                Bucket=bucket_name,
                CreateBucketConfiguration={'LocationConstraint': region}
            )
        print(f"Created new bucket: {bucket_name}")
    except s3_client.exceptions.BucketAlreadyOwnedByYou:
        print(f"Bucket already exists: {bucket_name}")
    except Exception as e:
        print(f"Error with bucket {bucket_name}: {str(e)}")

# Ensure bucket exists
ensure_bucket_exists(bucket, region)


AWS Configuration:
  Region: us-east-1
  Account ID: 864106638709
  S3 Bucket: cmsopenpaymentsystems
  Role: arn:aws:iam::864106638709:role/LabRole
Created new bucket: cmsopenpaymentsystems


In [20]:
# Define S3 paths for CMS data
cms_data_prefix = "cms-open-payments"
raw_data_prefix = f"{cms_data_prefix}/raw"
processed_data_prefix = f"{cms_data_prefix}/processed"
parquet_data_prefix = f"{cms_data_prefix}/parquet"

s3_raw_path = f"s3://{bucket}/{raw_data_prefix}"
s3_processed_path = f"s3://{bucket}/{processed_data_prefix}"
s3_parquet_path = f"s3://{bucket}/{parquet_data_prefix}"

print(f"S3 Data Paths:")
print(f"  Raw Data: {s3_raw_path}")
print(f"  Processed Data: {s3_processed_path}")
print(f"  Parquet Data: {s3_parquet_path}")

# Store paths for use in other notebooks
%store bucket
%store region
%store s3_raw_path
%store s3_processed_path
%store s3_parquet_path

S3 Data Paths:
  Raw Data: s3://cmsopenpaymentsystems/cms-open-payments/raw
  Processed Data: s3://cmsopenpaymentsystems/cms-open-payments/processed
  Parquet Data: s3://cmsopenpaymentsystems/cms-open-payments/parquet
Stored 'bucket' (str)
Stored 'region' (str)
Stored 's3_raw_path' (str)
Stored 's3_processed_path' (str)
Stored 's3_parquet_path' (str)


## 3. Download CMS Open Payments Data

Download the CMS Open Payments Program Year 2024 General Payments dataset.

**Data Source:** CMS Open Payments  
**Dataset:** Program Year 2024 General Payments  
**Published:** June 30, 2025  
**Coverage:** January 1, 2024 - December 31, 2024

In [21]:
# CMS Open Payments data URL - Direct CSV download
cms_data_url = "https://download.cms.gov/openpayments/PGYR2024_P06302025_06162025/OP_DTL_GNRL_PGYR2024_P06302025_06162025.csv"

# Alternative: If the above URL doesn't work, use this approach:
# 1. Go to https://openpaymentsdata.cms.gov/datasets
# 2. Select "Program Year 2024" and "General Payments"
# 3. Download the CSV file manually and place it in ../data/ directory

print(f"CMS Data URL: {cms_data_url}")
print(f"\nNote: This dataset is approximately 3-4 GB.")
print(f"Download may take several minutes depending on your connection.")

CMS Data URL: https://download.cms.gov/openpayments/PGYR2024_P06302025_06162025/OP_DTL_GNRL_PGYR2024_P06302025_06162025.csv

Note: This dataset is approximately 3-4 GB.
Download may take several minutes depending on your connection.


In [22]:
# Create local data directory if it doesn't exist
local_data_dir = Path("../data")
local_data_dir.mkdir(exist_ok=True)

# Local CSV file path
local_csv_file = local_data_dir / "OP_DTL_GNRL_PGYR2024_P06302025_06162025.csv"

print(f"Local data directory: {local_data_dir.absolute()}")
print(f"Target CSV file: {local_csv_file.name}")

Local data directory: /home/sagemaker-user/aai540_3proj/notebooks/../data
Target CSV file: OP_DTL_GNRL_PGYR2024_P06302025_06162025.csv


In [23]:
# Download CMS data if not already present
if local_csv_file.exists():
    print(f"CSV file already exists: {local_csv_file}")
    print(f"  File size: {local_csv_file.stat().st_size / (1024**3):.2f} GB")
else:
    print(f"Downloading CMS Open Payments data...")
    print(f"This may take 10-20 minutes depending on your connection.")
    
    try:
        # Download CSV file with progress indication
        response = requests.get(cms_data_url, stream=True)
        response.raise_for_status()
        
        total_size = int(response.headers.get('content-length', 0))
        print(f"Total download size: {total_size / (1024**3):.2f} GB")
        
        # Save CSV file directly
        with open(local_csv_file, 'wb') as f:
            downloaded = 0
            for chunk in response.iter_content(chunk_size=8192):
                if chunk:
                    f.write(chunk)
                    downloaded += len(chunk)
                    if total_size > 0:
                        percent = (downloaded / total_size) * 100
                        print(f"\rProgress: {percent:.1f}%", end="")
        
        print(f"\nDownload complete: {local_csv_file}")
        print(f"  File size: {local_csv_file.stat().st_size / (1024**3):.2f} GB")
            
    except Exception as e:
        print(f"\nError downloading data: {e}")
        print(f"\nAlternative approach:")
        print(f"1. Visit: https://openpaymentsdata.cms.gov/datasets")
        print(f"2. Select 'Program Year 2024' and 'General Payments'")
        print(f"3. Download CSV and save to: {local_data_dir.absolute()}")

Downloading CMS Open Payments data...
This may take 10-20 minutes depending on your connection.
Total download size: 8.22 GB
Progress: 100.0%
Download complete: ../data/OP_DTL_GNRL_PGYR2024_P06302025_06162025.csv
  File size: 8.22 GB


## 4. Upload Data to S3 upload

Upload the downloaded CMS data to S3 for datalake storage.

In [24]:
# Preview the data before upload
print("Loading sample of data for preview...")
df_sample = pd.read_csv(local_csv_file, nrows=5)

print(f"\nDataset Preview:")
print(f"  Columns: {len(df_sample.columns)}")
print(f"  Sample rows:")
display(df_sample.head())

print(f"\nColumn names:")
for i, col in enumerate(df_sample.columns, 1):
    print(f"  {i}. {col}")

Loading sample of data for preview...

Dataset Preview:
  Columns: 91
  Sample rows:


Unnamed: 0,Change_Type,Covered_Recipient_Type,Teaching_Hospital_CCN,Teaching_Hospital_ID,Teaching_Hospital_Name,Covered_Recipient_Profile_ID,Covered_Recipient_NPI,Covered_Recipient_First_Name,Covered_Recipient_Middle_Name,Covered_Recipient_Last_Name,Covered_Recipient_Name_Suffix,Recipient_Primary_Business_Street_Address_Line1,Recipient_Primary_Business_Street_Address_Line2,Recipient_City,Recipient_State,Recipient_Zip_Code,Recipient_Country,Recipient_Province,Recipient_Postal_Code,Covered_Recipient_Primary_Type_1,Covered_Recipient_Primary_Type_2,Covered_Recipient_Primary_Type_3,Covered_Recipient_Primary_Type_4,Covered_Recipient_Primary_Type_5,Covered_Recipient_Primary_Type_6,...,Indicate_Drug_or_Biological_or_Device_or_Medical_Supply_2,Product_Category_or_Therapeutic_Area_2,Name_of_Drug_or_Biological_or_Device_or_Medical_Supply_2,Associated_Drug_or_Biological_NDC_2,Associated_Device_or_Medical_Supply_PDI_2,Covered_or_Noncovered_Indicator_3,Indicate_Drug_or_Biological_or_Device_or_Medical_Supply_3,Product_Category_or_Therapeutic_Area_3,Name_of_Drug_or_Biological_or_Device_or_Medical_Supply_3,Associated_Drug_or_Biological_NDC_3,Associated_Device_or_Medical_Supply_PDI_3,Covered_or_Noncovered_Indicator_4,Indicate_Drug_or_Biological_or_Device_or_Medical_Supply_4,Product_Category_or_Therapeutic_Area_4,Name_of_Drug_or_Biological_or_Device_or_Medical_Supply_4,Associated_Drug_or_Biological_NDC_4,Associated_Device_or_Medical_Supply_PDI_4,Covered_or_Noncovered_Indicator_5,Indicate_Drug_or_Biological_or_Device_or_Medical_Supply_5,Product_Category_or_Therapeutic_Area_5,Name_of_Drug_or_Biological_or_Device_or_Medical_Supply_5,Associated_Drug_or_Biological_NDC_5,Associated_Device_or_Medical_Supply_PDI_5,Program_Year,Payment_Publication_Date
0,ADD,Covered Recipient Teaching Hospital,190036,14616,Ochsner Clinic Foundation,,,,,,,1516 Jefferson Hwy,,New Orleans,LA,70121,United States,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,2024,06/30/2025
1,ADD,Covered Recipient Teaching Hospital,440039,15311,Vanderbilt University Medical Center,,,,,,,1211 Medical Center Drive,,Nashville,TN,37232,United States,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,2024,06/30/2025
2,ADD,Covered Recipient Teaching Hospital,520087,15508,Gundersen Lutheran Medical Center I,,,,,,,1910 SOUTH AVE,,LA CROSSE,WI,54601,United States,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,2024,06/30/2025
3,ADD,Covered Recipient Teaching Hospital,520087,15508,Gundersen Lutheran Medical Center I,,,,,,,1910 SOUTH AVE,,LA CROSSE,WI,54601,United States,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,2024,06/30/2025
4,ADD,Covered Recipient Teaching Hospital,520087,15508,Gundersen Lutheran Medical Center I,,,,,,,1910 SOUTH AVE,,LA CROSSE,WI,54601,United States,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,2024,06/30/2025



Column names:
  1. Change_Type
  2. Covered_Recipient_Type
  3. Teaching_Hospital_CCN
  4. Teaching_Hospital_ID
  5. Teaching_Hospital_Name
  6. Covered_Recipient_Profile_ID
  7. Covered_Recipient_NPI
  8. Covered_Recipient_First_Name
  9. Covered_Recipient_Middle_Name
  10. Covered_Recipient_Last_Name
  11. Covered_Recipient_Name_Suffix
  12. Recipient_Primary_Business_Street_Address_Line1
  13. Recipient_Primary_Business_Street_Address_Line2
  14. Recipient_City
  15. Recipient_State
  16. Recipient_Zip_Code
  17. Recipient_Country
  18. Recipient_Province
  19. Recipient_Postal_Code
  20. Covered_Recipient_Primary_Type_1
  21. Covered_Recipient_Primary_Type_2
  22. Covered_Recipient_Primary_Type_3
  23. Covered_Recipient_Primary_Type_4
  24. Covered_Recipient_Primary_Type_5
  25. Covered_Recipient_Primary_Type_6
  26. Covered_Recipient_Specialty_1
  27. Covered_Recipient_Specialty_2
  28. Covered_Recipient_Specialty_3
  29. Covered_Recipient_Specialty_4
  30. Covered_Recipient_Spec

In [25]:
# Upload raw CSV to S3
print(f"Uploading data to S3...")
print(f"  Source: {local_csv_file}")
print(f"  Destination: {s3_raw_path}/")

s3_raw_file_path = f"{s3_raw_path}/{local_csv_file.name}"

try:
    # Upload file with progress callback
    file_size = local_csv_file.stat().st_size
    
    def upload_progress(bytes_uploaded):
        percent = (bytes_uploaded / file_size) * 100
        print(f"\rUpload progress: {percent:.1f}%", end="")
    
    s3_client.upload_file(
        str(local_csv_file),
        bucket,
        f"{raw_data_prefix}/{local_csv_file.name}",
        Callback=upload_progress
    )
    
    print(f"\nUpload complete")
    print(f"  S3 URI: {s3_raw_file_path}")
    
    # Store the S3 file path
    %store s3_raw_file_path
    
except Exception as e:
    print(f"\nError uploading to S3: {e}")

Uploading data to S3...
  Source: ../data/OP_DTL_GNRL_PGYR2024_P06302025_06162025.csv
  Destination: s3://cmsopenpaymentsystems/cms-open-payments/raw/
Upload progress: 0.0%
Upload complete
  S3 URI: s3://cmsopenpaymentsystems/cms-open-payments/raw/OP_DTL_GNRL_PGYR2024_P06302025_06162025.csv
Stored 's3_raw_file_path' (str)


In [26]:
# Verify upload
print("Verifying S3 upload...")

response = s3_client.list_objects_v2(
    Bucket=bucket,
    Prefix=raw_data_prefix
)

if 'Contents' in response:
    print(f"\nFiles in S3 bucket:")
    for obj in response['Contents']:
        size_gb = obj['Size'] / (1024**3)
        print(f"  {obj['Key']} ({size_gb:.2f} GB)")
else:
    print(f"\nNo files found in S3 bucket")

Verifying S3 upload...

Files in S3 bucket:
  cms-open-payments/raw/OP_DTL_GNRL_PGYR2024_P06302025_06162025.csv (8.22 GB)


## 5. Create Athena Database

Create an Amazon Athena database for querying CMS data using SQL.

In [27]:
# Define Athena database name
database_name = "cms_open_payments"

# Set S3 staging directory for Athena queries
s3_athena_staging = f"s3://{bucket}/athena/staging"

print(f"Athena Configuration:")
print(f"  Database: {database_name}")
print(f"  Staging Directory: {s3_athena_staging}")

# Store for use in other notebooks
%store database_name
%store s3_athena_staging

Athena Configuration:
  Database: cms_open_payments
  Staging Directory: s3://cmsopenpaymentsystems/athena/staging
Stored 'database_name' (str)
Stored 's3_athena_staging' (str)


In [28]:
# Create Athena connection
athena_conn = connect(
    region_name=region,
    s3_staging_dir=s3_athena_staging
)

print("Athena connection established")

Athena connection established


In [29]:
# Create database
create_db_query = f"CREATE DATABASE IF NOT EXISTS {database_name}"

print(f"Creating Athena database...")
print(f"  Query: {create_db_query}")

try:
    result = pd.read_sql(create_db_query, athena_conn)
    print(f"Database created successfully")
except Exception as e:
    print(f"Error creating database: {e}")

Creating Athena database...
  Query: CREATE DATABASE IF NOT EXISTS cms_open_payments
Database created successfully


In [30]:
# Verify database creation
show_db_query = "SHOW DATABASES"

print("Verifying database creation...")
databases = pd.read_sql(show_db_query, athena_conn)

print(f"\n Available Databases:")
display(databases)

if database_name in databases.values:
    print(f"\nDatabase '{database_name}' exists")
else:
    print(f"\nDatabase '{database_name}' not found")

Verifying database creation...

 Available Databases:


Unnamed: 0,database_name
0,cms_open_payments
1,default
2,dsoaws



Database 'cms_open_payments' exists


## 6. Register Data with Athena

Create an external table in Athena to query the CSV data stored in S3.

In [31]:
# Define table name
table_name_csv = "general_payments_csv"

print(f"Table Configuration:")
print(f"  Database: {database_name}")
print(f"  Table: {table_name_csv}")
print(f"  Location: {s3_raw_path}/")

%store table_name_csv

Table Configuration:
  Database: cms_open_payments
  Table: general_payments_csv
  Location: s3://cmsopenpaymentsystems/cms-open-payments/raw/
Stored 'table_name_csv' (str)


In [32]:
# Get actual column names from the CSV
df_schema = pd.read_csv(local_csv_file, nrows=1)

# Create column definitions for Athena
# Map pandas dtypes to Athena types
def get_athena_type(dtype):
    if pd.api.types.is_integer_dtype(dtype):
        return 'BIGINT'
    elif pd.api.types.is_float_dtype(dtype):
        return 'DOUBLE'
    elif pd.api.types.is_datetime64_any_dtype(dtype):
        return 'TIMESTAMP'
    else:
        return 'STRING'

# Create column definitions
columns_def = []
for col in df_schema.columns:
    # Clean column name for Athena (replace spaces and special chars)
    clean_col = col.replace(' ', '_').replace('(', '').replace(')', '').replace('-', '_')
    athena_type = get_athena_type(df_schema[col].dtype)
    columns_def.append(f"`{col}` {athena_type}")

columns_str = ',\n    '.join(columns_def)

print(f"Schema preview (first 10 columns):")
for i, col_def in enumerate(columns_def[:10], 1):
    print(f"  {i}. {col_def}")
print(f"  ... ({len(columns_def)} columns total)")

Schema preview (first 10 columns):
  1. `Change_Type` STRING
  2. `Covered_Recipient_Type` STRING
  3. `Teaching_Hospital_CCN` BIGINT
  4. `Teaching_Hospital_ID` BIGINT
  5. `Teaching_Hospital_Name` STRING
  6. `Covered_Recipient_Profile_ID` DOUBLE
  7. `Covered_Recipient_NPI` DOUBLE
  8. `Covered_Recipient_First_Name` DOUBLE
  9. `Covered_Recipient_Middle_Name` DOUBLE
  10. `Covered_Recipient_Last_Name` DOUBLE
  ... (91 columns total)


In [33]:
# Create external table for CSV data
create_table_query = f"""
CREATE EXTERNAL TABLE IF NOT EXISTS {database_name}.{table_name_csv} (
    {columns_str}
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\\n'
STORED AS TEXTFILE
LOCATION '{s3_raw_path}/'
TBLPROPERTIES (
    'skip.header.line.count'='1',
    'serialization.null.format'=''
)
"""

print(f"Creating external table...")
print(f"\nQuery preview:")
print(create_table_query[:500] + "...")

try:
    result = pd.read_sql(create_table_query, athena_conn)
    print(f"\nTable '{table_name_csv}' created successfully")
except Exception as e:
    print(f"\nError creating table: {e}")

Creating external table...

Query preview:

CREATE EXTERNAL TABLE IF NOT EXISTS cms_open_payments.general_payments_csv (
    `Change_Type` STRING,
    `Covered_Recipient_Type` STRING,
    `Teaching_Hospital_CCN` BIGINT,
    `Teaching_Hospital_ID` BIGINT,
    `Teaching_Hospital_Name` STRING,
    `Covered_Recipient_Profile_ID` DOUBLE,
    `Covered_Recipient_NPI` DOUBLE,
    `Covered_Recipient_First_Name` DOUBLE,
    `Covered_Recipient_Middle_Name` DOUBLE,
    `Covered_Recipient_Last_Name` DOUBLE,
    `Covered_Recipient_Name_Suffix` DOUBLE,...

Table 'general_payments_csv' created successfully


In [34]:
# Verify table creation
show_tables_query = f"SHOW TABLES IN {database_name}"

print("Verifying table creation...")
tables = pd.read_sql(show_tables_query, athena_conn)

print(f"\nTables in database '{database_name}':")
display(tables)

if table_name_csv in tables.values:
    print(f"\nTable '{table_name_csv}' exists")
else:
    print(f"\nTable '{table_name_csv}' not found")

Verifying table creation...

Tables in database 'cms_open_payments':


Unnamed: 0,tab_name
0,general_payments_csv



Table 'general_payments_csv' exists


In [35]:
# Test query - count rows
count_query = f"""
SELECT COUNT(*) as row_count
FROM {database_name}.{table_name_csv}
"""

print("Testing table access...")
print(f"Query: {count_query}")

try:
    result = pd.read_sql(count_query, athena_conn)
    print(f"\nQuery successful")
    print(f"  Total rows: {result['row_count'][0]:,}")
except Exception as e:
    print(f"\nError querying table: {e}")

Testing table access...
Query: 
SELECT COUNT(*) as row_count
FROM cms_open_payments.general_payments_csv


Query successful
  Total rows: 15,397,627


In [36]:
# Sample query - preview data
sample_query = f"""
SELECT *
FROM {database_name}.{table_name_csv}
LIMIT 5
"""

print("Fetching sample data...")

try:
    sample_data = pd.read_sql(sample_query, athena_conn)
    print(f"\nSample data retrieved")
    print(f"  Shape: {sample_data.shape}")
    display(sample_data.head())
except Exception as e:
    print(f"\nError fetching sample data: {e}")

Fetching sample data...

Sample data retrieved
  Shape: (5, 91)


Unnamed: 0,change_type,covered_recipient_type,teaching_hospital_ccn,teaching_hospital_id,teaching_hospital_name,covered_recipient_profile_id,covered_recipient_npi,covered_recipient_first_name,covered_recipient_middle_name,covered_recipient_last_name,covered_recipient_name_suffix,recipient_primary_business_street_address_line1,recipient_primary_business_street_address_line2,recipient_city,recipient_state,recipient_zip_code,recipient_country,recipient_province,recipient_postal_code,covered_recipient_primary_type_1,covered_recipient_primary_type_2,covered_recipient_primary_type_3,covered_recipient_primary_type_4,covered_recipient_primary_type_5,covered_recipient_primary_type_6,...,indicate_drug_or_biological_or_device_or_medical_supply_2,product_category_or_therapeutic_area_2,name_of_drug_or_biological_or_device_or_medical_supply_2,associated_drug_or_biological_ndc_2,associated_device_or_medical_supply_pdi_2,covered_or_noncovered_indicator_3,indicate_drug_or_biological_or_device_or_medical_supply_3,product_category_or_therapeutic_area_3,name_of_drug_or_biological_or_device_or_medical_supply_3,associated_drug_or_biological_ndc_3,associated_device_or_medical_supply_pdi_3,covered_or_noncovered_indicator_4,indicate_drug_or_biological_or_device_or_medical_supply_4,product_category_or_therapeutic_area_4,name_of_drug_or_biological_or_device_or_medical_supply_4,associated_drug_or_biological_ndc_4,associated_device_or_medical_supply_pdi_4,covered_or_noncovered_indicator_5,indicate_drug_or_biological_or_device_or_medical_supply_5,product_category_or_therapeutic_area_5,name_of_drug_or_biological_or_device_or_medical_supply_5,associated_drug_or_biological_ndc_5,associated_device_or_medical_supply_pdi_5,program_year,payment_publication_date
0,NEW,Covered Recipient Physician,,,,77275.0,1710931000.0,,,,,1401 E TRENT AVE STE 200,,SPOKANE,WA,99201,United States,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,
1,NEW,Covered Recipient Non-Physician Practitioner,,,,10939760.0,1104063000.0,,,,,275 BETHESDA DR,,GREENVILLE,NC,27833,United States,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,
2,NEW,Covered Recipient Non-Physician Practitioner,,,,10874215.0,1093306000.0,,,,,6001 N MAYFAIR ST,,SPOKANE,WA,99201,United States,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,
3,NEW,Covered Recipient Physician,,,,123293.0,1730126000.0,,,,,18100 OAKWOOD BLVD STE 315,,DEARBORN,MI,48120,United States,,,,,,,,,...,,,,,,,10705030000000.0,,,,,,,,,,,,,,,,,,
4,NEW,Covered Recipient Physician,,,,352958.0,1760687000.0,,,,,600 SUNCREST TOWN CENTRE DR STE 115,,MORGANTOWN,WV,26501,United States,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,


## 7. Convert CSV to Parquet

Convert the CSV data to Parquet format for better performance and compression.

In [37]:
# Define Parquet table name
table_name_parquet = "general_payments_parquet"

print(f"Parquet Conversion Configuration:")
print(f"  Source Table: {database_name}.{table_name_csv}")
print(f"  Target Table: {database_name}.{table_name_parquet}")
print(f"  Target Location: {s3_parquet_path}/")

%store table_name_parquet

Parquet Conversion Configuration:
  Source Table: cms_open_payments.general_payments_csv
  Target Table: cms_open_payments.general_payments_parquet
  Target Location: s3://cmsopenpaymentsystems/cms-open-payments/parquet/
Stored 'table_name_parquet' (str)


In [43]:
# Get column names and create explicit column list
columns_query = f"""
SELECT * 
FROM {database_name}.{table_name_csv} 
LIMIT 1
"""

try:
    # Get all columns from source table
    columns_df = pd.read_sql(columns_query, athena_conn)
    columns = columns_df.columns.tolist()
    
    # Remove program_year if it exists
    if 'program_year' in columns:
        columns.remove('program_year')
    
    # Create column selection string, ensuring program_year is last
    columns_str = ',\n        '.join(columns)
    
    create_parquet_query = f"""
    CREATE TABLE {database_name}.{table_name_parquet}
    WITH (
        format = 'PARQUET',
        parquet_compression = 'SNAPPY',
        external_location = '{s3_parquet_path}/',
        partitioned_by = ARRAY['program_year']
    )
    AS
    SELECT 
        {columns_str},
        '2024' as program_year
    FROM {database_name}.{table_name_csv}
    """
    
    print("Converting CSV to Parquet format...")
    print("Note: This operation may take 15-30 minutes for large datasets")
    print(f"\nQuery:")
    print(create_parquet_query)

    # Execute conversion
    result = pd.read_sql(create_parquet_query, athena_conn)
    print(f"\nConversion complete")
    print(f"  Parquet table '{table_name_parquet}' created successfully")
except Exception as e:
    print(f"\nError during conversion: {e}")
    print(f"\nNote: If table already exists, drop it first:")
    print(f"  DROP TABLE IF EXISTS {database_name}.{table_name_parquet}")


Converting CSV to Parquet format...
Note: This operation may take 15-30 minutes for large datasets

Query:

    CREATE TABLE cms_open_payments.general_payments_parquet
    WITH (
        format = 'PARQUET',
        parquet_compression = 'SNAPPY',
        external_location = 's3://cmsopenpaymentsystems/cms-open-payments/parquet/',
        partitioned_by = ARRAY['program_year']
    )
    AS
    SELECT 
        change_type,
        covered_recipient_type,
        teaching_hospital_ccn,
        teaching_hospital_id,
        teaching_hospital_name,
        covered_recipient_profile_id,
        covered_recipient_npi,
        covered_recipient_first_name,
        covered_recipient_middle_name,
        covered_recipient_last_name,
        covered_recipient_name_suffix,
        recipient_primary_business_street_address_line1,
        recipient_primary_business_street_address_line2,
        recipient_city,
        recipient_state,
        recipient_zip_code,
        recipient_country,
        re

In [44]:
# Verify Parquet table
count_parquet_query = f"""
SELECT COUNT(*) as row_count
FROM {database_name}.{table_name_parquet}
"""

print("Verifying Parquet table...")

try:
    result = pd.read_sql(count_parquet_query, athena_conn)
    print(f"\nParquet table verified")
    print(f"  Total rows: {result['row_count'][0]:,}")
except Exception as e:
    print(f"\nError verifying Parquet table: {e}")

Verifying Parquet table...

Parquet table verified
  Total rows: 15,397,627


In [45]:
# Compare file sizes
print("Comparing CSV vs Parquet storage:")

# Get CSV size
csv_objects = s3_client.list_objects_v2(
    Bucket=bucket,
    Prefix=raw_data_prefix
)

csv_size = sum(obj['Size'] for obj in csv_objects.get('Contents', []))

# Get Parquet size
parquet_objects = s3_client.list_objects_v2(
    Bucket=bucket,
    Prefix=parquet_data_prefix
)

parquet_size = sum(obj['Size'] for obj in parquet_objects.get('Contents', []))

print(f"\nStorage Comparison:")
print(f"  CSV Size: {csv_size / (1024**3):.2f} GB")
print(f"  Parquet Size: {parquet_size / (1024**3):.2f} GB")
if parquet_size > 0:
    compression_ratio = (1 - parquet_size/csv_size) * 100
    print(f"  Compression: {compression_ratio:.1f}% reduction")
    print(f"  Space Saved: {(csv_size - parquet_size) / (1024**3):.2f} GB")

Comparing CSV vs Parquet storage:

Storage Comparison:
  CSV Size: 8.22 GB
  Parquet Size: 0.51 GB
  Compression: 93.8% reduction
  Space Saved: 7.70 GB


## 8. Query Data with AWS Data Wrangler

Use AWS Data Wrangler for more efficient data querying and analysis.

In [46]:
# Query using AWS Data Wrangler
sample_query_wr = f"""
SELECT 
    COUNT(*) as total_payments,
    SUM(CAST(Total_Amount_of_Payment_USDollars AS DOUBLE)) as total_amount,
    AVG(CAST(Total_Amount_of_Payment_USDollars AS DOUBLE)) as avg_amount,
    MIN(CAST(Total_Amount_of_Payment_USDollars AS DOUBLE)) as min_amount,
    MAX(CAST(Total_Amount_of_Payment_USDollars AS DOUBLE)) as max_amount
FROM {database_name}.{table_name_parquet}
"""

print("Querying payment statistics with AWS Data Wrangler...")
print(f"\nQuery: {sample_query_wr}")

try:
    df_stats = wr.athena.read_sql_query(
        sql=sample_query_wr,
        database=database_name,
        ctas_approach=False
    )
    
    print(f"\nQuery successful")
    print(f"\nPayment Statistics:")
    display(df_stats)
    
except Exception as e:
    print(f"\nError querying data: {e}")

Querying payment statistics with AWS Data Wrangler...

Query: 
SELECT 
    COUNT(*) as total_payments,
    SUM(CAST(Total_Amount_of_Payment_USDollars AS DOUBLE)) as total_amount,
    AVG(CAST(Total_Amount_of_Payment_USDollars AS DOUBLE)) as avg_amount,
    MIN(CAST(Total_Amount_of_Payment_USDollars AS DOUBLE)) as min_amount,
    MAX(CAST(Total_Amount_of_Payment_USDollars AS DOUBLE)) as max_amount
FROM cms_open_payments.general_payments_parquet


Query successful

Payment Statistics:


Unnamed: 0,total_payments,total_amount,avg_amount,min_amount,max_amount
0,15397627,64501680000000.0,7630776.0,0.01,100001200000.0


In [47]:
# Sample data by recipient type
recipient_query = f"""
SELECT 
    Covered_Recipient_Type,
    COUNT(*) as payment_count,
    SUM(CAST(Total_Amount_of_Payment_USDollars AS DOUBLE)) as total_amount
FROM {database_name}.{table_name_parquet}
GROUP BY Covered_Recipient_Type
ORDER BY total_amount DESC
"""

print("Analyzing payments by recipient type...")

try:
    df_recipients = wr.athena.read_sql_query(
        sql=recipient_query,
        database=database_name,
        ctas_approach=False
    )
    
    print(f"\nQuery successful")
    print(f"\nPayments by Recipient Type:")
    display(df_recipients)
    
except Exception as e:
    print(f"\nError querying data: {e}")

Analyzing payments by recipient type...

Query successful

Payments by Recipient Type:


Unnamed: 0,Covered_Recipient_Type,payment_count,total_amount
0,Covered Recipient Physician,9894393,42901130000000.0
1,Covered Recipient Non-Physician Practitioner,5468860,21600140000000.0
2,Covered Recipient Teaching Hospital,34374,414516800.0


## 9. Validation & Verification

Perform final validation checks on the datalake setup.

In [48]:
# Comprehensive validation
print("=" * 70)
print("DATALAKE SETUP VALIDATION")
print("=" * 70)

validation_passed = True

# Check 1: S3 Buckets
print("\n1. S3 Storage:")
try:
    for prefix in [raw_data_prefix, parquet_data_prefix]:
        response = s3_client.list_objects_v2(Bucket=bucket, Prefix=prefix, MaxKeys=1)
        if 'Contents' in response:
            print(f"   [OK] {prefix}/")
        else:
            print(f"   [FAIL] {prefix}/ (empty or missing)")
            validation_passed = False
except Exception as e:
    print(f"   [FAIL] Error checking S3: {e}")
    validation_passed = False

# Check 2: Athena Database
print("\n2. Athena Database:")
try:
    databases = pd.read_sql("SHOW DATABASES", athena_conn)
    if database_name in databases.values:
        print(f"   [OK] Database '{database_name}' exists")
    else:
        print(f"   [FAIL] Database '{database_name}' not found")
        validation_passed = False
except Exception as e:
    print(f"   [FAIL] Error checking database: {e}")
    validation_passed = False

# Check 3: Tables
print("\n3. Athena Tables:")
try:
    tables = pd.read_sql(f"SHOW TABLES IN {database_name}", athena_conn)
    for table in [table_name_csv, table_name_parquet]:
        if table in tables.values:
            print(f"   [OK] Table '{table}' exists")
        else:
            print(f"   [FAIL] Table '{table}' not found")
            validation_passed = False
except Exception as e:
    print(f"   [FAIL] Error checking tables: {e}")
    validation_passed = False

# Check 4: Data Accessibility
print("\n4. Data Accessibility:")
try:
    count_result = pd.read_sql(
        f"SELECT COUNT(*) as cnt FROM {database_name}.{table_name_parquet}",
        athena_conn
    )
    row_count = count_result['cnt'][0]
    print(f"   [OK] Query successful ({row_count:,} rows)")
except Exception as e:
    print(f"   [FAIL] Error querying data: {e}")
    validation_passed = False

# Final result
print("\n" + "=" * 70)
if validation_passed:
    print("ALL VALIDATION CHECKS PASSED")
    print("Datalake setup complete and operational")
    setup_datalake_passed = True
else:
    print("SOME VALIDATION CHECKS FAILED")
    print("Please review the errors above and re-run failed steps")
    setup_datalake_passed = False

print("=" * 70)

# Store validation result
%store setup_datalake_passed

DATALAKE SETUP VALIDATION

1. S3 Storage:
   [OK] cms-open-payments/raw/
   [OK] cms-open-payments/parquet/

2. Athena Database:
   [OK] Database 'cms_open_payments' exists

3. Athena Tables:
   [OK] Table 'general_payments_csv' exists
   [OK] Table 'general_payments_parquet' exists

4. Data Accessibility:
   [OK] Query successful (15,397,627 rows)

ALL VALIDATION CHECKS PASSED
Datalake setup complete and operational
Stored 'setup_datalake_passed' (bool)
