# Landry BlueBook
This is the notebook for my Project BlueBook code

# What's working?

• imported libraries<br>
• imported most .csv files<br>
• combined classification files into one dataframe<br>
• removed trailing spaces from classification dataframe keys<br>

# What's not working?

• outpatient column names contain newline characters: can't remove them<br>
• having trouble loading providers_csv

In [1]:
# import libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

In [2]:
# set up dataframes
#   absolute path: /Users/landrybutler/github/healthcare-bluebook-project-bluebook/data/my_file.csv
#   relative path: ../data/my_file.csv

# paths to files
# providers_csv is too large to store in git, so kept outside this project folder in 'oversized_files'
providers_csv = '/Users/landrybutler/github/oversized_files/Medicare_Provider_Util_Payment_PUF_CY2017.csv'
outpatient_csv = '../data/MUP_OHP_R19_P04_V10_D17_APC_Provider.csv'
classification1_csv = '../data/508-Compliant-Version-of-2020_january_web_addendum_b.12312019.csv'
classification2_csv = '../data/2020_january_web_addendum_b.12312019.csv'
cbsa_csv = '../data/ZIP_CBSA_032020.csv'

# 
# 
# DEBUGDEBUG: having trouble loading providers_csv
# providers = pd.read_csv('../data/Medicare_Provider_Util_Payment_PUF_CY2017.csv', engine='python') 

# providers = pd.read_csv(providers_csv, engine='python') 

outpatient = pd.read_csv(outpatient_csv, low_memory=False) 
classification1 = pd.read_csv(classification1_csv) 
classification2 = pd.read_csv(classification2_csv) 
cbsa = pd.read_csv(cbsa_csv) 


In [3]:
# how can I join the classification files? 
# look at df.head()

classification1.head()

Unnamed: 0,HCPCS Code,Short Descriptor,SI,APC,Relative Weight,Payment Rate,National Unadjusted Copayment,Minimum Unadjusted Copayment,Column1,Column2,Column3
0,100,Anesth salivary gland,N,,,,,,,,
1,102,Anesth repair of cleft lip,N,,,,,,,,
2,103,Anesth blepharoplasty,N,,,,,,,,
3,104,Anesth electroshock,N,,,,,,,,
4,120,Anesth ear surgery,N,,,,,,,,


In [4]:
classification2.head()

Unnamed: 0,HCPCS Code,Short Descriptor,SI,APC,Relative Weight,Payment Rate,National Unadjusted Copayment,Minimum Unadjusted Copayment,Unnamed: 8,Unnamed: 9,Unnamed: 10
0,100,Anesth salivary gland,N,,,,,,,,
1,102,Anesth repair of cleft lip,N,,,,,,,,
2,103,Anesth blepharoplasty,N,,,,,,,,
3,104,Anesth electroshock,N,,,,,,,,
4,120,Anesth ear surgery,N,,,,,,,,


In [5]:
# look at df.info()

classification1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16628 entries, 0 to 16627
Data columns (total 11 columns):
HCPCS Code                        16628 non-null object
Short Descriptor                  16628 non-null object
SI                                16628 non-null object
APC                               5942 non-null float64
Relative Weight                   5516 non-null float64
Payment Rate                      5936 non-null object
National Unadjusted Copayment     5934 non-null object
Minimum Unadjusted Copayment      5936 non-null object
Column1                           279 non-null object
Column2                           408 non-null object
Column3                           0 non-null float64
dtypes: float64(3), object(8)
memory usage: 1.4+ MB


In [6]:
classification2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16628 entries, 0 to 16627
Data columns (total 11 columns):
HCPCS Code                        16628 non-null object
Short Descriptor                  16628 non-null object
SI                                16628 non-null object
APC                               5942 non-null float64
Relative Weight                   5516 non-null float64
Payment Rate                      5936 non-null object
National Unadjusted Copayment     5934 non-null object
Minimum Unadjusted Copayment      5936 non-null object
Unnamed: 8                        279 non-null object
Unnamed: 9                        408 non-null object
Unnamed: 10                       0 non-null float64
dtypes: float64(3), object(8)
memory usage: 1.4+ MB


NOTE: based on a quick examination of the head() and info(), it looks like these files are laid out in same manner:
        - HCPS Code
        - Short Descriptor
        - SI
        - APC
        - Relative Weight
        - Payment Rate
        - National Unadjusted Copayment
        - Minimum Unadjusted Copayment
        - Column1 or Unnamed: 8
        - Column2 or Unnamed: 9
        - Column3 or Unnamed: 10
            
The range index and memory usage are the same for both files. I wonder if they countain duplicate info?

In [7]:
# Inner join on HCPS code will eliminate any duplicates
# NOTE: Trailing spaces were found in column names, they're included below and will be removed later
# memory usage after join only increased by 0.5 MB

classifications = pd.merge(left=classification1, right=classification2, 
                           how='inner', 
                           on=['HCPCS Code','Short Descriptor','SI','APC ',
                               'Relative Weight','Payment Rate ','National Unadjusted Copayment ',
                              'Minimum Unadjusted Copayment '])
classifications.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 16628 entries, 0 to 16627
Data columns (total 14 columns):
HCPCS Code                        16628 non-null object
Short Descriptor                  16628 non-null object
SI                                16628 non-null object
APC                               5942 non-null float64
Relative Weight                   5516 non-null float64
Payment Rate                      5936 non-null object
National Unadjusted Copayment     5934 non-null object
Minimum Unadjusted Copayment      5936 non-null object
Column1                           279 non-null object
Column2                           408 non-null object
Column3                           0 non-null float64
Unnamed: 8                        279 non-null object
Unnamed: 9                        408 non-null object
Unnamed: 10                       0 non-null float64
dtypes: float64(4), object(10)
memory usage: 1.9+ MB


In [8]:
# Rename columns to remove trailing spaces
# df.rename(columns=lambda x: x.strip())
classifications = classifications.rename(columns=lambda x: x.strip())

classifications.keys()

Index(['HCPCS Code', 'Short Descriptor', 'SI', 'APC', 'Relative Weight',
       'Payment Rate', 'National Unadjusted Copayment',
       'Minimum Unadjusted Copayment', 'Column1', 'Column2', 'Column3',
       'Unnamed: 8', 'Unnamed: 9', 'Unnamed: 10'],
      dtype='object')

In [9]:
# Look at outpatient dataframe
outpatient.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 61779 entries, 0 to 61778
Data columns (total 16 columns):
Provider ID                                  61779 non-null int64
Provider Name                                61779 non-null object
Provider Street Address                      61779 non-null object
Provider City                                61779 non-null object
Provider
State                               61779 non-null object
Provider
Zip Code                            61779 non-null int64
Provider
Hospital Referral Region
(HRR)      61779 non-null object
APC                                          61779 non-null int64
APC
Description                              61779 non-null object
Beneficiaries                                60782 non-null object
Comprehensive APC
Services                   61779 non-null object
Average
Estimated
Total
Submitted
Charges    61779 non-null object
Average
Medicare
Allowed
Amount              61779 non-null object
Average
Medicare
Paymen

In [17]:
outpatient.head()

Unnamed: 0,Provider ID,Provider Name,Provider Street Address,Provider City,Provider\nState,Provider\nZip Code,Provider\nHospital Referral Region\n(HRR),APC,APC\nDescription,Beneficiaries,Comprehensive APC\nServices,Average\nEstimated\nTotal\nSubmitted\nCharges,Average\nMedicare\nAllowed\nAmount,Average\nMedicare\nPayment\nAmount,Outlier\nComprehensive\nAPC\nServices,Average\nMedicare\nOutlier\nAmount
0,10001,Southeast Alabama Medical Center,1108 Ross Clark Circle,Dothan,AL,36301,AL - Dothan,5072,Level 2 Excision/ Biopsy/ Incision and Drainage,249,259,"$9,575","$1,038",$826,,
1,10001,Southeast Alabama Medical Center,1108 Ross Clark Circle,Dothan,AL,36301,AL - Dothan,5073,Level 3 Excision/ Biopsy/ Incision and Drainage,52,53,"$12,578","$1,793","$1,423",,
2,10001,Southeast Alabama Medical Center,1108 Ross Clark Circle,Dothan,AL,36301,AL - Dothan,5091,Level 1 Breast/Lymphatic Surgery and Related P...,26,27,"$11,338","$2,114","$1,684",0.0,$0
3,10001,Southeast Alabama Medical Center,1108 Ross Clark Circle,Dothan,AL,36301,AL - Dothan,5092,Level 2 Breast/Lymphatic Surgery and Related P...,23,23,"$17,116","$3,737","$2,978",0.0,$0
4,10001,Southeast Alabama Medical Center,1108 Ross Clark Circle,Dothan,AL,36301,AL - Dothan,5112,Level 2 Musculoskeletal Procedures,17,17,"$7,383","$1,029",$820,0.0,$0


In [22]:
# DEBUDEBUG: This is not working
# column names contain newline characters, remove them
outpatient = outpatient.replace(to_replace='\n',value=' ')

# df.replace(',', '-', regex=True)
# outpatient.keys()

# outpatient = outpatient.rename(columns=lambda x: x.strip())

outpatient.head()

Unnamed: 0,Provider ID,Provider Name,Provider Street Address,Provider City,Provider\nState,Provider\nZip Code,Provider\nHospital Referral Region\n(HRR),APC,APC\nDescription,Beneficiaries,Comprehensive APC\nServices,Average\nEstimated\nTotal\nSubmitted\nCharges,Average\nMedicare\nAllowed\nAmount,Average\nMedicare\nPayment\nAmount,Outlier\nComprehensive\nAPC\nServices,Average\nMedicare\nOutlier\nAmount
0,10001,Southeast Alabama Medical Center,1108 Ross Clark Circle,Dothan,AL,36301,AL - Dothan,5072,Level 2 Excision/ Biopsy/ Incision and Drainage,249,259,"$9,575","$1,038",$826,,
1,10001,Southeast Alabama Medical Center,1108 Ross Clark Circle,Dothan,AL,36301,AL - Dothan,5073,Level 3 Excision/ Biopsy/ Incision and Drainage,52,53,"$12,578","$1,793","$1,423",,
2,10001,Southeast Alabama Medical Center,1108 Ross Clark Circle,Dothan,AL,36301,AL - Dothan,5091,Level 1 Breast/Lymphatic Surgery and Related P...,26,27,"$11,338","$2,114","$1,684",0.0,$0
3,10001,Southeast Alabama Medical Center,1108 Ross Clark Circle,Dothan,AL,36301,AL - Dothan,5092,Level 2 Breast/Lymphatic Surgery and Related P...,23,23,"$17,116","$3,737","$2,978",0.0,$0
4,10001,Southeast Alabama Medical Center,1108 Ross Clark Circle,Dothan,AL,36301,AL - Dothan,5112,Level 2 Musculoskeletal Procedures,17,17,"$7,383","$1,029",$820,0.0,$0


In [23]:
# Look at cbsa dataframe
cbsa.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 47415 entries, 0 to 47414
Data columns (total 6 columns):
ZIP          47415 non-null int64
CBSA         47415 non-null int64
RES_RATIO    47415 non-null float64
BUS_RATIO    47415 non-null float64
OTH_RATIO    47415 non-null float64
TOT_RATIO    47415 non-null float64
dtypes: float64(4), int64(2)
memory usage: 2.2 MB


In [24]:
cbsa.head()

Unnamed: 0,ZIP,CBSA,RES_RATIO,BUS_RATIO,OTH_RATIO,TOT_RATIO
0,501,35620,0.0,1.0,0.0,1.0
1,601,38660,1.0,1.0,1.0,1.0
2,602,10380,1.0,1.0,1.0,1.0
3,603,10380,1.0,1.0,1.0,1.0
4,604,10380,1.0,1.0,1.0,1.0


In [26]:
cbsa.shape

(47415, 6)

cbsa dataframe appears to be std tabular data w/o any issues … so far

In [27]:
cbsa.tail()

Unnamed: 0,ZIP,CBSA,RES_RATIO,BUS_RATIO,OTH_RATIO,TOT_RATIO
47410,99925,99999,0.0,0.0,1.0,1.0
47411,99926,99999,0.0,0.0,1.0,1.0
47412,99927,99999,0.0,0.0,1.0,1.0
47413,99928,28540,0.0,0.0,1.0,1.0
47414,99929,99999,0.0,0.0,1.0,1.0
