# Get sorta big data

## Medicare Part B Payment Data

For calendar year 2015: https://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/Medicare-Provider-Charge-Data/Physician-and-Other-Supplier2015.html


NOTE: this file is >400MB zipped and 2GB unzipped

This file outlines all procedures paid by Medicare Part B, aggregated by physician (NPI) and procedure (HCPCS code)

* NPI: https://en.wikipedia.org/wiki/National_Provider_Identifier
* HCPCS: https://en.wikipedia.org/wiki/Healthcare_Common_Procedure_Coding_System

In [None]:
!curl http://www.cms.gov/apps/ama/license.asp?file=http://download.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/Medicare-Provider-Charge-Data/Downloads/Medicare_Provider_Util_Payment_PUF_CY2015.zip > 2015_partB.zip

In [None]:
!unzip 2015_partB.zip

First trick, use grep to reduce our huge file down to something manageable for a tutorial.

In [None]:
!grep -e FL -e MIAMI 2015_partB.txt > 2015_partB_miami.txt

If you're impatient, just download it from S3

In [134]:
!aws s3 cp s3://rikturr/2015_partB_miami.txt 2015_partB_miami.txt

download: s3://rikturr/2015_partB_miami.txt to ./2015_partB_miami.txt


In [135]:
import pandas as pd
import numpy as np

df = pd.read_csv('2015_partB_miami.txt', sep='\t')
df.head()

Unnamed: 0,npi,nppes_provider_last_org_name,nppes_provider_first_name,nppes_provider_mi,nppes_credentials,nppes_provider_gender,nppes_entity_code,nppes_provider_street1,nppes_provider_street2,nppes_provider_city,...,hcpcs_code,hcpcs_description,hcpcs_drug_indicator,line_srvc_cnt,bene_unique_cnt,bene_day_srvc_cnt,average_Medicare_allowed_amt,average_submitted_chrg_amt,average_Medicare_payment_amt,average_Medicare_standard_amt
0,1003011107,STANDHART,PHILIP,L,"MSPT, CSCS",M,I,3841 E TAMIAMI TRL,,NAPLES,...,97001,Physical therapy evaluation,N,217.0,199,217,77.583318,165.0,57.633272,56.3047
1,1003011107,STANDHART,PHILIP,L,"MSPT, CSCS",M,I,3841 E TAMIAMI TRL,,NAPLES,...,97035,"Application of ultrasound to 1 or more areas, ...",N,205.0,33,205,10.721707,29.0,8.284585,8.050439
2,1003011107,STANDHART,PHILIP,L,"MSPT, CSCS",M,I,3841 E TAMIAMI TRL,,NAPLES,...,97110,"Therapeutic exercise to develop strength, endu...",N,4643.0,217,2442,28.992406,63.0,22.384523,19.457097
3,1003011107,STANDHART,PHILIP,L,"MSPT, CSCS",M,I,3841 E TAMIAMI TRL,,NAPLES,...,97112,Therapeutic procedure to re-educate brain-to-n...,N,164.0,16,164,34.376646,45.0,26.94939,26.43878
4,1003011107,STANDHART,PHILIP,L,"MSPT, CSCS",M,I,3841 E TAMIAMI TRL,,NAPLES,...,97140,Manual (physical) therapy techniques to 1 or m...,N,2030.0,199,2029,23.383468,60.0,18.179355,19.014739


# File formats are key

In [136]:
!ls -alh 2015_partB_miami.txt

-rw-r--r--  1 aaron.richter  staff    11M Jan 20  2018 2015_partB_miami.txt


In [140]:
df = pd.read_csv('2015_partB_miami.txt', sep='\t')
df.to_parquet('2015_partB_miami.parquet')

In [141]:
!ls -alh 2015_partB_miami.parquet

-rw-r--r--  1 aaron.richter  staff   2.7M Jan  8 18:44 2015_partB_miami.parquet


# Use your cores

In [53]:
indexes = list(df.index)
len(indexes)

52396

Oh no! A for loop 😱

In [77]:
def super_complex_function(x):
    return len(df.loc[x]['hcpcs_code'])

In [88]:
%%time
out = []
for i in indexes:
    out.append(super_complex_function(i))

CPU times: user 8.24 s, sys: 137 ms, total: 8.38 s
Wall time: 8.3 s


Let's try using multiple threads

In [91]:
import multiprocessing as mp

num_chunks = 10
num_threads = 4

In [92]:
%%time
pool = mp.Pool(num_threads)
fast_out = pool.map(super_complex_function, indexes)

CPU times: user 28.4 ms, sys: 50.5 ms, total: 78.9 ms
Wall time: 2.21 s


In [93]:
set(out) == set(fast_out)

True

# Sparse matrices

In [122]:
one_hot = (df
           .pivot_table(index=['npi'], columns='hcpcs_code', values='line_srvc_cnt')
           .reset_index()
           .fillna(0)
           .values)
one_hot

array([[  1.00301111e+09,   0.00000000e+00,   0.00000000e+00, ...,
          0.00000000e+00,   0.00000000e+00,   0.00000000e+00],
       [  1.00301711e+09,   0.00000000e+00,   0.00000000e+00, ...,
          0.00000000e+00,   0.00000000e+00,   0.00000000e+00],
       [  1.00302358e+09,   0.00000000e+00,   0.00000000e+00, ...,
          0.00000000e+00,   0.00000000e+00,   0.00000000e+00],
       ..., 
       [  1.99297706e+09,   0.00000000e+00,   0.00000000e+00, ...,
          0.00000000e+00,   0.00000000e+00,   0.00000000e+00],
       [  1.99299449e+09,   0.00000000e+00,   0.00000000e+00, ...,
          0.00000000e+00,   0.00000000e+00,   0.00000000e+00],
       [  1.99299752e+09,   0.00000000e+00,   0.00000000e+00, ...,
          0.00000000e+00,   0.00000000e+00,   0.00000000e+00]])

In [123]:
one_hot.shape, one_hot.shape[0] * one_hot.shape[1]

((5740, 2185), 12541900)

In [124]:
np.count_nonzero(one_hot)

56550

In [126]:
import scipy.sparse as sp

one_hot_sparse = sp.csc_matrix(one_hot)
one_hot_sparse

<5740x2185 sparse matrix of type '<class 'numpy.float64'>'
	with 56550 stored elements in Compressed Sparse Column format>

In [130]:
np.save('dense.npy', one_hot)
sp.save_npz('sparse.npz', one_hot_sparse)

In [132]:
!ls -alh dense.npy

-rw-r--r--  1 aaron.richter  staff    96M Jan  8 18:36 dense.npy


In [133]:
!ls -alh sparse.npz

-rw-r--r--  1 aaron.richter  staff   186K Jan  8 18:36 sparse.npz
