Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shell script modifications to run remotely and add indexes #1

Open
jonbartels opened this issue May 9, 2016 · 1 comment
Open

Shell script modifications to run remotely and add indexes #1

jonbartels opened this issue May 9, 2016 · 1 comment

Comments

@jonbartels
Copy link

To use nppes-postgres in my environment I had to make some changes. I do not believe that these changes are broadly useful enough for a pull request so I am sharing them as a comment. Thank you @semerj for sharing the basic technique.

  1. Modified shell script to run against remote postgres instance. Note - This is specific to my environment, it is not a good idea to do this remotely unless necessary! You have to transfer all 2.8GB of uncompressed NPPES data and that takes some time even on a fast network.

  2. Added some indexes. Note that these are UPPER indexes. The npi column was indexed as both a numeric PK and as a varchar, my use case has me joining to another table where npi data is stored as a varchar.

#!/bin/sh

#TODO set this so that it just replicates the psql commands
# ./create_npi_db.sh [USERNAME] [DATABASE] [NPI-DATA] [TAXONOMY-DATA] [HOST]

#wget the nppes fulls to ./data
#wget the nucc taxonomy fulls to ./data

# clean data (per Readme.md)
# Replace empty "" integer fields in NPI CSV file
# interesting fact, this takes the NPPES file from 5.5GB to 2.8GB
sed 's/""//g' $3 > cleaned_npi.csv

# change taxonomy per readme.md
# Convert taxonomy data to utf-8 and delimiter to "|"
# csvformat took some extra effort on OSX, had to install pip then get csvkit
iconv -c -t utf8 $4 | csvformat -T > taxonomy.tab

psql -U $1 -d $2 -h $5 -f nppes_tables.sql

#had to change this to pipe and use stdin
#when you use psql and COPY remotely, it runs the COPY command on that server so the file isn't readable
cat taxonomy.tab | psql -U $1 -d $2 -h $5 -c "COPY taxonomy FROM STDIN WITH CSV HEADER DELIMITER AS E'\t';"

#doing this remotely is SLOW!
cat cleaned_npi.csv | psql -U $1 -d $2 -h $5 -c "COPY npi FROM STDIN WITH CSV HEADER DELIMITER AS ',' NULL AS '';"

psql -U $1 -d $2 -h $5 -f nppes_indexes.sql

--constraints
--make npi PK
ALTER TABLE npi ADD CONSTRAINT PRIMARY KEY (npi);

--indexes on npi
--npi is implicit because it is a PK 
--index npi as a varchar since other systems represent npis as strings
CREATE INDEX npi_npi_as_string ON npi (cast(npi as varchar));
--replacement npi
CREATE INDEX npi_replacement ON npi (replacement_npi));
--last name (upper)
CREATE INDEX npi_upper_last ON npi (upper(provider_last_name));
--first and last name (composite, upper)
CREATE INDEX npi_upper_first_last ON npi (upper(provider_last_name), upper(provider_first_name));
--entity_type
CREATE INDEX npi_entity_type ON npi (entity_type_code);
@semerj
Copy link
Owner

semerj commented May 11, 2016

@jonbartels thanks for the suggestions!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants