Shell script modifications to run remotely and add indexes #1

jonbartels · 2016-05-09T14:57:27Z

To use nppes-postgres in my environment I had to make some changes. I do not believe that these changes are broadly useful enough for a pull request so I am sharing them as a comment. Thank you @semerj for sharing the basic technique.

Modified shell script to run against remote postgres instance. Note - This is specific to my environment, it is not a good idea to do this remotely unless necessary! You have to transfer all 2.8GB of uncompressed NPPES data and that takes some time even on a fast network.
Added some indexes. Note that these are UPPER indexes. The npi column was indexed as both a numeric PK and as a varchar, my use case has me joining to another table where npi data is stored as a varchar.

#!/bin/sh

#TODO set this so that it just replicates the psql commands
# ./create_npi_db.sh [USERNAME] [DATABASE] [NPI-DATA] [TAXONOMY-DATA] [HOST]

#wget the nppes fulls to ./data
#wget the nucc taxonomy fulls to ./data

# clean data (per Readme.md)
# Replace empty "" integer fields in NPI CSV file
# interesting fact, this takes the NPPES file from 5.5GB to 2.8GB
sed 's/""//g' $3 > cleaned_npi.csv

# change taxonomy per readme.md
# Convert taxonomy data to utf-8 and delimiter to "|"
# csvformat took some extra effort on OSX, had to install pip then get csvkit
iconv -c -t utf8 $4 | csvformat -T > taxonomy.tab

psql -U $1 -d $2 -h $5 -f nppes_tables.sql

#had to change this to pipe and use stdin
#when you use psql and COPY remotely, it runs the COPY command on that server so the file isn't readable
cat taxonomy.tab | psql -U $1 -d $2 -h $5 -c "COPY taxonomy FROM STDIN WITH CSV HEADER DELIMITER AS E'\t';"

#doing this remotely is SLOW!
cat cleaned_npi.csv | psql -U $1 -d $2 -h $5 -c "COPY npi FROM STDIN WITH CSV HEADER DELIMITER AS ',' NULL AS '';"

psql -U $1 -d $2 -h $5 -f nppes_indexes.sql

--constraints
--make npi PK
ALTER TABLE npi ADD CONSTRAINT PRIMARY KEY (npi);

--indexes on npi
--npi is implicit because it is a PK 
--index npi as a varchar since other systems represent npis as strings
CREATE INDEX npi_npi_as_string ON npi (cast(npi as varchar));
--replacement npi
CREATE INDEX npi_replacement ON npi (replacement_npi));
--last name (upper)
CREATE INDEX npi_upper_last ON npi (upper(provider_last_name));
--first and last name (composite, upper)
CREATE INDEX npi_upper_first_last ON npi (upper(provider_last_name), upper(provider_first_name));
--entity_type
CREATE INDEX npi_entity_type ON npi (entity_type_code);

The text was updated successfully, but these errors were encountered:

semerj · 2016-05-11T21:39:22Z

@jonbartels thanks for the suggestions!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Shell script modifications to run remotely and add indexes #1

Shell script modifications to run remotely and add indexes #1

jonbartels commented May 9, 2016

semerj commented May 11, 2016

Shell script modifications to run remotely and add indexes #1

Shell script modifications to run remotely and add indexes #1

Comments

jonbartels commented May 9, 2016

semerj commented May 11, 2016