Download Socrata Portals

I download the contents of Socrata portals to a filesystem and then upload them to S3.

How to run (multiple portals)

Set the parameters S3 bucket.

export SOCRATA_URLS=( data.cityofnewyork.us 
export SOCRATA_S3_BUCKET=socrata.appgen.me

If you are using Proxy Rack to get around API limits, set the wget proxy parameters.

export http_proxy=

Then run the main script.

./run.sh

This runs ./portals.py to get the list of all portals from socrata.com. Then it runs ./run_all.sh (See below.) for each of the portals.

API limits appear to apply across all of Socrata, not just within data portal, so these different portals just get run in series.

How to run (one portal)

Set the parameters.

SOCRATA_URL=data.cityofnewyork.us
SOCRATA_S3_BUCKET=socrata.appgen.me

Then run the main script.

./run_one.sh

The result will be the following file structure, both locally and in the bucket.

data.cityofnewyork.us/
  searches/
    1
    2
    ...
  views/
    abcd-efgh
    ijkl-mnop
    ...
  rows/
    abcd-efgh
    ijkl-mnop

Components of the run script.

When you run ./run_all.sh, the following things happen in order.

./search.sh searches/browses through all of the datasets/maps/views/&c. and saves all of the files as $SOCRATA_URL/searches/$page_number.
./viewids.py returns all of the 4x4 Socrata view id codes from the files in $SOCRATA_URL/searches.
./views.sh downloads the metadata files for each of the viewids and and saves all of the files as $SOCRATA_URL/views/$viewid.
./rows.sh downloads the data files for each of the viewids as CSV and saves all of the files as $SOCRATA_URL/rows/$viewid.
./builddb.py makes a SQLite3 database with one row per dataset, using features from the view and row files. It contains one table, called datasets. One of the columns is named socrata.url, so unioning it with databases from other portals will be easy. It is named $SOCRATA_URL/features.db.
./s3-upload.sh uploads all of the downloaded view metadata files to an S3 bucket, compressing them first.
./s3-download.sh downloads and decompresses all of the files in the S3 bucket.

data.json

catalogs.sh downloads the catalogs from the /data.json endpoints of each portal.

Name		Name	Last commit message	Last commit date
Latest commit History 63 Commits
.be		.be
rows-only		rows-only
util		util
.gitignore		.gitignore
catalogs.sh		catalogs.sh
customer-spotlight		customer-spotlight
delete_empty.sh		delete_empty.sh
portals.py		portals.py
readme.md		readme.md
rows.sh		rows.sh
run.sh		run.sh
run_one.sh		run_one.sh
s3-download.sh		s3-download.sh
s3-upload.sh		s3-upload.sh
search.sh		search.sh
sites.json		sites.json
viewids.py		viewids.py
views.sh		views.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Download Socrata Portals

How to run (multiple portals)

How to run (one portal)

Components of the run script.

data.json

About

Releases

Packages

Languages

tlevine/socrata-download

Folders and files

Latest commit

History

Repository files navigation

Download Socrata Portals

How to run (multiple portals)

How to run (one portal)

Components of the run script.

data.json

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages