Scrape unit counts for NYC rent stabilized apts from tax bills
Python HTML JavaScript Shell
Permalink
Failed to load latest commit information.
input
notice-of-property-value-pdf fixing scrape failures Apr 3, 2015
site
.gitignore adding gitignore, link to original data Aug 13, 2015
DATALICENSE-CC-BY-SA.html adding data license Jun 18, 2015
README.md docs on download_direct.py Aug 8, 2016
civic.json add contributors Jul 6, 2015
count.sh better date for count Jun 1, 2015
cross-tab-rs-counts.sql fix column naming Oct 18, 2016
download.py
download_direct.py fix bug around double file extension, allow for reading of stream and… Oct 19, 2016
import.sh
mv_bbls.sh move data into borough/block/lot folders Mar 20, 2015
notes.md
parse.py do less with fixing missing files Sep 25, 2015
reparse.sh remove pgloader, load directly into postgres using ./reparse.sh Aug 13, 2015
requirements.txt
run.sh
scratch.sql
upload.sh move on if theres an aws failure Sep 17, 2015

README.md

NYC Stabilization Unit Counts

Liberate NYC DOF tax documents to a machine readable format.

See why here.

Grab the latest parsed data here.

Parsed data is licensed CC BY-SA CC BY-SA. See DATALICENSE-CC-BY-SA.html for details.

You are free to:

  • Share — copy and redistribute the material in any medium or format
  • Adapt — remix, transform, and build upon the material for any purpose, even commercially.

The licensor cannot revoke these freedoms as long as you follow the license terms.

Under the following terms:

  • Attribution — You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.

  • ShareAlike — If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.

  • No additional restrictions — You may not apply legal terms or technological measures that legally restrict others from doing anything the license permits.

Installation

You'll need the following:

  • Linux, BSD, or MacOSX machine
  • Python 2.6 or greater (not tested on 3)
  • virtualenv or virtualenvwrapper (in order to install requirements without using sudo

The following requirements (on Debian):

sudo apt-get install python-dev python-pip python-virtualenv \
                     build-essential libxml2-dev libxslt1-dev xpdf

Or on Mac (with Homebrew):

brew install python pyenv-virtualenv libxml2 xpdf

Then create a virtualenv and install the requirements in it:

virtualenv .env
source .env/bin/activate
pip install -r requirements.txt

Developer Usage

To download all documents for a single address:

python download.py <house no> '<street name with suffix>' <borough number>

Make sure to put the street name in single quotes.

To download documents for multiple addresses:

  1. Create a tab separated file (eg: addresses.tsv) containing the house number, street name and suffix, and borough number. Separate each address by a new line.

  2. Then do (running the download in the background):

    python download.py /path/to/addresses.tsv >/path/to/log.txt 2>&1 &

To download a single tax bill for many BBLS:

python download_direct.py YYYYMMDD /path/to/bbls.csv > path/to/log.txt 2>&1 &

To parse the raw data into a CSV

You'll probably want to background this too, as it takes a while. The text PDF bills are turned into txt files using pdftotext.

python parse.py /path/to/input >/path/to/output.csv 2>/path/to/log.txt &

The structure of the CSV is as follows, including types:

bbl activityThrough section key dueDate activityDate value meta apts
BIGINT DATE TEXT TEXT DATE DATE TEXT TEXT INT

Since so much disparate information is recorded in tax bills, this raw output is something like a key-value store, with additional metadata to identify what building and tax period the data applies to.

  • bbl: Lot identifier.
  • activityThrough: Date of the tax bill.
  • section: The section of the bill in which the charge appeared. For example, Previous Balance, Tax Year Charges Remaining, Current Charges, etc.
  • key: The type of data on this line. For example, Owner name, Housing-Rent stabilization, Health-Extermination, etc.
  • dueDate: If this is a charge, when it is due. Not very accurate except for the rent stabilization lines.
  • activityDate: If this is a payment, when it was made. Not very accurate.
  • value: The value of the line. If the key was Owner name this would be their actual name; if it was Health-Extermination it would be the charge for the extermination, etc.
  • meta: Additional metadata recorded on the line. For rent stabilization, this is the registration number. For some payments it is the bank that actually made the payment.
  • apts: Only for rent stabiliztion lines, this is the stabilized unit count.

To import the CSV into postgres

You should have docker4data installed and set up on your system.

./reparse.sh

This will directly parse the data folder into docker4data's postgres.

Data Usage

You can see all data here.

Downloading is complete. This means all 6+ unit buildings, in addition to all buildings on DHCR's stabilized buildings list, are available and parsed.

Folder scheme for bills: data/<borough>/<block>/<lot>/

All PDFs are converted to their textual representations in the same folder.

A crosstab CSV with unit counts and abatements 2007-2014

Probably the most useful file for journalists or data-minded community advocates. This file has a row for every possibly stabilized building in New York. There could be stabilized buildings not on the list, but it is unlikely. Any building with 6 or more units as well as any building that was ever on HCR's own list of stabilized buildings was scraped. Buildings are aggregated by BBL.

  • borough: Borough of this lot.
  • ucbbl: The BBL.
  • 2007uc: The unit count in 2007. This is based off of the rent stabilization surcharge dated "4/1/2007", which appears in tax bills starting 2008. The parser sums these counts when a single tax bill includes multiple buildings, but is careful not to double-count if previous years' surcharges reappear.
  • 2007est: Whether or not this is an estimated unit count. As registration is voluntary, it is common for a building to miss a year, or even several. See the section Caveats below for information about how estimates are derived.
  • 2007dhcr: Whether the building appeared on DHCR's list that year. Blank if DHCR did not publish a list for that year.
  • 2007abat: A list of all abatements and exemptions claimed on that year's tax bill. This includes 421a, J51, 420C (LIHTC), SCRIE, DRIE, and several others.
  • These columns repeat for every year up to and including 2014
  • cd: The community district, from PLUTO. All remaining columns are from PLUTO.
  • ct2010: Census tract in 2010 census.
  • cb2010: Census block in 2010 census.
  • council: The city council district.
  • zipcode: The zip code.
  • address: An address for the lot, although it could have several.
  • ownername: The name of the lot's owner. Oftentimes just an LLC.
  • numbldgs: The number of buildings on the lot.
  • numfloors: The approximate number of floors on the lot's buildings.
  • unitsres: An approximate number of residential units in the lot's buildings.
  • unitstotal: An approximate number of residential & commercial units in the lot's buildings.
  • yearbuilt: An approximate year built, not particularly accurate. Especially poor quality in older buildings.
  • condono: The condo number, which links together different lots into a single condo development.
  • lon: The lot's centerpoint longitude.
  • lat: The lot's centerpoint latitude.

A CSV as above but with a separate row for each year

The columns are the same as before, except instead of having separate columns for each year of observation, there is a separate row.

This would be more useful for making a time-based map or doing statistical analysis where the year column can be fed in as a proper dimension.

A summary of building changes over the seven-year span.

This is the table that underlies the map.

  • ucbbl: The BBL.
  • unitstotal: An estimate of the number of units in the building. This is the greatest of PLUTO's unitstotal, unitsres, or the highest stabilized unit count ever recorded on this BBL's tax bills.
  • unitsstab2007: The number of stabilized units in 2007.
  • unitsstab2014: The number of stabilized units in 2014.
  • diff: The number of stabilized units gained or lost between 2007 and 2014.
  • percentchange: The percentage increase or loss. The denominator for this calculation is the greatest of unitsres, unitstotal, or the greatest number of stabilized units reported on a tax bill.
  • j51: Start and end year of any J51 abatement. Earliest start possible is 2009.
  • 421a: Start and end year of any 421-a abatement. Earliest start possible is 2009.
  • scrie: Start and end year of any SCRIE abatement. Earliest start possible is 2009.
  • drie: Start and end year of any DRIE abatement. Earliest start possible is 2009.
  • 420c: Start and end year of any 420C (LIHTC) abatement. Earliest start possible is 2009.
  • All remaining columns are from PLUTO as above.

Borough/CD summary tables

These are simple breakdowns of changes over the seven-year period by borough and community district.

Estimates of income and expense

Every year, Finance estimates the earnings and expenses of rentals as part of the assessment. For larger (10+ unit) buildings, these estimates are based upon real earnings and expense data filed by the landlord.

This table is simply an extract and simplification of the raw data.

  • bbl: The BBL of the property
  • activityThrough: The date of the bill.
  • key: Whether this is an estimate of income or expense.
  • value: The dollar amount of the estimate.

Caveats

The combination of self-reporting stabilization counts and occasionally missing tax bills means that a significant percentage of buildings miss reporting for some years.

In order to compensate, all output files contain some estimated counts, marked as such in the <YYYY>est columns below. You can exclude these estimates in your own aggregations by replacing those unit counts with 0.

If there is no stabilized unit count for a building that had one the previous year, the previous year's number is used in any of the following cases:

  • The bill without a unit count had a SCRIE or DRIE abatement, indicating the continued presence of regulated units.
  • The bill without a unit count maintained the same abatements as the previous year (for example 421a or J51) indicating that restrictions mandating affordability remained in effect.
  • The building appeared on HCR's stabilized building list for the year without a unit count, indicating that it was in fact still stabilized.

After working forwards through the years with the above criteria, they are re-used going backwards. For example, if in 2008 a building reported no units, but it had a SCRIE or DRIE abatement in effect, the count from 2009 will be used if it is available.