# Scrape Property Tax Data

Using NYC Dept of Finance website, download property tax data for all buildings in Assembly District 36, under the 421A Tax Abatement program. 

Each Tax Class has a different property tax form so we will pull each tax class individually.

Tax classes are:
- 1
- 2
- 2A
- 2B




The initial pull will only be done for the most recent Q1 tax documents sent out (from June 2019 - June 2021)

In [265]:
import pandas as pd
from bs4 import BeautifulSoup
import requests
from urllib.request import urlopen
from urllib.request import urlretrieve
import json
import tabula as tb
from datetime import datetime
import re

## Tax Class 2

### Read Building Data

In [266]:
df_buildings = pd.read_csv("ad36_421Properties_TaxClass2.csv")
df_buildings = df_buildings.astype({'BBL': 'str'})

In [267]:
print("Total Number of buildings under this class: {}".format(df_buildings.shape[0]))
df_buildings.head()

Total Number of buildings under this class: 71


Unnamed: 0.1,Unnamed: 0,the_geom,bin,cnstrct_yr,lstmoddate,lststatype,doitt_id,heightroof,feat_code,groundelev,...,BUILDING CLASS,ADDRESS,ZIP CODE,RESIDENTIAL UNITS,COMMERCIAL UNITS,TOTAL UNITS,LAND SQUARE FEET,GROSS SQUARE FEET,YEAR BUILT,BBL
0,0,"{'type': 'MultiPolygon', 'coordinates': [[[[-7...",4006371,2009.0,2017-08-22T00:00:00.000Z,Constructed,1113607,100.19426,2100,10.0,...,D6,21-16 31 AVENUE,11106.0,32,1,33,8550,33500,2009,4005520022
1,7,"{'type': 'MultiPolygon', 'coordinates': [[[[-7...",4005656,2008.0,2017-08-22T00:00:00.000Z,Constructed,1109509,53.890919,2100,10.0,...,D1,12-26 30 AVENUE,11102.0,37,0,37,13575,39745,2008,4005150031
2,11,"{'type': 'MultiPolygon', 'coordinates': [[[[-7...",4005897,2010.0,2017-08-22T00:00:00.000Z,Constructed,1259711,63.0,2100,11.0,...,D1,14-34 31 AVENUE,11106.0,14,0,14,4726,12945,2010,4005330037
3,14,"{'type': 'MultiPolygon', 'coordinates': [[[[-7...",4019881,2008.0,2017-08-22T00:00:00.000Z,Constructed,1108121,59.0,2100,30.0,...,D1,18-05 27 AVENUE,11102.0,11,0,11,2640,15900,2008,4008850016
4,15,"{'type': 'MultiPolygon', 'coordinates': [[[[-7...",4540411,2012.0,2017-08-22T00:00:00.000Z,Constructed,1256175,64.0,2100,15.0,...,D1,23-12 30 DRIVE,11102.0,20,0,20,7512,16000,2012,4005700030


In [268]:
#get BBL values into a list
building_bbl_list = list(df_buildings['BBL'])

### Extract Property Data

For each BBL, we need to download the property tax PDF, then gather the requisite 421 data. 

Eventually, combine into a single dataframe and merge with the buildings dataframe above. 

Link to download PDF: https://a836-edms.nyc.gov/dctm-rest/repositories/dofedmspts/StatementSearch?bbl=[BBL VALUE HERE]&stmtDate=20211120&stmtType=SOA 

PDF Process:

    - To pull tax information (with rent regulated units) we need to parse the actual PDF tax documents. These are all available through the NYC Dept of Finance website. 
    - For each building, we need to download the tax document for the past 3 years (2019, 2020, 2021). Reading an online PDF runs into issues with encoding. 
    - After downloading the tax document, we have to find *where* the unit information is and gather the number of rent-regulated units reported that year. 

This is not a perfect process! There is no guarantee that the PDF format is the same and there may be buildings that we miss values. For these buildings, we need to go into the files after the fact and update those values. 

In [269]:
#list to include all the rent stabilization data
property_tax_list = []

In [270]:
#Tax Dates for this type of tax document. Anything before 2019 is a different format
tax_dates = ['20210605', '20200606', '20190605']

In [271]:
#Loop through the tax dates and the building list and pull each tax document
for tax_form_date in tax_dates:

    for i in range(len(building_bbl_list)):

        print("BBL: {} Tax Date: {}".format(building_bbl_list[i], tax_form_date))

        #Download tax document PDF and save as "property_tax.pdf"
        url_loc = "https://a836-edms.nyc.gov/dctm-rest/repositories/dofedmspts/StatementSearch?bbl={}&stmtDate={}&stmtType=SOA".format(building_bbl_list[i], tax_form_date)
        urlretrieve(url_loc, "property_tax.pdf")


        #On these documents (tax class 2), the relevant tax info is usually on page 2. However the format is not always the same. This section runs to tries two ways to get the data.
        try:
            file = 'property_tax.pdf'
            #Using Tabula, we have to identify the area of the page we are trying to read and the column widths for the data. 
            df_pdf= tb.read_pdf(file, pages = '2', area = (50, 0, 1000, 1000), columns = [200, 305, 365, 455, 500])[0]
            
            #Find Rent data and save into a new dataframe
            try:
                #The first attempt at this will look for Rent Stabilization in one of the columns and identifying that index value
                index_to_find = df_pdf.loc[df_pdf['Unnamed: 0'] == 'Rent Stabilization'].index

                #We will take that found row and convert it into the header
                df_pdf.columns = df_pdf.iloc[index_to_find[0]]
                
                #Find the values below that and add it to a list 
                df_pdf = df_pdf.iloc[index_to_find[0]+1]
                df_list = df_pdf.to_list()
                df_list.append(str(building_bbl_list[i]))
                df_list.append(tax_form_date)
                
            except IndexError:
                try:
                    #If the above process returns an IndexError, try again but look for the "# Apts" value in the next column. This is likely due to there being an asterisk or some other value on the Rent Stabilization cell.
                    # Following that, it is the same process as above
                    index_to_find = df_pdf.loc[df_pdf['Unnamed: 1'] == '# Apts'].index
                    df_pdf.columns = df_pdf.iloc[index_to_find[0]]
                    df_pdf = df_pdf.iloc[index_to_find[0]+1]
                    df_list = df_pdf.to_list()
                    df_list.append(str(building_bbl_list[i]))
                    df_list.append(tax_form_date)
                    print("Issue with Rent Stabilization Field - Update data")
                except IndexError:
                    #If we find another error, that means there is some other formatting issues with the property tax document and may have to be updated manually.
                    df_list = ['Rent Stabilization Fee - Chg', '0', 'NaN', 'nan', 'nan', '0', str(building_bbl_list[i]), tax_form_date]
                    print("No Rent Stabilized Units")
                
        
        
        except:
            print("Error in Tabula")

        #Append to the master list which will be turned into a dataframe
        property_tax_list.append(df_list)
        
#Specify columns for dataframe from master list
cols = ['Rent Stabilized', 'Units', 'Due Date', 'RS ID', 'To Drop', 'Total Charge', 'BBL', 'Date']
final_df = pd.DataFrame(property_tax_list, columns=cols)
# final_df['Date'] = tax_form_date


BBL: 4005520022 Tax Date: 20210605
BBL: 4005150031 Tax Date: 20210605
BBL: 4005330037 Tax Date: 20210605
BBL: 4008850016 Tax Date: 20210605
BBL: 4005700030 Tax Date: 20210605
BBL: 4005700030 Tax Date: 20210605
BBL: 4005190006 Tax Date: 20210605
BBL: 4005190006 Tax Date: 20210605
BBL: 4008870035 Tax Date: 20210605
BBL: 4008720019 Tax Date: 20210605
BBL: 4005670040 Tax Date: 20210605
BBL: 4007250043 Tax Date: 20210605
BBL: 4006520080 Tax Date: 20210605
BBL: 4008110001 Tax Date: 20210605
BBL: 4005310015 Tax Date: 20210605
No Rent Stabilized Units
BBL: 4005310017 Tax Date: 20210605
BBL: 4008720008 Tax Date: 20210605
BBL: 4008720032 Tax Date: 20210605
BBL: 4005510013 Tax Date: 20210605
BBL: 4008720011 Tax Date: 20210605
BBL: 4005310050 Tax Date: 20210605
BBL: 4005310050 Tax Date: 20210605
BBL: 4005690034 Tax Date: 20210605
No Rent Stabilized Units
BBL: 4005690034 Tax Date: 20210605
No Rent Stabilized Units
BBL: 4005690034 Tax Date: 20210605
No Rent Stabilized Units
BBL: 4008610005 Tax Date:

Got stderr: Dec 05, 2021 9:19:50 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>



Issue with Rent Stabilization Field - Update data
BBL: 4008720032 Tax Date: 20190605
Issue with Rent Stabilization Field - Update data
BBL: 4005510013 Tax Date: 20190605
No Rent Stabilized Units
BBL: 4008720011 Tax Date: 20190605
Issue with Rent Stabilization Field - Update data
BBL: 4005310050 Tax Date: 20190605
Issue with Rent Stabilization Field - Update data
BBL: 4005310050 Tax Date: 20190605
Issue with Rent Stabilization Field - Update data
BBL: 4005690034 Tax Date: 20190605
No Rent Stabilized Units
BBL: 4005690034 Tax Date: 20190605
No Rent Stabilized Units
BBL: 4005690034 Tax Date: 20190605
No Rent Stabilized Units
BBL: 4008610005 Tax Date: 20190605
Issue with Rent Stabilization Field - Update data
BBL: 4005980038 Tax Date: 20190605
Issue with Rent Stabilization Field - Update data
BBL: 4008480042 Tax Date: 20190605
Issue with Rent Stabilization Field - Update data
BBL: 4008550003 Tax Date: 20190605


Got stderr: Dec 05, 2021 9:20:15 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>



Issue with Rent Stabilization Field - Update data
BBL: 4006490074 Tax Date: 20190605


Got stderr: Dec 05, 2021 9:20:17 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>



Issue with Rent Stabilization Field - Update data
BBL: 4005700024 Tax Date: 20190605


Got stderr: Dec 05, 2021 9:20:19 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>



Issue with Rent Stabilization Field - Update data
BBL: 4005350046 Tax Date: 20190605
Issue with Rent Stabilization Field - Update data
BBL: 4008610033 Tax Date: 20190605


Got stderr: Dec 05, 2021 9:20:23 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>



No Rent Stabilized Units
BBL: 4005960030 Tax Date: 20190605


Got stderr: Dec 05, 2021 9:20:25 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>



Issue with Rent Stabilization Field - Update data
BBL: 4006330040 Tax Date: 20190605
No Rent Stabilized Units
BBL: 4005700033 Tax Date: 20190605
No Rent Stabilized Units
BBL: 4006600005 Tax Date: 20190605
Issue with Rent Stabilization Field - Update data
BBL: 4005430002 Tax Date: 20190605
No Rent Stabilized Units
BBL: 4005430002 Tax Date: 20190605
No Rent Stabilized Units
BBL: 4005430019 Tax Date: 20190605
No Rent Stabilized Units
BBL: 4005690015 Tax Date: 20190605
Issue with Rent Stabilization Field - Update data
BBL: 4005310060 Tax Date: 20190605
Issue with Rent Stabilization Field - Update data
BBL: 4005400040 Tax Date: 20190605


Got stderr: Dec 05, 2021 9:20:44 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>



Issue with Rent Stabilization Field - Update data
BBL: 4005310059 Tax Date: 20190605
Issue with Rent Stabilization Field - Update data
BBL: 4006630043 Tax Date: 20190605
No Rent Stabilized Units
BBL: 4006590075 Tax Date: 20190605


Got stderr: Dec 05, 2021 9:20:50 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>



No Rent Stabilized Units
BBL: 4005980064 Tax Date: 20190605
Issue with Rent Stabilization Field - Update data
BBL: 4008850001 Tax Date: 20190605
Issue with Rent Stabilization Field - Update data
BBL: 4005710011 Tax Date: 20190605
Issue with Rent Stabilization Field - Update data
BBL: 4005710011 Tax Date: 20190605
Issue with Rent Stabilization Field - Update data
BBL: 4005960033 Tax Date: 20190605


Got stderr: Dec 05, 2021 9:21:01 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>



Issue with Rent Stabilization Field - Update data
BBL: 4005500010 Tax Date: 20190605


Got stderr: Dec 05, 2021 9:21:03 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>



Issue with Rent Stabilization Field - Update data
BBL: 4005940051 Tax Date: 20190605


Got stderr: Dec 05, 2021 9:21:05 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>



Issue with Rent Stabilization Field - Update data
BBL: 4006320024 Tax Date: 20190605
Issue with Rent Stabilization Field - Update data
BBL: 4005730059 Tax Date: 20190605
No Rent Stabilized Units
BBL: 4008390007 Tax Date: 20190605
No Rent Stabilized Units
BBL: 4005780001 Tax Date: 20190605


Got stderr: Dec 05, 2021 9:21:14 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>



Issue with Rent Stabilization Field - Update data
BBL: 4006200178 Tax Date: 20190605
Issue with Rent Stabilization Field - Update data
BBL: 4005730046 Tax Date: 20190605
Issue with Rent Stabilization Field - Update data
BBL: 4006590073 Tax Date: 20190605
Issue with Rent Stabilization Field - Update data
BBL: 4005390039 Tax Date: 20190605


Got stderr: Dec 05, 2021 9:21:22 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>



Issue with Rent Stabilization Field - Update data
BBL: 4005330027 Tax Date: 20190605
Issue with Rent Stabilization Field - Update data
BBL: 4005430024 Tax Date: 20190605
Issue with Rent Stabilization Field - Update data
BBL: 4008350034 Tax Date: 20190605


Got stderr: Dec 05, 2021 9:21:28 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>



No Rent Stabilized Units
BBL: 4008870032 Tax Date: 20190605
No Rent Stabilized Units
BBL: 4006520042 Tax Date: 20190605


Got stderr: Dec 05, 2021 9:21:33 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>



No Rent Stabilized Units
BBL: 4005350051 Tax Date: 20190605
Issue with Rent Stabilization Field - Update data
BBL: 4008310089 Tax Date: 20190605


Got stderr: Dec 05, 2021 9:21:37 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>



Issue with Rent Stabilization Field - Update data
BBL: 4005780035 Tax Date: 20190605


Got stderr: Dec 05, 2021 9:21:39 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>



Issue with Rent Stabilization Field - Update data
BBL: 4005340106 Tax Date: 20190605


Got stderr: Dec 05, 2021 9:21:42 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>



No Rent Stabilized Units
BBL: 4005390041 Tax Date: 20190605
Issue with Rent Stabilization Field - Update data


In [272]:
#merge the master list data with the df buildings
merged_data = df_buildings.merge(final_df, left_on='BBL', right_on='BBL', how = 'inner')

In [273]:
merged_data

Unnamed: 0.1,Unnamed: 0,the_geom,bin,cnstrct_yr,lstmoddate,lststatype,doitt_id,heightroof,feat_code,groundelev,...,GROSS SQUARE FEET,YEAR BUILT,BBL,Rent Stabilized,Units,Due Date,RS ID,To Drop,Total Charge,Date
0,0,"{'type': 'MultiPolygon', 'coordinates': [[[[-7...",4006371,2009.0,2017-08-22T00:00:00.000Z,Constructed,1113607,100.194260,2100,10.0,...,33500,2009,4005520022,Rent Stabilization Fee- Chg,30,01/01/2022,41089700,,$600.00,20210605
1,0,"{'type': 'MultiPolygon', 'coordinates': [[[[-7...",4006371,2009.0,2017-08-22T00:00:00.000Z,Constructed,1113607,100.194260,2100,10.0,...,33500,2009,4005520022,Rent Stabilization Fee- Chg,30,01/01/2021,41089700,,$600.00,20200606
2,0,"{'type': 'MultiPolygon', 'coordinates': [[[[-7...",4006371,2009.0,2017-08-22T00:00:00.000Z,Constructed,1113607,100.194260,2100,10.0,...,33500,2009,4005520022,Rent Stabilization Fee- Chg,30,01/01/2020,41089700,,$300.00,20190605
3,7,"{'type': 'MultiPolygon', 'coordinates': [[[[-7...",4005656,2008.0,2017-08-22T00:00:00.000Z,Constructed,1109509,53.890919,2100,10.0,...,39745,2008,4005150031,Rent Stabilization Fee- Chg,36,01/01/2022,42909800,,$720.00,20210605
4,7,"{'type': 'MultiPolygon', 'coordinates': [[[[-7...",4005656,2008.0,2017-08-22T00:00:00.000Z,Constructed,1109509,53.890919,2100,10.0,...,39745,2008,4005150031,Rent Stabilization Fee - Chg,0,,,,0,20200606
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
256,255,"{'type': 'MultiPolygon', 'coordinates': [[[[-7...",4618780,,2021-01-22T00:00:00.000Z,Constructed,1285809,50.000000,2100,13.0,...,14259,2017,4005340106,Rent Stabilization Fee- Chg,18,01/01/2021,51155300,,$360.00,20200606
257,255,"{'type': 'MultiPolygon', 'coordinates': [[[[-7...",4618780,,2021-01-22T00:00:00.000Z,Constructed,1285809,50.000000,2100,13.0,...,14259,2017,4005340106,Rent Stabilization Fee - Chg,0,,,,0,20190605
258,262,"{'type': 'MultiPolygon', 'coordinates': [[[[-7...",4006100,1952.0,2021-11-12T00:00:00.000Z,Constructed,564157,112.000000,2100,21.0,...,41381,2012,4005390041,Rent Stabilization Fee- Chg,30,01/01/2022,43177700,,$600.00,20210605
259,262,"{'type': 'MultiPolygon', 'coordinates': [[[[-7...",4006100,1952.0,2021-11-12T00:00:00.000Z,Constructed,564157,112.000000,2100,21.0,...,41381,2012,4005390041,Rent Stabilization Fee- Chg,30,01/01/2021,43177700,,$600.00,20200606


In [274]:
#Fix the date values and add a year indicator
merged_data['Date'] = merged_data['Date'].apply(lambda x: datetime.strptime(x, '%Y%m%d'))
merged_data['Year'] = merged_data['Date'].apply(lambda x: x.year)

In [None]:
merged_data.to_csv('ad36_421Properties_TaxClass2_withTaxData.csv')

### Pivot Table

Clean up the class2b file and convert into a pivot table.

In [275]:
#Clean up dataframe fields and columns
df_class2 = merged_data[['the_geom', 'geomsource', 'cnstrct_yr', 'lstmoddate', 'lststatype', 'Latitude', 'Longitude',
                         'AssemDist', 'BUILDING CLASS CATEGORY', 'BUILDING CLASS', 'ADDRESS', 'ZIP CODE', 'RESIDENTIAL UNITS',
                         'COMMERCIAL UNITS', 'TOTAL UNITS', 'YEAR BUILT', 'BBL', 'Units', 'Total Charge', 'Date', 'Year']]
df_class2.rename(columns={'Units': 'Rent Stabilized Units', 'Total Charge': 'Rent Stabilized Charges', 'Date': 'Tax Doc Date',
                                'Year': 'Tax Doc Year'}, inplace = True)
df_class2.fillna(0, inplace = True)
df_class2['cnstrct_yr'] = df_class2['cnstrct_yr'].astype(int)

#Create a pivot table that cleans up some of the information available based around the Tax Year and Rent Stabilized Units. This will allow for easier mapping and visualizing the data for what we want
df_pivot2 = df_class2.drop(['the_geom'], axis = 1)
df_pivot2 = df_pivot2.sort_values('geomsource', ascending=False).drop_duplicates(subset=['ADDRESS', 'Tax Doc Year'])
df_pivot2 = df_pivot2.pivot(index=['ADDRESS', 'Latitude', 'Longitude'], columns='Tax Doc Year', values='Rent Stabilized Units')
df_pivot2.fillna(0, inplace=True)

#Write to file
df_pivot2.to_csv('taxClass2_pivotTable_ad36.csv')

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().rename(
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().fillna(
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_class2['cnstrct_yr'] = df_class2['cnstrct_yr'].astype(int)


## Tax Class 2B

### Read Building Data

In [277]:
df_buildings = pd.read_csv("ad36_421Properties_TaxClass2B.csv")
df_buildings = df_buildings.astype({'BBL': 'str'})

In [278]:
print("Total Number of buildings under this class: {}".format(df_buildings.shape[0]))
df_buildings.head(1)

Total Number of buildings under this class: 105


Unnamed: 0.1,Unnamed: 0,the_geom,bin,cnstrct_yr,lstmoddate,lststatype,doitt_id,heightroof,feat_code,groundelev,...,BUILDING CLASS,ADDRESS,ZIP CODE,RESIDENTIAL UNITS,COMMERCIAL UNITS,TOTAL UNITS,LAND SQUARE FEET,GROSS SQUARE FEET,YEAR BUILT,BBL
0,1,"{'type': 'MultiPolygon', 'coordinates': [[[[-7...",4594935,2009.0,2017-08-22T00:00:00.000Z,Constructed,1112303,54.768241,2100,35,...,C1,25-16 30 DRIVE,11102.0,8,0,8,3669,10976,2009,4005780026


In [279]:
#get BBL values into a list
building_bbl_list = list(df_buildings['BBL'])

### Extract Property Data
For each BBL, we need to download the property tax PDF, then gather the requisite 421 data. 

Eventually, combine into a single dataframe and merge with the buildings dataframe above. 

Link to download PDF: https://a836-edms.nyc.gov/dctm-rest/repositories/dofedmspts/StatementSearch?bbl=[BBL VALUE HERE]&stmtDate=20211120&stmtType=SOA 

PDF Process:

    - To pull tax information (with rent regulated units) we need to parse the actual PDF tax documents. These are all available through the NYC Dept of Finance website. 
    - For each building, we need to download the tax document for the past 3 years (2019, 2020, 2021). Reading an online PDF runs into issues with encoding. 
    - After downloading the tax document, we have to find *where* the unit information is and gather the number of rent-regulated units reported that year. 

This is not a perfect process! There is no guarantee that the PDF format is the same and there may be buildings that we miss values. For these buildings, we need to go into the files after the fact and update those values. 

In [280]:
#list to include all the rent stabilization data
property_tax_list = []

In [281]:
tax_dates = ['20210605', '20200606', '20190605']

In [282]:
#Full trial run

# tax_form_date = '20211120'

for tax_form_date in tax_dates:



    for i in range(len(building_bbl_list)):

        print("BBL: {} Tax Date: {}".format(building_bbl_list[i], tax_form_date))

        #Download PDF 
        url_loc = "https://a836-edms.nyc.gov/dctm-rest/repositories/dofedmspts/StatementSearch?bbl={}&stmtDate={}&stmtType=SOA".format(building_bbl_list[i], tax_form_date)

        urlretrieve(url_loc, "property_tax.pdf")

        #read second page into a dataframe
        try:
            file = 'property_tax.pdf'
            df_pdf= tb.read_pdf(file, pages = '2', area = (50, 0, 1000, 1000), columns = [200, 305, 365, 455, 500])[0]
            #Find Rent data and save into a new dataframe
            try:
                index_to_find = df_pdf.loc[df_pdf['Unnamed: 0'] == 'Rent Stabilization'].index
                df_pdf.columns = df_pdf.iloc[index_to_find[0]]
                # df_pdf = df_pdf.loc[df_pdf['Rent Stabilization']]
                df_pdf = df_pdf.iloc[index_to_find[0]+1]
                df_list = df_pdf.to_list()
                df_list.append(str(building_bbl_list[i]))
                df_list.append(tax_form_date)
                # final_df = pd.DataFrame(df_pdf).T
            except IndexError:
                try:
                    index_to_find = df_pdf.loc[df_pdf['Unnamed: 1'] == '# Apts'].index
                    df_pdf.columns = df_pdf.iloc[index_to_find[0]]
                    df_pdf = df_pdf.iloc[index_to_find[0]+1]
                    df_list = df_pdf.to_list()
                    df_list.append(str(building_bbl_list[i]))
                    df_list.append(tax_form_date)
                    print("Issue with Rent Stabilization Field")
                except IndexError:
                    df_list = ['Rent Stabilization Fee - Chg', '0', 'NaN', 'nan', 'nan', '0', str(building_bbl_list[i]), tax_form_date]
                    print("No Rent Stabilized Units")
                
        
        
        except:
            print("Error in Tabula")



        


        property_tax_list.append(df_list)
cols = ['Rent Stabilized', 'Units', 'Due Date', 'RS ID', 'To Drop', 'Total Charge', 'BBL', 'Date']
final_df = pd.DataFrame(property_tax_list, columns=cols)
# final_df['Date'] = tax_form_date

BBL: 4005780026 Tax Date: 20210605
No Rent Stabilized Units
BBL: 4006600024 Tax Date: 20210605
BBL: 4008610044 Tax Date: 20210605
No Rent Stabilized Units
BBL: 4005930011 Tax Date: 20210605
BBL: 4005940055 Tax Date: 20210605
No Rent Stabilized Units
BBL: 4006220019 Tax Date: 20210605
No Rent Stabilized Units
BBL: 4006220019 Tax Date: 20210605
No Rent Stabilized Units
BBL: 4005490036 Tax Date: 20210605
BBL: 4006530074 Tax Date: 20210605
BBL: 4008860030 Tax Date: 20210605
No Rent Stabilized Units
BBL: 4006530020 Tax Date: 20210605
BBL: 4007110004 Tax Date: 20210605
No Rent Stabilized Units
BBL: 4006500069 Tax Date: 20210605
BBL: 4006330051 Tax Date: 20210605
BBL: 4005970120 Tax Date: 20210605
BBL: 4008720012 Tax Date: 20210605
BBL: 4006300011 Tax Date: 20210605
BBL: 4005940029 Tax Date: 20210605
BBL: 4006330082 Tax Date: 20210605
BBL: 4005740036 Tax Date: 20210605
BBL: 4005740037 Tax Date: 20210605
BBL: 4005390007 Tax Date: 20210605
No Rent Stabilized Units
BBL: 4008920113 Tax Date: 2021

Got stderr: Dec 05, 2021 9:29:26 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>



Issue with Rent Stabilization Field
BBL: 4005940055 Tax Date: 20190605
No Rent Stabilized Units
BBL: 4006220019 Tax Date: 20190605
Issue with Rent Stabilization Field
BBL: 4006220019 Tax Date: 20190605
Issue with Rent Stabilization Field
BBL: 4005490036 Tax Date: 20190605
No Rent Stabilized Units
BBL: 4006530074 Tax Date: 20190605
Issue with Rent Stabilization Field
BBL: 4008860030 Tax Date: 20190605
Issue with Rent Stabilization Field
BBL: 4006530020 Tax Date: 20190605


Got stderr: Dec 05, 2021 9:29:41 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>



Issue with Rent Stabilization Field
BBL: 4007110004 Tax Date: 20190605


Got stderr: Dec 05, 2021 9:29:43 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>



Issue with Rent Stabilization Field
BBL: 4006500069 Tax Date: 20190605
Issue with Rent Stabilization Field
BBL: 4006330051 Tax Date: 20190605


Got stderr: Dec 05, 2021 9:29:47 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>



Issue with Rent Stabilization Field
BBL: 4005970120 Tax Date: 20190605


Got stderr: Dec 05, 2021 9:29:49 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>



Issue with Rent Stabilization Field
BBL: 4008720012 Tax Date: 20190605
Issue with Rent Stabilization Field
BBL: 4006300011 Tax Date: 20190605


Got stderr: Dec 05, 2021 9:29:53 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>



Issue with Rent Stabilization Field
BBL: 4005940029 Tax Date: 20190605
No Rent Stabilized Units
BBL: 4006330082 Tax Date: 20190605
Issue with Rent Stabilization Field
BBL: 4005740036 Tax Date: 20190605
Issue with Rent Stabilization Field
BBL: 4005740037 Tax Date: 20190605
Issue with Rent Stabilization Field
BBL: 4005390007 Tax Date: 20190605
Issue with Rent Stabilization Field
BBL: 4008920113 Tax Date: 20190605
Issue with Rent Stabilization Field
BBL: 4008920113 Tax Date: 20190605
Issue with Rent Stabilization Field
BBL: 4006540022 Tax Date: 20190605


Got stderr: Dec 05, 2021 9:30:10 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>



Issue with Rent Stabilization Field
BBL: 4006540022 Tax Date: 20190605


Got stderr: Dec 05, 2021 9:30:12 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>



Issue with Rent Stabilization Field
BBL: 4006550012 Tax Date: 20190605
Issue with Rent Stabilization Field
BBL: 4005720030 Tax Date: 20190605
No Rent Stabilized Units
BBL: 4005770015 Tax Date: 20190605
Issue with Rent Stabilization Field
BBL: 4005970127 Tax Date: 20190605


Got stderr: Dec 05, 2021 9:30:20 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>



Issue with Rent Stabilization Field
BBL: 4005970118 Tax Date: 20190605


Got stderr: Dec 05, 2021 9:30:22 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>



Issue with Rent Stabilization Field
BBL: 4005970119 Tax Date: 20190605


Got stderr: Dec 05, 2021 9:30:24 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>



Issue with Rent Stabilization Field
BBL: 4005970110 Tax Date: 20190605
Issue with Rent Stabilization Field
BBL: 4005970111 Tax Date: 20190605


Got stderr: Dec 05, 2021 9:30:28 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>



Issue with Rent Stabilization Field
BBL: 4006160004 Tax Date: 20190605


Got stderr: Dec 05, 2021 9:30:31 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>



Issue with Rent Stabilization Field
BBL: 4006530090 Tax Date: 20190605
Issue with Rent Stabilization Field
BBL: 4006530081 Tax Date: 20190605
Issue with Rent Stabilization Field
BBL: 4006610034 Tax Date: 20190605
Issue with Rent Stabilization Field
BBL: 4006610037 Tax Date: 20190605
Issue with Rent Stabilization Field
BBL: 4006630010 Tax Date: 20190605
Issue with Rent Stabilization Field
BBL: 4005680007 Tax Date: 20190605
Issue with Rent Stabilization Field
BBL: 4005680007 Tax Date: 20190605
Issue with Rent Stabilization Field
BBL: 4006290009 Tax Date: 20190605


Got stderr: Dec 05, 2021 9:30:47 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>



Issue with Rent Stabilization Field
BBL: 4006290009 Tax Date: 20190605


Got stderr: Dec 05, 2021 9:30:49 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>



Issue with Rent Stabilization Field
BBL: 4006590068 Tax Date: 20190605
Issue with Rent Stabilization Field
BBL: 4006500081 Tax Date: 20190605
No Rent Stabilized Units
BBL: 4006500081 Tax Date: 20190605
No Rent Stabilized Units
BBL: 4006520010 Tax Date: 20190605
Issue with Rent Stabilization Field
BBL: 4006160003 Tax Date: 20190605


Got stderr: Dec 05, 2021 9:31:00 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>



Issue with Rent Stabilization Field
BBL: 4005670029 Tax Date: 20190605
No Rent Stabilized Units
BBL: 4005670029 Tax Date: 20190605
No Rent Stabilized Units
BBL: 4005980023 Tax Date: 20190605


Got stderr: Dec 05, 2021 9:31:06 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>



Issue with Rent Stabilization Field
BBL: 4006530089 Tax Date: 20190605
Issue with Rent Stabilization Field
BBL: 4006530089 Tax Date: 20190605
Issue with Rent Stabilization Field
BBL: 4008350012 Tax Date: 20190605
Issue with Rent Stabilization Field
BBL: 4008040123 Tax Date: 20190605


Got stderr: Dec 05, 2021 9:31:14 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>



Issue with Rent Stabilization Field
BBL: 4006150023 Tax Date: 20190605


Got stderr: Dec 05, 2021 9:31:16 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>



Issue with Rent Stabilization Field
BBL: 4006250084 Tax Date: 20190605
No Rent Stabilized Units
BBL: 4005150001 Tax Date: 20190605
Issue with Rent Stabilization Field
BBL: 4008390019 Tax Date: 20190605


Got stderr: Dec 05, 2021 9:31:23 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>



No Rent Stabilized Units
BBL: 4006290010 Tax Date: 20190605
Issue with Rent Stabilization Field
BBL: 4005520018 Tax Date: 20190605


Got stderr: Dec 05, 2021 9:31:27 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>



Issue with Rent Stabilization Field
BBL: 4006610003 Tax Date: 20190605


Got stderr: Dec 05, 2021 9:31:29 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>



No Rent Stabilized Units
BBL: 4006300018 Tax Date: 20190605


Got stderr: Dec 05, 2021 9:31:31 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>



Issue with Rent Stabilization Field
BBL: 4005170021 Tax Date: 20190605
No Rent Stabilized Units
BBL: 4005170021 Tax Date: 20190605
No Rent Stabilized Units
BBL: 4005690032 Tax Date: 20190605
Issue with Rent Stabilization Field
BBL: 4005980055 Tax Date: 20190605
No Rent Stabilized Units
BBL: 4006150083 Tax Date: 20190605
Issue with Rent Stabilization Field
BBL: 4006590077 Tax Date: 20190605


Got stderr: Dec 05, 2021 9:31:43 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>



Issue with Rent Stabilization Field
BBL: 4006290011 Tax Date: 20190605


Got stderr: Dec 05, 2021 9:31:45 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>



Issue with Rent Stabilization Field
BBL: 4006160019 Tax Date: 20190605


Got stderr: Dec 05, 2021 9:31:47 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>



Issue with Rent Stabilization Field
BBL: 4006160011 Tax Date: 20190605


Got stderr: Dec 05, 2021 9:31:50 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>



Issue with Rent Stabilization Field
BBL: 4005690027 Tax Date: 20190605
Issue with Rent Stabilization Field
BBL: 4005690027 Tax Date: 20190605
Issue with Rent Stabilization Field
BBL: 4005510003 Tax Date: 20190605
Issue with Rent Stabilization Field
BBL: 4005930021 Tax Date: 20190605


Got stderr: Dec 05, 2021 9:31:59 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>



Issue with Rent Stabilization Field
BBL: 4005730060 Tax Date: 20190605
Issue with Rent Stabilization Field
BBL: 4005940034 Tax Date: 20190605
No Rent Stabilized Units
BBL: 4006520046 Tax Date: 20190605
Issue with Rent Stabilization Field
BBL: 4005980076 Tax Date: 20190605
Issue with Rent Stabilization Field
BBL: 4008870038 Tax Date: 20190605
Issue with Rent Stabilization Field
BBL: 4006590078 Tax Date: 20190605
Issue with Rent Stabilization Field
BBL: 4006600031 Tax Date: 20190605


Got stderr: Dec 05, 2021 9:32:14 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>



Issue with Rent Stabilization Field
BBL: 4006490069 Tax Date: 20190605
Issue with Rent Stabilization Field
BBL: 4006330003 Tax Date: 20190605


Got stderr: Dec 05, 2021 9:32:18 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>



Issue with Rent Stabilization Field
BBL: 4006160030 Tax Date: 20190605
Issue with Rent Stabilization Field
BBL: 4008840005 Tax Date: 20190605
Issue with Rent Stabilization Field
BBL: 4005970124 Tax Date: 20190605


Got stderr: Dec 05, 2021 9:32:24 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>



Issue with Rent Stabilization Field
BBL: 4005710019 Tax Date: 20190605


Got stderr: Dec 05, 2021 9:32:27 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>



Issue with Rent Stabilization Field
BBL: 4006500080 Tax Date: 20190605
No Rent Stabilized Units
BBL: 4006520062 Tax Date: 20190605
No Rent Stabilized Units
BBL: 4005970223 Tax Date: 20190605
Issue with Rent Stabilization Field
BBL: 4006530075 Tax Date: 20190605
Issue with Rent Stabilization Field
BBL: 4005980001 Tax Date: 20190605
Issue with Rent Stabilization Field
BBL: 4006150022 Tax Date: 20190605


Got stderr: Dec 05, 2021 9:32:39 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>



Issue with Rent Stabilization Field
BBL: 4005740035 Tax Date: 20190605
Issue with Rent Stabilization Field
BBL: 4005920007 Tax Date: 20190605


Got stderr: Dec 05, 2021 9:32:44 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>



Issue with Rent Stabilization Field
BBL: 4006520011 Tax Date: 20190605
Issue with Rent Stabilization Field
BBL: 4005740135 Tax Date: 20190605
Issue with Rent Stabilization Field
BBL: 4005700007 Tax Date: 20190605
No Rent Stabilized Units
BBL: 4006600027 Tax Date: 20190605
Issue with Rent Stabilization Field
BBL: 4008400049 Tax Date: 20190605


Got stderr: Dec 05, 2021 9:32:54 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>



No Rent Stabilized Units
BBL: 4006190057 Tax Date: 20190605
Issue with Rent Stabilization Field
BBL: 4005940155 Tax Date: 20190605
No Rent Stabilized Units


In [283]:
merged_data = df_buildings.merge(final_df, left_on='BBL', right_on='BBL', how = 'inner')

In [284]:
merged_data['Date'] = merged_data['Date'].apply(lambda x: datetime.strptime(x, '%Y%m%d'))
merged_data['Year'] = merged_data['Date'].apply(lambda x: x.year)

In [None]:
merged_data.to_csv('ad36_421Properties_TaxClass2B_withTaxData.csv')

### Pivot Table

Clean up the class2b file and convert into a pivot table.

In [285]:
df_class2b = merged_data[['the_geom', 'geomsource', 'cnstrct_yr', 'lstmoddate', 'lststatype', 'Latitude', 'Longitude',
                         'AssemDist', 'BUILDING CLASS CATEGORY', 'BUILDING CLASS', 'ADDRESS', 'ZIP CODE', 'RESIDENTIAL UNITS',
                         'COMMERCIAL UNITS', 'TOTAL UNITS', 'YEAR BUILT', 'BBL', 'Units', 'Total Charge', 'Date', 'Year']]
df_class2b.rename(columns={'Units': 'Rent Stabilized Units', 'Total Charge': 'Rent Stabilized Charges', 'Date': 'Tax Doc Date',
                                'Year': 'Tax Doc Year'}, inplace = True)
df_class2b.fillna(0, inplace = True)
df_class2b['cnstrct_yr'] = df_class2b['cnstrct_yr'].astype(int)
df_pivot2b = df_class2b.drop(['the_geom'], axis = 1)
df_pivot2b = df_pivot2b.sort_values('geomsource', ascending=False).drop_duplicates(subset=['ADDRESS', 'Tax Doc Year'])
df_pivot2b = df_pivot2b.pivot(index=['ADDRESS', 'Latitude', 'Longitude'], columns='Tax Doc Year', values='Rent Stabilized Units')
df_pivot2b.fillna(0, inplace = True)
df_pivot2b.to_csv('taxClass2B_pivotTable_ad36.csv')

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().rename(
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().fillna(
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_class2b['cnstrct_yr'] = df_class2b['cnstrct_yr'].astype(int)


## Troubleshooting!


In [290]:
#All this should be done later. Focus on rent stabilization efforts first. 

bbl_trial = '4005330037'
tax_form_date = '20191119'


url_loc = "https://a836-edms.nyc.gov/dctm-rest/repositories/dofedmspts/StatementSearch?bbl={}&stmtDate={}&stmtType=SOA".format(bbl_trial, tax_form_date)

urlretrieve(url_loc, "property_tax.pdf")

('property_tax.pdf', <http.client.HTTPMessage at 0x7f8d80fef310>)

In [291]:
file = 'property_tax.pdf'
df3= tb.read_pdf(file, pages = '2', area = (50, 0, 400, 1000), columns = [200, 305, 365, 455, 500])[0]

In [292]:
df3

Unnamed: 0.1,Unnamed: 0,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Page 2
0,Previous Charges,,,,,Amount
1,Total previous charges including in,terest and payments,,,,$0.00
2,Current Charges,Activity Date,Due Date,,,Amount
3,Finance-Property Tax,,01/01/2020,,,$0.00
4,Adopted Tax Rate,,,,,$-30.92
5,Early Payment Discount,,01/01/2020,,,$0.15
6,Payment Adjusted,01/01/2020,,,,$30.77
7,Rent Stabilization * fee $10/apt.,# Apts,,RS fee identifiers,,
8,Rent Stabilization Fee- Chg,14,01/01/2020,43176000,,$140.00
9,Rent Stabilization Fee- Chg,14,01/01/2020,43176000,,$126.00


In [293]:
index_to_find = df3.loc[df3['Unnamed: 1'] == '# Apts'].index
df3.columns = df3.iloc[index_to_find[0]]
df3 = df3.iloc[index_to_find[0]+1]
df3 = pd.DataFrame(df3).T

In [294]:
df3

7,Rent Stabilization * fee $10/apt.,# Apts,NaN,RS fee identifiers,NaN.1,NaN.2
8,Rent Stabilization Fee- Chg,14,01/01/2020,43176000,,$140.00


In [295]:
df4= tb.read_pdf(file, pages = '2', area = (0, 0, 450, 1000), columns = [200, 305, 385, 475, 500])[0]


In [296]:
df4

Unnamed: 0.1,Unnamed: 0,Statemen,t Details,Unnamed: 1,Unnamed: 2,"November 16, 2019"
0,,,,,,"Jml CPA, LLC"
1,,,,,,14-34 31st Ave.
2,,,,,,4-00533-0037
3,,,,,,Page 2
4,Previous Charges,,,,,Amount
5,Total previous charges including in,terest and payments,,,,$0.00
6,Current Charges,Activity Date,Due Date,,,Amount
7,Finance-Property Tax,,01/01/2020,,,$0.00
8,Adopted Tax Rate,,,,,$-30.92
9,Early Payment Discount,,01/01/2020,,,$0.15


In [297]:
index_to_find2 = df4.loc[df4['Unnamed: 0'] == 'Billable Assessed Value'].index
# df4.columns = df4.iloc[index_to_find2[0]-1]
df4 = df4.iloc[index_to_find2[0]:index_to_find2[0]+2]
# df4 = pd.DataFrame(df4)


IndexError: index 0 is out of bounds for axis 0 with size 0

In [None]:
billable_value = df4.iloc[0]['t Details']
discount_421 = df4.iloc[1]['t Details']
type_421 = df4.iloc[1]['Unnamed: 0']

In [None]:
df4

Unnamed: 0.1,Unnamed: 0,Statemen,t Details,Unnamed: 1,Unnamed: 2,"November 21, 2020"
19,Billable Assessed Value,,"$922,860",,,
20,421a,,-889965.00,,,
