# Problem statement and Goal
NHPD: For all counties, aggregate the National Housing Preservation Database at the census tract level using whatever variables seem most interesting or relevant, and combine it with the “processed” datasets. What does the overall low-income housing stock look like in areas with high eviction rates? Are any of these features statistically related to the incidence of evictions in these counties? Furthermore, are there any insights common to two or more counties, or is the “state” of low-income housing unique to each county?


In [1]:
import pandas as pd

In [3]:
data_set = pd.read_excel("nhpd_data.xlsx", engine="openpyxl")

### General inference from the data dictionary
- The data set 'Active and Inconclusive Properties.xlsx' appears to be the `Data Extract`, as opposed to the `Data Grid`, as each row here represents a property. Here, subsidy information is present alongside property information instead of it being expanded from the property record.
- One property - One address mapping.
- Most of the phased properties are separated by address locations. 
- This is a rare scenario: In cases where they're combined into one record, we might need to separate them by address locations. This is rare, and might not be needed since funding is tracked at property level, and unless we need specifics that sub-categorize the property, we find little use of maintaining these records as separate entities. 
- Key words 'Development', 'Project' and 'Property' have been used interchangeably to mean 'cluster of buildings' tracked by the same identification number. 
  - If we wish to run some NLP algorithms on the descriptions, we may need to replace these words as 'Property'. If not, the vectorized versions of these words may not be close to each other. 
  - Even if we consider state-of-the-art algorithms like Word2Vec, we may not get vectors close to one another because 'Development', 'Project' and 'Property' have different semantic meanings.

In [15]:
shape = data_set.shape
data_set.head()

Unnamed: 0,NHPDPropertyID,PropertyName,PropertyAddress,City,State,Zip,CBSACode,CBSAType,County,CountyCode,...,NumberActiveMR,NumberInconclusiveMR,NumberInactiveMR,Mr_1_Status,Mr_1_ProgramName,Mr_1_AssistedUnits,Mr_2_Status,Mr_2_ProgramName,Mr_2_AssistedUnits,OldNHPDPropertyID
0,1000000,IVY ESTATES,6729 Zeigler Blvd,Mobile,AL,36608-4253,33660.0,Metropolitan Statistical Area,Mobile,1097.0,...,0,0,0,,,,,,,
1,1000001,RENDU TERRACE WEST,7400 Old Shell Rd,Mobile,AL,36608-4549,33660.0,Metropolitan Statistical Area,Mobile,1097.0,...,0,0,0,,,,,,,
2,1000002,TWB RESIDENTIAL OPPORTUNITIES II,93 Canal Rd,Port Jefferson Station,NY,11776-3024,35620.0,Metropolitan,Suffolk,36103.0,...,0,0,0,,,,,,,
3,1000003,THE DAISY HOUSE,615 Clarissa St,Rochester,NY,14608-2485,40380.0,Metropolitan,Monroe,36055.0,...,0,0,0,,,,,,,
4,1000004,MAIN AVENUE APARTMENTS,105 E Walnut St,Sylacauga,AL,35150-3012,45180.0,Micropolitan Statistical Area,Talladega,1121.0,...,0,0,0,,,,,,,


In [18]:
print("No. of data points:", shape[0])
print("No. of features: ", shape[1])
print(
    "Different property statuses, :", data_set.loc[:, "PropertyStatus"].unique()
)

No. of data points: 82287
No. of features:  252
Different classes, : ['Active' 'Inconclusive']


In [19]:
data_set.describe()

Unnamed: 0,NHPDPropertyID,CBSACode,CountyCode,CensusTract,Latitude,Longitude,ActiveSubsidies,TotalInconclusiveSubsidies,TotalInactiveSubsidies,TotalUnits,...,NumberInconclusivePBV,NumberInactivePBV,Pbv_1_AssistedUnits,Pbv_2_AssistedUnits,NumberActiveMR,NumberInconclusiveMR,NumberInactiveMR,Mr_1_AssistedUnits,Mr_2_AssistedUnits,OldNHPDPropertyID
count,82287.0,72919.0,82229.0,82224.0,82287.0,82287.0,82287.0,82287.0,82287.0,82287.0,...,82287.0,82287.0,2784.0,173.0,82287.0,82287.0,82287.0,500.0,9.0,58370.0
mean,1074656.0,30447.401281,28953.566759,28952170000.0,38.483402,-90.228069,1.386355,0.077485,0.369098,66.711145,...,0.0,0.0,43.463003,35.011561,0.006234,0.0,0.0,34.208,26.333333,52459.141083
std,40216.2,11096.935896,15256.967835,15254660000.0,4.975471,15.636717,0.895868,0.287468,0.734841,96.200003,...,0.0,0.0,41.703263,29.06228,0.081741,0.0,0.0,25.840049,19.68502,33880.224028
min,1000000.0,10100.0,1001.0,1001020000.0,13.49503,-166.722478,0.0,0.0,0.0,1.0,...,0.0,0.0,11.0,11.0,0.0,0.0,0.0,11.0,12.0,4.0
25%,1039279.0,19740.0,17053.0,17053960000.0,34.983064,-96.380764,1.0,0.0,0.0,18.0,...,0.0,0.0,17.0,14.0,0.0,0.0,0.0,16.0,13.0,24226.25
50%,1073499.0,32580.0,29095.0,29095010000.0,39.312214,-86.490946,1.0,0.0,0.0,40.0,...,0.0,0.0,29.0,24.0,0.0,0.0,0.0,25.0,19.0,49850.5
75%,1108144.0,39300.0,41015.0,41013950000.0,41.799999,-79.05225,2.0,0.0,1.0,82.0,...,0.0,0.0,53.0,46.0,0.0,0.0,0.0,43.0,22.0,78774.75
max,1163400.0,99999.0,69120.0,56045950000.0,65.160556,145.751129,106.0,13.0,24.0,5881.0,...,0.0,0.0,449.0,191.0,5.0,0.0,0.0,187.0,62.0,127185.0


### Status [WIP]
- Findout how housing and subsidies work in the US.
- Map census track to its geographical boundaries.
- List out all of the different subsidies ().
- Find out metrics per subsidy and compare various statistical plots.