In [59]:
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

### Data Description: Boston Housing Dataset

Ruonan Zhang(ruonanz2)

I am using Boston Property Assessment Dataset (51.5 MB) provided by Boston government https://data.boston.gov/dataset/property-assessment/resource/fd351943-c2c6-4630-992d-3f895360febd

**[Official description]** Gives property, or parcel, ownership together with value information, which ensures fair assessment of Boston taxable and non-taxable property of all types and classifications.

This dataset contains detailed information of real estates in Boston area. There are 172841 rows, each row of the dataset represents an unique building. There are 75 columns, each column provides descritive information for buildings. 

There are **classification features** such as type of land used `LU` or type of structural `STRUCTURE_CLASS`, those features are returned as categorical variable; **descriptive features** such as total number of rooms `R_TOTAL_RMS` or total number of bath `R_FULL_BTH`, those features are returned as numerical variable; **condition descriptions** such as overall condition `R_OVRALL_CND` or interior finish `R_INT_FIN`, returned as categorical variable; **assessment value** such as total assessed value for property `AV_TOTAL` or total assessed land value `AV_LAND`, those features are returned as numerical variables.

There are 75 columns in this dataset. We could make some inferences from the column names, but a detailed description is here: https://data.boston.gov/dataset/property-assessment/resource/d6c1268c-cd83-4dc3-a914-bba1ed59da6d

The total assessed value for property is updated by year, so there is a dataset recreated for each year. **I am only using the most up to date dataset (2018 version).**

In [60]:
BostonHousing = pd.read_csv('E:/UIUC/IS590_dv/Final/ast2018full.csv')
nrow = BostonHousing.shape[0]
print('This dataset has ' + str(nrow) + ' rows, each row represents a record for an unique building.')      

This dataset has 172841 rows, each row represents a record for an unique building.


In [62]:
n_features = len(BostonHousing.columns.values)
print('This dataset has ' + str(n_features) + ' features totally.')     

This dataset has 75 features totally.


In [26]:
BostonResidential = BostonHousing[BostonHousing['LU'].isin(['R1','R2','R3','R4'])]
print('Among all the buildings in Boston area, ' + str(BostonResidential.shape[0]) + ' of those are residential building.')

Among all the buildings in Boston area, 64496 of those are residential building.


We are only going to use **residential building** data in the following steps.

#### Column Desciption

In [68]:
numerical = BostonResidential.select_dtypes(include='float').shape[1]
print('There are ' + str(numerical) + ' numerical features totally.')

There are 24 numerical features totally.


In [72]:
cate = BostonResidential.select_dtypes(include='object').shape[1]
print('There are ' + str(cate) + ' categorical features totally.')

There are 45 categorical features totally.


#### Missing values

In [50]:
#Missing percentage for each column
Missing = (BostonResidential.isna().sum().sort_values(ascending = False)/BostonResidential.shape[0]).reset_index()
Missing.columns = ['colname','percentage']

In [58]:
large_missing = Missing[Missing['percentage'] > 0.4].shape[0]
print('There are ' + str(large_missing) + ' columns that have more than 40% missing values.')

There are 31 columns that have more than 40% missing values.


In [74]:
no_missing = Missing[Missing['percentage'] == 0].shape[0]
print('There are ' + str(no_missing) + ' columns that do not have missing value.')

There are 16 columns that do not have missing value.
