# Final Project: Boston Housing Visualization

## Group Member:

- MingFu Chou (mfchou2)
- Ruonan Zhang (ruonanz2)
- Shukai Yao (shukaiy2)
- Ni Lin (nilin2)

## Data Introduction

- Dataset: City of Boston Assessing Department
- File Name: ast2018full.csv
- Data Source: Analyze Boston (City of Boston's open data hub)
- Link: https://data.boston.gov/dataset/property-assessment
- License: Open Data Commons Public Domain Dedication and License (PDDL) - 
- Data usage: PDDL is a document intended to allow you to freely share, modify, and use this work for any purpose and without any restrictions. 
- File Size: 54.1 MB

**[Official description]** Gives property, or parcel, ownership together with value information, which ensures fair assessment of Boston taxable and non-taxable property of all types and classifications.

This dataset contains detailed information of real estates in Boston area. There are 172841 rows, each row of the dataset represents an unique building. There are 75 columns, each column provides descritive information for buildings. 
There are **classification features** such as type of land used `LU` or type of structural `STRUCTURE_CLASS`, those features are returned as categorical variable; **descriptive features** such as total number of rooms `R_TOTAL_RMS` or total number of bath `R_FULL_BTH`, those features are returned as numerical variable; **condition descriptions** such as overall condition `R_OVRALL_CND` or interior finish `R_INT_FIN`, returned as categorical variable; **assessment value** such as total assessed value for property `AV_TOTAL` or total assessed land value `AV_LAND`, those features are returned as numerical variables.

We are only using the residential properties of Boston in this dataset.

## Visualization Contents
 

[Map for Boston Housing](#Map-for-Boston-Housing)

- [Data Preprocessing](#Data-Preprocessing)

- [Build Interaction](#Build-Interaction)

- [Map function](#Map-function)

- [Final Display](#Final-Display)

- [Observations and summary](#Observations-and-summary)

[Time Series Chart](#Time-Series-Visualization)

- [Time Series Line Chart](#Time-Series-Line-Chart)

- [Observations](#Observations-for-Line-Chart)

- [Time Series Stacked Bar Chart](#Time-Series-Stacked-Bar-Chart)

- [Observations](#Observations-for-Stacked-Bar-Chart)

[Filter Scatter Chart](#Filter-Scatter-Chart)

- [Observations](#Observations-by-Filter)


----
### Map for Boston Housing

This is the map created for viewing Boston properties by zipcodes. User could select what kind of value they want to explore. The display is at the bottom, run all the codes before to pre-process the data for map.

In [5]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.ticker as mtick
import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed, interact_manual
import warnings
warnings.filterwarnings('ignore')
import json
import folium
from folium import plugins
from branca.colormap import LinearColormap

#### Data Preprocessing
To load the Map, **run all the code cell below** to generate dataframe and geojson file for polygon objects.

In [6]:
BostonHousing = pd.read_csv('E:/UIUC/IS590_dv/Final/ast2018full.csv')

In [7]:
#select only residential building
BostonResidential = BostonHousing[BostonHousing['LU'].isin(['R1','R2','R3','R4'])].reset_index()
#convert zip code into five digit zip
BostonResidential['ZIPCODE'] = ['0'+str(int(i)) for i in BostonResidential['ZIPCODE']]
#remove 0 value of year built and year remodel
BostonResidential['YR_BUILT'] = BostonResidential[BostonResidential["YR_BUILT"]!= 0]['YR_BUILT']
BostonResidential['YR_REMOD'] = BostonResidential[BostonResidential["YR_BUILT"]!= 0]['YR_REMOD']

In [8]:
#read in geojson file
with open('ZIP_Codes.geojson','r') as jsonFile:
    data = json.load(jsonFile)
tmp = data

#processing geojson file
zip_name = list(BostonResidential['ZIPCODE'].unique())
geozips = []
zip_code_list = []
for i in range(len(tmp['features'])):
    if tmp['features'][i]['properties']['ZIP5'] in zip_name:
        geozips.append(tmp['features'][i])
        zip_code_list.append(tmp['features'][i]['properties']['ZIP5'])
        
new_json = {}
new_json['type'] = 'FeatureCollection'
new_json['features'] = geozips

open('update-file.json','w').write(
    json.dumps(new_json, sort_keys = True, indent = 4,
              separators = (',',': ')))

2473509

In [9]:
with open('update-file.json','r') as temp:
    data = json.load(temp)
geodata = data

In [10]:
#numerical_selected = ['AV_BLDG','AV_TOTAL','AV_LAND','GROSS_TAX']
#categorical_selected = ['LU','R_BLDG_STYL','R_OVRALL_CND']
numerical_selected = ['Assessed building value',
                      'Assessed property value',
                      'Assessed land value',
                      'Tax bill amount']
categorical_selected = ['Land of Use',
                       'Building style',
                       'Overall condition']

numerical_dict = {'AV_BLDG': 'Assessed building value',
                  'AV_TOTAL': 'Assessed property value',
                  'AV_LAND':'Assessed land value',
                  'GROSS_TAX': 'Tax bill amount'}
categorical_dict = {'LU':'Land of Use',
                    'R_BLDG_STYL':'Building style',
                    'R_OVRALL_CND':'Overall condition'}
categorical_sub_dict = {
    'LU':{'R1':'One-Family',
          'R2':'Two-Family',
          'R3':'Three-Family',
          'R4':'Four or more'},
    'R_BLDG_STYL':{'BL':'Bi-Level', 'DX':'Duplex', 'SL':'Split Level',
            'BW':'Bungalow', 'L':'Tri-Level', 'TF':'Two-Family Stack',
            'CL':'Colonial', 'OT':'Other', 'TD':'Tudor','TL':'TL',
            'CN':'Contemporary', 'RE':'Row End', 'SD':'Semi-Detached',
            'CP':'Cape', 'RM':'Row Middle', 'VT':'Victorian',
            'CV':'Conventional', 'RN':'Ranch',
            'DK':'Decker', 'RR':'Raised Ranch'},
    'R_OVRALL_CND':{'A':'Average','E':'Excellent','F':'Fair',
            'G': 'Good','P':'Poor'}}
def get_key(dic,val):
    return list(dic.keys())[list(dic.values()).index(val)]

#### Build Interaction

In [11]:
select_type_widgets = widgets.Dropdown(options=['Price Map','Explore Other Features'],value='Price Map',
                                        description='Visual Type:',disabled=False)
select_group_widgets = widgets.Dropdown(options=['Assessed building value'],value='Assessed building value',description='Visual Group:',disabled=False)
select_cate_widgets = widgets.Dropdown(options=[' '],value=' ',description='Visual Group:',disabled=False)
creat_map_button = widgets.Button(description='Generate Map',disabled=False,button_style='', 
                                    tooltip='Click to visualize selected data')
Selection_type = widgets.Label()
Group_type = widgets.Label()
Cate_type = widgets.Label()
out = widgets.Output()

def selection_changed_1(event):
    if event['new'] == 'Price Map':
        select_group_widgets.options = numerical_selected
        select_group_widgets.value = numerical_selected[0]
        Selection_type.value = 'Price map is the average assessed value by Zipcode'
        Cate_type.value = ' '
        select_cate_widgets.options = ' '
    elif event['new'] == 'Explore Other Features':
        select_group_widgets.options = categorical_selected
        select_group_widgets.value = categorical_selected[0]
        Selection_type.value = 'View the number of building in different category'

def selection_changed_2(event):
    Group_type.value = ' '
    if select_type_widgets.value != 'Price Map':
        real_col_name = get_key(categorical_dict,event['new'])
        drop_list = list(BostonResidential[real_col_name].unique())#Drop List is the real val in data
        #Convert into option values in dictionary
        option_list = [categorical_sub_dict[real_col_name][i]  for i in drop_list if str(i) != 'nan']
        select_cate_widgets.options = option_list
        select_cate_widgets.value = option_list[0]
    else:
        select_cate_widgets.options = ' '
    
def selection_changed_3(event):
    Cate_type.value = 'Selected '+ str(event['new']) + ' out of '+str(select_group_widgets.value)
    
def creat_click(event):
    select_type = select_type_widgets.value
    if select_type == 'Price Map':
        column = get_key(numerical_dict,select_group_widgets.value)
        category = select_cate_widgets.value
    else:
        column = get_key(categorical_dict,select_group_widgets.value)
        category =  get_key(categorical_sub_dict[column],select_cate_widgets.value)
    with out:
        from IPython.display import clear_output
        clear_output(True)
        m = creat_map(select_type,column,category)
        display(m)
        
select_type_widgets.observe(selection_changed_1, 'value')
select_group_widgets.observe(selection_changed_2, 'value')
select_cate_widgets.observe(selection_changed_3, 'value')
creat_map_button.on_click(creat_click)

#### Map function

This is the plotting function with package Folium

In [12]:
def creat_map(select_type,column,category):
    boston_geo = r'update-file.json'
    
    def count_distribution(df, location, subgroup):
        group_counts = pd.DataFrame(df.groupby([location,subgroup]).size().unstack(1))
        group_counts.reset_index(inplace = True)
        return group_counts
    def continuous_var(df,var_name):
        group = df.groupby('ZIPCODE')[var_name].mean().reset_index()
        #group.reset_index(inplace = True)
        return group
    
    #dat = subgroup_distribution(BostonResidential,'ZIPCODE',column)
    if column in numerical_dict.keys():
        map_data = continuous_var(BostonResidential,column)
        category = column
    elif column in categorical_dict.keys():
        map_data = count_distribution(BostonResidential,'ZIPCODE',column)[['ZIPCODE',category]].fillna(0)
    #generate legent name
    if select_type == 'Price Map':
        legend = 'Average of ' + str(numerical_dict[column]) + ' by zipcode'
    elif select_type == 'Explore Other Features':
        legend = 'Number of residential buildings by zipcode. (' + str(categorical_dict[column]) \
                + '=' + str(categorical_sub_dict[column][category]) + ')'
    map_dict = map_data.set_index('ZIPCODE')[category].to_dict()
    
    color_scale = LinearColormap(['yellow','green'], 
                                 vmin = min(map_dict.values()), 
                                 vmax = max(map_dict.values()))
    color_scale.caption = legend
    
    def get_color(feature):
        value = map_dict.get(feature['properties']['ZIP5'])
        return color_scale(value)
    
    m = folium.Map(location = [42.3601,-71.0589], zoom_start = 11)
    folium.GeoJson(data = geodata,
            style_function = lambda feature: {
            'fillColor': get_color(feature),
            'fillOpacity': 0.5,
            'color' : 'white',
            'weight' : 0.7}).add_to(m)
    m.add_child(color_scale)
    return(m)

#### Final Display

In this Map, user could view a map of Boston, colored by different zipcode area. Two parts are provided for visualization: **Visual for Price** and **Visual for some other categorical features**. For price (assessment value) visual, the map would show the average level by zip code. For categorical variable, users could specify the level of variable of interest, the map would return the housing numbers by each zipcode under the selected level.

Notice that the map would not automatically change with the selections. After altering the options, click 'Generate Map' to refresh.

In [13]:
display(widgets.VBox([widgets.HBox([widgets.VBox([select_type_widgets,Selection_type]),
                                    widgets.VBox([select_group_widgets,Group_type]),
                                    widgets.VBox([select_cate_widgets,Cate_type])]),
                      creat_map_button,out]))

#### Observations and summary

From the map, we observed that the center area (downtown) Boston has the highest assessed property value (a combination of Land value and Building value), which is of no surprise. But comparing the map between building value and land value, it is obvious that the difference between center and none-center area is not that obvious for building value, but is still siginicance for land value.

From the map of categorical value `Land of Use`, after trying each level we observed that residential building in the south of Boston has lower family construction, which means there are more one-family buildings in the south. Four-family houses or more are popular in north. We think it means that there are more departments in the north, and more cottage in the south.

From the map of categorical value `Overall Condition`, we found out that houses in downtown area has excellent condition -- which is expected. But some buildings around downtown area shows poor condition, we think it is because buildings in those area has longer history. But overall, residential buildings in the south has good condition, and are off lower estimated price.

----
### Time Series Visualization

In this part we concentrated on how features of buildings in Boston change with BuildYear

#### Time Series Line Chart

In [14]:
@widgets.interact(numeric = ["AV_TOTAL", "AV_LAND", "AV_BLDG", "GROSS_TAX", "LAND_SF",  "GROSS_AREA", "LIVING_AREA"],
                  categorical = ["R_BLDG_STYL", "R_ROOF_TYP", "R_HEAT_TYP", "R_AC", "R_OVRALL_CND", "R_VIEW", "LU"],
                  year = ["YR_BUILT", "YR_REMOD"])

def get_line(numeric, categorical, year):
    plt.figure(figsize=(20,10))
    for l in list(BostonResidential[categorical].unique()):
        plt.plot(BostonResidential[BostonResidential[categorical]== l].groupby(year)[numeric].mean())
    plt.xlabel(year)
    plt.ylabel(numeric)
    plt.legend(loc='upper left', labels = list(BostonResidential[categorical].unique()))

#### Observations for Line Chart
From the time series graph with default label "AV_TOTAL", "R_BLDG_STYL", "YR_BUILT" we can see the mean price change over the year clearly. By adding the filter of categorical data, we can also compare the mean price for different style of buildings. By choosing the mean instead of sum is to eliminate the situation of different numbers of building is built in different year. And in the graph we can see that the building type of RM(Row Middle) and VT(Victorian) tend to have higher price than other buildings.

If we change the categorical filter to R_OVRALL_CND, it will be more clear that the building with Excellent overall condition tends to have higher price than the other condition. And the buildings with Excellent type was first built around 1800. And we can also observe that, the average price of buildings with excellent overall condition becomes higher in recent year.

#### Time Series Stacked Bar Chart

In [15]:
@widgets.interact(numeric = ["COUNT", "AV_LAND", "AV_BLDG","AV_TOTAL", "GROSS_TAX", "LAND_SF",  "GROSS_AREA", "LIVING_AREA"],
                  categorical = ["R_AC", "R_OVRALL_CND", "R_BLDG_STYL", "R_ROOF_TYP", "R_HEAT_TYP", "R_VIEW", "LU"],
                  year = ["YR_REMOD", "YR_BUILT"],
                  present = ["%", "value"])


def get_bar(numeric, categorical, year, present):
    if numeric == "COUNT":
        bar = BostonResidential.groupby([year,categorical]).size()
    else:
        bar = BostonResidential.groupby([year,categorical])[numeric].mean()
    if present == "%":
        bar.groupby(level=0).apply(
            lambda x: 100 * x / x.sum()
        ).unstack().plot(kind='bar',stacked=True, figsize = (20, 10))
        plt.gca().yaxis.set_major_formatter(mtick.PercentFormatter())
    else:
        bar.unstack().plot(kind='bar',stacked=True, figsize = (20, 10))
    plt.show()

#### Observations for Stacked Bar Chart
For the stacked bar chart, we add a new label called "COUNT", which can help us see clearly about the building type change. In the defualt setting, we set the year "YR_REMOD", and categorical "R_AC", we can clearly see that the recent remodel building are more likely to equip Central A/C. So does the "R_OVRALL_CND", recent remodel building tend to be excellent overall condition. But this trend is not be copied to the Overall View.

----
### Filter Scatter Chart 

In [16]:
import collections
import ipywidgets
import bqplot

In [17]:
BostonHousing['TotalBath']=BostonHousing['R_FULL_BTH']*1+0.5*BostonHousing['R_HALF_BTH']

In [18]:
BostonHousing.groupby('TotalBath').count()   ##24[0.....13.5,21]
BostonHousing.groupby('R_BDRMS').count()  ####18[0:18]
BostonHousing['LIVING_AREA'].describe() #####[0:25542][1598:3190]

count    1.678610e+05
mean     3.560676e+03
std      2.566108e+04
min      0.000000e+00
25%      6.750000e+02
50%      1.293000e+03
75%      2.434000e+03
max      1.940476e+06
Name: LIVING_AREA, dtype: float64

In [19]:
@ipywidgets.interact(Style = ['RE','CL','CV','BL', 'DX', 'SL','BW', 
                               'OT', 'TD','TL', 'RM','CN',  'TF',
                              'SD', 'CP', 'L', 'VT',  'RN',
                              'DK','RR'],
                    Condition=['A','E','F','G','P'],
                    LU = ['R1', 'R2', 'R3', 'R4'],
                    bedroommin = (0.0, 18, 1.0), bedroommax = (0.0, 18.0, 1.0),
                    bathroommin = (0.0, 24.0, 0.5), bathroommax = (0.0, 24.0, 0.5),
                    areamin =   (0.0, 25542, 54),   areamax= (0.0, 25542, 54))
def get_scatter(Style, Condition,LU, bedroommin,bedroommax,bathroommin,bathroommax, areamin, areamax):

   
    x_sc = bqplot.LinearScale()
    y_sc = bqplot.LinearScale()
    
    
    x_ax = bqplot.Axis(scale = x_sc, label = 'YR_BUILT')
    y_ax = bqplot.Axis(scale = y_sc, label = 'AV_BLDG', orientation = 'vertical')
    
    
    tooltip = bqplot.Tooltip(fields = ["x", "y"])
    
    #filter data
    new_df = BostonHousing[BostonHousing['R_BLDG_STYL']==Style] 
    rnew_df = new_df[new_df['R_OVRALL_CND']==Condition] 
    renew1_df = rnew_df[rnew_df['LU']==LU] 
    renew2_df = renew1_df[(renew1_df['TotalBath']>=bedroommin) & (renew1_df['TotalBath']<=bedroommax)]
    renew3_df = renew2_df[(renew2_df['R_BDRMS']>=bathroommin) & (renew2_df['R_BDRMS']<=bathroommax)]
    renew_df = renew3_df[(renew3_df['LIVING_AREA']>=areamin) & (renew3_df['LIVING_AREA']<=areamax)]
    
    scatters = bqplot.Scatter(x = renew_df['YR_BUILT'],
                              y = renew_df['AV_BLDG'],                           
                              scales = {'x': x_sc, 'y': y_sc },
                              tooltip = tooltip)

    selector = bqplot.interacts.FastIntervalSelector(
            scale = x_sc, marks = [scatters])
    fig = bqplot.Figure(marks = [scatters], axes = [x_ax, y_ax], interaction = selector)
    tb = bqplot.Toolbar(figure=fig)
    

    display(ipywidgets.VBox([fig, tb]))

#### Observations by Filter

For this Filter Scatter Chart, we aim to help user find out the house meets their demand.  For the categorical variables, we focus more on the type, condition and the type of property when we search the information of one housing. And for numerical variables, we consider the living area, the number of the bathroom and the number of the bedroom more. Therefore set three labels and six sliders to help user do filter. For the number of bathrooms, even though it ranges from zero to twenty-one, most buildings have two to eight bathrooms. So the figure maybe has no change when move the max number of bathroom. For the number of bedrooms, it ranges from 0 to eight, but most of buildings have two to eleven bedrooms. And the range of living area ranges from 0 to 25542, while the living area for most buildings range from 1598 to 3190. In the scatter chart we can see all the houses which meet our basic demand. And the tool tip shows the built year and sale price of each house. 