# **Chapter 2 - Customizing Visualizations**

**Colors, legend, and theme**

For your first assignment, the estate agents would like a visualization to represent the relationship between the year a property was built and its total land area, factoring in how this varies between the Northern and Southern regions of Melbourne. You decide to use one of Bokeh's custom themes for the plot.

Two subsets of melb have been created based on which region a property is located in, north and south, as shown below:
```
north = melb.loc[melb["region"] == "Northern"]
south = melb.loc[melb["region"] == "Southern"]
```
A figure, fig, has been preloaded for you. You will update the theme, and add circle glyphs using different colors for each region. You will then add a legend_label so they can be easily distinguished.

In [1]:
import pandas as pd
from bokeh.plotting import figure
from bokeh.io import output_file, show
from bokeh.io import output_notebook

# Enable viewing Bokeh plots in the notebook
output_notebook()

In [2]:
melb = pd.read_csv('melb_clean.csv')

In [3]:
melb.head()

Unnamed: 0.1,Unnamed: 0,rooms,type,price,date,distance,bedrooms,bathrooms,car,land_area,building_area,year_built,council_area,region
0,0,2,h,1480000.0,3/12/2016,2.5,2.0,1.0,1.0,202.0,,,Yarra,Northern
1,1,2,h,1035000.0,4/02/2016,2.5,2.0,1.0,0.0,156.0,79.0,1900.0,Yarra,Northern
2,2,3,h,1465000.0,4/03/2017,2.5,3.0,2.0,0.0,134.0,150.0,1900.0,Yarra,Northern
3,3,3,h,850000.0,4/03/2017,2.5,3.0,2.0,1.0,94.0,,,Yarra,Northern
4,4,4,h,1600000.0,4/06/2016,2.5,3.0,1.0,2.0,120.0,142.0,2014.0,Yarra,Northern


In [4]:
north = melb.loc[melb["region"] == "Northern"]
south = melb.loc[melb["region"] == "Southern"]

In [5]:
# Import curdoc
from bokeh.io import curdoc

# Change theme to contrast
curdoc().theme = "contrast"
fig = figure(x_axis_label="Year Built", y_axis_label="Land Area (Meters Squared)")

# Add north circle glyphs
fig.circle(x=north["year_built"], y=north["land_area"], color="yellow", legend_label="North")

# Add south circle glyphs
fig.circle(x=south["year_built"], y=south["land_area"], color="red", legend_label="South")

output_file(filename="north_vs_south.html")
show(fig)

**Customizing glyphs**

The estate agents have requested a plot displaying the relationship between the year a property was built and its distance to the Central Business District (CBD), distinguishing between houses, units, and townhouses. You decide to use different colors and glyphs for each of the three property types.

Three subsets of melb have been created and preloaded for you:
```
houses = melb.loc[melb["type"] == "h"]
units = melb.loc[melb["type"] == "u"]
townhouses = melb.loc[melb["type"] == "t"]
```

In [6]:
houses = melb.loc[melb["type"] == "h"]
units = melb.loc[melb["type"] == "u"]
townhouses = melb.loc[melb["type"] == "t"]

In [7]:
curdoc().theme = "light_minimal"

# Create figure
fig = figure(x_axis_label = "Year Built", y_axis_label="Distance from CBD (km)")

# Add circle glyphs for houses
fig.circle(x=houses["year_built"], y=houses["distance"], legend_label="House", color="purple")

# Add square glyphs for units
fig.square(x=units["year_built"], y=units["distance"], legend_label="Unit", color="red")

# Add triangle glyphs for townhouses
fig.triangle(x=townhouses["year_built"], y=townhouses["distance"], legend_label="Townhouse", color="green")
output_file(filename="year_built_vs_distance_by_property_type.html")
show(fig)

**Average building size**

The estate agents are interested in understanding if there have been changes in building size over time.

You will group the melb dataset by date and calculate the average building size. You will then convert the grouped DataFrame into a Bokeh source object and build a line plot.

In [24]:
from bokeh.models import ColumnDataSource

date_format="%d/%m/%Y"
melb["date"] = pd.to_datetime(melb['date'], format=date_format)

# Group by date and calculate average building size
prop_size = melb.groupby("date", as_index=False)["building_area"].mean()
source = ColumnDataSource(data=prop_size)

# Create the figure
fig = figure(x_axis_label="Date", y_axis_label="Building Size (Meters Squared)", x_axis_type="datetime")

# Add line glyphs
fig.line(x="date",y="building_area",source=source)

# Generate the HTML file
output_file("property_size_by_date.html")
show(fig)

**Sales over time**

The estate agents have now asked you to examine house market activity to visualize changes in total sales over time. The melb DataFrame has been grouped on date, this time calculating total sales using the sum of the price column, and stored as melb_sales:
```
melb_sales = melb.groupby("date", as_index=False)["price"].sum()
```
source has been created from melb_sales, and preloaded for you. Your task is to format a plot to display the visualization with meaningful axes allowing for insights to be drawn.

In [25]:
melb_sales = melb.groupby("date", as_index=False)["price"].sum()

In [26]:
melb_sales.head()

Unnamed: 0,date,price
0,2016-01-28,2018000.0
1,2016-02-04,23612750.0
2,2016-04-16,225348800.0
3,2016-04-23,91392950.0
4,2016-05-07,264289250.0


In [27]:
source = ColumnDataSource(data=melb_sales)

In [28]:
# Import the second formatter
from bokeh.models import NumeralTickFormatter, DatetimeTickFormatter
fig = figure(x_axis_label="Date", y_axis_label="Sales")

# Add line glyphs
fig.line(x="date", y="price",source=source)

# Format the x-axis format
fig.xaxis[0].formatter = DatetimeTickFormatter(months="%b %Y")

# Format the y-axis format
fig.yaxis[0].formatter = NumeralTickFormatter(format="$0a")

output_file(filename="melbourne_sales.html")
show(fig)

**Categorical column subplots**

The estate agents would like you to analyze how property size and amount of land vary by region in Melbourne. With the column layout, you can create two subplots displaying these relationships, using region as the x-axis.

The melb DataFrame has been grouped by region, and the average values for land_area and building_area have been calculated. This has been set up as a Bokeh data object called source, preloaded for you.

In [29]:
region_grouped = melb.groupby("region", as_index=False)["land_area", "building_area"].mean()
region_grouped

  region_grouped = melb.groupby("region", as_index=False)["land_area", "building_area"].mean()


Unnamed: 0,region,land_area,building_area
0,Eastern,634.133923,178.001521
1,Eastern Victoria,2949.698113,183.645
2,Northern,568.948072,124.177723
3,Northern Victoria,3355.463415,1746.374286
4,South-Eastern,613.991111,162.734296
5,Southern,509.252183,153.580962
6,Western,493.606852,144.697623
7,Western Victoria,655.5,134.68381


In [30]:
source = ColumnDataSource(data=region_grouped)

In [35]:
curdoc().theme = "dark_minimal"

# Import column
from bokeh.layouts import column
regions = ["Eastern", "Southern", "Western", "Northern"]
building_size = figure(x_axis_label="Region", y_axis_label="Building Size (Meters Squared)",
                       x_range=regions)
land_size = figure(x_axis_label="Region", y_axis_label="Land Size (Meters Squared)",
                   x_range=regions)

# Add bar glyphs
building_size.vbar(x="region", top="building_area", source=source)
land_size.vbar(x="region", top="land_area", source=source)

# Generate HTML file and display the subplots
output_file(filename="my_first_column.html")
show(column(building_size, land_size))

**Size, location, and price**

Next, the estate agents would like to understand how price is related to the size of the property and its distance from the Central Business District (CBD).

In this case, the y-axis of both figures will have the same units, so making a row of subplots is an appropriate choice. source has been set up as a Bokeh object using the melb dataset, and preloaded for you.

In [44]:
melb_new = melb.where(melb['building_area'] < 2000)
source = ColumnDataSource(data=melb_new)

In [45]:
curdoc().theme = "light_minimal"

# Import row
from bokeh.layouts import row
building_size = figure(x_axis_label="Building Area (Meters Squared)", y_axis_label="Sales")
distance = figure(x_axis_label="Distance from CBD (km)", y_axis_label="Sales")

# Add circle glyphs
building_size.circle(x="building_area", y="price", source=source)
distance.circle(x="distance", y="price", source=source)

# Update the y-axis format for both figures
building_size.yaxis[0].formatter = NumeralTickFormatter(format="$0a")
distance.yaxis[0].formatter = NumeralTickFormatter(format="$0a")

# Display the subplots
output_file(filename="my_first_row.html")
show(row(building_size, distance))

**Using gridplot**

The estate agents would like to examine how the relationship between property size and price varies across the four regions of Melbourne:

"Northern", "Western", "Eastern", and "Southern".

This is a great opportunity to use gridplot, displaying one subplot for each region!

In [46]:
# Import gridplot
from bokeh.layouts import gridplot
plots = []

# Complete for loop to create plots
for region in ["Northern", "Western", "Southern", "Eastern"]:
  df = melb.loc[melb["region"] == region]
  source = ColumnDataSource(data=df)
  fig = figure(x_axis_label="Building Area (Meters Squared)", y_axis_label="Price")
  fig.circle(x="building_area", y="price", source=source, legend_label=region)
  fig.yaxis[0].formatter = NumeralTickFormatter(format="$0a")
  plots.append(fig)

# Display plot
output_file(filename="gridplot.html")
show(gridplot(plots, ncols=2))

**Changing size**

The estate agents have fed back that the subplots are quite large! They have asked you to make the next round of visualizations, displaying the relationship between the year a property was built with a) distance from the CBD and b) property size, a bit smaller.

You will manually specify the size of two subplots in row format.

In [47]:
source = ColumnDataSource(data=melb)

# Set up figures
distance_vs_year = figure(x_axis_label="Year Built", y_axis_label="Distance from CBD (km)", height=300, width=400)
building_size_vs_year = figure(x_axis_label="Year Built", y_axis_label="Building Size (Meters Squared)", height=300, width=400)

# Add circle glyphs to distance_vs_year
distance_vs_year.circle(x="year_built", y="distance", source=source)

# Add circle glyphs to building_size_vs_year
building_size_vs_year.circle(x="year_built", y="building_area", source=source)

# Generate HTML file and display plot
output_file(filename="custom_size_plot")
show(row(distance_vs_year, building_size_vs_year))

**High to low prices by region**

Now you know how to sort a DataFrame, the estate agents have asked you to create a bar plot visualizing the average property price by region from largest to smallest.

regions has been created by grouping melb by region and calculating the average price, and preloaded for you:

```
regions = melb.groupby("region", as_index=False)["price"].mean() ```

In [48]:
regions = melb.groupby("region", as_index=False)["price"].mean()

In [55]:
# Sort df by price in descending order
regions = regions.sort_values("price", ascending=False)

# Create figure
fig = figure(x_range=regions["region"], x_axis_label="Region", y_axis_label="Sales",width=1000)

# Add bar glyphs
fig.vbar(x=regions["region"], top=regions["price"], width=0.9)

# Format the y-axis to numeric format
fig.yaxis[0].formatter = NumeralTickFormatter(format="$0.0a")

output_file(filename="sorted_barplot.html")
show(fig)

**Creating nested categories**

For your final plot, the estate agents would like you to present property sales across the year, displaying months and quarters on the x-axis.

Some of the code to add months and quarters into the Melbourne dataset has been preloaded for you. The factors variable, which will represent months and their corresponding quarters, needs to be created. The data must be also grouped by these two newly created columns to calculate total sales by taking the sum of the "price" column.

In [56]:
melb["month"] = melb["date"].dt.month
quarters = {1: "Q1", 2:"Q1", 3:"Q1", 4:"Q2", 5:"Q2", 6:"Q2", 7:"Q3", 8:"Q3", 9:"Q3", 10:"Q4", 11:"Q4", 12:"Q4"}
melb["quarter"] = melb["month"].replace(quarters)
melb["month"] = melb["month"].replace({1:"January", 2:"February", 3:"March", 4:"April", 5:"May", 6:"June", 7:"July", 8:"August", 9:"September", 10:"October", 11:"November", 12:"December"})

# Create factors
factors = [("Q1", "January"), ("Q1", "February"), ("Q1", "March"),
           ("Q2", "April"), ("Q2", "May"), ("Q2", "June"),
           ("Q3", "July"), ("Q3", "August"), ("Q3", "September"),
           ("Q4", "October"), ("Q4", "November"), ("Q4", "December")]

# Calculate total sales by month and quarter
grouped_melb = melb.groupby(["month", "quarter"], as_index=False)["price"].sum()
grouped_melb.sort_values("quarter", inplace=True)
grouped_melb.head()

Unnamed: 0,month,quarter,price
3,February,Q1,489236600.0
4,January,Q1,2018000.0
7,March,Q1,773771600.0
0,April,Q2,966240600.0
6,June,Q2,1969722000.0


**Visualizing sales by period**

Now you have created your factors, it is time to build a bar plot visualizing sales per month, grouped into quarters!

grouped_melb, a pandas DataFrame containing one row for each month, its respective quarter, and total sales for that month, has been preloaded for you. Additionally, factors, which is a list of tuples containing each quarter and month pair, has also been preloaded.

In [57]:
grouped_melb = melb.groupby(["month", "quarter"], as_index=False)["price"].sum()

In [59]:
factors = [('Q1', 'January'),
 ('Q1', 'February'),
 ('Q1', 'March'),
 ('Q2', 'April'),
 ('Q2', 'May'),
 ('Q2', 'June'),
 ('Q3', 'July'),
 ('Q3', 'August'),
 ('Q3', 'September'),
 ('Q4', 'October'),
 ('Q4', 'November'),
 ('Q4', 'December')]

In [60]:
# Import NumeralTickFormatter and FactorRange
from bokeh.models import NumeralTickFormatter, FactorRange

# Create figure
fig = figure(x_range=FactorRange(*factors), y_axis_label="Sales")

# Create bar glyphs
fig.vbar(x=factors, top=grouped_melb["price"], width = 0.9)
fig.yaxis[0].formatter = NumeralTickFormatter(format="$0.0a")

# Rotate the x-axis labels
fig.xaxis.major_label_orientation = 45

output_file(filename="sales_by_period.html")
show(fig)