# Fun facts about Vacouver trees distributions



## Foreword

This notebook will be showing exploratory data analysis for the subset of the Vancouver Street Trees dataset located here. 

## Introduction

### Questions of interests


* 1.How are trees distributions over the years through different neighbourhoods in Vancouver?  
* 2.How do trees differ among street sides in Vancouver?
* 3.Which neighbourhood is surrounded with the most big and tall trees in Vancouver? 
* 4.Which range of tree size is most popular in Vancouver?

In [1]:
# Import libraries needed for this assignment

import altair as alt
import pandas as pd
import os

alt.data_transformers.enable("data_server")

DataTransformerRegistry.enable('data_server')

Let's import the subset of the Vancouver Street Trees data. Since this is a new dataset,let's take a good first step to get familiar with it by glancing at the values in the dataframe. 

In [2]:
trees_df = pd.read_csv('small_unique_vancouver.csv')
trees_df.head()

Unnamed: 0.1,Unnamed: 0,std_street,on_street,species_name,neighbourhood_name,date_planted,diameter,street_side_name,genus_name,assigned,...,plant_area,curb,tree_id,common_name,height_range_id,on_street_block,cultivar_name,root_barrier,latitude,longitude
0,10747,W 20TH AV,W 20TH AV,PLATANOIDES,Riley Park,2000-02-23,28.5,EVEN,ACER,N,...,15,Y,21421,NORWAY MAPLE,4,0,,N,49.252711,-123.106323
1,12573,W 18TH AV,W 18TH AV,CALLERYANA,Arbutus-Ridge,1992-02-04,6.0,ODD,PYRUS,N,...,7,Y,129645,CHANTICLEER PEAR,2,2300,CHANTICLEER,N,49.25635,-123.158709
2,29676,ROSS ST,ROSS ST,NIGRA,Sunset,,12.0,ODD,PINUS,N,...,7,Y,154675,AUSTRIAN PINE,4,7800,,N,49.213486,-123.083254
3,8856,DOMAN ST,DOMAN ST,AMERICANA,Killarney,1999-11-12,11.0,EVEN,FRAXINUS,N,...,7,Y,180803,AUTUMN APPLAUSE ASH,4,6900,AUTUMN APPLAUSE,N,49.220839,-123.036721
4,21098,EAST BOULEVARD,EAST BOULEVARD,HIPPOCASTANUM,Shaughnessy,,15.5,ODD,AESCULUS,Y,...,N,Y,74364,COMMON HORSECHESTNUT,4,5200,,N,49.238514,-123.154958


Next, let's check the type of data in each column and how many missing values there are.

In [3]:
trees_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 21 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Unnamed: 0          5000 non-null   int64  
 1   std_street          5000 non-null   object 
 2   on_street           5000 non-null   object 
 3   species_name        5000 non-null   object 
 4   neighbourhood_name  5000 non-null   object 
 5   date_planted        2363 non-null   object 
 6   diameter            5000 non-null   float64
 7   street_side_name    5000 non-null   object 
 8   genus_name          5000 non-null   object 
 9   assigned            5000 non-null   object 
 10  civic_number        5000 non-null   int64  
 11  plant_area          4950 non-null   object 
 12  curb                5000 non-null   object 
 13  tree_id             5000 non-null   int64  
 14  common_name         5000 non-null   object 
 15  height_range_id     5000 non-null   int64  
 16  on_str

From the above infomation,the datatype of date_planted is object, we need to parse dates as numbers. We can specify parse_dates=['date_planted'] to read_csv again.


Also, it looks like there are some NaNs in three of the columns, and the date_planted and cultivar_name seem to have the most: about half rows are missing a value.

Now we are parsing the dates and then we'll reprint the info of the dataset.

In [4]:

trees_df = pd.read_csv('small_unique_vancouver.csv',parse_dates=['date_planted'])
trees_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 21 columns):
 #   Column              Non-Null Count  Dtype         
---  ------              --------------  -----         
 0   Unnamed: 0          5000 non-null   int64         
 1   std_street          5000 non-null   object        
 2   on_street           5000 non-null   object        
 3   species_name        5000 non-null   object        
 4   neighbourhood_name  5000 non-null   object        
 5   date_planted        2363 non-null   datetime64[ns]
 6   diameter            5000 non-null   float64       
 7   street_side_name    5000 non-null   object        
 8   genus_name          5000 non-null   object        
 9   assigned            5000 non-null   object        
 10  civic_number        5000 non-null   int64         
 11  plant_area          4950 non-null   object        
 12  curb                5000 non-null   object        
 13  tree_id             5000 non-null   int64       

Visualizing missing values helps us identify potential issues with the data.

In [5]:
alt.data_transformers.disable_max_rows();
trees_nans = trees_df.isna().reset_index().melt(id_vars='index', var_name='column', value_name='NaN')
trees_nans



Unnamed: 0,index,column,NaN
0,0,Unnamed: 0,False
1,1,Unnamed: 0,False
2,2,Unnamed: 0,False
3,3,Unnamed: 0,False
4,4,Unnamed: 0,False
...,...,...,...
104995,4995,longitude,False
104996,4996,longitude,False
104997,4997,longitude,False
104998,4998,longitude,False


In [6]:
alt.Chart(trees_nans).mark_rect(height=17).encode(
    x='index:O',
    y='column',
    color='NaN',
    stroke='NaN').properties(width=900)

By visualizing the missing values for each column next to each other, we can quickly see if there are similar patterns between columns.From the above plot we find that the missing values from cultivar_name and date_planted are not exactly the same rows,although they both have about half rows missing a value.The column plant_area has only 1% rows missing a value.

Since cultivar_name and plant_area are categorical columns showing trees description information,we are not dropping these NaN values if we are not interested in them.For the column date_planted,we can drop the NaN values when we focus on the statistics related to the time. Considering almost half of rows missing a value in date_planted, we might keep the NaN values rather than drop them when we deal with time unrelated statistics.

Now let’s print out the summary statistics for the numerical columns.

In [7]:
trees_df.describe()

Unnamed: 0.1,Unnamed: 0,diameter,civic_number,tree_id,height_range_id,on_street_block,latitude,longitude
count,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0
mean,14861.9204,12.340888,2975.7076,128682.5846,2.7344,2960.227,49.247349,-123.107128
std,8680.023278,9.2666,2078.580429,75412.260406,1.56957,2086.861052,0.021251,0.049137
min,2.0,0.0,2.0,36.0,0.0,0.0,49.202783,-123.22056
25%,7192.75,4.0,1300.5,61321.5,2.0,1300.0,49.230152,-123.144178
50%,14870.0,10.0,2639.0,130130.5,2.0,2600.0,49.247981,-123.105861
75%,22366.75,18.0,4123.0,191332.0,4.0,4100.0,49.263275,-123.063484
max,29992.0,71.0,9113.0,270750.0,9.0,9100.0,49.29393,-123.023311


Visualizing the distributions of all numerical columns helps us understand the data.


The first column unnamed:0 seems like the id for each row in the original dataset,we have not much interest in it when discovering the numerical columns relationships through visualization. We are going to ignore this column in the following numerical columns exploring.

In [8]:
# remove the first column (unnamed:0)from numerical columns
numerical_columns = trees_df.iloc[:,1:].select_dtypes('number').columns.tolist()
#numerical_columns = 
(alt.Chart(trees_df)
 .mark_bar().encode(
     alt.X(alt.repeat(), type='quantitative', bin=alt.Bin(maxbins=25)),
     y='count()')
 .properties(width=220, height=150)
 .repeat(numerical_columns,columns=3))

This overview tells us that most trees have a diameter of less than 5 in, and height range id between 1 to 2. As trees get bigger and taller,the count numbers are going down.Also, the civic number and street blocks number seem to share the same distribution. 

Repeating columns of both X and Y lets us effectively explore pairwise relationships between columns.

In [9]:
# Scroll right on the plot to see the last column
(alt.Chart(trees_df)
 .mark_point(size=10).encode(
     alt.X(alt.repeat('column'), type='quantitative'),
     alt.Y(alt.repeat('row'), type='quantitative'))
 .properties(width=80, height=120)
 .repeat(column=numerical_columns, row=numerical_columns))

Unfortunately, these plots are saturated, so although we can see that there might be some correlative relationships, we should remake this plot as a 2D histogram heatmap.

In [10]:
# Scroll right on the plot to see more columns
(alt.Chart(trees_df)
 .mark_rect().encode(
     alt.X(alt.repeat('column'), type='quantitative', bin=alt.Bin(maxbins=30)),
     alt.Y(alt.repeat('row'), type='quantitative', bin=alt.Bin(maxbins=30)),
     alt.Color('count()', title=None))
 .properties(width=110, height=110)
 .repeat(column=numerical_columns, row=numerical_columns)).resolve_scale(color='independent')

From the above heatmaps, we find that diameter and height might have a positive relationship when diameter is less than 25 inches. Also,we can learn that civic number and block number are related to longitude and latitude and it provides some interesting aspects related to geographic distribution.

Besides, visualizing the counts of all categorical columns helps us understand the data.Considering some columns have too many values and here we just select a subset of categorical columns to explore.

In [11]:
categorical_columns = ['street_side_name','curb','neighbourhood_name','root_barrier']
# categorical_columns = trees_df.select_dtypes('object').columns.tolist()
(alt.Chart(trees_df)
 .mark_bar().encode(
     alt.X('count()'),
     alt.Y(alt.repeat(), type='nominal', sort='x',title=''))
 .properties(width=80, height=200)
 .repeat(categorical_columns))

We can learn that some distributions are interesting such as how trees were planted in different street sides and neighbourhoods.Now we are going to explore more fun aspects of the data further in the following exploratory visualizaions.

## Exploratory Visualizations

### Question 1: How are trees distributions over the years through different neighbourhoods in Vancouver? 

In [12]:
trees_df = trees_df.assign(year_planted=(trees_df['date_planted'].dt.year.astype('Int64')))

In [13]:
trees_with_date_df = trees_df[trees_df['date_planted'].notna()]   
trees_with_date_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2363 entries, 0 to 4998
Data columns (total 22 columns):
 #   Column              Non-Null Count  Dtype         
---  ------              --------------  -----         
 0   Unnamed: 0          2363 non-null   int64         
 1   std_street          2363 non-null   object        
 2   on_street           2363 non-null   object        
 3   species_name        2363 non-null   object        
 4   neighbourhood_name  2363 non-null   object        
 5   date_planted        2363 non-null   datetime64[ns]
 6   diameter            2363 non-null   float64       
 7   street_side_name    2363 non-null   object        
 8   genus_name          2363 non-null   object        
 9   assigned            2363 non-null   object        
 10  civic_number        2363 non-null   int64         
 11  plant_area          2328 non-null   object        
 12  curb                2363 non-null   object        
 13  tree_id             2363 non-null   int64       

In [14]:
alt.Chart(trees_with_date_df).mark_bar().encode(
alt.X('year_planted'),
alt.Y('count()')
).properties(width=500,height=200)

From the above plot,we can easily find that most trees were planted in 1996,2002 and 2013. We are going to find out more about trees planted in different neighbourhood over these years.

In [15]:
alt.Chart(trees_with_date_df).mark_rect().encode(
     alt.X('year_planted',bin=alt.Bin(maxbins=25)),
     alt.Y('neighbourhood_name'),
     alt.Color('count()')).properties(width=510, height=410)

From the above heatmap, we learn that most trees were planted in Hastings-Sunrise,Kensington-Cedar Cottage , Renfres-Collingwood,Sunset and Victoria-Fraserview from 1992 to 2002.

In [16]:
alt.Chart(trees_df).mark_bar().encode(
alt.X('count()'),
alt.Y('neighbourhood_name',sort='x')
)
    

We find those neighbourhoods which planted most trees from 1992 to 2002 are also the areas with most trees nowadays.

Besides,we would like to make some observations about the tree heights distributions over the years as a bonus to question 1.

In [17]:
# tree_size = ['height_range_id','diameter']
line = alt.Chart(trees_with_date_df).mark_line().encode(
alt.X('year_planted'),
alt.Y('mean(height_range_id)')  
).properties(width=500,height=200)

point = alt.Chart(trees_df).mark_point().encode(
alt.X('year_planted'),
alt.Y('mean(height_range_id)')  
).properties(width=500,height=200)

line + point

As time goes by, we find that trees planted in 1991 are either growing fastest or originally tallest and this is really interesting.We might find more about this in later exploration.

### Question 2: How do trees differ among street sides in Vancouver?

To answer this question, we'll explore the relationship between average tree size and the neighbourhoods.

In [18]:
alt.Chart(trees_df).mark_bar().encode(
alt.X('mean(diameter)'),
alt.Y('street_side_name'),
alt.Color('street_side_name')
).properties(width=500,height=200)

In [19]:
alt.Chart(trees_df).mark_bar().encode(
alt.X('mean(height_range_id)'),
alt.Y('street_side_name'),
alt.Color('street_side_name')
).properties(width=500,height=200)

We find that trees planted on both sides of the street are bigger and taller than those planted in the middle of the street. Trees are usually smaller especially in the bike area. It makes sense when we are looking at the trees on the street we usually feel the same way as the above plot shows us.

### Question 3: Which neighbourhood is surrounded with most big and tall trees in Vancouver? 

Now we are exploring the most wonderful neighbourhoods where there are most aboundant giant tall trees.

In [20]:
alt.Chart(trees_df).mark_bar().encode(
alt.X('mean(diameter)'),
alt.Y('neighbourhood_name',sort='x')
).properties(width=600,height=300)

In [21]:
alt.Chart(trees_df).mark_bar().encode(
alt.X('mean(height_range_id)'),
alt.Y('neighbourhood_name',sort='x')
).properties(width=600,height=300)

From the above plots we find that Kitsilano, Dunbar,Fairview,Shaughnessy and Kerrisdale are these great neighbourhoods where there are most big and tall trees. It is facinating that these neighbourhoods are all in the Vancouver West area and usually have the highest housing price as well.

Now let's take a look at how the trees are distributed in these top neighbourhoods by subplots.

In [22]:
top_neighbourhood_trees_df = trees_df[trees_df['neighbourhood_name'].isin(['Kitsilano', 'Dunbar-Southlands','Fairview','Shaughnessy','Kerrisdale'])]


alt.Chart(top_neighbourhood_trees_df).mark_bar().encode(
    alt.X('diameter', bin=alt.Bin(maxbins=30)),
    alt.Y('count()'),
    alt.Color('neighbourhood_name')
).properties(width=200, height=150
).facet('neighbourhood_name',columns=3)




In [23]:
alt.Chart(top_neighbourhood_trees_df).mark_bar().encode(
    alt.X('height_range_id', bin=alt.Bin(maxbins=30)),
    alt.Y('count()'),
    alt.Color('neighbourhood_name')
).properties(width=200, height=150
).facet('neighbourhood_name',columns=3)



From these subplots Fairview has the most fairly distributed trees of different sizes just like its name "Fairview"! What a fun fact! 

### Question 4:Which range of tree size is most popular in Vancouver? 

In [24]:
alt.Chart(trees_df).mark_circle(size=500).encode(
     alt.X('height_range_id', type='quantitative', bin=alt.Bin(maxbins=30)),
     alt.Y('diameter', type='quantitative', bin=alt.Bin(maxbins=30)),
     alt.Color('count()', title=None),alt.Size('count()',title=None)).properties(width=510, height=310)



Using both the colour and marker size to indicate the count creates an effective visualization in the above plot.We can easily learn that diameter less that 5 and height range between 1 and 1.5 are the most poluplar size of the trees in Vancouver. The trees with the diameter between 5 and 10 and height range between 2 and 2.5 go to the second place.

# Conclusion

From the above exploratory visualizations,we are going to keep exploring and focus on fun facts about tree distributions in the report. Some of these are inspired by the quick and dirty EDA plots in the introduction part.Some columns of interest are date_planted,neighbourhood_name,diameter,height_range_id and street_side_name. 

During the exploration of the data, we find some interesting aspects that are more related to people's compelling impressions of the Vacouver city such as prestigious communities with more giant trees VS newly developing communities with more lately planted trees. We also explore some other fun facts like trees distribution could fit its neighbourhood name perfectly like "Fairview".

Here are basically five key types of graphs as following:

1. Heatmap plot

From a heatmap plot,we learn that most trees were planted in Hastings-Sunrise,Kensington-Cedar Cottage and other neighbourhoods in the east of Vancouver from 1992 to 2002.

2. Bar plot

Through simple bar plots we can find the contrast distribution aspects among different street sides in Vancouver. 


3. Faceted Histogram subplots

First we use simple bar plots to find the top neighbourhoods aboundant with most giant and tall trees.Coincedentally they are all located in the west of Vancouver.Then we use histogram subplots faceted with top neighbourhoods, we find a more fun fact about the trees distribution.

4. Circle plot with the colour and marker size

Using both the colour and marker size to indicate the count creates an effective visualization in the circle plot. It is easy to find out the most popular range of tree size in Vancouver.



5. Line + point plot

Through the first question exploration, we open another door to someting more interesting. Using a line and point plot, we can easily find trees planted in 1991 are either growing fastest or originally tallest because they are the tallest trees nowadays.
