## Exploratory Data Analysis of the [Vancouver Street Trees](https://opendata.vancouver.ca/explore/dataset/street-trees/information/?disjunctive.species_name&disjunctive.common_name&disjunctive.height_range_id) Dataset

This report was prepared by Sarah McDonald on December 12, 2021, as the final project for a Data Visualization class at the University of British Columbia using a [subset](https://raw.githubusercontent.com/UBC-MDS/data_viz_wrangled/main/data/Trees_data_sets/small_unique_vancouver.csv) of the Vancouver Street Trees dataset provided. 

In [1]:
# Import libraries needed for this analysis
import pandas as pd
import altair as alt
import json
alt.data_transformers.enable("data_server")

DataTransformerRegistry.enable('data_server')

In [2]:
# Load in the data and view a subset
trees_url = 'https://raw.githubusercontent.com/UBC-MDS/data_viz_wrangled/main/data/Trees_data_sets/small_unique_vancouver.csv'
trees_df = pd.read_csv(trees_url, parse_dates=['date_planted'])
trees_df.head()

Unnamed: 0.1,Unnamed: 0,std_street,on_street,species_name,neighbourhood_name,date_planted,diameter,street_side_name,genus_name,assigned,...,plant_area,curb,tree_id,common_name,height_range_id,on_street_block,cultivar_name,root_barrier,latitude,longitude
0,10747,W 20TH AV,W 20TH AV,PLATANOIDES,Riley Park,2000-02-23,28.5,EVEN,ACER,N,...,15,Y,21421,NORWAY MAPLE,4,0,,N,49.252711,-123.106323
1,12573,W 18TH AV,W 18TH AV,CALLERYANA,Arbutus-Ridge,1992-02-04,6.0,ODD,PYRUS,N,...,7,Y,129645,CHANTICLEER PEAR,2,2300,CHANTICLEER,N,49.25635,-123.158709
2,29676,ROSS ST,ROSS ST,NIGRA,Sunset,NaT,12.0,ODD,PINUS,N,...,7,Y,154675,AUSTRIAN PINE,4,7800,,N,49.213486,-123.083254
3,8856,DOMAN ST,DOMAN ST,AMERICANA,Killarney,1999-11-12,11.0,EVEN,FRAXINUS,N,...,7,Y,180803,AUTUMN APPLAUSE ASH,4,6900,AUTUMN APPLAUSE,N,49.220839,-123.036721
4,21098,EAST BOULEVARD,EAST BOULEVARD,HIPPOCASTANUM,Shaughnessy,NaT,15.5,ODD,AESCULUS,Y,...,N,Y,74364,COMMON HORSECHESTNUT,4,5200,,N,49.238514,-123.154958


In [3]:
# get more information about our datasset
trees_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 21 columns):
 #   Column              Non-Null Count  Dtype         
---  ------              --------------  -----         
 0   Unnamed: 0          5000 non-null   int64         
 1   std_street          5000 non-null   object        
 2   on_street           5000 non-null   object        
 3   species_name        5000 non-null   object        
 4   neighbourhood_name  5000 non-null   object        
 5   date_planted        2363 non-null   datetime64[ns]
 6   diameter            5000 non-null   float64       
 7   street_side_name    5000 non-null   object        
 8   genus_name          5000 non-null   object        
 9   assigned            5000 non-null   object        
 10  civic_number        5000 non-null   int64         
 11  plant_area          4950 non-null   object        
 12  curb                5000 non-null   object        
 13  tree_id             5000 non-null   int64       

# Questions of Interest
For this analysis I am interested in how the number and type of trees planted has changed over time. From our initial look at the data, I can see that a lot of values are missing from the 'date_planted' column. This could be an error in data recording or it could be that we don't have records of when older trees were planted. To visualize the gaps in our data, let's first plot the dates we do have.

In [4]:
# rug plot to visualize date_planted column data
trees_date = alt.Chart(trees_df).mark_tick().encode(
             alt.X("date_planted:T", scale=alt.Scale())
             )

trees_date

It looks like we have continuous data from 1989-2019. If our theory is correct and data without values in the ‘date_planted’ column is from older trees, we could expect these trees to be larger than trees planted more recently. Let’s see if that holds true for our data. 

In [5]:
# add a boolean column to our datafrom for data_planted data available
trees_nan = trees_df.assign(date_record = trees_df.isna().loc[:, 'date_planted'])
trees_nan.head()

Unnamed: 0.1,Unnamed: 0,std_street,on_street,species_name,neighbourhood_name,date_planted,diameter,street_side_name,genus_name,assigned,...,curb,tree_id,common_name,height_range_id,on_street_block,cultivar_name,root_barrier,latitude,longitude,date_record
0,10747,W 20TH AV,W 20TH AV,PLATANOIDES,Riley Park,2000-02-23,28.5,EVEN,ACER,N,...,Y,21421,NORWAY MAPLE,4,0,,N,49.252711,-123.106323,False
1,12573,W 18TH AV,W 18TH AV,CALLERYANA,Arbutus-Ridge,1992-02-04,6.0,ODD,PYRUS,N,...,Y,129645,CHANTICLEER PEAR,2,2300,CHANTICLEER,N,49.25635,-123.158709,False
2,29676,ROSS ST,ROSS ST,NIGRA,Sunset,NaT,12.0,ODD,PINUS,N,...,Y,154675,AUSTRIAN PINE,4,7800,,N,49.213486,-123.083254,True
3,8856,DOMAN ST,DOMAN ST,AMERICANA,Killarney,1999-11-12,11.0,EVEN,FRAXINUS,N,...,Y,180803,AUTUMN APPLAUSE ASH,4,6900,AUTUMN APPLAUSE,N,49.220839,-123.036721,False
4,21098,EAST BOULEVARD,EAST BOULEVARD,HIPPOCASTANUM,Shaughnessy,NaT,15.5,ODD,AESCULUS,Y,...,Y,74364,COMMON HORSECHESTNUT,4,5200,,N,49.238514,-123.154958,True


To account for differences in species we want to break the records down by species. First let's see how many species we are working with.

In [6]:
species = trees_nan.groupby("species_name")
species.describe()

Unnamed: 0_level_0,Unnamed: 0,Unnamed: 0,Unnamed: 0,Unnamed: 0,Unnamed: 0,Unnamed: 0,Unnamed: 0,Unnamed: 0,diameter,diameter,...,latitude,latitude,longitude,longitude,longitude,longitude,longitude,longitude,longitude,longitude
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,...,75%,max,count,mean,std,min,25%,50%,75%,max
species_name,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
ABIES,3.0,11484.666667,7736.631718,4347.0,7374.00,10401.0,15053.5,19706.0,3.0,16.000000,...,49.251689,49.265250,3.0,-123.139497,0.082919,-123.191800,-123.187300,-123.182800,-123.113346,-123.043891
ACERIFOLIA X,60.0,14736.833333,7736.247569,1152.0,8729.25,12926.0,21225.0,29978.0,60.0,22.355000,...,49.263235,49.289708,60.0,-123.117517,0.047075,-123.198230,-123.150238,-123.122816,-123.078775,-123.030066
ACUTISSIMA,19.0,16161.631579,8395.660984,2483.0,11159.00,16611.0,23396.5,28798.0,19.0,11.355263,...,49.263155,49.285991,19.0,-123.087162,0.038076,-123.166011,-123.113721,-123.089016,-123.058271,-123.028403
ALNIFOLIA,7.0,19888.285714,6129.725299,11189.0,15721.50,21692.0,24053.0,26788.0,7.0,7.642857,...,49.271948,49.290517,7.0,-123.086372,0.052193,-123.157361,-123.132622,-123.055624,-123.044008,-123.038358
ALPINUM,1.0,7160.000000,,7160.0,7160.00,7160.0,7160.0,7160.0,1.0,8.000000,...,49.261980,49.261980,1.0,-123.176110,,-123.176110,-123.176110,-123.176110,-123.176110,-123.176110
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
WATERERI X,3.0,10674.000000,4363.166396,7523.0,8184.00,8845.0,12249.5,15654.0,3.0,18.833333,...,49.247132,49.258560,3.0,-123.137358,0.067524,-123.209370,-123.168305,-123.127239,-123.101351,-123.075464
X YEDOENSIS,90.0,16544.900000,8492.142408,832.0,9711.75,17409.5,22845.5,29792.0,90.0,7.547222,...,49.256834,49.289456,90.0,-123.117258,0.057906,-123.220360,-123.166355,-123.130054,-123.058314,-123.025868
XX,57.0,16790.192982,9060.538799,397.0,8913.00,18314.0,25920.0,29855.0,57.0,3.504386,...,49.261244,49.289050,57.0,-123.097158,0.050363,-123.209720,-123.137614,-123.088452,-123.060023,-123.023650
YUNNANENSIS,1.0,5188.000000,,5188.0,5188.00,5188.0,5188.0,5188.0,1.0,10.000000,...,49.220989,49.220989,1.0,-123.100972,,-123.100972,-123.100972,-123.100972,-123.100972,-123.100972


171 is a lot of species to visualize all at once. Let's find our top 10.

In [7]:
#find the 10 most common trees in our dataset
trees_common = (trees_nan.groupby("common_name").count().sort_values(by='tree_id', ascending=False
                ).reset_index().loc[0:9])
trees_common = trees_common["common_name"].tolist()
trees_common

['KWANZAN FLOWERING CHERRY',
 'PISSARD PLUM',
 'NORWAY MAPLE',
 'CRIMEAN LINDEN',
 'PYRAMIDAL EUROPEAN HORNBEAM',
 'NIGHT PURPLE LEAF PLUM',
 'KOBUS MAGNOLIA',
 'AKEBONO FLOWERING CHERRY',
 'RED MAPLE',
 'KATSURA TREE']

In [8]:
# filter trees_nan to include only the most common trees
common_records = trees_nan.common_name.isin(trees_common)
trees_nan_small = trees_nan[common_records]

In [9]:
# chart average tree diameter per species (most common)
tree_diam = alt.Chart(trees_nan_small).mark_boxplot().encode(
            alt.X('diameter:Q'),
            alt.Y('common_name:N'),
            ).properties(width=300).facet('date_record')
tree_diam

As we can see from the chart above, trees without a date record do have a higher median diameter than trees with a date record. Our theory that trees without date records are older seems be correct, we will exclude these values from future plots regarding date. To make analysis easier, I will add a column with just the year planted.

In [10]:
# remove entries with no date_planted
trees_small = trees_df.dropna(subset=['date_planted'])
# create a new column with just year 
trees_small = trees_small.assign(year_planted = trees_small['date_planted'].dt.year)

In [11]:
# number of trees planted over time
trees_time = alt.Chart(trees_small).mark_bar().encode(
             alt.X('year_planted:O'),
             alt.Y('count()'))
trees_time

Let's make this chart clickable so we can filter our top 10 tree species by year. 

In [12]:
click_year = alt.selection_multi(encodings=['x'], on='click')
click_trees_year = (trees_time.encode(
                   opacity=alt.condition(click_year, alt.value(1), alt.value(0.5)))
                  .properties(height=100, width=500)
                  .add_selection(click_year))

In [13]:
# select 10 most common trees based on year
species_select = (alt.Chart(trees_small).transform_filter(click_year).mark_bar().encode(
                    alt.Y('species_name:N', sort='x'),
                    alt.X('species_count:Q'),
                    ).transform_aggregate(
                    species_count="count()",
                    groupby=["species_name"]
                    ).transform_window(
                    rank='rank(species_count)',
                    sort=[alt.SortField("species_count", order="descending")]
                    ).transform_filter((alt.datum.rank <= 10)).add_selection(click_year))
species_select & click_trees_year

Interesting, there is less overlap in the top 10 species per year than I thought there would be. Now, I would like to look more at the size of trees. I wonder how the method of planting affcts a trees size. To visualize I will use our top 10 datasubset.

In [14]:
# Tree diameter vs height colored by species
tree_height = alt.Chart(trees_nan_small).mark_circle().encode(
              alt.X('diameter:Q'),
              alt.Y('height_range_id:Q'),
              color='species_name:N'
              )
tree_height

In [15]:
# facet our size chart by root barrier
tree_height.facet('root_barrier:N')

In [16]:
# facet tree size by side of street
tree_side = tree_height.properties(width=200).facet('street_side_name')
tree_side

It looks like the side of the street trees are planted on makes no difference to size however, trees planted with a root barrier do seem to be smaller. Let's see if the trees with root barriers are younger than those without using our full dataset.

In [17]:
root_barrier = trees_time.encode(color="root_barrier:N")
root_barrier

It looks like most of the trees with root barriers were planted between 2004 and 2009. Let's filter our data to include just those years and see if the pattern still holds. 

In [18]:
tree_height_filter = alt.Chart(trees_small).transform_filter(
                   alt.FieldRangePredicate(field='year_planted', range=[2004, 2009])
                   ).mark_circle().encode(
                   alt.X('diameter:Q'),
                   alt.Y('height_range_id:Q')
                   ).properties(width=300).facet('root_barrier:N')
tree_height_filter

When we filter just for years that used root barriers the size difference is much less pronounced. Our initial observations about root barriers could have been because a smaller percentage of the data used root barriers.

Now, lets see how the trees are distributed over Vancouver.

In [19]:
# load data to make a map of vancouver (code provided)
url_geojson = 'https://raw.githubusercontent.com/UBC-MDS/exploratory-data-viz/main/data/local-area-boundary.geojson'
data_geojson_remote = alt.Data(url=url_geojson, format=alt.DataFormat(property='features',type='json'))
data_geojson_remote

Data({
  format: DataFormat({
    property: 'features',
    type: 'json'
  }),
  url: 'https://raw.githubusercontent.com/UBC-MDS/exploratory-data-viz/main/data/local-area-boundary.geojson'
})

In [20]:
# base map of Vancouver (code provided)
vancouver_map = alt.Chart(data_geojson_remote).mark_geoshape(
    color = 'white', opacity= 0.5, stroke='black').encode(
).project(type='identity', reflectY=True)

vancouver_map

In [21]:
#Map location of all trees in Vancouver
points = alt.Chart(trees_small).mark_circle(size=20).encode(
         longitude='longitude',
         latitude='latitude',
         ).project(type= 'identity', reflectY=True)

point_map = (vancouver_map + points)
point_map

To see how the distribution changes over time I am going to use the clickable year chart we made earlier.

In [22]:
point_map = point_map.encode(
                opacity=alt.condition(click_year, alt.value(1), alt.value(0.1)),
                color="species_name:N"
                ).add_selection(click_year)
point_map & click_trees_year

Interesting, over the years the distribution seems to be spread out evenly. I would have guessed that the street tree program would have started in a few neighbourhoods and branched out from there. There also doesn't seem to be any clusters of particular species in neighbourhoods but it is hard to tell with so many species to consider. For the analysis report I think it will be interesting to explore the distribution of species planted over time and space using both time charts and a map. Linking our top 10 species per year chart will make the species distribution much easier to visualize. I am also very interested in our findings about the size of trees and root barriers so I will include those in our report as well.