Title: "Green Vancouver: Exploring various tree species across different Vancouver neighbourhoods using Altair."
Author: Suvin Majithia
Date: March 21, 2024

Introduction:
Possible questions of interest you want to answer with your data exploration:

Here I am aim to explore subset of the original dataset called "Vancouver Street Trees". The subset is called "small_unique_vancouver.csv" and has 5000 rows and 21 columns.
I will be creating Exploratory Data Anlaysis using a powerful visualization library called Altair to create plots to find answers for the following questions: 
Question 1.) Exploring different Vancouver neighbourhoods and what do they tell me? 
Question 2.) Total count of different species of trees planted in Vancouver. 
Question 3.) Distribution of different species of trees in Vancouver accoring to their latitude and longitude. 
Question 4.) Street_side_name vs species/genus of trees - different types of trees based on even or odd street side name.

In [7]:
# Importing necessary libraries for EDA
import pandas as pd
import numpy as np
import altair as alt
alt.data_transformers.enable("data_server")

DataTransformerRegistry.enable('data_server')

Description & Review of Data:
Explain columns of interest, and overall data you will be using:
The overall data I will be using will be the small_unique_vancouver.csv and my columns of interest include: neighbourhood_name, street_side_name, latitude, longitude, species_name etc. 

In [8]:
# Reading in the data for EDA
trees_df = pd.read_csv("small_unique_vancouver.csv")
trees_df.head()

Unnamed: 0.1,Unnamed: 0,std_street,on_street,species_name,neighbourhood_name,date_planted,diameter,street_side_name,genus_name,assigned,...,plant_area,curb,tree_id,common_name,height_range_id,on_street_block,cultivar_name,root_barrier,latitude,longitude
0,10747,W 20TH AV,W 20TH AV,PLATANOIDES,Riley Park,2000-02-23,28.5,EVEN,ACER,N,...,15,Y,21421,NORWAY MAPLE,4,0,,N,49.252711,-123.106323
1,12573,W 18TH AV,W 18TH AV,CALLERYANA,Arbutus-Ridge,1992-02-04,6.0,ODD,PYRUS,N,...,7,Y,129645,CHANTICLEER PEAR,2,2300,CHANTICLEER,N,49.25635,-123.158709
2,29676,ROSS ST,ROSS ST,NIGRA,Sunset,,12.0,ODD,PINUS,N,...,7,Y,154675,AUSTRIAN PINE,4,7800,,N,49.213486,-123.083254
3,8856,DOMAN ST,DOMAN ST,AMERICANA,Killarney,1999-11-12,11.0,EVEN,FRAXINUS,N,...,7,Y,180803,AUTUMN APPLAUSE ASH,4,6900,AUTUMN APPLAUSE,N,49.220839,-123.036721
4,21098,EAST BOULEVARD,EAST BOULEVARD,HIPPOCASTANUM,Shaughnessy,,15.5,ODD,AESCULUS,Y,...,N,Y,74364,COMMON HORSECHESTNUT,4,5200,,N,49.238514,-123.154958


Description & Review of Data:
Use info/describe:

In [4]:
# Summary statistics of the trees dataset
trees_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 21 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Unnamed: 0          5000 non-null   int64  
 1   std_street          5000 non-null   object 
 2   on_street           5000 non-null   object 
 3   species_name        5000 non-null   object 
 4   neighbourhood_name  5000 non-null   object 
 5   date_planted        2363 non-null   object 
 6   diameter            5000 non-null   float64
 7   street_side_name    5000 non-null   object 
 8   genus_name          5000 non-null   object 
 9   assigned            5000 non-null   object 
 10  civic_number        5000 non-null   int64  
 11  plant_area          4950 non-null   object 
 12  curb                5000 non-null   object 
 13  tree_id             5000 non-null   int64  
 14  common_name         5000 non-null   object 
 15  height_range_id     5000 non-null   int64  
 16  on_str

In [5]:
trees_df.describe()

Unnamed: 0.1,Unnamed: 0,diameter,civic_number,tree_id,height_range_id,on_street_block,latitude,longitude
count,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0
mean,14861.9204,12.340888,2975.7076,128682.5846,2.7344,2960.227,49.247349,-123.107128
std,8680.023278,9.2666,2078.580429,75412.260406,1.56957,2086.861052,0.021251,0.049137
min,2.0,0.0,2.0,36.0,0.0,0.0,49.202783,-123.22056
25%,7192.75,4.0,1300.5,61321.5,2.0,1300.0,49.230152,-123.144178
50%,14870.0,10.0,2639.0,130130.5,2.0,2600.0,49.247981,-123.105861
75%,22366.75,18.0,4123.0,191332.0,4.0,4100.0,49.263275,-123.063484
max,29992.0,71.0,9113.0,270750.0,9.0,9100.0,49.29393,-123.023311


In [6]:
# Checking for null values in the trees dataset
null_values_df = trees_df.isnull().sum()
print("Null values in the dataframe: \n", null_values_df)

Null values in the dataframe: 
 Unnamed: 0               0
std_street               0
on_street                0
species_name             0
neighbourhood_name       0
date_planted          2637
diameter                 0
street_side_name         0
genus_name               0
assigned                 0
civic_number             0
plant_area              50
curb                     0
tree_id                  0
common_name              0
height_range_id          0
on_street_block          0
cultivar_name         2342
root_barrier             0
latitude                 0
longitude                0
dtype: int64


Exploratory visualizations:
Minimum 4 exploratory visualizations that support in answering your initial question(s) and that potentially open the door for new ones. 
These could be repeated plots, or draft plots (essentially plots you are creating for yourself to explore the data with). 
Use a variety of plot types.
Explanation of what your thought process is behind why you are making specific plots, and what the plots tells us:


In [15]:
# Plotting a graph for to answer Question 1.) Exploring different Vancouver neighbourhoods and what do they tell me? 
# Creating a scatter plot of Vancouver neighborhoods
van_neighbourhood_plot = alt.Chart(trees_df).mark_circle(opacity=0.5).encode(
    x=alt.X('longitude', scale=alt.Scale(domain=[-123.22, -123.021]), title='Longitude'),
    y=alt.Y('latitude', scale=alt.Scale(domain=[49.2, 49.3]), title='Latitude'),
    color=alt.Color('neighbourhood_name', scale=alt.Scale(scheme='turbo'), title='Neighbourhood'),
    tooltip=['neighbourhood_name', 'latitude', 'longitude']  
).properties(
    width=600,
    height=400
).interactive()

# Show the scatter plot
van_neighbourhood_plot

The above plot is a scatter plot where the data points are plotted using circles, the above plot highlights a map of different Vancouver nighbourhoods and the column "neighbourhood_name" along with latitude and longitude, show a beautiful and colorful map like distribution of various Vancouver neighbourhoods.

In [21]:
# Plotting a graph for to answer Question 2.) Total count of different species of trees planted in Vancouver. 

# Grouping the data by 'species_name' and calculating the count of each species
species_counts = trees_df['species_name'].value_counts().reset_index()
species_counts.columns = ['species_name', 'count']

# Creating a bar chart of total count of different species of trees
species_plot = alt.Chart(species_counts).mark_bar().encode(
    x=alt.X('species_name:N', title='Species of Trees'),
    y=alt.Y('count:Q', title='Total Count'),
    tooltip=['species_name', 'count']  # Show species name and count on hover
).properties(
    width=600,
    height=400
)

# Show the bar chart
species_plot

The above bar plot takes into account the columns "species_name" and their respective counts on Y axis to highlight total number of different tree species found in Vancouver. 

In [24]:
# Plotting a graph for to answer Question 3.) Distribution of different species of trees in Vancouver accoring to their latitude and longitude. 
# Create a heatmap of tree distribution
tree_distribution_plot = alt.Chart(trees_df).mark_rect().encode(
    alt.X('longitude:Q', bin=alt.Bin(maxbins=50), title='Longitude'),
    alt.Y('latitude:Q', bin=alt.Bin(maxbins=50), title='Latitude'),
    color=alt.Color('count()', scale=alt.Scale(scheme='viridis'), title='Number of Trees')
).properties(
    width=600,
    height=400
)

# Show the heatmap
tree_distribution_plot

Above heatmap shows a beautiful distribution of different species of trees in Vancouver accoring to their latitude and longitude and the legend highlights total number of trees of various species. 

In [27]:
# Plotting a facet graph for to answer Question 4.) Street_side_name vs species/genus of trees - different types of trees based on even or odd street side name.
# Grouping the data by 'street_side_name', 'species_name', and 'genus_name' and calculating the count 
street_species_counts = trees_df.groupby(['street_side_name', 'species_name', 'genus_name']).size().reset_index(name='count')

# Creating a facet plot to explore different types of trees based on even or odd street side name
facet_plot = alt.Chart(street_species_counts).mark_point().encode(
    x=alt.X('species_name:N', title='Species/Genus of Trees'),
    y=alt.Y('count:Q', title='Total Count'),
    color=alt.Color('species_name:N', title='Species/Genus of Trees'),
    tooltip=['species_name', 'genus_name', 'count']
).properties(
    width=200,
    height=200
).facet(
    column=alt.Column('street_side_name:N', title='Street Side Name')
)

# Show the facet plot
facet_plot

The above facet plot is a scatter plot that highlights species/genus of trees on X axis and their total count on Y axis.

Concluding remarks: 
Explain which four plots you are including in your report and how they will be changed for your audience. 

Plot 1 - Different Vancouver Neighborhoods w.r.t thrir latitude nad longitude:
Original Plot: A scatter plot showing the data points as circles - distribution of different Vancouver neighborhoods depending on thier latitude and longitude.
Change for the audience: I'll include different tree species and focus on just the main types of trees people see most often. This will help the stakeholders understand which trees are most common in each neighborhood.

Plot 2 - Total Count of Different Tree species Planted in Vancouver:
Original Plot: A bar chart showing how many of each different species of trees are planted in Vancouver.
Change for the audience: I'll add numbers to each bar so it's clear how many trees of each type there are. I also plan to highlight the top three most common types to make it easier to see which trees are most popular.

Plot 3 - Distribution of Different Tree Species by Street Side (Even or Odd):
Original Plot: A heatmap showing if different species of trees are more common on streets with even or odd numbers.
Change for the audience: I'll make a simple comparison between even and odd streets. Thus, it's easy to see if there's a difference in tree types between the two sides of the street.

Plot 4 - Street Side Name vs. Tree Species/Genus:
Original Plot: A facet plot showing which types of trees are found on each side of the street.
Change for the audience: I'll use a chart that stacks bars to show the proportion of each tree type on each side of the street. This will make it clear which trees are most common on even and odd sides.

References: 
1. https://opendata.vancouver.ca/explore/dataset/street-trees/information/?disjunctive.species_name&disjunctive.common_name&disjunctive.height_range_id&disjunctive.on_street&disjunctive.neighbourhood_name
2. https://raw.githubusercontent.com/UBC-MDS/data_viz_wrangled/main/data/Trees_data_sets/small_unique_vancouver.csv
3. https://stackoverflow.com/questions/62602367/altair-use-a-field-to-specify-the-domain-of-the-y-axis