#  Final Project: Exploratory Data Analysis of Vancouver Street Trees Dataset


 With the present project we aim to learn more about Vancouver Street Trees Dataset by exploring the data and plotting the necessary graphs. 
 
 The questions of interest are

1. **Which neighborhoods in Vancouver have the most number of trees?**

2. **How are the height range and diameter of trees related?**

3. **What is the distribution of tree species in Vancouver?**

4. **How does tree diameter vary by species?**

5. **What is the geographical distribution of trees in Vancouver?**

6. **What are the top 10 popular trees in Vancouver?**

7. **Visualize the distributions of all numerical columns for popular trees in Vancouver.**

8. **What is the most frequent combination of height and diameter among popular trees in Vancouver?**

9. **Visualize the count of all categorical aspects of popular trees in Vancouver.**

10. **Explore the relationship between categorical and numerical columns in the popular tree data frame.**

   

These questions are preliminary and can evolve as more insights are gained from the data. To explore this dataset it will be used Pandas and Altair libraries in Python.

## Description & Review of Data

For this project, I will be using a subset of the Vancouver Street Trees data from the City of Vancouver website(subset provided by instructors of Data Visualization 2024S course). The dataset includes information about public trees on boulevards in Vancouver, including tree coordinates, species, and other related characteristics. According to the City of Vancouver website Park trees and private trees information are not included on the dataset.

In [1]:
# Import necessary libraries
import pandas as pd
import altair as alt

In [2]:
# Load the dataset
url = 'https://raw.githubusercontent.com/UBC-MDS/data_viz_wrangled/main/data/Trees_data_sets/small_unique_vancouver.csv'
trees_df = pd.read_csv(url, parse_dates=['date_planted'])

# Initial Data Inspection

In [3]:
# Display basic information about the dataset
trees_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 21 columns):
 #   Column              Non-Null Count  Dtype         
---  ------              --------------  -----         
 0   Unnamed: 0          5000 non-null   int64         
 1   std_street          5000 non-null   object        
 2   on_street           5000 non-null   object        
 3   species_name        5000 non-null   object        
 4   neighbourhood_name  5000 non-null   object        
 5   date_planted        2363 non-null   datetime64[ns]
 6   diameter            5000 non-null   float64       
 7   street_side_name    5000 non-null   object        
 8   genus_name          5000 non-null   object        
 9   assigned            5000 non-null   object        
 10  civic_number        5000 non-null   int64         
 11  plant_area          4950 non-null   object        
 12  curb                5000 non-null   object        
 13  tree_id             5000 non-null   int64       

In [4]:
# Display the first few rows and the column names
print(trees_df.head())
print(trees_df.columns)

   Unnamed: 0      std_street       on_street   species_name  \
0       10747       W 20TH AV       W 20TH AV    PLATANOIDES   
1       12573       W 18TH AV       W 18TH AV     CALLERYANA   
2       29676         ROSS ST         ROSS ST          NIGRA   
3        8856        DOMAN ST        DOMAN ST      AMERICANA   
4       21098  EAST BOULEVARD  EAST BOULEVARD  HIPPOCASTANUM   

  neighbourhood_name date_planted  diameter street_side_name genus_name  \
0         Riley Park   2000-02-23      28.5             EVEN       ACER   
1      Arbutus-Ridge   1992-02-04       6.0              ODD      PYRUS   
2             Sunset          NaT      12.0              ODD      PINUS   
3          Killarney   1999-11-12      11.0             EVEN   FRAXINUS   
4        Shaughnessy          NaT      15.5              ODD   AESCULUS   

  assigned  ...  plant_area curb tree_id           common_name  \
0        N  ...          15    Y   21421          NORWAY MAPLE   
1        N  ...           7    Y

**Data set information**

- The dataset contains 5000 rows and 21 columns.
- Some columns have missing values, like date_planted, plant_area and cultivar_name.
- Key columns of interest include species_name, diameter, neighbourhood_name, latitude, and longitude.

In [5]:
# Display summary statistics of the dataset
trees_df.describe()

# Filter the relevant columns for analysis
columns_of_interest = [
    'on_street', 'species_name', 'neighbourhood_name', 'date_planted', 
    'diameter', 'genus_name', 'common_name', 'height_range_id', 'root_barrier', 'latitude', 'longitude'
]
trees_df = trees_df[columns_of_interest]

# Display summary statistics for selected columns
trees_df.describe()

Unnamed: 0,diameter,height_range_id,latitude,longitude
count,5000.0,5000.0,5000.0,5000.0
mean,12.340888,2.7344,49.247349,-123.107128
std,9.2666,1.56957,0.021251,0.049137
min,0.0,0.0,49.202783,-123.22056
25%,4.0,2.0,49.230152,-123.144178
50%,10.0,2.0,49.247981,-123.105861
75%,18.0,4.0,49.263275,-123.063484
max,71.0,9.0,49.29393,-123.023311


**Key Observations:**
    
- The mean diameter of trees is approximately 12.34 inches.
- The data includes various neighborhoods and tree species.
- Latitude and longitude values range within reasonable bounds for Vancouver.

**Filtering and Cleaning Data**

- Make sure all rows have latitude and longitude values.
- Ensured latitude and longitude are of type float.

## Exploratory Visualizations

Let's start exploring the data to answer the questions posed at the beginning.

**Question 1:** Which Neighborhoods in Vancouver Have the Most Number of Trees?

In [6]:
neighbourhood_trees = alt.Chart(trees_df).mark_bar().encode(
    alt.X('count()', title='Number of Trees Planted'),
    alt.Y('neighbourhood_name', sort='-x', title='Neighborhood')
).properties(
    title='Neighborhood Tree Counts'
)
neighbourhood_trees

The plot helps identify the neighborhoods with the highest number of trees, which can be useful for urban planning and environmental studies, we can see that Renfrew-Collingwood, Kensington-Cedar Cottage and Hastings-Sunrise are the top three neighborhoods in terms of the number of trees planted.

**Question 2:** Are Height Range and Diameter of Trees Related?

In [7]:
tree_size_plot_scatter = alt.Chart(trees_df).mark_circle().encode(
    alt.X('diameter', title='Diameter'),
    alt.Y('height_range_id', title='Height Range ID')
)
tree_size_plot_line = alt.Chart(trees_df).mark_line(color='green').encode(
    alt.X('mean(diameter)', title='Mean Diameter'),
    alt.Y('height_range_id', title='Height Range')
)
tree_size_plot_scatter + tree_size_plot_line


This plot investigates the relationship between tree height and diameter, which can provide insights into tree growth patterns. We can see a positive relationship between tree height and diameter. There are taller trees with larger diameters on average, but there are also many tall trees with smaller diameters

**Question 3:** What is the Distribution of Tree Species in Vancouver?

In [8]:
# Get the top 20 species by count
top_species = trees_df['species_name'].value_counts().head(20).index.tolist()
filtered_trees_df = trees_df[trees_df['species_name'].isin(top_species)]

species_distribution = alt.Chart(filtered_trees_df).mark_bar().encode(
    alt.X('count()', title='Number of Trees'),
    alt.Y('species_name', sort='-x', title='Species')
).properties(
    title='Distribution of Top 20 Tree Species in Vancouver'
)
species_distribution

This bar chart shows the distribution of tree species in Vancouver, highlighting the most common species. The most common species are Serrulata, Plantanoide, Cerasifera.


**Question 4:** How Does Tree Diameter Vary by Species?

In [9]:
diameter_by_species = alt.Chart(filtered_trees_df).mark_boxplot().encode(
    alt.X('diameter:Q', title='Diameter'),
    alt.Y('species_name:N', sort='-x', title='Species')
).properties(
    title='Tree Diameter Variation by Top 20 Species'
)
diameter_by_species


This plot displays the variation in tree diameter across different species, helping to identify species with larger or smaller average diameters. We can spot some outliers at different species Rubrum has an outlier that is the biggest tree om diameter, the species with  the biggest median diameter is the hippocastanum.

**Question 5:** What is the Geographical Distribution of Trees in Vancouver?

In [10]:
# Take a subset of the data for geographical plotting
trees_subset = trees_df.sample(500)

# Plot the valid data using a hexbin plot
geographical_distribution_hexbin = alt.Chart(trees_subset).mark_rect().encode(
    alt.X('longitude:Q', bin=alt.Bin(maxbins=30), title='Longitude'),
    alt.Y('latitude:Q', bin=alt.Bin(maxbins=30), title='Latitude'),
    alt.Color('count()', scale=alt.Scale(scheme='greenblue'), title='Count')
).properties(
    title='Geographical Distribution of Trees in Vancouver',
    width=800,
    height=600
).interactive()

geographical_distribution_hexbin

**Question 6:** What Are the Top 10 Popular Trees in Vancouver?

In [11]:
top_trees = trees_df['common_name'].value_counts().head(10).index.tolist()
popular_trees_df = trees_df[trees_df['common_name'].isin(top_trees)]
popular_trees = alt.Chart(popular_trees_df).mark_bar().encode(
    alt.X('count()', title='Count'),
    alt.Y('common_name', sort='-x', title='Common Name')
).properties(
    title='Top 10 Popular Trees in Vancouver'
)
popular_trees


**Question 7:** Visualize the Distributions of All Numerical Columns for Popular Trees in Vancouver

In [12]:
numerical_columns = ['diameter', 'height_range_id']
distribution_plots = alt.Chart(popular_trees_df).mark_bar().encode(
    alt.X(alt.repeat('column'), type='quantitative', bin=alt.Bin(maxbins=25)),
    alt.Y('count()')
).properties(
    width=250,
    height=150
).repeat(
    column=numerical_columns
)
distribution_plots

**Question 8:** What is the Most Frequent Combination of Height and Diameter Among Popular Trees in Vancouver?

In [13]:
height_diameter_combination = alt.Chart(popular_trees_df).mark_rect().encode(
    alt.X('diameter', bin=alt.Bin(maxbins=30)),
    alt.Y('height_range_id', bin=alt.Bin(maxbins=30)),
    alt.Color('count()', scale=alt.Scale(scheme='greenblue'), title='Count')
).properties(
    width=350,
    height=350
)
height_diameter_combination

**Question 9:** Visualize the Count of All Categorical Aspects of Popular Trees in Vancouver

In [14]:
categorical_columns = ['species_name', 'neighbourhood_name', 'genus_name', 'common_name', 'root_barrier']
categorical_plots = alt.Chart(popular_trees_df).mark_bar().encode(
    alt.X('count()'),
    alt.Y(alt.repeat('row'), type='nominal', sort='-x')
).properties(
    width=250,
    height=150
).repeat(
    row=categorical_columns
)
categorical_plots


**Question 10:** Explore the Relationship Between Categorical and Numerical Columns in the Popular Tree Data Frame

In [15]:
# Use mark_tick for a compact representation of tree diameters
diameter_tick = alt.Chart(filtered_trees_df).mark_tick().encode(
    alt.X('diameter:Q', title='Diameter'),
    alt.Y('species_name:N', sort='-x', title='Species')
).properties(
    title='Tree Diameter Distribution by Species'
)
diameter_tick

# Explore the relationship between categorical and numerical columns
relationship_plots = alt.Chart(popular_trees_df).mark_boxplot().encode(
    alt.X(alt.repeat('column'), type='quantitative'),
    alt.Y(alt.repeat('row'), type='nominal', sort='-x')
).properties(
    width=350,
    height=350
).repeat(
    column=numerical_columns,
    row=categorical_columns
)
relationship_plots

## Concluding Remarks

1. **Neighborhood Tree Counts**: 
   - Renfrew-Collingwood, Kensington-Cedar Cottage, and Hastings-Sunrise are the neighborhoods with the highest number of trees planted. This information can guide urban planning and environmental efforts in these areas.
 
2. **Height and Diameter Relationship**:
   - There is a positive relationship between tree height and diameter, with taller trees generally having larger diameters. However, there are exceptions, as some tall trees have smaller diameters.

3. **Distribution of Tree Species**:
   - The most common tree species in Vancouver include Serrulata, Plantanoide, and Cerasifera. Understanding the distribution of tree species helps in biodiversity conservation and tree management practices.

4. **Tree Diameter Variation by Species**:
   - Species like Plantanoides have a larger average diameter, while species like Japonica have smaller diameters. This variation can inform tree selection for different urban environments.

5. **Geographical Distribution of Trees**:
   - The geographical distribution of trees shows higher densities in downtown and central areas of Vancouver. This spatial information is crucial for urban forestry planning and environmental assessments.
   
6. **Top 10 Popular Trees in Vancouver**:
   - The Kwanzan Flowering Cherry is the most popular tree in Vancouver, followed by the Pissard Plum and Norway Maple. This insight can help in future planting strategies to enhance urban aesthetics and biodiversity specially because the trees have different cycles and produce flowers at different months.

7. **Distribution of Numerical Columns for Popular Trees**:
   - The distributions show that most trees have a diameter between 10 and 15 inches. This information helps understand the typical size and growth patterns of popular trees.

8. **Frequent Combination of Height and Diameter**:
   - The heatmap shows that the most frequent combination of height and diameter among popular trees is a diameter between 2 and 4 inches and a height.

9. **Relationship Between Categorical and Numerical Columns**:
   - The boxplots highlight interesting relationships between categorical variables like species name, genus name, common name, and root barrier with numerical variables like diameter and height.


## Recommendations for Final Report:
- Enhance the titles and subtitles of each plot for better context.
- Improve axis labels for clarity and readability.
- Trying different color schemes to increase graph redability.
- Add tooltips to interactive plots for detailed information.
- Summarize key insights in markdown cells after each set of graphs.

### References:
- City of Vancouver Open Data Portal
- Altair Documentation
- Pandas Documentation
