> **WARNING**: This is a work in progress

# Exploring the trees of Berlin's streets

The folk at [Code for Berlin](https://www.codefor.de/berlin/) have created a REST API offering access to the database of Berlin street trees and have [an issue open](https://github.com/codeforberlin/tickets/issues/3) asking people to try to do "something" with it. It seemed a cool way to look more deeply into the architecture of REST APIs on both the client and server side as well as playing with an interesting dataset, given I live in Berlin and like trees.

The API itself is built using the [Django REST Framework](https://www.django-rest-framework.org/) and is hosted [here](https://github.com/codeforberlin/trees-api-v2). An [interactive map](https://trees.codefor.de/) exists which uses the api to plot all the trees and allows some simple filtering on top of tiles from Open Street Map. I took a look and it proved a great intro to the data I wanted to do a deeper analysis of the data.

Some of the things I wanted to look into were:

* Which areas have the most trees, the oldest trees etc
* Are there any connections between the number of trees and other datapoints (air quality, socioeconomic demographics etc)
* Why are there no trees showing on my street even though I can see some out the window as I type this? 

## What sort of data is there and how can it be consumed? 

One of the cool things about the Django REST Framework is the way it's API can be explored out of the box. Simply point your browser to the API using the following link:

https://trees.codefor.de/api/v2

You should see something like this:

```json

HTTP 200 OK
Allow: GET, HEAD, OPTIONS
Content-Type: application/json
Vary: Accept

{
    "trees": "https://trees.codefor.de/api/v2/trees/",
    "species": "https://trees.codefor.de/api/v2/species/",
    "genera": "https://trees.codefor.de/api/v2/genera/",
    "boroughs": "https://trees.codefor.de/api/v2/boroughs/"
}

```

Essetially this is telling us that we have four endpoints - trees, species, genera and boroughs. You can follow the links to each one to get more details. To explore the data available, I hacked together a simple python wrapper which you can find here: 

https://github.com/scrambldchannel/berlin-trees-api-pywrapper

### Usage

The wrapper can be installed via pip:

```
pip install git+https://github.com/scrambldchannel/berlin-trees-api-pywrapper.git
```

#### Setup the wrapper



In [None]:
# Import the module and other useful libs

import json
from berlintreesapiwrapper import TreesWrapper

# Instantiate the api wrapper object
# you can change the base url if you are running a local instance of the api 

base_url = "https://trees.codefor.de/api/"
api_version = 2

api = TreesWrapper(api_root = base_url, version = api_version)

#### Calling functions

There is a function defined for each endpoint. At this stage, each function accepts only a couple of parameters. Each endpoint returns paginated results (the current config seems to return ten results per page) so the page number is a valid parameter for each function, defaulting to 1 if not supplied. See examples below.   

#### Trees endpoint

The most versatile endpoint is the trees endpoint which returns sets of individual trees. The endpoint allows filtering in a number of different ways (see https://github.com/codeforberlin/trees-api-v2#making-queries).

My basic wrapper function doesn't support anything other than a simple dump of all trees, by page, at this stage. This was sufficient for pulling all the data but I will look into enhancing this wrapper later, the ability to filter trees based on location is particular interesting. 


In [None]:
# Eg. request first page of all trees

ret_trees = api.get_trees()
print(json.dumps(ret_trees.json(), indent=4, sort_keys=True))

In [None]:
# Eg. request the 5000th page of all trees

ret_trees = api.get_trees(page=5000)
print(json.dumps(ret_trees.json(), indent=4, sort_keys=True))

#### Other endpoints

The other endpoints just return a count of the trees by borough, species and genus. Results can be filtered by page and the name of the borough etc. See examples below.

In [None]:
# Eg. request first page of the borough count

ret_borough = api.get_boroughs()
print(json.dumps(ret_borough.json(), indent=4, sort_keys=True))

In [None]:
# Eg. request the count for a specific borough

ret_borough = api.get_boroughs(borough = "Friedrichshain-Kreuzberg")
print(json.dumps(ret_borough.json(), indent=4, sort_keys=True))

In [None]:
# Eg. request the count for a specific species

ret_species = api.get_species(species = "Fagus sylvatica")
print(json.dumps(ret_species.json(), indent=4, sort_keys=True))

In [None]:
# Eg. request a specific page of the count of genera

ret_genera = api.get_genera(page = 13)
print(json.dumps(ret_genera.json(), indent=4, sort_keys=True))

## Data exploration 

Now we have a framework for pulling the data, let's create some simple visualisations to give an overview of the data. 

### Visualising the number of trees per borough

This helps give an overview of the data (and playing with graphs is cool)

#### Pull data into a dataframe and create simple bar chart


In [None]:
# Import necessary libraries

import pandas as pd
import matplotlib.pyplot as plt

In [None]:
# Loop over all pages, create a dataframe and create simple horizontal bar graph

page = 1
boroughs = []
counts = []
df =  pd.DataFrame()
df2 = pd.DataFrame()

while(True):
    result = api.get_boroughs(page = page).json()
    for i in result['results']:
        boroughs.append(i['borough'])
        counts.append(i['count'])
    if(result['next'] is None):
        break
    page = page + 1
    
df = pd.DataFrame({'borough': boroughs, 'count' : counts} )

ax = df.plot.barh(figsize=(30, 14) , x = 'borough', color='green', alpha = 0.3, width = 0.9, title = "Number of trees by Borough") 


#### Plot these figures onto a simple map

To get a simple shapefile of the administrative boundaries of Berlin I used this interface which makes it easy to select the areas you want and export. There are other ways of doing this though, such as using the Overpass API, which I will definitely look into in the future. 

https://wambachers-osm.website/boundaries/

I've saved the shapefiles into a local directory and use this code to import them:

In [None]:
# Import a couple of necessary libraries

import geopandas as gpd
import shapely

# set path for data imports

dataset_path = "../datasets/"

# Import the file into a geodataframe

gdf = gpd.read_file(dataset_path + 'Berlin_AL9-AL9.shp', encoding='utf-8')


Now we want to combine the dataframe containing the tree counts by borough with the data from the shapefile. Happily, I know the borough names correspond to the names in the shapefile from osm so we can join them together like this using the merge method.

In [None]:
# Merge counts from dataframe with our geodataframe into new geodataframe

map_gdf = gdf.merge(df, left_on='name', right_on='borough')


Now create a "chloropleth" as per howto here - https://towardsdatascience.com/lets-make-a-map-using-geopandas-pandas-and-matplotlib-to-make-a-chloropleth-map-dddc31c1983d


In [None]:
# set a variable that will call whatever column we want to visualise on the map

variable = "count"

# set the range , figures and axes

vmin, vmax = 500, 60000
fig, ax = plt.subplots(1, figsize=(30, 14))

# create map
map_gdf.plot(column=variable, cmap='Greens',  linewidth=1.5, ax=ax, edgecolor='0.5')
ax.axis("off")

# add a title
ax.set_title("Number of street trees by Borough", horizontalalignment="center",  fontdict={"fontsize": "30", "fontweight" : "5"})
# create an annotation for the data source
ax.annotate("Source: Code for Berlin",xy=(0.1, .08), xycoords="figure fraction", horizontalalignment="left", verticalalignment="top", fontsize=12, color="#555555")

# Create colorbar as a legend
sm = plt.cm.ScalarMappable(cmap="Greens", norm=plt.Normalize(vmin=vmin, vmax=vmax))
# empty array for the data range
sm._A = []
# add the colorbar to the figure
cbar = fig.colorbar(sm)

### Pull all the data from the trees endpoint to do more detailed analysis

Looking at counts of trees is fine but to really analyse the data, I want to pull it all individual trees into a single dataframe. To do so, I returned to the trees endpoint. The relevant part of the json result is contained within "features" and an individual tree looks like this:

```json

{
    "geometry": {
        "coordinates": [
            13.357809221770479,
            52.56657685261005
        ],
        "type": "Point"
    },
    "id": 38140,
    "properties": {
        "age": 80,
        "borough": "Reinickendorf",
        "circumference": 251,
        "created": "2018-11-11T12:22:35.506000Z",
        "feature_name": "s_wfs_baumbestand_an",
        "genus": "ACER",
        "height": 20,
        "identifier": "s_wfs_baumbestand_an.7329",
        "species": "Acer pseudoplatanus",
        "updated": "2018-11-11T12:22:35.506000Z",
        "year": 1938
    },
    "type": "Feature"
},
```

Essentially I want to pull all of these trees into a single dataframe by iterating over every page of the trees endpoint. I hacked together this code to accomplish this. It also converted the result to a geodataframe based on the long/lat information returned. Note, this was really slow, probably wasn't the best way to do it and there are other ways of sourcing the raw data. That said, I wanted to do it as a PoC.

```python

while True:
    this_page = api.get_trees(page=page).json()
    next_page = this_page["next"]
    for row in range(len(this_page['features'])):
        ids.append(this_page['features'][row]['id'])
        age.append(this_page['features'][row]['properties']['age'])
        borough.append(this_page['features'][row]['properties']['borough'])
        circumference.append(this_page['features'][row]['properties']['circumference'])
        genus.append(this_page['features'][row]['properties']['genus'])
        height.append(this_page['features'][row]['properties']['height'])
        species.append(this_page['features'][row]['properties']['species'])
        year.append(this_page['features'][row]['properties']['year'])        
        lat.append(this_page['features'][row]['geometry']['coordinates'][0])
        long.append(this_page['features'][row]['geometry']['coordinates'][1])        

    page = page + 1

    # for debugging, can be removed at some point
    print(page)

    if(next_page) is None:
        break

# create dataframe from resulting arrays       
df = pd.DataFrame(
    {'id': ids,
    'age' : age,
    'borough' : borough,
    'circumference' : circumference,
    'genus' : genus,
    'height' : height,
    'species' : species,
    'year': year,
    'Latitude': lat,
    'Longitude': long})

# convert to geodataframe
gdf = gpd.GeoDataFrame(df, geometry=gpd.points_from_xy(df.Longitude, df.Latitude)) 

```

Happily, I saved it to a csv for future analysis. On that note, I generally find it much easier to do this sort of exploration of data, once it text form, using the amazing [VisiData](https://visidata.org/)

In [None]:
# load csv into geodataframe

all_trees_gdf = gpd.read_file(dataset_path + 'all_trees_gdf.csv', encoding='utf-8')

# convert columns back to numeric as necessary

all_trees_gdf['age'] = pd.to_numeric(all_trees_gdf["age"])
all_trees_gdf['circumference'] = pd.to_numeric(all_trees_gdf["circumference"])
all_trees_gdf['year'] = pd.to_numeric(all_trees_gdf["year"])

### Now that we have the data, let's do some basic exploration

Let's start by finding the oldest tree using pandas.

In [None]:
# Get oldest tree

all_trees_gdf[all_trees_gdf['age']==all_trees_gdf['age'].max()]

That looks a bit unlikely, particularly given it's height and circumference. At a guess, it was probably planted in 2017. Let's try to identify any other similar outliers. 

In [None]:
# OK, that looks a bit spurious, particularly given it's height and circumference
# At a guess, it was probably planted in 2017
# Let's look at what other outliers there are

all_trees_gdf.loc[(all_trees_gdf['age'] >= 1500)]

In [None]:
# This seems to show that anything with a year has a sensible age

all_trees_gdf.loc[(all_trees_gdf['age'] == 0) & (all_trees_gdf['year'] >= 1) & (all_trees_gdf['year'] < 2018)]

In [None]:
# but there are a lot of missing ages that have years

all_trees_gdf.loc[(all_trees_gdf['age'].isnull()) & (all_trees_gdf['year'] >= 1) & (all_trees_gdf['year'] < 2018)]

In [None]:
# What about circumference? 

all_trees_gdf.loc[(all_trees_gdf['circumference'] >= 500) & (all_trees_gdf['circumference'] <= 13000)]

In [None]:
# this should give the oldest tree by  

all_trees_gdf.sort_values('age').drop_duplicates(['borough'], keep='last')

In [None]:
# this will give you the tree with the highest cirucmference for each borough 

# more columns can be added to the list passed to drop_duplicates to effectively group by more columns

all_trees_gdf.sort_values('circumference').drop_duplicates(['borough'], keep='last').head()