The database of trees will created from the SF Street Tree list, with some addresses added and modified based on the tree's listed geolocation (see SF Plum Finder post).

In [82]:
import os
import pandas as pd
original_path = os.path.join('..', 'original_data', 'Processed_Street_Tree_List.csv')
original_data = pd.read_csv(original_path).set_index('TreeID')
original_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 196458 entries, 121399 to 15238
Data columns (total 22 columns):
 #   Column                     Non-Null Count   Dtype  
---  ------                     --------------   -----  
 0   qLegalStatus               196401 non-null  object 
 1   qSpecies                   196458 non-null  object 
 2   qAddress                   195060 non-null  object 
 3   SiteOrder                  194654 non-null  float64
 4   qSiteInfo                  196458 non-null  object 
 5   PlantType                  196458 non-null  object 
 6   qCaretaker                 196458 non-null  object 
 7   qCareAssistant             24687 non-null   object 
 8   PlantDate                  70755 non-null   object 
 9   DBH                        152951 non-null  float64
 10  PlotSize                   146108 non-null  object 
 11  PermitNotes                53306 non-null   object 
 12  XCoord                     193516 non-null  float64
 13  YCoord                   

As can be seen, the data contains over 196,000 (!) trees in San Francisco, with a myriad of information contained in 21 columns. As many of these columns are unnecessary, we will remove those now as well as convert some of the data types.

In [83]:
data = original_data.loc[:, ['qSpecies', 'qAddress', 'SiteOrder', 'qSiteInfo']].dropna(subset='qAddress')
data[['SiteOrder']] = data[['SiteOrder']].fillna(1)

# checking to see what size integer is needed
data.loc[data.SiteOrder > 2**8]


Unnamed: 0_level_0,qSpecies,qAddress,SiteOrder,qSiteInfo
TreeID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
98278,Pinus canariensis :: Canary Island Pine,900 Brotherhood Way,500.0,Sidewalk: Curb side : Yard
137894,Pittosporum undulatum :: Victorian Box,1710 Bush St,1700.0,Sidewalk: Curb side : Cutout
98279,Pinus canariensis :: Canary Island Pine,900 Brotherhood Way,501.0,Sidewalk: Curb side : Yard
96618,Lophostemon confertus :: Brisbane Box,3555@ 19th St 19th St,268.0,Sidewalk: Property side :


It seems as if there are some issues with the site orders of some of the trees. Checking one of the addresses we get:

In [84]:
data.loc[data.qAddress == '900 Brotherhood Way'].SiteOrder

TreeID
178278      5.0
255371      4.0
176156     10.0
178280      7.0
176153      3.0
178282     11.0
178277      1.0
98278     500.0
256768      7.0
256767     22.0
98279     501.0
178291     20.0
255373      6.0
176154      4.0
178284     13.0
178285     14.0
178290     19.0
178279      6.0
255372      5.0
178286     15.0
178288     17.0
178287     16.0
255374     21.0
176152      2.0
178289     18.0
178283     12.0
178281      8.0
Name: SiteOrder, dtype: float64

Let's replace those values with more reasonable ones:

In [85]:
data.loc[98278, 'SiteOrder'] = 23
data.loc[98279, 'SiteOrder'] = 24
data.loc[data.qAddress == '900 Brotherhood Way'].SiteOrder.max()

24.0

Checking the others:

In [86]:
data.loc[data.qAddress == '3555@ 19th St 19th St'].SiteOrder

TreeID
96618    268.0
Name: SiteOrder, dtype: float64

'3555@ 19th St 19th St' isn't a real address anyways, so we will replace both the address and the site order:

In [87]:
data.loc[96618, 'SiteOrder'] = 1
data.loc[96618, 'qAddress'] = '3555 19th St'

And finally the last one:

In [88]:
data.loc[data.qAddress == '1710 Bush St'].SiteOrder

TreeID
137894    1700.0
2588         1.0
Name: SiteOrder, dtype: float64

In [89]:
data.loc[137894, 'SiteOrder'] = 2

# checking
data.loc[data.SiteOrder > 2**8]

Unnamed: 0_level_0,qSpecies,qAddress,SiteOrder,qSiteInfo
TreeID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1


Now we can convert the data types

In [90]:
data = data.astype({'qSpecies': 'category', 'qAddress': 'string', 'SiteOrder': 'int8', 'qSiteInfo': 'category'})
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 195060 entries, 121399 to 15238
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype   
---  ------     --------------   -----   
 0   qSpecies   195060 non-null  category
 1   qAddress   195060 non-null  string  
 2   SiteOrder  195060 non-null  int8    
 3   qSiteInfo  195060 non-null  category
dtypes: category(2), int8(1), string(1)
memory usage: 7.8 MB


Let's look at the different possible qSiteInfo's:

In [91]:
data.qSiteInfo.cat.categories

Index([':', ': Cutout', ': Yard', 'Back Yard : Cutout', 'Back Yard : Yard',
       'Front Yard :', 'Front Yard : Cutout', 'Front Yard : Pot',
       'Front Yard : Yard', 'Hanging basket : Cutout', 'Hanging basket : Yard',
       'Median :', 'Median : Cutout', 'Median : Hanging Pot', 'Median : Yard',
       'Side Yard : Cutout', 'Side Yard : Pot', 'Side Yard : Yard',
       'Sidewalk: Curb side :', 'Sidewalk: Curb side : Cutout',
       'Sidewalk: Curb side : Hanging Pot', 'Sidewalk: Curb side : Pot',
       'Sidewalk: Curb side : Yard', 'Sidewalk: Property side :',
       'Sidewalk: Property side : Cutout', 'Sidewalk: Property side : Pot',
       'Sidewalk: Property side : Yard', 'Unaccepted Street : Cutout',
       'Unaccepted Street : Pot', 'Unaccepted Street : Yard',
       'unknown : Cutout', 'unknown : Pot', 'unknown : Yard'],
      dtype='object')

Since we only care about trees that are publicly accessible (and exist), we'll filter to only contain front yard and sidewalk trees:

In [92]:
acceptable_site_info = ['Front Yard :', 'Front Yard : Cutout', 'Front Yard : Pot',
       'Front Yard : Yard', 'Hanging basket : Cutout', 'Hanging basket : Yard',
       'Median :', 'Median : Cutout', 'Median : Hanging Pot', 'Median : Yard', 'Sidewalk: Curb side :', 
       'Sidewalk: Curb side : Cutout',
       'Sidewalk: Curb side : Hanging Pot', 'Sidewalk: Curb side : Pot', 'Sidewalk: Curb side : Yard', 
       'Sidewalk: Property side :', 'Sidewalk: Property side : Cutout', 'Sidewalk: Property side : Pot',
       'Sidewalk: Property side : Yard']

data = data.loc[data['qSiteInfo'].isin(acceptable_site_info)]

Looking at the species in the data:

In [93]:
print(data.qSpecies.cat.categories.to_list())

['::', ':: To Be Determine', ':: Tree', 'Abutilon hybridum :: Flowering maple', "Acacia baileyana 'Purpurea' :: Purple-leaf Acacia", "Acacia baileyana :: Bailey's Acacia", 'Acacia cognata :: River Wattle', 'Acacia cyclops :: Cyclops wattle', 'Acacia dealbata :: Silver Wattle', 'Acacia decurrens :: Acacia: Silver Wattle', 'Acacia iteaphylla :: Willow wattle', 'Acacia longifolia :: Golden Wattle', 'Acacia melanoxylon :: Blackwood Acacia', 'Acacia spp :: Acacia Spp', 'Acacia stenophylla :: Shoestring Acacia', 'Acacia vestita :: Hairy wattle', 'Acca sellowiana :: Pineapple Guava Tree', 'Acer buergeranum :: Trident Maple', 'Acer campestre :: Hedge Maple', 'Acer circinatum :: Vine Maple', 'Acer ginnela :: Amur Maple', 'Acer japonicum :: Japanese Maple', 'Acer macrophyllum :: Big Leaf Maple', 'Acer negundo :: Box Elder', "Acer palmatum 'Bloodgood' :: Bloodgood Japanese Maple", "Acer palmatum 'Sango Kaku' :: Coral Bark Maple", 'Acer palmatum :: Japanese Maple', 'Acer paxii :: Evergreen Maple',

We'll remove trees without proper species names:

In [94]:
non_species_categories = ['::', ':: To Be Determine', ':: Tree', 'Tree(s) ::', 
                          'Potential Site :: Potential Site', 'Private shrub :: Private Shrub']
data = data.loc[~data.qSpecies.isin(non_species_categories)]
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 175092 entries, 121399 to 15238
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype   
---  ------     --------------   -----   
 0   qSpecies   175092 non-null  category
 1   qAddress   175092 non-null  string  
 2   SiteOrder  175092 non-null  int8    
 3   qSiteInfo  175092 non-null  category
dtypes: category(2), int8(1), string(1)
memory usage: 3.4 MB


cleaned_path = os.path.join('..', 'src', 'SF_Tree_Identifier', 'data', 'Cleaned_Street_Tree_List.csv')
data.to_csv(cleaned_path)

Now that we've got a dataframe containing mostly usable trees, it's time to assign a webpage to each species so the user can get information about the tree. Specifically, the webpage will be the specie's corresponding 'SelecTree' page found on CalPoly's Urban Forestry Research Insitute: https://selectree.calpoly.edu/

SelecTree uses an unique integer (or key) for the URL path to each species of tree. For example, in the URL for the Red Flowering Gum (Corymbia Ficifolia), https://selectree.calpoly.edu/tree-detail/540, the key is 540. This is opposed to, say, wikipedia, where the URL path is typically the name of the article itself (i.e. https://en.wikipedia.org/wiki/Corymbia_ficifolia). Consequently, to save a means of returning a URL for each tree, we need to first get these keys for every species.

Fortunately, SelecTree also has an API endpoint that allows search by name: https://selectree.calpoly.edu/api/search-by-name-multiresult

Species names can be queried using the [requests package](https://requests.readthedocs.io/en/latest/) via:
```
import requests
url = 'https://selectree.calpoly.edu/api/search-by-name-multiresult'
payload = {'searchTerm': species_name, 'activePage': 1, 'resultsPerPage': 1, 'sort': 1}
r = requests.get(url, params=payload)
```

where 'species_name' is the species name being queried and r is the response from the request. The json response received can then be queried for the url path of the first results using:
```
path_id = r.json()['pageResults'][0]['tree_id']
```

Through some experimentation, I found that the SelecTree search works using the scientific name of *most* of the species left in the database. There are two major edge cases:
1. Trees containing only genus names
2. Cross bred species

The tree's containing genus names (or 'generic epithets') but not the species names are denoted by *genus_name spp* in the scientific portion of the tree's qSpecies column:

In [95]:
data = data.astype({'qSpecies': 'string'})
data.loc[data.qSpecies.str.contains('spp')].qSpecies.unique()

<StringArray>
[            'Melaleuca spp :: Melaleuca spp',
                          'Acer spp :: Maple',
                        'Salix spp :: Willow',
                   'Acacia spp :: Acacia Spp',
                       'Prunus spp :: Cherry',
                         'Quercus spp :: Oak',
                  'Ilex spp :: Holly Species',
                   'Magnolia spp :: Magnolia',
          'Citrus spp :: Lemon: Orange: Lime',
            'Brugmansia spp :: Angel trumpet',
                    'Fraxinus spp :: Ash Spp',
                       'Ulmus spp :: Elm Spp',
                         'Yucca spp :: Yucca',
                        'Tilia spp :: Linden',
                        'Betula spp :: Birch',
         'Pittosporum spp :: Pittosporum spp',
                  'Crateagus spp :: Hawthorn',
   'Fremontodendron spp :: Flannel Bush Tree',
           'Grevillea spp :: Silkoak species',
                        'Metrosideros spp ::',
  'Lagerstroemia spp :: Crape myrtle species',

For most of these, using the common name (with 'spp' removed if needed) should work well enough. For the few remaining, I may need to manually add a name, but as there should be very few this shouldn't be an issue.

The second edge case, cross bred species, are denoted by *first species x second species* in the 

In [96]:
data = data.astype({'qSpecies': 'string'})
data.loc[data.qSpecies.str.contains(' x ')].qSpecies.unique()

<StringArray>
[                                   'Platanus x hispanica :: Sycamore: London Plane',
                        "Magnolia x soulangiana 'Rustica Rubra' :: Chinese Magnolia",
                                "Platanus x hispanica 'Yarwood' :: Yarwood Sycamore",
                                              "Laurus x 'Saratoga' :: Hybrid Laurel",
                                         'Magnolia x soulangiana :: Saucer Magnolia',
 "Cornus nuttallii x florida 'Eddie's White Wonder' :: Eddie's White Wonder Dogwood",
                                "Aesculus x carnea 'Briotii' :: Ruby Horse Chestnut",
                                               "Pyrus x 'Bartlett' :: Bartlett Pear",
                     "Platanus x hispanica 'Columbia' :: Columbia Hybrid Plane Tree",
                                           'Aesculus x carnea :: Red Horse Chestnut',
                                                 'Acer x freemanii :: Freeman Maple',
                                        

To assign the url paths to the species names, we'll first create a dataframe containing all of the unique species and split the species name into scientific names and common names:

In [97]:
data = data.astype({'qSpecies': 'string'})
species_names = pd.Series(data.qSpecies.unique())
species_names

0                Corymbia ficifolia :: Red Flowering Gum
1      Eucalyptus polyanthemos :: Silver Dollar Eucal...
2                  Lophostemon confertus :: Brisbane Box
3               Cupressus macrocarpa :: Monterey Cypress
4                     Jacaranda mimosifolia :: Jacaranda
                             ...                        
560    Pyrus pyrifolia '20th Century' :: Asian Pear '...
561       Acer palmatum 'Sango Kaku' :: Coral Bark Maple
562    Prunus sargentii 'Columnaris' :: Sargent Cherr...
563                  Paulownia tomentosa :: Empress Tree
564    Prunus persica nectarina :: Flowering Nectarin...
Length: 565, dtype: string

In [98]:
import numpy as np
urls = pd.Series(np.zeros(len(species_names)), dtype='uint16')
pd.concat([species_names, urls], axis=1)

Unnamed: 0,0,1
0,Corymbia ficifolia :: Red Flowering Gum,0
1,Eucalyptus polyanthemos :: Silver Dollar Eucal...,0
2,Lophostemon confertus :: Brisbane Box,0
3,Cupressus macrocarpa :: Monterey Cypress,0
4,Jacaranda mimosifolia :: Jacaranda,0
...,...,...
560,Pyrus pyrifolia '20th Century' :: Asian Pear '...,0
561,Acer palmatum 'Sango Kaku' :: Coral Bark Maple,0
562,Prunus sargentii 'Columnaris' :: Sargent Cherr...,0
563,Paulownia tomentosa :: Empress Tree,0


species_path = os.path.join('..', 'src', 'SF_Tree_Identifier', 'data', 'Species.csv')
species_names.to_csv(species_path)

Note the new index, as this will be how we look up the URL's later.

The next step is to create a new column that contains the url path number for SelecTree. 

In [99]:
species_names = data['qSpecies'].str.split(' :: ', expand=True).rename(columns = {0: 'ScientificName', 1: 'CommonName'})
pd.isna(species_names.loc[pd.isna(species_names.CommonName)].fillna(np.nan).iloc[0,1])


True

In [100]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 175092 entries, 121399 to 15238
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype   
---  ------     --------------   -----   
 0   qSpecies   175092 non-null  string  
 1   qAddress   175092 non-null  string  
 2   SiteOrder  175092 non-null  int8    
 3   qSiteInfo  175092 non-null  category
dtypes: category(1), int8(1), string(2)
memory usage: 4.3 MB


Now, we can check to see if wikipedia articles exist for each 