Exploratory Data Analysis (EDA)
===============================
___
> Exploring the relavent content of [austraits-7.0.0](https://zenodo.org/records/15718081), specifically "austraits-7.0.0.zip".
>
> The extracted austraits-7.0.0 folder is excluded from github due to its extracted size, but the zip file is included.
>
> The data is being explored with the problem statement:
>
> "Develop machine learning to identify an unknow native plant entity based on easily measurable traits in the field".

# Glossary
___
References:

[www.merriam-webster.com](https://www.merriam-webster.com)

[thedailyeco.com: What is a Whirl in Biology](https://www.thedailyeco.com/what-is-a-whorl-in-biology-218.html)

[wikipedia.org: Perianth](https://en.wikipedia.org/wiki/Perianth)

[sciencefacts.net](https://www.sciencefacts.net/)

![Leaf and Flower Anatomy](leaf_and_flower_anatomy.png)

| Name | Description |
|:-----|:------------|
|  |  |
| adaxial | Facing towards the stem of a plant. |
| adaxial | Facing away from the stem of a plant. |
| androecium | Flower: Male reproductive organs. |
| anther | Flower: The tip of the stamen and consisting of lobes called thecae, which contain polen sacs. |
| anthesis | Flower: Period during which a flower is in the stages of opening. |
| apex | Leaf: Tip of the leaf. |
| axil | Leaf: Crotch formed between stem and pentiole. |
| calyx | Flower: Outer whorl of flower that protect the flower bud before it opens and forms part of the perianth, which includes the petals. The calyx is comprised of multiple leaf-like 'sepals'. |
| carpel | Flower: Female reproductive part of flower, including ovary, style and stigma. |
| corolla | Flower: The petals. |
| gynoecium | Female reproductive organs.
| lamina (blade) | Leaf: Flat expanded part of the leaf where photosynthesis and gas exchange takes place. |
| leaf whorl / flower whorl | Flower/Leaf: Separately identifiable segments/sections, indicatd by alternate leaves or leaf clusters for leaves, or peduncle, clyx, corolla, androecium and gynoecium for flowers. |
| midrib | Leaf: Central vein of the leaf, from the pentiole to the apex. |
| ovary | Flower: Protects the ovules. |
| ovule | Flower: Female reproductive cells. |
| peduncle | The flower stalk. |
| perianth | The calyx and petals protecting the flower. Flowers with a perianth are called dressed flowers (clamídeas), while naked flowers do not have a perianth. |
| petiole | Leaf: Stalk connecting the leaf blade to the plant stem. |
| recepticule | Flower: The structure immediately above the flower stalk and the base the other floral whorls are attached. |
| sepal | The multiple leaf-like parts forming the calyx. |
| stamen | Flower: Male reproductive part of the flower, including the anther and filament. |
| stigma | Flower: sticky head of the carpel, for receiving and germinating pollen grains. |
| stipule | Leaf: Leaf-like protrusions found on some plants, usually in pairs, extending from the base of the pentiole. |
| style | Flower: Slender stalk structure connecting from the ovary to the stigma, and housing the pollen tube. |
| taxon | Grouping of organisms, such as by ranking, like `family`, `genus` or `species`. For example, taxon rank might be 'species', with the taxon name being 'Acanthocarpus preissii', such that "the species is Acanthocarpus preissii". |

# Overview
___
austraits-7.0.0 consists of 13 tables, saved in various formats. Of particular interest serving the problem statement:

* schema
>  * YAML file.
>  * Describes structure of all the tables, definition of fields, and the relationships between the tables.

* taxa
>  * CSV file containing `taxon_name` with `taxon_rank`, `genus` and `family` fields.

* traits
>  * CSV file with 26 columns and 1798215 records.
>  * One or more row entries belong to a single `observation_id`, representing a specific `taxon_name` from a particular `dataset_id`.
>  * Each entry of the `observation_id`, or just '_**observation**_', is for a particular `trait_name`, or just '_**trait**_'.
>    * -> Each row is a trait, with `value` and, if applicable, `unit` units, but there may be more than one entry for different `value_type`'s, such as 'minimum' and 'maximum', and possibly other fields.
>  * Consider checking uniqueness of row entries.
>  * See "1.1 First 10 rows", below.

* definitions
>  * YAML file with 532 keys in the top level.
>  * Each top level key is a possible `trait_name` entry in the traits table.
>    * Trait keys include keys `description` and `type` (categorical or numerical) and can includes keys `units`, `allowed_values_min` and `allowed_values_max`, among others.

* methods
>  * CSV file containing the methodology of meausring each trait for respective dataset sources.

In [1]:
# Imports
import pandas as pd

# 1. traits.csv
___

In [2]:
traits_csv = 'austraits-7.0.0/traits.csv'
traits_df = pd.read_csv(traits_csv, encoding='ISO-8859-1', low_memory=False)

## 1.1. First 10 rows
> View first few rows to determine type of data avaialble.

In [3]:
# Don't skip columns or rows in output
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
traits_df.head(10)

Unnamed: 0,dataset_id,taxon_name,observation_id,trait_name,value,unit,entity_type,value_type,basis_of_value,replicates,basis_of_record,life_stage,population_id,individual_id,repeat_measurements_id,temporal_context_id,source_id,location_id,entity_context_id,plot_context_id,treatment_context_id,collection_date,measurement_remarks,method_id,method_context_id,original_name
0,ABRS_1981,Acanthocarpus canaliculatus,1,leaf_compoundness,simple,,species,mode,expert_score,,preserved_specimen,adult,,,,1.0,,,,,,unknown/2015,,1.0,,Acanthocarpus canaliculatus
1,ABRS_1981,Acanthocarpus canaliculatus,1,seed_length,3,mm,species,maximum,measurement,,preserved_specimen,adult,,,,1.0,,,,,,unknown/2015,,1.0,,Acanthocarpus canaliculatus
2,ABRS_1981,Acanthocarpus humilis,2,leaf_compoundness,simple,,species,mode,expert_score,,preserved_specimen,adult,,,,1.0,,,,,,unknown/2015,,1.0,,Acanthocarpus humilis
3,ABRS_1981,Acanthocarpus parviflorus,3,leaf_compoundness,simple,,species,mode,expert_score,,preserved_specimen,adult,,,,1.0,,,,,,unknown/2015,,1.0,,Acanthocarpus parviflorus
4,ABRS_1981,Acanthocarpus preissii,4,leaf_compoundness,simple,,species,mode,expert_score,,preserved_specimen,adult,,,,1.0,,,,,,unknown/2015,,1.0,,Acanthocarpus preissii
5,ABRS_1981,Acanthocarpus preissii,4,seed_length,4,mm,species,minimum,measurement,,preserved_specimen,adult,,,,1.0,,,,,,unknown/2015,,1.0,,Acanthocarpus preissii
6,ABRS_1981,Acanthocarpus preissii,4,seed_length,5,mm,species,maximum,measurement,,preserved_specimen,adult,,,,1.0,,,,,,unknown/2015,,1.0,,Acanthocarpus preissii
7,ABRS_1981,Acanthocarpus robustus,5,leaf_compoundness,simple,,species,mode,expert_score,,preserved_specimen,adult,,,,1.0,,,,,,unknown/2015,,1.0,,Acanthocarpus robustus
8,ABRS_1981,Acanthocarpus rupestris,6,leaf_compoundness,simple,,species,mode,expert_score,,preserved_specimen,adult,,,,1.0,,,,,,unknown/2015,,1.0,,Acanthocarpus rupestris
9,ABRS_1981,Acanthocarpus rupestris,6,seed_length,2.5,mm,species,maximum,measurement,,preserved_specimen,adult,,,,1.0,,,,,,unknown/2015,,1.0,,Acanthocarpus rupestris


* **An issue here is the value_type indicates entries are aggregates/minimums/maximums.**
  * Not having all the raw data may make it difficult to produce a well geenralised model.
  * The features predicted on will also have to be aggregates/minimums/maximums if/when using features of this type.
  * There may be too few samples below the genus level, or family level in some cases, to model with.

## 1.2. Total Records in traits.csv

In [4]:
print(f'Total traits records: {len(traits_df)}')

Total traits records: 1798215


## 1.3. Column Names and Data Types

In [5]:
traits_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1798215 entries, 0 to 1798214
Data columns (total 26 columns):
 #   Column                  Dtype  
---  ------                  -----  
 0   dataset_id              object 
 1   taxon_name              object 
 2   observation_id          int64  
 3   trait_name              object 
 4   value                   object 
 5   unit                    object 
 6   entity_type             object 
 7   value_type              object 
 8   basis_of_value          object 
 9   replicates              object 
 10  basis_of_record         object 
 11  life_stage              object 
 12  population_id           object 
 13  individual_id           float64
 14  repeat_measurements_id  float64
 15  temporal_context_id     float64
 16  source_id               object 
 17  location_id             float64
 18  entity_context_id       float64
 19  plot_context_id         float64
 20  treatment_context_id    float64
 21  collection_date         object 

## 1.4. Total Unique Trait Names

In [6]:
print(f'Number of unique trait names: {traits_df.trait_name.nunique()}')

Number of unique trait names: 530


## 1.5. Entity Types and Counts

In [7]:
traits_df.entity_type.value_counts()

entity_type
species           1355931
individual         322631
population         118170
metapopulation        933
unknown               550
Name: count, dtype: int64

This list only represents the source of the data with respect to the entity. For example, does the value come from a single individual, or does it represent a summary statistic for an entire species. An entry of 'species' is not necessarily the taxon rank. For example, an unknown entity could belong to a family, genus, species and subspecies, each of which could have rows in traits.csv.

The `taxon_rank` is available in taxa.csv, by referencing `taxon_name`.

## 1.6 Taxon Example

In [8]:
traits_df[traits_df.taxon_name == 'Abrotanella scapigera']

Unnamed: 0,dataset_id,taxon_name,observation_id,trait_name,value,unit,entity_type,value_type,basis_of_value,replicates,basis_of_record,life_stage,population_id,individual_id,repeat_measurements_id,temporal_context_id,source_id,location_id,entity_context_id,plot_context_id,treatment_context_id,collection_date,measurement_remarks,method_id,method_context_id,original_name
9034,ABRS_2022,Abrotanella scapigera,3,life_history,perennial,,species,mode,expert_score,,preserved_specimen,adult,,,,,,,,,,unknown/2022,author,1.0,,Abrotanella scapigera
9035,ABRS_2022,Abrotanella scapigera,3,plant_growth_form,herb,,species,mode,expert_score,,preserved_specimen,adult,,,,,,,,,,unknown/2022,author,1.0,,Abrotanella scapigera
9036,ABRS_2022,Abrotanella scapigera,3,woodiness_detailed,herbaceous,,species,mode,expert_score,,preserved_specimen,adult,,,,,,,,,,unknown/2022,author,1.0,,Abrotanella scapigera
56558,ABRS_2023,Abrotanella scapigera,3,fruit_colour,brown,,species,mode,expert_score,,preserved_specimen,adult,,,,1.0,,,,,,unknown/2022,Author: I.R. Thompson; Contributor: John R. Bu...,1.0,32.0,Abrotanella scapigera
56559,ABRS_2023,Abrotanella scapigera,3,fruit_dehiscence,indehiscent,,species,mode,expert_score,,preserved_specimen,adult,,,,1.0,,,,,,unknown/2022,Author: I.R. Thompson; Contributor: John R. Bu...,1.0,2.0,Abrotanella scapigera
56560,ABRS_2023,Abrotanella scapigera,3,fruit_length,1.7,mm,species,minimum,measurement,,preserved_specimen,adult,,,,1.0,,,,,,unknown/2022,Author: I.R. Thompson; Contributor: John R. Bu...,1.0,32.0,Abrotanella scapigera
56561,ABRS_2023,Abrotanella scapigera,3,fruit_length,2.2,mm,species,maximum,measurement,,preserved_specimen,adult,,,,1.0,,,,,,unknown/2022,Author: I.R. Thompson; Contributor: John R. Bu...,1.0,32.0,Abrotanella scapigera
56562,ABRS_2023,Abrotanella scapigera,3,fruit_type,achene,,species,mode,expert_score,,preserved_specimen,adult,,,,1.0,,,,,,unknown/2022,Author: I.R. Thompson; Contributor: John R. Bu...,1.0,32.0,Abrotanella scapigera
56563,ABRS_2023,Abrotanella scapigera,3,leaf_length,10,mm,species,minimum,measurement,,preserved_specimen,adult,,,,1.0,,,,,,unknown/2022,Author: I.R. Thompson; Contributor: John R. Bu...,1.0,28.0,Abrotanella scapigera
56564,ABRS_2023,Abrotanella scapigera,3,leaf_length,40,mm,species,maximum,measurement,,preserved_specimen,adult,,,,1.0,,,,,,unknown/2022,Author: I.R. Thompson; Contributor: John R. Bu...,1.0,28.0,Abrotanella scapigera


## 1.7. Traits Summary

There are 530 unique trait names in this data set, but section '2.1 Trait Definition Count' indicates there are 532 defined. Each is a feature to consider. The taxon name could be considered a dependent variable, but must take the taxon rank into account, which is avaialble in taxa.csv.

A possible vector for identifying an unknown entity could to first use ML to determine the highest avaialble taxa, then use additional ML models to determine each next subordinate taxa.

Section '1.6 Taxon Example', above, shows entries from different `dataset_id`'s. One provides minimum and maximum leaf_length, while another provides average leaf_area. Different entities will likely have variation in avaialble features.

In [9]:
traits_df[traits_df.taxon_name.str.startswith('Abrotanella')].taxon_name.value_counts()

taxon_name
Abrotanella nivigena            95
Abrotanella forsteroides        35
Abrotanella scapigera           34
Abrotanella sp. [White_2020]    19
Name: count, dtype: int64

# 2. definitions.yml
___
## 2.1. Trait Definition Count

In [10]:
import yaml

definitions_yaml = 'austraits-7.0.0/definitions.yml'
with open(definitions_yaml, 'r') as definitions_file:
    definitions_data = yaml.safe_load(definitions_file)

    # Ensure the data is a dictionary
    if isinstance(definitions_data, dict):
        # Print count of top level keys
        print(f'Number of types of traits: {len(definitions_data.keys())}')
    else:
        raise ValueError("YAML file does not contain a dictionary at top level.")

Number of types of traits: 532


# 3. Trait Selection
___
* To meet the problem statement, features will be selected that are quickly attainable in the field, such as leaf dimensions, fruit colour, ...
* Given the reduced number of features, the output of the model could be the most likely `taxon_name`'s with the associated probability of each. The probability of the mostl likely taxon_name may typically be well under 50% with a reduced number of features.

## 3.1. Grouped Trait List
* This grouping and sorting of traits allows for easier initial trait selection, as traits can be easily excluded based on units, as well as trait name.
  * For example, if units are 'um', this would not easily be measurable with simple field instruments.
* Displaying `unit` and `value_type` also makes any errors easy to identify as they may show as duplicates.
* It is noteworthy there are value types of mean and raw with many traits having entries for both.

In [11]:

traits_grouped_df = traits_df[['trait_name', 'unit', 'value_type']].groupby(['trait_name', 'unit', 'value_type']).size().reset_index(name='Counts').sort_values(by=['trait_name', 'unit', 'value_type'], ascending=[True, True, True])
traits_grouped_df

Unnamed: 0,trait_name,unit,value_type,Counts
0,accessory_cost_fraction,mg/mg,mean,47
1,accessory_cost_mass,mg,mean,47
2,atmospheric_CO2_concentration,umol{CO2}/mol,raw,840
3,bark_Al_per_dry_mass,mg/g,raw,70
4,bark_B_per_dry_mass,mg/g,raw,70
5,bark_C_per_dry_mass,mg/g,raw,229
6,bark_Ca_per_dry_mass,mg/g,mean,34
7,bark_Ca_per_dry_mass,mg/g,raw,70
8,bark_Cu_per_dry_mass,mg/g,raw,70
9,bark_Fe_per_dry_mass,mg/g,raw,70


## 3.2. Selected Traits
* Traits have been selected for this list based on ease of field based measurement.
  * For example, a ruler can easily be used in the field to measure leaf width.
* Initial selection of traits were then verified by cross referencing definitions.yml and, if more clarification is required, methods.csv. 
* Example exlcusions:
  * leaf_lifespan - requires observation over time
  * fruit_dry_weight - requires drying of the fruit
  * leaf_mass_per_area - requires several measurements to calculate
  * leaf_palisade_tissue_thickness_abaxial - measured in um
  * root_diameter - requires excavation
  * root_wood_density - requires more complex measurements
* Will there be sufficient data points to sufficiently represent each feature for ML?
  * For example, a trait may exist for most species, but there could be just one average value entered per species.
    * There may at least be sufficient data points if rolled up for genus or family.

In [12]:
selected_traits_initial = [
                           'bark_thickness',
                           'bark_thickness_index',
                           'branch_terminal_twig_cross_sectional_area',
                           'branch_terminal_twig_length',
                           'bud_length',
                           'buds_per_inflorescence',
                           'flower_androecium_structural_merism',
                           'flower_androecium_structural_whorls_count',
                           'flower_count_maximum',
                           'flower_diameter',
                           'flower_fertile_stamens_count',
                           'flower_filament_fusion',
                           'flower_filament_fusion_to_inner_perianth',
                           'flower_length',
                           'flower_ovary_fusion',
                           'flower_ovules_per_functional_carpel_count',
                           'flower_perianth_fusion',
                           'flower_perianth_merism',
                           'flower_perianth_parts_count',
                           'flower_perianth_whorls_count',
                           'flower_pollen_apertures_count',
                           'flower_pollen_length',
                           'flower_structural_carpels_count',
                           'flower_style_fusion',
                           'fruit_height',
                           'fruit_length',
                           'fruit_wall_thickness',
                           'fruit_width',
                           'leaf_area',
                           'leaf_inclination_angle',
                           'leaf_length',
                           'leaf_posture_numeric',
                           'leaf_secondary_vein_angle',
                           'leaf_thickness',
                           'leaf_vein_frequency',
                           'leaf_vessel_density',
                           'leaf_width',
                           'leaflet_area',
                           'leaflet_count',
                           'petiole_length',
                           'petiole_width',
                           'plant_diameter_breast_height',
                           'plant_height',
                           'plant_height_climbing_plant',
                           'plant_height_reproductive',
                           'plant_width',
                           'ploidy',
                           'resprouting_capacity_proportion_individuals',
                           'resprouting_capacity_stem_ratio',
                           'seed_germination',
                           'sprout_depth',
                           'stem_count',
                           'stem_length',
                           'storage_organ_diameter',
                           'storage_organ_length'
                           ]

### Notes:
* **bark_thickness** and **bark_thickness_index**
  * definitions.yml: bark_thickness_index: Ratio of two times bark thickness to stem diamter.
  * Drop **bark_thickness**, as it may be arbitrary on its own.
* **branch_terminal_twig_cross_sectional_area**
  * Could be approximated by measuring the largest width, and the width perpedicular to that.
* flower and fruit measurements
  * These measurements depend on the plant being in the state of flowering and producing fruit, respectively.
* **leaf_vein_frequency** and **leaf_vessel_density**
  * The former appears achievable in the field while the latter requires more complex instruments, like a microscope and is excluded.
* **leaflet_area**
  * A small scanner could be used for quick area calculation in the field, or simpler techniques to estimate.
* **plant_diameter_breast_height**
  * A somewhat obscure method of measurement as it geenrally pertains to plant diameter 1.4m above ground. It is excluded.
* **ploidy**
  * Chromosome detail, so it is excluded.
* **resprouting_capacity_proportion_individuals** and **resprouting_capacity_stem_ratio**
  * Fire response metric, so they are excluded.
* **seed_germination**
  * Germination under specified conditions, so it is excluded.
* **sprout_depth**
  * Fire response metric, so it is excluded.
* **storage_organ_diameter** and **storage_organ_length**
  * Includes below ground, so they are excluded.

### Selected traits, round 2:

In [13]:
selected_traits2 = [
                   'bark_thickness',
                   'bark_thickness_index',
                   'branch_terminal_twig_cross_sectional_area',
                   'branch_terminal_twig_length',
                   'bud_length',
                   'buds_per_inflorescence',
                   'flower_androecium_structural_merism',
                   'flower_androecium_structural_whorls_count',
                   'flower_count_maximum',
                   'flower_diameter',
                   'flower_fertile_stamens_count',
                   'flower_filament_fusion',
                   'flower_filament_fusion_to_inner_perianth',
                   'flower_length',
                   'flower_ovary_fusion',
                   'flower_ovules_per_functional_carpel_count',
                   'flower_perianth_fusion',
                   'flower_perianth_merism',
                   'flower_perianth_parts_count',
                   'flower_perianth_whorls_count',
                   'flower_pollen_apertures_count',
                   'flower_pollen_length',
                   'flower_structural_carpels_count',
                   'flower_style_fusion',
                   'fruit_height',
                   'fruit_length',
                   'fruit_wall_thickness',
                   'fruit_width',
                   'leaf_area',
                   'leaf_inclination_angle',
                   'leaf_length',
                   'leaf_posture_numeric',
                   'leaf_secondary_vein_angle',
                   'leaf_thickness',
                   'leaf_vein_frequency',
                   'leaf_width',
                   'leaflet_area',
                   'leaflet_count',
                   'petiole_length',
                   'petiole_width',
                   'plant_height',
                   'plant_height_climbing_plant',
                   'plant_height_reproductive',
                   'plant_width',
                   'stem_count',
                   'stem_length'
                   ]
print(f'Total traits selected: {len(selected_traits2)}')

Total traits selected: 46


## 3.3. Sample Counts of Selected Traits Round 2

In [14]:
sample_counts2_df = traits_grouped_df[traits_grouped_df['trait_name'].isin(selected_traits2)].sort_values(by=['Counts'], ascending=[False])
sample_counts2_df

Unnamed: 0,trait_name,unit,value_type,Counts
528,plant_height,m,maximum,52096
315,leaf_length,mm,maximum,41517
317,leaf_length,mm,minimum,37367
493,leaf_width,mm,maximum,36218
495,leaf_width,mm,minimum,33528
153,fruit_length,mm,maximum,21289
530,plant_height,m,minimum,17505
155,fruit_length,mm,minimum,17320
221,leaf_area,mm2,raw,13768
158,fruit_width,mm,maximum,13556


In [15]:
print(f'Sample count minimum: {sample_counts2_df.Counts.min()}, Sample count maximum: {sample_counts2_df.Counts.max()}')

Sample count minimum: 1, Sample count maximum: 52096


### Notes
* The majority of features appear to have too few samples to provide value to the model.

## 3.4. Taxa Counts
* It would be worth considering the possible class count with respect to the number of entries per feature.
* Firstly, preview of som taxa.csv data:

In [16]:
taxa_csv = 'austraits-7.0.0/taxa.csv'
taxa_df = pd.read_csv(taxa_csv, encoding='ISO-8859-1', low_memory=False)
taxa_df.head(10)

Unnamed: 0,taxon_name,taxon_rank,taxonomic_status,taxonomic_dataset,taxon_name_alternatives,genus,family,binomial,trinomial,taxon_distribution,establishment_means,scientific_name,taxon_id,taxon_id_genus,taxon_id_family,scientific_name_id
0,(Dockrillia pugioniformis x Dockrillia striola...,species,accepted,APC,,(Dockrillia,Orchidaceae,(Dockrillia pugioniformis x Dockrillia striola...,,NSW,native,(Dockrillia pugioniformis (A.Cunn.) Rauschert ...,https://id.biodiversity.org.au/taxon/apni/5140...,https://id.biodiversity.org.au/taxon/apni/5179...,https://id.biodiversity.org.au/taxon/apni/5179...,https://id.biodiversity.org.au/name/apni/51342600
1,Abelia x grandiflora,species,accepted,APC,,Abelia,Caprifoliaceae,Abelia x grandiflora,,NSW (naturalised),naturalised,Abelia x grandiflora (Rovelli ex AndrÃ©) Rehder,https://id.biodiversity.org.au/taxon/apni/5143...,https://id.biodiversity.org.au/taxon/apni/5143...,https://id.biodiversity.org.au/taxon/apni/5161...,https://id.biodiversity.org.au/name/apni/190758
2,Abelmoschus ficulneus,species,accepted,APC,,Abelmoschus,Malvaceae,Abelmoschus ficulneus,,"WA, NT, Qld",native,Abelmoschus ficulneus (L.) Wight,https://id.biodiversity.org.au/node/apni/2897916,https://id.biodiversity.org.au/node/apni/2898872,https://id.biodiversity.org.au/taxon/apni/5177...,https://id.biodiversity.org.au/name/apni/55929
3,Abelmoschus manihot,species,accepted,APC,,Abelmoschus,Malvaceae,Abelmoschus manihot,,"ChI, NT, Qld (naturalised), NSW (doubtfully na...",native and naturalised,Abelmoschus manihot (L.) Medik.,https://id.biodiversity.org.au/node/apni/2901085,https://id.biodiversity.org.au/node/apni/2898872,https://id.biodiversity.org.au/taxon/apni/5177...,https://id.biodiversity.org.au/name/apni/55937
4,Abelmoschus manihot subsp. manihot,subspecies,accepted,APC,,Abelmoschus,Malvaceae,Abelmoschus manihot,Abelmoschus manihot subsp. manihot,"Qld (naturalised), NSW (doubtfully naturalised)",naturalised,Abelmoschus manihot (L.) Medik. subsp. manihot,https://id.biodiversity.org.au/node/apni/2917035,https://id.biodiversity.org.au/node/apni/2898872,https://id.biodiversity.org.au/taxon/apni/5177...,https://id.biodiversity.org.au/name/apni/116920
5,Abelmoschus manihot subsp. tetraphyllus,subspecies,accepted,APC,,Abelmoschus,Malvaceae,Abelmoschus manihot,Abelmoschus manihot subsp. tetraphyllus,"ChI, Qld (naturalised)",native and naturalised,Abelmoschus manihot subsp. tetraphyllus (Roxb....,https://id.biodiversity.org.au/node/apni/2892917,https://id.biodiversity.org.au/node/apni/2898872,https://id.biodiversity.org.au/taxon/apni/5177...,https://id.biodiversity.org.au/name/apni/55945
6,Abelmoschus moschatus,species,accepted,APC,,Abelmoschus,Malvaceae,Abelmoschus moschatus,,"WA, NT, Qld, NSW (naturalised)",native and naturalised,Abelmoschus moschatus Medik.,https://id.biodiversity.org.au/node/apni/2900572,https://id.biodiversity.org.au/node/apni/2898872,https://id.biodiversity.org.au/taxon/apni/5177...,https://id.biodiversity.org.au/name/apni/55953
7,Abelmoschus moschatus subsp. biakensis,subspecies,accepted,APC,,Abelmoschus,Malvaceae,Abelmoschus moschatus,Abelmoschus moschatus subsp. biakensis,WA,native,Abelmoschus moschatus subsp. biakensis (Hochr....,https://id.biodiversity.org.au/node/apni/2907435,https://id.biodiversity.org.au/node/apni/2898872,https://id.biodiversity.org.au/taxon/apni/5177...,https://id.biodiversity.org.au/name/apni/116595
8,Abelmoschus moschatus subsp. moschatus,subspecies,accepted,APC,,Abelmoschus,Malvaceae,Abelmoschus moschatus,Abelmoschus moschatus subsp. moschatus,NSW (naturalised),naturalised,Abelmoschus moschatus Medik. subsp. moschatus,https://id.biodiversity.org.au/node/apni/2911283,https://id.biodiversity.org.au/node/apni/2898872,https://id.biodiversity.org.au/taxon/apni/5177...,https://id.biodiversity.org.au/name/apni/243806
9,Abelmoschus moschatus subsp. tuberosus,subspecies,accepted,APC,,Abelmoschus,Malvaceae,Abelmoschus moschatus,Abelmoschus moschatus subsp. tuberosus,"WA, NT, Qld",native,Abelmoschus moschatus subsp. tuberosus (Span.)...,https://id.biodiversity.org.au/node/apni/2919287,https://id.biodiversity.org.au/node/apni/2898872,https://id.biodiversity.org.au/taxon/apni/5177...,https://id.biodiversity.org.au/name/apni/55961


* Counts of each taxon rank:

In [17]:
taxa_df.taxon_rank.value_counts()

taxon_rank
species       26807
subspecies     2970
genus          2022
variety        1326
form             86
family           68
series           20
Name: count, dtype: int64

In [18]:
total_feature_samples = sample_counts2_df.Counts.sum()
print(f'Total sample count across all 46 selected features: {total_feature_samples}')

Total sample count across all 46 selected features: 379188


In [19]:
total_top10_feature_samples = sample_counts2_df.iloc[0:10].Counts.sum()
total_top10_feature_samples_pcnt = int(total_top10_feature_samples/total_feature_samples * 100)
print('Total sample count across largest 10 selected features: {}.\nSo, {}% belong to the 10 largest features.'.format(total_top10_feature_samples, total_top10_feature_samples_pcnt))

Total sample count across largest 10 selected features: 284164.
So, 74% belong to the 10 largest features.


In [20]:

print(f'The ratio of total samples to total species (ignoring the other taxon ranks) is: {total_feature_samples/taxa_df.taxon_rank.value_counts().iloc[0]}')
print(f'The ratio of top 10 feature samples to total species (ignoring the other taxon ranks) is: {total_top10_feature_samples/taxa_df.taxon_rank.value_counts().iloc[0]}')

The ratio of total samples to total species (ignoring the other taxon ranks) is: 14.145111351512664
The ratio of top 10 feature samples to total species (ignoring the other taxon ranks) is: 10.600365576155482


### Notes
* The largest count for a feature is 52,096 for maximum `plant_height`, with 379,188 across all 46 features, three quarters belonging to just the top 10 features by size. Meanwhile there are 26,807 species listed in taxa.csv. Compared to the top 10 features there is an average of 10.6 samples per sepcies. However, it is likely not all taxa are equally represented by the data and some taxa may not be present in traits.csv.

## 3.5. Analyze Taxon Rank between taxa.csv and traits.csv

* taxa.csv lists seven taxon ranks, which is counted in the previous section. The order of the ranks:
  * family (68)
  * genus (2022)
    * series (20) **
  * species (26807)
  * subspecies (2970)
  * variety (1326) **
  * form (86) **

* ** However, 'variety' and 'form' can all directly branch of any rank above them, up to 'species', while, according to [wikipedia](https://en.wikipedia.org/wiki/Series_(botany)), a 'series' can be used to divide up a genus with a very large number of species; referencing Christensen (1987; Nordic J. Botany 7: 383-408; see p.384):

>"_The species concept used in the present work is morphological, and mostly in line with Rothmaler (1944) and Du Rietz (1930). The taxonomic ranks used are defined as follows:_
> 
>_**Forma** of a variety, subspecies or species occurs sporadically within the distribution area of the taxon of higher rank to which it is referred and differs from that taxon in a single character._
>
>_**Varietas** of a subspecies or species is to some extent allopatric and forms local, distinct populations as well as mixed, integrating populations within the distribution area of the subspecies or species. They differ from each other in usually more than a single, distinct character._
>
>_**Subspecies** of a species are both regionally and locally allopatric. They differ from each other in several, distinct characters, but intergrade in overlapping areas._
>
>_**Species** of a genus differ from each other in numerous, distinct characters and have a characteristic distribution area of their own. Where closely related species meet occasional hybridization and introgression may occur._"


* Within taxa.csv, the `taxon_rank` field affects the content on the `taxon_name` field:
  * If the rank is 'family', `taxon_name` equals  or starts with the `family` field. The `genus` field is blank.
  * If the rank is 'genus', `taxon_name`  equals or starts with the `genus` field. The `family` field is populated.
  * If the rank is 'species', `taxon_name` equals the `binomial` field. The `trinomial` field is blank.
  * If the rank is 'subspecies', `taxon_name` equals the `trinomial` field. The `trinomial` field starts with the `binomial` field, folowed by ' subsp. ', followed by the subspecies suffix.
  * If the rank is 'variety', `taxon_name` equals the `trinomial` field. The `trinomial` field starts with the `binomial` field, folowed by ' var. ', followed by the variety suffix. The binomial is the same as the species so there is no indication if the variety is that of a subspecies, only the species.
  * If the rank is 'form', `taxon_name` equals the `trinomial` field. The `trinomial` field starts with the `binomial` field, folowed by ' f. ', followed by the form suffix. The binomial is the same as the species so there is no indication if the form is that of a subspecies or variety, only the species.
  * If the rank is 'series', `taxon_name` equals the `trinomial` field. The `trinomial` field starts with the `binomial` field, folowed by ' ser. ', followed by the series suffix.
    * 19 of the 'series' are genus Dryandra, which includes two subspecies. The only other series is Eucalyptus ser. Diversiformae. All of these currently have the `taxonomic_status` of 'unplaced', indicating they don't fit in their hierarchy, which can be due to various reasons, including source material problems.
  * 68 `taxon_name`'s are missing data for the other fields.


* Entries in traits.csv for a family would likely apply to its subordinate ranks, but such traits likely do not pertain to the selected traits being observed. For example, 'Malvaceae sp. [White_2020]':

In [21]:
traits_df[traits_df.taxon_name.str.startswith('Malvaceae sp. [White_2020]')]

Unnamed: 0,dataset_id,taxon_name,observation_id,trait_name,value,unit,entity_type,value_type,basis_of_value,replicates,basis_of_record,life_stage,population_id,individual_id,repeat_measurements_id,temporal_context_id,source_id,location_id,entity_context_id,plot_context_id,treatment_context_id,collection_date,measurement_remarks,method_id,method_context_id,original_name
1659829,White_2020,Malvaceae sp. [White_2020],4865,dispersal_syndrome,zoochory,,species,mode,expert_score,,literature field,adult,,,,1.0,,,,,,unknown/2020,,1.0,,Malvaceae spp.
1659830,White_2020,Malvaceae sp. [White_2020],4865,dispersers,invertebrates,,species,mode,expert_score,,literature field,adult,,,,1.0,,,,,,unknown/2020,,1.0,,Malvaceae spp.
1659831,White_2020,Malvaceae sp. [White_2020],4865,lifespan,1--10,a,species,bin,expert_score,,literature field,adult,,,,1.0,,,,,,unknown/2020,,1.0,,Malvaceae spp.
1659832,White_2020,Malvaceae sp. [White_2020],4865,nitrogen_fixing,non_nitrogen_fixer,,species,mode,expert_score,,literature field,adult,,,,1.0,,,,,,unknown/2020,,1.0,,Malvaceae spp.
1659833,White_2020,Malvaceae sp. [White_2020],4865,parasitic,not_parasitic,,species,mode,expert_score,,literature field,adult,,,,1.0,,,,,,unknown/2020,,1.0,,Malvaceae spp.
1659834,White_2020,Malvaceae sp. [White_2020],4865,photosynthetic_pathway,c3,,species,mode,expert_score,,literature field,adult,,,,1.0,,,,,,unknown/2020,,1.0,,Malvaceae spp.
1659835,White_2020,Malvaceae sp. [White_2020],4865,plant_growth_form,herb,,species,mode,expert_score,,literature field,adult,,,,1.0,,,,,,unknown/2020,,1.0,,Malvaceae spp.
1659836,White_2020,Malvaceae sp. [White_2020],4865,plant_physical_defence_structures,absent,,species,mode,expert_score,,literature field,adult,,,,1.0,,,,,,unknown/2020,,1.0,,Malvaceae spp.
1659837,White_2020,Malvaceae sp. [White_2020],4865,plant_tolerance_inundation,1-6_months greater_than_6_months less_than_1_m...,,species,mode,expert_score,,literature field,adult,,,,1.0,,,,,,unknown/2020,,1.0,,Malvaceae spp.
1659838,White_2020,Malvaceae sp. [White_2020],4865,plant_tolerance_salt,salinity_tolerance_undefined,,species,mode,expert_score,,literature field,adult,,,,1.0,,,,,,unknown/2020,,1.0,,Malvaceae spp.


* `taxon_rank` of 'family' and 'genus' will be dropped from traits for observations.
* The selected features could vary by series, species, subspecies, variety or form so each taxon_name could be treated independent of eachother, such that each would be a class. However, this depends on there being sufficient data.
  * If there is insufficient sample counts, then existing samples could be rolled up into genus, which would then be the only classes.
  * This might have unexpected consequences. Some species may generalize better while other species' details may be lost in the model due to underfitting.
### Filtering and merging:

In [22]:
# Only rows with desired traits
traits_filtered_df = traits_df[traits_df.trait_name.apply(lambda trait: trait in selected_traits2)]
# Only for adult plants
traits_filtered_df = traits_filtered_df[traits_filtered_df.life_stage == 'adult']

In [23]:
# Only columns of use to modelling
traits_filtered_df = traits_filtered_df[['taxon_name', 'observation_id', 'trait_name', 'value', 'unit', 'value_type', 'basis_of_value']]

In [24]:
# Remove genus and family entries
taxa_filtered_df = taxa_df[taxa_df.taxon_rank.apply(lambda rank: rank not in ['genus', 'family'])]
taxa_filtered_df = taxa_filtered_df[['taxon_name', 'taxon_rank', 'genus', 'family']]

In [25]:
# Inner merge on taxon_name
trait_filt_df = pd.merge(traits_filtered_df, taxa_filtered_df, left_on='taxon_name', right_on='taxon_name', how="inner")

In [26]:
trait_filt_df[trait_filt_df.basis_of_value != 'measurement'].head(10)

Unnamed: 0,taxon_name,observation_id,trait_name,value,unit,value_type,basis_of_value,taxon_rank,genus,family
272954,Acacia fasciculifera,11,plant_height,20,m,maximum,expert_score,species,Acacia,Fabaceae
272957,Acacia fasciculifera,12,leaflet_count,1,{count},mode,expert_score,species,Acacia,Fabaceae
272962,Acacia melanoxylon,24,plant_height,30,m,maximum,expert_score,species,Acacia,Fabaceae
272965,Acacia melanoxylon,25,leaflet_count,1,{count},mode,expert_score,species,Acacia,Fabaceae
272969,Acalypha capillipes,35,plant_height,4,m,maximum,expert_score,species,Acalypha,Euphorbiaceae
272972,Acalypha capillipes,36,leaflet_count,1,{count},mode,expert_score,species,Acalypha,Euphorbiaceae
272981,Ackama paniculosa,51,plant_height,40,m,maximum,expert_score,species,Ackama,Cunoniaceae
272984,Ackama paniculosa,52,leaflet_count,5,{count},mode,expert_score,species,Ackama,Cunoniaceae
272987,Syzygium hemilamprum,60,plant_height,35,m,maximum,expert_score,species,Syzygium,Myrtaceae
272990,Syzygium hemilamprum,61,leaflet_count,1,{count},mode,expert_score,species,Syzygium,Myrtaceae


### Samples per taxon_name

In [27]:
# Calculate average samples per taxon_name/trait_name/value_type
temp = trait_filt_df[['taxon_name', 'trait_name', 'value_type', 'observation_id']].groupby(['taxon_name', 'trait_name', 'value_type']).count().rename(columns={'observation_id': 'counts'})
print(f'Average samples per taxon_name/trait_name/value_type: {temp.counts.mean()}')

Average samples per taxon_name/trait_name/value_type: 1.7753787915743875


* There are only about 1.8 samples per `trait_name`/`value_type` for each class (a.k.a. `taxon_name`)
* This isn't sufficient for model training, as there would be insuficient

### Samples per genus

In [28]:
# Calculate average samples per taxon_name/trait_name/value_type
temp = trait_filt_df[['genus', 'trait_name', 'value_type', 'observation_id']].groupby(['genus', 'trait_name', 'value_type']).count().rename(columns={'observation_id': 'counts'})
print(f'Average samples per genus/trait_name/value_type: {temp.counts.mean()}')

Average samples per genus/trait_name/value_type: 8.120710825352175


* The result of 8 samples on average per genus is better but unlikely sufficient as there will be variation per genus
* Viewing the largest and smallest count:

In [29]:
print(f'Largest genus count of trait_name/value_type: {temp.loc[temp['counts'].idxmax()]}\n')
print(f'Smallest genus count of trait_name/value_type: {temp.loc[temp['counts'].idxmin()]}')

Largest genus count of trait_name/value_type: counts    3851
Name: (Acacia, plant_height, maximum), dtype: int64

Smallest genus count of trait_name/value_type: counts    1
Name: (Abelia, leaf_length, maximum), dtype: int64


* All the counts for "plant_height, maximum":

In [30]:
temp.loc[(slice(None), 'plant_height', 'maximum')]

Unnamed: 0_level_0,counts
genus,Unnamed: 1_level_1
Abelia,1
Abelmoschus,11
Abildgaardia,4
Abroma,2
Abrophyllum,4
Abrotanella,5
Abrus,1
Abutilon,84
Acacia,3851
Acaciella,3


* The distributions are far from equal and the variation in each genus per trait_name/value_type could be too large for good modelling.
* Filtering to only genera with a minimum(?) count per trait_name/value_type feature and selecting only the largest trait_name/value_type features may improve modelling for those genera that are included.

In [31]:
# Drop taxon_name, taxon_rank as genus will be the class, combine trait_name and value_type as the trait
genus_trait_df = trait_filt_df.drop(columns=['taxon_name', 'taxon_rank', 'observation_id', 'basis_of_value'])
genus_trait_df['trait'] = genus_trait_df['trait_name'] + '_' + genus_trait_df['value_type']
genus_trait_df = genus_trait_df.drop(columns=['trait_name', 'value_type'])
genus_trait_df.head()

Unnamed: 0,value,unit,genus,family,trait
0,5.0,mm,Adenanthos,Proteaceae,leaf_length_minimum
1,20.0,mm,Adenanthos,Proteaceae,leaf_length_maximum
2,0.5,mm,Adenanthos,Proteaceae,leaf_width_maximum
3,0.5,m,Adenanthos,Proteaceae,plant_height_maximum
4,10.0,mm,Adenanthos,Proteaceae,leaf_length_minimum


# 4. Summary
> As per README.md, this data is not suited to provide a model to solve the problem statement. The data could be suitable if it were the raw data and data sources were consistent with data collection.