# Filter records

Only keep records useful to this project:
- Records must have a picture, an entry in the multimedia table with a valid link.
    - However we only want records with only one image because we want to know which image the notations are referring to.
- Only angiosperm records. Filter by:
    - Phylum/Division
    - Class
    - Family
- Records should have reproductive data in one of these fields:
    - reproductivecondition is not empty
    - occurrenceremarks contains a reproductive key
    - dynamicproperties contains a reproductive key
    - fieldnotes contains a reproductive key
- Reproductive keys are:
    - flower, fruit petal fls corolla leaves tepal seed sterile ray infl 
    - bract inflor inflorescence stigma sepal flores

In [1]:
import sys

sys.path.append('..')

In [2]:
import sqlite3
from pathlib import Path

import pandas as pd

In [3]:
DATA_DIR = Path('..') / 'data'

DB_IN = DATA_DIR / 'idigbio_2021-02.sqlite'
DB_OUT = DATA_DIR / 'angiosperms.sqlite'

The old data base is "idigbio_2021-02.sqlite" and new database is "angiosperms.sqlite".

This takes a ~60 GB database down to 1.1 GB. There are ~2.2 M records left. Of those, there are ~2 M records have data in the reproductivecondition field.

In [4]:
!ls -lh $DATA_DIR/idigbio_2021-02.sqlite

-rw-r--r-- 1 rafe rafe 20G Nov  4 18:45 ../data/idigbio_2021-02.sqlite


## Look at taxon distributions

In [4]:
# pd.set_option("display.max_rows", None, "display.max_columns", None)

In [5]:
sql = """
    select phylum, count(*) as n
    from angiosperms
    group by phylum
    order by n desc
"""
with sqlite3.connect(DB_OUT) as cxn:
    df = pd.read_sql(sql, cxn)
df

Unnamed: 0,phylum,n
0,tracheophyta,1691656
1,magnoliophyta,112180
2,,20801
3,charophyta,5100
4,pteridophyta,29
5,mollusca,9
6,lycopodiophyta,6


In [6]:
sql = """
    select class_, count(*) as n
    from angiosperms
    group by class_
    order by n desc
"""
with sqlite3.connect(DB_OUT) as cxn:
    df = pd.read_sql(sql, cxn)
df.to_csv(DATA_DIR / 'output' / 'class.csv', index=False)
df

Unnamed: 0,class_,n
0,magnoliopsida,1374164
1,liliopsida,334149
2,,115371
3,equisetopsida,5126
4,polypodiopsida,392
5,dicotyledonae,289
6,monocotyledonae,114
7,lycopodiopsida,75
8,pinopsida,70
9,psilotopsida,21


In [7]:
sql = """
    select order_, count(*) as n
    from angiosperms
    group by order_
    order by n desc
"""
with sqlite3.connect(DB_OUT) as cxn:
    df = pd.read_sql(sql, cxn)
df.to_csv(DATA_DIR / 'output' / 'order.csv', index=False)
df

Unnamed: 0,order_,n
0,asterales,274844
1,poales,222397
2,lamiales,142595
3,fabales,140475
4,caryophyllales,100066
...,...,...
109,myricales,1
110,marattiales,1
111,illiciales,1
112,hamamelidales,1


In [8]:
sql = """
    select family, count(*) as n
    from angiosperms
    group by family
    order by n desc
"""
with sqlite3.connect(DB_OUT) as cxn:
    df = pd.read_sql(sql, cxn)
df.to_csv(DATA_DIR / 'output' / 'family.csv', index=False)
df

Unnamed: 0,family,n
0,asteraceae,264278
1,fabaceae,130815
2,poaceae,115949
3,cyperaceae,87365
4,rosaceae,60595
...,...,...
522,caesalpiniaceae,1
523,blandfordiaceae,1
524,biebersteiniaceae,1
525,asteracea,1


In [9]:
sql = """
    select genus, count(*) as n
    from angiosperms
    group by genus
    order by n desc
"""
with sqlite3.connect(DB_OUT) as cxn:
    df = pd.read_sql(sql, cxn)
df.to_csv(DATA_DIR / 'output' / 'genera.csv', index=False)

In [10]:
sql = """
    select genus || ' ' || specificepithet as binominal, count(*) as n
    from angiosperms
    group by binominal
    order by n desc
"""
with sqlite3.connect(DB_OUT) as cxn:
    df = pd.read_sql(sql, cxn)
df.to_csv(DATA_DIR / 'output' / 'binominal.csv', index=False)