# Data Ingest Pipeline

## Raw

The raw landing zone will include various datasets, data formats and we will have to find a way to normalize all of the different types of inputs into a standard format (images)

- Image

- Video

## Metadata

Prior to loading all of the images/videos into the dataset we need to build a metadata mapping list for the known pollen grain slides. The goal is to build a robust enough library of pollen types so that data annotation becomes automated reducing the need to manually label each pollen slide. I think having a binary classification system for detecting pollen grains would help for the initial annotation process. Once the slides are annotated these "unknown" pollen types can be labeled by a professional. Our goal is to reduce the amount of manual intervention required to annotate and count pollen grains on microscope slides.

## Dataset

The dataset will be the normalized images in jpg format. The images will be stored under
- POLLEN/{FAMILY}/{YEAR_COLLECTED}/{MONTH_COLLECTED}

It seems like the common level of classification is from the family level. Lets build an object detection system for the family level of pollen grain. Once it can accuratly count/classify into these families we can create sub genus and species models to further define the types. POC will be family level model

In [28]:
metadata_url = '../data/metadata/pollenlibrary/full.json'

import pandas as pd

metadata_df = pd.read_json(metadata_url)

In [174]:

metadata_df = pd.read_csv('../data/database/pollen_categories.csv')

common_names_df = pd.read_csv('../data/database/pollen_commonnames.csv')

common_names_df

Unnamed: 0,id,name,classification_level
0,HYPERICACEAE_HYPERICUM_CALYCINUM,HYPERICUM CALYCINUM,SPECIES
1,CUPRESSACEAE_CALLITROPSISOERST_NOOTKATENSIS,CALLITROPSIS NOOTKATENSIS,SPECIES
2,ROSACEAE_CERCOCARPUS_MONTANUS,CERCOCARPUS MONTANUS,SPECIES
3,FABACEAE_MEDICAGO_SATIVA,MEDICAGO SATIVA,SPECIES
4,FAGACEAE_CASTANEA_PUMILA,CASTANEA PUMILA,SPECIES
...,...,...,...
3046,POACEAE_GE_SP,GRASS,FAMILY
3047,PINACEAE_PINUS_SP,PINE,GENUS
3048,SAPINDACEAE_ACER_SP,MAPLE,GENUS
3049,BETULACEAE_CORYLUS_SP,HAZEL,GENUS


In [208]:
def get_dataframe(path: str, verbose:int=0):
    """
    Read a DataFrame from a csv file.

    Args:
        path (str): Path to the csv file.

    Returns:
        pd.DataFrame: Read DataFrame.
    """
    try:
        return pd.read_csv(path)
    except Exception as e:
        if verbose > 0:
            print(e)
        return pd.DataFrame()
def clean_string(s):
    if isinstance(s, str):
        return re.sub("[\W_]+",'',s)
    else:
        return s

def re_braces(s):
    if isinstance(s, str):
        return re.sub("[\(\[].*?[\)\]]", "", s)
    else:
        return s

def name_filter(s):
    s = clean_string(s)
    s = re_braces(s)
    s = str(s)
    s = s.replace(' ', '').lower()
    return s
from datetime import datetime
labels_df = pd.read_csv('../data/raw/pollen20L/class_map.csv',header=None, names=['name','id'])
class_dict = dict(zip(list(labels_df.id.values), list(labels_df.name.values)))
inverse_class_dict = dict(zip(list(labels_df.name.values), list(labels_df.id.values)))
bboxes_df = pd.read_csv('../data/raw/pollen20L/bboxes.csv', header=None, names=['image_name', 'x1', 'y1', 'x2', 'y2', 'class_name'])
bboxes_df['label'] = bboxes_df['class_name'].map(inverse_class_dict)
bboxes_df.class_name = bboxes_df.class_name.str.replace('_','').str.upper()

common_names_df = get_dataframe('../data/database/pollen_commonnames.csv')
if common_names_df.shape[0]==0:
    raise Exception('Need to run category_importer before running media_metadata_importer')
out_df = pd.merge(bboxes_df, common_names_df.rename(columns={'name': 'class_name', 'id': 'common_name_id'}).drop(columns=['classification_level']), on=['class_name'])
out_df['source'] = 'POLLEN20L'
out_df

image_name        0
x1                0
y1                0
x2                0
y2                0
class_name        0
label             0
common_name_id    0
source            0
dtype: int64

In [121]:
latin_name_df = metadata_df[['id','latin_name']].rename(columns={'latin_name':'name'})
latin_name_df.name = latin_name_df.name.str.upper()

friendly_name_df = metadata_df[['id','friendly_name']].rename(columns={'friendly_name':'name'})
friendly_name_df.name = friendly_name_df.name.str.upper()

long_latin_name_df = metadata_df[['id']]
long_latin_name_df['name'] = long_latin_name_df.id.str.replace('_',' ')

ku_common_names_df = pd.DataFrame([
    {
        'id':'ASTERACEAE_XANTHIUM_STRUMARIUM',
        'name':'ASTERACEA XANTHIUM STRUMARIUM',
        'classification_level':'SPECIES'
    },
    {
        'id':'JUGLANDACEAE_CARYA_TOMENTOSA',
        'name':'JUGLANDACEA CARYA ALBA',
        'classification_level':'SPECIES'
    },
    {
        'id':'SAPINDACEAE_ACER_NEGUNDO',
        'name':'ACERACEA ACER NEGUNDO',
        'classification_level':'SPECIES'
    },
    {
        'id':'ASTERACEAE_ARTEMISIA_VULGARIS',
        'name':'ARTEMESIA VULGARIS',
        'classification_level':'SPECIES'
    },
    {
        'id':'OLEACEAE_OLEA_EUROPAEA',
        'name':'OLEA EUROPEAE',
        'classification_level':'SPECIES'
    },
    {
        'id':'PINACEAE_PINUS_STROBUS',
        'name':'PINACEA PINUS STROBUS',
        'classification_level':'SPECIES'
    },
    {
        'id':'ASTERACEAE_ARTEMISIA_TRIDENTATA',
        'name':'ASTERACEA AMBROSIA TRIDENTATA',
        'classification_level':'SPECIES'
    },
    {
        'id':'SALICACEAE_SALIX_LASIOLEPIS',
        'name':'SALIX LASIOLEPSIS',
        'classification_level':'SPECIES'
    },
    {
        'id':'PLANTAGINACEAE_PLANTAGO_LANCEOLATA',
        'name':'PLANTANGO LANCEOLATA',
        'classification_level':'SPECIES'
    },
    {
        'id':'ASTERACEAE_AMBROSIA_TRIFIDA',
        'name':'ASTERACEA AMBROSIA TRIFIDA',
        'classification_level':'SPECIES'
    },
    {
        'id':'ASTERACEAE_SOLIDAGO_CANADENSIS',
        'name':'SOLDIAGO SP',
        'classification_level':'SPECIES'
    },
    {
        'id':'CANNABACEAE_CELTIS_OCCIDENTALIS',
        'name':'ULMACEAE CELTIS OCCIDENTALIS',
        'classification_level':'SPECIES'
    },
    {
        'id':'SALICACEAE_POPULUS_SP',
        'name':'SALICEAE POPULUS SP',
        'classification_level':'GENUS'
    },
    {
        'id':'MORACEAE_MORUS_RUBRA',
        'name':'MORACEA MORUS RUBRA',
        'classification_level':'SPECIES'
    },
    {
        'id':'ASTERACEAE_ARTEMISIA_PYCNOCEPHALA',
        'name':'ASTERACEA ARTEMESIA PYCNOCEPHALA',
        'classification_level':'SPECIES'
    },
    {
        'id':'JUGLANDACEAE_CARYA_LACINIOSA',
        'name':'JUGLANDACEA CARYA LACINIOSA',
        'classification_level':'SPECIES'
    },
    {
        'id':'PLATANACEAE_PLATANUS_SP',
        'name':'PLATANACEA PLATANUS SP',
        'classification_level':'GENUS'
    },
])

common_names_df = [
    latin_name_df,
    friendly_name_df,
    long_latin_name_df,
    ku_common_names_df
]

common_names_df = pd.concat(common_names_df,ignore_index=True)
common_names_df['classification_level'] = 'SPECIES'
common_names_df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  long_latin_name_df['name'] = long_latin_name_df.id.str.replace('_',' ')


Unnamed: 0,id,name,classification_level
0,HYPERICACEAE_HYPERICUM_CALYCINUM,HYPERICUM CALYCINUM,SPECIES
1,CUPRESSACEAE_CALLITROPSISOERST_NOOTKATENSIS,CALLITROPSIS NOOTKATENSIS,SPECIES
2,ROSACEAE_CERCOCARPUS_MONTANUS,CERCOCARPUS MONTANUS,SPECIES
3,FABACEAE_MEDICAGO_SATIVA,MEDICAGO SATIVA,SPECIES
4,FAGACEAE_CASTANEA_PUMILA,CASTANEA PUMILA,SPECIES
...,...,...,...
3027,SALICACEAE_POPULUS_SP,SALICEAE POPULUS SP,SPECIES
3028,MORACEAE_MORUS_RUBRA,MORACEA MORUS RUBRA,SPECIES
3029,ASTERACEAE_ARTEMISIA_PYCNOCEPHALA,ASTERACEA ARTEMESIA PYCNOCEPHALA,SPECIES
3030,JUGLANDACEAE_CARYA_LACINIOSA,JUGLANDACEA CARYA LACINIOSA,SPECIES


In [122]:
import re
from datetime import datetime
import os
image_metadata = {
    'media_type':None,
    'name':None,
    'source':'Kean University',
    'magnification':None,
    'capture_datetime':None,
    'last_updated': None
}

'../data/raw/ku_pollen/DRB_100x/images\\100x Beech, American (Fagus grandifolia)Photo on 12-31-17 at 1.20 PM #2.jpg'
def clean_string(s):
    if isinstance(s, str):
        return re.sub("[\W_]+",'',s)
    else:
        return s

def re_braces(s):
    if isinstance(s, str):
        return re.sub("[\(\[].*?[\)\]]", "", s)
    else:
        return s

def name_filter(s):
    s = clean_string(s)
    s = re_braces(s)
    s = str(s)
    s = s.replace(' ', '').lower()
    return s
class MediaMetadata:
    def __init__(self, media_type: str, name: str, source: str, magnification: int, capture_datetime: datetime, specific_type:str='MIXED'):
        self.media_type = media_type
        self.name = name
        self.id = name_filter(name)
        self.source = source
        self.magnification = magnification
        self.capture_datetime = capture_datetime
        self.specific_type = specific_type
        self.last_updated = datetime.now()

    @classmethod
    def from_image_file(cls, file_path: str, source: str = None):
        file_name = os.path.basename(file_path)

        # Extract magnification
        parts = file_name.split(' on ')
        magnification_str = parts[0].split(' ')[0]
        magnification = int(magnification_str[:-1])

        datetime_str = parts[1]
        date_str = datetime_str.split(' at ')[0].strip()
        year = f"20"+date_str[-2:] 
        date_str = date_str[0:-2] + year
        time_str = datetime_str.split(' at ')[1].rsplit(' ',1)[0].replace('.',':')
        datetime_str = f"{date_str} {time_str}"
        capture_datetime = pd.Timestamp(datetime_str).to_pydatetime()

        #date_str = name_and_date.split(' at ')[-1].strip()
        #capture_datetime = datetime.strptime(date_str, '%m-%d-%y at %I.%M %p')

        # Extract specific type
        specific_type = re.search(r'\((.*?)\)', file_name).group(1).upper().strip().replace('.','')
        return cls(
            media_type=file_path.split('.')[-1],
            name=file_name,
            source=source,
            magnification=magnification,
            capture_datetime=capture_datetime,
            specific_type=specific_type
        )
media_metadatas = []
file_paths = glob.glob('../data/raw/ku_pollen/DRB_400x/images/*')
file_paths.extend(glob.glob('../data/raw/ku_pollen/DRB_100x/images/*'))
for file_path in file_paths:
    media_metadata = MediaMetadata.from_image_file(file_path, source='Kean University')
    media_metadatas.append(media_metadata.__dict__)

media_metadatas_df = pd.DataFrame(media_metadatas)


In [123]:
media_metadatas_df.specific_type.unique()

array(['CARYA OVATA', 'ULMUS AMERICANA', 'CUPRESSUS ARIZONICA',
       'SALIX LASIOLEPSIS', 'SALIX NIGRA', 'ACERACEA ACER NEGUNDO',
       'ASTERACEA XANTHIUM STRUMARIUM', 'RUMEX CRISPUS',
       'PLANTANGO LANCEOLATA', 'ASTERACEA AMBROSIA TRIFIDA',
       'SOLDIAGO SP', 'ULMACEAE CELTIS OCCIDENTALIS',
       'JUGLANDACEA CARYA ALBA', 'CHENOPODIUM ALBUM',
       'POACEAE DACTYLIS GLOMERATA', 'PINACEA PINUS STROBUS',
       'SALICEAE POPULUS SP', 'AMBROSIA PSILOSTACHYA', 'ALNUS RUBRA',
       'MORACEA MORUS RUBRA', 'JUNIPERUS SCOPULORUM',
       'ASTERACEA AMBROSIA TRIDENTATA',
       'ASTERACEA ARTEMESIA PYCNOCEPHALA', 'JUGLANDACEA CARYA LACINIOSA',
       'PLATANACEA PLATANUS SP', 'PLATANUS RACEMOSA', 'SALSOLA KALI',
       'JUGLANS CALIFORNICA', 'FAGUS GRANDIFOLIA', 'BETULA NIGRA',
       'BETULA OCCIDENTALIS', 'JUNIPERUS VIRGINIANA', 'ALNUS INCANA',
       'ARTEMESIA VULGARIS', 'URTICA DIOICA', 'QUERCUS AGRIFOLIA',
       'OLEA EUROPEAE', 'CASUARINA EQUISETIFOLIA'], dtype=object)

In [167]:
media_metadatas_df = pd.merge(media_metadatas_df, common_names_df.rename(columns={'name':'specific_type','id':'name_id'}), on=['specific_type'])


media_metadatas_df.columns

Index(['media_type', 'name', 'id', 'source', 'magnification',
       'capture_datetime', 'specific_type', 'last_updated', 'name_id',
       'classification_level'],
      dtype='object')

In [124]:
for unique_ku in list(media_metadatas_df.specific_type.unique()):
    res = common_names_df.loc[common_names_df.name == unique_ku]
    if res.shape[0] == 0:
        print(unique_ku)

In [162]:
common_names_df.loc[common_names_df.name.str.contains('VULGARIS')]

Unnamed: 0,id,name,classification_level
218,OLEACEAE_SYRINGA_VULGARIS,SYRINGA VULGARIS,SPECIES
230,ASTERACEAE_ARTEMISIA_VULGARIS,ARTEMISIA VULGARIS,SPECIES
311,BERBERIDACEAE_BERBERIS_VULGARIS,BERBERIS VULGARIS,SPECIES
409,ERICACEAE_CALLUNASALISB_VULGARIS,CALLUNA VULGARIS,SPECIES
738,AMARANTHACEAE_BETA_VULGARIS,BETA VULGARIS,SPECIES
2228,OLEACEAE_SYRINGA_VULGARIS,OLEACEAE SYRINGA VULGARIS,SPECIES
2240,ASTERACEAE_ARTEMISIA_VULGARIS,ASTERACEAE ARTEMISIA VULGARIS,SPECIES
2321,BERBERIDACEAE_BERBERIS_VULGARIS,BERBERIDACEAE BERBERIS VULGARIS,SPECIES
2419,ERICACEAE_CALLUNASALISB_VULGARIS,ERICACEAE CALLUNASALISB VULGARIS,SPECIES
2748,AMARANTHACEAE_BETA_VULGARIS,AMARANTHACEAE BETA VULGARIS,SPECIES


In [193]:
'Polygonaceae_Fagopyrum_esculentum'.upper()

'POLYGONACEAE_FAGOPYRUM_ESCULENTUM'

In [169]:
df = pd.read_csv('../data/database/pollen_media_metadata.csv')
df.value_counts(['common_name_id'])

common_name_id                       
JUGLANDACEAE_CARYA_OVATA                 62
CHENOPODIACEAE_SALSOLA_KALI              30
BETULACEAE_ALNUS_RUBRA                   30
PINACEAE_PINUS_STROBUS                   29
SAPINDACEAE_ACER_NEGUNDO                 28
ASTERACEAE_AMBROSIA_PSILOSTACHYA         27
PLATANACEAE_PLATANUS_RACEMOSA            26
POLYGONACEAE_RUMEX_CRISPUS               25
JUGLANDACEAE_JUGLANS_CALIFORNICA         25
ASTERACEAE_XANTHIUM_STRUMARIUM           24
AMARANTHACEAE_CHENOPODIUM_ALBUM          23
ASTERACEAE_ARTEMISIA_TRIDENTATA          23
CUPRESSACEAE_JUNIPERUS_SCOPULORUM        21
JUGLANDACEAE_CARYA_TOMENTOSA             20
ASTERACEAE_ARTEMISIA_VULGARIS            17
ULMACEAE_ULMUS_AMERICANA                 17
BETULACEAE_BETULA_NIGRA                  16
BETULACEAE_BETULA_OCCIDENTALIS           14
ASTERACEAE_SOLIDAGO_CANADENSIS           12
PLANTAGINACEAE_PLANTAGO_LANCEOLATA       12
URTICACEAE_URTICA_DIOICA                 11
FAGACEAE_FAGUS_GRANDIFOLIA            

In [None]:


#angelica

#angelica_garden

#hill_mustard

#linden