Some genera were not included during the data curation, but their genus_id still exists in the `distances` metadata. Thanks @kalelpark for reporting this.

This notebook is just to double-check if genus_ids presented in the data are consistent throughout all metadata, except the `distances` metadata.

In [None]:
import json
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os
os.listdir('/kaggle/input/herbarium-2022-fgvc9')

In [None]:
with open('/kaggle/input/herbarium-2022-fgvc9/train_metadata.json', 'r') as f:
    train_metadata= json.load(f)

In [None]:
train_metadata.keys()

In [None]:
train_metadata['genera'][0], train_metadata['genera'][-1], len(train_metadata['genera'])

We see that the ID number for genus starts from 1 and ends at 2584, however the length is 2564, because some genera were removed during the data curation. 
First thing to do is to check if the genus_ids in the `genera` metadata match exactly with the genus_ids in the `annotations` metadata:

In [None]:
#Get unique genus id from key==genera
S_genus_id=set(pd.DataFrame(train_metadata['genera']).genus_id)

In [None]:
#Get unique genus id from image annotations data
S_genus_id_from_annotations=set(pd.DataFrame(train_metadata['annotations']).genus_id)

In [None]:
# Make sure two sets match: if true, it means genera metadata has all record matching the image metadata 
S_genus_id==S_genus_id_from_annotations

The genus_ids match, OK.

Now we check that if the `distances` metadata has all genus_id from 1 to 2584

In [None]:
#Check length of genera of distance metadata:
S_genus_id_x=set(pd.DataFrame(train_metadata['distances']).genus_id_x)
S_genus_id_y=set(pd.DataFrame(train_metadata['distances']).genus_id_y)

In [None]:
[id_num for id_num  in range(1,2584) if id_num not in S_genus_id_x],[id_num for id_num  in range(1,2584) if id_num not in S_genus_id_y]

Finally, we get genus_id that are not used in the herbarium 2022 dataset:

In [None]:
# Check the genera id that does not present in the current image dataset
[id_num for id_num  in range(1,2584) if id_num not in S_genus_id ]