# Bird Detective

## Overview
This notebook explores the structure of the NABirds dataset from Cornell Ornithology, referenced [here](https://vision.cornell.edu/se3/wp-content/uploads/2015/05/Horn_Building_a_Bird_2015_CVPR_paper.pdf).

#### Data Load
The data set has a collection of files relevant to the images that define:
- [image location in the directory structure](#image_locations),
- [the taxonomic hierarchy](#hierarchy),
- [the class labels](#class_labels),
- [the class names themselves](#classes),
- [bounding boxes for the birds](#bounding_boxes),
- [part locations within the images](#part_locations),
- [the names for those parts](#parts),
- [image sizes](#sizes),
- [the photographer's name](#photographers),
- [and a train-test split](#train_test_split).

##### Import and Configuration

In [1]:
import boto3
import s3fs
import pandas as pd
from sagemaker import get_execution_role

role = get_execution_role()
bucket = "furry-dollop"
data_key = "nabirds"
data_location = f"s3://{bucket}/{data_key}"

fs = s3fs.S3FileSystem()
fs.ls(data_location)

['furry-dollop/nabirds/bounding_boxes.txt',
 'furry-dollop/nabirds/classes.txt',
 'furry-dollop/nabirds/hierarchy.txt',
 'furry-dollop/nabirds/image_class_labels.txt',
 'furry-dollop/nabirds/images.txt',
 'furry-dollop/nabirds/photographers.txt',
 'furry-dollop/nabirds/sizes.txt',
 'furry-dollop/nabirds/train_test_split.txt',
 'furry-dollop/nabirds/images',
 'furry-dollop/nabirds/parts']

In [2]:
fs.ls(data_location+"/parts")

['furry-dollop/nabirds/parts/part_locs.txt',
 'furry-dollop/nabirds/parts/parts.txt']

In [3]:
image_location = data_location + "/images"

##### Fuction Definitions

In [4]:
def get_table(file, columns, index=0):
    df = pd.read_csv(data_location+"/"+file, header=None, sep="\t", squeeze=True).str.split(" ", columns, expand=True)
    if isinstance(df[index], pd.Series):
        df[index] = df[index].astype(int, errors='ignore')
#         print("your index got squeezed")
    else:
        df[index] = df[index].apply(lambda s: s.astype(int, errors='ignore'))
        print("Your index got split ...MultiIndex FTW!")
    df = df.set_index(index)
    if df.index.is_unique:
        print(f"{file}: Success!")
        return df
    else:
        raise IndexError(f"Index in {file} is not unique.")

##### Data Loading

<a id='image_locations'></a>

##### ------- List of image files (images.txt) ------
```
The list of image file names is contained in the file images.txt, with each line corresponding to one image:

<image_id> <image_name>
------------------------------------------
```

In [5]:
img_locs_srs = get_table("images.txt", 1)

images.txt: Success!


<a id='hierarchy'></a>

##### ------- Class hierarchy (hierarchy.txt) ------
```
The ground truth class labels (bird species labels) for each image are contained in the file image_class_labels.txt, with each line corresponding to one image:

<child_class_id> <parent_class_id>

where <child_class_id> and <parent_class_id> correspond to the IDs in classes.txt.
---------------------------------------------------------
```

In [6]:
class_hierarchy_srs = get_table("hierarchy.txt", 1).astype(int)
class_hierarchy_dict = {k:v for k,v in zip(class_hierarchy_srs.index, class_hierarchy_srs[1].values)}

hierarchy.txt: Success!


<a id='class_labels'></a>

##### ------- Image class labels (image_class_labels.txt) ------
```
The ground truth class labels (bird species labels) for each image are contained in the file image_class_labels.txt, with each line corresponding to one image:

<image_id> <class_id>

where <image_id> and <class_id> correspond to the IDs in images.txt and classes.txt, respectively.
---------------------------------------------------------
```

In [7]:
img_class_labels_srs = get_table("image_class_labels.txt", 1)
## Can inspect to understand class balance >> Would potentially benefit from understanding taxonomy structure
## e.g., 0 [birds] > 21 [woodpeckers] > 124 [red-naped sapsucker] > 555 [red-naped sapsucker]
## e.g., 0 [birds] > 9 [caracaras and falcons] > 176 [peregrine falcon] > 663 [peregrine falcon (immature)]
## e.g., 0 [birds] > 9 [caracaras and falcons] > 176 [peregrine falcon] > 364 [peregrine falcon (adult)]

image_class_labels.txt: Success!


<a id='classes'></a>

##### ------- List of class names (classes.txt) ------
```
The list of class names (bird species) is contained in the file classes.txt, with each line corresponding to one class:

<class_id> <class_name>
--------------------------------------------
```

In [8]:
classes_srs = get_table("classes.txt", 1)

classes.txt: Success!


<a id='bounding_boxes'></a>

##### ------- Bounding boxes (bounding_boxes.txt) ------

```
Each image contains a single bounding box label.  Bounding box labels are contained in the file bounding_boxes.txt, with each line corresponding to one image:

<image_id> <x> <y> <width> <height>

where <image_id> corresponds to the ID in images.txt, and <x>, <y>, <width>, and <height> are all measured in pixels
------------------------------------------
```

In [9]:
b_boxes_df = get_table("bounding_boxes.txt", 4)

bounding_boxes.txt: Success!


<a id='part_locations'></a>

##### ------- Part locations (parts/part_locs.txt) ------
```
The set of all ground truth part locations is contained in the file parts/part_locs.txt, with each line corresponding to the annotation of a particular part in a particular image:

<image_id> <part_id> <x> <y> <visible>

where <image_id> and <part_id> correspond to the IDs in images.txt and parts/parts.txt, respectively.  <x> and <y> denote the pixel location of the center of the part.  <visible> is 0 if the part is not visible in the image and 1 otherwise.
----------------------------------------------------------
```

In [10]:
parts_locs_df = get_table("parts/part_locs.txt", 3, [0,1])
## needs scaling for image size... images will be resized

Your index got split ...MultiIndex FTW!
parts/part_locs.txt: Success!


<a id='parts'></a>

##### ------- List of part names (parts/parts.txt) ------
```
The list of all part names is contained in the file parts/parts.txt, with each line corresponding to one part:

<part_id> <part_name>
------------------------------------------
```

In [11]:
parts_srs = get_table("parts/parts.txt", 1)

parts/parts.txt: Success!


<a id='sizes'></a>

##### ------- Image sizes (sizes.txt) ------
```
The size of each image in pixels:

<image_id> <width> <height>

where <image_id> corresponds to the ID in images.txt, and <width> and <height> correspond to the width and height of the image in pixels.
------------------------------------------------------
```

In [12]:
sizes_df = get_table("sizes.txt", 2)

sizes.txt: Success!


<a id='photographers'></a>

##### ------- Image photographers (photographers.txt) ------
```
The photographer for each image:

<image_id> <photographer_name>

where <image_id> corresponds to the ID in images.txt, and <photographer_name> corresponds to the name of the photographer that took the photo. Please
be considerate and display the photographer's name when displaying their image.
------------------------------------------------------
```

<a id='test_train_split'></a>

##### ------- Train/test split (train_test_split.txt) ------
```
The suggested train/test split is contained in the file train_test_split.txt, with each line corresponding to one image:

<image_id> <is_training_image>

where <image_id> corresponds to the ID in images.txt, and a value of 1 or 0 for <is_training_image> denotes that the file is in the training or test set, respectively.
------------------------------------------------------
```

In [13]:
train_test_split_srs = get_table("train_test_split.txt", 1)

train_test_split.txt: Success!


In [14]:
## given a category number, return it's parent category stop when parent category is 0

def traverse_taxonomy(category):
    parent_category = class_hierarchy_dict.get(category, False)
    if parent_category:
        return f"{traverse_taxonomy(parent_category)} < {parent_category}"
    else:
        return parent_category

In [15]:
print(traverse_taxonomy(663))

0 < 9 < 176


In [16]:
all(class_hierarchy_srs.index.astype(int)[:5] == class_hierarchy_srs.index[:5])

True

In [17]:
len(class_hierarchy_dict)

1010