# Similarity of left and right profiles of sea turtles

This notebook analyzes the differences between left and right profiles for sea turtles. We analyzed three different species (loggerheads, grees and hawksbills) with the uniform conclusion that there is a significant similarity between opposite profile in all three species. The main conclusion of this observation is that biologists should used both profiles for identifying individuals and not only the same profile as the current practise goes.

We first load the required packages and functions.

In [None]:
import sys
sys.path.append('..')
import os
import numpy as np
import pandas as pd
from PIL import Image
import matplotlib.pyplot as plt
import torchvision.transforms as T
from scipy.stats import ttest_ind

from wildlife_datasets import datasets
from sides_matching import Prediction, Data_MegaDescriptor, Data_SIFT, Data_TORSOOI
from sides_matching import get_dataset, get_transform, get_box_plot_data
from sift_matching import Loader

We already assume that the data were downloaded and the features extracted. If this is not the case, please run first [this notebook](compute_features.ipynb). The next codes specifies the folders whether data are stored and where results will be saved. The variable 'data' states that we will run experiments on datasets ZakynthosTurtles, AmvrakikosTurtles and ReunionTurtles, while the last dataset will be analyzed separately for green and hawksbill turtles. Concerning methods, we will use MegaDescriptor, SIFT and TORSOOI codes.

In [None]:
root_datasets = '../data'
root_features = '../features'
root_figures = '../figures'
root_images = '../images'
data = [
    ('Zakynthos-Loggerheads MegaDescriptor', datasets.ZakynthosTurtles, 'MegaDescriptor', {}),
    ('Amvrakikos-Loggerheads MegaDescriptor', datasets.AmvrakikosTurtles, 'MegaDescriptor', {}),
    ('Reunion-Greens MegaDescriptor', datasets.ReunionTurtles, 'MegaDescriptor', {'species': 'Green'}),
    ('Reunion-Hawksbills MegaDescriptor', datasets.ReunionTurtles, 'MegaDescriptor', {'species': 'Hawksbill'}),
    ('Zakynthos-Loggerheads SIFT', datasets.ZakynthosTurtles, 'SIFT', {}),
    ('Amvrakikos-Loggerheads SIFT', datasets.AmvrakikosTurtles, 'SIFT', {}),
    ('Reunion-Greens SIFT', datasets.ReunionTurtles, 'SIFT', {'species': 'Green'}),
    ('Reunion-Hawksbills SIFT', datasets.ReunionTurtles, 'SIFT', {'species': 'Hawksbill'}),
    ('Reunion-Greens TORSOOI', datasets.ReunionTurtles, 'TORSOOI', {'species': 'Green'}),
    ('Reunion-Hawksbills TORSOOI', datasets.ReunionTurtles, 'TORSOOI', {'species': 'Hawksbill'}),
]

for root in [root_features, root_figures, root_images]:
    if not os.path.exists(root):
        os.makedirs(root)
names = [x[0] for x in data]
names_methods = [name.split(' ')[-1] for name in names]
names_datasets = [name.split(' ')[0] for name in names]
data_index = pd.MultiIndex.from_arrays([names_methods, names_datasets], names=['Method', 'Dataset'])

The next code computes the scores and predictions for the query images. The scores are computed for pairs are of images. The computations of scores is as follows:

- MegaDescriptor: the cosine similarity between the extracted features.
- SIFT: the negative distance between 15 closest descriptors.
- TORSOOI: the number of matching number of edges from TORSOOI codes.

The predictions are computed as the images with the highest similarity to the query image. We return not only the usual top-1 prediction but a sorted array of all scores. 

In [None]:
predictions = {x: {} for x in [True, False]}
img_size_SIFT = 90
for grayscale in [True, False]:
    for name, dataset_class, metric, pars_split in data:
        # Load dataset
        root_dataset = os.path.join(root_datasets, dataset_class.__name__)
        df = get_dataset(dataset_class, root_dataset).df
        # Get split into database (empty) and query
        idx_database = []    
        idx_ignore = []    
        for key, value in pars_split.items():
            idx_ignore = idx_ignore + list(np.where(df[key] != value)[0])    
        idx_query = np.setdiff1d(np.arange(len(df)), idx_ignore)
        # Define score_computer for each method
        if metric == 'MegaDescriptor':
            path_features_query = os.path.join(root_features, f'features_{dataset_class.__name__}_flip={False}_grayscale={grayscale}.npy')
            path_features_database = os.path.join(root_features, f'features_{dataset_class.__name__}_flip={False}_grayscale={grayscale}.npy')
            score_computer = Data_MegaDescriptor(path_features_query, path_features_database)
        elif metric == 'SIFT':
            transform = get_transform(flip=False, grayscale=grayscale, img_size=img_size_SIFT, normalize=False)
            if transform is not None:
                transform = T.Compose([T.Lambda(lambda x: Image.fromarray(x)), *transform.transforms, T.Lambda(lambda x: np.array(x))])
            img_load = 'bbox' if 'bbox' in df.columns else 'full'
            image_loader = Loader(img_load=img_load, img_size=None, transform=transform, transform_name=f'{grayscale}+{img_size_SIFT}')
            path_features = os.path.join(root_features, f'features_{dataset_class.__name__}_flip={False}_grayscale={grayscale}_SIFT')        
            score_computer = Data_SIFT(root_dataset, path_features, df, image_loader=image_loader)        
        elif metric == 'TORSOOI':
            score_computer = Data_TORSOOI(df)
        else:
            raise Exception('Metric now known')
        # Compute scores and predictions based on closest scores
        idx_true, idx_pred, scores = score_computer.compute_scores(idx_query)
        prediction = Prediction(df, idx_true, idx_pred, scores, df['identity'].iloc[idx_query].nunique())
        predictions[grayscale][name] = prediction 

From the predictions, the accuracy is computed. The top-1 accuracy is the standard accuracy, which computes the ratio of correct matches of the closests predictions. Top-k accuracy is deemed a success when at least one of the top-k sorted predictions is a success (the prediction is the same as the true identity).

In [None]:
mods = ['full', 'same year', 'same orientation', 'different both', 'different year']
mods_text = ['all images', 'A: diff side, same year', 'B: same side, diff year', 'C: diff side, diff year', 'B+C: any side, diff year']
for grayscale in [True, False]:
    for _, prediction in predictions[grayscale].items():
        prediction.compute_accuracy(mods)

We now plot the accuracy for all methods and all datasets. The images are saved into the `root_figures` folder. As an example, one figure is plotted in this notebook as well.

In [None]:
for grayscale in [True, False]:
    for i_name, name in enumerate(names):
        prediction = predictions[grayscale][name]
        xs = range(1, 1+prediction.n_individuals)
        df_save = pd.DataFrame()
        df_save['k'] = xs
        plt.figure()
        for mod in mods:
            ys = [prediction.accuracy[mod][f'top {i}'] for i in xs]
            df_save[mod] = ys
            plt.plot(xs, ys)
        df_save.to_csv(os.path.join(root_figures, f'accuracy_{name}_{grayscale}.csv'), index=False)
        plt.axhline(1, color='black', linestyle='dotted')
        plt.xlim([1, 10])
        plt.ylim([0, 1.05])
        plt.xlabel('k')
        plt.ylabel('top k accuracy')
        plt.legend(mods_text)
        plt.title(f'{name}, grayscale = {grayscale}')
        plt.savefig(os.path.join(root_figures, f'accuracy_{name}_{grayscale}.png'), bbox_inches='tight')
        if i_name > 0 or grayscale:
            plt.close()

The previous figures may be visualized as a table. We show the top-5 accuracy for all methods and all datasets.

In [None]:
metric = 'top 5'
accuracy_top = {x: {mod: [] for mod in mods} for x in [True, False]}    
for grayscale in [True, False]:
    for name in names:
        prediction = predictions[grayscale][name]
        for mod in mods:
            accuracy_top[grayscale][mod].append(prediction.accuracy[mod][metric])
    df_save = pd.DataFrame(accuracy_top[grayscale], index=data_index)
    df_latex = df_save.to_latex(float_format='%.3f')
    print(f'Grayscale = {grayscale}')
    display(df_save)
    with open(os.path.join(root_figures, f'accuracy_{metric}_{grayscale}.txt'), 'w') as file:
        file.write(df_latex)

The next code graphically shows the boxplot of all similarities for the various settings mentioned in the paper. As in the previous case, all figures are saved into `root_figures` and only one is plotted here.

In [None]:
p_values = {x: {name: {} for name in names} for x in [True, False]}
boxplot_data = {x: {name: {} for name in names} for x in [True, False]}
for grayscale in [True, False]:
    for i_name, name in enumerate(names):
        similarity_split = predictions[grayscale][name].split_scores()
        similarity_boxplot = [
            # identity, orientation, year
            similarity_split[True][False][True],
            similarity_split[True][True][False],
            similarity_split[True][False][False],
            similarity_split[False][True][True] + similarity_split[False][True][False],
            similarity_split[False][False][True] + similarity_split[False][False][False],
        ]
        names_boxplot = [
            '(A): same ind, diff side, same year',
            '(B): same ind, same side, diff year',
            '(C): same ind, diff side, diff year',
            '(D): diff ind, same side',
            '(E): diff ind, diff side',
        ]

        for i in range(len(similarity_boxplot)):
            similarity_boxplot[i] = np.array(similarity_boxplot[i])[np.isfinite(similarity_boxplot[i])]
        
        for i, j, alt, comparison in zip([0,0,1,2,3], [1,2,2,3,4], ['two-sided', 'greater', 'greater', 'greater', 'two-sided'], ['A!=B', 'A>C', 'B>C', 'C>D', 'D!=E']):
            _, p_value = ttest_ind(similarity_boxplot[i], similarity_boxplot[j], alternative=alt)
            p_values[grayscale][name][comparison] = np.round(p_value, 3)

        plt.figure()
        fig = plt.boxplot(similarity_boxplot)
        plt.xticks(range(1, len(names_boxplot)+1), names_boxplot, rotation=25)
        plt.ylabel('similarity')
        plt.title(name)
        plt.savefig(os.path.join(root_figures, f'similarity_{name}_{grayscale}.png'), bbox_inches='tight')
        if i_name > 0 or grayscale:
            plt.close()
        boxplot_data[grayscale][name] = get_box_plot_data(fig)

We show the p-values that the individual settings have the same similarity scores.

In [None]:
for grayscale in [True, False]:
    df_save = pd.DataFrame(p_values[grayscale]).T.set_index(data_index)
    df_latex = df_save.to_latex(float_format='%.3f')
    print(f'Grayscale = {grayscale}')
    display(df_save)
    with open(os.path.join(root_figures, f'pvalues_{metric}_{grayscale}.txt'), 'w') as file:
        file.write(df_latex)

In [None]:
flags = [
    'col1, ylabel={Zakynthos-Loggerheads}, title={MegaDescriptor}, xticklabels={}',
    'col2, ylabel={}, title={SIFT}, xticklabels={}, yticklabels={}',
    'col3, group/empty plot, title={TORSOI}',
    'col1, ylabel={Amvrakikos-Loggerheads}, title={}, xticklabels={}',
    'col2, ylabel={}, title={}, xticklabels={}, yticklabels={}',
    'col3, group/empty plot, title={}',
    'col1, ylabel={Reunion-Greens}, title={}, xticklabels={}',
    'col2, ylabel={}, title={}, xticklabels={}, yticklabels={}',
    'col3, ylabel={}, title={}, xticklabels={}, yticklabels={}',
    'col1, xlabel={similarity}, ylabel={Reunion-Hawksbills}, title={}',
    'col2, xlabel={similarity}, ylabel={}, title={}, yticklabels={}',
    'col3, xlabel={similarity}, ylabel={}, title={}, yticklabels={}',
]
names_order = [
    'Zakynthos-Loggerheads MegaDescriptor',
    'Zakynthos-Loggerheads SIFT',
    '',
    'Amvrakikos-Loggerheads MegaDescriptor',
    'Amvrakikos-Loggerheads SIFT',
    '',
    'Reunion-Greens MegaDescriptor',
    'Reunion-Greens SIFT',
    'Reunion-Greens TORSOOI',
    'Reunion-Hawksbills MegaDescriptor',
    'Reunion-Hawksbills SIFT',
    'Reunion-Hawksbills TORSOOI'
]

for grayscale in [True, False]:
    for name, flag in zip(names_order, flags):
        print(f'\\nextgroupplot[{flag}]')
        if name != '':
            bp_data = boxplot_data[grayscale][name]
            for _, row in bp_data[::-1].iterrows():
                l_w = np.round(row["lower_whisker"],2)
                l_q = np.round(row["lower_quartile"],2)
                median = np.round(row["median"],2)
                u_q = np.round(row["upper_quartile"],2)
                u_w = np.round(row["upper_whisker"],2)
                print(f'\\addboxplot{{bp}}{{{median}}}{{{u_q}}}{{{l_q}}}{{{u_w}}}{{{l_w}}};')
    print('\n\n')

In [None]:
for grayscale in [True, False]:
    for name, dataset_class, _, _ in data[:4]:
        similarity_split = predictions[grayscale][name].split_scores(save_idx=True)
        
        similarity_boxplot = [
            # identity, orientation, year
            similarity_split[True][False][True],
            similarity_split[True][False][False],
        ]
        names_boxplot = [
            '(A)',
            '(C)',
        ]

        root_dataset = os.path.join(root_datasets, dataset_class.__name__)
        prediction = predictions[grayscale][name]
        transform = get_transform(flip=False, grayscale=grayscale, normalize=False)
        dataset = dataset_class(root_dataset, img_load='auto', transform=transform)
        
        for i in range(len(similarity_boxplot)):
            for score_selection in ['top', 'bottom']:
                sim = similarity_boxplot[i]
                sim = sorted(sim, key=lambda x: (x[0]))
                if score_selection == 'top':
                    sim = sim[::-1]
                idx1 = [sim[0][1], sim[0][2]] if prediction.orientation[sim[0][1]] == 'right' else [sim[0][2], sim[0][1]]
                idx2 = [sim[2][1], sim[2][2]] if prediction.orientation[sim[2][1]] == 'right' else [sim[2][2], sim[2][1]]
                idx = idx1 + idx2

                header_cols = [f'{names_boxplot[i]}, {score_selection}', '']
                for j1, j2 in enumerate(idx):
                    img = dataset[j2]
                    new_height = int(200)
                    new_width  = int(new_height * img.size[0] / img.size[1])
                    img = img.resize((new_width, new_height))
                    img.save(os.path.join(root_images, f'sim_{name}_{score_selection}_{grayscale}_{i}_{j1}.png'))

In [None]:
for grayscale in [True, False]:
    transform = get_transform(flip=False, grayscale=grayscale, normalize=False)
    for name, dataset_class, _, _ in data[:4]:
        root_dataset = os.path.join(root_datasets, dataset_class.__name__)
        prediction = predictions[grayscale][name]
        dataset = dataset_class(root_dataset, img_load='auto', transform=transform)
        if 'year' not in dataset.df.columns:
            dataset.df['year'] = pd.to_datetime(dataset.df['date']).apply(lambda x: x.year)

        for i, identity in enumerate(dataset.df.identity[prediction.true].unique()[:1]):
            idx = list(np.where(dataset.df['identity'] == identity)[0])
            idx = dataset.df.iloc[idx].sort_values(['year', 'orientation'])[::-1].index
            idx = dataset.df.index.get_indexer(idx)
            dataset.plot_grid(idx=idx, n_rows=1, n_cols=4, transform=transform);            
            plt.savefig(os.path.join(root_figures, f'grid_{name}_{grayscale}_{i}.png'), bbox_inches='tight')
            plt.close()
            for j_save, j in enumerate(idx):
                img = dataset[j]
                new_height = int(200)
                new_width  = int(new_height * img.size[0] / img.size[1])
                img = img.resize((new_width, new_height))
                img.save(os.path.join(root_images, f'img_{name}_{grayscale}_{i}_{j_save}.png'))