## Schwierigkeiten bei den Fotos

* Sand im Gesicht (z.B. ID_0FEWYAAG)
* Unschärfe
* Gras im Vordergrund
* Unterschiedliche Farben der dunklen Bereiche (orange bis schwarz)
* Belichtungen (besonders Überbelichtung und starke Unterbelichtung)
* Verwackelte Schildkröten
* Helle Fäden
* Abbröckelnde Haut
* Bein mit Maserung im Hintergrund
* Datum im Bild
* Schwach ausgebildetes Muster
* Ganze Schildkröte im Bild (ID_0RNNI62X)
* Kopf nimmt nur einen kleinen Teil des Bildes ein (ID_0TN13JTG)

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
df_train = pd.read_csv('../data/train.csv')
df_test = pd.read_csv('../data/test.csv')
df_extra = pd.read_csv('../data/extra_images.csv')

## Head

In [None]:
df_train.head()

In [None]:
df_test.head()

In [None]:
df_extra.head()

## Info

In [None]:
df_train.info()

In [None]:
df_test.info()

In [None]:
df_extra.info()

## Image location

In [None]:
# Convert image_location strings to lowercase.
for df in [df_train, df_test]:
  df.image_location = df.image_location.apply(lambda x: x.lower())

In [None]:
df_train['image_location'].value_counts()

In [None]:
df_test['image_location'].value_counts()

---

Ausrichtung der Bilder im Trainings- und Testset ist relativ ausgeglichen!

---


## Unique turtles

In [None]:
df_train['turtle_id'].nunique()

In [None]:
df_extra['turtle_id'].nunique()

In [None]:
df_full = pd.concat([df_train, df_extra], axis=0, ignore_index=True)

In [None]:
df_full.info()

In [None]:
df_full

In [None]:
df_full['turtle_id'].nunique()

In [None]:
more_turtles = df_full['turtle_id'].nunique() - df_train['turtle_id'].nunique()
more_turtles

---

Es gibt 100 Turtles im Trainingsset. Im Extra-Images-Set kommen weitere 2165 Turtles dazu! Insgesamt gibt es 2265 verschiedene Turtles.

---

In [None]:
full_images_per_turtle = pd.value_counts(df_full['turtle_id'])
print('The number of training images in train- and extra-set per turtle is: \n'
      f'Mean is {round(np.mean(full_images_per_turtle), 2)}, \n'
      f'Median is {int(np.median(full_images_per_turtle))}.\n'
      f'Maximum is {int(np.max(full_images_per_turtle))}.\n'
      f'Minimum is {int(np.min(full_images_per_turtle))}.')
sns.histplot(full_images_per_turtle)
plt.xlabel('Images per train turtle in train- and extra-set')
plt.show()

In [None]:
train_images_per_turtle = pd.value_counts(df_train['turtle_id'])
print('The number of training images in train-set per turtle is: \n'
      f'Mean is {round(np.mean(train_images_per_turtle), 2)}. \n'
      f'Median is {int(np.median(train_images_per_turtle))}.\n'
      f'Maximum is {int(np.max(train_images_per_turtle))}.\n'
      f'Minimum is {int(np.min(train_images_per_turtle))}.')
sns.histplot(train_images_per_turtle)
plt.xlabel('Images per train turtle in train-set')
plt.show()

In [None]:
extra_images_per_turtle = pd.value_counts(df_extra['turtle_id'])
print('The number of training images in extra-set per turtle is: \n'
      f'Mean is {round(np.mean(extra_images_per_turtle), 2)}. \n'
      f'Median is {int(np.median(extra_images_per_turtle))}.\n'
      f'Maximum is {int(np.max(extra_images_per_turtle))}.\n'
      f'Minimum is {int(np.min(extra_images_per_turtle))}.')
sns.histplot(extra_images_per_turtle)
plt.xlabel('Images per train turtle in extra-set')
plt.show()