## Creating the Dataset

---

### Create Directory

To label the bird species, we need to make folders for each. As this code is being run on Google Colab, it was simple to download all the data onto Google Drive to save resources.

As there is a lot of data that needs to be gathered, rather than doing it by hand, we can automate the process using code. Credits to [Jeremy Howard's](https://www.kaggle.com/code/jhoward/is-it-a-bird-creating-a-model-from-your-own-data) tutorial for this method of gathering data. To begin, we need to install the `duckduckgo_search` library using pip:

In [None]:
!pip install duckduckgo_search

In [None]:
bird_classes = ['Blackbird', 'Blackcap', 'Blue Tit', 'Bullfinch', 'Carrion Crow', 'Chaffinch', 'Chiffchaff', 'Coal Tit', 'Collared Dove', 'Common Linnet', 'Common Sandpiper', 'Common Whitethroat', 'Crossbill', 'Dotterel', 'Dunnock', 'Eurasian Jay', 'Eurasian Magpie', 'Eurasian Teal', 'Feral Pigeon', 'Fieldfare', 'Firecrest', 'Garden Warbler', 'Goldcrest', 'Golden Plover', 'Goldfinch', 'Great Spotted Woodpecker', 'Great Tit', 'Green Woodpecker', 'Greenfinch', 'Grey Wagtail', 'Hawfinch', 'House Martin', 'House Sparrow', 'Jackdaw', 'Lesser Redpoll', 'Lesser Spotted Woodpecker', 'Lesser Whitethroat', 'Long-tailed Tit', 'Meadow Pipit', 'Mealy Redpoll', 'Mistle Thrush', 'Nightingale', 'Nuthatch', "Pallas's Warbler", "Pied Flycatcher", 'Pied Wagtail', 'Redstart', 'Redwing', 'Reed Bunting', 'Robin', 'Sand Martin', 'Sedge Warbler', 'Siskin', 'Skylark', 'Song Thrush', 'Spotted Flycatcher', 'Starling', 'Stock Dove', 'Stonechat', 'Swallow', 'Swift', 'Treecreeper', 'Waxwing', 'Wheatear', 'Whinchat', 'Willow Warbler', 'Wood Warbler', 'Woodcock', 'Woodpigeon', 'Wren', 'Yellow-browed warbler']
raw_bird_classes = 

In [None]:
import os

for i in range(len(bird_classes)):
  path = f'/content/drive/MyDrive/BirdClassifierDataset/train/{bird_classes[i]}'
  os.mkdir(path)
  path = f'/content/drive/MyDrive/BirdClassifierDataset/test/{bird_classes[i]}'
  os.mkdir(path)
  path = f'/content/drive/MyDrive/BirdClassifierDataset/valid/{bird_classes[i]}'
  os.mkdir(path)

### Download Images
For this process, we utilise the `fastcore.all`, `duckduckgo_search`,`fastdownload` and `time` libraries. Two lists were created; `bird_classes` and `raw_bird_classes` where the latter omits spaces to name the files. 

The `search_images()` function is defined, which scrapes the web, using the `duckduckgo_search` library, for a provided search term and returns a list of urls for the images.

Then, each bird species on the list is iterated through as a search term, and 60 of the images are downloaded onto Google Drive, using the `download_url()` function. This process is in a 'try', 'except', 'else' block because sometimes the images cannot be properly downloaded and this avoids having to restart the process every time at the expense of having an inconsistent amount of images throughout the classes (not a big deal).

It is important to include a `sleep(1)` statement in between each iteration to prevent the server from overloading.



In [None]:
from duckduckgo_search import ddg_images
from fastcore.all import *
from fastdownload import download_url
from time import sleep
from PIL import Image, ImageFile
import os

# grabs urls
def search_images(term, max_images=200):
    return L(ddg_images(term, max_results=max_images)).itemgot('image')

i = 0
while i < len(bird_classes):
  species = bird_classes[i]
  raw_species = raw_bird_classes[i]
  urls = search_images(species + 'bird', max_images=210)
  print(species)
  print(f'i = {i}')

  j = 0
  while j < 210: # because there will be 15 images per species
    print(j)
    if j < 160: path = '/content/drive/MyDrive/BirdClassifierDataset/train/' + f'{species}/{raw_species}{j}.jpg'
    if 159 < j < 190: path = '/content/drive/MyDrive/BirdClassifierDataset/valid/' + f'{species}/{raw_species}{j}.jpg'
    if 189 < j < 210: path = '/content/drive/MyDrive/BirdClassifierDataset/test/' + f'{species}/{raw_species}{j}.jpg'
    try:
      download_url(urls[j], path, show_progress=False) # downloads the images to the path
    except:
      print(f'Error downloading: {raw_species}{j}.jpg')
      j += 1
    else:
      j += 1
    
    sleep(1) # prevents server from overloading
  
  i += 1

Sometimes the images being downloaded are not the ones that you intended. At this step, it is important to go through your dataset and make sure that the images being downloaded were correct.

The images may be corrupted or in the incorrect format. Now we have to verify all the images to make sure they are good, otherwise the classifier will throw an error.

Making sure all the folders contain a suitable amount of images:

In [None]:
import os

def check_folder_filenum(parent_dir):
  folder_list = os.listdir(parent_dir)
  empty_folders = []
  for folder in folder_list:
    folder_path = os.path.join(parent_dir, folder)
    folder_contents = os.listdir(folder_path)
    if (len(folder_contents)) == 0:
      print(f"{folder} IS EMPTY")
      empty_folders.append(folder)
    else:
        print(f"{folder} has {len(folder_contents)} files")

  return empty_folders

i = 0
empty_folders = []
for i in range(3):
  dir = ""
  if i == 0: dir = "train"
  if i == 1: dir = "test"
  if i == 2: dir = "valid"

  print(f"Checking {dir} directory: \n")


  empty_folders.append(check_folder_filenum(f'/content/drive/MyDrive/BirdClassifierDataset/{dir}'))

print("Empty train folders:\n")
print(empty_folders[0])

print("\n\nEmpty test folders:\n")
print(empty_folders[1])

print("\n\nEmpty valid folders:\n")
print(empty_folders[2])

Verifying the files (Link to code source: https://stackoverflow.com/a/70363370):

In [None]:
import os
import cv2
import imghdr

def check_images(parent_dir, ext_list):
    bad_images=[]

    folder_list = os.listdir(parent_dir) # s_ list  = folder_list
    for folder in folder_list:
        folder_path = os.path.join (parent_dir, folder)
        print ('processing class directory ', folder)
        if os.path.isdir(folder_path): # if the path exists
            file_list = os.listdir(folder_path)
            for f in file_list:               
                f_path = os.path.join (folder_path, f)
                f_type = imghdr.what(f_path) # type of file extension
                if ext_list.count(f_type) == 0: # checks to see if the ext is part of our list
                  bad_images.append(f_path)
                if os.path.isfile(f_path):
                    try:
                        img = cv2.imread(f_path) # tries to open file
                        shape = img.shape # tries to get file size
                    except:
                        print('file ', f_path, ' is not a valid image file')
                        bad_images.append(f_path)
                else:
                    print('*** fatal error, you a sub directory ', f, ' in class directory ', folder_list)
        else:
            print ('*** WARNING*** you have files in ', parent_dir, ' it should only contain sub directories')
    return bad_images

i = 0
for i in range(3):
  corrupted_files = []

  dir = ""
  if i == 0: dir = "train"
  if i == 1: dir = "test"
  if i == 2: dir = "valid"
  print(f'Checking {dir} dataset: ')
  source_dir =f'/content/drive/MyDrive/BirdClassifierDataset/{dir}' # changes for each folder
  good_exts=['jpg', 'png', 'jpeg', 'gif', 'bmp' ] # list of acceptable extensions
  bad_file_list = check_images(source_dir, good_exts)
  if len(bad_file_list) !=0:
    print('removing improper files')
    for f in bad_file_list:
      try:
        print(f"removing {f}")
        os.remove(f)
      except:
        print(f"error removing {f}")
  else:
    print(' no improper image files were found')

# removing the images

for dir in corrupted_files:
  for f in dir:
    try:
      print(f'removing {f}')
      os.remove(f)
    except:
      print(f'Error removing {f}')

# Creating the Model

In [None]:
from keras import layers, optimizers, metrics
from keras.models import Sequential, load_model
from keras.utils import image_dataset_from_directory, to_categorical

Here we load the pre-trained InceptionV3 model with iNaturalist 2017 weights:

In [None]:
import tensorflow_hub as hub

base_model = hub.KerasLayer("https://tfhub.dev/google/inaturalist/inception_v3/feature_vector/5",
               trainable=True, arguments=dict(batch_norm_momentum=0.997))

Freezing the layers from training:

In [None]:
base_model.trainable = False

For a limited dataset, we can introduce more variety in the data by including a data augumentation model to randomly flip and rotate the images:

In [None]:
data_augmentation = Sequential([
  layers.RandomFlip('horizontal'),
  layers.RandomRotation(0.2),
])

Next we will have to put the model together, while adding the classification layers:


In [None]:
model_incep = Sequential()
model_incep.add(data_augmentation)

model_incep.add(layers.Rescaling(1./255))

model_incep.add(base_model)
model_incep.add(layers.Flatten())

model_incep.add(layers.Dropout(0.2))
model_incep.add(layers.BatchNormalization())
model_incep.add(layers.Dense(71, activation = 'softmax'))

In [None]:
model_incep.summary()

# Training the model

In [None]:
traindata = image_dataset_from_directory(
                                        directory= '/content/drive/MyDrive/SmallBirdsDataset/train',
                                        labels= 'inferred', #one_hot_train.tolist()
                                        label_mode= 'categorical',
                                        image_size=(224, 224),
                                        shuffle= True,
                                        ) 
validdata = image_dataset_from_directory(
                                        directory= '/content/drive/MyDrive/SmallBirdsDataset/valid',
                                        labels= 'inferred', 
                                        label_mode= 'categorical',
                                        shuffle= True,
                                        image_size=(224, 224),
                                        )

testdata = image_dataset_from_directory(
                                        directory= '/content/drive/MyDrive/SmallBirdsDataset/test',
                                        labels= 'inferred', 
                                        label_mode= 'categorical',
                                        shuffle= True,
                                        image_size=(224, 224),
                                        )

Here we load the dataset.

`keras.utils.image_dataset_from_directory()` loads the image and stores it as a tuple: `(image, label)`, with the image being converted into an image tensor (data structure like a matrix).

### Parameters
`labels='inferred'`

The class labels are inferred from the name of the folders/structure of directory.

`label_mode='categorical'`

Describes the encoding of the labels. In this case we set it as categorical meaning they are encoded as a categorical vector. This allows us to apply the categorical crossentropy loss function.

`image_size=(224, 224)`

Sets the target image size as 224px by 224px. This is important because the ResNet50 model was trained on images of this size, and this will allow us to optimise performance.

`shuffle=True`

Shuffles the images in the dataset.

To train the model, we need two apply two main methods: `Model.compile()` and `Model.fit()`. The former prepares the model for training, while the latter trains the model for a set number of epochs (iterations).

First we start with the `Model.compile()` method. As this is a supervised learning problem with more than two classes, we set the loss function as `categorical_crossentropy`. This will ensure that each input passed into the model will be fitted into one of the specified classes, and provide the probability that the input is a member of the class.

We set out optimizer as [`Adam`](https://machinelearningmastery.com/adam-optimization-algorithm-for-deep-learning/) with the `learning_rate` parameter set to 0.01. As we want to evaluate the accuracy of the model while its being trained, we set the `metrics=['accuracy']`. 


In [None]:
model_incep.compile(loss='categorical_crossentropy', 
                  optimizer= optimizers.Adam(learning_rate=0.01),
                  metrics=['accuracy'])

After compiling, we move on to training the model. We call the `Model.fit()` method and set the training data and the validation data. We set `shuffle` as true, and set the number of epochs to be 10, with a verbose as 1, which gives us a progress bar to visualise the speed of training. Finally we save the trained weights.



In [None]:
history = model_incep.fit(traindata,
               validation_data= validdata,
               shuffle= True,
               batch_size = 32,
               epochs=10,
               verbose=1)

model_incep.save('/content/drive/MyDrive/ML_Models/birdClassifierDIYv0/inceptiNatv0.h5')

# Fine Tuning

To fine tune, we load the model we just trained:

In [None]:
model_ft = load_model(
       '/content/drive/MyDrive/ML_Models/birdClassifierDIYv0/inceptiNatv0.h5',
       custom_objects={'KerasLayer':hub.KerasLayer("https://tfhub.dev/google/inaturalist/inception_v3/feature_vector/5")}
)

Now we set the InceptionV3 layers as trainable so we can fine-tune the weights:

In [None]:
model_ft.layers[2].trainable = True
print(model_ft.summary())

Now we compile and fit, reducing the learning rate and number of epochs.

In [None]:
model_ft.compile(loss='categorical_crossentropy', 
                  optimizer= optimizers.Adam(learning_rate=0.00001),
                  metrics=['accuracy'])

history = model_ft.fit(traindata,
               validation_data= validdata,
               shuffle= True,
               batch_size = 32,
               epochs=5,
               verbose=1)

model_ft.save('/content/drive/MyDrive/ML_Models/birdClassifierDIYv0/fttiNatFTUNEv0.h5')