# Bird Species Classifier for AML project
## University of Vienna, S2022
---
### Goal: An image recognition model
#### Open Questions:
* Which methods to use? 
* Which model to train?
* What is the class_dict.csv for?
* Do we need both valid set and test set?

In [161]:
import numpy as np
import pandas as pd

# To search directories
import os

import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('darkgrid')

# Load data
#from tf.keras.preprocessing.image import ImageDataGenerator

# Preprocessing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
# Models
from sklearn.linear_model import LogisticRegression
#...
# Model Selection
from sklearn.metrics import confusion_matrix, classification_report

# Let's get started

### CSV data

In [162]:
birds_english = "../input/100-bird-species/birds.csv"
birds_latin = "../input/100-bird-species/birds latin names.csv"

In [163]:
birds_df = pd.read_csv(birds_english)
# clean column names
birds_df.columns = [col.replace(' ', '_').lower() for col in birds_df.columns]
birds_df.head()

Unnamed: 0,class_index,filepaths,labels,data_set
0,0,train/ABBOTTS BABBLER/001.jpg,ABBOTTS BABBLER,train
1,0,train/ABBOTTS BABBLER/002.jpg,ABBOTTS BABBLER,train
2,0,train/ABBOTTS BABBLER/003.jpg,ABBOTTS BABBLER,train
3,0,train/ABBOTTS BABBLER/004.jpg,ABBOTTS BABBLER,train
4,0,train/ABBOTTS BABBLER/005.jpg,ABBOTTS BABBLER,train


In [164]:
birds_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 62388 entries, 0 to 62387
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   class_index  62388 non-null  int64 
 1   filepaths    62388 non-null  object
 2   labels       62388 non-null  object
 3   data_set     62388 non-null  object
dtypes: int64(1), object(3)
memory usage: 1.9+ MB


In [165]:
birds_df.value_counts("data_set").head()

data_set
train    58388
test      2000
valid     2000
dtype: int64

In [166]:
birds_df.value_counts("class_index").head()

class_index
224    259
144    243
287    243
363    227
396    224
dtype: int64

In [167]:
#mask = birds_df['labels'].str.contains("ABBOTTS BABBLER") # Search for text fragment
#mask = birds_df.query('labels == "ABBOTTS BABBLER"') # query for name (case sensitive!)
mask = birds_df.loc[birds_df['class_index'] == 0]

print(mask.value_counts("data_set"))
mask

data_set
train    166
test       5
valid      5
dtype: int64


Unnamed: 0,class_index,filepaths,labels,data_set
0,0,train/ABBOTTS BABBLER/001.jpg,ABBOTTS BABBLER,train
1,0,train/ABBOTTS BABBLER/002.jpg,ABBOTTS BABBLER,train
2,0,train/ABBOTTS BABBLER/003.jpg,ABBOTTS BABBLER,train
3,0,train/ABBOTTS BABBLER/004.jpg,ABBOTTS BABBLER,train
4,0,train/ABBOTTS BABBLER/005.jpg,ABBOTTS BABBLER,train
...,...,...,...,...
60388,0,valid/ABBOTTS BABBLER/1.jpg,ABBOTTS BABBLER,valid
60389,0,valid/ABBOTTS BABBLER/2.jpg,ABBOTTS BABBLER,valid
60390,0,valid/ABBOTTS BABBLER/3.jpg,ABBOTTS BABBLER,valid
60391,0,valid/ABBOTTS BABBLER/4.jpg,ABBOTTS BABBLER,valid


### Image data

In [168]:
# Filepaths
rootpath = "../input/100-bird-species"
trainpath = "../input/100-bird-species/train"
validpath = "../input/100-bird-species/valid"
testpath = "../input/100-bird-species/test"
pretest_path = "../input/100-bird-species/images to test"
paths = [pretest_path, trainpath, validpath, testpath]

In [172]:
def importTestImages(path):
    images = []
    
    # Iterate over all files and subfolders
    for root, dirs, filenames in os.walk(path):
        print(root)
        print(dirs)
        print(filenames)
        print("-----------")
        # Iterate all filenames
        for file in filenames:
            # Store the filename for later inspection
            files.append(file)
            
    return pd.DataFrame(files)
            
test_images_df = importTestImages(pretest_path)
test_images_df

../input/100-bird-species/images to test
[]
['5.jpg', '1.jpg', '7.jpg', '4.jpg', '3.jpg', '14.jpg', '2.jpg']
-----------


Unnamed: 0,0
0,../input/100-bird-species/class_dict.csv
1,../input/100-bird-species/my_csv-2-17-2022-1-1...
2,../input/100-bird-species/birds latin names.csv
3,../input/100-bird-species/EfficientNetB4-BIRDS...
4,../input/100-bird-species/birds.csv
...,...
68416,7.jpg
68417,4.jpg
68418,3.jpg
68419,14.jpg


In [None]:
def importFiles(path): 
    """
    Walk through and document the folder structure. Read images and foldernames into DataFrame. Return this DataFrame.
    """
    # Variable to store all files
    files = []

    # Iterate over all files and subfolders
    for dirname, _, filenames in os.walk(path):
        # Iterate all filenames
        for filename in filenames:
            # Store the filename for later inspection
            files.append(os.path.join(dirname, filename))

    print({len(files)}, "files are in the directories.")

    # Split the filenames into subfolders and filename
    # Remove the first three folders (home, kaggle and input) since they do not add new information
    files_split = [file.split('/')[3:] for file in files]

    # Store the split files as DataFrame to get aggregated summaries 
    df = pd.DataFrame(files_split)

    return df

folder_df = importFiles(testpath)
print('\nThese are some sampled entries:')
folder_df.sample(3)

# Prepare data for model

In [170]:
# Read in images from folders (train, valid, test)




#### Define train, validation and test data X and y

In [171]:
X_train = birds_df.loc[birds_df['data_set']=="train"].drop("class_index", axis=1).copy()
y_train = birds_df.loc[birds_df['data_set']=="train"]["class_index"].copy()

X_valid = birds_df.loc[birds_df['data_set']=="valid"].drop("class_index", axis=1).copy()
y_valid = birds_df.loc[birds_df['data_set']=="valid"]["class_index"].copy()

X_test = birds_df.loc[birds_df['data_set']=="test"].drop("class_index", axis=1).copy()
y_test = birds_df.loc[birds_df['data_set']=="test"]["class_index"].copy()

---
# README
after the first meeting

### Aufteilung
- keras tensorflow - Clemens
- pytorch - Jakob
- PCA + preprocessing - Lena 


### Methoden 
- pca?
- image segementation
- Wie laden wir die Bilder von der CSV ins Notebook?
- Wieviele Datenreihen brauchen wir? 
data set
| train    58388 | test      2000 | valid     2000 |
