# Petbreed Multiclassification

**Try out the working classifier [here](https://mypetbreed.onrender.com/) as a web app!**

In [1]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

In [2]:
from torchvision.models import *
import pretrainedmodels

from fastai.vision import *
from fastai.callbacks.tracker import SaveModelCallback
from fastai.vision.models import *
from fastai.vision.learner import model_meta

from sklearn.utils import shuffle
import sys

In [3]:
torch.cuda.set_device(0)
print(f'Using GPU#{torch.cuda.current_device()}')

Using GPU#0


## Configuration

Next we'll configure our path's and setup some parameters we'll be changing throughout the training process to scale the model with increasing image resolution.

In [4]:
PATH = Path('data')
TRAIN = PATH/'train'
TEST = PATH/'test'
LARGE = PATH/'large'

In [5]:
# prefix = "resnet50_v2_"
# model = models.resnet50

# prefix = "vgg19_bn_"
# model = models.vgg19_bn

prefix = "resnet152_"
model = models.resnet152

# prefix = "xception_"

# def xception(pretrained=False):
#     pretrained = 'imagenet' if pretrained else None
#     model = pretrainedmodels.xception(pretrained=pretrained)
#     return nn.Sequential(*list(model.children()))

# model = xception

 

## Data Prep

Before anything, we need to prepare our data for modeling. With how the raw files are structured, the steps we'll need to take are:

1. Match image files to breed names. Since the file names are just numbers and the breed names are IDs in a csv, we need to make a function that pairs the two together for when we setup our [Databunch](https://docs.fast.ai/data_block.html).
2. Upsample imageset. Since we're limited to train with only a few thousand images in total, our training and validation accuracy should increase if we have more data to train with. Because of this, we can duplicate our training set several times to "artificially" get a bigger dataset. We'll avoid overfitting by appying unique image transforms to all these images so that each image is different than the rest, increasing the generalization of our model.
3. Create a databunch. Using Fastai's Datablock API, we'll create a databunch that uses our labeling function and upsampled dataset to split our training dataset into training and validation subsets. We'll also apply image transforms using Fastai's `vision.transform` package.

In [6]:
gc.collect()

40

In [7]:
train_df = pd.read_csv(PATH/'train.csv', engine='python')
train_df.head()

Unnamed: 0,breedID,speciesID,fname,breed_name
0,23,2,0,newfoundland
1,35,2,1,staffordshire bull terrier
2,19,2,2,keeshond
3,2,2,3,american bulldog
4,29,2,4,saint bernard


### Upsampling

Lets cycle through our Training set 10 times to generate a bigger dataset (only do this 1 time!):

In [8]:
amt = os.listdir(TRAIN)
amt_len = len(amt)
mult_amt = 10

In [9]:
# %%time
# i = 1
# while i < mult_amt:
#     n = 0
#     while n < amt_len:
#         amt.append(amt[n][:-4] + '_copy_' + str(i) + '.jpg')
#         os.system(f'cp {TRAIN}/{amt[n]} {LARGE}/{amt[n][:-4]}_copy_{str(i)}.jpg')
#         n+=1
#         if n%250 == 0:
#             print(n)
#     i+=1
    

Custom label function for the englarged dataset:

In [10]:
def get_large_labels(fname):
    fname = str(fname) # Convert path object to string
    fname = fname.split(sep='/')
    
    fname = fname[len(fname) - 1]
    fname = fname[:-11]

    row = train_df.loc[train_df['fname'] == int(fname)]
    label = row["breed_name"].values[0]
        
    return label

Train/valid split proportional to class amounts:

In [11]:
def large_train_valid_split(mult_amt, val_pct):
    breeds = list(train_df.breed_name.unique())
    
    for breed in breeds:
        breed_df = train_df.loc[train_df["breed_name"] == str(breed)]
        breed_df = shuffle(breed_df, random_state=42)
        fnames = list(breed_df["fname"])
        f_amt = round(len(fnames)*val_pct)
        
        # Validation split
        for file in fnames[0:f_amt]:            
            for rep in range(1,mult_amt):
                os.system(f"mv {LARGE}/{file}_copy_{rep}.jpg {LARGE}/valid")
        print(f"Completed {breed} validation split.")
    
    # Training split
    print("Starting training split.")
    os.system(f"mv {LARGE}/*.jpg {LARGE}/train")

In [12]:
# large_train_valid_split(10, 0.2)

### Datablock

Below is our labeling function for matching image file names with breed ids. We will input this into our databunch when we create it later:

In [None]:
# Labeling function used by Datablock API
def get_labels(fname):    
    fname = str(fname) # Convert path object to string
    fname = fname.split(sep='/')
    
    fname = fname[len(fname) - 1]
    fname = fname[:-4]

    row = train_df.loc[train_df['fname'] == int(fname)]
    label = row["breed_name"].values[0]
        
    return label

Custom Transforms List

In [None]:
# Hyperparams
prob = 1
brightness_range = (0.25,0.75)
contrast_range = (0.5,1.5)
jitter_mag = (0.005,0.01)
max_warp = (0.3)
rotate_range = (0,25)
zoom_range = (1., 1.5)
img_size = (128,512)
x_pct = (0.25,0.75)
y_pct = (0.25,0.75)

# Transforms
trn_tfms = [
    brightness(change=brightness_range, use_on_y=False),
    contrast(scale=contrast_range, use_on_y=False),
    crop_pad(size=img_size, row_pct=x_pct, col_pct=y_pct, use_on_y=False), # Random Expand 
    flip_lr(p=prob, use_on_y=False), # Flips Image
    jitter(magnitude=jitter_mag, use_on_y=False),
    perspective_warp(magnitude=(-max_warp,max_warp), use_on_y=False),
    rand_zoom(scale=zoom_range),
    rotate(degrees=rotate_range, use_on_y=False)
]

val_tfms = [crop_pad(use_on_y=False)] 
tfms = (trn_tfms, val_tfms)

In [None]:
# Datablock
def get_data(sz, bs):
    data = (ImageList.from_folder(COMBINED_TRAIN)
            .split_by_rand_pct(valid_pct=0.1, seed=42)
            .label_from_func(get_labels) 
            .transform(tfms, size=sz)
            .databunch(bs=bs).normalize(imagenet_stats))
    
    return data

# Datablock for large dataset
# def get_data(sz, bs):
#     data = (ImageList.from_folder(LARGE) 
#             .split_by_folder()
#             .label_from_func(get_large_labels) 
#             .transform(tfms, size=sz)
#             .databunch(bs=bs).normalize(imagenet_stats))
    
#     return data

In [None]:
%%time
data = get_data(128,128)

After successfully creating our databunch, let's look at some animals!

In [None]:
data.show_batch(rows=3, figsize=(6,6))

Now that our databunch is setup, its time to do some modeling.

## Modeling

### Progressive Resizing

From testing various pretrained models and architechures out on the Azure VM, we saw best results with using a resnet50 architecture in a Convolutional Nerual Network. InceptionV3 wasnt getting nearly as good accuracy as Resnet50 (or Resnet34) and Vgg wasn't either. 

In [None]:
gc.collect()

Now we'll create our learner class and load in some metrics we care about (error rate and accuracy):

In [None]:
# xception
learn = cnn_learner(data, model, pretrained=True, cut=-1,
                    split_on=lambda m: (m[0][11], m[1]), metrics=(error_rate,accuracy))

# fastai/pytorch models
# learn = cnn_learner(data, model, metrics=(error_rate,accuracy))


learn.callbacks = [SaveModelCallback(learn, every='improvement', monitor='accuracy', name=f'{prefix}best', mode='max')]

In [None]:
# InceptionV4 test
def inceptionv4(pretrained=False):
    pretrained = 'imagenet' if pretrained else None
    model = pretrainedmodels.inceptionv4(pretrained=pretrained)
    all_layers = list(model.children())
    return nn.Sequential(*all_layers[0], *all_layers[1:])

learn = create_cnn(data, inceptionv4, pretrained=True,
                   cut=-2, split_on=lambda m: (m[0][11], m[1]), metrics=(error_rate,accuracy))

learn.callbacks = [SaveModelCallback(learn, every='improvement', monitor='accuracy', name=f'{prefix}best', mode='max')]

prefix = "inceptionv4_"

In [None]:
# PNAS test
def identity(x): return x

def pnasnet5large(pretrained=False):    
    pretrained = 'imagenet' if pretrained else None
    model = pretrainedmodels.pnasnet5large(pretrained=pretrained, num_classes=1000) 
    model.logits = identity
    return nn.Sequential(model)

# model_meta[pnasnet5large] =  { 'cut': None, 
#                                'split': lambda m: (list(m[0][0].children())[8], m[1]) }

learn = cnn_learner(data, pnasnet5large, pretrained=True)

In [None]:
learn.path = PATH
os.system(f"mv {TRAIN}/models {PATH}")

Lets find the learning rate we'll want to use by using Fastai's lr finder:

In [None]:
learn.lr_find()
learn.recorder.plot()

In [None]:
lr = 1e-2

Time to train! After our prototying accuracy seemed to pleateau around 15 epochs, so that's the numbe we'll go with here:

In [None]:
learn.fit_one_cycle(15, slice(lr))

In [None]:
os.system(f"mv {PATH}/models/{prefix}best.pth {PATH}/models/first_{prefix}best.pth")

In [None]:
gc.collect()
learn.load(f"first_{prefix}best");

In [None]:
learn.unfreeze()

In [None]:
learn.lr_find()
learn.recorder.plot()

In [None]:
learn.fit_one_cycle(30, slice(1e-6,1e-4))

In [None]:
os.system(f"mv {PATH}/models/{prefix}best.pth {PATH}/models/second_{prefix}best.pth")

In [None]:
gc.collect()
learn.load(f"second_{prefix}best");
learn.data = get_data(256,32)

In [None]:
learn.freeze()

In [None]:
learn.lr_find()
learn.recorder.plot()

In [None]:
learn.fit_one_cycle(15, slice(1e-3))

In [None]:
os.system(f"mv {PATH}/models/{prefix}best.pth {PATH}/models/third_{prefix}best.pth")

In [None]:
gc.collect()
learn.load(f"third_{prefix}best");
learn.unfreeze()

In [None]:
learn.lr_find()
learn.recorder.plot()

In [None]:
learn.fit_one_cycle(15, slice(1e-6, 1e-4))

In [None]:
os.system(f"mv {PATH}/models/{prefix}best.pth {PATH}/models/fourth_{prefix}best.pth")

With ~95% validation accuracy, we were satisfied with our model. Now to get predictions for our test set!

## Submission

Since we looked at image file names and breed types, we need to write a function that changes the category we're looking at to `breedID` instead. First we'll make an infrence learner to get the predictions of the test set images, then feed our results into a pandas dataframe which will be exported into a csv for Kaggle.

In [None]:
samp_df = pd.read_csv(PATH/'sampleSubmission_breed.csv')
samp_df.head()

Getting predictions for test set:

In [None]:
# Create infrence learner
# learn = cnn_learner(data, model, metrics=(error_rate,accuracy))

# # xception
# learn = cnn_learner(data, model, pretrained=True, cut=-1,
#                     split_on=lambda m: (m[0][11], m[1]), metrics=(error_rate,accuracy))

# inceptionv4
learn = cnn_learner(data, inceptionv4, pretrained=True,
                   cut=-2, split_on=lambda m: (m[0][11], m[1]), metrics=(error_rate,accuracy))

learn.path = PATH
os.system(f"mv {TRAIN}/models {PATH}")
os.system(f"mv {LARGE}/models {PATH}")

# Load best model
# learn.load(f"final_{prefix}best");
learn.load(f"fourth_{prefix}best");

In [None]:
# Create Submission Dataframe
samp_data = {'fname':[], 'breedID':[]}
sub_df = pd.DataFrame(samp_data)
sub_df

In [None]:
%%time
count = 0
dataset = TEST

for image in os.listdir(TEST):    
    # File names
    fname = image[:-4]
    
    # breed_id predictions
    img = open_image(str(TEST/image))
    breed_pred = learn.predict(img)[0]
    temp_df = train_df.loc[train_df['breed_name'] == f"{breed_pred}"]
    id_pred = temp_df.values[0][0]
    
    temp_data = {'fname':[fname], 'breedID':[id_pred]}
    temp_sub_df = pd.DataFrame(samp_data)   
    sub_df.loc[count] = [fname, id_pred]
    
    if count%250 == 0:
        print(f"{count} of {len(os.listdir(TEST))} done.")
    count += 1

In [None]:
sub_df.sort_values(by=["fname"], inplace=True)
sub_df.reset_index(inplace=True)
sub_df.drop(["index"], axis=1, inplace=True)
sub_df["breedID"] = sub_df["breedID"].apply(int)
sub_df.head()

In [None]:
sub_df.shape

In [None]:
sub_df.to_csv(f'{PATH}/submissions/{prefix}submission.csv', index=False)

##### We'll also export our model so we can use it in our Web App:

In [None]:
learn.export()

## Ensemble Model Predictions 

We'll now create a master submission from using the predictions from multiple models.

### Averaging Submissions

In [None]:
SUB = Path(PATH/'submissions')
sub_list = list(SUB.ls())
sub_list

In [None]:
sub_list.pop(4)

In [None]:
sub_list

In [None]:
submissions = []
for index in range(len(sub_list)):
    submissions.append(pd.read_csv(sub_list[index]))

In [None]:
sub_len = len(pd.read_csv(SUB.ls()[0]))
breed_preds = []
for row in range(sub_len):
    temp_breed_pred = []
    
    for sub in submissions:
        temp_breed_pred.append(sub["breedID"][row])
    
    # Get mode of predictions
    mode = max(set(temp_breed_pred), key=temp_breed_pred.count)
    breed_preds.append(mode)
    

In [None]:
fnames = list(pd.read_csv(SUB.ls()[0])["fname"])

In [None]:
final_data = {'fname':fnames, 'breedID':breed_preds}
final_sub = pd.DataFrame(final_data)
final_sub.head()

In [None]:
final_sub.to_csv('final_submission.csv', index=False)

## Analysis

Let's look at our results!

In [None]:
interp = ClassificationInterpretation.from_learner(learn)

losses,idxs = interp.top_losses()

len(data.valid_ds)==len(losses)==len(idxs)

Below are several images that our model had the most trouble on. If we had more time to work on this project, we'd adjust our training datset accordingly so that we generate more augmented images of each class which should help eliminate some of these losses. We can also see what features activated the model the most, and take that into consideration for future edits of this project:

In [None]:
interp.plot_top_losses(9, figsize=(15,11))

If we look at the confusion matrix below, we can see a vizualization of the model's performance on all the breeds. The more linear and darker the line from the top right to bottom left is, the more accurate the model:

In [None]:
interp.plot_confusion_matrix(figsize=(12,12), dpi=60)

We can also look at what the model was most unsure about, meaning the activations for all the classes we were looking at were all about even, and the model couldn't decide on one that stood out more from the rest:

In [None]:
interp.most_confused(min_val=2)

## Playground

In [None]:
TEST = Path(PATH/'categories'/'downloads'/'bear -brown -black').ls()
TEST[0]

In [None]:
open_image(TEST[1])

### Scraping Google Images

In [13]:
from google_images_download import google_images_download

In [14]:
GI_PATH = Path(PATH/'gi_train')
GI_PATH.mkdir(exist_ok=True)

In [15]:
cats_df = train_df[train_df["speciesID"] == 1]
cat_breeds = list(cats_df.breed_name.unique())

dogs_df = train_df[train_df["speciesID"] == 2]
dog_breeds = list(dogs_df.breed_name.unique())

In [16]:
cat_ids = {}
dog_ids = {}

for breed in cat_breeds:
    temp = train_df[train_df["breed_name"] == breed]
    breed_id = list(temp["breedID"])[0]    
    breed_name = breed
    cat_ids[breed_name] = breed_id
    
for breed in dog_breeds:
    temp = train_df[train_df["breed_name"] == breed]
    breed_id = list(temp["breedID"])[0]    
    breed_name = breed
    dog_ids[breed_name] = breed_id    

In [17]:
chrome_driver_path = "/home/waydegg/Downloads/chromedriver"

**Download images**

In [18]:
%%time
gi_keywords = []
breeds = [dog_breeds, cat_breeds]

def download_animal_breeds():
    for breed in breeds:
        animal = ""
        if breed == dog_breeds:
            animal = "dog"
        else:
            animal = "cat"

        for sub_breed in breed:
            keyword = ""
            keyword = f"{animal} {sub_breed}"

            # Exclude breeds in search
            for exclude in breed:
                if exclude != sub_breed:
                    formatted = exclude.replace(" ", "_")
                    keyword = keyword + f" -{formatted}" 

            gi_keywords.append(keyword)

    for keyword in gi_keywords:
        fn = keyword.split("-")[0][4:-1].replace(" ", "_")
        KW_PATH = Path(GI_PATH/fn)
        KW_PATH.mkdir(exist_ok=True)

        os.system(f'googleimagesdownload -k "{keyword}" -o {GI_PATH} -i {fn} -l 100 --chromedriver {chrome_driver_path}')
        print(f"Finished downloading {fn} pictures!")
        
# download_animal_breeds()

CPU times: user 2 µs, sys: 0 ns, total: 2 µs
Wall time: 3.81 µs


**Clean Image Names**

In [27]:
%%time

for breed in GI_PATH.ls():
    index = 0
    breed_name = str(breed).split("/")[-1]
    
    for animal in breed.ls():
        old_fn = str(animal).split("/")[-1]
        new_fn = old_fn.replace(" ", "")
        suffix = new_fn.split('.')[-1]
        new_fn = f"{breed_name}_{index}.{suffix}"
        old_fn = old_fn.replace(" ", "\ ")
        
        os.system(f"mv {breed}/{old_fn} {breed}/{new_fn}")
        index += 1
    
    print(breed_name)


shiba_inu
scottish_terrier
Bengal
Birman
miniature_pinscher
beagle
Bombay
Russian_Blue
yorkshire_terrier
great_pyrenees
newfoundland
Egyptian_Mau
staffordshire_bull_terrier
keeshond
British_Shorthair
Maine_Coon
leonberger
chihuahua
english_cocker_spaniel
japanese_chin
wheaten_terrier
american_bulldog
Siamese
american_pit_bull_terrier
boxer
german_shorthaired
havanese
pug
english_setter
Abyssinian
Sphynx
Persian
Ragdoll
pomeranian
saint_bernard
basset_hound
samoyed
CPU times: user 303 ms, sys: 40.8 s, total: 41.1 s
Wall time: 1min 38s


**Label and move downloaded images**

In [None]:
COMBINED_TRAIN = Path(PATH/'combined_train')
COMBINED_TRAIN.mkdir(exist_ok=True)

In [None]:
cols = ["breedID", "speciesID", "fname", "breed_name"]
download_df = pd.DataFrame(columns=cols)
download_df

Labeling

In [None]:
for breed in GI_PATH.ls():
#     verify_images(breed, delete=True)

    breed = Path(breed)
    index = 0

    temp_df = pd.DataFrame(columns=cols)

    for animal in breed.ls():
        filename = str(animal).split('/')[-1]
        breed = str(animal).split('/')[-2].replace("_", " ")
        breed_id = 0

        if filename.split(".")[-1] == "png" or "jpg":
            try:
                temp_id = dog_ids[breed]
                breed_id = 2
            except:
                temp_id = cat_ids[breed]
                breed_id = 1

            temp_df.loc[index] = [temp_id, breed_id, filename, breed]
            index += 1
        else:
            print(filename)
            os.system(f"rm {str(animal)}")
        
#     break

    download_df = download_df.append(temp_df, ignore_index=True)

In [None]:
download_df.shape

In [None]:
download_df.head()

Move images into combined folder

In [None]:
def move_downloaded_images():
    # Copying downloaded images
    for breed in GI_PATH.ls():
        breed_name = str(breed).split("/")[-1]
        index = 0
        for pic in breed.ls():
            print(pic)
            
            os.system(f"cp {pic} {COMBINED_TRAIN}")
            index += 1
            break
        break
            
#         os.system(f"cp {breed}/*.jpg {COMBINED_TRAIN}")
#         os.system(f"cp {breed}/*.png {COMBINED_TRAIN}")
        
    # Copying default images
#     os.system(f"cp {PATH}/train/* {COMBINED_TRAIN}")
    
move_downloaded_images()

In [None]:
download_df.shape, len(COMBINED_TRAIN.ls()) # These should be the same size?

In [None]:
list(train_df["fname"])

In [None]:
download_df[download_df["fname"] not in COMBINED_TRAIN.ls()]