# Setup

First, we import all the basic stuff that we need. You can find the code for `keras_utilities` [here](https://github.com/rforgione/keras-utilities). It's basically a compilation of the utility code written by Jeremy Howard for the FastAI deep learning course, and some of my own code for creating the appropriate directory structure for Keras models. 

In [34]:
from keras_utilities import *
from keras_utilities.models.vgg16bn import Vgg16BN
from keras_utilities.models.vgg16 import Vgg16
from keras.layers import Dense
import os
from numpy import array
from IPython.display import FileLink

## Control Panel

In [5]:
prod = True

if not prod:
    path = 'sample/'
else:
    path = 'data/'
    
batch_size = 64

model_dir = "models/"
results_dir = "results/"

The following three sequences are commented out, because we only want to run them one time. This first one takes 10% of the training data and copies it into a sample data directory. We can use this sample directory to quickly iterate on models.

In [3]:
# for subdir in ['c' + str(i) for i in range(10)]:
#     if not os.path.exists("sample/train/%s" % subdir):
#         os.makedirs("sample/train/%s" % subdir)
#     create_sample_data("data/train/%s" % subdir, "sample/train/%s" % subdir)

In this next piece, we take 25% of our training data and move it into a validation set directory.

In [4]:
# for subdir in ['c' + str(i) for i in range(10)]:
#     if not os.path.exists("data/valid/%s" % subdir):
#         os.makedirs("data/valid/%s" % subdir)
#     create_sample_data("data/train/%s" % subdir, "data/valid/%s" % subdir, 
#                        method="move", sample_pct=.25)

Finally, we take 25% of our sample data and move it into a sample validation set directory.

In [5]:
# for subdir in ['c' + str(i) for i in range(10)]:
#     if not os.path.exists("sample/valid/%s" % subdir):
#         os.makedirs("sample/valid/%s" % subdir)
#     create_sample_data("sample/train/%s" % subdir, "sample/valid/%s" % subdir, 
#                        method="move", sample_pct=.25)

Let's do a quick check on whether this worked:

In [9]:
# training data for class c0
!ls -1 data/train/c0 | wc -l

1900


In [10]:
# sample training data for class c0, should be about 10% of the data above
!ls -1 sample/train/c0 | wc -l

166


In [11]:
# validation data for class c0, should be about 1/3 of the training data for c0
!ls -1 data/valid/c0 | wc -l

589


In [12]:
# sample validation data for class c0, should be about 10% of the sample training data for c0
!ls -1 sample/valid/c0 | wc -l

62


So the percentages don't work out perfectly, but they're good enough for our purposes. We now have a pretty well organized collection of image data. Time to move onto the fun part: the actual modeling.

# Modeling

In [2]:
vgg = Vgg16()

In [3]:
vgg.ft(10)

In [40]:
batches = get_batches(path+"train", batch_size=64)
val_batches = get_batches(path+"valid", batch_size=64)

Found 16904 images belonging to 10 classes.
Found 5520 images belonging to 10 classes.


In [47]:
vgg.fit(batches=batches, val_batches=val_batches)

Epoch 1/1


In [51]:
vgg.model.save_weights(model_dir+"vgg_ft_1epoch.h5")

In [43]:
vgg.model.load_weights(model_dir+"vgg_ft_1epoch.h5")

In [22]:
files = os.listdir("data/test/unknown/")

In [7]:
test_batches = get_batches("data/test/", batch_size=32)

Found 79726 images belonging to 1 classes.


In [96]:
batch = next(test_batches)

In [9]:
batch[0].shape

NameError: name 'batch' is not defined

In [8]:
preds = vgg.test("data/test", batch_size=32)

Found 79726 images belonging to 1 classes.


In [32]:
def create_prediction_file(filename, preds=None, model=None):
    if not preds:
        preds = model.test("data/test", batch_size=32)
    elif not model and not preds:
        raise ValueError("Must pass either preds or a model.")
        
    pred_str = map(lambda x: ",".join(["%.10f" % i for i in x]), preds[1].tolist())
    rows = [",".join([a,b]) + "\n" for a,b in zip(files, pred_str)]
    with open(filename, "a") as f:
        f.write("img,c0,c1,c2,c3,c4,c5,c6,c7,c8,c9\n")
        for row in rows:
            f.write(row)

In [33]:
create_prediction_file(results_dir+"results_vgg.csv", preds=preds)

In [35]:
FileLink(results_dir+"results_vgg.csv")

In [36]:
vgg.model.summary()

____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
lambda_1 (Lambda)                (None, 3, 224, 224)   0           lambda_input_1[0][0]             
____________________________________________________________________________________________________
zeropadding2d_1 (ZeroPadding2D)  (None, 3, 226, 226)   0           lambda_1[0][0]                   
____________________________________________________________________________________________________
convolution2d_1 (Convolution2D)  (None, 64, 224, 224)  1792        zeropadding2d_1[0][0]            
____________________________________________________________________________________________________
zeropadding2d_2 (ZeroPadding2D)  (None, 64, 226, 226)  0           convolution2d_1[0][0]            
___________________________________________________________________________________________

Results: 1.42 (1.54 private) which gets the job done!

In [38]:
for layer in ['convolution2d_13', 'dense_1', 'dense_2', 'dense_4']:
#     if isinstance(layer, Dense):
    vgg.model.get_layer(layer).trainable = True
        
vgg.compile()

In [49]:
vgg.fit(batches=batches, val_batches=val_batches)

In [46]:
vgg16bn = Vgg16BN()
vgg16bn.ft(10)
vgg16bn.compile()
vgg16bn.fit(batches=batches, val_batches=val_batches)

Epoch 1/1


In [52]:
vgg16bn.model.save_weights(model_dir+"vgg16bn_1_epoch_and_change.h5")

In [53]:
vgg16bn.fit(batches=batches, val_batches=val_batches)

Epoch 1/1


In [47]:
create_prediction_file(results_dir+"vggbn_results.csv", model=vgg16bn)

Found 79726 images belonging to 1 classes.


In [48]:
FileLink(results_dir+"vggbn_results.csv")

## Creating a val set according to driver.

In [54]:
import pandas as pd
imgs_list = pd.read_csv("data/driver_imgs_list.csv")

In [55]:
imgs_list.head()

Unnamed: 0,subject,classname,img
0,p002,c0,img_44733.jpg
1,p002,c0,img_72999.jpg
2,p002,c0,img_25094.jpg
3,p002,c0,img_69092.jpg
4,p002,c0,img_92629.jpg


In [56]:
imgs_list['subject'].nunique()

26

In [82]:
imgs_list.head()

Unnamed: 0,subject,classname,img
0,p002,c0,img_44733.jpg
1,p002,c0,img_72999.jpg
2,p002,c0,img_25094.jpg
3,p002,c0,img_69092.jpg
4,p002,c0,img_92629.jpg


In [84]:
imgs_list['file_path'] = imgs_list.apply(lambda x: "/".join([x['classname'], x['img']]), axis=1)

In [87]:
imgs_list.head()

Unnamed: 0,subject,classname,img,file_path
0,p002,c0,img_44733.jpg,c0/img_44733.jpg
1,p002,c0,img_72999.jpg,c0/img_72999.jpg
2,p002,c0,img_25094.jpg,c0/img_25094.jpg
3,p002,c0,img_69092.jpg,c0/img_69092.jpg
4,p002,c0,img_92629.jpg,c0/img_92629.jpg


In [57]:
imgs_list.groupby('subject').count()

Unnamed: 0_level_0,classname,img
subject,Unnamed: 1_level_1,Unnamed: 2_level_1
p002,725,725
p012,823,823
p014,876,876
p015,875,875
p016,1078,1078
p021,1237,1237
p022,1233,1233
p024,1226,1226
p026,1196,1196
p035,848,848


In [62]:
imgs_list['subject'].unique()[:20], imgs_list['subject'].unique()[20:]

(array(['p002', 'p012', 'p014', 'p015', 'p016', 'p021', 'p022', 'p024', 'p026', 'p035', 'p039',
        'p041', 'p042', 'p045', 'p047', 'p049', 'p050', 'p051', 'p052', 'p056'], dtype=object),
 array(['p061', 'p064', 'p066', 'p072', 'p075', 'p081'], dtype=object))

In [73]:
train_subjects = list(imgs_list['subject'].unique()[:20])
valid_subjects = list(imgs_list['subject'].unique()[20:])

In [74]:
train_subjects, valid_subjects

(['p002',
  'p012',
  'p014',
  'p015',
  'p016',
  'p021',
  'p022',
  'p024',
  'p026',
  'p035',
  'p039',
  'p041',
  'p042',
  'p045',
  'p047',
  'p049',
  'p050',
  'p051',
  'p052',
  'p056'],
 ['p061', 'p064', 'p066', 'p072', 'p075', 'p081'])

In [88]:
val_imgs = list(imgs_list.loc[imgs_list['subject'].isin(valid_subjects), 'file_path'])

In [89]:
val_imgs

['c0/img_79017.jpg',
 'c0/img_97584.jpg',
 'c0/img_43311.jpg',
 'c0/img_59737.jpg',
 'c0/img_51347.jpg',
 'c0/img_94538.jpg',
 'c0/img_48175.jpg',
 'c0/img_77617.jpg',
 'c0/img_82654.jpg',
 'c0/img_74403.jpg',
 'c0/img_102087.jpg',
 'c0/img_4733.jpg',
 'c0/img_13675.jpg',
 'c0/img_58922.jpg',
 'c0/img_81052.jpg',
 'c0/img_54366.jpg',
 'c0/img_97751.jpg',
 'c0/img_62737.jpg',
 'c0/img_16990.jpg',
 'c0/img_32707.jpg',
 'c0/img_52699.jpg',
 'c0/img_19970.jpg',
 'c0/img_44121.jpg',
 'c0/img_81643.jpg',
 'c0/img_15117.jpg',
 'c0/img_79563.jpg',
 'c0/img_41948.jpg',
 'c0/img_91788.jpg',
 'c0/img_85949.jpg',
 'c0/img_32174.jpg',
 'c0/img_50526.jpg',
 'c0/img_52776.jpg',
 'c0/img_84289.jpg',
 'c0/img_40944.jpg',
 'c0/img_48288.jpg',
 'c0/img_80325.jpg',
 'c0/img_10704.jpg',
 'c0/img_98733.jpg',
 'c0/img_51928.jpg',
 'c0/img_39356.jpg',
 'c0/img_87301.jpg',
 'c0/img_54415.jpg',
 'c0/img_88541.jpg',
 'c0/img_74442.jpg',
 'c0/img_73265.jpg',
 'c0/img_10307.jpg',
 'c0/img_79400.jpg',
 'c0/img_2283

In [94]:
import shutil

train_dir = "data/train/"
val_dir = "data/valid/"

for img in val_imgs:
    shutil.move(train_dir+img, val_dir+img)

In [95]:
batches = get_batches(train_dir)
val_batches = get_batches(val_dir)

Found 17778 images belonging to 10 classes.
Found 4646 images belonging to 10 classes.


In [None]:
vgg16bn = Vgg16BN()
