**What I want to do: **Play the dogs versus cats competition on kaggle by fine-tuning a pretrained deep learning model, specifically the Vgg16 model.

## Step 1 - Sign up for the dogs v cats comp

Sign up [here](https://www.kaggle.com/c/dogs-vs-cats-redux-kernels-edition). You can actually sign up when you try to make a submission or when you try to download the data. There are 25,000 labelled dog and cat photos available for training, and 12,500 in the test set that we have to try to label for this competition

## Step 2 - Set up the kaggle cli

The command line interface for kaggle has been developed by this [fine gentleman](https://github.com/floydwch/kaggle-cli). thank you!

In this case, the competition name is "dogs-vs-cats-redux-kernels-edition". Also dont forget to globally configure your username and password.

In [2]:
#!pip install kaggle-cli

In [19]:
#!kg config -g -u '###' -p '#####' -c 'dogs-vs-cats-redux-kernels-edition'
#!kg config

In [8]:
import os
os.getcwd()

'/home/ubuntu/kaggle_dogs_cats'

In [14]:
#!mkdir data_redux
#%cd data_redux

/home/ubuntu/kaggle_dogs_cats/data_redux


In [18]:
!kg download

downloading https://www.kaggle.com/c/dogs-vs-cats-redux-kernels-edition/download/test.zip

test.zip 100% |######################################| Time: 0:00:10  24.7 MiB/s
downloading https://www.kaggle.com/c/dogs-vs-cats-redux-kernels-edition/download/train.zip

train.zip 100% |#####################################| Time: 0:00:20  26.3 MiB/s
downloading https://www.kaggle.com/c/dogs-vs-cats-redux-kernels-edition/download/sample_submission.csv

sample_submission.csv 100% |#########################| Time: 0:00:00 269.2 KiB/s


## Step 3 - Download the data

Enter "kg download" from the kaggle-cli after creating a separate folder called kaggle_dogscats in your data directory. Note, I'm assuming you're using a large GPU box. Otherwise this will be miserably slow.

After you've finished downloading, you'll see a sample file and 2 zip files for test and train. Unzip these latter files to get 2 new directories for test and train

unzip test.zip<br\>
unzip train.zip

In [21]:
!ls

sample_submission.csv  test  test.zip  train  train.zip


## Step 4 - Explore the data


ls test/ | wc -l   # 12500 images in test set<br>
ls train/ | wc -l  # 25000 images in training set

ls kaggle_dogscats/train/ | less  <br>
ls kaggle_dogscats/train/ | tail   # i see that in the training set, images are in the format  class(dog/cat).image_id(986660).jpg

ls kaggle_dogscats/test/ | tail    # in the test set though, images are in the format image_id(986660).jpg<br>

ls kaggle_dogscats/train/ | grep 'dog' | wc -l    # 12500 images of dog in the training set<br>
ls kaggle_dogscats/train/ | grep 'cat' | wc -l    # 12500 images of cat in the training set

so I'll use uniform distribution for shuffling and sampling the training data. 


In [23]:
!ls test/ | wc -l
!ls train/ | wc -l

12500
25000


In [24]:
!ls train/ | tail

dog.9993.jpg
dog.9994.jpg
dog.9995.jpg
dog.9996.jpg
dog.9997.jpg
dog.9998.jpg
dog.9999.jpg
dog.999.jpg
dog.99.jpg
dog.9.jpg


In [25]:
!ls test/ | tail

9993.jpg
9994.jpg
9995.jpg
9996.jpg
9997.jpg
9998.jpg
9999.jpg
999.jpg
99.jpg
9.jpg


In [27]:
!ls train/ | grep 'dog' | wc -l
!ls train/ | grep 'cat' | wc -l

12500
12500


## Step 5 - Split images into training, test, validation and sample sets

Execute "split_images_into_directories.py" 

In [29]:
#%cd ..

/home/ubuntu/kaggle_dogs_cats


In [30]:
from split_images_into_directories import *

In [31]:
split_image_dataset_into_train_test_validation_sample("data_redux", 
                                                       response_classes=['dog','cat'],
                                                       perc_split = 0.8)

0 files were unable to be classified into response classes in the 'train' sub-directory within the 'data_redux/sample' directory
0 files were unable to be classified into response classes in the 'valid' sub-directory within the 'data_redux/sample' directory
0 files were unable to be classified into response classes in the 'train' sub-directory within the 'data_redux/train' directory
0 files were unable to be classified into response classes in the 'valid' sub-directory within the 'data_redux/train' directory


'Success'

The whole process barely took 2 seconds!

## Step 6 - Test whether the directories are how you want it to be 

I manually checked whether the file splits are to my satisfaction. It all checks out!

In [37]:
os.listdir('data_redux')

['train.zip',
 'valid',
 'sample_submission.csv',
 'unprocessed',
 'train',
 'test',
 'sample',
 'test.zip']

In [41]:
for subdirs in ['valid','unprocessed','train','test','sample']:
    print (os.path.join('data_redux',subdirs),len(os.listdir(os.path.join('data_redux',subdirs))))

('data_redux/valid', 2)
('data_redux/unprocessed', 0)
('data_redux/train', 2)
('data_redux/test', 12500)
('data_redux/sample', 2)


In [44]:
!ls data_redux/valid/cat | wc -l
!ls data_redux/valid/dog | wc -l
!ls data_redux/train/cat | wc -l
!ls data_redux/train/dog | wc -l

2520
2480
9980
10020


In [46]:
!ls data_redux/sample/train/cat | wc -l
!ls data_redux/sample/valid/cat | wc -l

43
7


## Step7: Rewrite the code taught during lesson 1 in the fastai course for the new dataset

FIN. Step 7 onwards will be in another ipynb. I want to logically break here because I'm hungry. Also dont want to run the split image piece of code each time I run the amazon instance.