**What I want to do: **Play the statefarm distracted driver competition on kaggle by fine-tuning a pretrained deep learning model, specifically the Vgg16 model.

## Step 1 - Accept T&C for the dataset

Sign up [here](https://www.kaggle.com/c/state-farm-distracted-driver-detection). Accept the terms and conditions before you download the data.

In [1]:
%cd data

/home/ubuntu/distracted_driver/data


In [3]:
#!kg config -g -u '****' -p '*****' -c 'state-farm-distracted-driver-detection'
#!kg config

Working config:
[('username', u'datastic'), ('password', '********'), ('competition', u'state-farm-distracted-driver-detection')]


In [4]:
!kg download

downloading https://www.kaggle.com/c/state-farm-distracted-driver-detection/download/sample_submission.csv.zip

sample_submission.csv.zip 100% |#####################| Time: 0:00:00 465.0 KiB/s
downloading https://www.kaggle.com/c/state-farm-distracted-driver-detection/download/imgs.zip

imgs.zip 100% |######################################| Time: 0:02:16  29.9 MiB/s
downloading https://www.kaggle.com/c/state-farm-distracted-driver-detection/download/driver_imgs_list.csv.zip

driver_imgs_list.csv.zip 100% |######################| Time: 0:00:00  93.2 KiB/s


## Step 3 - Download the data

Enter "kg download" from the kaggle-cli after creating a separate folder called kaggle_dogscats in your data directory. Note, I'm assuming you're using a large GPU box. Otherwise this will be miserably slow.

After you've finished downloading, you'll see a sample file and 2 zip files for test and train. Unzip these latter files to get 2 new directories for test and train

unzip test.zip<br\>
unzip train.zip

In [None]:
!unzip driver_imgs_list.csv.zip

Archive:  driver_imgs_list.csv.zip
  inflating: driver_imgs_list.csv    
Archive:  imgs.zip
   creating: test/
  inflating: test/img_1.jpg          
  inflating: test/img_10.jpg         
  inflating: test/img_100.jpg        
  inflating: test/img_1000.jpg       
  inflating: test/img_100000.jpg     
  inflating: test/img_100001.jpg     
  inflating: test/img_100002.jpg     
  inflating: test/img_100003.jpg     
  inflating: test/img_100004.jpg     
  inflating: test/img_100005.jpg     
  inflating: test/img_100007.jpg     
  inflating: test/img_100008.jpg     
  inflating: test/img_100009.jpg     
  inflating: test/img_10001.jpg      
  inflating: test/img_100010.jpg     
  inflating: test/img_100011.jpg     
  inflating: test/img_100012.jpg     
  inflating: test/img_100013.jpg     
  inflating: test/img_100014.jpg     
  inflating: test/img_100016.jpg     
  inflating: test/img_100017.jpg     
  inflating: test/img_100018.jpg     
  inflating: test/img_100019.jpg     
  inflating: te

In [5]:
!ls

driver_imgs_list.csv.zip  imgs.zip  sample_submission.csv.zip


## Step 4 - Explore the data


ls test/ | wc -l   # 12500 images in test set<br>
ls train/ | wc -l  # 25000 images in training set

ls kaggle_dogscats/train/ | less  <br>
ls kaggle_dogscats/train/ | tail   # i see that in the training set, images are in the format  class(dog/cat).image_id(986660).jpg

ls kaggle_dogscats/test/ | tail    # in the test set though, images are in the format image_id(986660).jpg<br>

ls kaggle_dogscats/train/ | grep 'dog' | wc -l    # 12500 images of dog in the training set<br>
ls kaggle_dogscats/train/ | grep 'cat' | wc -l    # 12500 images of cat in the training set

so I'll use uniform distribution for shuffling and sampling the training data. 


In [23]:
!ls test/ | wc -l
!ls train/ | wc -l

12500
25000


In [24]:
!ls train/ | tail

dog.9993.jpg
dog.9994.jpg
dog.9995.jpg
dog.9996.jpg
dog.9997.jpg
dog.9998.jpg
dog.9999.jpg
dog.999.jpg
dog.99.jpg
dog.9.jpg


In [25]:
!ls test/ | tail

9993.jpg
9994.jpg
9995.jpg
9996.jpg
9997.jpg
9998.jpg
9999.jpg
999.jpg
99.jpg
9.jpg


In [27]:
!ls train/ | grep 'dog' | wc -l
!ls train/ | grep 'cat' | wc -l

12500
12500


## Step 5 - Split images into training, test, validation and sample sets

Execute "split_images_into_directories.py" 

In [29]:
#%cd ..

/home/ubuntu/kaggle_dogs_cats


In [30]:
from split_images_into_directories import *

In [31]:
split_image_dataset_into_train_test_validation_sample("data_redux", 
                                                       response_classes=['dog','cat'],
                                                       perc_split = 0.8)

0 files were unable to be classified into response classes in the 'train' sub-directory within the 'data_redux/sample' directory
0 files were unable to be classified into response classes in the 'valid' sub-directory within the 'data_redux/sample' directory
0 files were unable to be classified into response classes in the 'train' sub-directory within the 'data_redux/train' directory
0 files were unable to be classified into response classes in the 'valid' sub-directory within the 'data_redux/train' directory


'Success'

The whole process barely took 2 seconds!

## Step 6 - Test whether the directories are how you want it to be 

I manually checked whether the file splits are to my satisfaction. It all checks out!

In [37]:
os.listdir('data_redux')

['train.zip',
 'valid',
 'sample_submission.csv',
 'unprocessed',
 'train',
 'test',
 'sample',
 'test.zip']

In [41]:
for subdirs in ['valid','unprocessed','train','test','sample']:
    print (os.path.join('data_redux',subdirs),len(os.listdir(os.path.join('data_redux',subdirs))))

('data_redux/valid', 2)
('data_redux/unprocessed', 0)
('data_redux/train', 2)
('data_redux/test', 12500)
('data_redux/sample', 2)


In [44]:
!ls data_redux/valid/cat | wc -l
!ls data_redux/valid/dog | wc -l
!ls data_redux/train/cat | wc -l
!ls data_redux/train/dog | wc -l

2520
2480
9980
10020


In [46]:
!ls data_redux/sample/train/cat | wc -l
!ls data_redux/sample/valid/cat | wc -l

43
7


## Step7: Rewrite the code taught during lesson 1 in the fastai course for the new dataset

FIN. Step 7 onwards will be in another ipynb. I want to logically break here because I'm hungry. Also dont want to run the split image piece of code each time I run the amazon instance.