<a href="https://colab.research.google.com/github/rahiakela/deep-learning-with-python-francois-chollet/blob/5-deep-learning-for-computer-vision/2_training_convnet_from_scratch_on_small_dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Training convnet from scratch on a small dataset

Having to train an image-classification model using very little data is a common situation, which you’ll likely encounter in practice if you ever do computer vision in a professional context. 

A “few” samples can mean anywhere from a few hundred to a
few tens of thousands of images. As a practical example, we’ll focus on classifying images as dogs or cats, in a dataset containing 4,000 pictures of cats and dogs (2,000 cats, 2,000 dogs). We’ll use 2,000 pictures for training—1,000 for validation, and 1,000 for testing.

We’ll review one basic strategy to tackle this problem: **training a new model from scratch using what little data you have**. You’ll start by naively training a small convnet on the 2,000 training samples, without any regularization, to set a baseline for what can be achieved. 

This will get you to a classification accuracy of 71%. At
that point, the main issue will be overfitting. Then we’ll introduce data augmentation, a powerful technique for mitigating overfitting in computer vision. By using data augmentation you’ll improve the network to reach an accuracy of 82%.

We’ll review two more essential techniques for applying deep
learning to small datasets: 
* **feature extraction with a pretrained network** (which will get you to an accuracy of 90% to 96%) 
* and **fine-tuning a pretrained network** (this will get you to a final accuracy of 97%). 

Together, these three strategies—training a small model from scratch, doing feature extraction using a pretrained model, and fine-tuning a pretrained model—will constitute your future toolbox for tackling the problem of performing image classification with small datasets.

## Setup

In [1]:
from __future__ import absolute_import, division, print_function, unicode_literals

try:
  # %tensorflow_version only exists in Colab.
  %tensorflow_version 2.x
except Exception:
  pass
import tensorflow as tf
from tensorflow import keras
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt

TensorFlow 2.x selected.


## The relevance of deep learning for small-data problems

You’ll sometimes hear that deep learning only works when lots of data is available. This is valid in part: one fundamental characteristic of deep learning is that it can find interesting features in the training data on its own, without any need for manual feature engineering, and this can only be achieved when lots of training examples are available. This is especially true for problems where the input samples are very highdimensional, like images.



## Downloading the data

The Dogs vs. Cats dataset that you’ll use isn’t packaged with Keras. It was made available by Kaggle as part of a computer-vision competition in late 2013, back when convnets weren’t mainstream. You can download the original dataset from [www.kaggle.com/c/dogs-vs-cats/data](https://www.kaggle.com/c/dogs-vs-cats/data).

<img src='https://s3.amazonaws.com/book.keras.io/img/ch5/cats_vs_dogs_samples.jpg?raw=1' width='800'/>

Unsurprisingly, the dogs-versus-cats Kaggle competition in 2013 was won by entrants who used convnets. The best entries achieved up to 95% accuracy. In this example, you’ll get fairly close to this accuracy (in the next section), even though you’ll train
your models on less than 10% of the data that was available to the competitors.

This dataset contains 25,000 images of dogs and cats (12,500 from each class) and is 543 MB (compressed). After downloading and uncompressing it, you’ll create a new dataset containing three subsets: a training set with 1,000 samples of each class, a validation set with 500 samples of each class, and a test set with 500 samples of each class.

Let's donload datasets from Kaggle.

In [0]:
# reference: https://medium.com/@saedhussain/google-colaboratory-and-kaggle-datasets-b57a83eb6ef8
# Install Kaggle library
! pip install -q kaggle

In [0]:
# copy kaggle.json file to .kaggle directory
! cp kaggle.json ~/.kaggle/

In [10]:
# Downlaod data for the dogs-vs-cats challenge
!kaggle competitions download -c dogs-vs-cats

Downloading train.zip to /content
 98% 533M/543M [00:05<00:00, 142MB/s]
100% 543M/543M [00:05<00:00, 111MB/s]
Downloading sampleSubmission.csv to /content
  0% 0.00/86.8k [00:00<?, ?B/s]
100% 86.8k/86.8k [00:00<00:00, 63.3MB/s]
Downloading test1.zip to /content
 95% 258M/271M [00:02<00:00, 111MB/s] 
100% 271M/271M [00:03<00:00, 91.2MB/s]


### Copying images to training, validation, and test directories