<a href="https://colab.research.google.com/github/sboomi/exploradome_tangram/blob/master/Laura_create_balanced_dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##### Copyright 2018 The TensorFlow Authors.

In [None]:
#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Image classification

<table class="tfo-notebook-buttons" align="left">
  <td>
    <a target="_blank" href="https://www.tensorflow.org/tutorials/images/classification"><img src="https://www.tensorflow.org/images/tf_logo_32px.png" />View on TensorFlow.org</a>
  </td>
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/tensorflow/docs/blob/master/site/en/tutorials/images/classification.ipynb"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab</a>
  </td>
  <td>
    <a target="_blank" href="https://github.com/tensorflow/docs/blob/master/site/en/tutorials/images/classification.ipynb"><img src="https://www.tensorflow.org/images/GitHub-Mark-32px.png" />View source on GitHub</a>
  </td>
  <td>
    <a href="https://storage.googleapis.com/tensorflow_docs/docs/site/en/tutorials/images/classification.ipynb"><img src="https://www.tensorflow.org/images/download_logo_32px.png" />Download notebook</a>
  </td>
</table>

This tutorial shows how to classify cats or dogs from images. It builds an image classifier using a `tf.keras.Sequential` model and load data using `tf.keras.preprocessing.image.ImageDataGenerator`. You will get some practical experience and develop intuition for the following concepts:

* Building _data input pipelines_ using the `tf.keras.preprocessing.image.ImageDataGenerator` class to efficiently work with data on disk to use with the model.
* _Overfitting_ —How to identify and prevent it.
* _Data augmentation_ and _dropout_ —Key techniques to fight overfitting in computer vision tasks to incorporate into the data pipeline and image classifier model.

This tutorial follows a basic machine learning workflow:

1. Examine and understand data
2. Build an input pipeline
3. Build the model
4. Train the model
5. Test the model
6. Improve the model and repeat the process

## Import packages

Let's start by importing the required packages. The `os` package is used to read files and directory structure, NumPy is used to convert python list to numpy array and to perform required matrix operations and `matplotlib.pyplot` to plot the graph and display images in the training and validation data.

Import Tensorflow and the Keras classes needed to construct our model.

In [None]:
import tensorflow as tf

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Conv2D, Flatten, Dropout, MaxPooling2D
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.optimizers import RMSprop
from tensorflow.keras.callbacks import ModelCheckpoint, CSVLogger


import os
import numpy as np
import matplotlib.pyplot as plt

## Load data

Begin by downloading the dataset. This tutorial uses a filtered version of <a href="https://www.kaggle.com/c/dogs-vs-cats/data" target="_blank">Dogs vs Cats</a> dataset from Kaggle. Download the archive version of the dataset and store it in the "/tmp/" directory.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

After extracting its contents, assign variables with the proper file path for the training and validation set.

In [None]:
import os

PATH = '/content/drive/My Drive/data/'
train_dir = os.path.join(PATH, 'train_full')
validation_dir = os.path.join(PATH, 'test_full')

### Understand the data

Let's look at how many cats and dogs images are in the training and validation directory:

In [None]:
num_tr = len(os.listdir(train_dir))
num_val = len(os.listdir(validation_dir))

In [None]:
print(os.listdir(train_dir))

In [None]:
num_tr

We can verify with the preceding output that we have the same number of images for each category. Let’s now build our smaller dataset, so that we have 140 images for training of each categories, and 28 images for our test dataset of each categories (20% of train dataset).

In [None]:
# Number of img for each categories, and create list of all img for each categories

string_train = []

for i in os.listdir(train_dir):
  print(train_dir+"/"+i)
  print(len(os.listdir(train_dir+"/"+i)))
  string = os.listdir(train_dir+"/"+i)
  string_train.append(string)
  print(string_train)

In [None]:
# Number of img for each categories, and create list of all img for each categories

string_valid = []

for i in os.listdir(validation_dir):
  print(validation_dir+"/"+i)
  print(len(os.listdir(validation_dir+"/"+i)))
  string = os.listdir(validation_dir+"/"+i)
  string_valid.append(string)
  print(string_valid)

In [None]:
len(string_train)
string_train[4]

In [None]:
# Split random dataset

import random

def split_train_balanced(string, nb):
  class_train_balanced = []
  for i in range(len(string)):
    class_train = random.sample(string[i], k=nb)
    class_train_balanced.append(class_train)
  return class_train_balanced

#train_balanced = split_train_balanced(string_train, nb=140)
#nb=140, because maison =140 images

In [None]:
print(train_balanced)
train_balanced[5]
len(train_balanced)

In [None]:
list_class = os.listdir(train_dir)
list_class[0]

In [None]:
# Copy img of dataset to new folder

import shutil

source_train= train_dir 
source_test= validation_dir
dest_train="/content/drive/My Drive/data/train_balanced/"
dest_test="/content/drive/My Drive/data/test_balanced/"
 

def copy_file(source,dest,data_balanced):
    for i in range(len(data_balanced)):
      for j in range(len(data_balanced[i])):
        #print(train_balanced[i])
        # Copy file to another directory
        #print(source +"/"+ list_class[i] +"/"+ train_balanced[i][j])
        newPath = shutil.copy(source +"/"+ list_class[i] +"/"+ data_balanced[i][j], dest + list_class[i])
        print("Path of copied file : ", newPath)     

train_balanced_img = copy_file(source_train,dest_train, train_balanced)

In [None]:
for i in os.listdir("/content/drive/My Drive/data/train_balanced"):
  print("/content/drive/My Drive/data/train_balanced"+"/"+i)
  print(len(os.listdir("/content/drive/My Drive/data/train_balanced"+"/"+i))) 

In [None]:
test_balanced = split_train_balanced(string_test, nb=28)
#nb=28 for 20% of 140 img
test_balanced_img = copy_file(source_test,dest_test,test_balanced)

In [None]:
for i in os.listdir("/content/drive/My Drive/data/test_balanced"):
  print("/content/drive/My Drive/data/test_balanced"+"/"+i)
  print(len(os.listdir("/content/drive/My Drive/data/test_balanced"+"/"+i))) 