# Data Collection and Preparation  

---  

# Objective of this notebook
* Prepare the image sets before modelling phase

---
## 1. Importing packages & modules

In [4]:
# Common modules/packages
import math
import matplotlib.pyplot as plt
import numpy as np
import pathlib, os, shutil
import random
import requests
import warnings

from zipfile import ZipFile
from PIL import Image, UnidentifiedImageError

from csv import reader

warnings.filterwarnings('ignore')

In [2]:
# Mount Google Drive folder
from google.colab import drive
drive.mount('/content/drive')
# change current directory after mounting
PROJ_DIR = '/content/drive/MyDrive/art-classifier'
%cd $PROJ_DIR

Mounted at /content/drive
/content/drive/MyDrive/art-classifier


## 2. Data Collection
Considering that modelling requires 3 image sets ('training', 'testing', 'valid'), a decent number of images must be collected beforehand. There are three methods to download images.

### 2.1. Extract from archive file
There is a set of images (downloads.zip) provided located in Dataset folder.

In [3]:
pathToDataset = pathlib.Path.cwd().joinpath('Dataset')
os.chdir(pathToDataset)

images_file = os.path.join(pathToDataset, 'downloads.zip')

# Extracting all the images to `downloads` folder
with ZipFile(images_file, 'r') as zipObj:
   zipObj.extractall(pathToDataset)

### 2.2. Direct download
The method below helps you to be more specific whether you want to focus on an art category or you already have a list of images. This method assumes that you have a list of files containing links to images. We provide with two files as an example.

In [4]:
###
# Download images from a list of urls
def download_listed_images(filepath):

    # Check 'downloads' folder exists
    pathToDownload = pathlib.Path.cwd().joinpath('downloads')
    if not pathToDownload.exists():
        pathToDownload.mkdir()

    # Check Art Category folder exists
    pathToDownload = pathToDownload.joinpath(filepath[:-4])
    if not pathToDownload.exists():
        pathToDownload.mkdir()
    
    # grab the list of URLs from the input file, then initialize the total number of images downloaded so far
    urls = open(filepath).read().strip().split("\n")
    urlCounter = 0

    # loop the URLs
    for url in urls:
        try:
            # try to download the image
            req = requests.get(url, timeout=60)

            # save the image to disk
            pathToDownloadedImage = pathToDownload.joinpath("{}.jpg".format(str(urlCounter).zfill(8)))
            downloaded_image = open(pathToDownloadedImage, "wb")
            downloaded_image.write(req.content)
            downloaded_image.close()

            # update the counter
            print("[INFO] downloaded: {}".format(pathToDownloadedImage))
            urlCounter += 1
            
        # handle if any exceptions are thrown during the download process
        except:
            print("[INFO] error downloading {}...skipping".format(pathToDownloadedImage))

Invoke the 'download_listed_images' function with a list of files containing the urls

In [5]:
###
# List of files
image_files = ['cubism.txt', 'surrealism.txt']
for image_file in image_files:
    download_listed_images(image_file)

[INFO] downloaded: /content/drive/MyDrive/art-classifier/Dataset/downloads/cubism/00000000.jpg
[INFO] downloaded: /content/drive/MyDrive/art-classifier/Dataset/downloads/cubism/00000001.jpg
[INFO] downloaded: /content/drive/MyDrive/art-classifier/Dataset/downloads/cubism/00000002.jpg
[INFO] downloaded: /content/drive/MyDrive/art-classifier/Dataset/downloads/cubism/00000003.jpg
[INFO] downloaded: /content/drive/MyDrive/art-classifier/Dataset/downloads/cubism/00000004.jpg
[INFO] downloaded: /content/drive/MyDrive/art-classifier/Dataset/downloads/cubism/00000005.jpg
[INFO] downloaded: /content/drive/MyDrive/art-classifier/Dataset/downloads/cubism/00000006.jpg
[INFO] downloaded: /content/drive/MyDrive/art-classifier/Dataset/downloads/cubism/00000007.jpg
[INFO] downloaded: /content/drive/MyDrive/art-classifier/Dataset/downloads/cubism/00000008.jpg
[INFO] downloaded: /content/drive/MyDrive/art-classifier/Dataset/downloads/cubism/00000009.jpg
[INFO] downloaded: /content/drive/MyDrive/art-clas

# 3. Data Preparation
Now that the images are downloaded, begin to prepare the datasets. 

For example, the training images are all stored in a directory path that looks like this:
```
dataset/train/artCategory_1/abc123.jpg
dataset/train/artCategory_1/abc456.jpg
dataset/train/artCategory_1/abc789.jpg
...
dataset/train/artCategory_2/abc123.jpg
dataset/train/artCategory_2/abc456.jpg
dataset/train/artCategory_2/abc789.jpg
```

Where, in this case, the root folder for training is `dataset/train` and the classes are the names of art types. Likewise, `dataset/valid` and `dataset/test` are for validation and testing respectively.

## 3.1. Preparation functions

Before separating the images, define two utilities functions:
* One function should return a list of files present in a specific directory
* One function should return a sorted list of folder names present in a specific directory

In [3]:
# Retrieves the list of files with a directory
def getFilesInDirectory(pathToDir, extension = "*.*"):
    dir_path = pathlib.Path(pathToDir)
    file_list = list(dir_path.glob(extension))
    return file_list
    
# Retrieves the list of folders with a directory
def getFolderNamesInDirectory(pathToDir):
    dir_path = pathlib.Path(pathToDir)
    dir_list = [dir.name for dir in dir_path.iterdir() if dir.is_dir()]
    dir_list.sort()
    return dir_list

## 3.2. Prepare the images
* Set the location for `train`, `test` and `valid` folders and create the missing folders

In [6]:
###
# Sets the root folder for image sets
pathToDataset  = pathlib.Path.cwd()
pathToDownload = pathToDataset.joinpath('downloads')

pathToTrain = pathToDataset.joinpath('train')
if not pathToTrain.exists():
    pathToTrain.mkdir()

pathToTest = pathToDataset.joinpath('test')
if not pathToTest.exists():
    pathToTest.mkdir()

pathToValid = pathToDataset.joinpath('valid')
if not pathToValid.exists():
    pathToValid.mkdir()

###
# Sets the folder for models (where all the models will be saved)
pathToModels = pathToDataset.joinpath('..', 'models')
if not pathToModels.exists():
    pathToModels.mkdir()

* Count the number of art categories and list them using the function above

In [7]:
# list the folders required under 'dataset' folder (using a list to reduce the lines of code)
artCategories = getFolderNamesInDirectory(pathToDownload)  #collects the list of folders
print("Total no. of categories = ", len(artCategories))  #displays the number of classes (= Art categories)
print("Categories: ", artCategories)  #displays the list of classes

Total no. of categories =  6
Categories:  ['cubism', 'genre', 'landscape', 'portrait', 'still-life', 'surrealism']


* For each art category in the downloads folder, spread the images to `test` folder (20% of them) and `valid` folder (20% of them)

In [9]:
# For each art category
for artCategory in artCategories:

    # Sets the source folder
    path_source = pathToDownload.joinpath(artCategory)
    
    # Sets the datasets
    image_list = getFilesInDirectory(path_source, '*.jpg') # lists all the 'jpg' images in the folder
    random.shuffle(image_list) # shuffle image list
    split_index = int(round(len(list(image_list))/5)) # Determines the splitting index: 5 = 20%
    split_images = np.split(image_list, [3*split_index, 4*split_index])# Split the files across the 3 datasets

    # Sets the target folders
    path_target_train = pathToTrain.joinpath(artCategory)
    if not path_target_train.exists():
        path_target_train.mkdir()
    for img_file in split_images[0]:
        shutil.move(img_file, path_target_train.joinpath(img_file.name))    
            
    path_target_test = pathToTest.joinpath(artCategory)
    if not path_target_test.exists():
        path_target_test.mkdir()
    for img_file in split_images[1]:
        shutil.move(img_file, path_target_test.joinpath(img_file.name))    

    path_target_valid = pathToValid.joinpath(artCategory)
    if not path_target_valid.exists():
        path_target_valid.mkdir()
    for img_file in split_images[2]:
        shutil.move(img_file, path_target_valid.joinpath(img_file.name))    


## 4. Check the folder content

Folder now has the following structure:
 * root >  dataset >  downloads  
 * root >  dataset >  test  
 * root >  dataset >  train  
 * root >  dataset >  valid  

Some of the downloaded image files might be corrupted or are not images. Define a function to remove these files. 

In [15]:
def cleanImages(location):
    artCategories = getFolderNamesInDirectory(location)

    # For each art category
    for artCategory in artCategories:

        # Sets the source folder
        path_source = location.joinpath(artCategory)

        # Sets the datasets
        files = getFilesInDirectory(path_source, '*.jpg')    # lists all the 'jpg' images in the folder

        for file in files:
            try:
                img = Image.open(file)
            except IOError:
                print("IOError:")
                print(file)
                os.remove(file)
            except UnidentifiedImageError:
                print("UnidentifiedImageError:")
                print(file)
                os.remove(file)

pathToTrain = pathlib.Path.cwd().joinpath('..', 'Dataset', 'train')
cleanImages(pathToTrain)

pathToValid = pathlib.Path.cwd().joinpath('..', 'Dataset', 'valid')
cleanImages(pathToValid)

pathToTest = pathlib.Path.cwd().joinpath('..', 'Dataset', 'test')
cleanImages(pathToTest)


IOError:
/content/drive/MyDrive/art-classifier/Dataset/../Dataset/valid/cubism/00000066.jpg
IOError:
/content/drive/MyDrive/art-classifier/Dataset/../Dataset/valid/cubism/00000068.jpg
IOError:
/content/drive/MyDrive/art-classifier/Dataset/../Dataset/valid/cubism/00000130.jpg
IOError:
/content/drive/MyDrive/art-classifier/Dataset/../Dataset/valid/cubism/00000155.jpg
IOError:
/content/drive/MyDrive/art-classifier/Dataset/../Dataset/valid/cubism/00000227.jpg
IOError:
/content/drive/MyDrive/art-classifier/Dataset/../Dataset/valid/cubism/00000230.jpg
IOError:
/content/drive/MyDrive/art-classifier/Dataset/../Dataset/valid/cubism/00000245.jpg
IOError:
/content/drive/MyDrive/art-classifier/Dataset/../Dataset/valid/surrealism/00000014.jpg
IOError:
/content/drive/MyDrive/art-classifier/Dataset/../Dataset/valid/surrealism/00000053.jpg
IOError:
/content/drive/MyDrive/art-classifier/Dataset/../Dataset/valid/surrealism/00000070.jpg
IOError:
/content/drive/MyDrive/art-classifier/Dataset/../Dataset/va