## Land Use Classification | Prepare Dataset

The data for this project comes from the German Reseach Center for Artificial Intelligence's open source EuroSAT Sentinel-2 satellite image [dataset](http://madm.dfki.de/downloads). The data was downloaded locally and split into train, validation, and test folders suitable for keras using this notebook.

In [1]:
import warnings
warnings.filterwarnings('ignore')

In [2]:
import os
import shutil
import random
from osgeo import gdal

In [3]:
path = "E:/land-use-classification-cnn-master/"
SEED = random.seed(123)

### Creating Folders

In order to be compatible with the keras function `flow_from_directory`, I created train, validation, and test set folders for the data. Within each folder, the data was separated into more folders by category. 

In [4]:
# get names of categories
categories = []
tif_files = []
for (dirpath, dirnames, filenames) in os.walk(path + 'land-use-tif/'):
    categories.extend(dirnames)
    tif_files += filenames

In [5]:
# create new folder with test train and valid sets with folders of all categories

# make train, test, valid folders
split_names = ['train', 'test', 'valid']
for sp_name in split_names:
    directory = path + 'land-use-jpeg/' + sp_name + '/'
    if not os.path.exists(directory):
        os.makedirs(directory) #here directory is land-use-jpeg -> train , test, valid
    # make category folders
    for category in categories:
        dir_cat = directory + category + '/'  
        if not os.path.exists(dir_cat):
            os.makedirs(dir_cat) #here directory is dir_cat = train -> .. .. .. , test -> .. .. .. , valid -> .. .. ..

### Translating File Type

Compatibility with keras/tensorflow also required me to translate the files from tif to jpeg format. I found that this was easiest using the `gdal_translate` function in the command line, which I accessed within the notebook using the `!` notation. 

When translating, I selected bands 2, 3, and 4, which are the red, green, and blue bands. Choosing the RGB bands makes for a traditional image, rather than a specialized satellite image which may have extra near-infrared, red-edge, or short-wave infrared bands. This makes the model more accessible to the average, everyday image of the outdoors.

After translating a file, I moved it to the training data folder. When all files of a certain category were trained and moved, I separated the jpeg files into the validation and test data folders with a train:validation:test separation of 80:10:10.

In [6]:
%%capture
for category in categories:
    directory = path + 'land-use-tif/' + category + '/'
    directory_train = path + 'land-use-jpeg/train/' + category + '/'
    directory_valid = path + 'land-use-jpeg/test/' + category + '/'
    directory_test = path + 'land-use-jpeg/valid/' + category + '/'
    
    cat_files = os.listdir(directory)
    
    if '.DS_Store' in cat_files:
        cat_files.remove('.DS_Store')
        
    # translate files from .tif to .jpeg
    for file in cat_files:
        file_no_ext = file.split('.')[0] # separate file name from extension
        img_in = directory + file
        img_out = directory + file_no_ext + '.jpeg'
            
        if not os.path.exists(directory_train + file_no_ext + '.jpeg'):
            !gdal_translate -of JPEG $img_in $img_out -b 2 -b 3 -b 4 -scale
            if os.path.exists(img_out):
                shutil.move(img_out, directory_train + file_no_ext + '.jpeg')
                
    # remove .xml files that come from translation
    for item in cat_files:
        if item.endswith(".xml"):
            os.remove(os.path.join(directory, item))
                
    # sort files into test and valid folders
    filenames = os.listdir(directory_train)
    filenames.sort()
    if '.DS_Store' in filenames:
        filenames.remove('.DS_Store')
    random.shuffle(filenames)
    split_1 = int(0.8 * len(filenames))
    split_2 = int(0.9 * len(filenames))
    train_filenames = filenames[:split_1]
    valid_filenames = filenames[split_1:split_2]
    test_filenames = filenames[split_2:]
        
    for file in filenames:
        if file in valid_filenames:
            shutil.move(directory_train + file, directory_valid + file)
        elif file in test_filenames:
            shutil.move(directory_train + file, directory_test + file)

In [3]:
!gdal_translate -of JPEG $E:\\land-use-classification-cnn-master\\Sikkim_LULC $E:\\land-use-classification-cnn-master\\Sikkim_LULC1 -b 2 -b 3 -b 4 -scale

ERROR 4: `$E:\\land-use-classification-cnn-master\\Sikkim_LULC' does not exist in the file system, and is not recognized as a supported dataset name.
