# Welcome  

Notebook Author: Samuel Alter  
Notebook Subject: Capstone Project - Preprocess Imagery

BrainStation Winter 2023: Data Science

This notebook is for processing the satellite into tiles or 'patches' for modelling, and cleaning the files to ensure only square images are fed into the image analysis.

# Imports

In [1]:
# imports

import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns

import os
import shutil
from PIL import Image
import cv2

## Split all images in to `train` and `test` splits

### First, create a function for flow control. If a file was already created in a folder, then when put in an if statement, this function can prevent such actions from occurring. Its also just a handy function to check how many files are in a directory.

In [2]:
def filePresenceSumChecker(directory:str,extension:str,count=True,verbose=False):
    '''
    Checks the sum of all the files with a certain extension.
    
    Useful to see if a file move process has already been completed.
    
    ----
    Inputs
    
    >directory
    path to a folder to check if files are there
    
    >extension
    user-specified extension to only count those files
    
    >verbose
    option for the user to see how many files with the extension 
    is in the directory provided
    
    >count
    option for the user to see the count of the files
    
    ----
    Outputs
    
    >counter
    gives the amount of files within the directory
    '''
    
    counter=0
    
    # get a list of all files in the directory
    files = os.listdir(directory)

    # iterate through the files and check if any have the specified extension
    for file in files:
        if file.endswith(extension):
            counter+=1
                
    if verbose==True:
        print(f"There are {counter} '{extension}' files within {directory}.")

    if count==True:
        return counter

Determine how many individual patch images are in the full patch dataset:

In [30]:
# setup paths
farm='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/patch_nofire_farm'
city='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/patch_nofire_city'
fire1='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/patch_fire_fire1'
fire2='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/patch_fire_fire2'

# save counts
farm_ct=filePresenceSumChecker(directory=farm,extension='.tif')
city_ct=filePresenceSumChecker(directory=city,extension='.tif')
fire1_ct=filePresenceSumChecker(directory=fire1,extension='.tif')
fire2_ct=filePresenceSumChecker(directory=fire2,extension='.tif')

# sum fire patch counts together and nofire patch counts together
sum_nofire=farm_ct+city_ct
sum_fire=fire1_ct+fire2_ct

# check if the sums are the same
if sum_nofire==sum_nofire:
    print('same size')
    print(f'the size of city_ct is {city_ct}')
    print(f'the size of farm_ct is {farm_ct}')
    print(f'the size of fire1_ct is {fire1_ct}')
    print(f'the size of fire2_ct is {fire2_ct}')
else:
    print('not the same size')

same size
the size of city_ct is 7761
the size of farm_ct is 2496
the size of fire1_ct is 7761
the size of fire2_ct is 2496


The size of the image patches are the same per category, effectively building a labeled dataset with $50\%$ images being from fire areas and $50\%$ from nofire areas.

The images are in four folders. Though all images have names corresponding to their folder, they have similar numbers across the folders. For example, the folder `patch_nofire_farm` has images with the pattern `patch_nofire_farm.XXXX.tif`. Images from the `patch_nofire_city` folder have the pattern `patch_nofire_city.XXXX.tif`, and the numbers from the `_farm` folder are repeated in the `_city` folder. I want the numbers to be different so that, as the file counting function above demonstrates:
* The images from `city` are numbered `0` through `7761`
* The images from `farm` are numbered `7762` through `10257`
* And so on

### Rename the files to remove the period between the `_farm`/`_city`/etc. and `XXXX`:

In [66]:
def fileRenamer(source:str, prefix:str,extension='.tif',verbose=False):
    '''
    Renames files to the format provided by the user.
    It can help clean an image format to one that can be
    read by modules like Tensorflow.
    
    Note: code will break if there are no files with a number
    suffix separated by a period. Put the function in a flow
    control loop first.
    
    ----
    Inputs:
    
    >source
    the directory where the files are located
    
    >prefix
    the base part of the filename that will remain
    
    >extension
    defaults to '.tif', but this will ensure you only rename
    certain files that have the specified extension
    
    ----
    Outputs:
    
    >N/A
    renames files in-place, no further output
    
    ----
    Example:
    >>source='/patch_nofire_farm.0.tif'
    >>fileRenamer(source=source,prefix='patch_nofire_farm',extension='.tif')
    >>patch_nofire_farm_testing_0.tif
    
    '''

    # loop over each file from the source directory
    for filename in os.listdir(source):
        if verbose==True:
            print('filename:',filename)
        
        # check if the file is the provided `ext` (extension)
        if filename.endswith(extension):
            
            # split the filename into base and extension
            base, ext = os.path.splitext(filename)
            if verbose==True:
                print('base:',base)
                print('ext:',ext)
            
            # split the base into the prefix and number parts
            prefix, number = base.split('.', 1)
            if verbose==True:
                print('prefix:',prefix)
                print('number:',number)
            
            # create the new filename with the desired format
            new_filename = f'{prefix}_{number}{ext}'
            if verbose==True:
                print('new_filename:',new_filename)
            
            # rename the file
            os.rename(os.path.join(source, filename), 
                      os.path.join(source, new_filename))

In [69]:
# testing renaming function
# directory='/Users/sra/temp/'
# prefix_='patch_nofire_city'

# fileRenamer(source=directory,prefix=prefix_)

In [71]:
# write function to check if files have already been renamed
# this is for flow control

def checkFileString(directory_path, file_string):
    '''
    Takes a directory and string and checks if the string is
    included in any of the filenames within the directory.
    
    ----
    Inputs:
    
    >directory_path
    User-specified path to look for the filenames
    
    >file_string
    User-specified string that the function will look for
    '''
    
    for filename in os.listdir(directory_path):
        if file_string in filename:
            return True
    return False

In [74]:
# setup lists
# to run renaming function

city='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/patch_nofire_city'
farm='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/patch_nofire_farm'
fire1='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/patch_fire_fire1'
fire2='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/patch_fire_fire2'

sources=[city,farm,#nofire
        fire1,fire2]#fire

prefix_city='patch_nofire_city'
prefix_farm='patch_nofire_farm'
prefix_fire1='patch_nofire_fire1'
prefix_fire2='patch_nofire_fire2'

prefixes=[prefix_city,prefix_farm,
         prefix_fire1,prefix_fire2]

In [77]:
# run fileRenamer on all the patches:
# city #nofire
# farm #nofire
# fire1 #fire
# fire2 #fire

# setup directory for flow control file string checker function
directory='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/patch_nofire_city'

if checkFileString(directory_path=directory,file_string='patch_nofire_city_') == False:
    for src,pref in zip(sources,prefixes):
        fileRenamer(source=src,prefix=pref,extension='.tif')

Checking the filenames shows that they have been changed to swap the `.` for a `_`.

---

Now I need to remove the non-square images from the folders because the point-creation tool (see geoanalysis and report) did not make points for the images that are on the margin of the four areas (`city`, `farm`, `fire1`, `fire2`).

This needs to happen before I rename the images because the numbers corresponding to the images need to correspond to the geographic dataset. In other words, the ideal situation would be that point `18` in the geographic dataset corresponds to the exact location of image `18`.

First, though, the photos need to be converted from `.tif` to `.jpg`. This will be achieved through a function.

### Convert `.tif` to `.jpg`

In [109]:
def imageConverter(inputPath, outputPath, oldExtension='.tif',newExtension='.jpg',fileType='JPEG',verbose=False):
    '''
    Iterates through a directory of (default) .tif files and 
    converts them to (default) .jpg format using the Pillow library.
    
    The images will be sent to a new folder.

    Requires an input directory path 
    and an output directory path as strings.
    
    ----
    Inputs:
    
    >inputPath
    string path to where the inputs are located
    
    >outputPath
    string path to where the outputs will be located
    '''

    # create the output directory if it doesn't exist
    # os.makedirs(outputPath, exist_ok=True)

    # iterate through all files in the input directory
    for file_name in os.listdir(inputPath):
        if file_name.endswith(oldExtension):
            # construct the input and output file paths
            input_path = os.path.join(inputPath, file_name)
            output_path = os.path.join(outputPath, 
                                       file_name.replace(oldExtension,
                                                         newExtension))

            # load the image
            # https://stackoverflow.com/questions/40751523/how-do-you-read-a-32-bit-tiff-image-in-python
            img = cv2.imread(input_path,-1)
            
            # convert to RGB format if necessary
            if img.shape[2] == 1:
                img = cv2.cvtColor(img, cv2.COLOR_GRAY2RGB)
            elif img.shape[2] == 4:
                img = cv2.cvtColor(img, cv2.COLOR_BGRA2RGB)

            # Save the image as a .jpg file
            cv2.imwrite(output_path, img, [int(cv2.IMWRITE_JPEG_QUALITY), 90])
            
            if verbose==True:
                print(f"Conversion complete: {input_path} -> {output_path}")


In [110]:
# test function

inp='/Users/sra/temp/'
out='/Users/sra/temp2'

imageConverter(inputPath=inp,outputPath=out)

In [111]:
# setup lists for jpg converter function

source_city='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/patch_nofire_city'
source_farm='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/patch_nofire_farm'
source_fire1='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/patch_fire_fire1'
source_fire2='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/patch_fire_fire2'

sources=[source_city,
        source_farm,
        source_fire1,
        source_fire2]

dest_city='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/_patch_jpg/city'
dest_farm='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/_patch_jpg/farm'
dest_fire1='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/_patch_jpg/fire1'
dest_fire2='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/_patch_jpg/fire2'

dests=[dest_city,
      dest_farm,
      dest_fire1,
      dest_fire2]

In [113]:
source='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/_patch_jpg/fire2'

if filePresenceSumChecker(directory=source,extension='.jpg')==0:
    for src,des in zip(sources,dests):
        imageConverter(inputPath=src,outputPath=des)

Now we can move the nonsquare `.jpg` images:

In [95]:
def moveNonSquareJPG(source_folder:str, destination_folder:str):
    '''
    Checks to see if any JPG or PNG in the source folder
    does not have square dimensions (e.g. 29x128 is not square,
    128x128 is).
    
    If they do, they are sent to the destination_folder.
    
    ----
    Inputs
    
    >source_folder
    the source of the images to be checked
    
    >destination_folder
    where the non-square images will be relocated to
    
    '''
    
    # Create destination folder if it doesn't exist
    if not os.path.exists(destination_folder):
        os.makedirs(destination_folder)
    
    # get a list of all image files in the source folder
    image_files = [f for f in os.listdir(source_folder) if \
                   f.endswith('.jpg') or \
                   f.endswith('.jpeg') or \
                   f.endswith('.png')]
    
    for file_name in image_files:
        # open the image using PIL
        img = Image.open(os.path.join(source_folder, file_name))
        
        # check if image is square
        if img.size[0] != img.size[1]:
            # move the image to the destination folder
            shutil.move(os.path.join(source_folder, file_name), os.path.join(destination_folder, file_name))
            # # delete the non-square image from the source folder
            # os.remove(os.path.join(source_folder, file_name))
        
    # friendly notice
    print('Done!')

In [103]:
# test function
source_='/Users/sra/temp2'
dest_='/Users/sra/temp'

moveNonSquareJPG(source_folder=source_,destination_folder=dest_)

Done!


In [116]:
# setup lists for function

source_city='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/_patch_jpg/city'
source_farm='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/_patch_jpg/farm'
source_fire1='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/_patch_jpg/fire1'
source_fire2='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/_patch_jpg/fire2'

sources=[source_city,
        source_farm,
        source_fire1,
        source_fire2]

dest_city='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/nonsquares/jpg/city'
dest_farm='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/nonsquares/jpg/farm'
dest_fire1='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/nonsquares/jpg/fire1'
dest_fire2='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/nonsquares/jpg/fire2'

dests=[dest_city,
      dest_farm,
      dest_fire1,
      dest_fire2]

In [117]:
# setup flow control

directory_='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/nonsquares/jpg/city'

if filePresenceSumChecker(directory=directory_,extension='.jpg') == 0:
    for src,dest in zip(sources,dests):
        moveNonSquareJPG(source_folder=src,
                         destination_folder=dest)

Done!
Done!
Done!
Done!


---

To do this, we will create function to generate lists of unique numbers that will serve as the random selector for setting up train/test/validation splits:

In [78]:
def createTVTS(start,total_img:int,step=1,
              valid_frac=0.15,test_frac=0.15,
              replace=False,verbose=True,debug=False):
    '''
    Creates three lists for a train/validation/test split of 
    numbered files, such as patches previously made from 
    a larger image to be used in convolutional neural network 
    workflows.
    
    The training fraction of the output is the remainder of
    the sum of the validation fraction and the testing fraction:
    
    train_frac = 1 - (valid_frac + test_frac)
    
    Default splits are:
        0.7    = 1 - (   .15     +    .15 )
    
    Please ensure that you have a reasonable split amongst these
    three groups.
    
    ----
    Inputs:
    
    >start
    starting number for the image patches
    
    >total_img
    serves both as total size of images in the patch set
    
    >step
    defaults to 1, the step size in creating a list of numbers
    
    >valid_frac
    the fraction of the numbers that will be split into the
    validation set. Please make the number between 0 and 1
    
    >train_frac
    the fraction of the numbers that will be split into the
    training set. Please make the number between 0 and 1
    
    >replace
    since this function is splitting the numbers, replace defaults
    to False
    
    >verbose
    runs a line of code to check that the splitting was successful
    
    >debug
    helpful print statements to show you what step function is on.
    defaults to not showing these statements
    
    ----
    Outputs:
    
    >train_valid_test_tuple
    a tuple of three lists, containing the train, valid, and
    test list that when combined together are the same size as
    the total_img value
    
    '''
#     create list with each image's number
#     there are `total_img` images each in the fire and nofire datasets
    file_nums=np.arange(start,total_img,step)
    if debug==True:
        print(f'created initial list of size {total_img}')
        
#     create train fraction
    train_frac=1-(valid_frac+test_frac)
    if debug==True:
        print(f'created train_fraction ({train_frac})')
    
#     create train, valid, and test splits    
    trains = np.random.choice(file_nums,
                              size=int(total_img * train_frac),
                              replace=False)
    if debug==True:
        print('created train list')
    
    valids = np.random.choice(np.setdiff1d(file_nums, trains),
                              size=int(total_img * valid_frac),
                              replace=False)
    if debug==True:
        print('created validation list')
    
    tests = np.random.choice(np.setdiff1d(file_nums, np.concatenate((trains, valids))),
                             size=int(total_img * test_frac),
                             replace=False)
    if debug==True:
        print('created test list')
    
    # tests=list(set(file_nums)-set(trains))

    if verbose==True:
        print(f'The size of train ({len(trains)}), validation ({len(valids)}), and tests ({len(tests)}) together is {len(trains)+len(valids)+len(tests)}')
        if debug==True:
            print('printed size of train, validation, and test')
            
    train_valid_test_tuple=(trains,tests,valids)
    if debug==True:
         print('created tuple of train, validation, and test')
    
    return train_valid_test_tuple

In [139]:
# run function
train_valid_test_tuple=createTVTS(start=0,total_img=7760,\
                                  step=1,verbose=True,\
                                  debug=True)
# train_valid_test_tuple

# sanity checks
trains=train_valid_test_tuple[0]
valids=train_valid_test_tuple[1]
tests=train_valid_test_tuple[2]

print(len(trains))
print(len(valids))
print(len(tests))

print(set(trains) & set(valids) & set(tests))

created initial list of size 7760
created train_fraction (0.7)
created train list
created validation list
created test list
The size of train (5432), validation (1164), and tests (1164) together is 7760
printed size of train, validation, and test
created tuple of train, validation, and test
5432
1164
1164
set()


In [140]:
trains

array([1179, 5283, 7319, ..., 4356, 5760, 3943])

In [141]:
# convert list of integers to list of strings
# important for moving files in next step
trains=[str(i) for i in trains]
valids=[str(i) for i in valids]
tests=[str(i) for i in tests]

type(trains[0])

str

### Move the images to their corresponding training and validation locations

In [142]:
def copyFileByNumber(source, dest, set_):
    '''
    Copies files from `source` to `dest` that have numbers 
    in their filename and that match any element in `set_` 
    list (either trains, valids, or tests).
    
    ----
    Inputs:
    
    >source
    source of files to be copied
    
    >dest
    destination of files to be moved to
    
    >set_
    specify either the training ('trains'), validation
    ('valids') or test ('tests') set
    
    ----
    Outputs:
    
    >N/A
    copies files, no further output
    
    '''
    
    for filename in os.listdir(source):
        # Get the number in the filename
        file_num = "".join(filter(str.isdigit, filename))
        # Check if the number is in the trains list
        if file_num in set_:
            # Copy the file to the destination folder
            shutil.copy(os.path.join(source, filename), dest)

In [144]:
# setup loop for copyFileByNumber

source_train_fire='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/patch_fire'
source_train_nofire='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/patch_nofire'
source_valid_fire='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/patch_fire'
source_valid_nofire='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/patch_nofire'
source_test_fire='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/patch_fire'
source_test_nofire='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/patch_nofire'


dest_train_fire='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/train/fire'
dest_train_nofire='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/train/nofire'
dest_valid_fire='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/validation/fire'
dest_valid_nofire='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/validation/nofire'
dest_test_fire='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/test/fire'
dest_test_nofire='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/test/nofire'


sources = [source_train_fire,
          source_train_nofire,
          source_valid_fire,
          source_valid_nofire,
          source_test_fire,
          source_test_nofire]

dests = [dest_train_fire,
        dest_train_nofire,
        dest_valid_fire,
        dest_valid_nofire,
        dest_test_fire,
        dest_test_nofire]

sets = [trains,
        trains,
        valids,
        valids,
        tests,
        tests]

In [147]:
# run function to move subsets of patches
# flow control
if filePresenceSumChecker(directory=dests[0],extension='.tif')==0:
    for src,des,sts in zip(sources,dests,sets):
        copyFileByNumber(source=src,dest=des,set_=sts)

### Rename files and change to proper format

Convert filenames from `patch_fire.X.tif` to `patch_fire_X.tif`, where X is a number with one or more digits.

In [148]:
def fileRenamer(source:str, prefix:str,extension='.tif'):
    '''
    Renames files to the format provided by the user.
    It can help clean an image format to one that can be
    read by modules like Tensorflow.
    
    ----
    Inputs:
    
    >source
    the directory where the files are located
    
    >prefix
    the base part of the filename that will remain
    
    >extension
    defaults to '.tif', but this will ensure you only rename
    certain files that have the specified extension
    
    ----
    Outputs:
    
    >N/A
    renames files in-place, no further output
    '''

    # loop over each file from the source directory
    for filename in os.listdir(source):
        
        # check if the file is the provided `ext` (extension)
        if filename.endswith(extension):
            
            # split the filename into base and extension
            base, ext = os.path.splitext(filename)
            
            # split the base into the prefix and number parts
            prefix, number = base.split('.', 1)
            
            # create the new filename with the desired format
            new_filename = f'{prefix}_{number}{ext}'
            
            # rename the file
            os.rename(os.path.join(source, filename), 
                      os.path.join(source, new_filename))

In [152]:
# make a for loop to rename all the files

source_train_fire='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/train/fire'
source_train_nofire='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/train/nofire'
source_valid_fire='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/validation/fire'
source_valid_nofire='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/validation/nofire'
source_test_fire='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/test/fire'
source_test_nofire='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/test/nofire'


sources=[source_train_fire,
        source_train_nofire,
        source_valid_fire,
        source_valid_nofire,
        source_test_fire,
        source_test_nofire]
 
# flow control
if filePresenceSumChecker(directory=source_train_fire,extension='.tif')<0:
    for src in sources:    
        fileRenamer(source=src,prefix='patch_fire')

In [156]:
directory = '/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/validation/nofire'
extension = '.jpg'

filePresenceSumChecker(directory=directory,extension=extension)

0

In [157]:
# setup for loop

inputPath_train_fire='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/train/fire'
inputPath_train_nofire='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/train/nofire'
inputPath_valid_fire='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/validation/fire'
inputPath_valid_nofire='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/validation/nofire'
inputPath_test_fire='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/test/fire'
inputPath_test_nofire='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/test/nofire'


outputPath_train_fire='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/train/fire'
outputPath_train_nofire='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/train/nofire'
outputPath_valid_fire='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/validation/fire'
outputPath_valid_nofire='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/validation/nofire'
outputPath_test_fire='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/test/fire'
outputPath_test_nofire='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/test/nofire'


inputPaths=[inputPath_train_fire,
            inputPath_train_nofire,
            inputPath_valid_fire,
            inputPath_valid_nofire,
            inputPath_test_fire,
            inputPath_test_nofire]

outputPaths=[outputPath_train_fire,
            outputPath_train_nofire,
            outputPath_valid_fire,
            outputPath_valid_nofire,
            outputPath_test_fire,
            outputPath_test_nofire]

In [158]:
# flow control
if filePresenceSumChecker(directory=outputPath_train_fire,extension='.jpg')==0:
    for inp,outp in zip(inputPaths,outputPaths):
        imageConverter(inputPath=inp,outputPath=outp)

In [159]:
filePresenceSumChecker(directory='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/validation/fire',
                      extension='.jpg')

1164

In [24]:
def fileDeleter(source:str, extension:str='.tif'):
    '''
    Deletes files with the provided extension from the source directory.
    
    ----
    Inputs:
    
    >source
    the directory where the files are located
    
    >extension
    defaults to '.tif', but this will ensure you only delete
    certain files that have the specified extension
    
    ----
    Outputs:
    
    >N/A
    deletes files in-place, no further output
    '''
    
    # loop over each file from the source directory
    for filename in os.listdir(source):
        
        # check if the file is the provided `ext` (extension)
        if filename.endswith(extension):
            
            # delete the file
            os.remove(os.path.join(source, filename))


In [163]:
# setup for loop

inputPath_train_fire='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/train/fire'
inputPath_train_nofire='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/train/nofire'
inputPath_valid_fire='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/validation/fire'
inputPath_valid_nofire='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/validation/nofire'
inputPath_test_fire='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/test/fire'
inputPath_test_nofire='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/test/nofire'

inputPaths=[inputPath_train_fire,
            inputPath_train_nofire,
            inputPath_valid_fire,
            inputPath_valid_nofire,
            inputPath_test_fire,
            inputPath_test_nofire]

In [164]:
# flow control
if filePresenceSumChecker(directory=inputPath_train_fire,extension='.tif')>0:
    for inp in (inputPaths):
        fileDeleter(source=inp)

In [45]:
# dir_path='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/nonsquares/validation/nofire'

input_path='/Users/sra/temp'
output_path='/Users/sra/temp3'

if filePresenceSumChecker(directory=output_path,extension='.jpg') != 0:
    separateNonSquareImages(input_path=input_path,output_path=output_path)

In [59]:
# setup for loop

inputPath_train_fire='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/train/fire'
inputPath_train_nofire='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/train/nofire'
inputPath_valid_fire='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/validation/fire'
inputPath_valid_nofire='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/validation/nofire'
inputPath_test_fire='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/test/fire'
inputPath_test_nofire='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/test/nofire'

destPath_train_fire='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/nonsquares/train/fire'
destPath_train_nofire='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/nonsquares/train/nofire'
destPath_valid_fire='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/nonsquares/validation/fire'
destPath_valid_nofire='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/nonsquares/validation/nofire'
destPath_test_fire='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/nonsquares/test/fire'
destPath_test_nofire='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/nonsquares/test/nofire'

inputPaths=[inputPath_train_fire,
            inputPath_train_nofire,
            inputPath_valid_fire,
            inputPath_valid_nofire,
            inputPath_test_fire,
            inputPath_test_nofire]

destPaths=[destPath_train_fire,
          destPath_train_nofire,
          destPath_valid_fire,
          destPath_valid_nofire,
          destPath_test_fire,
          destPath_test_nofire]

In [60]:
checked_location='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/nonsquares/validation/nofire'

if filePresenceSumChecker(directory=checked_location,extension='.jpg') == 0:
    for inp,des in zip(inputPaths,destPaths):
        moveNonSquareImages(source_folder=inp,destination_folder=des)

## Clip raster to polygon

In [None]:
from osgeo import gdal, ogr

# Define the input raster and polygon mask
input_raster = "/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/m_3411849_se_11_060_20180722.tif"
mask = "/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/shapefiles/perimeters_sm/santa_monica_fire_perimeters_fire_valid.shp"

# Open the input raster and polygon mask
raster_ds = gdal.Open(input_raster)
mask_ds = ogr.Open(mask)

# Get the mask layer
mask_lyr = mask_ds.GetLayer()

# Get the extent of the mask layer
mask_extent = mask_lyr.GetExtent()

# Set the output file name and format
output_file = "path/to/clipped_raster.tif"
output_format = "GTiff"

# Set the output file resolution
output_res = raster_ds.GetGeoTransform()[1]

# Define the output file size
output_width = int((mask_extent[1] - mask_extent[0]) / output_res)
output_height = int((mask_extent[3] - mask_extent[2]) / output_res)

# Define the warp options
warp_options = gdal.WarpOptions(cutlineDSName=mask, cropToCutline=True, dstSRS=raster_ds.GetProjection(), outputBounds=mask_extent, xRes=output_res, yRes=output_res, width=output_width, height=output_height)

# Call the gdal.Warp() function to clip the raster
clipped_raster_ds = gdal.Warp(output_file, raster_ds, options=warp_options)

# Save clipped raster to a shapefile
output_shp = "path/to/clipped_raster.shp"
gdal.VectorTranslate(output_shp, clipped_raster_ds, format="ESRI Shapefile")

# Save clipped raster to a GeoJSON
output_geojson = "path/to/clipped_raster.geojson"
gdal.VectorTranslate(output_geojson, clipped_raster_ds, format="GeoJSON")

# Clean up
raster_ds = None
mask_ds = None
clipped_raster_ds = None

In [None]:
from osgeo import gdal, ogr

# Define the input raster and polygon mask
input_raster = "/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/m_3411849_se_11_060_20180722.tif"
mask = "/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/shapefiles/perimeters_sm/santa_monica_fire_perimeters_fire_valid.shp"

# Open the input raster and polygon mask
raster_ds = gdal.Open(input_raster)
mask_ds = ogr.Open(mask)

# Get the mask layer
mask_lyr = mask_ds.GetLayer()

# Get the extent of the mask layer
mask_extent = mask_lyr.GetExtent()

# Set the output file name and format for GeoTIFF
output_file_tif = "path/to/clipped_raster.tif"
output_format_tif = "GTiff"

# Set the output file name and format for GeoJSON
output_file_geojson = "path/to/clipped_raster.geojson"
output_format_geojson = "GeoJSON"

# Set the output file resolution
output_res = raster_ds.GetGeoTransform()[1]

# Define the output file size
output_width = int((mask_extent[1] - mask_extent[0]) / output_res)
output_height = int((mask_extent[3] - mask_extent[2]) / output_res)

# Define the warp options
warp_options = gdal.WarpOptions(cutlineDSName=mask, cropToCutline=True, dstSRS=raster_ds.GetProjection(), outputBounds=mask_extent, xRes=output_res, yRes=output_res, width=output_width, height=output_height)

# Call the gdal.Warp() function to clip the raster
clipped_raster_ds = gdal.Warp('', raster_ds, options=warp_options)

# Save clipped raster to GeoTIFF
output_tif = "path/to/clipped_raster.tif"
gdal.Translate(output_tif, clipped_raster_ds, format=output_format_tif)

# Save clipped raster to GeoJSON
output_geojson = "path/to/clipped_raster.geojson"
gdal.Translate(output_file_geojson, clipped_raster_ds, format=output_format_geojson)

# Save clipped raster to a shapefile
output_shp = "path/to/clipped_raster.shp"
gdal.VectorTranslate(output_shp, clipped_raster_ds, format="ESRI Shapefile")

# Clean up
raster_ds = None
mask_ds = None
clipped_raster_ds = None

## [Extrating Patches from Large Images ~~and Masks~~ for Semantic Segmentation](https://www.youtube.com/watch?v=7IL7LKSLb9I)

Following this tutorial to convert my large fire/nofire images into patches for neural network analysis. The code block below is from this video, with some alterations to adapt it to my use case.

In [1]:
import numpy as np
from matplotlib import pyplot as plt
from patchify import patchify
import tifffile as tiff

# large_image_stack_fire=tiff.imread('/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/2018/ortho_2018_sm_fire.tif')

large_image_stack_patch_fire=tiff.imread('/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/img_patch_fire2.tif')

In [6]:
# updated 
# https://stackoverflow.com/questions/68224588/problem-when-using-patchify-library-to-create-patches

import cv2

# filepaths
target_tiff_fire='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/img_patch_fire2.tif'
output_location_fire='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches'

# read large_image_stack_test
img = cv2.imread(target_tiff_fire)

# cv2.imshow('image',img)

[ WARN:0@347.904] global /private/var/folders/nz/j6p8yfhx1mv_0grj5xl4650h0000gp/T/abs_d9lyif19nl/croot/opencv-suite_1676472756314/work/modules/imgcodecs/src/grfmt_tiff.cpp (629) readData OpenCV TIFF: TIFFRGBAImageOK: Sorry, can not handle images with 32-bit samples


In [3]:
patches_img = patchify(img, (128,128,3), step=128)

for i in range(patches_img.shape[0]):
    for j in range(patches_img.shape[1]):
        single_patch_img = patches_img[i, j, 0, :, :, :]
        if not cv2.imwrite(output_location_fire + 'image_' + '_'+ str(i)+str(j)+'.jpg', single_patch_img):
            raise Exception("Could not write the image")

error: OpenCV(4.6.0) /private/var/folders/nz/j6p8yfhx1mv_0grj5xl4650h0000gp/T/abs_d9lyif19nl/croot/opencv-suite_1676472756314/work/modules/imgcodecs/src/loadsave.cpp:77: error: (-215:Assertion failed) pixels <= CV_IO_MAX_IMAGE_PIXELS in function 'validateInputImageSize'
