# Welcome  

Notebook Author: Samuel Alter  
Notebook Subject: Capstone Project - Preprocess Imagery

BrainStation Winter 2023: Data Science

This notebook is for processing the satellite into tiles or 'patches' for modelling, and cleaning the files to ensure only square images are fed into the image analysis.

The satellite images processed here were already patched in QGIS.

# Imports

In [19]:
# imports

import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns

import os
import shutil
from PIL import Image
import cv2
import re

# Preprocess Images

### First, create a function for flow control. If a file was already created in a folder, then when put in an if statement, this function can prevent such actions from occurring. Its also just a handy function to check how many files are in a directory.

In [43]:
def filePresenceSumChecker(directory:str,extension:str,count=True,verbose=False):
    '''
    Checks the sum of all the files with a certain extension.
    
    Useful to see if a file move process has already been completed.
    
    ----
    Inputs
    
    >directory
    path to a folder to check if files are there
    
    >extension
    user-specified extension to only count those files
    
    >verbose
    option for the user to see how many files with the extension 
    is in the directory provided
    
    >count
    option for the user to see the count of the files
    
    ----
    Outputs
    
    >counter
    gives the amount of files within the directory
    '''
    
    counter=0
    
    # get a list of all files in the directory
    files = os.listdir(directory)

    # iterate through the files and check if any have the specified extension
    for file in files:
        if file.endswith(extension):
            counter+=1
                
    if verbose==True:
        print(f"There are {counter} '{extension}' files within {directory}.")

    if count==True:
        return counter

Determine how many individual patch images are in the full patch dataset:

In [3]:
# setup paths
# farm='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/patch_nofire_farm'
# city='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/patch_nofire_city'
# fire1='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/patch_fire_fire1'
# fire2='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/patch_fire_fire2'

farm='./patches/patch_nofire_farm'
city='./patches/patch_nofire_city'
fire1='./patches/patch_fire_fire1'
fire2='./patches/patch_fire_fire2'

# save counts
farm_ct=filePresenceSumChecker(directory=farm,extension='.tif')
city_ct=filePresenceSumChecker(directory=city,extension='.tif')
fire1_ct=filePresenceSumChecker(directory=fire1,extension='.tif')
fire2_ct=filePresenceSumChecker(directory=fire2,extension='.tif')

# sum fire patch counts together and nofire patch counts together
sum_nofire=farm_ct+city_ct
sum_fire=fire1_ct+fire2_ct

# check if the sums are the same
if sum_nofire==sum_nofire:
    print('same size')
    print(f'the size of city_ct is {city_ct}')
    print(f'the size of farm_ct is {farm_ct}')
    print(f'the size of fire1_ct is {fire1_ct}')
    print(f'the size of fire2_ct is {fire2_ct}')
else:
    print('not the same size')

same size
the size of city_ct is 7761
the size of farm_ct is 2496
the size of fire1_ct is 7761
the size of fire2_ct is 2496


The size of the image patches are the same per category, effectively building a labeled dataset with $50\%$ images being from fire areas and $50\%$ from nofire areas. 

**However**, and this is very important, some of the images are not square, which means that the point layer did not capture the area of these images. Therefore, the nonsquare images will be removed later in the notebook. Accordingly, the number of images will decrease when the nonsquare patches will be moved away.

The images are in four folders. Though all images have names corresponding to their folder, they have similar numbers across the folders. For example, the folder `patch_nofire_farm` has images with the pattern `patch_nofire_farm.XXXX.tif`. Images from the `patch_nofire_city` folder have the pattern `patch_nofire_city.XXXX.tif`, and the numbers from the `_farm` folder are repeated in the `_city` folder. I want the numbers to be different so that, as the file counting function above demonstrates:
* The images from `city` are numbered `0` through `7761`
* The images from `farm` are numbered `7762` through `10257`
* And so on

### Rename the files to remove the period between the `_farm`/`_city`/etc. and `XXXX`:

In [4]:
def fileRenamer(source:str, prefix:str,extension='.tif',verbose=False):
    '''
    Renames files to the format provided by the user.
    It can help clean an image format to one that can be
    read by modules like Tensorflow.
    
    Note: code will break if there are no files with a number
    suffix separated by a period. Put the function in a flow
    control loop first.
    
    ----
    Inputs:
    
    >source
    the directory where the files are located
    
    >prefix
    the base part of the filename that will remain
    
    >extension
    defaults to '.tif', but this will ensure you only rename
    certain files that have the specified extension
    
    ----
    Outputs:
    
    >N/A
    renames files in-place, no further output
    
    ----
    Example:
    >>source='/patch_nofire_farm.0.tif'
    >>fileRenamer(source=source,prefix='patch_nofire_farm',extension='.tif')
    >>patch_nofire_farm_testing_0.tif
    
    '''

    # loop over each file from the source directory
    for filename in os.listdir(source):
        if verbose==True:
            print('filename:',filename)
        
        # check if the file is the provided `ext` (extension)
        if filename.endswith(extension):
            
            # split the filename into base and extension
            base, ext = os.path.splitext(filename)
            if verbose==True:
                print('base:',base)
                print('ext:',ext)
            
            # split the base into the prefix and number parts
            prefix, number = base.split('.', 1)
            if verbose==True:
                print('prefix:',prefix)
                print('number:',number)
            
            # create the new filename with the desired format
            new_filename = f'{prefix}_{number}{ext}'
            if verbose==True:
                print('new_filename:',new_filename)
            
            # rename the file
            os.rename(os.path.join(source, filename), 
                      os.path.join(source, new_filename))

In [69]:
# testing renaming function
# directory='/Users/sra/temp/'
# prefix_='patch_nofire_city'

# fileRenamer(source=directory,prefix=prefix_)

In [5]:
# write function to check if files have already been renamed
# this is for flow control

def checkFileString(directory_path, file_string):
    '''
    Takes a directory and string and checks if the string is
    included in any of the filenames within the directory.
    
    ----
    Inputs:
    
    >directory_path
    User-specified path to look for the filenames
    
    >file_string
    User-specified string that the function will look for
    '''
    
    for filename in os.listdir(directory_path):
        if file_string in filename:
            return True
    return False

In [6]:
# setup lists
# to run renaming function

# city='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/patch_nofire_city'
# farm='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/patch_nofire_farm'
# fire1='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/patch_fire_fire1'
# fire2='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/patch_fire_fire2'

city='./patches/patch_nofire_city'
farm='./patches/patch_nofire_farm'
fire1='./patches/patch_fire_fire1'
fire2='./patches/patch_fire_fire2'


sources=[city,farm,#nofire
        fire1,fire2]#fire

prefix_city='patch_nofire_city'
prefix_farm='patch_nofire_farm'
prefix_fire1='patch_nofire_fire1'
prefix_fire2='patch_nofire_fire2'

prefixes=[prefix_city,prefix_farm,
         prefix_fire1,prefix_fire2]

In [77]:
# run fileRenamer on all the patches:
# city #nofire
# farm #nofire
# fire1 #fire
# fire2 #fire

# setup directory for flow control file string checker function
# directory='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/patch_nofire_city'
directory='./patches/patch_nofire_city'

if checkFileString(directory_path=directory,file_string='patch_nofire_city_') == False:
    for src,pref in zip(sources,prefixes):
        fileRenamer(source=src,prefix=pref,extension='.tif')

Checking the filenames shows that they have been changed to swap the `.` for a `_`.

---

Now I need to remove the non-square images from the folders because the point-creation tool (see geoanalysis and report) did not make points for the images that are on the margin of the four areas (`city`, `farm`, `fire1`, `fire2`).

This needs to happen before I rename the images because the numbers corresponding to the images need to correspond to the geographic dataset. In other words, the ideal situation would be that point `18` in the geographic dataset corresponds to the exact location of image `18`.

First, though, the photos need to be converted from `.tif` to `.jpg`. This will be achieved through a function.

### Convert `.tif` to `.jpg`

In [7]:
def imageConverter(inputPath, outputPath, oldExtension='.tif',newExtension='.jpg',fileType='JPEG',verbose=False):
    '''
    Iterates through a directory of (default) .tif files and 
    converts them to (default) .jpg format using the Pillow library.
    
    The images will be sent to a new folder.

    Requires an input directory path 
    and an output directory path as strings.
    
    ----
    Inputs:
    
    >inputPath
    string path to where the inputs are located
    
    >outputPath
    string path to where the outputs will be located
    '''

    # create the output directory if it doesn't exist
    # os.makedirs(outputPath, exist_ok=True)

    # iterate through all files in the input directory
    for file_name in os.listdir(inputPath):
        if file_name.endswith(oldExtension):
            # construct the input and output file paths
            input_path = os.path.join(inputPath, file_name)
            output_path = os.path.join(outputPath, 
                                       file_name.replace(oldExtension,
                                                         newExtension))

            # load the image
            # https://stackoverflow.com/questions/40751523/how-do-you-read-a-32-bit-tiff-image-in-python
            img = cv2.imread(input_path,-1)
            
            # convert to RGB format if necessary
            if img.shape[2] == 1:
                img = cv2.cvtColor(img, cv2.COLOR_GRAY2RGB)
            elif img.shape[2] == 4:
                img = cv2.cvtColor(img, cv2.COLOR_BGRA2RGB)

            # Save the image as a .jpg file
            cv2.imwrite(output_path, img, [int(cv2.IMWRITE_JPEG_QUALITY), 90])
            
            if verbose==True:
                print(f"Conversion complete: {input_path} -> {output_path}")


In [8]:
# # test function

# inp='/Users/sra/temp/'
# out='/Users/sra/temp2'

# imageConverter(inputPath=inp,outputPath=out)

In [9]:
# setup lists for jpg converter function

# source_city='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/patch_nofire_city'
# source_farm='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/patch_nofire_farm'
# source_fire1='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/patch_fire_fire1'
# source_fire2='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/patch_fire_fire2'

source_city='./patches/patch_nofire_city'
source_farm='./patches/patch_nofire_farm'
source_fire1='./patches/patch_fire_fire1'
source_fire2='./patches/patch_fire_fire2'


sources=[source_city,
        source_farm,
        source_fire1,
        source_fire2]

# dest_city='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/_patch_jpg/city'
# dest_farm='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/_patch_jpg/farm'
# dest_fire1='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/_patch_jpg/fire1'
# dest_fire2='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/_patch_jpg/fire2'

dest_city='./patches/_patch_jpg/city'
dest_farm='./patches/_patch_jpg/farm'
dest_fire1='./patches/_patch_jpg/fire1'
dest_fire2='./patches/_patch_jpg/fire2'


dests=[dest_city,
      dest_farm,
      dest_fire1,
      dest_fire2]

In [113]:
# source='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/_patch_jpg/fire2'
source='./patches/_patch_jpg/fire2'

if filePresenceSumChecker(directory=source,extension='.jpg')==0:
    for src,des in zip(sources,dests):
        imageConverter(inputPath=src,outputPath=des)

Now we can move the nonsquare `.jpg` images:

In [10]:
def moveNonSquareJPG(source_folder:str, destination_folder:str):
    '''
    Checks to see if any JPG or PNG in the source folder
    does not have square dimensions (e.g. 29x128 is not square,
    128x128 is).
    
    If they do, they are sent to the destination_folder.
    
    ----
    Inputs
    
    >source_folder
    the source of the images to be checked
    
    >destination_folder
    where the non-square images will be relocated to
    
    '''
    
    # Create destination folder if it doesn't exist
    if not os.path.exists(destination_folder):
        os.makedirs(destination_folder)
    
    # get a list of all image files in the source folder
    image_files = [f for f in os.listdir(source_folder) if \
                   f.endswith('.jpg') or \
                   f.endswith('.jpeg') or \
                   f.endswith('.png')]
    
    for file_name in image_files:
        # open the image using PIL
        img = Image.open(os.path.join(source_folder, file_name))
        
        # check if image is square
        if img.size[0] != img.size[1]:
            # move the image to the destination folder
            shutil.move(os.path.join(source_folder, file_name), os.path.join(destination_folder, file_name))
            # # delete the non-square image from the source folder
            # os.remove(os.path.join(source_folder, file_name))
        
    # friendly notice
    print('Done!')

In [11]:
# # test function
# source_='/Users/sra/temp2'
# dest_='/Users/sra/temp'

# moveNonSquareJPG(source_folder=source_,destination_folder=dest_)

In [12]:
# setup lists for function

# source_city='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/_patch_jpg/city'
# source_farm='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/_patch_jpg/farm'
# source_fire1='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/_patch_jpg/fire1'
# source_fire2='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/_patch_jpg/fire2'

source_city='./patches/_patch_jpg/city'
source_farm='./patches/_patch_jpg/farm'
source_fire1='./patches/_patch_jpg/fire1'
source_fire2='./patches/_patch_jpg/fire2'


sources=[source_city,
        source_farm,
        source_fire1,
        source_fire2]

# dest_city='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/nonsquares/jpg/city'
# dest_farm='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/nonsquares/jpg/farm'
# dest_fire1='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/nonsquares/jpg/fire1'
# dest_fire2='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/nonsquares/jpg/fire2'

dest_city='./patches/nonsquares/jpg/city'
dest_farm='./patches/nonsquares/jpg/farm'
dest_fire1='./patches/nonsquares/jpg/fire1'
dest_fire2='./patches/nonsquares/jpg/fire2'


dests=[dest_city,
      dest_farm,
      dest_fire1,
      dest_fire2]

In [117]:
# setup flow control

# directory_='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/nonsquares/jpg/city'
directory_='./patches/nonsquares/jpg/city'

if filePresenceSumChecker(directory=directory_,extension='.jpg') == 0:
    for src,dest in zip(sources,dests):
        moveNonSquareJPG(source_folder=src,
                         destination_folder=dest)

Done!
Done!
Done!
Done!


---

### `train`/`valid`/`test` Split

#### Get the number of images in each category

In [45]:
# setup paths
# city='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/_patch_jpg/city'
# farm='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/_patch_jpg/farm'
# fire1='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/_patch_jpg/fire1'
# fire2='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/_patch_jpg/fire2'

city='./patches/_patch_jpg/city'
farm='./_patch_jpg/farm'
fire1='./_patch_jpg/fire1'
fire2='./patches/_patch_jpg/fire2'


# save counts
farm_ct=filePresenceSumChecker(directory=farm,extension='.jpg')
city_ct=filePresenceSumChecker(directory=city,extension='.jpg')
fire1_ct=filePresenceSumChecker(directory=fire1,extension='.jpg')
fire2_ct=filePresenceSumChecker(directory=fire2,extension='.jpg')

# sum fire patch counts together and nofire patch counts together
sum_nofire=farm_ct+city_ct
sum_fire=fire1_ct+fire2_ct

# check if the sums are the same
if sum_nofire==sum_nofire:
    print('same size')
    print(f'the size of city_ct is {city_ct}')
    print(f'the size of farm_ct is {farm_ct}')
    print(f'the size of fire1_ct is {fire1_ct}')
    print(f'the size of fire2_ct is {fire2_ct}')
else:
    print('not the same size')

same size
the size of city_ct is 0
the size of farm_ct is 0
the size of fire1_ct is 0
the size of fire2_ct is 0


**Reminder: the number of images corresponds exactly to the number of points in the geographic dataset. This is a very, very important check to ensure that the eventual metamodel can be created properly**

#### Reset the number suffix on each image to prepare for `train`/`valid`/`test` splits

The numbers on the geographic dataset will also be reset. Each patch, (city, farm, fire1, fire2) will be reset as well.

In [14]:
def resetFileNumbers(directory: str, prefix: str):
    '''
    Resets the numbering in filenames 
    in the provided directory.
    
    ----
    Inputs:
    
    >directory
    the directory where the files are located
    
    >prefix
    the prefix part of the filename that will remain
    
    ----
    Outputs:
    
    >N/A
    renames files in-place, no further output
    
    ----
    Example:
    >>directory='./files'
    >>prefix='file_'
    >>reset_file_numbers(directory=directory,prefix=prefix)
    >>file_0
    >>file_1
    >>file_2
    '''

    # create a list of all the files in the directory
    files = os.listdir(directory)

    # filter only the files with the prefix provided
    files = [f for f in files if f.startswith(prefix)]

    # sort the files by their numerical suffix
    files.sort(key=lambda x: int(''.join(filter(str.isdigit, x))))

    # rename the files with the new numbering
    for i, filename in enumerate(files):
        new_filename = prefix + str(i) + os.path.splitext(filename)[1]
        os.rename(os.path.join(directory, filename),
                  os.path.join(directory, new_filename))

In [15]:
# test function

# directory_='/Users/sra/temp'

# resetFileNumbers(directory=directory_,prefix='patch_')

In [16]:
# city=7525
# farm=2395
# fire1=7525
# fire2=2395

In [17]:
# setup lists

# dir_city='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/_patch_jpg/city'
# dir_farm='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/_patch_jpg/farm'
# dir_fire1='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/_patch_jpg/fire1'
# dir_fire2='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/_patch_jpg/fire2'

dir_city='./patches/_patch_jpg/city'
dir_farm='./patches/_patch_jpg/farm'
dir_fire1='./patches/_patch_jpg/fire1'
dir_fire2='./patches/_patch_jpg/fire2'


dirs=[dir_city,dir_farm,dir_fire1,dir_fire2]

prefix_city='patch_nofire_city_'
prefix_farm='patch_nofire_farm_'
prefix_fire1='patch_fire_fire1_'
prefix_fire2='patch_fire_fire2_'

prefixes=[prefix_city,prefix_farm,prefix_fire1,prefix_fire2]

In [149]:
# flow control and rename images

# destination_folder='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/_patch_jpg/city/flow_control'
# destination='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/_patch_jpg/city'

destination_folder='./patches/_patch_jpg/city/flow_control'
destination='./patches/_patch_jpg/city'


if not os.path.exists(destination_folder):
    for dr,prf in zip(dirs,prefixes):
        resetFileNumbers(directory=dr,prefix=prf)
        
    # create the full path to the new folder
    new_folder_path = os.path.join(destination, 'flow_control')

    # create the new folder if it doesn't already exist
    if not os.path.exists(new_folder_path):
        os.makedirs(new_folder_path)

Now to the train/test split. To do this, we will create function to generate lists of unique numbers that will serve as the random selector for setting up train/test/validation splits:

First, count the number of files in each folder as before to confirm the total_img size:

In [44]:
# setup paths
# dir_city='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/_patch_jpg/city'
# dir_farm='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/_patch_jpg/farm'
# dir_fire1='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/_patch_jpg/fire1'
# dir_fire2='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/_patch_jpg/fire2'

dir_city='./patches/_patch_jpg/city'
dir_farm='./patches/_patch_jpg/farm'
dir_fire1='./patches/_patch_jpg/fire1'
dir_fire2='./patches/_patch_jpg/fire2'


# save counts
city_ct=filePresenceSumChecker(directory=city,extension='.jpg')
farm_ct=filePresenceSumChecker(directory=farm,extension='.jpg')
fire1_ct=filePresenceSumChecker(directory=fire1,extension='.jpg')
fire2_ct=filePresenceSumChecker(directory=fire2,extension='.jpg')

# sum fire patch counts together and nofire patch counts together
sum_nofire=city_ct+farm_ct
sum_fire=fire1_ct+fire2_ct

# check if the sums are the same
if sum_nofire==sum_nofire:
    print('same size')
    print(f'the size of city_ct is {city_ct}')
    print(f'the size of farm_ct is {farm_ct}')
    print(f'the size of fire1_ct is {fire1_ct}')
    print(f'the size of fire2_ct is {fire2_ct}')
else:
    print('not the same size')
    
total_sum=sum_nofire+sum_fire
print('total_sum:',total_sum)

NameError: name 'city' is not defined

In [57]:
def createTVTS(start,total_img:int,step=1,
              valid_frac=0.15,test_frac=0.15,
              replace=False,verbose=True,debug=False):
    '''
    Creates three lists for a train/validation/test split of 
    numbered files, such as patches previously made from 
    a larger image to be used in convolutional neural network 
    workflows.
    
    The training fraction of the output is the remainder of
    the sum of the validation fraction and the testing fraction:
    
    train_frac = 1 - (valid_frac + test_frac)
    
    Default splits are:
        0.7    = 1 - (   .15     +    .15 )
    
    Please ensure that you have a reasonable split amongst these
    three groups.
    
    ----
    Inputs:
    
    >start
    starting number for the image patches
    
    >total_img
    serves both as total size of images in the patch set
    
    >step
    defaults to 1, the step size in creating a list of numbers
    
    >valid_frac
    the fraction of the numbers that will be split into the
    validation set. Please make the number between 0 and 1
    
    >train_frac
    the fraction of the numbers that will be split into the
    training set. Please make the number between 0 and 1
    
    >replace
    since this function is splitting the numbers, replace defaults
    to False
    
    >verbose
    runs a line of code to check that the splitting was successful
    
    >debug
    helpful print statements to show you what step function is on.
    defaults to not showing these statements
    
    ----
    Outputs:
    
    >train_valid_test_tuple
    a tuple of three lists, containing the train, valid, and
    test list that when combined together are the same size as
    the total_img value
    
    '''
#     create list with each image's number
#     there are `total_img` images each in the fire and nofire datasets
    file_nums=np.arange(start,total_img,step)
    if debug==True:
        print(f'created initial list of size {total_img}')
        
#     create train fraction
    train_frac=1-(valid_frac+test_frac)
    if debug==True:
        print(f'created train_fraction ({train_frac})')
    
#     create train, valid, and test splits    
    trains = np.random.choice(file_nums,
                              size=int(total_img * train_frac),
                              replace=False)
    if debug==True:
        print('created train list')
    
    valids = np.random.choice(np.setdiff1d(file_nums, trains),
                              size=int(total_img * valid_frac),
                              replace=False)
    if debug==True:
        print('created validation list')
    
    tests = np.random.choice(np.setdiff1d(file_nums, np.concatenate((trains, valids))),
                             size=int(total_img * test_frac),
                             replace=False)
    if debug==True:
        print('created test list')
    
    # tests=list(set(file_nums)-set(trains))

    if verbose==True:
        print(f'The size of train ({len(trains)}), validation ({len(valids)}), and tests ({len(tests)}) together is {len(trains)+len(valids)+len(tests)}')
        if debug==True:
            print('printed size of train, validation, and test')
            
    train_valid_test_tuple=(trains,tests,valids)
    if debug==True:
         print('created tuple of train, validation, and test')
    
    return train_valid_test_tuple

In [58]:
# run function
# total_sum is defined in a cell above
train_valid_test_tuple=createTVTS(start=0,total_img=total_sum,\
                                  step=1,verbose=True,\
                                  debug=True)
# train_valid_test_tuple

# sanity checks
trains=train_valid_test_tuple[0]
valids=train_valid_test_tuple[1]
tests=train_valid_test_tuple[2]

print(len(trains))
print(len(valids))
print(len(tests))

print(set(trains) & set(valids) & set(tests))

created initial list of size 19836
created train_fraction (0.7)
created train list
created validation list
created test list
The size of train (13885), validation (2975), and tests (2975) together is 19835
printed size of train, validation, and test
created tuple of train, validation, and test
13885
2975
2975
set()


In [59]:
trains

array([15641, 14998, 10468, ..., 17694,  9970,  2793])

In [60]:
# convert list of integers to list of strings
# important for moving files in next step
trains=[str(i) for i in trains]
valids=[str(i) for i in valids]
tests=[str(i) for i in tests]

type(trains[0])

str

As a reminder: the images are in four folders. Though all images have names corresponding to their folder, they have similar numbers across the folders. For example, the folder `patch_nofire_farm` has images with the pattern `patch_nofire_farm.XXXX.tif`. Images from the `patch_nofire_city` folder have the pattern `patch_nofire_city.XXXX.tif`, and the numbers from the `_farm` folder are repeated in the `_city` folder. I want the numbers to be different so that, as the file counting function above demonstrates:
* The images from `city` are numbered `0` through `7523`
* The images from `farm` are numbered `7524` through `9918`
* And so on

the size of city_ct is 7524  
the size of farm_ct is 2394  
the size of fire1_ct is 7524  
the size of fire2_ct is 2394

Rename the images following the pattern described above and enumerated below:

* `city` = 0 to 7523
* `farm` = 7524 to 9916
* `fire1` = 9917 to 17441
* `fire2` = 17442 to 19836

In [38]:
# function to specify the filenumbers

def resetFileNumbers(directory: str, prefix: str, start: int = 0):
    '''
    Resets the numbering in filenames 
    in the provided directory.
    
    ----
    Inputs:
    
    >directory
    the directory where the files are located
    
    >prefix
    the prefix part of the filename that will remain
    
    >start
    the starting number for the file numbering
    
    ----
    Outputs:
    
    >N/A
    renames files in-place, no further output
    
    ----
    Example:
    >>directory='./files'
    >>prefix='file_'
    >>reset_file_numbers(directory=directory,prefix=prefix, start=2)
    >>file_2
    >>file_3
    >>file_4
    '''

    # create a list of all the files in the directory
    files = os.listdir(directory)

    # filter only the files with the prefix provided
    files = [f for f in files if f.startswith(prefix)]

    # sort the files by their numerical suffix
#     files.sort(key=lambda x: int(''.join(filter(str.isdigit, x))))
    files.sort(key=lambda x: int(''.join(filter(str.isdigit, x))) if any(char.isdigit() for char in x) else 0)

    
    # rename the files with the new numbering
    for i, filename in enumerate(files):
        new_filename = prefix + str(i + start) + os.path.splitext(filename)[1]
        os.rename(os.path.join(directory, filename),
                  os.path.join(directory, new_filename))

In [31]:
# # test the function

# test_dir='/Users/sra/temp'
# test_prefix='patch_'
test_start=0

# resetFileNumbers(directory=test_dir,
#                  prefix=test_prefix,
#                  start=test_start)

In [29]:
# the size of city_ct is 7524
# the size of farm_ct is 2394
# the size of fire1_ct is 7524
# the size of fire2_ct is 2394

Rename the images following the pattern described above and enumerated below:

* `city` = 0 to 7523
* `farm` = 7524 to 9917
* `fire1` = 9918 to 17442
* `fire2` = 17443 to 19835

In [43]:
def resetFileNumbers(directory: str, prefix: str, start: int = 0):
    '''
    Resets the numbering in filenames 
    in the provided directory.
    
    ----
    Inputs:
    
    >directory
    the directory where the files are located
    
    >prefix
    the prefix part of the filename that will remain
    
    >start
    the starting number for the file numbering
    
    ----
    Outputs:
    
    >N/A
    renames files in-place, no further output
    
    ----
    Example:
    >>directory='./files'
    >>prefix='file_'
    >>reset_file_numbers(directory=directory,prefix=prefix, start=2)
    >>file_2
    >>file_3
    >>file_4
    '''

    # create a list of all the files in the directory
    files = os.listdir(directory)

    # filter only the files with the prefix provided
    files = [f for f in files if f.startswith(prefix)]

    # sort the files by their numerical suffix
    files.sort(key=lambda x: int(x.split('_')[-1].split('.')[0]))

    # rename the files with the new numbering
    for i, filename in enumerate(files):
        new_filename = prefix + str(i + start) + os.path.splitext(filename)[1]
        shutil.move(os.path.join(directory, filename),
                  os.path.join(directory, new_filename))
    return

In [46]:
# city images are already named appropriately 
# they start at 0

In [45]:
# farm
# directory_='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/_patch_jpg/farm'
directory_='./patches/_patch_jpg/farm'

prefix_='patch_nofire_farm_'
start_=7524

# destination_folder='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/_patch_jpg/city/flow_control_resetFileNumbers0to19835'
# destination='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/_patch_jpg/farm'
destination_folder='./patches/_patch_jpg/city/flow_control_resetFileNumbers0to19835'
destination='./patches/_patch_jpg/farm'


if not os.path.exists(destination_folder):
    for dr,prf in zip([directory_], [prefix_]):
        resetFileNumbers(directory=dr,prefix=prf,start=start_)
        
    # create the full path to the new folder
    new_folder_path = os.path.join(destination, 'flow_control_resetFileNumbers0to19835')

    # create the new folder if it doesn't already exist
    if not os.path.exists(new_folder_path):
        os.makedirs(new_folder_path)
        
    # move the files to the new folder
    for file in os.listdir(directory_):
        if file.startswith(prefix_):
            shutil.move(os.path.join(directory_, file), os.path.join(new_folder_path, file))

In [50]:
# fire1
# directory_='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/_patch_jpg/fire1'
directory_='./patches/_patch_jpg/fire1'

prefix_='patch_fire_fire1_'
start_=9918

# destination_folder='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/_patch_jpg/fire1/flow_control_resetFileNumbers0to19835'
# destination='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/_patch_jpg/fire1'
destination_folder='./patches/_patch_jpg/fire1/flow_control_resetFileNumbers0to19835'
destination='./patches/_patch_jpg/fire1'


if not os.path.exists(destination_folder):
    for dr,prf in zip([directory_], [prefix_]):
        resetFileNumbers(directory=dr,prefix=prf,start=start_)
        
    # create the full path to the new folder
    new_folder_path = os.path.join(destination, 'flow_control_resetFileNumbers0to19835')

    # create the new folder if it doesn't already exist
    if not os.path.exists(new_folder_path):
        os.makedirs(new_folder_path)
        
    # move the files to the new folder
    for file in os.listdir(directory_):
        if file.startswith(prefix_):
            shutil.move(os.path.join(directory_, file), os.path.join(new_folder_path, file))

In [51]:
# fire2
# directory_='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/_patch_jpg/fire2'
directory_='./patches/_patch_jpg/fire2'
prefix_='patch_fire_fire2_'
start_=17442

# destination_folder='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/_patch_jpg/fire2/flow_control_resetFileNumbers0to19835'
# destination='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/_patch_jpg/fire2'
destination_folder='./patches/_patch_jpg/fire2/flow_control_resetFileNumbers0to19835'
destination='./patches/_patch_jpg/fire2'


if not os.path.exists(destination_folder):
    for dr,prf in zip([directory_], [prefix_]):
        resetFileNumbers(directory=dr,prefix=prf,start=start_)
        
    # create the full path to the new folder
    new_folder_path = os.path.join(destination, 'flow_control_resetFileNumbers0to19835')

    # create the new folder if it doesn't already exist
    if not os.path.exists(new_folder_path):
        os.makedirs(new_folder_path)
        
    # move the files to the new folder
    for file in os.listdir(directory_):
        if file.startswith(prefix_):
            shutil.move(os.path.join(directory_, file), os.path.join(new_folder_path, file))

### Moving the images to their corresponding training and validation locations

#### First, move the images to a nofire and fire folder.

In [61]:
def move_files_by_extension(source_folder, dest_folder, extension):
    """
    Move all files with the specified extension from the source folder to the destination folder.
    
    Args:
    - source_folder (str): The path to the source folder.
    - dest_folder (str): The path to the destination folder.
    - extension (str): The file extension to move, with or without the dot (e.g., '.txt' or 'txt').
    
    Returns:
    - None: The function does not return anything, but raises an error if the source or destination folder 
    paths are invalid or if there are no files with the specified extension in the source folder.
    """
    
    # Check if the source folder exists
    if not os.path.exists(source_folder):
        raise ValueError(f"The source folder {source_folder} does not exist.")
    
    # Check if the destination folder exists; if not, create it
    if not os.path.exists(dest_folder):
        os.makedirs(dest_folder)
    
    # Get a list of all files in the source folder with the specified extension
    files_to_move = [f for f in os.listdir(source_folder) if f.endswith(extension)]
    
    # Raise an error if no files were found with the specified extension
    if len(files_to_move) == 0:
        raise ValueError(f"No files with the extension {extension} were found in the source folder {source_folder}.")
    
    # Move each file from the source folder to the destination folder
    for file in files_to_move:
        source_file_path = os.path.join(source_folder, file)
        dest_file_path = os.path.join(dest_folder, file)
        shutil.move(source_file_path, dest_file_path)


In [68]:
# city

In [None]:
# source='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/_patch_jpg/city'
# dest='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/_patch_jpg/nofire'
source='./patches/_patch_jpg/city'
dest='./patches/_patch_jpg/nofire'

move_files_by_extension(source_folder=source,dest_folder=dest,extension='.jpg')

In [66]:
# farm

In [62]:
# source='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/_patch_jpg/farm/flow_control_resetFileNumbers0to19835'
# dest='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/_patch_jpg/nofire'
source='./patches/_patch_jpg/farm/flow_control_resetFileNumbers0to19835'
dest='./patches/_patch_jpg/nofire'

move_files_by_extension(source_folder=source,dest_folder=dest,extension='.jpg')

In [63]:
# fire1

In [64]:
# source='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/_patch_jpg/fire1/flow_control_resetFileNumbers0to19835'
# dest='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/_patch_jpg/fire'
source='./patches/_patch_jpg/fire1/flow_control_resetFileNumbers0to19835'
dest='./patches/_patch_jpg/fire'

move_files_by_extension(source_folder=source,dest_folder=dest,extension='.jpg')

In [None]:
# fire2

In [65]:
# source='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/_patch_jpg/fire2/flow_control_resetFileNumbers0to19835'
# dest='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/_patch_jpg/fire'
source='./patches/_patch_jpg/fire2/flow_control_resetFileNumbers0to19835'
dest='./patches/_patch_jpg/fire'

move_files_by_extension(source_folder=source,dest_folder=dest,extension='.jpg')

##### Move the files based on train/valid/test splits

In [16]:
def copyFileByNumber(source, dest, set_):
    '''
    Copies files from `source` to `dest` that have numbers 
    in their filename and that match any element in `set_` 
    list (either trains, valids, or tests).
    
    ----
    Inputs:
    
    >source
    source of files to be copied
    
    >dest
    destination of files to be moved to
    
    >set_
    specify either the training ('trains'), validation
    ('valids') or test ('tests') set
    
    ----
    Outputs:
    
    >N/A
    copies files, no further output
    
    '''
    
    for filename in os.listdir(source):
        # Get the number in the filename
        file_num = "".join(filter(str.isdigit, filename))
        # Check if the number is in the trains list
        if file_num in set_:
            # Copy the file to the destination folder
            shutil.copy(os.path.join(source, filename), dest)

In [71]:
# setup loop for copyFileByNumber

# dest_train_fire='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/_model_images/train/fire'
# dest_train_nofire='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/_model_images/train/nofire'
# dest_valid_fire='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/_model_images/validation/fire'
# dest_valid_nofire='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/_model_images/validation/nofire'
# dest_test_fire='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/_model_images/test/fire'
# dest_test_nofire='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/_model_images/test/nofire'

dest_train_fire='./patches/_model_images/train/fire'
dest_train_nofire='./patches/_model_images/train/nofire'
dest_valid_fire='./patches/_model_images/validation/fire'
dest_valid_nofire='./patches/_model_images/validation/nofire'
dest_test_fire='./patches/_model_images/test/fire'
dest_test_nofire='./patches/_model_images/test/nofire'

# source_train_fire='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/_patch_jpg/fire'
# source_train_nofire='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/_patch_jpg/nofire'
# source_valid_fire='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/_patch_jpg/fire'
# source_valid_nofire='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/_patch_jpg/nofire'
# source_test_fire='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/_patch_jpg/fire'
# source_test_nofire='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/_patch_jpg/nofire'

source_train_fire='./patches/_patch_jpg/fire'
source_train_nofire='./patches/_patch_jpg/nofire'
source_valid_fire='./patches/_patch_jpg/fire'
source_valid_nofire='./patches/_patch_jpg/nofire'
source_test_fire='./patches/_patch_jpg/fire'
source_test_nofire='./patches/_patch_jpg/nofire'


sources = [source_train_fire,
          source_train_nofire,
          source_valid_fire,
          source_valid_nofire,
          source_test_fire,
          source_test_nofire]

dests = [dest_train_fire,
        dest_train_nofire,
        dest_valid_fire,
        dest_valid_nofire,
        dest_test_fire,
        dest_test_nofire]

sets = [trains,
        trains,
        valids,
        valids,
        tests,
        tests]

In [72]:
# run function to move subsets of patches
# flow control
if filePresenceSumChecker(directory=dests[0],extension='.tif')==0:
    for src,des,sts in zip(sources,dests,sets):
        copyFileByNumber(source=src,dest=des,set_=sts)

In [74]:
# fire
# run function
# total_sum is defined in a cell above
train_valid_test_tuple=createTVTS(start=0,total_img=9918,\
                                  step=1,verbose=True,\
                                  debug=True)
# train_valid_test_tuple

# sanity checks
fire_trains=train_valid_test_tuple[0]
fire_valids=train_valid_test_tuple[1]
fire_tests=train_valid_test_tuple[2]

print(len(fire_trains))
print(len(fire_valids))
print(len(fire_tests))

print(set(fire_trains) & set(fire_valids) & set(fire_tests))

created initial list of size 9918
created train_fraction (0.7)
created train list
created validation list
created test list
The size of train (6942), validation (1487), and tests (1487) together is 9916
printed size of train, validation, and test
created tuple of train, validation, and test
6942
1487
1487
set()


Create list of numbers for the second fraction of images. The numbers range from 9919 to 19835.

In [17]:
import numpy as np

def split_list(start, end, train_frac, valid_frac, test_frac):
    '''
    Splits a list of numbers from start to end into training, validation, and
    test sets using the specified fractions.
    
    Args:
    start (int): The starting number in the list.
    end (int): The ending number in the list.
    train_frac (float): The fraction of numbers to put in the training set.
    valid_frac (float): The fraction of numbers to put in the validation set.
    test_frac (float): The fraction of numbers to put in the test set.
    
    Returns:
    A tuple of three lists, containing the train, valid, and test list that 
    when combined together are the same size as the range from start to end.
    '''
    
    # Create list with each number from start to end
    nums = np.arange(start, end+1)
    
    # Calculate the number of elements to put in each set
    train_size = int(len(nums) * train_frac)
    valid_size = int(len(nums) * valid_frac)
    test_size = len(nums) - train_size - valid_size
    
    # Use np.random.choice to split the numbers into sets
    train_nums = np.random.choice(nums, size=train_size, replace=False)
    nums = np.setdiff1d(nums, train_nums)
    valid_nums = np.random.choice(nums, size=valid_size, replace=False)
    test_nums = np.setdiff1d(nums, valid_nums)
    
    # Return the tuple of lists
    return (train_nums, valid_nums, test_nums)


In [2]:
train,valid,test=split_list(start=9918,end=19835,
                            train_frac=0.7,
                            valid_frac=0.15,
                            test_frac=0.15)

In [10]:
train_len=len(train)
valid_len=len(valid)
test_len=len(test)

print(f'''
train: {len(train)}
valid: {len(valid)}
test:  {len(test)}
sum: {train_len+valid_len+test_len}
''')


train: 6942
valid: 1487
test:  1489
sum: 9918



In [None]:
# setup loop for copyFileByNumber

# source_train_fire='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/_patch_jpg/fire'
# source_train_nofire='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/_patch_jpg/nofire'
# source_valid_fire='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/_patch_jpg/fire'
# source_valid_nofire='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/_patch_jpg/nofire'
# source_test_fire='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/_patch_jpg/fire'
# source_test_nofire='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/_patch_jpg/nofire'

source_train_fire='./patches/_patch_jpg/fire'
source_train_nofire='./patches/_patch_jpg/nofire'
source_valid_fire='./patches/_patch_jpg/fire'
source_valid_nofire='./patches/_patch_jpg/nofire'
source_test_fire='./patches/_patch_jpg/fire'
source_test_nofire='./patches/_patch_jpg/nofire'

# dest_train_fire='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/_model_images/train/fire'
# dest_train_nofire='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/_model_images/train/nofire'
# dest_valid_fire='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/_model_images/validation/fire'
# dest_valid_nofire='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/_model_images/validation/nofire'
# dest_test_fire='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/_model_images/test/fire'
# dest_test_nofire='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/_model_images/test/nofire'

dest_train_fire='./patches/_model_images/train/fire'
dest_train_nofire='./patches/_model_images/train/nofire'
dest_valid_fire='./patches/_model_images/validation/fire'
dest_valid_nofire='./patches/_model_images/validation/nofire'
dest_test_fire='./patches/_model_images/test/fire'
dest_test_nofire='./patches/_model_images/test/nofire'

sources = [source_train_fire,
          source_train_nofire,
          source_valid_fire,
          source_valid_nofire,
          source_test_fire,
          source_test_nofire]

dests = [dest_train_fire,
        dest_train_nofire,
        dest_valid_fire,
        dest_valid_nofire,
        dest_test_fire,
        dest_test_nofire]

sets = [trains,
        trains,
        valids,
        valids,
        tests,
        tests]

In [40]:
def copy_files_by_number(source_dir, dest_dir, numbers_list, extension, prefix):
    '''
    Takes a source directory, destination directory,
    list of numbers, an extension, and a prefix, and uses 
    all of that to move files that exactly match the
    resultant filename from the source to the directory.
    '''
    for filename in os.listdir(source_dir):
        if filename.endswith(extension) and filename.startswith(prefix):
            file_number = int(filename[len(prefix):-len(extension)])
            if file_number in numbers_list:
                source_path = os.path.join(source_dir, filename)
                dest_path = os.path.join(dest_dir, filename)
                shutil.copyfile(source_path, dest_path)

In [39]:
# test function

source='/Users/sra/temp'
dest='/Users/sra/temp2'
set_=[0,1,2]

copy_files_by_number(source_dir=source,
                    dest_dir=dest,
                    numbers_list=set_,
                    extension='.jpg',
                    prefix='patch_')

In [None]:
# setup loop for copyFileByNumber

# dest_train_fire='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/_model_images/train/fire'
# dest_train_nofire='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/_model_images/train/nofire'
# dest_valid_fire='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/_model_images/validation/fire'
# dest_valid_nofire='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/_model_images/validation/nofire'
# dest_test_fire='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/_model_images/test/fire'
# dest_test_nofire='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/_model_images/test/nofire'

dest_train_fire='./patches/_model_images/train/fire'
dest_train_nofire='./patches/_model_images/train/nofire'
dest_valid_fire='./patches/_model_images/validation/fire'
dest_valid_nofire='./patches/_model_images/validation/nofire'
dest_test_fire='./patches/_model_images/test/fire'
dest_test_nofire='./patches/_model_images/test/nofire'

# source_train_fire='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/_patch_jpg/fire'
# source_train_nofire='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/_patch_jpg/nofire'
# source_valid_fire='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/_patch_jpg/fire'
# source_valid_nofire='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/_patch_jpg/nofire'
# source_test_fire='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/_patch_jpg/fire'
# source_test_nofire='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/_patch_jpg/nofire'

source_train_fire='./patches/_patch_jpg/fire'
source_train_nofire='./patches/_patch_jpg/nofire'
source_valid_fire='./patches/_patch_jpg/fire'
source_valid_nofire='./patches/_patch_jpg/nofire'
source_test_fire='./patches/_patch_jpg/fire'
source_test_nofire='./patches/_patch_jpg/nofire'

sources = [source_train_fire,
          source_train_nofire,
          source_valid_fire,
          source_valid_nofire,
          source_test_fire,
          source_test_nofire]

dests = [dest_train_fire,
        dest_train_nofire,
        dest_valid_fire,
        dest_valid_nofire,
        dest_test_fire,
        dest_test_nofire]

sets = [trains,
        trains,
        valids,
        valids,
        tests,
        tests]

In [24]:
def fileDeleter(source:str, extension:str='.tif'):
    '''
    Deletes files with the provided extension from the source directory.
    
    ----
    Inputs:
    
    >source
    the directory where the files are located
    
    >extension
    defaults to '.tif', but this will ensure you only delete
    certain files that have the specified extension
    
    ----
    Outputs:
    
    >N/A
    deletes files in-place, no further output
    '''
    
    # loop over each file from the source directory
    for filename in os.listdir(source):
        
        # check if the file is the provided `ext` (extension)
        if filename.endswith(extension):
            
            # delete the file
            os.remove(os.path.join(source, filename))


In [163]:
# setup for loop

# inputPath_train_fire='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/train/fire'
# inputPath_train_nofire='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/train/nofire'
# inputPath_valid_fire='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/validation/fire'
# inputPath_valid_nofire='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/validation/nofire'
# inputPath_test_fire='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/test/fire'
# inputPath_test_nofire='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/orthoimagery/patches/test/nofire'

inputPath_train_fire='./patches/train/fire'
inputPath_train_nofire='./patches/train/nofire'
inputPath_valid_fire='./patches/validation/fire'
inputPath_valid_nofire='./patches/validation/nofire'
inputPath_test_fire='./patches/test/fire'
inputPath_test_nofire='./patches/test/nofire'

inputPaths=[inputPath_train_fire,
            inputPath_train_nofire,
            inputPath_valid_fire,
            inputPath_valid_nofire,
            inputPath_test_fire,
            inputPath_test_nofire]

In [164]:
# flow control
if filePresenceSumChecker(directory=inputPath_train_fire,extension='.tif')>0:
    for inp in (inputPaths):
        fileDeleter(source=inp)