# SETOM_501A Final Project

This final project will be part of the MSGT capstone project. The project accepts shapefiles or folders of shapefiles, runs them against a GIS model, and then generates a single output shapefile. This final project will perform these acts at a high level in order to prove the concept. In particular it will search for new files in a staging folder and process them against a simple data model, outputting a single file. The exact procedure is described in the code comments. A high-level description is available here: 

### Data Pipeline
The data passes through multiple stages as it is processed. This is most easily described as three phases: 
 - Data is ingested, placed in a named folder. That folder is placed in a stage folder
 - The names of each folder are checked against existing intermediary folders. If an intermediary folder exists, the new data must be an addition to that dataset, if no folder exists then the data is a new dataset. Add new data or create new dataset
 - For each modified or new dataset, process the new files
 - Remove the new files from the dataset. The end result of each dataset folder should be a single file
 - Merge all singular dataset files into a final, single file output

The end result is: MANY input files --> SOME dataset files --> ONE output file

### Project Fundamental Pseudocode
 - Scan a folder for files
 - If new files, check for the existence of a dataset 
 - If existing dataset, add files to that dataset, Update folder metadata file to flag processing required
 - If no existing dataset, create new dataset, add files to dataset, flag dataset for processing, clear Stage folder,  Create a folder metadata file including: date created, date modified, Flag for modification required
 - Process all flagged datasets, discard files as they are processed (end result should be one file per dataset)
 - Merge all datasets
 - Output single file, overwrite existing output
 
 #### Development Notes
 I had initially planned to check for updates using folder timestamps. However I think this is a bad design. If a folder is somehow missed, the chron job will not pick it up next time around, since the modified date will remain the same. For this reason I'm going with a flag/metadata file for each folder. 
 
 

In [10]:
import os
import pathlib
from os import listdir
from os.path import isfile, join
import shutil
import time


#Define paths to different folders
stagePath = "Stage/"
datasetPath = "Datasets/"
outputPath = "Output/"
#Define a list of directories that need to be processed
toProcess = []

#Checks for the existence of any files in a given path NOTE: Empty directoris within the path will return FALSE (Files only!)
# IN: A path to check
# OUT: True if files, False if no files
def checkForNewFiles(path) : 
    if len(os.listdir(path)) > 0 :
        return True
    return False

#Checks for the existence of any datasets with the given name
#IN: (The name of the new dataset), (The path of the directory to check)
#OUT: True if exists
def checkForExistingDataset(name, path) :
    for directory in os.listdir(path) :
        if not directory.startswith('.') :
            if directory == name :
                return True
    return False

#Moves a file from a source directory to a destination directory - Retains the filename
#IN: (Source Directory), (Destination Directory), (Filename)
#OUT: Nil
def copyFile(source, dest, fileName) :
    src = source+fileName
    dst = dest+fileName
    shutil.copyfile(src,dst)

#Process Contents of Stage into Datasets - Takes staged folders, 
#puts them in apropriate Dataset folders. Updates Global toProcess list. Cleans Stage
#IN: Nothing
#OUT: Returns True if something happened
def processStage() :
    flag = False
    if(checkForNewFiles(stagePath)) :
        #For each new dataset: 
        for directory in os.listdir(stagePath) :
            if not directory.startswith('.') :
                flag = True
                #If the dataset already exists add the new data to it: 
                if(checkForExistingDataset(directory, datasetPath)) :
                    pass
                #Else it does not exist, so make a new dataset
                else : 
                    pathlib.Path(datasetPath+directory).mkdir() 
                #Now copy files to the datasets
                files = os.listdir(stagePath+directory)
                filesCopied = 0
                for f in files :
                    copyFile(stagePath+directory+"/", datasetPath+directory+"/", f)
                    filesCopied += 1
                #print("%s files copied to dataset: %s" % (filesCopied, directory))
                #update Global toProcessList
                global toProcess
                toProcess.append(directory)
                #Finally, delete the Directory from Stage
                shutil.rmtree(stagePath+directory)
        if flag :
            return True
        return False
    
#Process a dataset, given a path, take all the shapefiles in that directory and process against a model
#Checks for existence of {dataset}.shp, if it exists, uses it. If not exists, creates it
#IN: Path to the directory you wish to process, name of the dataset
#OUT: Nil
def processDataset(path) :
    #Check for the dataset master, make if if there isn't one
    dataset = datasetPath + path + "/" + path + ".txt"
    if not (os.path.isfile(dataset)) :
        with open(dataset, "w+"): pass
    #run the model

    #delete the superfluous files
    for file in os.listdir(datasetPath+path) :
        if (str(file) == path+".txt") :
            pass
        else :
            os.remove(datasetPath + path + "/" + file) 
    


#-----*********MAIN*********-----
print("Running ...")
if(processStage()) :
    print("Stage Files Imported...Processing...")
    for directory in toProcess: 
        processDataset(directory)
    #reset toProcess
    toProcess = []

print("Job complete!")
    
    

Running ...
Stage Files Imported...Processing...
Job complete!
