# SETOM_501A Final Project

This final project will be part of the MSGT capstone project. The project accepts shapefiles or folders of shapefiles, runs them against a GIS model, and then generates a single output shapefile. This final project will perform these acts at a high level in order to prove the concept. In particular it will search for new files in a staging folder and process them against a simple data model, outputting a single file. The exact procedure is described in the code comments. A high-level description is available here: 

### Data Pipeline
The data passes through multiple stages as it is processed. This is most easily described as three phases: 
 - Data is ingested, placed in a named folder. That folder is placed in a stage folder
 - The names of each folder are checked against existing intermediary folders. If an intermediary folder exists, the new data must be an addition to that dataset, if no folder exists then the data is a new dataset. Add new data or create new dataset
 - For each modified or new dataset, process the new files
 - Remove the new files from the dataset. The end result of each dataset folder should be a single file
 - Merge all singular dataset files into a final, single file output

The end result is: MANY input files --> SOME dataset files --> ONE output file

### Project Fundamental Pseudocode
 - Scan a folder for files
 - If new files, check for the existence of a dataset 
 - If existing dataset, add files to that dataset, flag dataset for processing
 - If no existing dataset, create new dataset, add files to dataset, flag dataset for processing, clear Stage folder
 - Process all flagged datasets, discard files as they are processed (end result should be one file per dataset)
 - Merge all datasets
 - Output single file, overwrite existing output
 
 

In [84]:
import os
import pathlib
from os import listdir
from os.path import isfile, join
import shutil

#Define paths to different folders
stagePath = "Stage/"
datasetPath = "Datasets/"
outputPath = "Output/"

#Checks for the existence of any files in a given path NOTE: Empty directoris within the path will return FALSE (Files only!)
# IN: A path to check
# OUT: True if files, False if no files
def checkForNewFiles(path) : 
    return any(isfile(join(path, i)) for i in listdir(path))

#Checks for the existence of any datasets with the given name
#IN: (The name of the new dataset), (The path of the directory to check)
#OUT: True if exists
def checkForExistingDataset(name, path) :
    for directory in os.listdir(path) :
        if not directory.startswith('.') :
            if directory == name :
                return True
    return False

#Moves a file from a source directory to a destination directory - Retains the filename
#IN: (Source Directory), (Destination Directory), (Filename)
#OUT: Nil
def copyFile(source, dest, fileName) :
    src = source+fileName
    dst = dest+fileName
    shutil.copyfile(src,dst)

#Process Contents of Stage into Datasets - Takes staged folders, puts them in apropriate Dataset folders. Cleans Stage
#IN: Nothing
#OUT: Returns True if something happened
def processStage() :
    flag = False
    if(checkForNewFiles(stagePath)) :
        #For each new dataset: 
        for directory in os.listdir(stagePath) :
            if not directory.startswith('.') :
                flag = True
                #If the dataset already exists add the new data to it: 
                if(checkForExistingDataset(directory, datasetPath)) :
                    print("Existing Dataset Found: %s" % (datasetPath+directory))
                #Else it does not exist, so make a new dataset
                else : 
                    pathlib.Path(datasetPath+directory).mkdir() 
                    print("Adding new dataset: %s" % (datasetPath+directory))

                #Now copy files to the datasets
                files = os.listdir(stagePath+directory)
                filesCopied = 0
                for f in files :
                    copyFile(stagePath+directory+"/", datasetPath+directory+"/", f)
                    filesCopied += 1
                print("%s files copied to dataset: %s" % (filesCopied, directory))

                #Finally, delete the Directory from Stage
                shutil.rmtree(stagePath+directory)
        if flag :
            return True
        return False
    

    

#-----*********MAIN*********-----
if(processStage()) :
    print("Files processed")

    
#Current Problem 
'''
I have these intermediatry folders where some are updated and some are not. I need to flag which ones need processing 
so that I don't have to process all of my intermediary folders. 

How can I flag a folder for processing?
'''

Adding new dataset: Datasets/Dataset88999 2.16.34 PM
5 files copied to dataset: Dataset88999 2.16.34 PM
Adding new dataset: Datasets/Dataset12345 2.16.34 PM
5 files copied to dataset: Dataset12345 2.16.34 PM
Adding new dataset: Datasets/DatasetExisting
5 files copied to dataset: DatasetExisting
Files processed
