Skip to content

wolfhall/cat_esri

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

43 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

cat_esri

Description

LogicalCat cat_esri is a command line utility for collecting metadata content and various file attributes from ESRI shapefiles and geodatabases by recursively crawling the supplied path and parsing file components. It can export full metadata in .csv or sqlite3 formats. E&P data managers or technicians can use it to locate scattered shapefiles, identify duplicates using a special composite checksum, get X/Y boundaries, get coordinate systems, etc. See the fields section below for details.

cat_esri is part of a suite of crawlers that work with LogicalCat’s Simple E&P Search Engine to provide full text search and reporting for messy geo-computing environments. Contact LogicalCat LLC for more details.

The following “data” gems collect metadata from several data exchange formats commonly used by E&P companies:

cat_bizdoc

MS Office documents and Adobe .pdf

cat_esri

ESRI Shapefiles and Personal/File Geodatabases

cat_imagery

Common aerial photo/satellite image file formats

cat_las

Petrophysical LAS files (including those in .zips)

cat_rascurve

Petrophysical Raster Log Image files (.tifs and .jpgs)

cat_seismic

SEGY and SEGP1 header metadata

The following “project” gems collect project metadata on: wells, maps, digital curves, raster curves, surfaces, and project inventories:

cat_petra

IHS PETRA project crawler

cat_kingdom

IHS Kingdom Suite project crawler

cat_discovery

LMKR GeoGraphix Discovery project crawler

Installation

gem install cat_esri

Requirements

LogicalCat cat_esri works on Windows and Unix-based operating systems, however the Personal Geodatabase parser only works on Windows. The cat_esri crawler was developed on Ruby 1.9.3 and depends on the following gems:

  • trollop

  • sqlite3

  • dbf

Command Line Options

-p    --path

Network path to scan. Path strings should be quoted if they contain spaces. The –path argument is mandatory.

-l    --logfile

Log crawl activity to file. Path strings should be quoted if they contain spaces. Logging will capture all activity, including errors or interruptions.

-t    --timeout

The time limit (in seconds) to allow some long-running tasks before moving on. Default = 10 sec. Setting a short timeout may result in collection of only partial content.

-e    --esrigdb

ESRI Personal and File geodatabases are excluded by default. Set the ‘-e’ flag to include them. Note: at this time the File Geodatabase parser is a “hack” and will collect some garbage text data; the Personal Geodatabases parser works fine.

-g    --ggxlayer

GeoGraphix Discovery’s GeoAtlas layers (www.geographix.com) are excluded by default. Set the ‘-g’ flag to include them, but note that there will be duplication if you are also running a cat_discovery ‘map’ crawl.

-f    --format

Output format: csv (comma separated values) or sqlite3 database. Default is sqlite3. See www.sqlite.org/ for info about sqlite3.

-x    --xitems

Maximum items per outfile. Default = 50000; writes new file if exceeded. Items are held in memory before writing; very high values may cause problems. See note about outfile timestamp naming below.

-o    --outdir

The directory for output files. Crawler file names are auto-generated based on data type, a timestamp, and an extension based on selected format. For example, a crawl of digital log curves using sqlite3 format might be: DIGCURVE_1392987864628876.sqlite

-n    --label

Apply an arbitrary label to the crawl results. Labels can be used to help filter or group multiple crawls later. Use a business unit, vendor or other logical group name.

-q    --quell

Exclude crawl subdirectories. Use –quell to stop recursion on folders under the path that should not be scanned. Separate multiple exclusions with a question mark, ‘?’: cat_esri -p “X:stuff” -q “X:stuffsecret?X:stuffboring?X:stuffhidden”

NOTE: Any paths or labels containing spaces should be enclosed in quotes!

Usage

Crawl path on the X: drive and write output to c:temp in default sqlite3 format:

cat_esri -p "X:\geodata\gis data" -o c:\temp

Same crawl as above, log activity to file, and include GeoGraphix layers:

cat_esri -p "X:\geodata\gis data" -o c:\temp -l "c:\temp\crawler.log" -g

Output to .csv format, write only 100 items per outfile:

cat_esri -p "X:\geodata\gis data" -x 100 -f csv -o "c:\temp"

Output to .csv format, write only 1000 items per outfile, set 45 second timeout:

cat_esri -p "X:\geodata\gis data" -x 1000 -f csv -o "c:\temp" -t 45

Crawl a UNC path, output to sqlite3, exclude one subdirectory:

cat_esri -p "\\server\stuff" -f sqlite3 -o "c:\temp" -q "\\server\stuff\secret"

Crawl a UNC path, output to sqlite3, exclude two subdirectories:

cat_esri -p "\\server\stuff" -f sqlite3 -o "c:\temp" -q "\\server\stuff\secret?\\server\stuff\dull"

Fields

The following types of metadata are collected by cat_esri. Not all fields may fit politely in an Excel spreadsheet. Contact LogicalCat LLC if you require more sophisticated options.

store = ‘shapefile’, ‘esri_pgdb’ or ‘esri_fgdb’ name = file name (basename of shapefile or geodatabase) location = the network path for the file/folder checksum = a composite MD5 checksum based on multi-file components modified = composite file modification date (last modified) bytes = composite file size

coordsys

geographic coordinate/projection system

x_min

minimum northing extent (in default units)

x_max

maximum northing extent (in default units)

y_min

minimum easting extent (in default units)

y_max

maximum easting extent (in default units)

cloud

unique text from shapefile .dbf or geodatabase tables

minced

chopped file path (for use with full text search)

scanclient

hostname of workstation running cat_esri

guid

unique identifier for this file

model

‘map’ (allows comparison to project-bound maps)

created_at

time of crawl

Note on Patches/Pull Requests

  • Fork the project.

  • Ask your mother for permission.

  • Make your feature addition or bug fix.

  • Add tests for it. This is important so I don’t break it in a future version unintentionally.

  • Commit, do not mess with rakefile, version, or history.

  • Send me a pull request. Bonus points for topic branches.

Copyright (c) 2012 LogicalCat LLC

Permission is hereby granted, free of charge, to any person obtaining
a copy of this software and associated documentation files (the
"Software"), to deal in the Software without restriction, including
without limitation the rights to use, copy, modify, merge, publish,
distribute, sublicense, and/or sell copies of the Software, and to
permit persons to whom the Software is furnished to do so, subject to
the following conditions:

The above copyright notice and this permission notice shall be
included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

About

LogicalCat ESRI Crawler

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published