LogicalCat cat_esri is a command line utility for collecting metadata content and various file attributes from ESRI shapefiles and geodatabases by recursively crawling the supplied path and parsing file components. It can export full metadata in .csv or sqlite3 formats. E&P data managers or technicians can use it to locate scattered shapefiles, identify duplicates using a special composite checksum, get X/Y boundaries, get coordinate systems, etc. See the fields section below for details.
cat_esri is part of a suite of crawlers that work with LogicalCat’s Simple E&P Search Engine to provide full text search and reporting for messy geo-computing environments. Contact LogicalCat LLC for more details.
The following “data” gems collect metadata from several data exchange formats commonly used by E&P companies:
- cat_bizdoc
-
MS Office documents and Adobe .pdf
- cat_esri
-
ESRI Shapefiles and Personal/File Geodatabases
- cat_imagery
-
Common aerial photo/satellite image file formats
- cat_las
-
Petrophysical LAS files (including those in .zips)
- cat_rascurve
-
Petrophysical Raster Log Image files (.tifs and .jpgs)
- cat_seismic
-
SEGY and SEGP1 header metadata
The following “project” gems collect project metadata on: wells, maps, digital curves, raster curves, surfaces, and project inventories:
- cat_petra
-
IHS PETRA project crawler
- cat_kingdom
-
IHS Kingdom Suite project crawler
- cat_discovery
-
LMKR GeoGraphix Discovery project crawler
gem install cat_esri
LogicalCat cat_esri works on Windows and Unix-based operating systems, however the Personal Geodatabase parser only works on Windows. The cat_esri crawler was developed on Ruby 1.9.3 and depends on the following gems:
-
trollop
-
sqlite3
-
dbf
-p --path
Network path to scan. Path strings should be quoted if they contain spaces. The –path argument is mandatory.
-l --logfile
Log crawl activity to file. Path strings should be quoted if they contain spaces. Logging will capture all activity, including errors or interruptions.
-t --timeout
The time limit (in seconds) to allow some long-running tasks before moving on. Default = 10 sec. Setting a short timeout may result in collection of only partial content.
-e --esrigdb
ESRI Personal and File geodatabases are excluded by default. Set the ‘-e’ flag to include them. Note: at this time the File Geodatabase parser is a “hack” and will collect some garbage text data; the Personal Geodatabases parser works fine.
-g --ggxlayer
GeoGraphix Discovery’s GeoAtlas layers (www.geographix.com) are excluded by default. Set the ‘-g’ flag to include them, but note that there will be duplication if you are also running a cat_discovery ‘map’ crawl.
-f --format
Output format: csv (comma separated values) or sqlite3 database. Default is sqlite3. See www.sqlite.org/ for info about sqlite3.
-x --xitems
Maximum items per outfile. Default = 50000; writes new file if exceeded. Items are held in memory before writing; very high values may cause problems. See note about outfile timestamp naming below.
-o --outdir
The directory for output files. Crawler file names are auto-generated based on data type, a timestamp, and an extension based on selected format. For example, a crawl of digital log curves using sqlite3 format might be: DIGCURVE_1392987864628876.sqlite
-n --label
Apply an arbitrary label to the crawl results. Labels can be used to help filter or group multiple crawls later. Use a business unit, vendor or other logical group name.
-q --quell
Exclude crawl subdirectories. Use –quell to stop recursion on folders under the path that should not be scanned. Separate multiple exclusions with a question mark, ‘?’: cat_esri -p “X:stuff” -q “X:stuffsecret?X:stuffboring?X:stuffhidden”
NOTE: Any paths or labels containing spaces should be enclosed in quotes!
Crawl path on the X: drive and write output to c:temp in default sqlite3 format:
cat_esri -p "X:\geodata\gis data" -o c:\temp
Same crawl as above, log activity to file, and include GeoGraphix layers:
cat_esri -p "X:\geodata\gis data" -o c:\temp -l "c:\temp\crawler.log" -g
Output to .csv format, write only 100 items per outfile:
cat_esri -p "X:\geodata\gis data" -x 100 -f csv -o "c:\temp"
Output to .csv format, write only 1000 items per outfile, set 45 second timeout:
cat_esri -p "X:\geodata\gis data" -x 1000 -f csv -o "c:\temp" -t 45
Crawl a UNC path, output to sqlite3, exclude one subdirectory:
cat_esri -p "\\server\stuff" -f sqlite3 -o "c:\temp" -q "\\server\stuff\secret"
Crawl a UNC path, output to sqlite3, exclude two subdirectories:
cat_esri -p "\\server\stuff" -f sqlite3 -o "c:\temp" -q "\\server\stuff\secret?\\server\stuff\dull"
The following types of metadata are collected by cat_esri. Not all fields may fit politely in an Excel spreadsheet. Contact LogicalCat LLC if you require more sophisticated options.
store = ‘shapefile’, ‘esri_pgdb’ or ‘esri_fgdb’ name = file name (basename of shapefile or geodatabase) location = the network path for the file/folder checksum = a composite MD5 checksum based on multi-file components modified = composite file modification date (last modified) bytes = composite file size
- coordsys
-
geographic coordinate/projection system
- x_min
-
minimum northing extent (in default units)
- x_max
-
maximum northing extent (in default units)
- y_min
-
minimum easting extent (in default units)
- y_max
-
maximum easting extent (in default units)
- cloud
-
unique text from shapefile .dbf or geodatabase tables
- minced
-
chopped file path (for use with full text search)
- scanclient
-
hostname of workstation running cat_esri
- guid
-
unique identifier for this file
- model
-
‘map’ (allows comparison to project-bound maps)
- created_at
-
time of crawl
-
Fork the project.
-
Ask your mother for permission.
-
Make your feature addition or bug fix.
-
Add tests for it. This is important so I don’t break it in a future version unintentionally.
-
Commit, do not mess with rakefile, version, or history.
-
Send me a pull request. Bonus points for topic branches.
Copyright (c) 2012 LogicalCat LLC Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.