"Scraper" class for DTI's ContentPublisher CMS which is built on Intersystems Caché, the world's fastest high performance object database.
Apex
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.
scraper
.gitattributes
.gitignore
AUTHORS.md
LICENSE
README.md

README.md

custom.rg.Scraper

About:

"Scraper" class for Digital Technology International's ContentPublisher CMS which is built on Intersystems Caché, the world's fastest high performance object database.

Feedback:

If you improve upon this code and/or have feedback, please contact me:

micky [at] registerguard (.) com.

... or use this repo's issue tracker.

Thank-you's:

Big ups go out to DTI's Eric Gauthier and Joy Peterson! Thanks for the pro help and optimization tips guys!

Basic usage:

#(##class(custom.rg.Scraper).scrape("baz", "www.foo.com", "cms/some/path/foo.php", 10))#

See scraper.csp for more examples.

custom_rg_Scraper properties:

  • "name" Name of scraping "fragment".
  • "interval" Interval of scraping in minutes.
  • "first" First time scraped.
  • "scraped" Date/time of last scraping.
  • "counter" Counts how many times scraping has been updated.
  • "scraping" Contents of scraping.
  • "uri" URI of scraping.

custom_rg_Scraper methods:

  • "expired" Checks if scraping has expired.
  • "diff" Time difference since last scraping to now in minutes.
  • "age" Time since very first scraping.
  • "next" Time until next scraping in minutes.
  • "elapsed" Elapsed time, in minutes, since last update.

custom_rg_Scraper.scrape() parameters:

  • "name" (Required) Scraping fragment identifier.
  • "server" (Required) The IP address or machine name of the web server that you wish to connect to.
  • "location" The location is the url to request, e.g. '/test.html'. This can contain parameters which are assumed to be already URL escaped.
  • "interval" Time, in minutes, of scraping interval. Default: 60 minutes.
  • "force" Force scraping fragment update? Default: False (0).
  • "userAgent" The User-Agent request-header field contains information about the user agent originating the request.
  • "followRedirect" If true then automatically follow redirection requests from the web server. Default: False (0).
  • "https" If not using a proxy server and this is true then it issues a request for an https page rather than the normal http page. Default: False (0).
  • "authorization" Sets/get the 'Authorization:' header field in the Http request.
  • "contentEncoding" Sets/gets the 'Content-Encoding:' entity header field in the HTTP request.
  • "contentType" Sets/gets the 'Content-Type:' entity header field in the HTTP request. Default: "text/html".
  • "contentCharset" If the ContentType starts with 'text/' then this is the charset to encode the contents with. Default: UTF-8.
  • "port" The TCP/IP port number to connect to. Default: 80.
  • "pragma" The Pragma general-header field is used to include implementation- specific directives that may apply to any recipient along the request/response chain.

To-do list:

  • ClassMethod "scrape": Validate "server" parmeter? Would probably need to account for IP addys.
  • Use TRY/CATCH? Not sure of best way to handle this.
  • What's faster/easier than $SYSTEM.SQL.DATEDIFF?

Changelog:

  • v2.0.0: 2013/05/30
    • MAJOR FIX: Squashed truncation bug!
      • Removed ##class(%CSP.Page).EscapeHTML() and ##class(%CSP.Page).UnescapeHTML().
    • MAJOR FIX: Stopped using http.Location = location in place of http.Get(location).
      • The former would not allow query strings, whereas the latter does.
    • Added while( ' http.HttpResponse.Data.AtEnd) { ... } to make sure the response finalizes before writing it to the database.
    • Changed stream.Read() to stream.Read($$$MaxLocalLength).
      • Using $$$MaxLocalLength macro to make that I can safely read all of the stream content that can fit into a string without getting a <MAXSTRING> error.
    • Changed stream.SizeGet() to stream.Size.
      • The former is a getter method for the Size property and is implicitly invoked when you access said property.
    • Updated/added repo boilerplate files.
    • Moved code to scraper sub-folder (keeps the primary separate from the boilerplate cruft).
    • Bumped version number.
  • v1.0.2: 2011/04/05
    • Modified URI formatting for storage in table.
      • Needed to account for slash between server and location variables.
    • Cleaned up indentation of tabbed white space.
    • Slightly modified class documentation.
      • Added version number.
    • Added (more) error checking to return value from ClassMethod() scrape().
  • v1.0.1: 2011/03/30
    • Properties no longer truncate.
    • Properties that are logically required have been marked "Required".
    • Added property "counter": Counts the number of times scraping has been updated.
    • Added "age" method: Time since very first scraping.
    • Added "elapsed" method: Elapsed time, in minutes, since last update.
    • ClassMethod scrape() now depends on DTI's dtCommon.inc macros.
      • Making use of $$$ISOK(), $$$ISERR() and $$$dtThrow() macros.
      • Character stream now uses DT's global character stream class.
      • Native Cache streams are not cached on ECP servers while streams defined in dt.common.streams.* are.
      • Added/modified code, in a few spots, to check return "status".
      • Initialized vars at top of ClassMethod (I like to see what I am working with).
  • v1.0.0: 2011/03/10
    • Initial public release: Uploaded to GitHub.

LEGAL

Copyright © 2013 Micky Hulse/The Register-Guard

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this work except in compliance with the License. You may obtain a copy of the License in the LICENSE file, or at:

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.