PATFT is a simple gem to extract relevant data from raw HTML provided by the USPTO at http://patft.uspto.gov/. PATFT uses Nokogiri and XPath to scan HTML files provided to it and returns a structure (e.g., Hash/JSON) representation of the patent document.
WARNING: PATFT is under active development, refer to the roadmap below (and the specs) to see what is and is not implemented.
require 'patft'
local_html = File.read('patent.html')
patents = Parser.new(local_html)
patents.extract(:title) # => 'System and method for ...'Note that PATFT::Parser#parse requires a String representation of the HTML, how you get that is up to you. This was intentional given the USPTO's policy on scraping (and generally to encourage being responsible).
Below are the keys output by Parser#parse:
A String containing the patent number, without kind code. Note that this field
may contain non-numeric characters for design, re-issue, etc. patents.
A String containing the title.
Extract the following fields:
- Number
- Title
- Issue Date
- Abstract
- Inventors*+
- Assignee*
- Family ID
- Serial Number
- Filing Date
- US Class*+
- CPC Class*+
- Int'l Class*+
- Field of search
- Primary Examiner
- Assistant Examiner
- Attorney/Agent
- Parent Case Text
- Claims*+
- Description (paragraphs)+
- Related Patents*+
- References Cited*+
Format notes:
- Asterisks denote structured data.
- Plusses denote arrays of data
- Asterisks and plusses are arrays of structured data
- CLI
- Increase field support based on red book (e.g., PCT data)
- Remote search interface
- Query tool ("Advanced Search")
- AppFT (probably a different gem)
The gem is available as open source under the terms of the MIT License.