Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

introduce html4 namespace #2278

Merged
merged 6 commits into from
Jun 21, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .yardopts
Original file line number Diff line number Diff line change
@@ -1,5 +1,8 @@
--embed-mixins
--main=README.md
--exclude=lib/nokogiri/css/tokenizer.rb
--exclude=lib/nokogiri/css/parser.rb
--exclude=ext/nokogiri/test_global_handlers.c
lib/**/*.rb
ext/nokogiri/*.c
-
Expand Down
21 changes: 17 additions & 4 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,17 +6,25 @@ Nokogiri follows [Semantic Versioning](https://semver.org/), please see the [REA

## next / unreleased

### Added
### Notable Addition: HTML5 Support (CRuby only)

__HTML5 support__ has been added (to CRuby only) by merging [Nokogumbo](https://github.com/rubys/nokogumbo) into Nokogiri. The Nokogumbo public API has been preserved, so this functionality is available under the `Nokogiri::HTML5` namespace. [[#2204](https://github.com/sparklemotion/nokogiri/issues/2204)]

Please note that HTML5 support is not available for JRuby in this version. However, we feel it is important to think about JRuby and we hope to work on this in the future. If you're interested in helping with HTML5 support on JRuby, please reach out to the maintainers by commenting on issue [#2227](https://github.com/sparklemotion/nokogiri/issues/2227).

Please also note that the `Nokogiri::HTML` parse methods still use libxml2's HTML4 parser in the v1.12 release series. Future releases of Nokogiri may change this behavior, but we'll proceed cautiously to avoid breaking existing applications.

Many thanks to Sam Ruby, Steve Checkoway, and Craig Barnes for creating and maintaining Nokogumbo and supporting the Gumbo HTML5 parser. They're now Nokogiri core contributors with all the powers and privileges pertaining thereto. 🙌

#### Other

### Notable Change: `Nokogiri::HTML4` module and namespace

`Nokogiri::HTML` has been renamed to `Nokogiri::HTML4`, and `Nokogiri::HTML` is aliased to preserve backwards-compatibility. `Nokogiri::HTML` and `Nokogiri::HTML4` parse methods still use libxml2's (or NekoHTML's) HTML4 parser in the v1.12 release series.

Take special note that if you rely on the class name of an object in your code, objects will now report a class of `Nokogiri::HTML4::Foo` where they previously reported `Nokogiri::HTML::Foo`. Instead of relying on the string returned by `Object#class`, prefer `Class#===` or `Object#is_a?` or `Object#instance_of?`.

Future releases of Nokogiri may deprecate `HTML` methods or otherwise change this behavior, so please start using `HTML4` in place of `HTML`.


### Added

* [CRuby] `Nokogiri::VERSION_INFO["libxslt"]["datetime_enabled"]` is a new boolean value which describes whether libxslt (or, more properly, libexslt) has compiled-in datetime support. This generally going to be `true`, but some distros ship without this support (e.g., some mingw UCRT-based packages, see https://github.com/msys2/MINGW-packages/pull/8957). See [#2272](https://github.com/sparklemotion/nokogiri/issues/2272) for more details.

Expand All @@ -38,6 +46,11 @@ Many thanks to Sam Ruby, Steve Checkoway, and Craig Barnes for creating and main
* [CRuby] Speed up (slightly) the compile time of packaged libraries `libiconv`, `libxml2`, and `libxslt` by using autoconf's `--disable-dependency-tracking` option. ("ruby" platform gem only.)


### Deprecated

* Deprecating Nokogumbo's `Nokogiri::HTML5.get`. This method will be removed in a future version of Nokogiri.


### Dependencies

* [CRuby] Upgrade mini_portile2 dependency from `~> 2.5.0` to `~> 2.6.1`. ("ruby" platform gem only.)
Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ Some guiding principles Nokogiri tries to follow:

## Features Overview

- DOM Parser for XML and HTML4
- DOM Parser for XML, HTML4, and HTML5
- SAX Parser for XML and HTML4
- Push Parser for XML and HTML4
- Document search via XPath 1.0
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -18,13 +18,13 @@
import static nokogiri.internals.NokogiriHelpers.getNokogiriClass;

/**
* Class for Nokogiri::HTML::Document.
* Class for Nokogiri::HTML4::Document.
*
* @author sergio
* @author Yoko Harada <yokolet@gmail.com>
*/
@JRubyClass(name = "Nokogiri::HTML::Document", parent = "Nokogiri::XML::Document")
public class HtmlDocument extends XmlDocument
@JRubyClass(name = "Nokogiri::HTML4::Document", parent = "Nokogiri::XML::Document")
public class Html4Document extends XmlDocument
{
private static final String DEFAULT_CONTENT_TYPE = "html";
private static final String DEFAULT_PUBLIC_ID = "-//W3C//DTD HTML 4.01//EN";
Expand All @@ -33,19 +33,19 @@ public class HtmlDocument extends XmlDocument
private String parsed_encoding = null;

public
HtmlDocument(Ruby ruby, RubyClass klazz)
Html4Document(Ruby ruby, RubyClass klazz)
{
super(ruby, klazz);
}

public
HtmlDocument(Ruby runtime, Document document)
Html4Document(Ruby runtime, Document document)
{
this(runtime, getNokogiriClass(runtime, "Nokogiri::XML::Document"), document);
}

public
HtmlDocument(Ruby ruby, RubyClass klazz, Document doc)
Html4Document(Ruby ruby, RubyClass klazz, Document doc)
{
super(ruby, klazz, doc);
}
Expand All @@ -55,10 +55,10 @@ public class HtmlDocument extends XmlDocument
rbNew(ThreadContext context, IRubyObject klazz, IRubyObject[] args)
{
final Ruby runtime = context.runtime;
HtmlDocument htmlDocument;
Html4Document htmlDocument;
try {
Document docNode = createNewDocument(runtime);
htmlDocument = (HtmlDocument) NokogiriService.HTML_DOCUMENT_ALLOCATOR.allocate(runtime, (RubyClass) klazz);
htmlDocument = (Html4Document) NokogiriService.HTML_DOCUMENT_ALLOCATOR.allocate(runtime, (RubyClass) klazz);
htmlDocument.setDocumentNode(context.runtime, docNode);
} catch (Exception ex) {
throw asRuntimeError(runtime, "couldn't create document: ", ex);
Expand Down Expand Up @@ -135,13 +135,6 @@ public class HtmlDocument extends XmlDocument
return parsed_encoding;
}

/*
* call-seq:
* read_io(io, url, encoding, options)
*
* Read the HTML document from +io+ with given +url+, +encoding+,
* and +options+. See Nokogiri::HTML.parse
*/
@JRubyMethod(meta = true, required = 4)
public static IRubyObject
read_io(ThreadContext context, IRubyObject klass, IRubyObject[] args)
Expand All @@ -151,13 +144,6 @@ public class HtmlDocument extends XmlDocument
return ctx.parse(context, (RubyClass) klass, args[1]);
}

/*
* call-seq:
* read_memory(string, url, encoding, options)
*
* Read the HTML document contained in +string+ with given +url+, +encoding+,
* and +options+. See Nokogiri::HTML.parse
*/
@JRubyMethod(meta = true, required = 4)
public static IRubyObject
read_memory(ThreadContext context, IRubyObject klass, IRubyObject[] args)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -16,12 +16,12 @@
import org.jruby.runtime.builtin.IRubyObject;

/**
* Class for Nokogiri::HTML::ElementDescription.
* Class for Nokogiri::HTML4::ElementDescription.
*
* @author Patrick Mahoney <pat@polycrystal.org>
*/
@JRubyClass(name = "Nokogiri::HTML::ElementDescription")
public class HtmlElementDescription extends RubyObject
@JRubyClass(name = "Nokogiri::HTML4::ElementDescription")
public class Html4ElementDescription extends RubyObject
{

/**
Expand All @@ -38,7 +38,7 @@ public class HtmlElementDescription extends RubyObject
protected HTMLElements.Element element;

public
HtmlElementDescription(Ruby runtime, RubyClass rubyClass)
Html4ElementDescription(Ruby runtime, RubyClass rubyClass)
{
super(runtime, rubyClass);
}
Expand Down Expand Up @@ -89,8 +89,8 @@ public class HtmlElementDescription extends RubyObject
return context.nil;
}

HtmlElementDescription desc =
new HtmlElementDescription(context.getRuntime(), (RubyClass)klazz);
Html4ElementDescription desc =
new Html4ElementDescription(context.getRuntime(), (RubyClass)klazz);
desc.element = elem;
return desc;
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -12,16 +12,16 @@
import org.jruby.runtime.builtin.IRubyObject;

/**
* Class for Nokogiri::HTML::EntityLookup.
* Class for Nokogiri::HTML4::EntityLookup.
*
* @author Patrick Mahoney <pat@polycrystal.org>
*/
@JRubyClass(name = "Nokogiri::HTML::EntityLookup")
public class HtmlEntityLookup extends RubyObject
@JRubyClass(name = "Nokogiri::HTML4::EntityLookup")
public class Html4EntityLookup extends RubyObject
{

public
HtmlEntityLookup(Ruby runtime, RubyClass rubyClass)
Html4EntityLookup(Ruby runtime, RubyClass rubyClass)
{
super(runtime, rubyClass);
}
Expand All @@ -41,7 +41,7 @@ public class HtmlEntityLookup extends RubyObject
if (val == -1) { return ruby.getNil(); }

IRubyObject edClass =
ruby.getClassFromPath("Nokogiri::HTML::EntityDescription");
ruby.getClassFromPath("Nokogiri::HTML4::EntityDescription");
IRubyObject edObj = invoke(context, edClass, "new",
ruby.newFixnum(val), ruby.newString(name),
ruby.newString(name + " entity"));
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -24,27 +24,27 @@
import static nokogiri.internals.NokogiriHelpers.rubyStringToString;

/**
* Class for Nokogiri::HTML::SAX::ParserContext.
* Class for Nokogiri::HTML4::SAX::ParserContext.
*
* @author serabe
* @author Patrick Mahoney <pat@polycrystal.org>
* @author Yoko Harada <yokolet@gmail.com>
*/

@JRubyClass(name = "Nokogiri::HTML::SAX::ParserContext", parent = "Nokogiri::XML::SAX::ParserContext")
public class HtmlSaxParserContext extends XmlSaxParserContext
@JRubyClass(name = "Nokogiri::HTML4::SAX::ParserContext", parent = "Nokogiri::XML::SAX::ParserContext")
public class Html4SaxParserContext extends XmlSaxParserContext
{

static HtmlSaxParserContext
static Html4SaxParserContext
newInstance(final Ruby runtime, final RubyClass klazz)
{
HtmlSaxParserContext instance = new HtmlSaxParserContext(runtime, klazz);
Html4SaxParserContext instance = new Html4SaxParserContext(runtime, klazz);
instance.initialize(runtime);
return instance;
}

public
HtmlSaxParserContext(Ruby ruby, RubyClass rubyClass)
Html4SaxParserContext(Ruby ruby, RubyClass rubyClass)
{
super(ruby, rubyClass);
}
Expand All @@ -68,7 +68,7 @@ public class HtmlSaxParserContext extends XmlSaxParserContext
return parser;
} catch (SAXException ex) {
throw new SAXException(
"Problem while creating HTML SAX Parser: " + ex.toString());
"Problem while creating HTML4 SAX Parser: " + ex.toString());
}
}

Expand All @@ -79,7 +79,7 @@ public class HtmlSaxParserContext extends XmlSaxParserContext
IRubyObject data,
IRubyObject encoding)
{
HtmlSaxParserContext ctx = HtmlSaxParserContext.newInstance(context.runtime, (RubyClass) klazz);
Html4SaxParserContext ctx = Html4SaxParserContext.newInstance(context.runtime, (RubyClass) klazz);
String javaEncoding = findEncodingName(context, encoding);
if (javaEncoding != null) {
CharSequence input = applyEncoding(rubyStringToString(data.convertToString()), javaEncoding);
Expand Down Expand Up @@ -231,7 +231,7 @@ static EncodingType get(final int ordinal)
IRubyObject data,
IRubyObject encoding)
{
HtmlSaxParserContext ctx = HtmlSaxParserContext.newInstance(context.runtime, (RubyClass) klass);
Html4SaxParserContext ctx = Html4SaxParserContext.newInstance(context.runtime, (RubyClass) klass);
ctx.setInputSourceFile(context, data);
String javaEncoding = findEncodingName(context, encoding);
if (javaEncoding != null) {
Expand All @@ -247,7 +247,7 @@ static EncodingType get(final int ordinal)
IRubyObject data,
IRubyObject encoding)
{
HtmlSaxParserContext ctx = HtmlSaxParserContext.newInstance(context.runtime, (RubyClass) klass);
Html4SaxParserContext ctx = Html4SaxParserContext.newInstance(context.runtime, (RubyClass) klass);
ctx.setIOInputSource(context, data, context.nil);
String javaEncoding = findEncodingName(context, encoding);
if (javaEncoding != null) {
Expand All @@ -258,12 +258,12 @@ static EncodingType get(final int ordinal)

/**
* Create a new parser context that will read from a raw input stream.
* Meant to be run in a separate thread by HtmlSaxPushParser.
* Meant to be run in a separate thread by Html4SaxPushParser.
*/
static HtmlSaxParserContext
static Html4SaxParserContext
parse_stream(final Ruby runtime, RubyClass klass, InputStream stream)
{
HtmlSaxParserContext ctx = HtmlSaxParserContext.newInstance(runtime, klass);
Html4SaxParserContext ctx = Html4SaxParserContext.newInstance(runtime, klass);
ctx.setInputSource(stream);
return ctx;
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -27,25 +27,25 @@
import org.jruby.runtime.builtin.IRubyObject;

/**
* Class for Nokogiri::HTML::SAX::PushParser
* Class for Nokogiri::HTML4::SAX::PushParser
*
* @author
* @author Piotr Szmielew <p.szmielew@ava.waw.pl> - based on Nokogiri::XML::SAX::PushParser
*/
@JRubyClass(name = "Nokogiri::HTML::SAX::PushParser")
public class HtmlSaxPushParser extends RubyObject
@JRubyClass(name = "Nokogiri::HTML4::SAX::PushParser")
public class Html4SaxPushParser extends RubyObject
{
ParserContext.Options options;
IRubyObject saxParser;

NokogiriBlockingQueueInputStream stream;

private ParserTask parserTask = null;
private FutureTask<HtmlSaxParserContext> futureTask = null;
private FutureTask<Html4SaxParserContext> futureTask = null;
private ExecutorService executor = null;

public
HtmlSaxPushParser(Ruby ruby, RubyClass rubyClass)
Html4SaxPushParser(Ruby ruby, RubyClass rubyClass)
{
super(ruby, rubyClass);
}
Expand Down Expand Up @@ -111,7 +111,7 @@ public class HtmlSaxPushParser extends RubyObject
final ByteArrayInputStream data = NokogiriHelpers.stringBytesToStream(chunk);
if (data == null) {
terminateTask(context.runtime);
throw XmlSyntaxError.createHTMLSyntaxError(context.runtime).toThrowable(); // Nokogiri::HTML::SyntaxError
throw XmlSyntaxError.createHTMLSyntaxError(context.runtime).toThrowable(); // Nokogiri::HTML4::SyntaxError
}

int errorCount0 = parserTask.getErrorCount();
Expand Down Expand Up @@ -149,12 +149,12 @@ public class HtmlSaxPushParser extends RubyObject

assert saxParser != null : "saxParser null";
parserTask = new ParserTask(context, saxParser, stream);
futureTask = new FutureTask<HtmlSaxParserContext>((Callable) parserTask);
futureTask = new FutureTask<Html4SaxParserContext>((Callable) parserTask);
executor = Executors.newSingleThreadExecutor(new ThreadFactory() {
@Override
public Thread newThread(Runnable r) {
Thread t = new Thread(r);
t.setName("HtmlSaxPushParser");
t.setName("Html4SaxPushParser");
t.setDaemon(true);
return t;
}
Expand Down Expand Up @@ -187,14 +187,14 @@ public Thread newThread(Runnable r) {
futureTask = null;
}

private static HtmlSaxParserContext
private static Html4SaxParserContext
parse(final Ruby runtime, final InputStream stream)
{
RubyClass klazz = getNokogiriClass(runtime, "Nokogiri::HTML::SAX::ParserContext");
return HtmlSaxParserContext.parse_stream(runtime, klazz, stream);
RubyClass klazz = getNokogiriClass(runtime, "Nokogiri::HTML4::SAX::ParserContext");
return Html4SaxParserContext.parse_stream(runtime, klazz, stream);
}

static class ParserTask extends XmlSaxPushParser.ParserTask /* <HtmlSaxPushParser> */
static class ParserTask extends XmlSaxPushParser.ParserTask /* <Html4SaxPushParser> */
{

private
Expand All @@ -204,10 +204,10 @@ static class ParserTask extends XmlSaxPushParser.ParserTask /* <HtmlSaxPushParse
}

@Override
public HtmlSaxParserContext
public Html4SaxParserContext
call() throws Exception
{
return (HtmlSaxParserContext) super.call();
return (Html4SaxParserContext) super.call();
}

}
Expand Down
Loading