Skip to content

Commit

Permalink
Refactoring: UTS46 revision 31 and WHATWG IDNA support
Browse files Browse the repository at this point in the history
  • Loading branch information
skryukov committed Nov 14, 2023
1 parent ecd8d45 commit 210c8ee
Show file tree
Hide file tree
Showing 32 changed files with 1,878 additions and 619 deletions.
14 changes: 14 additions & 0 deletions .rubocop.yml
Original file line number Diff line number Diff line change
@@ -1,3 +1,7 @@
require:
- rubocop-rake
- rubocop-rspec

AllCops:
NewCops: disable
TargetRubyVersion: 2.7
Expand Down Expand Up @@ -31,3 +35,13 @@ Metrics:

Naming/MethodParameterName:
Enabled: false

Naming/FileName:
Exclude:
- lib/uri-idna.rb

RSpec/MultipleExpectations:
Enabled: false

RSpec/NestedGroups:
Enabled: false
14 changes: 14 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,20 @@ and this project adheres to [Semantic Versioning].

## [Unreleased]

### Added

- WHATWG IDNA functions

### Changed

- **BREAKING!** Names of options updated to match UTS46 flags
- Unicode version updated to 15.1
- UTS46 functions now support Revision 31

### Fixed

- IDNA2008 functions now support not only labels, but full domains

## [0.1.0] - 2023-08-05

### Added
Expand Down
2 changes: 2 additions & 0 deletions Gemfile
Original file line number Diff line number Diff line change
Expand Up @@ -10,3 +10,5 @@ gem "rake", "~> 13.0"
gem "rspec", "~> 3.0"

gem "rubocop", "~> 1.55", require: false
gem "rubocop-rake", "~> 0.6", require: false
gem "rubocop-rspec", "~> 2.25", require: false
169 changes: 134 additions & 35 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
[![Gem Version](https://badge.fury.io/rb/uri-idna.svg)](https://rubygems.org/gems/uri-idna)
[![Ruby](https://github.com/skryukov/uri-idna/actions/workflows/main.yml/badge.svg)](https://github.com/skryukov/uri-idna/actions/workflows/main.yml)

A IDNA 2008, UTS 46 and Punycode implementation in pure Ruby.
A IDNA2008, UTS46, IDNA from WHATWG URL Standard and Punycode implementation in pure Ruby.

This gem provides a number of functions for converting internationalized domain names (IDNs) between the Unicode and ASCII Compatible Encoding (ACE) forms.

Expand All @@ -24,98 +24,197 @@ And then run `bundle install`.

There are plenty of ways to convert IDNs between Unicode and ACE forms.

### WHATWG
### IDNA2008

- [URL Standard] – Standard used by modern browsers
The [RFC 5891] defines two protocols for IDN conversion: [Registration](https://datatracker.ietf.org/doc/html/rfc5891#section-4) and [Domain Name Lookup](https://datatracker.ietf.org/doc/html/rfc5891#section-5).

### IDNA 2008
#### Registration protocol

The [RFC 5890] defines two protocols for IDN conversion: [Registration](https://datatracker.ietf.org/doc/html/rfc5891#section-4) and [Domain Name Lookup](https://datatracker.ietf.org/doc/html/rfc5891#section-5).
`URI::IDNA.register(alabel:, ulabel:, **options)`

#### Registration protocol
##### Options

- `check_hyphens`: `true` – whether to check hyphens according to [Section 5.4](https://datatracker.ietf.org/doc/html/rfc5891#section-5.4).
- `leading_combining`: `true` – whether to check leading combining marks according to [Section 5.4](https://datatracker.ietf.org/doc/html/rfc5891#section-5.4).
- `check_joiners`: `true` – whether to check `CONTEXTJ` code points according to [Section 5.4](https://datatracker.ietf.org/doc/html/rfc5891#section-5.4).
- `check_others`: `true` – whether to check `CONTEXTO` code points according to [Section 5.4](https://datatracker.ietf.org/doc/html/rfc5891#section-5.4).
- `check_bidi`: `true` – whether to check bidirectional characters according to [Section 5.4](https://datatracker.ietf.org/doc/html/rfc5891#section-5.4).

```ruby
require "uri/idna"

URI::IDNA.register(alabel: "xn--gdkl8fhk5egc", ulabel: "ハロー・ワールド")
#=> "xn--gdkl8fhk5egc"
URI::IDNA.register(alabel: "xn--gdkl8fhk5egc.jp", ulabel: "ハロー・ワールド.jp")
#=> "xn--gdkl8fhk5egc.jp"

URI::IDNA.register(ulabel: "ハロー・ワールド")
#=> "xn--gdkl8fhk5egc"
URI::IDNA.register(ulabel: "ハロー・ワールド.jp")
#=> "xn--gdkl8fhk5egc.jp"

URI::IDNA.register(alabel: "xn--gdkl8fhk5egc")
#=> "xn--gdkl8fhk5egc"
URI::IDNA.register(alabel: "xn--gdkl8fhk5egc.jp")
#=> "xn--gdkl8fhk5egc.jp"

URI::IDNA.register(ulabel: "☕.us")
#<URI::IDNA::InvalidCodepointError: Codepoint U+2615 at position 1 of "☕" not allowed>
```

#### Domain Name Lookup Protocol

`URI::IDNA.lookup(domain_name, **options)`

##### Options

- `check_hyphens`: `true` – whether to check hyphens according to [Section 4.2.3.1](https://datatracker.ietf.org/doc/html/rfc5891#section-4.2.3.1).
- `leading_combining`: `true` – whether to check leading combining marks according to [Section 4.2.3.2](https://datatracker.ietf.org/doc/html/rfc5891#section-4.2.3.2).
- `check_joiners`: `true` – whether to check CONTEXTJ code points according to [Section 4.2.3.3](https://datatracker.ietf.org/doc/html/rfc5891#section-4.2.3.3).
- `check_others`: `true` – whether to check CONTEXTO code points according to [Section 4.2.3.3](https://datatracker.ietf.org/doc/html/rfc5891#section-4.2.3.3).
- `check_bidi`: `true` – whether to check bidirectional characters according to [Section 4.2.3.4](https://datatracker.ietf.org/doc/html/rfc5891#section-4.2.3.4).
- `verify_dns_length`: `true` – whether to check DNS length according to [Section 4.4](https://datatracker.ietf.org/doc/html/rfc5891#section-4.4).

```ruby
require "uri/idna"

URI::IDNA.lookup("ハロー・ワールド")
#=> "xn--pck0a1b0a6a2e"
URI::IDNA.lookup("ハロー・ワールド.jp")
#=> "xn--pck0a1b0a6a2e.jp"

URI::IDNA.lookup("xn--pck0a1b0a6a2e")
#=> "xn--pck0a1b0a6a2e"
URI::IDNA.lookup("xn--pck0a1b0a6a2e.jp")
#=> "xn--pck0a1b0a6a2e.jp"

URI::IDNA.lookup("Ῠ.me")
#<URI::IDNA::InvalidCodepointError: Codepoint U+1FE8 at position 1 of "Ῠ" not allowed>
```

### Unicode UTS 46(TR46)
### Unicode UTS46 (TR46)

_Current revision: 31_

The [UTS 46] defines two IDN conversion functions: [ToASCII](https://www.unicode.org/reports/tr46/#ToASCII) and [ToUnicode](https://www.unicode.org/reports/tr46/#ToUnicode).
The [UTS46] defines two IDN conversion functions: [ToASCII](https://www.unicode.org/reports/tr46/#ToASCII) and [ToUnicode](https://www.unicode.org/reports/tr46/#ToUnicode).

#### ToASCII

`URI::IDNA.to_ascii(domain_name, **options)`

##### Options

- `use_std3_ascii_rules`: `true` – whether to apply [STD3 rules](https://www.unicode.org/reports/tr46/#STD3_Rules) for both mapping and validation.
- `check_hyphens`: `true` – whether to check hyphens according to [Section 4.2.3.1](https://datatracker.ietf.org/doc/html/rfc5891#section-4.2.3.1) of [RFC 5891].
- `check_bidi`: `true` – whether to check bidirectional characters according to [Section 4.2.3.4](https://datatracker.ietf.org/doc/html/rfc5891#section-4.2.3.4) of [RFC 5891].
- `check_joiners`: `true` – whether to check CONTEXTJ code points according to [Section 4.2.3.3](https://datatracker.ietf.org/doc/html/rfc5891#section-4.2.3.3) of [RFC 5891].
- `transitional_processing`: `false` – (deprecated) whether to apply [transitional processing](https://www.unicode.org/reports/tr46/#ProcessingStepMap) for mapping.
- `ignore_invalid_punycode`: `false` – whether to fast-path invalid Punycode labels according to [4th step of Processing](https://www.unicode.org/reports/tr46/#ProcessingStepPunycode).
- `verify_dns_length`: `true` – whether to check DNS length according to [Section 4.4](https://datatracker.ietf.org/doc/html/rfc5891#section-4.4) of [RFC 5891].

```ruby
require "uri/idna"

URI::IDNA.to_ascii("Bloß.de")
#=> "xn--blo-7ka.de"

# UTS 46 transitional processing is disabled by default,
# UTS46 transitional processing is disabled by default,
# but can be enabled via option:
URI::IDNA.to_ascii("Bloß.de", uts46_transitional: true)
URI::IDNA.to_ascii("Bloß.de", transitional_processing: true)
#=> "bloss.de"

# Note that UTS 46 transitional processing is not fully IDNA 2008 compliant:
# Note that UTS46 processing is not fully IDNA2008 compliant:
URI::IDNA.to_ascii("☕.us")
#=> "xn--53h.us"
```

#### ToUnicode

`URI::IDNA.to_unicode(domain_name, **options)`

##### Options

- `use_std3_ascii_rules`: `true` – whether to apply [STD3 rules](https://www.unicode.org/reports/tr46/#STD3_Rules) for both mapping and validation.
- `check_hyphens`: `true` – whether to check hyphens according to [Section 4.2.3.1](https://datatracker.ietf.org/doc/html/rfc5891#section-4.2.3.1) of [RFC 5891].
- `check_bidi`: `true` – whether to check bidirectional characters according to [Section 4.2.3.4](https://datatracker.ietf.org/doc/html/rfc5891#section-4.2.3.4) of [RFC 5891].
- `check_joiners`: `true` – whether to check CONTEXTJ code points according to [Section 4.2.3.3](https://datatracker.ietf.org/doc/html/rfc5891#section-4.2.3.3) of [RFC 5891].
- `transitional_processing`: `false` – (deprecated) whether to apply [transitional processing](https://www.unicode.org/reports/tr46/#ProcessingStepMap) for mapping.
- `ignore_invalid_punycode`: `false` – whether to fast-path invalid Punycode labels according to [4th step of Processing](https://www.unicode.org/reports/tr46/#ProcessingStepPunycode).

```ruby
require "uri/idna"

URI::IDNA.to_unicode("xn--blo-7ka.de")
#=> "bloß.de"
```

#### IDNA 2008 compatibility
#### IDNA2008 compatibility

It's possible to apply both IDNA 2008 and UTS 46 at once:
It's possible to use UTS46 mapping first and then apply IDNA2008, so the processing fully conforms IDNA2008:

```ruby
require "uri/idna"

URI::IDNA.to_ascii("☕.us", idna_validity: true, contexto: true)
#<URI::IDNA::InvalidCodepointError: Codepoint U+2615 at position 1 of "☕" not allowed>
# For example we can use UTS46 mapping to downcase some characters
char = ""
char.ord # "\u2F24"
#=> 12068

# just downcase doesn't work in this case
char.downcase.ord
#=> 12068

# but UTS46 mapping does it's thing:
URI::IDNA::UTS46::Mapping.call(char).ord
#=> 22823

# so here is a full example:
domain = "⼤.cn" # "\u2F24.cn"
URI::IDNA.lookup(domain)
# <URI::IDNA::InvalidCodepointError: Codepoint U+2F24 at position 1 of "⼤" not allowed>

mapped_domain = URI::IDNA::UTS46::Mapping.call(domain)
URI::IDNA.lookup(mapped_domain)
#=> "xn--pss.cn"
```

### WHATWG

WHATWG's [URL Standard] uses UTS46 algorithm to define ToASCII and ToUnicode functions, it abstracts all available flags and provides only one—the `be_btrict` flag instead.

Note that the `check_hyphens` UTS46 option is set to `false` in this algorithm.

# It's also possible to apply UTS 46 to IDNA 2008 protocols:
URI::IDNA.lookup("Ῠ.me", check_dot: true, uts46: true, uts46_std3: true)
#=> "xn--rtg.me"
#### ToASCII

`URI::IDNA.whatwg_to_ascii(domain_name, **options)`

##### Options

- `be_strict`: `true` – defines values of `use_std3_ascii_rules` and `verify_dns_length` UTS46 options.

```ruby
require "uri/idna"

URI::IDNA.whatwg_to_ascii("Bloß.de")
#=> "xn--blo-7ka.de"

# The be_strict flag sets use_std3_ascii_rules and verify_dns_length UTS46 flags to its value
URI::IDNA.whatwg_to_ascii("2003_rules.com", be_strict: false)
#=> "2003_rules.com"

# By default be_strict is set to true
URI::IDNA.whatwg_to_ascii("2003_rules.com")
#<URI::IDNA::InvalidCodepointError: Codepoint U+005F at position 5 of "2003_rules" not allowed>
```

#### ToUnicode

`URI::IDNA.whatwg_to_unicode(domain_name, **options)`

##### Options

- `be_strict`: `true` - `be_strict`: `true` – defines value of `use_std3_ascii_rules` UTS46 option.

```ruby
require "uri/idna"

URI::IDNA.whatwg_to_unicode("xn--blo-7ka.de")
#=> "bloß.de"
```

### Punycode

Punycode module performs conversion between Unicode and Punycode. Note that Punycode is not IDNA 2008 compliant, it is only used for conversion, no validations performed.
Punycode module performs conversion between Unicode and Punycode. Note that Punycode is not IDNA2008 compliant, it is only used for conversion, no validations performed.

```ruby
require "uri/idna/punycode"
Expand All @@ -129,7 +228,7 @@ URI::IDNA::Punycode.decode("gdkl8fhk5egc")

## Full technical reference:

### IDNA 2008
### IDNA2008
- [RFC 5890] – Definitions and Document Framework
- [RFC 5891] – Protocol
- [RFC 5892] – The Unicode Code Points
Expand All @@ -139,9 +238,9 @@ URI::IDNA::Punycode.decode("gdkl8fhk5egc")

- [RFC 3492] – Punycode: A Bootstring encoding of Unicode

### UTS 46 (also referenced as TS46)
### UTS46 (also referenced as TS46)

- [Unicode IDNA Compatibility Processing][UTS 46]
- [Unicode IDNA Compatibility Processing][UTS46]

## Development

Expand All @@ -167,9 +266,9 @@ To inspect Unicode data, run `bundle exec rake 'idna:inspect[<HEX_CODE>]'`.

To specify Unicode version, or cache directory, use `VERSION` or `CACHE_DIR` environment variables, e.g. `VERSION=15.1.0 bundle exec rake 'idna:inspect[1f495]'`.

### Update UTS 46 test suite data
### Update UTS46 test suite data

To update UTS 46 test suite data, run `bundle exec rake idna:update_uts46_test_suite`.
To update UTS46 test suite data, run `bundle exec rake idna:update_uts46_test_suite`.

## Contributing

Expand All @@ -184,6 +283,6 @@ The gem is available as open source under the terms of the [MIT License].
[RFC 5892]: https://datatracker.ietf.org/doc/html/rfc5892
[RFC 5893]: https://datatracker.ietf.org/doc/html/rfc5893
[RFC 3492]: https://datatracker.ietf.org/doc/html/rfc3492
[UTS 46]: https://www.unicode.org/reports/tr46
[UTS46]: https://www.unicode.org/reports/tr46
[URL Standard]: https://url.spec.whatwg.org/#idna
[MIT License]: https://opensource.org/licenses/MIT
3 changes: 3 additions & 0 deletions lib/uri-idna.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# frozen_string_literal: true

require_relative "uri/idna"
Loading

0 comments on commit 210c8ee

Please sign in to comment.