Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor unicode.py script #60081

Merged
merged 6 commits into from
Jul 7, 2019
Merged

Conversation

pawroman
Copy link
Contributor

Hi, I noticed that the unicode.py script used some deprecated escapes in regular expressions. E.g. \d, \w, \. will be illegal in the future without "raw strings". This is now fixed. I have also cleaned up the script quite a bit.

Escape deprecation

OK (note the r):
re.compile(r"\d")

Deprecated (from Python 3.6 onwards, see here and here):
re.compile("\d").

This was evident running the script using Python 3.7 like so:

$ python3 -Wall unicode.py 
unicode.py:227: DeprecationWarning: invalid escape sequence \w
  re1 = re.compile("^ *([0-9A-F]+) *; *(\w+)")
unicode.py:228: DeprecationWarning: invalid escape sequence \.
  re2 = re.compile("^ *([0-9A-F]+)\.\.([0-9A-F]+) *; *(\w+)")
unicode.py:453: DeprecationWarning: invalid escape sequence \d
  pattern = "for Version (\d+)\.(\d+)\.(\d+) of the Unicode"

The documentation states that

A backslash-character pair that is not a valid escape sequence now generates a DeprecationWarning. Although this will eventually become a SyntaxError, that will not be for several Python releases.

Testing

To test my changes, I had to add support for choosing the Unicode version to use. The script will default to latest release (which is 12.0.0 at the moment, repo has 11.0.0 checked in).

The script generates the exact same output for version 11.0.0 with Python 2.7 and 3.7 and no longer generates any deprecation warnings:

$ python3 -Wall unicode.py -v 11.0.0
Using Unicode version: 11.0.0
Regenerated tables.rs.
$ git diff tables.rs
$ python2 -Wall unicode.py -v 11.0.0
Using Unicode version: 11.0.0
Regenerated tables.rs.
$ git diff tables.rs
$ python2 --version
Python 2.7.16
$ python3 --version
Python 3.7.3

Extra functionality

Furthermore, the script will check and download the latest Unicode version by default (without the -v argument). The --help is below:

$ ./unicode.py --help
usage: unicode.py [-h] [-v VERSION]

Regenerate Unicode tables (tables.rs).

optional arguments:
  -h, --help            show this help message and exit
  -v VERSION, --version VERSION
                        Unicode version to use (if not specified, defaults to
                        latest available final release).

Cleanups

I have cleaned up the code quite a bit, with Python best practices and code style in mind. I'm happy to provide more details and rationale for all my changes if the reviewers so desire.

One externally visible change is that the Unicode data will now be downloaded into src/libcore/unicode/downloaded directory suffixed by Unicode version:

$ pwd
.../rust/src/libcore/unicode
$ exa -T downloaded/
downloaded
├── 11.0.0
│  ├── DerivedCoreProperties.txt
│  ├── DerivedNormalizationProps.txt
│  ├── PropList.txt
│  ├── ReadMe.txt
│  ├── Scripts.txt
│  ├── SpecialCasing.txt
│  └── UnicodeData.txt
└── 12.0.0
   ├── DerivedCoreProperties.txt
   ├── DerivedNormalizationProps.txt
   ├── PropList.txt
   ├── ReadMe.txt
   ├── Scripts.txt
   ├── SpecialCasing.txt
   └── UnicodeData.txt

@rust-highfive
Copy link
Collaborator

r? @KodrAus

(rust_highfive has picked a reviewer for you, use r? to override)

@rust-highfive rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Apr 18, 2019
@rust-highfive
Copy link
Collaborator

The job x86_64-gnu-llvm-6.0 of your PR failed on Travis (raw log). Through arcane magic we have determined that the following fragments from the build log may contain information about the problem.

Click to expand the log.
travis_time:end:064f0514:start=1555598252801053942,finish=1555598254934744840,duration=2133690898
$ git checkout -qf FETCH_HEAD
travis_fold:end:git.checkout

Encrypted environment variables have been removed for security reasons.
See https://docs.travis-ci.com/user/pull-requests/#pull-requests-and-security-restrictions
$ export SCCACHE_BUCKET=rust-lang-ci-sccache2
$ export SCCACHE_REGION=us-west-1
$ export GCP_CACHE_BUCKET=rust-lang-ci-cache
$ export AWS_ACCESS_KEY_ID=AKIA46X5W6CZEJZ6XT55
---

[00:03:43] travis_fold:start:tidy
travis_time:start:tidy
tidy check
[00:03:43] tidy error: /checkout/src/libcore/unicode/unicode.py:463: line longer than 100 chars
[00:03:43] tidy error: /checkout/src/libcore/unicode/unicode.py:629: TODO is deprecated; use FIXME
[00:03:45] some tidy checks failed
[00:03:45] 
[00:03:45] 
[00:03:45] command did not execute successfully: "/checkout/obj/build/x86_64-unknown-linux-gnu/stage0-tools-bin/tidy" "/checkout/src" "/checkout/obj/build/x86_64-unknown-linux-gnu/stage0/bin/cargo" "--no-vendor" "--quiet"
[00:03:45] 
[00:03:45] 
[00:03:45] failed to run: /checkout/obj/build/bootstrap/debug/bootstrap test src/tools/tidy
[00:03:45] Build completed unsuccessfully in 0:00:45
[00:03:45] Build completed unsuccessfully in 0:00:45
[00:03:45] make: *** [tidy] Error 1
[00:03:45] Makefile:67: recipe for target 'tidy' failed
The command "stamp sh -x -c "$RUN_SCRIPT"" exited with 2.
travis_time:start:31ae9ce0
$ date && (curl -fs --head https://google.com | grep ^Date: | sed 's/Date: //g' || true)
Thu Apr 18 14:41:31 UTC 2019
---
travis_time:end:00af8fc0:start=1555598492378398530,finish=1555598492383061211,duration=4662681
travis_fold:end:after_failure.3
travis_fold:start:after_failure.4
travis_time:start:0f3c2b5c
$ ln -s . checkout && for CORE in obj/cores/core.*; do EXE=$(echo $CORE | sed 's|obj/cores/core\.[0-9]*\.!checkout!\(.*\)|\1|;y|!|/|'); if [ -f "$EXE" ]; then printf travis_fold":start:crashlog\n\033[31;1m%s\033[0m\n" "$CORE"; gdb --batch -q -c "$CORE" "$EXE" -iex 'set auto-load off' -iex 'dir src/' -iex 'set sysroot .' -ex bt -ex q; echo travis_fold":"end:crashlog; fi; done || true
travis_fold:end:after_failure.4
travis_fold:start:after_failure.5
travis_time:start:159c54d0
travis_time:start:159c54d0
$ cat ./obj/build/x86_64-unknown-linux-gnu/native/asan/build/lib/asan/clang_rt.asan-dynamic-i386.vers || true
cat: ./obj/build/x86_64-unknown-linux-gnu/native/asan/build/lib/asan/clang_rt.asan-dynamic-i386.vers: No such file or directory
travis_fold:end:after_failure.5
travis_fold:start:after_failure.6
travis_time:start:1a9a1a97
$ dmesg | grep -i kill

I'm a bot! I can only do what humans tell me to, so if this was not helpful or you have suggestions for improvements, please ping or otherwise contact @TimNN. (Feature Requests)

@fbstj
Copy link
Contributor

fbstj commented Apr 18, 2019

should each of your functions have some doc-strings?

@pawroman
Copy link
Contributor Author

should each of your functions have some doc-strings?

Possibly, I can annotate the most obvious ones but I would have to ask the authors about some of the Unicode table calculations -- to explain what is being done and why.

@pawroman pawroman changed the title Cleanup unicode.py script Refactor unicode.py script Apr 19, 2019
@pawroman
Copy link
Contributor Author

Following the suggestion above, I have added docstrings and type annotations for all functions. To fully comprehend what's going on I had to give the code a few good, thorough reads. This spawned more cleanups and the entire script is now refactored.

@KodrAus
Copy link
Contributor

KodrAus commented Apr 23, 2019

I don't think I've got the python experience to be able to properly review this so I'll assign somebody else based on the commit history...

r? @SimonSapin

@rust-highfive rust-highfive assigned SimonSapin and unassigned KodrAus Apr 23, 2019
@SimonSapin
Copy link
Contributor

Sorry, I don’t have bandwidth available at the moment for non-trivial Rust reviews.

@SimonSapin SimonSapin assigned varkor and unassigned SimonSapin Apr 23, 2019
@SimonSapin
Copy link
Contributor

r? @varkor, based on git-shortlog

@varkor
Copy link
Member

varkor commented Apr 28, 2019

I'll try to take a look soon.

Copy link
Member

@varkor varkor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little divided about pull requests like this. On the one hand, refactoring and cleaning code up is a good thing. However, for scripts like this, which are quite large, but also very rarely used, and also written in Python, it's (unfortunately) possibly not worth the effort to clean up. For instance, it's very likely this script will never need to be modified (functionally), so readability isn't a huge issue, and it will continue to work in older versions of Python indefinitely. I do appreciate the desire to improve the codebase as a whole, but (as can be seen) reviewing such refactorings can be difficult, as it takes a lot of time.

The one thing I might be tempted to do is rewrite this (and the other table-generating scripts) in Rust, which would then be accessible to more of the reviewers and also give us more type safety and reassurance that we aren't accidentally breaking anything.

That said, I've taken a look at these changes. Overall I think they are an improvement. Honestly, as long as we continue to generate the same tables and the changes look mostly like moving code around, I'm happy. I would like the comment style to be normalised a bit more. One thing that would be nice to have is a test to check that the current table.rs matches the output of unicode.py -v 11.0.0, so we know that the script is (mostly) correct, but as that requires a connection to the Unicode website, it's probably not feasible.

If you could fix the few comments I had, I'll approve.

src/libcore/unicode/unicode.py Outdated Show resolved Hide resolved
src/libcore/unicode/unicode.py Outdated Show resolved Hide resolved
src/libcore/unicode/unicode.py Outdated Show resolved Hide resolved
src/libcore/unicode/unicode.py Outdated Show resolved Hide resolved
src/libcore/unicode/unicode.py Outdated Show resolved Hide resolved
src/libcore/unicode/unicode.py Show resolved Hide resolved
src/libcore/unicode/unicode.py Outdated Show resolved Hide resolved
@varkor varkor added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels May 6, 2019
@SimonSapin
Copy link
Contributor

The one thing I might be tempted to do is rewrite this (and the other table-generating scripts) in Rust

Such an effort might be able to rely on (parts of?) ucd-generate or unic-ucd.

@jonas-schievink
Copy link
Contributor

Ping from triage @pawroman, there are still outstanding review items that need to be addressed

@pawroman
Copy link
Contributor Author

Thanks for being patient - I couldn't find the time to look into this recently.

@varkor Thanks for reviewing - I will address the remarks soon.

I am tempted to rewrite this in Rust myself, however I wanted to clean up the Python code first (so I understand it better) and then perhaps work on a Rust re-implementation.

Co-Authored-By: varkor <github@varkor.com>
@varkor
Copy link
Member

varkor commented Jun 23, 2019

@pawroman: if you could squash the last two commits (so the submodules never change), I'm happy to approve with the new changes 👍

@Centril Centril reopened this Jul 6, 2019
@Centril
Copy link
Contributor

Centril commented Jul 6, 2019

@bors r=varkor

@bors
Copy link
Contributor

bors commented Jul 6, 2019

📌 Commit 2b47a08 has been approved by varkor

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. labels Jul 6, 2019
Centril added a commit to Centril/rust that referenced this pull request Jul 6, 2019
…varkor

Refactor unicode.py script

Hi, I noticed that the `unicode.py` script used some deprecated escapes in regular expressions. E.g. `\d`, `\w`, `\.` will be illegal in the future without "raw strings". This is now fixed. I have also cleaned up the script quite a bit.

## Escape deprecation

OK (note the `r`):
`re.compile(r"\d")`

Deprecated (from Python 3.6 onwards, see [here][link1] and [here][link2]):
`re.compile("\d")`.

[link1]: https://docs.python.org/3.6/whatsnew/3.6.html#deprecated-python-behavior
[link2]: https://bugs.python.org/issue27364

This was evident running the script using Python 3.7 like so:

```
$ python3 -Wall unicode.py
unicode.py:227: DeprecationWarning: invalid escape sequence \w
  re1 = re.compile("^ *([0-9A-F]+) *; *(\w+)")
unicode.py:228: DeprecationWarning: invalid escape sequence \.
  re2 = re.compile("^ *([0-9A-F]+)\.\.([0-9A-F]+) *; *(\w+)")
unicode.py:453: DeprecationWarning: invalid escape sequence \d
  pattern = "for Version (\d+)\.(\d+)\.(\d+) of the Unicode"
```

The documentation states that
> A backslash-character pair that is not a valid escape sequence now generates a DeprecationWarning. Although this will eventually become a SyntaxError, that will not be for several Python releases.

## Testing

To test my changes, I had to add support for choosing the Unicode version to use. The script will default to latest release (which is 12.0.0 at the moment, repo has 11.0.0 checked in).

The script generates the exact same output for version 11.0.0 with Python 2.7 and 3.7 and no longer generates any deprecation warnings:

```
$ python3 -Wall unicode.py -v 11.0.0
Using Unicode version: 11.0.0
Regenerated tables.rs.
$ git diff tables.rs
$ python2 -Wall unicode.py -v 11.0.0
Using Unicode version: 11.0.0
Regenerated tables.rs.
$ git diff tables.rs
$ python2 --version
Python 2.7.16
$ python3 --version
Python 3.7.3
```

## Extra functionality

Furthermore, the script will check and download the latest Unicode version by default (without the `-v` argument). The `--help` is below:

```
$ ./unicode.py --help
usage: unicode.py [-h] [-v VERSION]

Regenerate Unicode tables (tables.rs).

optional arguments:
  -h, --help            show this help message and exit
  -v VERSION, --version VERSION
                        Unicode version to use (if not specified, defaults to
                        latest available final release).
```

## Cleanups

I have cleaned up the code quite a bit, with Python best practices and code style in mind. I'm happy to provide more details and rationale for all my changes if the reviewers so desire.

One externally visible change is that the Unicode data will now be downloaded into `src/libcore/unicode/downloaded` directory suffixed by Unicode version:

```
$ pwd
.../rust/src/libcore/unicode
$ exa -T downloaded/
downloaded
├── 11.0.0
│  ├── DerivedCoreProperties.txt
│  ├── DerivedNormalizationProps.txt
│  ├── PropList.txt
│  ├── ReadMe.txt
│  ├── Scripts.txt
│  ├── SpecialCasing.txt
│  └── UnicodeData.txt
└── 12.0.0
   ├── DerivedCoreProperties.txt
   ├── DerivedNormalizationProps.txt
   ├── PropList.txt
   ├── ReadMe.txt
   ├── Scripts.txt
   ├── SpecialCasing.txt
   └── UnicodeData.txt
```
Centril added a commit to Centril/rust that referenced this pull request Jul 6, 2019
Rollup of 4 pull requests [2]

Successful merges:

 - rust-lang#59800 (rustc: Remove `dylib` crate type from most rustc crates)
 - rust-lang#60081 (Refactor unicode.py script)
 - rust-lang#62270 (Move async-await tests from run-pass to ui)
 - rust-lang#62281 (Add support for pc-relative addressing on 64-bit RISC-V)

Failed merges:

r? @ghost
Centril added a commit to Centril/rust that referenced this pull request Jul 6, 2019
…varkor

Refactor unicode.py script

Hi, I noticed that the `unicode.py` script used some deprecated escapes in regular expressions. E.g. `\d`, `\w`, `\.` will be illegal in the future without "raw strings". This is now fixed. I have also cleaned up the script quite a bit.

## Escape deprecation

OK (note the `r`):
`re.compile(r"\d")`

Deprecated (from Python 3.6 onwards, see [here][link1] and [here][link2]):
`re.compile("\d")`.

[link1]: https://docs.python.org/3.6/whatsnew/3.6.html#deprecated-python-behavior
[link2]: https://bugs.python.org/issue27364

This was evident running the script using Python 3.7 like so:

```
$ python3 -Wall unicode.py
unicode.py:227: DeprecationWarning: invalid escape sequence \w
  re1 = re.compile("^ *([0-9A-F]+) *; *(\w+)")
unicode.py:228: DeprecationWarning: invalid escape sequence \.
  re2 = re.compile("^ *([0-9A-F]+)\.\.([0-9A-F]+) *; *(\w+)")
unicode.py:453: DeprecationWarning: invalid escape sequence \d
  pattern = "for Version (\d+)\.(\d+)\.(\d+) of the Unicode"
```

The documentation states that
> A backslash-character pair that is not a valid escape sequence now generates a DeprecationWarning. Although this will eventually become a SyntaxError, that will not be for several Python releases.

## Testing

To test my changes, I had to add support for choosing the Unicode version to use. The script will default to latest release (which is 12.0.0 at the moment, repo has 11.0.0 checked in).

The script generates the exact same output for version 11.0.0 with Python 2.7 and 3.7 and no longer generates any deprecation warnings:

```
$ python3 -Wall unicode.py -v 11.0.0
Using Unicode version: 11.0.0
Regenerated tables.rs.
$ git diff tables.rs
$ python2 -Wall unicode.py -v 11.0.0
Using Unicode version: 11.0.0
Regenerated tables.rs.
$ git diff tables.rs
$ python2 --version
Python 2.7.16
$ python3 --version
Python 3.7.3
```

## Extra functionality

Furthermore, the script will check and download the latest Unicode version by default (without the `-v` argument). The `--help` is below:

```
$ ./unicode.py --help
usage: unicode.py [-h] [-v VERSION]

Regenerate Unicode tables (tables.rs).

optional arguments:
  -h, --help            show this help message and exit
  -v VERSION, --version VERSION
                        Unicode version to use (if not specified, defaults to
                        latest available final release).
```

## Cleanups

I have cleaned up the code quite a bit, with Python best practices and code style in mind. I'm happy to provide more details and rationale for all my changes if the reviewers so desire.

One externally visible change is that the Unicode data will now be downloaded into `src/libcore/unicode/downloaded` directory suffixed by Unicode version:

```
$ pwd
.../rust/src/libcore/unicode
$ exa -T downloaded/
downloaded
├── 11.0.0
│  ├── DerivedCoreProperties.txt
│  ├── DerivedNormalizationProps.txt
│  ├── PropList.txt
│  ├── ReadMe.txt
│  ├── Scripts.txt
│  ├── SpecialCasing.txt
│  └── UnicodeData.txt
└── 12.0.0
   ├── DerivedCoreProperties.txt
   ├── DerivedNormalizationProps.txt
   ├── PropList.txt
   ├── ReadMe.txt
   ├── Scripts.txt
   ├── SpecialCasing.txt
   └── UnicodeData.txt
```
@bors
Copy link
Contributor

bors commented Jul 6, 2019

⌛ Testing commit 2b47a08 with merge 5a94ebbc9dc43dfa82b165b2c41a115b0e869804...

Centril added a commit to Centril/rust that referenced this pull request Jul 6, 2019
…varkor

Refactor unicode.py script

Hi, I noticed that the `unicode.py` script used some deprecated escapes in regular expressions. E.g. `\d`, `\w`, `\.` will be illegal in the future without "raw strings". This is now fixed. I have also cleaned up the script quite a bit.

## Escape deprecation

OK (note the `r`):
`re.compile(r"\d")`

Deprecated (from Python 3.6 onwards, see [here][link1] and [here][link2]):
`re.compile("\d")`.

[link1]: https://docs.python.org/3.6/whatsnew/3.6.html#deprecated-python-behavior
[link2]: https://bugs.python.org/issue27364

This was evident running the script using Python 3.7 like so:

```
$ python3 -Wall unicode.py
unicode.py:227: DeprecationWarning: invalid escape sequence \w
  re1 = re.compile("^ *([0-9A-F]+) *; *(\w+)")
unicode.py:228: DeprecationWarning: invalid escape sequence \.
  re2 = re.compile("^ *([0-9A-F]+)\.\.([0-9A-F]+) *; *(\w+)")
unicode.py:453: DeprecationWarning: invalid escape sequence \d
  pattern = "for Version (\d+)\.(\d+)\.(\d+) of the Unicode"
```

The documentation states that
> A backslash-character pair that is not a valid escape sequence now generates a DeprecationWarning. Although this will eventually become a SyntaxError, that will not be for several Python releases.

## Testing

To test my changes, I had to add support for choosing the Unicode version to use. The script will default to latest release (which is 12.0.0 at the moment, repo has 11.0.0 checked in).

The script generates the exact same output for version 11.0.0 with Python 2.7 and 3.7 and no longer generates any deprecation warnings:

```
$ python3 -Wall unicode.py -v 11.0.0
Using Unicode version: 11.0.0
Regenerated tables.rs.
$ git diff tables.rs
$ python2 -Wall unicode.py -v 11.0.0
Using Unicode version: 11.0.0
Regenerated tables.rs.
$ git diff tables.rs
$ python2 --version
Python 2.7.16
$ python3 --version
Python 3.7.3
```

## Extra functionality

Furthermore, the script will check and download the latest Unicode version by default (without the `-v` argument). The `--help` is below:

```
$ ./unicode.py --help
usage: unicode.py [-h] [-v VERSION]

Regenerate Unicode tables (tables.rs).

optional arguments:
  -h, --help            show this help message and exit
  -v VERSION, --version VERSION
                        Unicode version to use (if not specified, defaults to
                        latest available final release).
```

## Cleanups

I have cleaned up the code quite a bit, with Python best practices and code style in mind. I'm happy to provide more details and rationale for all my changes if the reviewers so desire.

One externally visible change is that the Unicode data will now be downloaded into `src/libcore/unicode/downloaded` directory suffixed by Unicode version:

```
$ pwd
.../rust/src/libcore/unicode
$ exa -T downloaded/
downloaded
├── 11.0.0
│  ├── DerivedCoreProperties.txt
│  ├── DerivedNormalizationProps.txt
│  ├── PropList.txt
│  ├── ReadMe.txt
│  ├── Scripts.txt
│  ├── SpecialCasing.txt
│  └── UnicodeData.txt
└── 12.0.0
   ├── DerivedCoreProperties.txt
   ├── DerivedNormalizationProps.txt
   ├── PropList.txt
   ├── ReadMe.txt
   ├── Scripts.txt
   ├── SpecialCasing.txt
   └── UnicodeData.txt
```
@Centril
Copy link
Contributor

Centril commented Jul 6, 2019

@bors retry

bors added a commit that referenced this pull request Jul 6, 2019
Rollup of 6 pull requests

Successful merges:

 - #60081 (Refactor unicode.py script)
 - #61862 (Make the Weak::{into,as}_raw methods)
 - #62243 (Improve documentation for built-in macros)
 - #62422 (Remove some uses of mem::uninitialized)
 - #62432 (Update rustfmt to 1.3.2)
 - #62436 (normalize use of backticks/lowercase in compiler messages for librustc_mir)

Failed merges:

r? @ghost
Centril added a commit to Centril/rust that referenced this pull request Jul 6, 2019
…varkor

Refactor unicode.py script

Hi, I noticed that the `unicode.py` script used some deprecated escapes in regular expressions. E.g. `\d`, `\w`, `\.` will be illegal in the future without "raw strings". This is now fixed. I have also cleaned up the script quite a bit.

## Escape deprecation

OK (note the `r`):
`re.compile(r"\d")`

Deprecated (from Python 3.6 onwards, see [here][link1] and [here][link2]):
`re.compile("\d")`.

[link1]: https://docs.python.org/3.6/whatsnew/3.6.html#deprecated-python-behavior
[link2]: https://bugs.python.org/issue27364

This was evident running the script using Python 3.7 like so:

```
$ python3 -Wall unicode.py
unicode.py:227: DeprecationWarning: invalid escape sequence \w
  re1 = re.compile("^ *([0-9A-F]+) *; *(\w+)")
unicode.py:228: DeprecationWarning: invalid escape sequence \.
  re2 = re.compile("^ *([0-9A-F]+)\.\.([0-9A-F]+) *; *(\w+)")
unicode.py:453: DeprecationWarning: invalid escape sequence \d
  pattern = "for Version (\d+)\.(\d+)\.(\d+) of the Unicode"
```

The documentation states that
> A backslash-character pair that is not a valid escape sequence now generates a DeprecationWarning. Although this will eventually become a SyntaxError, that will not be for several Python releases.

## Testing

To test my changes, I had to add support for choosing the Unicode version to use. The script will default to latest release (which is 12.0.0 at the moment, repo has 11.0.0 checked in).

The script generates the exact same output for version 11.0.0 with Python 2.7 and 3.7 and no longer generates any deprecation warnings:

```
$ python3 -Wall unicode.py -v 11.0.0
Using Unicode version: 11.0.0
Regenerated tables.rs.
$ git diff tables.rs
$ python2 -Wall unicode.py -v 11.0.0
Using Unicode version: 11.0.0
Regenerated tables.rs.
$ git diff tables.rs
$ python2 --version
Python 2.7.16
$ python3 --version
Python 3.7.3
```

## Extra functionality

Furthermore, the script will check and download the latest Unicode version by default (without the `-v` argument). The `--help` is below:

```
$ ./unicode.py --help
usage: unicode.py [-h] [-v VERSION]

Regenerate Unicode tables (tables.rs).

optional arguments:
  -h, --help            show this help message and exit
  -v VERSION, --version VERSION
                        Unicode version to use (if not specified, defaults to
                        latest available final release).
```

## Cleanups

I have cleaned up the code quite a bit, with Python best practices and code style in mind. I'm happy to provide more details and rationale for all my changes if the reviewers so desire.

One externally visible change is that the Unicode data will now be downloaded into `src/libcore/unicode/downloaded` directory suffixed by Unicode version:

```
$ pwd
.../rust/src/libcore/unicode
$ exa -T downloaded/
downloaded
├── 11.0.0
│  ├── DerivedCoreProperties.txt
│  ├── DerivedNormalizationProps.txt
│  ├── PropList.txt
│  ├── ReadMe.txt
│  ├── Scripts.txt
│  ├── SpecialCasing.txt
│  └── UnicodeData.txt
└── 12.0.0
   ├── DerivedCoreProperties.txt
   ├── DerivedNormalizationProps.txt
   ├── PropList.txt
   ├── ReadMe.txt
   ├── Scripts.txt
   ├── SpecialCasing.txt
   └── UnicodeData.txt
```
bors added a commit that referenced this pull request Jul 6, 2019
Rollup of 5 pull requests

Successful merges:

 - #60081 (Refactor unicode.py script)
 - #61862 (Make the Weak::{into,as}_raw methods)
 - #62243 (Improve documentation for built-in macros)
 - #62422 (Remove some uses of mem::uninitialized)
 - #62436 (normalize use of backticks/lowercase in compiler messages for librustc_mir)

Failed merges:

r? @ghost
@bors bors merged commit 2b47a08 into rust-lang:master Jul 7, 2019
@bors
Copy link
Contributor

bors commented Jul 7, 2019

⌛ Testing commit 2b47a08 with merge b0bd5f2...

@pietroalbini
Copy link
Member

@bors r-

@bors bors added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. labels Jul 15, 2019
@pietroalbini
Copy link
Member

@bors retry

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. labels Jul 15, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet