Skip to content

Commit

Permalink
adjust tools for Unicode 15 in folders named dev
Browse files Browse the repository at this point in the history
- add version number 15
- adjust file reading code to detect the next version and read from the "dev" folder
- adjust file writing to omit the version infix
- adjust docs
see issue #144

also
- change UNICODETOOLS_DIR+data to DATA_DIR
- fix TestSecurity, see issue #151
  - fall back from Generated output to input data files
  - CI: no output-to-input symlink
  • Loading branch information
markusicu committed Nov 24, 2021
1 parent 1b478b4 commit aa6d11c
Show file tree
Hide file tree
Showing 42 changed files with 538 additions and 248 deletions.
1 change: 0 additions & 1 deletion .github/workflows/build-jsp.yml
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,6 @@ jobs:
# TODO: symlink of security here?
run: >
mkdir -pv $(pwd)/output/Generated/ &&
ln -s $(pwd)/unicodetools/data/security ./output/Generated/ &&
mvn -s .github/workflows/mvn-settings.xml -B exec:java
-Dexec.mainClass="org.unicode.text.UCD.Main" -Dexec.args="version 14.0.0 build MakeUnicodeFiles"
-pl unicodetools -DCLDR_DIR=${GITHUB_WORKSPACE}/cldr
Expand Down
44 changes: 42 additions & 2 deletions docs/build.md
Original file line number Diff line number Diff line change
Expand Up @@ -192,6 +192,7 @@ See the top level `pom.xml` under `<properties>`.
- If you are using Eclipse, make sure CLDR and UnicodeTools are in the same workspace,
and Eclipse should do the right thing.
- I'm not sure how to do the same with ICU.

### Input data files

The input data files for the Unicode Tools are checked into the repo since
Expand All @@ -215,12 +216,35 @@ Make sure you have the VM arguments set up as described above.

## Updating to a new Unicode version

All of the following have "version 14.0.0" (or whatever the latest version is)
### Unicode 15+ workflow

Starting with Unicode 15, we are developing most of the Unicode data files
in this Unicode Tools project, and publish them to the Public folder
only for alpha/beta/final releases.
That is, we are reversing the flow of files.
(See [issue #144](https://github.com/unicode-org/unicodetools/issues/144).)

We are also no longer generating and posting files with version suffixes.

Except: Some files, such as Unihan and ucdxml data files, are developed elsewhere,
and we continue to ingest them as before.

###

All of the following have `version 15.0.0` (or whatever the latest version is)
in the options given to Java.

Example changes for adding Unicode 15 version numbers:
See the second commit of https://github.com/unicode-org/unicodetools/pull/156

Example changes for adding properties:
<https://github.com/unicode-org/unicodetools/pull/40>. Throughout these steps we
will walk through updating unicodetools to support Unicode 14.
will walk through updating unicodetools to support Unicode 15 or 14.

Starting with Unicode 15, we keep the latest versions of data files in
unversioned "dev" folders in this repo.

Unicode 14:

Firstly, fetch the latest data files for this version from
<https://www.unicode.org/Public/14.0.0/ucd/>, matching your new version number.
Expand All @@ -236,6 +260,14 @@ desuffix the files (removing the -dN suffixes). Copy these into
to set up the inputs correctly. For some updates you may need to pull in other
(uca, security, idna, etc) files, see [Input data setup](inputdata.md) for more information.

Unicode 15:

We no longer generate files with version suffixes, but for now we still
generate files into an output folder with the DeltaVersion that is set in MakeUnicodeFiles.txt.
We might revisit this.

Unicode 14:

Now, update the following files:

`MakeUnicodeFiles.txt` (find in Eclipse via Navigate/Resource or Ctrl+Shift+R)
Expand Down Expand Up @@ -325,6 +357,12 @@ to generate new files). For all the new ones:
Make a pull request to incorporate these updates, and upload the generated files
in a way that can be shared with ucd-dev.

Unicode 15 TODO:
We plan to
- make a commit for changes in input data files
- copy the output files back into the input folders, review, and commit again
... instead of posting draft files elsewhere and re-ingesting them later.

Ideally, diff the files to check for any discrepancies. The script will do this
automatically, you can search the output for lines that say "Found difference in
`<filename>`", however note that it will only display the first line of the diff,
Expand Down Expand Up @@ -478,6 +516,8 @@ If there are new break rules (or changes), see

### Upload for Ken Whistler & editorial committee

Unicode 15 TODO: See above; commit new input data, run tools, review output, copy back to input, commit, pull request...

1. Check diffs for problems
2. First drop for a version: Upload **all** files
3. Subsequent drop for a version: Upload *only modified* files
Expand Down
16 changes: 15 additions & 1 deletion docs/inputdata.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,18 @@
# Input data setup

## Unicode 15+ workflow

Starting with Unicode 15, we are developing most of the Unicode data files
in this Unicode Tools project, and publish them to the Public folder
only for alpha/beta/final releases.
That is, we are reversing the flow of files.
(See [issue #144](https://github.com/unicode-org/unicodetools/issues/144).)

We are also no longer generating and posting files with version suffixes.

Except: Some files, such as Unihan and ucdxml data files, are developed elsewhere,
and we continue to ingest them as before.

## Source Files

The source files that you will need for a release such as 8.0.0 are in:
Expand Down Expand Up @@ -54,6 +67,7 @@ files have the version suffix.

### Removing Suffixes

Only for Unicode 14 and earlier:
For the ucd and uca files, you will have to remove the suffixes.

Tip: On Linux, you can remove version suffixes on the command line like this:
Expand Down Expand Up @@ -122,7 +136,7 @@ $ ../../desuffixucd.py .

### Unihan

You may need to manually change "Unihan-8.0.0d2 Folder" to "Unihan".
You may need to manually change the "Unihan-8.0.0d2 Folder" to "Unihan".

Unzip the Unihan.zip file into a "Unihan" subfolder.

Expand Down
2 changes: 2 additions & 0 deletions docs/security.md
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,8 @@ Run GenerateConfusables -c -b to generate the files. They will appear in two pla
**Run TestSecurity to verify that the confusable mappings are idempotent!**

With the same VM arguments as the generator.
Starting in 2021q3, TestSecurity needs to be run as a JUnit test.
It is also now part of the unit test suite and run on GitHub CI.

Copy the following from the output directory to the top level of the revision directory:

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -33,9 +33,8 @@ public static void main(String[] args) {
patriksData.put(cp, allButFirst(parts));
}

for (String line : FileUtilities.in(Settings.UnicodeTools.DATA_DIR + "/idna/" +
Settings.latestVersion +
"/", "IdnaMappingTable.txt")) {
String idnaDir = Settings.UnicodeTools.getDataPathStringForLatestVersion("idna");
for (String line : FileUtilities.in(idnaDir + "/", "IdnaMappingTable.txt")) {
final int pos2 = line.indexOf('#');
if (pos2 >= 0) {
line = line.substring(0,pos2);
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -176,7 +176,7 @@ int generateTests(int lines) throws IOException {
"update GenerateIdnaTest.ucdTypesLastVersion to match " + lastVersion);
}
Set<TestLine> testLines = LoadIdnaTest.load(
Settings.UnicodeTools.UNICODETOOLS_DIR + "data/idna/" + lastVersion);
Settings.UnicodeTools.DATA_DIR + "idna/" + lastVersion);

for (TestLine testLine : testLines) {
count += generateLine(replaceNewerThan(testLine.source, ucdTypesLastVersion), out, out2);
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -163,7 +163,8 @@ static public Set<TestLine> load(String directory) {

public static void main(String[] args) {
Set<Idna2008Status> seen = EnumSet.noneOf(Idna2008Status.class);
for (TestLine testLine : load(Settings.UnicodeTools.UNICODETOOLS_DIR + "data/idna/13.0.0")) {
// TODO: latestVersion? lastVersion?
for (TestLine testLine : load(Settings.UnicodeTools.DATA_DIR + "idna/13.0.0")) {
System.out.println(testLine);
}
}
Expand Down
3 changes: 2 additions & 1 deletion unicodetools/src/main/java/org/unicode/idna/Uts46.java
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,8 @@ public class Uts46 extends Idna {
public static Uts46 SINGLETON = new Uts46();

private Uts46() {
new MyHandler().process(Settings.UnicodeTools.DATA_DIR + "idna/" + Settings.latestVersion, "IdnaMappingTable.txt");
String path = Settings.UnicodeTools.getDataPathStringForLatestVersion("idna");
new MyHandler().process(path, "IdnaMappingTable.txt");
types.freeze();
mappings.freeze();
mappings_display.freeze();
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ public static Binary forName(String name) {
}
}

public enum Age_Values implements Named {
public enum Age_Values implements Named {
V1_1("1.1"),
V2_0("2.0"),
V2_1("2.1"),
Expand All @@ -57,6 +57,7 @@ public enum Age_Values implements Named {
V13_0("13.0"),
V13_1("13.1"), // TODO: there is no Unicode 13.1, see https://github.com/unicode-org/unicodetools/issues/100
V14_0("14.0"),
V15_0("15.0"),
Unassigned("NA");
private final PropertyNames<Age_Values> names;
private Age_Values (String shortName, String...otherNames) {
Expand Down
4 changes: 3 additions & 1 deletion unicodetools/src/main/java/org/unicode/text/UCA/UCA.java
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@
import java.io.FileReader;
import java.io.IOException;
import java.io.PrintWriter;
import java.nio.file.Path;
import java.text.MessageFormat;
import java.util.Collections;
import java.util.Comparator;
Expand Down Expand Up @@ -1761,7 +1762,8 @@ public UnicodeSet getHomelessSecondaries() {
public static UCA buildCollator(Remap primaryRemap) {
try {
if (VERBOSE) System.out.println("Building UCA");
final String file = Utility.searchDirectory(new File(Settings.UnicodeTools.DATA_DIR + "uca/" + Default.ucdVersion() + "/"), "allkeys", true, ".txt");
final Path dataPath = Settings.UnicodeTools.getDataPathForLatestVersion("uca");
final String file = Utility.searchDirectory(dataPath.toFile(), "allkeys", true, ".txt");
final UCA collator = new UCA(file, Default.ucdVersion(), primaryRemap);
if (VERBOSE) System.out.println("Built version " + collator.getDataVersion() + "/ucd: " + collator.getUCDVersion());
if (VERBOSE) System.out.println("Building UCD data");
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -86,7 +86,8 @@ public class GenerateConfusables {
private static final String REVISION = Settings.latestVersion;
static final String VERSION_PROP_VALUE = REVISION; // "V7_0";

static final String reformatedInternal = Settings.UnicodeTools.UNICODETOOLS_DIR + "data/security/" + REVISION + "/data/";
static final String reformatedInternal =
Settings.UnicodeTools.getDataPathString("security", REVISION) + "/data/";
public static final String GEN_SECURITY_DIR = Settings.Output.GEN_DIR + "security/" + REVISION + "/";

// static final XIDModifications REFERENCE_VALUES = new XIDModifications(Settings.UNICODETOOLS_DIRECTORY + "data/security/"
Expand All @@ -99,8 +100,8 @@ public class GenerateConfusables {
static final boolean DEBUG = false;

static {
Confusables REFERENCE_VALUES = new Confusables(Settings.UnicodeTools.UNICODETOOLS_DIR + "data/security/"
+ REFERENCE_VERSION);
String path = Settings.UnicodeTools.getDataPathString("security", REFERENCE_VERSION);
Confusables REFERENCE_VALUES = new Confusables(path);
for (EntryRange<String> entry : REFERENCE_VALUES.getRawMapToRepresentative(Confusables.Style.MA).entryRanges()) {
if (entry.string != null) {
LAST_COUNT.add(entry.value, 1);
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -86,7 +86,8 @@ public class GenerateConfusablesCopy {
private static final String REVISION = Settings.latestVersion;
private static final String VERSION_PROP_VALUE = "V7_0";

private static final String outdir = Settings.UnicodeTools.UNICODETOOLS_DIR + "data/security/" + REVISION + "/data/";
private static final String outdir =
Settings.UnicodeTools.getDataPathString("security", REVISION) + "/data/";
private static final String indir = outdir + "source/";
private static final UCD DEFAULT_UCD = Default.ucd();
private static final UnicodeProperty.Factory ups = ToolUnicodePropertySource.make(version); // ICUPropertyFactory.make();
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,7 @@
import com.ibm.icu.text.UnicodeSetIterator;

public final class TestNormalization {
// TODO: There is no such Update folder. Is this class obsolete?
static final String DIR = Settings.UnicodeTools.UCD_DIR + "Update 3.0.1/";
static final boolean SKIP_FILE = true;

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -605,6 +605,7 @@ public final class UCD_Names implements UCD_Types {
"12.1",
"13.0",
"14.0",
"15.0",
};

static final String[] LONG_AGE = {
Expand Down Expand Up @@ -633,6 +634,7 @@ public final class UCD_Names implements UCD_Types {
"V12_1",
"V13_0",
"V14_0",
"V15_0",
};

static final String[] GENERAL_CATEGORY = {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -601,7 +601,8 @@ public interface UCD_Types {
AGE121 = 22,
AGE130 = 23,
AGE140 = 24,
LIMIT_AGE = AGE140 + 1; // + FIX_FOR_NEW_VERSION;
AGE150 = 25,
LIMIT_AGE = AGE150 + 1; // + FIX_FOR_NEW_VERSION;

static final String[] AGE_VERSIONS = {
"?",
Expand Down Expand Up @@ -629,6 +630,7 @@ public interface UCD_Types {
"12.1.0",
"13.0.0",
"14.0.0",
"15.0.0",
};

public static byte
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -21,11 +21,11 @@
import com.ibm.icu.dev.util.UnicodeMap;

public class CheckSecurityProposals {
private static final String SECURITY_DIR = Settings.UnicodeTools.DATA_DIR + "security/";
private static final String SECURITY_DIR = Settings.UnicodeTools.getDataPathStringForLatestVersion("security");
private static final IndexUnicodeProperties IUP = IndexUnicodeProperties.make(Settings.latestVersion);
private static final UnicodeMap<Age_Values> AGE = IUP.loadEnum(UcdProperty.Age, UcdPropertyValues.Age_Values.class);

public static final Confusables CONFUSABLES = new Confusables(SECURITY_DIR + Settings.latestVersion);
public static final Confusables CONFUSABLES = new Confusables(SECURITY_DIR);
public static final UnicodeMap<String> conMap = CONFUSABLES.getRawMapToRepresentative(Style.MA);

public static Splitter TAB_SPLITTER = Splitter.on('\t').trimResults();
Expand All @@ -39,8 +39,10 @@ public static void main(String[] args) {
LinkedHashMultimap<String, String> nonconfusable = LinkedHashMultimap.create();
HashMap<String, String> contributor = new HashMap<>();


for (String line : FileUtilities.in(Settings.UnicodeTools.UNICODETOOLS_DIR + "data/security/" + Settings.latestVersion + "/data/source/", "proposals.txt")) {
String path =
Settings.UnicodeTools.getDataPathStringForLatestVersion("security") +
"/data/source/";
for (String line : FileUtilities.in(path, "proposals.txt")) {
List<String> parts = TAB_SPLITTER.splitToList(line);
String sourceRaw = parts.get(1);
String source = NFD.normalize(Utility.fromHex(sourceRaw, true));
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,9 @@ public static void main(String[] args) {
final SortedSet<String> props = new TreeSet<String>();
final Relation<String,String> values = Relation.of(new HashMap<String,Set<String>>(), HashSet.class);
final Pattern tabSplitter = Pattern.compile("\t");
for (final File file : new File(Settings.UnicodeTools.UCD_DIR + "/Unihan").listFiles()) {
// TODO: There is no Unihan folder directly inside .../unicodetools/data/ucd/
// Is this class obsolete?
for (final File file : new File(Settings.UnicodeTools.UCD_DIR + "Unihan").listFiles()) {
System.out.println(file.getName());
for (final String line : FileUtilities.in(file.getParent(), file.getName())) {
if (line.length() == 0 || line.startsWith("#")) {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ public class FindDuplicateFiles {
private static Map<String,String> DIRS = new LinkedHashMap<>();
static {
String[] dirs = {
Settings.UnicodeTools.UNICODETOOLS_DIR,
Settings.UnicodeTools.UNICODETOOLS_REPO_DIR,
CLDRPaths.BASE_DIRECTORY
};
for (String dir : dirs) {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -167,8 +167,6 @@ private static Identifier_Type getIdentifierType(int cp, String dataLine) {
}
}

private static final String SECURITY = Settings.UnicodeTools.UNICODETOOLS_DIR + "data/security/";

public static void main(String[] args) {
// for (Entry<String, Set<IdentifierType>> x : DATA2TYPE.keyValues()) {
// String s = x.getKey();
Expand All @@ -183,7 +181,8 @@ public static void main(String[] args) {
System.out.println("\nValues");
showValues("values.txt", DATA2TYPE, Identifier_Type.recommended);

XIDModifications xidModOld = new XIDModifications(SECURITY + Settings.latestVersion);
String path = Settings.UnicodeTools.getDataPathStringForLatestVersion("security");
XIDModifications xidModOld = new XIDModifications(path);
UnicodeMap<Set<Identifier_Type>> xidMod = xidModOld.getType();

UnicodeMap<Set<String>> cldrChars = CLDRCharacterUtility.getCLDRCharacters();
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -17,11 +17,6 @@
* @see com.ibm.icu.text.SpoofChecker
*/
public class RecommendedSetGenerator {
/**
* Update the directory to use for generating the data:
*/
private static final String DIRECTORY = "data/security/" + Settings.latestVersion;

public static void main(String[] args) {
Sets sets = generateSet();
System.out.println("# inclusion: \n" + sets.inclusion.toString());
Expand Down Expand Up @@ -85,7 +80,8 @@ public static String uniSetToCodeString(UnicodeSet uniset, String varName, boole
}

public static Sets generateSet() {
XIDModifications inst = new XIDModifications(DIRECTORY);
String path = Settings.UnicodeTools.getDataPathStringForLatestVersion("security");
XIDModifications inst = new XIDModifications(path);

// Compute sets based on status
UnicodeSet allowedS = new UnicodeSet();
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,10 @@ static Factory getFactory() {
public static void main(String[] args) throws IOException {
try {
//checkRegex();
testFile(Settings.UnicodeTools.UCD_DIR + "/xml/ucd.nounihan.grouped.xml");
// TODO: There is no xml folder inside .../unicodetools/data/ucd/
// Instead, there is a ucdxml folder parallel to ucd.
// Is this class obsolete?
testFile(Settings.UnicodeTools.UCD_DIR + "xml/ucd.nounihan.grouped.xml");
// too many errors to test: testFile("C:/DATA/UCD/xml/ucd.nounihan.grouped.xml");
} finally {
System.out.println("DONE");
Expand Down

0 comments on commit aa6d11c

Please sign in to comment.