Umple is using the system default encoding scheme #960

Open
AdamBJ opened this Issue Jan 21, 2017 · 2 comments

Projects

None yet

2 participants

@AdamBJ
Contributor
AdamBJ commented Jan 21, 2017

Brief Description

Umple currently relies on the system default of whatever system it is being run on to provide the encoding the Umple parser uses to interpret any files it reads in (this includes both .ump files and .grammar files). Umple does the same thing when writing files (e.g. Java or Php files). This is problematic if:

  • The system default encoding is different than the encoding of the .ump file being passed to Umple (it can lead to parsing errors)
  • There are non-Latin characters in the .ump file you pass in to Umple (e.g. in comments, string literals). If your default system encoding is ANSI for example, generated files will have sequences of ? instead of the expected non-Latin characters in the generated output. It appears that the server running Umple Online is experiencing this issue (it can't handle non-Latin characters).

I encountered this issue on a machine running Windows 10, but it should affect all platforms.

Minimum Steps to Reproduce

To reproduce the "reading files in" problem:

First, determine what the default encoding scheme of your machine is. You can do this by running to following Java code and examining the contents of defaultEncoding:

FileInputStream fis = new FileInputStream(<path/to/your/file");
InputStreamReader isr = new InputStreamReader(fis);
String defaultEncoding = isr.getEncoding();

If your system default is UTF you might not be able to reproduce this part of the issue. If it's ANSI or a variant of ANSI (such as Windows-1252 or CP1252), try passing this .ump file into Umple. Umple should return a parsing error, even though there's nothing syntactically wrong with the file.

To reproduce the output file encoding problem:

Just add non-Latin characters in a comment or as a string literal in any example in Umple online. After generating code, you'll see ? instead of the expected characters. Here are some Japanese characters for those that to try this: 他の文字体系を圧倒する.

Expected Result

In the case of reading files in, we should either be able to specify the encoding of the .ump files we pass in (perhaps via a command-line option such as -encoding), or we should be using a sensible default. I suggest UTF-8 as the default as it is the most flexible (it can be used to decode files encoded in ANSI in addition to files encoded in Unicode).

A similar pattern holds for output files. We should either be able to specify the encoding scheme directly, or we should use UTF-8 as a sensible default.

Actual Result

We're seeing sequences of ?s pop up in the output when non-Latin characters are expected (in the case of output files), and encountering unexpected parsing issues (in the case of input files).

This is how the parser reads files in (snippet from ...\umple\UmpleParser\src\GrammarParsing_Code.ump):

  if ((new File(file)).exists())
      {
        reader = new BufferedReader(new FileReader(file));
      }
      else
      {
        resourceStream = getClass().getResourceAsStream(file);
        reader = new BufferedReader(new InputStreamReader(resourceStream));
      }

In both cases, we're relying on the default system character encoding to parse to files correctly (see this Stack Overflow question for more information. If an Umple user has a system with a default encoding that is different than the encoding of the .ump files they're trying to pass to Umple, the decoding scheme Umple applies to the file will be wrong.

To fix this issue (i.e. to make UTF8 the default), we need to change two lines in the above code:

 if ((new File(file)).exists())
      {
        ->reader = new BufferedReader(new InputStreamReader(new FileInputStream(file), "UTF8"));<-
      }
      else
      {
        resourceStream = getClass().getResourceAsStream(file);
        ->reader = new BufferedReader(new InputStreamReader(resourceStream), "UTF8");<-
      }

Similarly, when outputting a file (such as a generated Java file), Umple again relies on the system default. From Generator_CodeJava.ump's writeFile() method:

 File file = new File(path);
 file.mkdirs();   
 String filename = path + File.separator + aClass.getName() + ".java";
 BufferedWriter bw = new BufferedWriter(new FileWriter(filename));

If the system default is ANSI or some variety thereof, generated code that contains any non-Latin characters (for example in a comment or a stream constant) won't display correctly (the user will see a sequence of question marks). To fix this issue we'll need to change the output behaviour of each of the writeFile() methods. To fix the writeFile() method above, we just change the last line to:

BufferedWriter bw = new BufferedWriter(new OutputStreamWriter(
    	    new FileOutputStream(filename), "UTF8"));

@TimLethbridge, @vahdat-ab mentioned you might want to have input on this issue.

@AdamBJ AdamBJ self-assigned this Jan 21, 2017
@Nava2
Member
Nava2 commented Jan 21, 2017 edited

Small thing: Use Guava's Charsets. Guava is already available in our build. Also, Java's String is by encoded via UTF-16, so that might be something to consider performance wise (hardly much of a concern..).

@AdamBJ
Contributor
AdamBJ commented Jan 21, 2017

@Nava2 So Guava's Charsets just makes things a bit easier because we have to do less error checking (because we can be sure whatever encoding we specify using Guava will be supported)?

@AdamBJ AdamBJ removed their assignment Feb 5, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment