Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add options to allow unicode character in identifier names #4508

Merged
merged 9 commits into from
Jan 20, 2017

Conversation

eblis
Copy link
Contributor

@eblis eblis commented Jan 6, 2017

PR checklist

  • Read the contribution guildelines.
  • Ran the shell/batch script under ./bin/ to update Petstore sample so that CIs can verify the change. (For instance, only need to run ./bin/{LANG}-petstore.sh and ./bin/security/{LANG}-petstore.sh if updating the {LANG} (e.g. php, ruby, python, etc) code generator or {LANG} client's mustache templates)
  • Filed the PR against the correct branch: master for non-breaking changes and 2.3.0 branch for breaking (non-backward compatible) changes.

Description of the PR

Added option to allow unicode characters in class names, method names etc.
Added option to allow keeping of underscore characters in class names, methods names, etc.

Unicode identifiers are supported by some languages, like Java, but the codegen will remove all non ASCII letters by default. Users might want to separate different parts of names using extended punctuation connector characters, like "_, ‿, ⁀, ⁔, ・, ︳, ︴, ﹍, ﹎, ﹏, _, ・"

Underscores cannot be used to separate different parts of the name as they're removed by default by the codegen, the second option allows underscore characters if the user so desires.

Both options can be used separately if needed, for example one might allow extended punctuation connector characters, but still disallow underscore. Sometimes you need to logically separate parts of the identifier names, which isn't supported yet, without these changes.

The default values for both options are set such that the generator will behave as it did before.

…s etc.

Added option to allow keeping of underscore characters in class names, methods names, etc.

Unicode identifiers are supported by some languages, like Java, but the codegen will remove all non ASCII letters by default. Users might want to separate different parts of names using extended punctuation connector characters, like "_, ‿, ⁀, ⁔, ・, ︳, ︴, ﹍, ﹎, ﹏, _, ・".
Underscores cannot be used to separate different parts of the name as they're removed by default by the codegen, the second option allows underscore characters if the user so desires.
Both options can be used separately if needed, for example one might allow extended punctuation connector characters, but still disallow underscore.
@wing328
Copy link
Contributor

wing328 commented Jan 7, 2017

Thanks for the PR but your commit (as shown in the Commits tab) is not linked to your Github account, which means this PR won't count as your contribution in https://github.com/swagger-api/swagger-codegen/graphs/contributors.

Let me know if you need help fixing it.

Ref: https://github.com/swagger-api/swagger-codegen/wiki/FAQ#how-can-i-update-commits-that-are-not-linked-to-my-github-account

@eblis
Copy link
Contributor Author

eblis commented Jan 9, 2017

I've added the email address in the commits as a secondary email, hopefully that should fix it.
Thanks.

If that doesn't work i'll use the script to edit the history, please let me know if I should do that instead.

@wing328
Copy link
Contributor

wing328 commented Jan 9, 2017

Now the commits are correctly linked to your account.

Can you elaborate with a use case in which you need to keep the underscore in the property name? As you probably aware, keeping the underscore in the variable name won't conform to C# style guide in terms of variable naming.

@eblis
Copy link
Contributor Author

eblis commented Jan 9, 2017

Sure.

The reason we need to keep the underscores is to logically separate parts of type names. We're building a proxy server to handle requests to a 3rd party system which has a very weird API and we're building a more RESTful interface for it. The problem is that the 3rd party system types use 2 different names, domain name and type name, and when calling the system we need to know which is which, as the proxy relies on reflection to do it's job (so as not to manually hard-code calls), reflection allows us to be future compatible, any new feature created in the 3rd party system just works without any further changes in the proxy. We've used _ to separate between the domain and type names, but those get removed by the swagger codegen. We'd like to have the option to keep those underscores in certain cases, if the user knows what he's doing and really wants underscores to be present, for any language they might be generating for.

The feature has been implemented at the base layer, so going forward the codegen could have different default values, depending on whether underscores conform to each language's style guides or not, or in cases users really want the underscores for other reasons, like the one I mentioned above.

Alternatively, some languages (like Java, which we're generating for) allow multiple unicode characters as punctuation connector characters, which would alleviate our need for underscores, as we could use one of these extended connectors to logically separate between domain and type names.
Also, some languages support unicode characters in type names, so this feature could be used to allow that support for languages that offer it.

Both feature default values are set to false, so they're not activated unless the user really wants to use either of them. And they can be turned on/off independently, which could be used to further configure the generators for all languages that swagger codegen supports, depending on the style guides for those specific languages.

@wing328
Copy link
Contributor

wing328 commented Jan 12, 2017

@eblis thanks for the detailed explanation.

We've used _ to separate between the domain and type names, but those get removed by the swagger codegen

Let's say the name is "domain_type" and it will become "domainType" in Java, "DomainType" in C#. I would like to point out that these are just variable names but not the actual parameter (path, header, form, query) name. When the API client (e.g. C#) sends the request to the RESTful backend, the actual parameter name (baseName) is used as shown below:

https://github.com/swagger-api/swagger-codegen/blob/master/modules/swagger-codegen/src/main/resources/csharp/api.mustache#L240-L251

We definitely welcome enhancement to meet different use cases but I would like to point this out to make sure we're on the same page after merging the additional features you've built.

@wing328 wing328 added this to the v2.2.2 milestone Jan 12, 2017
@eblis
Copy link
Contributor Author

eblis commented Jan 12, 2017

We are more interested in the variable and class names, as we're using reflection to handle the proxy-ing, and need to know the domain name and type name.
I probably didn't explain it good enough the first time, the data types we use have 2 different logical names in them (as the system we're doing proxy calls for uses this naming convention) so a full name is not just domain_type (or DomainType, domainType) but instead it is domain_type, for example ActiveDirectory_OrganizationalUnit. We'd like to have a separator there so we can determine, at runtime, that domain was ActiveDirectory and type was OrganizationalUnit.
(There are some other ways we could achieve this, like adding an enumerated value to each type definition, but that would just generate a lot of fluff)

We can do the logical separation using underscores, but we can also use a unicode extended separator character (e.g. _) to logically separate these parts of the name.

However, I thought that having the option to strip/not strip underscores might be useful for some people (I noticed other reported issues mentioning the fact that underscores get stripped) and also for some languages which may use/approve of this naming styling (C, Python, etc ?) The two features should allow for more customizing options for each supported language, e.g. enable unicode characters for languages that support it, enable underscores for languages (or companies) that use snake case.

@eblis
Copy link
Contributor Author

eblis commented Jan 12, 2017

I just noticed the name template variable for classes, which I missed the first time looking at the variables :(
Looks like it's holding the original name of the class, including any unicode characters and/or underscores so we could probably use this value and annotate all generated types with the original name, instead of relying on the name of the class.

However, I feel like the original pull request is still valid, as you can make use of unicode characters in variable names and allow languages (and companies) that use snake case to generate code using swagger codegen.

@bhuvnesharya
Copy link

Thanks for the excellent use cases @eblis . I am having a use case where data is coming from the MATLAB type system and the image is already created from that and used from a long time and the response is again send back to the other system to do some analysis. I have to write an application in between that do some transformations and send the data in the same format as it is coming from the first system( snake_case). If I will not have support of snake_case then I need to convert snake_case to camelCase ( already done by swagger codegen , while storing I need to store it in camelCase and while retrieving and send back data to other system , I need to convert it back to snake_case. Which creates complexity while writing code.

@bhuvnesharya
Copy link

@eblis : Can you provide more insight , how to use KEEP_UNDERSCORES parameter in maven codegen. Shall I use this as a config options.

@eblis
Copy link
Contributor Author

eblis commented Jan 16, 2017

Just a note, the KEEP_UNDERSCORES setting will not prevent camelization of the names, as I wasn't really interested in the case that much, I just wanted to keep the underscores in place. I guess it makes more sense to stop camelization if the KEEP_UNDERSCORES setting is enabled, as you probably don't want to have camel case + snake case.

The easiest way to use the new settings is to enable them in the configuration options, e.g.:
{ "allowUnicodeIdentifiers": "false", "keepUnderscores": "false", "interfaceOnly": false, "java8": true, "serializableModel": false, "async": false, "library": "spring-boot" }

…ifier case as is (you probably don't want camel case + snake case, just snake case when KEEP_UNDERSCORES is set).

Added unit tests to verify how the case is computed for various scenarios.
@eblis
Copy link
Contributor Author

eblis commented Jan 16, 2017

I have modified the pull request such that it doesn't camelize the names if KEEP_UNDERSCORES option is selected. The settings acts like a case selector now, if KEEP_UNDERSCORES is false camel case will be used, if KEEP_UNDERSCORES is true snake case will be used instead.

@bhuvnesharya
Copy link

Thanks @eblis . I am currently using swagger-codegen-maven-plugin-2.2.1 which internally uses swagger codegen -2.2.1 . May I know when a new release will be coming where I can get these above changes merged into master.

@eblis
Copy link
Contributor Author

eblis commented Jan 17, 2017

It depends if and when this pull request is accepted, the code from this pull request isn't available in the swagger code.
Maybe if there's more demand for the feature it will get in :)

@wing328
Copy link
Contributor

wing328 commented Jan 18, 2017

@eblis first of all, thank you for your work. The support for unicode identifier is definitely a nice one and we definitely want to include the enhancement in the upcoming release.

But for the "underscore" feature, I do have the following feedback:

  1. the switch/flag is done inside the camelize function. IMO, that's not the best way to do it because a developer will expect the function to simply convert the input into camel case. A switch/flag inside the function makes the implementation relatively easy but that will completely change the behaviour of the function, which can be used elsewhere such as class naming, function naming, filename naming, parameter naming, etc.

  2. For some generators (e.g. typescript) we do have an option to keep the original naming, e.g.

	modelPropertyNaming
	    Naming convention for the property: 'camelCase', 'PascalCase', 'snake_case' and 'original', which keeps the original name (Default: camelCase)

(ref: java -jar modules/swagger-codegen-cli/target/swagger-codegen-cli.jar config-help -l typescript-node)

If we implement this option for other generators, it should cover what the "KEEP_UNDERSCORE" option provides. Do you agree?

What I would propose is that I'll cherry pick your work related to ALLOW_UNICODE_IDENTIFIERS and add the feature to the latest master while we discuss further on the "KEEP_UNDERSCORE" feature. What do you think?

@eblis
Copy link
Contributor Author

eblis commented Jan 18, 2017 via email

@eblis
Copy link
Contributor Author

eblis commented Jan 19, 2017

I've reworked the pull request to only include changes related to unicode characters support.

@eblis eblis changed the title Options to allow unicode character in identifier names and keep underscore characters Options to allow unicode character in identifier names ~~and keep underscore characters~~ Jan 19, 2017
@eblis eblis changed the title Options to allow unicode character in identifier names ~~and keep underscore characters~~ Options to allow unicode character in identifier names Jan 19, 2017
@wing328
Copy link
Contributor

wing328 commented Jan 19, 2017

@eblis thanks. Looks good to me. I'll perform some tests later and merge if no question/feedback from me or anyone in the community.

@wing328
Copy link
Contributor

wing328 commented Jan 19, 2017

Btw, before I forgot, for your upcoming PRs (not this one), I would recommend creating a new branch for the change as per git best practice.

@wing328 wing328 merged commit 3b3f2d2 into swagger-api:master Jan 20, 2017
@wing328
Copy link
Contributor

wing328 commented Jan 20, 2017

@eblis PR merged into master. Thanks for your contribution.

@eblis
Copy link
Contributor Author

eblis commented Jan 20, 2017 via email

@wing328 wing328 changed the title Options to allow unicode character in identifier names Add options to allow unicode character in identifier names Feb 22, 2017
davidgri pushed a commit to davidgri/swagger-codegen that referenced this pull request May 11, 2017
)

* Added option to allow unicode identifiers in class names, method names etc.
Added option to allow keeping of underscore characters in class names, methods names, etc.

Unicode identifiers are supported by some languages, like Java, but the codegen will remove all non ASCII letters by default. Users might want to separate different parts of names using extended punctuation connector characters, like "_, ‿, ⁀, ⁔, ・, ︳, ︴, ﹍, ﹎, ﹏, _, ・".
Underscores cannot be used to separate different parts of the name as they're removed by default by the codegen, the second option allows underscore characters if the user so desires.
Both options can be used separately if needed, for example one might allow extended punctuation connector characters, but still disallow underscore.

* Added new command line options to all required unit tests.

* Added KEEP_UNDERSCORES and ALLOW_UNICODE_IDENTIFIERS to Bash tests.

* When KEEP_UNDERSCORES is set don't camelize the names, keep the identifier case as is (you probably don't want camel case + snake case, just snake case when KEEP_UNDERSCORES is set).
Added unit tests to verify how the case is computed for various scenarios.

* Reworked pull request to only include changes related to supporting unicode characters in identifiers (removed references to keep underscores).

* These methods and classes can be static again.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants