Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

General improvement to the C/C++ lexer #1350

Merged
merged 32 commits into from
May 22, 2020
Merged

Conversation

hgruniaux
Copy link
Contributor

@hgruniaux hgruniaux commented Jan 5, 2020

Improve the Pygments C and C++ lexer. This include adding new features, and correcting olders.

Changelog

  • Add support for hexadecimal floating-point literals (C99 and C++17) (e.g. 0x1.2p3)
    Reference.

  • Add support for decimal separators in literals (C++14, but implemented in C lexer too) (e.g. 341'658'362)
    Decimal separators have been added to the generic C lexer, so C also have this functionality
    even if it is not in the C specification.
    Reference.
    Reference.

  • Correct integer suffixes. Now only two 'l' (lower or upper cased), and one 'u' (lower or upper cased)
    are matched. So 1234uuuuuuuu is no longer valid.
    Reference.

  • Add support for _Alignas, _Alignof, _Noreturn, _Generic, _Thread_local, _Static_assert, _Imaginary, _Atomic (C99, C11).
    Reference.

  • Add keywords macro to C lexer (e.g. complex, imaginary, noreturn, alignas, etc...)

  • Add support for binary integer literals (C++14) (e.g. 0b11001)
    Binary literals have been added to the generic C lexer, so C also have this functionality
    even if it is not in the C specification.
    Reference.

  • Move _Bool and _Complex from generic C lexer to C lexer (C++ do not have them).

  • Correct floating-point literal suffix, now l and f both works.
    Previously, even u was recognized as a valid suffix.
    Reference.

  • Correct hexadecimal integer literal prefix, now 0XFF works.
    Previously, only 0x (lowercase) worked as a hexadecimal prefix.
    Reference.

  • Identifiers following concept (C++20), enum, struct, union, class and typename, are now recognized as class names.
    Previously, only identifiers following class worked in C++.

  • Add C11 atomic types from standard library highlighting (e.g. atomic_uint8).
    Like C99 standard types, C11 atomic types are only highlighted when the c11highlighting property is True (default).
    Reference.

  • Add support for u'a', U'a', and u8'a' (before only L'a' was supported).
    Reference (C++).
    Reference (C).

  • Add support for $ (dollar sign) in identifiers.
    This is a language extension supported by GCC, Clang, MSVC, and probably many others.

  • Add _Pragma keyword to generic C lexer.

  • Add Unicode string literals to C (e.g. u8"a", u"a" or U"a").
    Before they were only supported for C++.

  • Add support for Universal Character Names (UCNs)

  • - is now highlighted as number in front of numeric literals (e.g. -9.56 is highlighted as a number, the - included).

Solves

hgruniaux and others added 3 commits January 5, 2020 22:18
Add '_Imaginary', '_Static_assert', '_Atomic' keywords.
Add support for C11 atomic types `atomic_*`.
pygments/lexers/c_cpp.py Show resolved Hide resolved
hgruniaux and others added 17 commits January 6, 2020 12:33
Fix bad highlighting for `5.`, where `.` was not highlighted.
Hexadecimal floating point literals needs an exponent (`0x5p8`). Before this commit, event floating-point literals without an exponent were accepted (e.g. `0x5.5`).
Some old C/C++ compilers have supported `$` (dollar sign) in identifiers, and some news continue to support this for legacy reasons. That is, some codes may use them, and it is therefore preferable to color them correctly.
- Add '_Pragma' keyword
- Recognize the identifier following 'typename' as Name.Class
- Do not tokenize 'class' or 'struct' following 'enum' as Name.Class, but instead as Keyword (C++ lexer)
- Move some C++ keywords to the generic lexer (`alignas`, `alignof`, etc...)
- Add some C keywords (`noreturn`, `imaginary`, `complex`)
- And others things...
Now `class`, `struct`, `enum`, `union`, etc... can be used alone. Previously, the lexer do not recognizes them if they are not followed by an identifier. This regression was introduced in pygments@013bf6a by me.
Some lexers depends on the old states names (e.g. `classname` state) to works. This commit, reintroduce these old names.
@hgruniaux hgruniaux requested a review from Anteru January 6, 2020 19:07
@Anteru
Copy link
Collaborator

Anteru commented Mar 2, 2020

Thanks a lot for this huge PR -- one question before I assign this to a milestone. There are a few TODO items left, are you planning to work on those as well or should I merge it as-is and leave those open for a future PR?

@hgruniaux
Copy link
Contributor Author

Hi ! I will try to finish the few remaining TODO items, and I will keep you informed when this is over or when I have forsaken.

@Anteru Anteru modified the milestones: 2.6, 2.7 Mar 5, 2020
@Anteru
Copy link
Collaborator

Anteru commented Mar 5, 2020

Perfect. I'm going to cut off 2.6 this week, so I'm targeting this at 2.7. Please take your time to fix this, and thanks a lot for your contribution!

@hgruniaux
Copy link
Contributor Author

This pull request is finished, the remaining TODOs will not be completed, or at least not by me or not in this specific pull-request. So you can merge in the main branch if you want.

Thank you in advance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants