Coding standard proposal

VM Coding Style Guidelines

We need to define a coding convention standard for the new VM code. The results of this discussion should be located on the wiki or in a new version of the CONTRIBUTING.md document.

Conding style guidelines include cosmetic, functional and quality aspects. From a quality standpoint what is really important on this aspect is to be as consistent as possible on the codebase (i.e. use the same coding style everywhere).

I suggest keeping the following original guidelines from the original VM source code, but this is clearly not enough:

Original VM coding guidelines from CONTRIBUTING.md

C Source Code Formatting

When editing an existing file, please be polite and first follow the guidelines below, and secondly follow the general rule of keeping to the same formatting conventions as exist in the file.

C Function Declarations

C function declarations should have the type on the same line as the function name and parameters, e.g.:

static int convertCopy(char *from, int fromLen, char *to, int toLen, int term)

C Function Definitions

C function definitions should have the type on one line and the function name and parameters on the following line, e.g.:

static int
convertCopy(char *from, int fromLen, char *to, int toLen, int term)

This facilitates searching for function definitions by searching for the name at the beginning of the line.

Tab stops are 4 spaces

In this case the first rule is to keep the existing formatting within a file - we don't want to polute the code history with space changes.

The goal is to use tabs set to 4 spaces.

Additional coding guidelines

Checking existing coding guidelines

The following are some popular coding style that used in the industry and open source/free software projects:

Google C++ style guide: https://google.github.io/styleguide/cppguide.html
GNU coding standard: https://www.gnu.org/prep/standards/
MISRA (Safe critical and reliable software): https://en.wikipedia.org/wiki/MISRA_C

We should take and adapt some of the guidelines that are used by these coding standards.

Indentation style

A coding standard is not the same as an indentation and formatting standard. A coding standard may have some .

For a list of indentation styles check: https://en.wikipedia.org/wiki/Indentation_style . For my own projects I typically use the Allman "BSD style" from that list. Another popular indentation style is to use a variant from the Kernighan and Ritchie style from that list. The K&R styles are more similar to the indentation style that is typically used in Java.

An indentation style must be selected explicitly for the whole VM source code project (excluding third party libraries). The selection of this indentation style must be specified with the command line parameters that have to be passed to the indent Unix utility. New code should be written by following that indentation style, and existing code should be converted by running the indent utility on them. The indentation style typically can be specified on IDEs such as Eclipse, and text editors such as Emacs so that they automatically help on writing code following the indentation.

Specific guidelines

Single line comments

Single line comments using // were not a part of the C standard until C99. These comments are supported by the important compilers (gcc, clang, msvc). The advantage of using these comment is that they facilitate using /**/ for comment large blocks of code for debugging purposes. These same large blocks of code can also be commented by #if 0 ... #endif. Since these comments are supported by the important compilers, then there is no reason for not using them.

Include guards and #pragma once

The #pragma once preprocessor directive is not part of the C standard. However, mainstream C compilers such as gcc, clang and msvc do support this pragma. The C standard does also mandates that a #pragma that is not understood by a compiler must be ignore. The intention of this pragma is to improve compilation speed, but because of its non-standard nature it should be used in conjunction with include guards, in the following way:

#ifndef HEADERFOLDER_NAME_H_
#define HEADERFOLDER_NAME_H_
#pragma once

/**
 Here goes the C header content.
*/
#endif //HEADERFOLDER_NAME_H_

Cast from void* to specific type

In C, casting implicitly from void* is an allowed operation, but this is not allowed in C++. For example, the following is valid code in C, but not in C++:

int *array = malloc(sizeof(int)*8);

The following version of the same code is valid in C and in C++:

int *array = (int*)malloc(sizeof(int)*8);

Since the second version is more explicit, I recommend that we use that version just for the sake of being more explicit.

Order of includes in headers

According to the Google C++ coding standard, header files should be self contained and they should compile by it self. To facilitate this, the order of the includes should be the following:

Project specific headers should come first.
System provided headers should come after.

The intention of this ordering is to produce a compilation error if a project specific header is missing the inclusion of a dependency.

Standard headers that are always present

The C language has the following two compilation modes:

Free standing mode (-ffreestanding): this mode is typically used for embedded system or for constructing an operating system kernel.
Hosted mode: the default mode with an operating system.

The following headers are mandated by the language standard to be provided by the compiler on free standing mode, so it can be assumed that they are always going to be present even if there is no operating system:

<float.h> (C99)
<iso646.h> (C99)
<limits.h> (C99)
<stdalign.h> (C11)
<stdarg.h> (C99)
<stdbool.h> (C99)
<stddef.h> (C99)
<stdint.h> (C99)
<stdnoreturn.h> (C11)
<stdatomic.h> (C11)

From this list, the most important are <stddef.h>, <stdarg.h> and <stdint.h>. stddef defines ptrdiff_t, size_t, wchar_t, NULL and offsetof. stdarg defines macros for implementing functions with variable arguments such as printf. stdint defines integer types with fixed size such as int32_t and uint32_t. Since these headers are standard and they have to be always provided by the compiler, then there is no reason to not use them.

Clang and GCC go for full C11 compatibility. MSVC implements only the subsets of C99 and C11 that it also needs to implement a corresponding C++ standard (https://herbsutter.com/2012/05/03/reader-qa-what-about-vc-and-c99/).

Prefixes for namespacing

The lack of namespaces in C, and the usage of a single global symbol table (specially on Linux. DLLs on Windows have their own symbol table. OS X uses something similar to RTLD_DEEPBIND by default) has a huge potential for producing name clashes. New code for the VM should use a prefix for simulating namespaces in C which is a standard practice in C libraries. This prefix could be realized by separating with an underscore:

int
pharovm_parseArgument(int argc, const char **argv)
{
    //...
    return 0;
}

Or by juxtaposition by using CamelCase:

int
pharoVMParseArgument(int argc, const char **argv)
{
    //...
    return 0;
}

I suggest using the version with underscores because it allows to have multiple namespaces, and we could use a different one for VM internals that should not be exposed to the public:

int
pharovm_internal_interpreter_pushOop(sqInt oop)
{
    //...
}

We should decide what is this prefix going to be. The following are some suggestions:

pharo
pharovm
osvm (OpenSmalltalkVM)

Memory management hygiene

The explicit memory management in C is the source of several reliability and security issues. For this reason we need some guidelines when allocating memory, working with them, and disposing it.

alloca should not be used

alloca is a built-in primitive from the C compiler for allocating memory on the stack. The implementation of alloca is done by decrementing the stack pointer. On Linux only the memory for the stack of the main thread can dynamically grow. The memory for threads different to the main thread is allocated on thread creation time and it is fixed for the whole lifetime of the thread. Since alloca can be used with large sizes, there is no guarantee on having enough memory on the stack for fullfilling the alloca, there is a strong potential for provocating a stack underflow and an almost guaranteed segmentation fault or memory corruption. For this reason the usage alloca is typically forbidden by coding standards used in the industry.

In case where alloca is absolutely required, the size of the allocation should be really small (no more than 100s of bytes). A local array with the maximum size should always be preferred to alloca.

alloca should not be used for MAX_PATH for the same reason that a local variable should not be used for MAX_PATH. The two idioms are equivalent.

malloced memory should be always fully inititialized

malloc returns uninitialized memory which can contain random garbage, but in many cases it may look like zeroed memory. Some of this garbage may never be initialized explicitly and could be interpreted in different random ways that vary with the execution of the program. This usage of uninitialized memory may be very hard to replicate and to debug. In some cases this uninitialized memory may introduce subtle security bugs and leak sensitive information like in the case of the infamous hearbleed bug in openssl. For this reason it is recommended to do a memset after a malloc. For example:

#define BufferSize 4096
char *buffer = (char*)malloc(BufferSize);
memset(buffer, 0, BufferSize);

calloc should preferred over malloc

calloc always zeroes the allocated memory by doing an implicit memset. For this reason, calloc should be preferred to a raw malloc.

Initialization of structures

Structure should be always fully initialized. Most of the time it is recommended to fully zero initialize structures by using the following idiom:

#include <stdlib.h> // For malloc and family
#include <string.h> // For memset

typedef struct some_struct_s
{
    char firstField;
    int secondField;
} some_struct_t;

// Global variables are always zero initialized.
some_struct_t globalVariable;

// In C, but not in C++, we can also initialize structures
// in the following way:
some_struct_t globalVariable2 = {
    .firstField = 'A',
    .secondField = 42
};

void someFunction()
{
    // ...

    // Local structure variable definition.
    some_struct_t localVariable = {};

    // DO NOT USE the following:
    //struct some_struct localVariable = {0};
    // Some compiler may only initialize the first field with this variant.


    // Use memset with malloc
    some_struct_t *pointer = (some_struct_t*)malloc(sizeof(some_struct_t));
    memset(pointer, 0, sizeof(some_struct_t));

    // Or use calloc
    some_struct_t *otherPointer = (some_struct_t*)calloc(1, sizeof(some_struct_t));

    // ...
}

String handling safeness

The usage of null terminated string literals in C is one of its greatest design errors. This design has performance problems. Another problem with string handling is that the string functions provided by the standard C library are prone to produce buffer overflows (e.g: strcpy, strcat).

Common operations on strings require knowing the length of a string for allocating memory to hold the result, which requires doing a linear iteration per string just for asking for its size. The correct way of fixing this is to prefix a string with its size (a la Pascal), but this introduces a strong penalty on the code. Instead we are going to accept this performance penalty, but we still need to use functions that operates on strings and take care doing a correct allocation of the result buffer, and provide safeness, security and reliability correctness when manipulating string.

The basic operations that should be provided by this library are the followings:

String duplication.
String concatenation.
String formatting (wrapper on snprintf, but properly allocating the result buffer).

String functions from the standard library that do not receive the size of the target should not be used (e.g: strcpy, strcat, sprintf). Instead versions, of these functions that receive the size of the target buffer must be used (strncat, snprintf). In the case of strncpy, it must be avoided because it is designed for fixed sized strings without a null terminator, and strncpy may not insert the required null terminator at the end location.

Pointers to constants

Functions that receive string arguments that are not mutated must have pointer to constant values type. These types can be written in the following two equivalent ways:

void functionReceivingStringLiteral(const char *string);
void functionReceivingStringLiteral(char const *string);

For reasons of consistency, only one of these two ways of writing this type must be used. The first way of defining this type is the most commonly used way, but the second way may be easier to understand, and some files are currently using it.

In C++ unlike C, there is no implicit cast from string literals to char*, so for example the following code is legal in C, but not in C++:

int executeImage(char *imageFileName);


int main()
{
    return executeImage("Pharo.image");
}

Using const as much as possible it is recommended because allows using the compiler to detect some errors. In the case of functions that belong to public interface that is exposed to plugins or applications that are embedding the VM, these headers should be usable from C++ code.

Operating system naming convention

The following naming prefixes should be used when writing operating system specific code:

For generic Unixes (Linux, FreeBSD, and to some extent OS X), they should be referred as Unix or abbreviated as unix. When using specific APIs from an Unix such as epoll in Linux vs kqueue in BSDs (including OS X), Unix specific variant naming and abbreviations should be used.
Generic BSDs should be abbreviated as just bsd.
OS X specific code should use osx. (Maybe we should use macos?)
For Windows, windows in full, or win32, or w32 should be used. In some communities, using win for Windows is advised against (GNU coding standard: https://www.gnu.org/prep/standards/html_node/System-Portability.html#System-Portability https://www.gnu.org/prep/standards/html_node/Trademarks.html#Trademarks ). Even on 64 bits mode the API is still called win32.
For the Windows App Store, the API is called UWP (Universal Windows Platform). This API is completely different from the Win32 API (except for a supported subset of Win), and it is only necessary to support it for some ARM devices, the Hololens and the Xbox One. On desktop the Win32 API is preferred.

C programming language standards

For reference, here are the draft versions of the current standards of the C language:

These language standards should be used mostly as guidelines for separating platforms independent features from platform specific features. GCC and clang tend to follow the language standard closely, but MSVC tend to only implement the standard partially. C99 features tend to be more widely supported. From C11, the support for atomic operation seems to be the most relevant feature for the VM.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly