# [chibicc](https://github.com/rui314/chibicc)

## va_list, va_start, va_arg

```c
// Takes a printf-style format string and returns a formatted string.
char *format(char *fmt, ...) {
  char *buf;
  size_t buflen;
  FILE *out = open_memstream(&buf, &buflen);

  va_list ap;
  va_start(ap, fmt);
  vfprintf(out, fmt, ap);
  va_end(ap);
  fclose(out);
  return buf;
}
```

1. va_* 这几个宏原理就是移动指针，如下

```c
typedef unsigned char *va_list;
#define va_start(list, param) (list = (((va_list)&param) + sizeof(param)))
#define va_arg(list, type)    (*(type *)((list += sizeof(type)) - sizeof(type)))

```

只不过在GCC里不是明确用宏定义的，而是内建函数


![](resources/01.png)

-----------------
-----------------

## tokenize.c

```c
// Tokenize a given string and returns new tokens.
Token *tokenize(File *file) {
  current_file = file;

  char *p = file->contents;
  Token head = {};
  Token *cur = &head;

  at_bol = true;
  has_space = false;

  while (*p) {
    // Skip line comments.
    if (startswith(p, "//")) {
      p += 2;
      while (*p != '\n')
        p++;
      has_space = true;
      continue;
    }

    // Skip block comments.
    if (startswith(p, "/*")) {
      char *q = strstr(p + 2, "*/");
      if (!q)
        error_at(p, "unclosed block comment");
      p = q + 2;
      has_space = true;
      continue;
    }

    // Skip newline.
    if (*p == '\n') {
      p++;
      at_bol = true;
      has_space = false;
      continue;
    }

    // Skip whitespace characters.
    if (isspace(*p)) {
      p++;
      has_space = true;
      continue;
    }

    // Numeric literal
    if (isdigit(*p) || (*p == '.' && isdigit(p[1]))) {
      char *q = p++;
      for (;;) {
        if (p[0] && p[1] && strchr("eEpP", p[0]) && strchr("+-", p[1]))
          p += 2;
        else if (isalnum(*p) || *p == '.')
          p++;
        else
          break;
      }
      cur = cur->next = new_token(TK_PP_NUM, q, p);
      continue;
    }

```

把文件token化，比较简单，主要就几种情况

1.注释 2.换行 3.空白 4.数字常量 5.字符串常量 6.关键字 7.运算符

----------------------------------------------------------------------------
-------------------------------------------------------------

## preprocess.c

[Macro Algo: Dave Prosser Algo](resources/cpp.algo.pdf)

[GCC Macros](https://gcc.gnu.org/onlinedocs/cpp/Macros.html)

![](resources/02.png)

Dave 算法：

1. 每个token都有一个hideset，表示这个token之前是由哪个macro（string）替换来的。初始的时候都是空的（{}）

2. 在macro expand的过程中，如果当前这个token的string在hideset中，说明之前已经发生过一次替换，那么这次就不再替换（这就防止了循环替换）。
   
   
3. 如果hideset中没有出现过，就把当前token替换成对应的macro，同时将原来token的string放入到hideset中（也就是代码中的 $HS \cup \{T\}$）

4. 如果是function-like 的macro（注：macro定义时的参数较parameter，传入的参数叫actual或argument），先对传入的参数（actuals）作macro expand，然后在用expanded之后的actual去替换macro中的parameter，同时actual的hideset `HS'`和parameter的hideset `HS`取交集，也就是代码中的$(HS \cap HS')$

5. 在代码中的macro expand，针对每一个token，会循环去作expand，直到当前的token无法再expand了，再去处理下一个token。这样就解决了macro嵌套定义的问题，例如

    ```c
    #define A 1
    #define B A
    #define C B
    #define D C+B
    ```
----------------------------------

### static Token *preprocess2(Token *tok)


```c
// Visit all tokens in `tok` while evaluating preprocessing
// macros and directives.
static Token *preprocess2(Token *tok) {
  Token head = {};
  Token *cur = &head;

  while (tok->kind != TK_EOF) {
    // If it is a macro, expand it.
    if (expand_macro(&tok, tok))
      continue;

    // Pass through if it is not a "#".
    if (!is_hash(tok)) {
      tok->line_delta = tok->file->line_delta;
      tok->filename = tok->file->display_name;
      cur = cur->next = tok;
      tok = tok->next;
      continue;
    }

    Token *start = tok;
    tok = tok->next;

```
1. `expand_macro` 展开当前token，如果可以展开，返回true，那么就continue，继续展开当前token，直到无法展开，再往下继续处理

-----------------------------------------

### static bool expand_macro(Token **rest, Token *tok)

```c
// If tok is a macro, expand it and return true.
// Otherwise, do nothing and return false.
static bool expand_macro(Token **rest, Token *tok) {
  if (hideset_contains(tok->hideset, tok->loc, tok->len))
    return false;

  /** xitongsys
   * 
   * find_macro 从全局的hashmap中查找当前token是否是一个macro
   * 
  **/
  Macro *m = find_macro(tok);
  if (!m)
    return false;

  // Built-in dynamic macro application such as __LINE__
  if (m->handler) {
    *rest = m->handler(tok);
    (*rest)->next = tok->next;
    return true;
  }


  /** xitongsys
   * 
   * 对于Object-like的macro，将当前token的name string和之前的hideset union作为新的hideset
   * 因为macro expand之后可能会有多个token，所以这里body是一个链表
   * 
  **/
  // Object-like macro application
  if (m->is_objlike) {
    Hideset *hs = hideset_union(tok->hideset, new_hideset(m->name));
    Token *body = add_hideset(m->body, hs);
    for (Token *t = body; t->kind != TK_EOF; t = t->next)
      t->origin = tok;
    *rest = append(body, tok->next);
    (*rest)->at_bol = tok->at_bol;
    (*rest)->has_space = tok->has_space;
    return true;
  }

  // If a funclike macro token is not followed by an argument list,
  // treat it as a normal identifier.
  if (!equal(tok->next, "("))
    return false;

  // Function-like macro application
  Token *macro_token = tok;
  MacroArg *args = read_macro_args(&tok, tok, m->params, m->va_args_name);
  Token *rparen = tok;

  // Tokens that consist a func-like macro invocation may have different
  // hidesets, and if that's the case, it's not clear what the hideset
  // for the new tokens should be. We take the interesection of the
  // macro token and the closing parenthesis and use it as a new hideset
  // as explained in the Dave Prossor's algorithm.
  Hideset *hs = hideset_intersection(macro_token->hideset, rparen->hideset);
  hs = hideset_union(hs, new_hideset(m->name));

  Token *body = subst(m->body, args);
  body = add_hideset(body, hs);
  for (Token *t = body; t->kind != TK_EOF; t = t->next)
    t->origin = macro_token;
  *rest = append(body, tok->next);
  (*rest)->at_bol = macro_token->at_bol;
  (*rest)->has_space = macro_token->has_space;
  return true;
}
```
1. see comments in the code by xitongsys

---------------------------------------

```c
static Token *add_hideset(Token *tok, Hideset *hs) {
  Token head = {};
  Token *cur = &head;

  for (; tok; tok = tok->next) {
    Token *t = copy_token(tok);
    t->hideset = hideset_union(t->hideset, hs);
    cur = cur->next = t;
  }
  return head.next;
}
```

之所以把hs加到整个token list里面是因为macro展开的时候，可以有多个token，因此macro expand的结果是一个token list。具体看Macro struct的内容。其中body就是这个macro要展开的token list

```c
typedef struct Macro Macro;
struct Macro {
  char *name;
  bool is_objlike; // Object-like or function-like
  MacroParam *params;
  char *va_args_name;
  Token *body;
  macro_handler_fn *handler;
};
```

----------------------------

```c
// Append tok2 to the end of tok1.
static Token *append(Token *tok1, Token *tok2) {
  if (tok1->kind == TK_EOF)
    return tok2;

  Token head = {};
  Token *cur = &head;

  for (; tok1->kind != TK_EOF; tok1 = tok1->next)
    cur = cur->next = copy_token(tok1);
  cur->next = tok2;
  return head.next;
}
```

1. 代码中用很多对两个token list的操作，往往都是第一个参数不变，而是重新拷贝一份，第二个参数追加上去，返回 （新拷贝的1 + 老的2）

2. `cur = cur->next = copy_token(tok1);` 赋值小技巧

------------------------
------------------------

## parse.c

```c
// This file contains a recursive descent parser for C.
//
// Most functions in this file are named after the symbols they are
// supposed to read from an input token list. For example, stmt() is
// responsible for reading a statement from a token list. The function
// then construct an AST node representing a statement.
//
// Each function conceptually returns two values, an AST node and
// remaining part of the input tokens. Since C doesn't support
// multiple return values, the remaining tokens are returned to the
// caller via a pointer argument.
//
// Input tokens are represented by a linked list. Unlike many recursive
// descent parsers, we don't have the notion of the "input token stream".
// Most parsing functions don't change the global state of the parser.
// So it is very easy to lookahead arbitrary number of tokens in this
// parser.
```

整体逻辑比较简单。就是递归下降去parse所有的token构建AST（难点在于把所有语句的文法写清楚）。当然内部有些技巧

---------------------------------------

### 复习下编译原理的一些概念

![](resources/04.png)
![](resources/05.png)
![](resources/06.png)

1. LL 剖析，就是从左向右输入，从左向有进行替换。上例中，当看到第一个是`(`后，就进行2替换。第二个括号，继续进行2替换。后面都是3替换

2. 从替换过程可以看到，是自顶向下的构建

![](resources/07.png)

### 左递归

![](resources/08.png)
![](resources/09.png)
![](resources/10.png)


## [Paull's Algorithm](resources/removing_left_recursion_from_context_free_grammars.pdf)

![](resources/11.png)
![](resources/12.png)
![](resources/13.png)

1. 对于直接左递归，$A \rightarrow A\alpha_1|\beta_1$，所有左递归的production,最终的展开必然是非左递归的某一个，也就是其中的$\beta_1$。而左递归production中的$\alpha_1$，则可以重复多次。因此将$A \rightarrow \beta_1A'，A' \rightarrow \alpha_1A'$，得到了和之前一样的语义

2. 对于间接左递归，将所有nonterminals排序，只允许从前往后的展开，不允许从后往前。例如$i>j, A_i \rightarrow A_j\alpha$就需要移除。而$A_j$在前面已经处理过了，它所有展开一定只包括$A_k, k<j$。因此消除的时候，只要把$A_j$替换成所有其可能的展开即可

3. 替换左递归后往往会导致结合律发生变化，前面wiki中也提到了几种方法。如果手写praser，最简单的就是在构造语法树的时候做特殊处理，重新安排顺序，例如龙书中的这个例子

![](resources/14.png)

------------------

### 代码笔记


```c
// program = (typedef | function-definition | global-variable)*
Obj *parse(Token *tok) {
  declare_builtin_functions();
  globals = NULL;

  while (tok->kind != TK_EOF) {
    VarAttr attr = {};
    Type *basety = declspec(&tok, tok, &attr);

    // Typedef
    if (attr.is_typedef) {
      tok = parse_typedef(tok, basety);
      continue;
    }

    // Function
    if (is_function(tok)) {
      tok = function(tok, basety, &attr);
      continue;
    }

    // Global variable
    tok = global_variable(tok, basety, &attr);
  }

  for (Obj *var = globals; var; var = var->next)
    if (var->is_root)
      mark_live(var);

  // Remove redundant tentative definitions.
  scan_globals();
  return globals;
}
```

1. 最顶层的parse，因为像#include，#define等preprocess的语句，已经在preprocess过程中转换成普通token了。所以在这里只有这三种情况：`typedef, function-definition, global-variable`

-------------------

---------------------
---------------------

## some functions

```c
/* Open a stream that writes into a malloc'd buffer that is expanded as
   necessary.  *BUFLOC and *SIZELOC are updated with the buffer's location
   and the number of characters written on fflush or fclose.  */
extern FILE *open_memstream (char **__bufloc, size_t *__sizeloc) __THROW
  __attribute_malloc__ __attr_dealloc_fclose __wur;
```

------------------------

![](resources/03.png)

---------------------------