report.tex

\documentclass[a4paper]{article}
\usepackage[titletoc]{appendix}
\usepackage[a4paper,includeheadfoot,left=2.5cm,top=1.5cm,right=2.5cm,bottom=1.75cm]{geometry}
\usepackage{CJKutf8}
\usepackage{fancyhdr}

\usepackage{amsmath,amssymb,amsthm}
\usepackage{color,graphicx,subfigure,enumerate,epsfig,epstopdf}
\usepackage{pdfpages}

\title{Project Report for Translator from C to Java}
\author{Li Wei (5092029004)}
\date{\today}

\begin{document}

\maketitle

\section{Introduction}

This project, which is written in C programming language, consist of 10 files described as follows.\\
\begin{enumerate}
  \item \texttt{c2java.l} is the lexical specification for \texttt{flex}.
  \item \texttt{c2java.y} is the syntactic specification for \texttt{bison}.
  \item \texttt{c2java.h} declares the data structure of \texttt{ast\_node, type\_node, scope\_node, sym\_node} and other function prototypes which are used through out the building of the translator.
  \item \texttt{main.c} contains the \texttt{main()} function which drives the whole parsing, constructing and translating process.
  \item \texttt{ast.c} implements functions required for building the abstract syntax tree, type table, symbol table and scopes.
  \item \texttt{print\_ast.c} implements functions required for printing the AST in HTML format.
  \item \texttt{check\_ast.c} implements functions required for checking the AST types according to scopes, type table, etc. Also, some synthesized attributes of the ast\_node would be recorded and propagated in the type checking process.
  \item \texttt{translate\_ast.c} implements functions required for translating the abstract syntax tree to the java code leveraging the attributes of the ast\_node recoded in previous process.
  \item \texttt{report.pdf} illustrates the design and implementations of the translator.
\end{enumerate}

\begin{figure}
  \centering
  \includegraphics[width=0.9\textwidth]{overview}
  \caption{translator structure}
  \label{fig:overview}
\end{figure}


In this project, the stucture of the translator is as shown in Figure \ref{fig:overview}. Firstly, the lexical analyzer takes the source program as input and output the tokens. Then the syntax analyzer will parse the tokens to construct the abstract syntax tree. Type checker will check the type rules and semantic errors and output the errors. Finally the translator will translate the abstract syntax tree to Java code. Through this path, the ast\_node propagate its synthesized attributes: \texttt{int type\_id, char fg};\\


Generally speaking, the translator takes a subset of C language as input:\\
\begin{itemize}
\item Three types are contained:
\begin{enumerate}
	\item int;
	\item array(recursive, muti-dimensional, members can be int, array, struct);
	\item struct(field type can only be int);
\end{enumerate}
\item Constraints on the function definition:
\begin{enumerate}
	\item return type can only be int;
	\item parameters type can only be int;
\end{enumerate}
\item All the arithmetic operations are included.
\item The flow "if...then...else", for loop(including break and continue), return statement are included.

\item The two iosream functions of c are rewritten as read(int) and write(int), which can be translated in to java library function Scanner.nextInt() and System.out.println().

\item No other libraries or library function call is contained.

\item Several testcases are in the directory \textbf{testcases}, which will be translated, run in java, and compared with the validated output file in the \texttt{runtest.sh}.
\end{itemize}

The \texttt{Makefile} is available to assist operating:
\begin{enumerate}
	\item \textbf{make}: is to compile the translator c2java.
	\item \textbf{make test}: is to call the \texttt{runtest.sh} to translate and run the test cases. 
\end{enumerate}

\section{Lexical Analysis}

4 types of tokens are matched in the following order:
\begin{description}
  \item[Keywords:] \texttt{break}, \texttt{continue}, \texttt{if}, \texttt{else}, \texttt{for}, \texttt{int}, \texttt{return}, and \texttt{struct}, \texttt{read}.
  \item[Identifiers:] \texttt{sym()} function will look up the symbol table according to \texttt{yytext}.
  \item[Numbers:] hexadecimal, octal, and decimal integers.
  \item[Operators:] single-character operators correspond to the ASCII codes of the characters, while multiple-character operators have their own token numbers.
\end{description}
Whitespace and illegal characters are ignored. See \texttt{c2java.l} for details.


\section{Syntactic Analysis}

Non-terminals correspond to pointers of AST nodes with type \texttt{struct ast\_node}, which is defined in \texttt{c2java.h}.
For terminals, only \texttt{CONSTANT} and \texttt{IDENTIFIER} have values which are integers.
\texttt{CONSTANT} values are parsed integers. \texttt{IDENTIFIER} values are numeric symbols.\\

The grammar constructed according to the C grammar and is a subsect of the standard grammar of C. Also there are some differences from the standard C grammar, e.g.\\
(i) avoiding sloppy $\epsilon$'s to reduce unnecessary conflicts;\\
(ii) using left recursions instead of right recursions to reduce backtracks and memory cost.\\
Also, operator precedence is implemented in a rather formal way other than specifying priorities to related tokens.\\


The abstract syntax tree will be constructed during the syntactic parsing by adding semantic actions. In the yacc program, the corresponding construct functions such as \texttt{funcdef\_new()} or \texttt{stdef\_new()} will be called at certain productions to construct the ast\_node of funcdef type or of stdef type.\\


The ast\_node defined in \texttt{c2java.h} has the different types, for instance \texttt{AST\_LIST, AST\_FUNCDEF, AST\_FUNCHEAD, AST\_PARA}. Each type has its own attributes implemented by a union shown as below. \\
\begin{verbatim}
typedef struct ast_node {
    enum {
        AST_LIST,
        /*AST_VARDEF,*/ AST_FUNCDEF, AST_FUNCHEAD, AST_PARA,
        AST_STDEF, AST_VAR, AST_SUBVAR,
        AST_COMPOUND_STMT, AST_EXPR_STMT, AST_IF_STMT, AST_FOR_STMT,
        AST_RETURN_STMT, AST_CONTINUE_STMT, AST_BREAK_STMT,
        AST_DEF, AST_DEC,
        AST_BINOP, AST_PREFIX, AST_POSTFIX,
        AST_INDEXING, AST_FUNC_CALL, AST_MEMBER,
        AST_ID, AST_CONST, /*AST_INIT_ARG,*/
        AST_TYPE_LIMIT
    } type;
    
    union {
        struct {
            int type;
            struct ast_node *head;
            struct ast_node *tail;
        } list;

        struct {
            int ret_type;
            struct ast_node *funchead;
            struct ast_node *funcbody;
        } funcdef;

        struct {
            int sym_name;
            struct ast_node *paras;
        } funchead;

        struct {
            int para_type;
            struct ast_node *var;
        } para;

        struct {
            int sym_name;
            struct ast_node *defs;
        } stdef;

        struct {
            int sym_name;
        } var;

        struct {
            int index;
            struct ast_node *var;
        } subvar;

        struct {
            struct ast_node *defs;
            struct ast_node *stmts;
        } compound_stmt;

        struct {
            struct ast_node *expr;
        } expr_stmt;

        struct {
            struct ast_node *cond;
            struct ast_node *then;
            struct ast_node *els;
        } if_stmt;

        struct {
            struct ast_node *init;
            struct ast_node *cond;
            struct ast_node *incr;
            struct ast_node *body;
        } for_stmt;

        struct {
            struct ast_node *retval;
        } return_stmt;

        struct {
            int def_type;
            struct ast_node *decs;
        } def;

        struct {
            struct ast_node *var;
            struct ast_node *init;
        } dec;

        struct {
            int op;
            struct ast_node *lhs;
            struct ast_node *rhs;
        } binop;

        struct {
            int op;
            struct ast_node *unary;
        } prefix;

        struct {
            int op;
            struct ast_node *unary;
        } postfix;

        struct {
            struct ast_node *postfix;
            struct ast_node *expr;
        } indexing;

        struct {
            struct ast_node *postfix;
            struct ast_node *args;
        } func_call;

        struct {
            int sym_name;
            struct ast_node *postfix;
        } member;

        struct {
            int sym_name;
        } id;

        struct {
            int number;
        } constant;

        /*
        struct {
            int sym_name;
            struct ast_node *expr;
        } init_arg;
        */
    };
    int type_id;
    int lvalue; // 1 - left value   0 - right value
    int offset;
    int framesize;
    char fg;
} ast_node;

\end{verbatim}


More details can be seen in \texttt{c2java.h}.\\


The construction functions are implemented with polymorphism, according to the type of the ast\_node, so as to in accord with the attributes of different types of ast\_node.

\section{Print Abstract Syntax Tree}


This part is for checking and testing the constructed abstract syntax tree and visualize the ast. It isn't included in the final translator as we only need to translate and generate the Java code.\\

The ast will be printed by \texttt{print\_ast()}. For the sake of printing the parse tree in a hierarchical style, the list representation in HTML with tags $<ul>$ and $<li>$ is leveraged. Below is one instance for printing funcdef ast\_node.

\begin{verbatim}
static void print_funcdef(ast_node *n)
{
    printf("<ul>\n");
    printf("<li>funcdef</li>\n");
    printf("<li>ret_type: %d</li>\n", n->funcdef.ret_type);
    printf("<li>funchead:\n");
    print_ast(n->funcdef.funchead);
    printf("</li>\n");
    printf("<li>funcbody:\n");
    print_ast(n->funcdef.funcbody);
    printf("</li>\n");
    printf("</ul>\n");
}
\end{verbatim}

\begin{figure}
  \centering
  \includegraphics[width=0.5\textwidth]{printast}
  \caption{Print abstract syntax tree}
  \label{fig:printast}
\end{figure}


The parser reads from the input and writes the AST in HTML format to the output. 
It will create an HTML file to show the parse tree in a hierarchical way. One example is shown in Figure \ref{fig:printast}.


\section{Type Checking}

\begin{figure}
  \centering
  \includegraphics[width=\textwidth]{scope}
  \caption{Scope structure}
  \label{fig:scope}
\end{figure}

For type checking, type table as well as the scopes are constructed. The scope is seperated into two scopes: vscope for variables and function names; tscope for new struct type names. The sym\_id and type\_id is stored in scope\_node. The type table contains the type infomation and the symbol talbe contains the hash from the string symbol name to int symbol id. The structure of the scope and the two tables are shown in Figure \ref{fig:scope}.\\

There are four types in the c2java translator: INT, STRUCTURE, ARRAY, FUNC. The information for each type is stored in type table, which is a list of type\_node. The data structure of type\_node is as below:

\begin{verbatim}
enum TYPE {STRUCTURE, ARRAY, FUNC};

struct field_list
{
	int sym_id;
	int index;
	struct field_list *next;
};

struct type_node
{
  int type_id;
  enum TYPE ty;
  union
  {
    struct
    {
      int sym_name;
      struct field_list *fields;
    } as_struct;
    
    struct 
    {
      int elem_type_id;
      int size;
    } as_array;
    
    struct
    {
      int ret_type;
      int para_num; 
      int para_type[1000];
    } as_func;
    
  };
  int mem_space;
  struct type_node *next;
};

\end{verbatim}
The type\_id for INT is always set to 1, so there's no need to store the INT type in the type table. The implementation here is the same with that in ast\_node, different attributes for different types in a union.\\


Here the array type contains three types: \textbf{INT, STRUCTURE and ARRAY}, which means it can have nesting arrays. So when constructing array in \texttt{check\_ast.c}, each nested array are given a type\_id by calling \texttt{array\_type()} to construct the array with setting up the attributes in type\_node of array type.\\

The struct type only contains INT type.The function's return type and parameters' type can only be INT.
 

In my translator, I implement the following type checks:
\begin{enumerate}
    \item Reserved words can not be used as \textit{identifiers}.
    \item Program must contain a function \textit{int main()} to be the entrance
    \item Check variables should not be re-declared by checking the curent scope
    \item Check variables have been declared before usage by checking all the \textbf{scopes}
    \item Check the initialized element have the same type with the left variable
    \item Check the size of the \textbf{array initialization} should equal or less than the size of the left array.
    \item Check the return type of function declaration should be INT
    \item Check function name should not be re-declared
    \item Check the parameter's type should be INT
    \item Check \textbf{parametes} should not have the same name
    \item Check if the structure type has been declared
    \item The condition of \textit{if} statement should be an expression with \textit{int} type
    \item Check wheather \textit{break} and \textit{continue} are in for loop
    \item The condition of \textit{for} should be an expression with \textit{int} type or epsilon
    \item Check the type of return statement in function should be INT
    \item Check only expression with type int can be involved in arithmetic
    \item Check right-value can not be assigned by any value or expression
    \item Use [] operator to a non-array variable is not allowed
    \item Check the \textit{identifier} of a function call is declared and is of type FUNC
    \item Check the number and type of variable(s) passed should match the definition of the function
    \item Check . operator can only be used to a struct variable
    \item Check the member following . operator is declared in the struct type
    \item Check the \textit{identifier} is declared
\end{enumerate}

Through type checking, the \texttt{check\_ast.c} also implements the following funtionalities:
\begin{itemize}
    \item Construct array type: array type can be constructed by STRUCT, INT and ARRAY type.
    \item Assign type id which is the synthesized attribute of the ast node
    \item Record and propagate the global variable and local variable in \texttt{char gf} in ast\_node
\end{itemize}


\section{Translation}

\subsection{Overview}
The implementation of the code tranlation part is in accord with the checking and printing part. Actually the whole project is implemented in a polymorphic style. An array of function pointer will point to a specific function according to the type of the ast\_node. So when the function is called, it don't need to specify the type with the \texttt{trans\_ast(ast\_node* n)} function. Below is the example of translate\_ast: 

\begin{verbatim}

static ast_translator g_ast_translators[AST_TYPE_LIMIT];

static void init_ast_translators()
{
    if (g_ast_translators[0]) return;

    g_ast_translators[AST_LIST] = trans_list;
    g_ast_translators[AST_FUNCDEF] = trans_funcdef;
    g_ast_translators[AST_FUNCHEAD] = trans_funchead;
    g_ast_translators[AST_PARA] = trans_para;
    g_ast_translators[AST_STDEF] = trans_stdef;
    g_ast_translators[AST_VAR] = trans_var;
    g_ast_translators[AST_SUBVAR] = trans_subvar;
    g_ast_translators[AST_COMPOUND_STMT] = trans_compound_stmt;
    g_ast_translators[AST_EXPR_STMT] = trans_expr_stmt;
    g_ast_translators[AST_IF_STMT] = trans_if_stmt;
    g_ast_translators[AST_FOR_STMT] = trans_for_stmt;
    g_ast_translators[AST_RETURN_STMT] = trans_return_stmt;
    g_ast_translators[AST_CONTINUE_STMT] = trans_continue_stmt;
    g_ast_translators[AST_BREAK_STMT] = trans_break_stmt;
    g_ast_translators[AST_DEF] = trans_def;
    g_ast_translators[AST_DEC] = trans_dec;
    g_ast_translators[AST_BINOP] = trans_binop;
    g_ast_translators[AST_PREFIX] = trans_prefix;
    g_ast_translators[AST_POSTFIX] = trans_postfix;
    g_ast_translators[AST_INDEXING] = trans_indexing;
    g_ast_translators[AST_FUNC_CALL] = trans_func_call;
    g_ast_translators[AST_MEMBER] = trans_member;
    g_ast_translators[AST_ID] = trans_id;
    g_ast_translators[AST_CONST] = trans_const;
}

void trans_ast(ast_node *n)
{
    init_ast_translators();
    if (n)
    {
        #ifdef DEBUG
            fprintf(stderr, "[DEBUG] trans_ast n->type=%s\n", g_ast_names[n->type]);
        #endif

        g_ast_translators[n->type](n);
    }
}

\end{verbatim}

In general, firstly the class is constructed with the same name of the input c file. And the Scanner will be initialized with the \texttt(import java.util.*;) in the top of the java file. Then Each ast\_node is translated in to Java code correspondingly with the proper braces and tabs. The number of tabs is recorded to increas of decreas so that each time the function \texttt{emit\_tab} is called, it will output the right nunmber of the tabs. The function \textbf{main} in c will be translated into \textbf{main\_} since the the main function in c will return an interger but the main function in java must have the return type of void. So finally the main function in java will call \textbf{main\_}, which is the real main function in c, with \texttt{"System.exit(main\_(args.length, args));"}.

To handle the translating of \texttt{read} and \texttt{write} functions:
\begin{itemize}
    \item {\tt read(x)} is transformed to {\tt x = read()} in the parser in order to be processed consistently along with other function calls.
    \item Both {\tt read} and {\tt write} are considered as standard library functions, and are translated into the java function System.out.printlin and leverage the scanner of System.in.
\end{itemize}


\subsection{Problems and Solutions}
There are several differences between C and Java. So during the translations some problems occurs.
\begin{itemize}
	\item Since the condition expression in Java must be boolean type, which is different from C (the value not equal to zero is true, zero is false), so the experssion in a conditon cannot be translated directly. So a specific function \texttt{trans\_bool} will be called recursively according to the type of each ast\_node. It will be checked whether the expression is of type boolean or int. If boolean, the expression can be translated in to the java condition code, otherwise the end of the expression will be attached a \texttt{"!= 0"}. To judge whether the type of the expression is boolean or not, the operators in the expression can be leverated. If the expression node contains the operator \texttt{GE\_OP, LE\_OP, EQ\_OP, NE\_OP, '>', '<';}, then the expression is of boolean type. If the expression node contains the operator \texttt{AND\_OP, OR\_OP}, then the left handside and the right handside must be boolean and the expression itself is of boolean type. The example of boolean translation shown in Fig. \ref{fig:bool} is extracted from the \texttt{testcases/test}. More details can be seen there.
	
\begin{figure}
  \centering
  \includegraphics[width=0.9\textwidth]{bool}
  \caption{Boolean translation}
  \label{fig:bool}
\end{figure}

	\item The global variable will be declared with static and the local variables will be declared without static. The ast\_node has the attribute \texttt{char fg;} to record whether the ast\_node is global or not. So the static will be declared according to this character. Additionally, if the \texttt{"new object"} is called in the global space, where the variables are declared static, it will be translated into \texttt{"\{ static xx = new xxx;\}"}.
	\item Struct array initialization in Java is different from C since Java need to initialize each member of the object array dynamically with the construction function \textbf{class\_name()}. So the for loop will be generated according to the array dimension. For the array of integer with multi-dimension, the inilization can be static with the initlized arguments. The example can be seen in Fig. \ref{fig:sarray}. The snapshot is from the \texttt{testcases/test}.
	
\end{itemize}
	
\begin{figure}
  \centering
  \includegraphics[width=\textwidth]{sarray}
  \caption{Example: Struct and Int array initialization}
  \label{fig:sarray}
\end{figure}
	

\end{document}