Download links to all models of my doctoral dissertation

Chapter 2 PCC model implementation

PCC implementation (with dataset, data preprocessing code, video and tutorial)

Chapter 3 HLM and SLM model implementation

HLM-SLM implementation (github address, this project is mixed with REP model, containing tutorial)

Chapter 4 REP model implementation

REP implementation (github address, this project is mixed with HLM-SLM model, containing tutorial)

Detailed experiments for REP and Pointer-Mixture in Chapter 4 of doctoral dissertation

　　In Chapter 4, REP model (proposed by me) and Pointer-Mixture (proposed by JianLi) are compared. We use the similar setting as Pointer-Mixture meaning that we ignore grammar tokens when learning token repetition.

only consider identifier tokens
- In this experiment, REP model and Pointer-Mixture only consider identifiers. REP model further splits identifier tokens into different types and use different REP models to learn the repetition of tokens of different types. The distinct tokens in the paper refer to tokens which should be specifically handled.
for non-identifier tokens　
- For grammar tokens or other non-variable tokens, we use Hierarchical Language Model described in Chapter 3 to further improve the prediction accuracy. This optimization is described at length in Chapter 5 (there may be some confusion in the arrangement of paper content). The final accuracy is the weighted average of the two kinds of tokens (variable-tokens and non-variable tokens).
Please check paper and corrigendum paper for further details and experimental comparison results.

Explanation for "we learn the repetition patterns of all kinds of tokens." in dissertation:

　　First of all, for different kinds of tokens, we apply independent REP model to learn token repetition. For example, for grammar tokens, we use one REP model to learn token repetition, for variables, we use another isolated REP model to learn token repetition.

grammar tokens
- For grammar tokens, we also try to apply REP to learn the token repetition, however, we found that grammar tokens have no regularity of token repetition. When a large number of grammar tokens are not repeated while only a small number of grammar tokens are repeated, this leads to the situation that the model will think all grammar tokens in test set are not repeated. Thus this is equivalent to directly use traditional language model to predict token. Here we use Hierarchical Language Model as traditional language model.　
literals
- For string literals or char literals or number literals, in test set, most of such tokens are marked UNK. If we think predicting UNK correctly is good, then applying REP will improve the model performance. Otherwise, it will degenerate into a similar situation as grammar tokens due to the reason that few string literals or char literals or number literals are repeated in training set or test set.

Data preprocessor implementation

Data preprocessing module (aim for translating raw java files to the tensor format which REP or HLM-SLM can directly handle, github address, containing tutorial)
Data set (note that the separation (train, test, valid) of data set is remaked so the results may differ from the data presented in the dissertation or the already published paper, github address, containing tutorial)

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Download links to all models of my doctoral dissertation

Chapter 2 PCC model implementation

Chapter 3 HLM and SLM model implementation

Chapter 4 REP model implementation

Detailed experiments for REP and Pointer-Mixture in Chapter 4 of doctoral dissertation

Explanation for "we learn the repetition patterns of all kinds of tokens." in dissertation:

Data preprocessor implementation

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Download links to all models of my doctoral dissertation

Chapter 2 PCC model implementation

Chapter 3 HLM and SLM model implementation

Chapter 4 REP model implementation

Detailed experiments for REP and Pointer-Mixture in Chapter 4 of doctoral dissertation

Explanation for "we learn the repetition patterns of all kinds of tokens." in dissertation:

Data preprocessor implementation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Packages