Project investigating human and artificial neural representations of code.
This branch is currently under development, and should be considered unstable. To replicate specific papers, git checkout
the corresponding branch, e.g., NeurIPS2022
, and follow instructions in the README.md
.
This pipeline supports several major functions.
- MVPA (multivariate pattern analysis) evaluates decoding of code properties or code model representations from their respective brain representations within a collection of canonical brain regions.
- RSA (representational similarity analysis) is also supported as an alternative to MVPA.
- VWEA (voxel-wise encoding analysis) evaluates prediction of voxel-level activation patterns using code properties and code model representations as features.
- NLEA (network-level encoding analysis) uses the same features to evaluate encoding of mean network-level activation strength.
- PRDA (program representation decoding analysis) evaluates decoding of code properties from code model representations.
- PREA (program representation encoding analysis) evaluates encoding of code model representations using the set of code properties explored in this work.
Note: VWEA and NLEA also support ceiling estimates at the network level, calculated via an identical pipeline but with the features being the representations of other participants to the same stimuli rather than the properties extracted from those stimuli. To invoke a ceiling analysis, prefix the requested analysis type with a "C", e.g., CNLEA.
brain-md_lh
(Multiple Demand Network: Left Hemisphere)brain-md_rh
(Multiple Demand Network: Right Hemisphere)brain-lang_lh
(Language Network: Left Hemisphere)brain-lang_rh
(Language Network: Right Hemisphere)
Code Properties
task-structure
(seq vs. for vs. if) *ControlFlowtask-content
(math vs. str) *DataTypetask-nodes
(# of nodes in AST) *ASTNodestask-lines
(# of runtime steps during execution) *LinesExecuted
Code Models
Baseline:
code-tokens
(arbitrary projection encoding presence of individual tokens)
LLM Suite (CodeGen1):
code-llm_350m_nl
code-llm_2b_nl
code-llm_6b_nl
code-llm_16b_nl
code-llm_350m_mono
code-llm_2b_mono
code-llm_6b_mono
code-llm_16b_mono
Note: checkpoints vary in size and pre-training (nl
—ThePile; mono
—ThePile+BigQuery+BigPython)
Requirements: Anaconda, GNU Make
git clone --branch main --depth 1 https://github.com/benlipkin/braincode
cd braincode
make setup
usage: __main__.py [-h] [-f FEATURE] [-t TARGET] [-m METRIC] [-d CODE_MODEL_DIM] [-p BASE_PATH] [-s] [-b] {mvpa,rsa,vwea,nlea,cvwea,cnlea,prda,prea}
run specified analysis type
positional arguments:
{mvpa,rsa,vwea,nlea,cvwea,cnlea,prda,prea}
optional arguments:
-h, --help show this help message and exit
-f FEATURE, --feature FEATURE
-t TARGET, --target TARGET
-m METRIC, --metric METRIC
-d CODE_MODEL_DIM, --code_model_dim CODE_MODEL_DIM
-p BASE_PATH, --base_path BASE_PATH
-s, --score_only
-b, --debug
Note: BASE_PATH must be specified to match setup.sh if changed from default.
Sample calls
# basic examples
python -m braincode mvpa -f brain-md_lh -t task-structure # brain -> {task, model}
python -m braincode rsa -f brain-lang_lh -t code-llm_2b_nl # brain <-> {task, model}
python -m braincode vwea -f brain-md_rh -t code-tokens # brain <- {task, model}
python -m braincode nlea -f brain-lang_rh -t task-content # brain <- {task, model}
python -m braincode prda -f code-llm_350m_mono -t task-lines # model -> task
python -m braincode prea -f code-tokens -f task-content # model <- task
# more complex examples
python -m braincode cnlea -f all -m SpearmanRho --score_only # check metrics module for all options
python -m braincode mvpa -f brain-lang_lh+brain-lang_rh -t code-tokens -d 64 -p $BASE_PATH
python -m braincode vwea -t task-content+task-structure+task-nodes+task-lines
# note how `+` operator can be used to join multiple representations via concatenation
If you use this work, please cite XXX (under review)