LLM-Prop: Predicting Physical And Electronic Properties Of Crystalline Solids From Their Text Descriptions

This repository contains the implementation of the LLM-Prop model. LLM-Prop is an efficiently finetuned large language model (T5 encoder) on crystals text descriptions to predict their properties. Given a text sequence that describes the crystal structure, LLM-Prop encodes the underlying crystal representation from its text description and output its properties such as band gap and volume.

LLM-Prop architecture

For more details check our pre-print.

Installation

You can install LLM-Prop by following these steps:

git clone https://github.com/vertaix/LLM-Prop.git
cd LLM-Prop
conda create -n <environment_name> requirement.txt
conda activate <environment_name>

Usage

Training LLM-Prop from scratch

Add the following scripts to llmprop_train.sh

#!/usr/bin/env bash

TRAIN_PATH="data/samples/textedge_prop_mp22_train.csv"
VALID_PATH="data/samples/textedge_prop_mp22_valid.csv"
TEST_PATH="data/samples/textedge_prop_mp22_test.csv"
EPOCHS=5 # the default epochs is 200 to get the best performance
TASK_NAME="regression" # the task name can also be set to "classification"
PROPERTY="band_gap" # the property can also be set to "volume" or "is_gap_direct". Note that if the task name is set to classification, only "is_gap_direct" is allowed here. And if the task name is set to regression, only "band_gap" or "volume" is allowed here.

python llmprop_train.py \
--train_data_path $TRAIN_PATH \
--valid_data_path $VALID_PATH \
--test_data_path $TEST_PATH \
--epochs $EPOCHS \
--task_name $TASK_NAME \
--property $PROPERTY

Then run bash scripts/llmprop_train.sh

Evaluating the pretrained LLM-Prop

Add the following scripts to llmprop_evaluate.sh

#!/usr/bin/env bash

TRAIN_PATH="data/samples/textedge_prop_mp22_train.csv"
TEST_PATH="data/samples/textedge_prop_mp22_test.csv"
TASK_NAME="regression" # the task name can also be set to "classification"
PROPERTY="band_gap" # the property can also be set to "volume" or "is_gap_direct". Note that if the task name is set to classification, only "is_gap_direct" is allowed here. And if the task name is set to regression, only "band_gap" or "volume" is allowed here.
CKPT_PATH="checkpoints/samples/$TASK_NAME/best_checkpoint_for_$PROPERTY.tar.gz" # path to the best model if the property to be predicted

python llmprop_evaluate.py \
--train_data_path $TRAIN_PATH \
--test_data_path $TEST_PATH \
--task_name $TASK_NAME \
--property $PROPERTY \
--checkpoint $CKPT_PATH

Then run bash scripts/llmprop_evaluate.sh

Data availability

Note: The data samples and checkpoints in this repository are just for the purpose of testing the LLM-Prop implementation pipeline only, not intended to reproduce the results.

To use TextEdge (a dataset collected in this work) for your work or reproducibility purporses, first dowload it here (about 700 MB) and then replace the train, validation, and test samples in your local directory accordingly.

Citation

@article{rubungo2023llm,
  title={LLM-Prop: Predicting Physical And Electronic Properties Of Crystalline Solids From Their Text Descriptions},
  author={Rubungo, Andre Niyongabo and Arnold, Craig and Rand, Barry P and Dieng, Adji Bousso},
  journal={arXiv preprint arXiv:2310.14029},
  year={2023}
}

Name		Name	Last commit message	Last commit date
Latest commit History 175 Commits
checkpoints/samples		checkpoints/samples
data/samples		data/samples
figures		figures
scripts		scripts
stopwords/en		stopwords/en
tokenizers/t5_tokenizer_trained_on_modified_part_of_C4_and_textedge		tokenizers/t5_tokenizer_trained_on_modified_part_of_C4_and_textedge
LICENSE		LICENSE
README.md		README.md
llmprop_args_parser.py		llmprop_args_parser.py
llmprop_dataset.py		llmprop_dataset.py
llmprop_evaluate.py		llmprop_evaluate.py
llmprop_model.py		llmprop_model.py
llmprop_train.py		llmprop_train.py
llmprop_utils.py		llmprop_utils.py
requirements.txt		requirements.txt

License

vertaix/LLM-Prop

Folders and files

Latest commit

History

Repository files navigation

Installation

Usage

Training LLM-Prop from scratch

Evaluating the pretrained LLM-Prop

Data availability

Citation

About

Resources

License

Stars

Watchers

Forks

Languages