Large Language Model for Scientific Discovery (LLM4SD)

LLM4SD is an open-source initiative that aims to leverage large language models for scientific discovery. We have now released the complete code 😆.

Code Description

QuickStart:

🌟 First, requirements are shown in the requirements.txt. Please use the requirements.txt to create the environment for running LLM4SD.

🌟 Second, please put your Openai API key in the bash file before you run the bash file. The Openai API will be used to call GPT-4 to conduct text summarisation for knowledge inference information and automatic code generation.

To run tasks for ["bbbp" "bace" "clintox" "esol" "freesolv" "hiv" "lipophilicity"]. Please run:

bash run_others.sh

To run tasks for "Tox21" and "Sider". Please run:

bash run_tox21.sh

bash run_sider.sh

To run tasks for "Qm9". Please run:

bash run_qm9.sh

The Process of LLM4SD Code Pipeline:

In the bash file, the LLM4SD is conducted in the following process:

👉: "Knowledge synthesize from the literature", this step will call python synthesize.py The synthesized rules are stored under the prior_knowledge folder.

👉: "Knowledge inference from data", this step will call python inference.py The inferred rules are stored under the data_knowledge folder.

👉: "Inferred Knowledge Summarization", this step will call python summarize_rules.py The summarized rules are stored under the summarized_inference_rules folder. --> The purpose of this step is to drop duplicate rules.

👉: "Automatic Code Generation & Evaluation", this step will call python auto_gen_and_eval.py This step will automatically generate the code using GPT-4 and run experiments to get the model performance. Please note that, in practice, human experts would review the code before usage. However, even with automatic code generation and direct evaluation, the code achieves pretty much the same performance.

📓Notes: We have also provided an advanced automatic code generation tool based on the newly released OpenAI Assistant. If you are interested in trying the assistant version of code generation, please check out the "code_gen.py" and "eval.py" files in the folder "LLM4SD-gpt4-demo".

PS: To obtain an explanation, you can use the information provided by the trained interpretable model and structure a prompt to let an LLM explain the result as shown in the paper.

Direct Evaluation:

A direct evaluation of the generated code of a specific task. You can run:

python eval.py --dataset ${dataset} --subtask "{subtask_name}" --model ${model_name} --knowledge_type ${knowledge_type} [if evaluating inference code or combined code specify --num_samples ${number of responses during inference}]

A direct evaluation of all generated code in all tasks. You can run:

bash eval_code.sh

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
eval_code_generation_repo		eval_code_generation_repo
eval_result		eval_result
inference_model_response		inference_model_response
prompt_file		prompt_file
scaffold_datasets		scaffold_datasets
summarized_inference_rules		summarized_inference_rules
synthesize_model_response		synthesize_model_response
LICENSE		LICENSE
README.md		README.md
code_gen_and_eval.py		code_gen_and_eval.py
create_prompt.py		create_prompt.py
eval.py		eval.py
eval_code.sh		eval_code.sh
inference.py		inference.py
llm4sd_models.json		llm4sd_models.json
requirements.txt		requirements.txt
run_others.sh		run_others.sh
run_qm9.sh		run_qm9.sh
run_sider.sh		run_sider.sh
run_tox21.sh		run_tox21.sh
summarize_rules.py		summarize_rules.py
synthesize.py		synthesize.py

License

zyzisastudyreallyhardguy/LLM4SD

Folders and files

Latest commit

History

Repository files navigation

Large Language Model for Scientific Discovery (LLM4SD)

Code Description

QuickStart:

The Process of LLM4SD Code Pipeline:

Direct Evaluation:

Architecture of LLM4SD

Web-based application developed based on LLM4SD (Will be released soon)

Comments are welcome to help us improve the web-based application:exclamation::exclamation::exclamation:

1.Knowledge Synthesis (Derive Knowledge from Scientific Literature)

2.Knowledge Inference (Derive Knowledge from Analyzing Scientific Data)

3.Prediction with Explanation (Explaining how the Prediction is derived)

About

Resources

License

Stars

Watchers

Forks

Languages