Mining Patents with Large Language Models Demonstrates Congruence of Functional Labels and Chemical Structures #33

shnakazawa · 2023-11-06T00:32:16Z

Kosonocky, Clayton W., et al. “Mining Patents with Large Language Models Demonstrates Congruence of Functional Labels and Chemical Structures.” arXiv [q-bio.QM], 15 Sept. 2023, http://arxiv.org/abs/2309.08765. arXiv.

ChatGPT (gpt-3.5-turbo) を使って特許情報から「分子」と「機能」を紐づけたデータセット Chemical Function (CheF) dataset を構築。
CheF datasetでは高い精度で分子とその機能が関連付けられていた。
CheF datasetをモデルに学習させることで、検証データ内の分子の機能を推測できた。

言語モデル + 特許データを活用した新規材料開発の一事例として面白い報告。予測できていない分子の特徴や、全く未知の分子を与えたときにどういう結果が出るのかは気になるところ。

Abstract

Predicting chemical function from structure is a major goal of the chemical sciences, from the discovery and repurposing of novel drugs to the creation of new materials. Recently, new machine learning algorithms are opening up the possibility of general predictive models spanning many different chemical functions. Here, we consider the challenge of applying large language models to chemical patents in order to consolidate and leverage the information about chemical functionality captured by these resources. Chemical patents contain vast knowledge on chemical function, but their usefulness as a dataset has historically been neglected due to the impracticality of extracting high-quality functional labels. Using a scalable ChatGPT-assisted patent summarization and word-embedding label cleaning pipeline, we derive a Chemical Function (CheF) dataset, containing 100K molecules and their patent-derived functional labels. The functional labels were validated to be of high quality, allowing us to detect a strong relationship between functional label and chemical structural spaces. Further, we find that the co-occurrence graph of the functional labels contains a robust semantic structure, which allowed us in turn to examine functional relatedness among the compounds. We then trained a model on the CheF dataset, allowing us to assign new functional labels to compounds. Using this model, we were able to retrodict approved Hepatitis C antivirals, uncover an antiviral mechanism undisclosed in the patent, and identify plausible serotonin-related drugs. The CheF dataset and associated model offers a promising new approach to predict chemical functionality.

(DeepL翻訳)

構造から化学機能を予測することは、新薬の発見や再利用から新材料の創製に至るまで、化学科学の主要な目標である。近年、新しい機械学習アルゴリズムにより、様々な化学機能にまたがる一般的な予測モデルの可能性が開かれつつある。ここでは、化学特許に大規模な言語モデルを適用することで、これらのリソースによって捕捉された化学的機能性に関する情報を統合し、活用するという課題について考察する。化学特許には化学機能に関する膨大な知識が含まれているが、高品質な機能ラベルを抽出することが現実的でないため、データセットとしての有用性はこれまで軽視されてきた。ChatGPTによる特許要約と単語埋め込みラベルクリーニングパイプラインを用いて、10万個の分子と特許由来の機能ラベルを含む化学機能(CheF)データセットを作成した。機能ラベルは高品質であることが検証され、機能ラベルと化学構造空間の強い関係を検出することができた。さらに、機能ラベルの共起グラフには頑健な意味構造が含まれていることがわかり、化合物間の機能的関連性を調べることができた。次に、CheFデータセットでモデルを学習し、化合物に新しい機能ラベルを割り当てることを可能にした。このモデルを使用することで、承認されたC型肝炎の抗ウイルス薬を逆探知し、特許では開示されていない抗ウイルスメカニズムを発見し、もっともらしいセロトニン関連薬を特定することができた。CheFデータセットと関連モデルは、化学的機能性を予測するための有望な新しいアプローチを提供する。

コード

https://github.com/kosonocky/chef

解決した課題/先行研究との比較

薬剤の機能は分子の化学的な構造で決定される。しかし、構造に基づく機能の予測は簡単ではない。
一方、これまでの創薬の歴史の中で、化学構造と機能の関係は、様々な文献に組み込まれていると考えられる。
本論文ではChatGPT (gpt-3.5-turbo) を用いることで特許文献の情報を機能ラベルへと加工し、分子の構造と紐付けることで、新薬開発に特許情報を活用する方法の一例を示している。

技術・手法のポイント

分子と特許のデータベース SureChEMBL からランダムな10万種の分子と関連する特許情報を抽出。
- 分子と特許の対応確度をあげるため、10件より少ない特許で触れられている分子のみから10万種選んだ。
  - 例えばペニシリンは4万件の特許で触れられているが、本当にペニシリンそのものと関係がある特許は数件のみ。こういう分子を除く。
抽出された特許のタイトル・要旨・説明をGoogle Scholarからスクレイプ。
gpt-3.5-turboを用い、スクレイピングした特許情報から抽出した各分子に機能ラベルを1～3個付与。
さらに意味が類似したラベルを一つにまとめる（OpenAIのtext-embedding-ada-002を使用）などの統合処理を行い、データを綺麗に。
こうして、「特許の要約情報」と「10万種の分子」と「特許由来の機能ラベル」を含む化学機能データセットが作成された → Chemical Function (CheF) データセットと命名。

評価指標

CheFデータセットのラベルと分子が強い相関を持っていることの確認
- CheFデータセットから200分子をランダムに選択→1,738のラベルを持っていた。
- これらのラベルのうち、99.6%が正しい文章構造をもっており、99.8%がそれぞれの特許に関連していた。
- 77.9%のラベルがラベル付けされた分子の機能を直接説明していた。標識分子が中間体である一次特許分子の機能性を考慮すると、この割合は98.2%に増加した。
- 同一ラベルを持つ分子同士のタニモト類似度はランダムに選んだ分子同士のタニモト類似度よりも高かった (Fig.2)。
機能ラベルの共起グラフには頑健な意味構造が含まれていることがわかり、化合物間の機能的関連性を調べることができた (Fig.3)。

CheFデータセットをモデルに学習させることで、化合物に新しい機能ラベルを割り当てることを可能にした。
- モデルは隠れ層2層 (それぞれ512, 256 neurons)、マルチクラス分類のニューラルネット。
- Hold-out法でテストしたところ、1543ラベルにおいて平均ROC-AUCは0.81、平均PR-AUCは0.12となった。 (Fig.5a)
- 分子の構造から効果の予測 (Fig.5b)、効果から分子の検索 (Fig.5c,d)。(緑は真陽性、赤は偽陽性)

残された課題・議論・感想

本論文で用いたCheFデータセットには10万分子しか情報が入っていない。数千万の分子のデータに拡張することで、より有用なものにできる可能性がある。
注意点として、学習データが特許のデータ = 特許になる分子のデータに偏っていることが挙げられる。実用性は高いが特許になっていない分子などは含まれていない。
- 特許以外のデータベース、例えばPubMedなど経由で科学文献の情報を取り込むなどを考える必要があるだろう。
性能の確認もHold-out法での確認であるため、全くの新規分子ではないことに注意は必要と思われる。

重要な引用

本論文以前の、創薬におけるLLM活用事例
- Andres M Bran, Sam Cox, Andrew D White, and Philippe Schwaller. Chemcrow: Augmenting large-language models with chemistry tools. arXiv preprint arXiv:2304.05376, 2023.
- Yin Fang, Xiaozhuan Liang, Ningyu Zhang, Kangwei Liu, Rui Huang, Zhuo Chen, Xiaohui Fan, and Huajun Chen. Mol-instructions: A large-scale biomolecular instruction dataset for large language models. arXiv preprint arXiv:2306.08018, 2023.
- Dimitrios Christofidellis, Giorgio Giannone, Jannis Born, Ole Winther, Teodoro Laino, and Matteo Manica. Unifying molecular and textual representations via multi-task language modelling. arXiv preprint arXiv:2301.12586, 2023.
SureChEMBL database
- George Papadatos, Mark Davies, Nathan Dedman, Jon Chambers, Anna Gaulton, James Siddle, Richard Koks, Sean A Irvine, Joe Pettersson, Nicko Goncharoff, et al. Surechembl: a large-scale, chemically annotated patent document database. Nucleic acids research, 44(D1):D1220–D1228, 2016.

shnakazawa added Natural language processing Papers related to NLP Materials informatics GPT family labels Nov 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mining Patents with Large Language Models Demonstrates Congruence of Functional Labels and Chemical Structures #33

Mining Patents with Large Language Models Demonstrates Congruence of Functional Labels and Chemical Structures #33

shnakazawa commented Nov 6, 2023

Mining Patents with Large Language Models Demonstrates Congruence of Functional Labels and Chemical Structures #33

Mining Patents with Large Language Models Demonstrates Congruence of Functional Labels and Chemical Structures #33

Comments

shnakazawa commented Nov 6, 2023

Abstract

コード

解決した課題/先行研究との比較

技術・手法のポイント

評価指標

残された課題・議論・感想

重要な引用