Skip to content

yuxuanliao/SigmaCCS

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SigmaCCS

This is the code repo for the paper Highly accurate and large-scale collision cross section prediction with graph neural network for compound identification. We developed a method named Structure included graph merging with adduct method for CCS prediction (SigmaCCS), and a dataset including 282 million CCS values for three different ion adducts ([M+H]+, [M+Na]+ and [M-H]-) of 94 million compounds. For each molecule, there are "Pubchem ID", "SMILES", "InChi", "Inchikey", "Molecular Weight", "Exact Mass", "Formula" and predicted CCS values of three adduct ion types.

Package required:

We recommend to use conda and pip.

By using the requirements/conda/environment.yml, requirements/pip/requirements.txt file, it will install all the required packages.

Data pre-processing

SigmaCCS is a model for predicting CCS based on graph neural networks, so we need to convert SMILES strings to Graph. The related method is shown in sigma/GraphData.py

1. Generate 3D conformations of molecules.

mol = Chem.MolFromSmiles(smiles)
mol = Chem.AddHs(mol)
ps = AllChem.ETKDGv3()
ps.randomSeed = -1
ps.maxAttempts = 1
ps.numThreads = 0
ps.useRandomCoords = True
re = AllChem.EmbedMultipleConfs(mol, numConfs = 1, params = ps)
re = AllChem.MMFFOptimizeMoleculeConfs(mol, numThreads = 0)
  • ETKDGv3 Returns an EmbedParameters object for the ETKDG method - version 3 (macrocycles).
  • EmbedMultipleConfs, use distance geometry to obtain multiple sets of coordinates for a molecule.
  • MMFFOptimizeMoleculeConfs, uses MMFF to optimize all of a molecule’s conformations

2. Save relevant parameters. For details, see sigma/parameter.py.

  • adduct set
  • atoms set
  • Minimum value in atomic coordinates
  • Maximum value in atomic coordinates

3. Generate the Graph dataset. Generate the three matrices used to construct the Graph:
(1) node feature matrix, (2) adjacency matrix, (3) edge feature matrix.

adj, features, edge_features = convertToGraph(smiles, Coordinate, All_Atoms)
DataSet = MyDataset(features, adj, edge_features, ccs)

Optionnal args

  • All_Atoms : The set of all elements in the dataset
  • Coordinate : Array of coordinates of all molecules
  • features : Node feature matrix
  • adj : Adjacency matrix
  • edge_features : Edge feature matrix

Model training

Train the model based on your own training dataset with Model_train function.

Model_train(ifile, ParameterPath, ofile, ofileDataPath, EPOCHS, BATCHS, Vis, All_Atoms=[], adduct_SET=[])

Optionnal args

  • ifile : File path for storing the data of smiles and adduct.
  • ofile : File path where the model is stored.
  • ParameterPath : Save path of related data parameters.
  • ofileDataPath : File path for storing model parameter data.

Predicting CCS

The CCS prediction of the molecule is obtained by feeding the Graph and Adduct into the already trained SigmaCCS model with Model_prediction function.

Model_prediction(ifile, ParameterPath, mfileh5, ofile, Isevaluate = 0)

Optionnal args

  • ifile : File path for storing the data of smiles and adduct
  • ParameterPath : File path for storing model parameter data
  • mfileh5 : File path where the model is stored
  • ofile : Path to save ccs prediction values

Usage

The example codes for usage is included in the test.ipynb

Others

The following files are in the others folder:

Package required:

Slurm script

slurm script for generating CCS of PubChem in HPC cluster. The following files are in the slurm folder

  • mp.py
  • multiple_job.sh (Batch generation of slurm script files)
  • normal_job.sh (Submit the slurm script for the mp.py file)

Information of maintainers

About

Highly accurate and large-scale collision cross section prediction with graph neural network for compound identification

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Jupyter Notebook 98.6%
  • Python 1.4%