Skip to content

Commit

Permalink
Merge 896928e into 08e78ea
Browse files Browse the repository at this point in the history
  • Loading branch information
rasbt committed Jun 7, 2017
2 parents 08e78ea + 896928e commit 54cc8e9
Show file tree
Hide file tree
Showing 10 changed files with 215 additions and 63 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -1531,6 +1531,13 @@
"## Parsing Multi-MOL2 files"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Basic Multi-MOL2 File Parsing"
]
},
{
"cell_type": "markdown",
"metadata": {},
Expand All @@ -1542,7 +1549,7 @@
},
{
"cell_type": "code",
"execution_count": 47,
"execution_count": 1,
"metadata": {},
"outputs": [
{
Expand Down Expand Up @@ -1601,6 +1608,107 @@
" # note that the mol2_text contains the original mol2 content\n",
" f.write(pdmol.mol2_text) "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Using Multiprocessing for Multi-MOL2 File Analysis"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To improve the computational efficiency and throughput for multi-mol2 analyses, it is recommended to use the [`mputil`](https://github.com/rasbt/mputil) package, which evaluates Python generators lazily. The `lazy_imap` function from `mputil` is based on Python's standardlib multiprocessing `imap` function, but it doesn't consume the generator upfront. This lazy evaluation is important, for example, if we are parsing large (possibly Gigabyte- or Terabyte-large) multi-mol2 files for multiprocessing.\n",
"\n",
"The following example provides a template for atom-type based molecule queries, but the `data_processor` function can be extended to do any kind of functional group queries (for example, involving the `'charge'` column and/or `PandasMol2.distance` method). "
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import pandas as pd\n",
"from mputil import lazy_imap\n",
"from biopandas.mol2 import PandasMol2\n",
"from biopandas.mol2 import split_multimol2"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Searched 40 molecules. Got 3 hits.\n"
]
}
],
"source": [
"# Selection strings to capture\n",
"# all molecules that contain at least one sp2 hybridized\n",
"# oxygen atom and at least one Fluorine atom\n",
"SELECTIONS = [\"(pdmol.df.atom_type == 'O.2')\",\n",
" \"(pdmol.df.atom_type == 'F')\"]\n",
" \n",
"# Path to the multi-mol2 input file\n",
"MOL2_FILE = \"./data/40_mol2_files.mol2\"\n",
"\n",
"# Data processing function to be run in parallel\n",
"def data_processor(mol2):\n",
" \"\"\"Return molecule ID if there's a match and '' otherwise\"\"\"\n",
" pdmol = PandasMol2().read_mol2_from_list(mol2_lines=mol2[1],\n",
" mol2_code=mol2[0])\n",
" match = mol2[0]\n",
" for sub_sele in SELECTIONS:\n",
" if not pd.eval(sub_sele).any():\n",
" match = ''\n",
" break\n",
" return match\n",
"\n",
"# Process molecules and save IDs of hits to disk\n",
"with open('./data/selected_ids.txt', 'w') as f: \n",
" searched, found = 0, 0\n",
" for chunk in lazy_imap(data_processor=data_processor,\n",
" data_generator=split_multimol2(MOL2_FILE),\n",
" n_cpus=0): # means all available cpus\n",
"\n",
" for mol2_id in chunk:\n",
" if mol2_id:\n",
" # write IDs of matching molecules to disk\n",
" f.write('%s\\n' % mol2_id)\n",
" found += 1\n",
" searched += len(chunk)\n",
" \n",
" \n",
"print('Searched %d molecules. Got %d hits.' % (searched, found))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"[Input File link: [40_mol2_files.mol2](https://raw.githubusercontent.com/rasbt/biopandas/master/docs/sources/tutorials/data/40_mol2_files.mol2)]\n",
"\n",
"[Output File link: [selected_ids.txt](https://raw.githubusercontent.com/rasbt/biopandas/master/docs/sources/tutorials/data/selected_ids.txt)]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
}
],
"metadata": {
Expand All @@ -1620,7 +1728,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.0"
"version": "3.6.1"
}
},
"nbformat": 4,
Expand Down
109 changes: 79 additions & 30 deletions docs/sources/tutorials/Working_with_MOL2_Structures_in_DataFrames.md
Original file line number Diff line number Diff line change
@@ -1,31 +1,4 @@

BioPandas

Author: Sebastian Raschka <mail@sebastianraschka.com>
License: BSD 3 clause
Project Website: http://rasbt.github.io/biopandas/
Code Repository: https://github.com/rasbt/biopandas


```python
%load_ext watermark
%watermark -d -u -p pandas,biopandas
```

last updated: 2017-04-02

pandas 0.19.2
biopandas 0.2.0.dev0



```python
from biopandas.mol2 import PandasMol2
import pandas as pd
pd.set_option('display.width', 600)
pd.set_option('display.max_columns', 8)
```

# Working with MOL2 Structures in DataFrames

The Tripos MOL2 format is a common format for working with small molecules. In this tutorial, we will go over some examples that illustrate how we can use Biopandas' MOL2 DataFrames to analyze molecules conveniently.
Expand All @@ -41,13 +14,17 @@ from biopandas.mol2 import PandasMol2
pmol = PandasMol2().read_mol2('./data/1b5e_1.mol2')
```

[File link: [1b5e_1.mol2](https://raw.githubusercontent.com/rasbt/biopandas/master/docs/sources/tutorials/data/1b5e_1.mol2)]

The `read_mol2` method can also load structures from `.mol2.gz` files, but if you have a multi-mol2 file, keep in mind that it will only fetch the first molecule in this file. In the section "[Parsing Multi-MOL2 files](#parsing-multi-mol2-files)," we will see how we can parse files that contain multiple structures.


```python
pmol = PandasMol2().read_mol2('./data/40_mol2_files.mol2.gz')
```

[File link: [40_mol2_files.mol2.gz](https://github.com/rasbt/biopandas/blob/master/docs/sources/tutorials/data/40_mol2_files.mol2.gz?raw=true)]

After the file was succesfully loaded, we have access to the following basic `PandasMol2` attributes:


Expand Down Expand Up @@ -358,6 +335,8 @@ pmol.df.tail(10)



[File link: [1b5e_1.mol2](https://raw.githubusercontent.com/rasbt/biopandas/master/docs/sources/tutorials/data/1b5e_1.mol2)]

For example, we can select all hydrogen atoms by filtering on the atom type column:


Expand Down Expand Up @@ -636,6 +615,8 @@ from biopandas.mol2 import PandasMol2
pmol = PandasMol2().read_mol2('./data/1b5e_1.mol2')
```

[File link: [1b5e_1.mol2](https://raw.githubusercontent.com/rasbt/biopandas/master/docs/sources/tutorials/data/1b5e_1.mol2)]


```python
%matplotlib inline
Expand All @@ -655,7 +636,7 @@ plt.show()
```


![png](Working_with_MOL2_Structures_in_DataFrames_files/Working_with_MOL2_Structures_in_DataFrames_30_0.png)
![png](Working_with_MOL2_Structures_in_DataFrames_files/Working_with_MOL2_Structures_in_DataFrames_34_0.png)


If this is too fine-grained for our needs, we could summarize the different atom types by atomic elements:
Expand All @@ -671,7 +652,7 @@ plt.show()
```


![png](Working_with_MOL2_Structures_in_DataFrames_files/Working_with_MOL2_Structures_in_DataFrames_32_0.png)
![png](Working_with_MOL2_Structures_in_DataFrames_files/Working_with_MOL2_Structures_in_DataFrames_36_0.png)


One of the coolest features in pandas is the groupby method. Below is an example plotting the average charge of the different atom types with the standard deviation as error bars:
Expand All @@ -685,7 +666,7 @@ plt.show()
```


![png](Working_with_MOL2_Structures_in_DataFrames_files/Working_with_MOL2_Structures_in_DataFrames_34_0.png)
![png](Working_with_MOL2_Structures_in_DataFrames_files/Working_with_MOL2_Structures_in_DataFrames_38_0.png)


## Computing the Root Mean Square Deviation
Expand Down Expand Up @@ -719,6 +700,8 @@ print('All-atom RMSD: %.4f Angstrom' % r_all)
All-atom RMSD: 1.5523 Angstrom


[File links: [1b5e_1.mol2](https://raw.githubusercontent.com/rasbt/biopandas/master/docs/sources/tutorials/data/1b5e_1.mol2), [1b5e_2.mol2](https://raw.githubusercontent.com/rasbt/biopandas/master/docs/sources/tutorials/data/1b5e_2.mol2)]

<br>

## Filtering Atoms by Distance
Expand Down Expand Up @@ -1046,6 +1029,8 @@ all_within_3A.tail()

## Parsing Multi-MOL2 files

### Basic Multi-MOL2 File Parsing

As mentioned earlier, `PandasMol2.read_mol2` method only reads in the first molecule if it is given a multi-MOL2 file. However, if we want to create DataFrames from multiple structures in a MOL2 file, we can use the handy `split_multimol2` generator.

The `split_multimol2` generator yields tuples containing the molecule IDs and the MOL2 content as strings in a list -- each line in the MOL2 file is stored as a string in the list.
Expand All @@ -1066,6 +1051,8 @@ print('First 10 lines:\n', mol2_cont[:10])
['@<TRIPOS>MOLECULE\n', 'ZINC38611810\n', ' 65 68 0 0 0\n', 'SMALL\n', 'NO_CHARGES\n', '\n', '@<TRIPOS>ATOM\n', ' 1 C1 -1.1786 2.7011 -4.0323 C.3 1 <0> -0.1537\n', ' 2 C2 -1.2950 1.2442 -3.5798 C.3 1 <0> -0.1156\n', ' 3 C3 -0.1742 0.4209 -4.2178 C.3 1 <0> -0.1141\n']


[File link: [40_mol2_files.mol2](https://raw.githubusercontent.com/rasbt/biopandas/master/docs/sources/tutorials/data/40_mol2_files.mol2)]

We can now use this generator to loop over all files in a multi-MOL2 file and create PandasMol2 DataFrames. A typical use case would be the filtering of mol2 files by certain properties:


Expand All @@ -1084,3 +1071,65 @@ with open('./data/filtered.mol2', 'w') as f:
# note that the mol2_text contains the original mol2 content
f.write(pdmol.mol2_text)
```

### Using Multiprocessing for Multi-MOL2 File Analysis

To improve the computational efficiency and throughput for multi-mol2 analyses, it is recommended to use the [`mputil`](https://github.com/rasbt/mputil) package, which evaluates Python generators lazily. The `lazy_imap` function from `mputil` is based on Python's standardlib multiprocessing `imap` function, but it doesn't consume the generator upfront. This lazy evaluation is important, for example, if we are parsing large (possibly Gigabyte- or Terabyte-large) multi-mol2 files for multiprocessing.

The following example provides a template for atom-type based molecule queries, but the `data_processor` function can be extended to do any kind of functional group queries (for example, involving the `'charge'` column and/or `PandasMol2.distance` method).


```python
import pandas as pd
from mputil import lazy_imap
from biopandas.mol2 import PandasMol2
from biopandas.mol2 import split_multimol2
```


```python
# Selection strings to capture
# all molecules that contain at least one sp2 hybridized
# oxygen atom and at least one Fluorine atom
SELECTIONS = ["(pdmol.df.atom_type == 'O.2')",
"(pdmol.df.atom_type == 'F')"]

# Path to the multi-mol2 input file
MOL2_FILE = "./data/40_mol2_files.mol2"

# Data processing function to be run in parallel
def data_processor(mol2):
"""Return molecule ID if there's a match and '' otherwise"""
pdmol = PandasMol2().read_mol2_from_list(mol2_lines=mol2[1],
mol2_code=mol2[0])
match = mol2[0]
for sub_sele in SELECTIONS:
if not pd.eval(sub_sele).any():
match = ''
break
return match

# Process molecules and save IDs of hits to disk
with open('./data/selected_ids.txt', 'w') as f:
searched, found = 0, 0
for chunk in lazy_imap(data_processor=data_processor,
data_generator=split_multimol2(MOL2_FILE),
n_cpus=0): # means all available cpus

for mol2_id in chunk:
if mol2_id:
# write IDs of matching molecules to disk
f.write('%s\n' % mol2_id)
found += 1
searched += len(chunk)


print('Searched %d molecules. Got %d hits.' % (searched, found))
```

Searched 40 molecules. Got 3 hits.


[Input File link: [40_mol2_files.mol2](https://raw.githubusercontent.com/rasbt/biopandas/master/docs/sources/tutorials/data/40_mol2_files.mol2)]

[Output File link: [selected_ids.txt](https://raw.githubusercontent.com/rasbt/biopandas/master/docs/sources/tutorials/data/selected_ids.txt)]
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading

0 comments on commit 54cc8e9

Please sign in to comment.