**Protkit Data Download Demo**

This short notebook illustrates the following:

*   Downloading PDB files from the Protein Data Bank.
*   Creating a Prot file containing the PDB files.
*   Visualising the proteins in 3D.

Protkit is available as a package on PyPI.

We start by installing Protkit and 3DMol. It could take a few minutes to install the necessary dependencies.

In [1]:
!pip install protkit
!pip install py3Dmol

Collecting protkit
  Downloading protkit-0.3.0-py3-none-any.whl (125 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/125.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━[0m [32m71.7/125.1 kB[0m [31m1.9 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m125.1/125.1 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
Collecting scikit-learn>=1.4.1.post1 (from protkit)
  Downloading scikit_learn-1.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (13.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.3/13.3 MB[0m [31m38.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting biopython>=1.83 (from protkit)
  Downloading biopython-1.84-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.2/3.2 MB[0m [31m38.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollec

Protkit contains a number of modules organised by the functionality they provide. In this demo, we are interested in downloading and file handling capabilties. To use these, we import the relevant modules and classes.

In [2]:
from protkit.download import Download
from protkit.file_io import PDBIO, ProtIO
import py3Dmol

For convenience, we list the proteins PDB ids we will use in this demo.

In [3]:
pdb_ids = ["1ahw", "1a4y", "1a6m", "1dc9"]

We can download all of PDB files from the Protein Data Bank through a single line of code. Protkit will download the files efficiently in parallel and store them in the data directory.

In [4]:
Download.download_pdb_files_from_rcsb(pdb_ids, "data")

We can load PDB files through the PDBIO class. The next line of code will load all the proteins into memory. They are stored as Protein objects, which expose a large number of useful functions that can be applied to proteins.

In [5]:
proteins = [PDBIO.load(f"data/{pdb_id}.pdb")[0] for pdb_id in pdb_ids]

Let's visualise the proteins. For this, we can 3DMol. 3DMol has capabilities to load proteins from file, but in this demo, we will create the proteins from our in-memory data representations.

In [6]:
viewer = py3Dmol.view(linked=False,viewergrid=(2, 2))
for i, protein in enumerate(proteins):
  viewer.addModel(PDBIO.save_to_string(proteins[i]),'pdb',viewer=(i // 2, i % 2))
  viewer.setStyle({'cartoon': {'color':'spectrum'}}, viewer=(i // 2, i % 2))
  viewer.zoomTo(viewer=(i // 2, i % 2))
viewer.render()

<py3Dmol.view at 0x7c400ff2aef0>

Let's save all our proteins in a single Prot file. This could be useful if we want to work with the data again in future.

In [7]:
ProtIO.save(proteins, "data/proteins.prot")

Prot files are stored using an efficient compressed JSON format. If we compare the file size of the database file just created, it is about half the size of a single PDB file.

In [9]:
import os

file_stats = os.stat("data/proteins.prot")
print(f'Database size in bytes: {file_stats.st_size}')

file_stats_pdb = os.stat("data/1ahw.pdb")
print(f'Single protein size in bytes: {file_stats_pdb.st_size}')

Database size in bytes: 423593
Single protein size in bytes: 862245
