# Loading data

This notebook illustrates how to load 3D structural ensemble data in IDPET and perform basic analyses.

IDPET allows to load ensemble data from local files or from ensembles stored on online databases (currently PED and ATLAS).

## Download ensembles from PED

IDPET allows to download ensembles directly from the [Protein Ensemble Database](https://proteinensemble.org/) (PED).

We will begin by downloading three ensembles for the same protein domain drkN SH3.

In [2]:
from idpet.ensemble import Ensemble
from idpet.ensemble_analysis import EnsembleAnalysis
from idpet.utils import set_verbosity

# Change verbosity level to show more information when performing the analysis.
set_verbosity("INFO")

We begin by creating a list of Ensemble objects, each initialized with a PED ensemble code. By setting the `database` as `"ped"`, IDPET will automatically attempt to download the corresponding data from the PED database.

In [2]:
ens_codes = [
    Ensemble(code='PED00156e001', database='ped'),
    Ensemble(code='PED00157e001', database='ped'),
    Ensemble(code='PED00158e001', database='ped') 
]

Next, we will download the ensembles and load them into the notebook.

The ensemble data will be saved locally on your system. You can specify the download location using the `output_dir` argument of the `EnsembleAnalysis` class.

If you do not provide this argument, or leave it set to `None`, ensemble data will be downloaded in a default the directory `${HOME}/.idpet/data`, which will be created appositely. You can change the location of this default directory by setting an `$IDPET_OUTPUT_DIR` environmental variable in your system.

If `output_dir` is not provided or set to `None`, the data will be downloaded in a default directory `${HOME}/.idpet/data`, which will be created automatically if it does not exist. You can override this default location by setting the `$IDPET_OUTPUT_DIR` environment variable on your system.

In [3]:
analysis = EnsembleAnalysis(
    ens_codes,
    output_dir=None  # Optional: add your own path to a directory to save ensemble data.
)
analysis.load_trajectories();

Downloading entry PED00156e001 from PED.
Download complete. Saved to: /Users/giacomojanson/.idpet/data/PED00156e001.tar.gz
Downloaded file PED00156e001.tar.gz from PED.
Extracted file PED00156e001.pdb.
Downloading entry PED00157e001 from PED.
Download complete. Saved to: /Users/giacomojanson/.idpet/data/PED00157e001.tar.gz
Downloaded file PED00157e001.tar.gz from PED.
Extracted file PED00157e001.pdb.
Downloading entry PED00158e001 from PED.
Download complete. Saved to: /Users/giacomojanson/.idpet/data/PED00158e001.tar.gz
Downloaded file PED00158e001.tar.gz from PED.
Extracted file PED00158e001.pdb.
PED00156e001 chain ids: ['A']
Generating trajectory for PED00156e001...
Generated trajectory saved to /Users/giacomojanson/.idpet/data.
PED00157e001 chain ids: ['A']
Generating trajectory for PED00157e001...
Generated trajectory saved to /Users/giacomojanson/.idpet/data.
PED00158e001 chain ids: ['A']
Generating trajectory for PED00158e001...
Generated trajectory saved to /Users/giacomojanson

## How to use your ensembles

Once the data has been downloaded, the `Ensemble` objects within the `EnsembleAnalysis` instance will be populated and ready for use.

Each ensemble contains multiple 3D structures, which are stored internally as a `Trajectory` object from the [MDTraj library](https://www.mdtraj.org).

In this section, we demonstrate basic usage of the ensembles and how to access their structural data.

For additional functionality and more useful analyses, please refer to the other notebooks and the official IDPET documentation.

In [4]:
# Access via iteration.
for ensemble in analysis.ensembles:
    print("---")
    print("code:", ensemble.code)
    # The xyz coordinates are stored in numpy arrays.
    mdtraj_trajectory = ensemble.trajectory
    n_conformations, n_atoms, _xyz_dimensions = mdtraj_trajectory.xyz.shape
    n_residues = mdtraj_trajectory.topology.n_residues
    print(f"shape xyz ensemble data: conformations={n_conformations}, residues={n_residues}, atoms={n_atoms}")

---
code: PED00156e001
shape xyz ensemble data: conformations=100, residues=59, atoms=941
---
code: PED00157e001
shape xyz ensemble data: conformations=100, residues=59, atoms=939
---
code: PED00158e001
shape xyz ensemble data: conformations=88, residues=59, atoms=939


In [5]:
# Access via key-value pairs.
ensemble = analysis["PED00156e001"]
print("code:", ensemble.code)

code: PED00156e001


The `Ensemble` class can be used to compute numerous structural features to describe the 3D conformations. IPDET has numerous high-level methods for calculating and analyzing these features (see the other notebooks), but here we will show how in principle you can calculate them at a lower-level.

In [6]:
# Compute radius of gyration.
rg_array = ensemble.get_features("rg")
print(f"Average Rg: {rg_array.mean():.3f} [nm]")

Average Rg: 1.794 [nm]


## Download ensembles from ATLAS

IPDET also allows you to download ensembles from the [ATLAS](https://www.dsimb.inserm.fr/ATLAS/) database of molecular dynamic (MD) simulations of protein chains from the PDB.

Let's start a new analysis to demonstrate this.

In [7]:
ens_codes = [
    Ensemble(
        code='1ail_A',
        database='atlas'  # Specify "atlas" here.
    )
]

analysis = EnsembleAnalysis(
    ens_codes,
    output_dir=None  # Optional: add your own path to a directory to save ensemble data.
)
analysis.load_trajectories();

Downloading entry 1ail_A from ATLAS.
Download complete. Saved to: /Users/giacomojanson/.idpet/data/1ail_A.zip
Downloaded file 1ail_A.zip from Atlas.
Extracted file /Users/giacomojanson/.idpet/data/1ail_A.zip.
Loading trajectory for 1ail_A_prod_R1_fit...
Loading trajectory for 1ail_A_prod_R2_fit...
Loading trajectory for 1ail_A_prod_R3_fit...


For each ATLAS system, three independent MD trajectories are available. IDPET will store them as independent `Ensemble` objects.

ATLAS contains 10001 conformations per MD trajectory, so we will first randomly downsample them to 250.

In [8]:
analysis.random_sample_trajectories(sample_size=250);

250 conformations sampled from 1ail_A_prod_R1_fit trajectory.
250 conformations sampled from 1ail_A_prod_R2_fit trajectory.
250 conformations sampled from 1ail_A_prod_R3_fit trajectory.


In [9]:
# Access via iteration.
for ensemble in analysis.ensembles:
    print("---")
    print("code:", ensemble.code)
    # The xyz coordinates are stored in numpy arrays.
    mdtraj_trajectory = ensemble.trajectory
    n_conformations, n_atoms, _xyz_dimensions = mdtraj_trajectory.xyz.shape
    n_residues = mdtraj_trajectory.topology.n_residues
    print(f"shape xyz ensemble data: conformations={n_conformations}, residues={n_residues}, atoms={n_atoms}")
    # Compute some example molecular features.
    for feature in ("rg", "prolateness", "asphericity", "sasa", "end_to_end"):
        feat_array = ensemble.get_features(feature)
        feat_avg = feat_array.mean()
        feat_std = feat_array.std()
        print(f"Average {feature}: mean={feat_avg:.3f}, std={feat_std:.3f}")
    

---
code: 1ail_A_prod_R1_fit
shape xyz ensemble data: conformations=250, residues=73, atoms=1172
Average rg: mean=1.361, std=0.038
Average prolateness: mean=0.412, std=0.101
Average asphericity: mean=0.139, std=0.034
Average sasa: mean=57.753, std=1.335
Average end_to_end: mean=3.317, std=0.449
---
code: 1ail_A_prod_R2_fit
shape xyz ensemble data: conformations=250, residues=73, atoms=1172
Average rg: mean=1.356, std=0.025
Average prolateness: mean=0.431, std=0.104
Average asphericity: mean=0.143, std=0.029
Average sasa: mean=58.516, std=1.040
Average end_to_end: mean=3.106, std=0.554
---
code: 1ail_A_prod_R3_fit
shape xyz ensemble data: conformations=250, residues=73, atoms=1172
Average rg: mean=1.342, std=0.019
Average prolateness: mean=0.398, std=0.081
Average asphericity: mean=0.138, std=0.017
Average sasa: mean=57.150, std=1.005
Average end_to_end: mean=2.988, std=0.376


## Load from local PDB files

You can also load ensembles from multi-model PDB files on your local system.

By default, IDPET will save a binary trajectory file in its data directory for faster reading in case you will load the ensemble again.

In [None]:
code = "your_ensemble_from_pdb_file"

ensemble = Ensemble(
    code=code,
    data_path="/home/your_username/path/to/multi_model_file.pdb"
)

# Add your ensemble to a new analysis (see other notebooks for more information).
analysis = EnsembleAnalysis([ensemble])
analysis.load_trajectories();

In [None]:
print(ensemble.code)
# Compute Flory exponent.
flory_exp = ensemble.get_features("flory_exponent")[0]
print(f"Flory exponent: {flory_exp:.3f}")

## Load from local trajectory files

Finally you can load data from trajectory binary files.

You need a trajectory file (e.g.: dcd or xtc format) and a topology file (e.g.: pdb format).

In [None]:
code = "your_ensemble_from_traj_file"

ensemble = Ensemble(
    code=code,
    data_path="/home/your_username/path/to/trajectory_file.xtc",
    top_path="/home/your_username/path/to/multi_model_file.pdb"
)

# Instead of loading data by adding your ensemble to an `EnsembleAnalysis` object,
# you can also load an ensemble directly using the `load_trajectory` method of
# the `Ensemble` class.
data_dir = None
ensemble.load_trajectory(data_dir)

# Compute Flory exponent.
flory_exp = ensemble.get_features("flory_exponent")[0]
print(f"Flory exponent: {flory_exp:.3f}")