# 1. Data description and retrieval

In [5]:
import bin.params as params
from IPython.display import Markdown as md

## Data

We use **P**rotein**D**ata**B**ank (`.pdb`) files downloaded from 
`SAbDab - The Structural Antibody Database` project website (http://opig.stats.ox.ac.uk/webapps/newsabdab/sabdab/).

PDB file stored there describe common antibody structures, their spatial organisation, chemical properties and more.

In [5]:
md(f"We use only structures that have **resolution value less (=better) than {params.RESOLUTION_CUTOFF} Angstroms** (using top navbar `SAbDab > Structure search > Search structures by attribute > Resolution cutoff = {params.RESOLUTION_CUTOFF}`).")

We use only structures that have **resolution value less (=better) than 3.0 Angstroms** (using top navbar `SAbDab > Structure search > Search structures by attribute > Resolution cutoff = 3.0`).

---

## PDB file format

Detailed description: https://www.cgl.ucsf.edu/chimera/docs/UsersGuide/tutorials/pdbintro.html

Concise formal description: 

---

## Downloading the data

This process consists of several steps:

### a) Getting the database accession code

1. Run the cell bellow and follow the link it generates:

In [17]:
link = f'http://opig.stats.ox.ac.uk/webapps/newsabdab/sabdab/search/?ABtype=All&method=All&species=All&resolution={params.RESOLUTION_CUTOFF}&rfactor=&antigen=All&ltype=All&constantregion=All&affinity=All&isin_covabdab=All&isin_therasabdab=All&chothiapos=&restype=ALA&field_0=Antigens&keyword_0=#downloads'
print(link)

http://opig.stats.ox.ac.uk/webapps/newsabdab/sabdab/search/?ABtype=All&method=All&species=All&resolution=3.0&rfactor=&antigen=All&ltype=All&constantregion=All&affinity=All&isin_covabdab=All&isin_therasabdab=All&chothiapos=&restype=ALA&field_0=Antigens&keyword_0=#downloads


The link will take you to the **Download results** section immediately.

2. You will see a couple of links there, **right-click** the one that prompts you to download **an archived zip file** and select **Inspect** option from the dropdown menu that occurs. Web developer console should open in one part of your browser window (usually at the bottom part).

4. Check the highlighted line in the displayed web developer console. It should look like this:

> &lt;a href="/webapps/newsabdab/sabdab/archive/**20220601_0200368**/">zip file&lt;/a>

5. See the numerical part of the link - that is actually highlighted in **bold**? That is your accession code. You probably have the different digits in your code, but that is actually expected. Copy **your code** to the cell bellow, replacing some old accession code that is already there and run the cell.

In [6]:
DATABASE_ACCESSION_CODE = '20220601_0621156'

6. Hooray! :) You have it done.

**Proceed right onto the next section, since your accession code will not be valid forever!** (and you will need to obtain it again when your first one expires)

### b) Downloading the actual data

Run the following cell to download the zipped structure directory. The zip archive is ~4GB big, so the download will take a while:

In [12]:
command = f"""
cd {params.DATA_DIR} 
mkdir -p {params.DATA_DIR}/pdb
cd pdb
echo 'downloading the SAbDab data to'  $(pwd) '...'
wget http://opig.stats.ox.ac.uk/webapps/newsabdab/sabdab/archive/{DATABASE_ACCESSION_CODE}/ -o sabdab_download.log
"""
! $command

downloading the SAbDab data to /SFS/user/wp/benor/test/proto-moto/data/pdb ...


**Note: Check the `sabdab_download.log` file in your `data/pdb` directory to see the progress of your download. There may be for example `Internal Server Error` in case your accession code expired - meaning you need to obtain new accession code and start the download again.**

### c) Unzip and reorganize the downloaded data

Unzip and reorganize the subdirectories -  this may again take a while due to the zip archive downloaded size:

In [11]:
command = f"""
echo 'Creating required directory structure...'
cd {params.DATA_DIR}/pdb
mkdir -p raw chothia imgt

# the data was downloaded into 'index.html' file (that is weird, I know)
# move it to the zipfile and unzip it
echo 'Renaming and unzipping the downloaded file...'
mv index.html sabdab_chothia.zip
unzip -qq sabdab_chothia.zip

echo 'Moving raw .pdb files ...'
mv *.pdb raw 

echo 'Removing the original .zip file
rm -rf sabdab_chothia.zip
"""

! $command

**Sanity check -** output the number of downloaded structures:

In [10]:
command = f"""
cd {params.DATA_DIR}/pdb/chothia
ls -l | wc -l 
"""

! $command

4326


**All done :)**