<a href="https://colab.research.google.com/github/wangqian2149185/BMRB-API/blob/master/Sec02_readPDB.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Project: Prediction chemical shift < = > structure

### **Personal project proposal since April 2024**

#### **Composed by Qian Wang**

Alphafold2's predition is quite impressive. But all of the predition was based on 1st primary structureof the peptide, which ignores the individual environment of the protien, such as (pH, temperature, ion strength, or even the presence of other molecules like ligand and so forth). Fortunately, the NMR assignment of the chemical shift from each atom of protein molecule is the closest the in vivo of the protein state. In addition, NMR spectrum contains tons of infomation of each atom of molecules that had never been fully digged in. Most of the info has been ignored, due to the complicated combination of each minor quantum effect (resultantly largely effect).

This project is trying to predict from 2 directions in the methods of machine learnings and deep learnings by training the data from BMRB and PDB database:

1. predict NMR chemical shift based on a given structure (PDB format)

2. predict the structure from the acquired NMR chemical shift.


Additionally, we will try to build a model of predicting 2nd structure, dihedral angle, or even the dynamics from chemical shift.

**Table of contents**:

1. Reading data from BMRB
2. Reading data from PDB
3.




In [None]:
# install requests and biopython for later read
!pip install requests
!pip install biopython

Collecting biopython
  Downloading biopython-1.83-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m12.8 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: biopython
Successfully installed biopython-1.83


In [None]:
# show all columns
import pandas as pd
pd.set_option('display.max_columns', None)

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# 2.0 Read the files from PDB database

## 2.0.1 read index files

In [None]:
# read table where keep all BMRBid and PDB id from the drive. This is the table containing BMRB ID and PDB ID
path_BMRB_ID = "/content/drive/MyDrive/Colab Notebooks/BMRB_PDB_Project/rawdata/query_grid.csv"
df_query_CSV = pd.read_csv(path_BMRB_ID)

In [None]:
df_EntryID_PDB = df_query_CSV[['Entry_ID', 'pdb_ids']]

In [None]:
df_EntryID_PDB.head()

Unnamed: 0,Entry_ID,pdb_ids
0,4023,"2SPZ,1Q2N"
1,4052,1JOO
2,4053,1JOQ
3,4089,2DEF
4,4090,2EZH


##2.0.2 read concurrently after sharding
**Since Colab has only 2 cores, so I decided to run on my local MAC which has 8 cores and faster.**

In [None]:
import os
from concurrent.futures import ThreadPoolExecutor, as_completed
import requests
import pandas as pd
from Bio.PDB.MMCIF2Dict import MMCIF2Dict
from io import StringIO
from math import ceil

# Check the number of CPU cores in the system
num_cores = os.cpu_count()
print(f"Number of CPU cores available: {num_cores}")

# Set max_workers based on the number of CPU cores
max_workers = num_cores * 2  # or another multiplier based on your testing

def get_pdb_files_by_url(pdb_num):
    url_pre = "https://models.rcsb.org/v1/"
    url_post = "/full?encoding=cif&copy_all_categories=false&download=false"
    url = url_pre + pdb_num + url_post

    try:
        response = requests.get(url)
        response.raise_for_status()  # Check if the request was successful

        # Read the content into MMCIF2Dict
        dico = MMCIF2Dict(StringIO(response.text))
        df_temp = pd.DataFrame.from_dict(dico, orient='index').transpose()

        print(f"Successfully added PDB ID {pdb_num}.")
        return df_temp
    except requests.exceptions.RequestException as e:
        print(f"Request failed for PDB ID {pdb_num}: {e}")
    except Exception as e:
        print(f"Failed to process PDB ID {pdb_num}: {e}")
    return pd.DataFrame()  # Return an empty DataFrame on failure

def download_pdb_files_concurrently(pdb_ids, max_workers=10):
    df_BMRB_all = pd.DataFrame()
    chunk_size = ceil(len(pdb_ids) / max_workers)
    pdb_chunks = [pdb_ids[i:i + chunk_size] for i in range(0, len(pdb_ids), chunk_size)]

    with ThreadPoolExecutor(max_workers=max_workers) as executor:
        future_to_chunk = {executor.submit(download_chunk, chunk): chunk for chunk in pdb_chunks}

        for future in as_completed(future_to_chunk):
            try:
                df_temp = future.result()
                df_BMRB_all = pd.concat([df_BMRB_all, df_temp], ignore_index=True)
            except Exception as e:
                print(f"Error processing a chunk of PDB IDs: {e}")

    return df_BMRB_all

def download_chunk(chunk):
    df_chunk = pd.DataFrame()
    for pdb_id in chunk:
        df_temp = get_pdb_files_by_url(pdb_id)
        df_chunk = pd.concat([df_chunk, df_temp], ignore_index=True)
    return df_chunk

def main():
    # Load the PDB IDs from your CSV file
    path_BMRB_ID = "/content/drive/MyDrive/Colab Notebooks/BMRB_PDB_Project/rawdata/query_grid.csv"
    df_query_CSV = pd.read_csv(path_BMRB_ID)
    df_EntryID_PDB = df_query_CSV[['Entry_ID', 'pdb_ids']]

    # Prepare the list of PDB IDs
    pdb_ids = []
    for index, row in df_EntryID_PDB.iterrows():
        pdb_ids.extend(row['pdb_ids'].split(','))

    chunk_size = 200
    for i in range(0, len(pdb_ids), chunk_size):
        pdb_chunk = pdb_ids[i:i + chunk_size]

        # Download PDB files in parallel
        df_BMRB_all = download_pdb_files_concurrently(pdb_chunk, max_workers=max_workers)

        # Remove duplicates and reset index
        # df_BMRB_all = df_BMRB_all.drop_duplicates().reset_index(drop=True)

        # Save the DataFrame to a CSV file
        temp_path = f"/content/drive/MyDrive/Colab Notebooks/BMRB_PDB_Project/rawdata/df_PDB_{i//chunk_size + 1}.csv"
        df_BMRB_all.to_csv(temp_path, index=False)
        print(f"Successfully saved to {temp_path}!")

if __name__ == '__main__':
    main()


Number of CPU cores available: 2
Successfully added PDB ID 1C05.
Successfully added PDB ID 2SPZ.
Successfully added PDB ID 1Q2N.
Successfully added PDB ID 1HYJ.
Successfully added PDB ID 1JOO.
Successfully added PDB ID 1JM4.
Successfully added PDB ID 1JR6.
Successfully added PDB ID 1HYI.
Successfully added PDB ID 1DOQ.
Successfully added PDB ID 1ZRR.
Successfully added PDB ID 1EIG.
Successfully added PDB ID 1EIH.
Successfully added PDB ID 1EOQ.
Successfully added PDB ID 7HSC.
Successfully added PDB ID 1KQQ.
Successfully added PDB ID 2AN7.
Successfully added PDB ID 1JOQ.
Successfully added PDB ID 1CL4.
Successfully added PDB ID 1LY7.
Successfully added PDB ID 1DLZ.
Successfully added PDB ID 1KD6.
Successfully added PDB ID 1FPW.
Successfully added PDB ID 1DV5.
Successfully added PDB ID 1EE7.
Successfully added PDB ID 2DEF.
Successfully added PDB ID 1LS4.
Successfully added PDB ID 1FHO.
Successfully added PDB ID 2EZH.
Successfully added PDB ID 1M58.
Successfully added PDB ID 1A5J.
Success

KeyboardInterrupt: 