<a href="https://colab.research.google.com/github/wangqian2149185/BMRB-API/blob/master/Sec03_combine_PDB_BMRB_clean.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Project: Prediction chemical shift < = > structure

### **Personal project proposal since April 2024**

#### **Composed by Qian Wang**

Alphafold2's predition is quite impressive. But all of the predition was based on 1st primary structureof the peptide, which ignores the individual environment of the protien, such as (pH, temperature, ion strength, or even the presence of other molecules like ligand and so forth). Fortunately, the NMR assignment of the chemical shift from each atom of protein molecule is the closest the in vivo of the protein state. In addition, NMR spectrum contains tons of infomation of each atom of molecules that had never been fully digged in. Most of the info has been ignored, due to the complicated combination of each minor quantum effect (resultantly largely effect).

This project is trying to predict from 2 directions in the methods of machine learnings and deep learnings by training the data from BMRB and PDB database:

1. predict NMR chemical shift based on a given structure (PDB format)

2. predict the structure from the acquired NMR chemical shift.


Additionally, we will try to build a model of predicting 2nd structure, dihedral angle, or even the dynamics from chemical shift.

**Table of contents**:

1. Reading data from BMRB
2. Reading data from PDB
3. Combine the downloaded BMRB and PDB files, combine and clean data. Get ready for the modeling.




In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# show all columns
import pandas as pd
import ast
pd.set_option('display.max_columns', None)

# 3.0 Read the CSV files of BMRB files

## 3.0.1 read BMRB CSV files

In [None]:
import glob
import os


def read_all_csv_in_folder(folder_path):
    """
    Reads all CSV files in the specified folder and concatenates them into a single DataFrame.

    Args:
        folder_path (str): Path to the folder containing the CSV files.

    Returns:
        pd.DataFrame: A DataFrame containing all the data from the CSV files.
    """
    # Use glob to get all CSV files in the folder
    csv_files = glob.glob(os.path.join(folder_path, "*.csv"))

    # Initialize an empty list to hold DataFrames
    dfs = []

    # Iterate over the list of CSV files and read each one into a DataFrame
    for csv_file in csv_files:
        df = pd.read_csv(csv_file)
        dfs.append(df)

    # Concatenate all DataFrames into a single DataFrame
    df_combined = pd.concat(dfs, ignore_index=True)

    return df_combined

In [None]:
# read the BMRB csv files, which downloaded previously
bmrb_csv_path = "/content/drive/MyDrive/Colab Notebooks/BMRB_PDB_Project/rawdata/bmrb_folders"
df_bmrb_raw = read_all_csv_in_folder(bmrb_csv_path)

In [None]:
df_bmrb_raw

Unnamed: 0,Comp_index_ID,Comp_ID,Atom_ID,Atom_type,Val,BMRB_ID,pH,temperature_Kelvin
0,1,LYS,C,C,174.500,4023,6.5,303.0
1,1,LYS,CA,C,56.000,4023,6.5,303.0
2,1,LYS,HA,H,4.350,4023,6.5,303.0
3,1,LYS,CB,C,33.000,4023,6.5,303.0
4,1,LYS,HB2,H,1.750,4023,6.5,303.0
...,...,...,...,...,...,...,...,...
6076518,84,GLN,HE22,H,6.790,51834,6.4,277.0
6076519,84,GLN,CA,C,57.522,51834,6.4,277.0
6076520,84,GLN,CB,C,30.544,51834,6.4,277.0
6076521,84,GLN,N,N,127.162,51834,6.4,277.0


In [None]:
#df_bmrb_raw.to_csv("/content/drive/MyDrive/Colab Notebooks/BMRB_PDB_Project/rawdata/df_bmrb_raw.csv", index=False)

In [None]:
#df_bmrb_raw = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/BMRB_PDB_Project/rawdata/df_bmrb_raw.csv")

In [None]:
#df_bmrb_raw

Unnamed: 0,Comp_index_ID,Comp_ID,Atom_ID,Atom_type,Val,BMRB_ID,pH,temperature_Kelvin
0,1,LYS,C,C,174.500,4023,6.5,303.0
1,1,LYS,CA,C,56.000,4023,6.5,303.0
2,1,LYS,HA,H,4.350,4023,6.5,303.0
3,1,LYS,CB,C,33.000,4023,6.5,303.0
4,1,LYS,HB2,H,1.750,4023,6.5,303.0
...,...,...,...,...,...,...,...,...
6076518,84,GLN,HE22,H,6.790,51834,6.4,277.0
6076519,84,GLN,CA,C,57.522,51834,6.4,277.0
6076520,84,GLN,CB,C,30.544,51834,6.4,277.0
6076521,84,GLN,N,N,127.162,51834,6.4,277.0


# 3.1 read PDB CSV files

## 3.1.1 read csv files

In [None]:
# read the BMRB csv files, which downloaded previously
# pdb_csv_path = "/content/drive/MyDrive/Colab Notebooks/BMRB_PDB_Project/rawdata/pdb_folders"
# df_pdb_raw = read_all_csv_in_folder(pdb_csv_path)

In [None]:
'''
Since reading all of the df_pdb_raw costs all of the System RAM, which is impossible to read.
So, we diveded, read and merge tables.
Then finally combine them right before modeling.
'''

'\nSince reading all of the df_pdb_raw costs all of the System RAM, which is impossible to read.\nSo, we diveded, read and merge tables.\nThen finally combine them right before modeling.\n'

In [None]:
## first read one pdb csv file
pdb_csv_1_path = "/content/drive/MyDrive/Colab Notebooks/BMRB_PDB_Project/rawdata/pdb_folders/df_PDB_1.csv"
df_pdb_raw = pd.read_csv(pdb_csv_1_path)

In [None]:
columns_wanted = ['_exptl.method',
                  'data_',
                  '_entity.formula_weight',
                  '_entity.id',
                  '_entity.pdbx_number_of_molecules',
                  '_pdbx_struct_assembly.oligomeric_details',
                  '_pdbx_struct_assembly.oligomeric_count',
                  '_pdbx_struct_assembly_gen.asym_id_list',
                  '_struct_conf.beg_label_comp_id',
                  '_struct_conf.beg_label_seq_id',
                  '_struct_conf.beg_label_asym_id',
                  '_struct_conf.beg_label_entity_id',
                  '_struct_conf.end_label_comp_id',
                  '_struct_conf.end_label_seq_id',
                  '_struct_conf.end_label_asym_id',
                  '_struct_conf.end_label_entity_id',
                  '_struct_conf.end_auth_asym_id',
                  '_struct_conf.pdbx_PDB_helix_class',
                  '_struct_conf.pdbx_PDB_helix_length',
                  '_entity_poly.pdbx_seq_one_letter_code',
                  '_entity_poly_seq.mon_id',
                  '_atom_site.type_symbol',
                  '_atom_site.label_atom_id',
                  '_atom_site.label_comp_id',
                  '_atom_site.label_seq_id',
                  '_chem_comp.formula_weight',
                  '_atom_site.id',
                  '_atom_site.type_symbol',
                  '_atom_site.label_atom_id',
                  '_atom_site.label_comp_id',
                  '_atom_site.label_seq_id',
                  '_atom_site.Cartn_x',
                  '_atom_site.Cartn_y',
                  '_atom_site.Cartn_z',
                  '_struct_sheet_range.sheet_id',
                  '_struct_sheet_range.id',
                  '_struct_sheet_range.beg_label_comp_id',
                  '_struct_sheet_range.beg_label_seq_id',
                  '_struct_sheet_range.beg_auth_asym_id',
                  '_struct_sheet_range.end_label_comp_id',
                  '_struct_sheet_range.end_label_seq_id',
                  '_struct_conn.ptnr1_label_comp_id',
                  '_struct_conn.ptnr1_label_seq_id',
                  '_struct_conn.ptnr2_label_comp_id',
                  '_struct_conn.ptnr2_label_seq_id'
                  ]

In [None]:
df_pdb_raw.head(10)

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,data_,_model_server_result.job_id,_model_server_result.datetime_utc,_model_server_result.server_version,_model_server_result.query_name,_model_server_result.source_id,_model_server_result.entry_id,_entry.id,_exptl.entry_id,_exptl.method,_entity.details,_entity.formula_weight,_entity.id,_entity.src_method,_entity.type,_entity.pdbx_description,_entity.pdbx_number_of_molecules,_entity.pdbx_mutation,_entity.pdbx_fragment,_entity.pdbx_ec,_cell.angle_alpha,_cell.angle_beta,_cell.angle_gamma,_cell.entry_id,_cell.length_a,_cell.length_b,_cell.length_c,_cell.Z_PDB,_cell.pdbx_unique_axis,_pdbx_struct_assembly.method_details,_pdbx_struct_assembly.oligomeric_details,_pdbx_struct_assembly.oligomeric_count,_pdbx_struct_assembly.details,_pdbx_struct_assembly.id,_pdbx_struct_assembly_gen.asym_id_list,_pdbx_struct_assembly_gen.assembly_id,_pdbx_struct_assembly_gen.oper_expression,_pdbx_struct_oper_list.id,_pdbx_struct_oper_list.type,_pdbx_struct_oper_list.name,_pdbx_struct_oper_list.symmetry_operation,_pdbx_struct_oper_list.matrix[1][1],_pdbx_struct_oper_list.matrix[1][2],_pdbx_struct_oper_list.matrix[1][3],_pdbx_struct_oper_list.matrix[2][1],_pdbx_struct_oper_list.matrix[2][2],_pdbx_struct_oper_list.matrix[2][3],_pdbx_struct_oper_list.matrix[3][1],_pdbx_struct_oper_list.matrix[3][2],_pdbx_struct_oper_list.matrix[3][3],_pdbx_struct_oper_list.vector[1],_pdbx_struct_oper_list.vector[2],_pdbx_struct_oper_list.vector[3],_struct_conf.conf_type_id,_struct_conf.id,_struct_conf.beg_label_comp_id,_struct_conf.beg_label_seq_id,_struct_conf.pdbx_beg_PDB_ins_code,_struct_conf.beg_label_asym_id,_struct_conf.beg_label_entity_id,_struct_conf.beg_auth_comp_id,_struct_conf.beg_auth_seq_id,_struct_conf.beg_auth_asym_id,_struct_conf.end_label_comp_id,_struct_conf.end_label_seq_id,_struct_conf.pdbx_end_PDB_ins_code,_struct_conf.end_label_asym_id,_struct_conf.end_label_entity_id,_struct_conf.end_auth_comp_id,_struct_conf.end_auth_seq_id,_struct_conf.end_auth_asym_id,_struct_conf.pdbx_PDB_helix_class,_struct_conf.details,_struct_conf.pdbx_PDB_helix_length,_struct_asym.details,_struct_asym.entity_id,_struct_asym.id,_struct_asym.pdbx_modified,_struct_asym.pdbx_blank_PDB_chainid_flag,_entity_poly.entity_id,_entity_poly.nstd_linkage,_entity_poly.nstd_monomer,_entity_poly.type,_entity_poly.pdbx_strand_id,_entity_poly.pdbx_seq_one_letter_code,_entity_poly.pdbx_seq_one_letter_code_can,_entity_poly.pdbx_target_identifier,_entity_poly_seq.entity_id,_entity_poly_seq.hetero,_entity_poly_seq.mon_id,_entity_poly_seq.num,_chem_comp.formula,_chem_comp.formula_weight,_chem_comp.id,_chem_comp.mon_nstd_flag,_chem_comp.name,_chem_comp.type,_chem_comp.pdbx_synonyms,_chem_comp_bond.atom_id_1,_chem_comp_bond.atom_id_2,_chem_comp_bond.comp_id,_chem_comp_bond.value_order,_chem_comp_bond.pdbx_ordinal,_chem_comp_bond.pdbx_stereo_config,_chem_comp_bond.pdbx_aromatic_flag,_atom_sites.entry_id,_atom_sites.fract_transf_matrix[1][1],_atom_sites.fract_transf_matrix[1][2],_atom_sites.fract_transf_matrix[1][3],_atom_sites.fract_transf_matrix[2][1],_atom_sites.fract_transf_matrix[2][2],_atom_sites.fract_transf_matrix[2][3],_atom_sites.fract_transf_matrix[3][1],_atom_sites.fract_transf_matrix[3][2],_atom_sites.fract_transf_matrix[3][3],_atom_sites.fract_transf_vector[1],_atom_sites.fract_transf_vector[2],_atom_sites.fract_transf_vector[3],_atom_site.group_PDB,_atom_site.id,_atom_site.type_symbol,_atom_site.label_atom_id,_atom_site.label_comp_id,_atom_site.label_seq_id,_atom_site.label_alt_id,_atom_site.pdbx_PDB_ins_code,_atom_site.label_asym_id,_atom_site.label_entity_id,_atom_site.Cartn_x,_atom_site.Cartn_y,_atom_site.Cartn_z,_atom_site.occupancy,_atom_site.B_iso_or_equiv,_atom_site.pdbx_formal_charge,_atom_site.auth_atom_id,_atom_site.auth_comp_id,_atom_site.auth_seq_id,_atom_site.auth_asym_id,_atom_site.pdbx_PDB_model_num,_model_server_stats.io_time_ms,_model_server_stats.parse_time_ms,_model_server_stats.create_model_time_ms,_model_server_stats.query_time_ms,_model_server_stats.encode_time_ms,_model_server_stats.element_count,_struct_sheet_range.sheet_id,_struct_sheet_range.id,_struct_sheet_range.beg_label_comp_id,_struct_sheet_range.beg_label_seq_id,_struct_sheet_range.pdbx_beg_PDB_ins_code,_struct_sheet_range.beg_label_asym_id,_struct_sheet_range.beg_label_entity_id,_struct_sheet_range.beg_auth_comp_id,_struct_sheet_range.beg_auth_seq_id,_struct_sheet_range.beg_auth_asym_id,_struct_sheet_range.end_label_comp_id,_struct_sheet_range.end_label_seq_id,_struct_sheet_range.pdbx_end_PDB_ins_code,_struct_sheet_range.end_label_asym_id,_struct_sheet_range.end_label_entity_id,_struct_sheet_range.end_auth_comp_id,_struct_sheet_range.end_auth_seq_id,_struct_sheet_range.end_auth_asym_id,_struct_sheet_range.symmetry,_struct_conn.conn_type_id,_struct_conn.details,_struct_conn.id,_struct_conn.ptnr1_label_asym_id,_struct_conn.ptnr1_label_atom_id,_struct_conn.ptnr1_label_comp_id,_struct_conn.ptnr1_label_seq_id,_struct_conn.ptnr1_auth_asym_id,_struct_conn.ptnr1_auth_comp_id,_struct_conn.ptnr1_auth_seq_id,_struct_conn.ptnr1_symmetry,_struct_conn.ptnr2_label_asym_id,_struct_conn.ptnr2_label_atom_id,_struct_conn.ptnr2_label_comp_id,_struct_conn.ptnr2_label_seq_id,_struct_conn.ptnr2_auth_asym_id,_struct_conn.ptnr2_auth_comp_id,_struct_conn.ptnr2_auth_seq_id,_struct_conn.ptnr2_symmetry,_struct_conn.pdbx_ptnr1_PDB_ins_code,_struct_conn.pdbx_ptnr1_label_alt_id,_struct_conn.pdbx_ptnr1_standard_comp_id,_struct_conn.pdbx_ptnr2_PDB_ins_code,_struct_conn.pdbx_ptnr2_label_alt_id,_struct_conn.pdbx_ptnr3_PDB_ins_code,_struct_conn.pdbx_ptnr3_label_alt_id,_struct_conn.pdbx_ptnr3_label_asym_id,_struct_conn.pdbx_ptnr3_label_atom_id,_struct_conn.pdbx_ptnr3_label_comp_id,_struct_conn.pdbx_ptnr3_label_seq_id,_struct_conn.pdbx_PDB_id,_struct_conn.pdbx_dist_value,_struct_conn.pdbx_value_order,_symmetry.entry_id,_symmetry.cell_setting,_symmetry.Int_Tables_number,_symmetry.space_group_name_Hall,_symmetry.space_group_name_H-M,_pdbx_nonpoly_scheme.asym_id,_pdbx_nonpoly_scheme.entity_id,_pdbx_nonpoly_scheme.mon_id,_pdbx_nonpoly_scheme.pdb_strand_id,_pdbx_nonpoly_scheme.ndb_seq_num,_pdbx_nonpoly_scheme.pdb_seq_num,_pdbx_nonpoly_scheme.auth_seq_num,_pdbx_nonpoly_scheme.pdb_mon_id,_pdbx_nonpoly_scheme.auth_mon_id,_pdbx_nonpoly_scheme.pdb_ins_code
0,4,4,1I4V,['0zLu1KA8Y1cHYVBLYOt9sA'],['2024-05-19 01:51:23'],['0.9.11'],['full'],['pdb-bcif'],['1I4V'],['1I4V'],['1I4V'],['SOLUTION NMR'],['?'],['12308.985'],['1'],['man'],['polymer'],"[""UMUD' PROTEIN""]",['2'],['G25A'],['?'],['3.4.21.-'],['?'],['?'],['?'],['1I4V'],['?'],['?'],['?'],['1'],['?'],['?'],['dimeric'],['2'],['author_defined_assembly'],['1'],"['A,B']",['1'],['1'],['1'],['identity operation'],['1_555'],['?'],['1'],['0'],['0'],['0'],['1'],['0'],['0'],['0'],['1'],['0'],['0'],['0'],"['helx_p', 'helx_p']","['helx_p1', 'helx_p2']","['ASP', 'ASP']","['15', '15']","['.', '.']","['A', 'B']","['1', '1']","['ASP', 'ASP']","['39', '39']","['A', 'B']","['ILE', 'ILE']","['21', '21']","['.', '.']","['A', 'B']","['1', '1']","['ILE', 'ILE']","['45', '45']","['A', 'B']","['1', '1']","['?', '?']","['7', '7']","['?', '?']","['1', '1']","['A', 'B']","['N', 'N']","['N', 'N']",['1'],['no'],['no'],['polypeptide(L)'],"['A,B']",['AFPSPAADYVEQRIDLNQLLIQHPSATYFVKASGDSMIDGGISD...,['AFPSPAADYVEQRIDLNQLLIQHPSATYFVKASGDSMIDGGISD...,['?'],"['1', '1', '1', '1', '1', '1', '1', '1', '1', ...","['n', 'n', 'n', 'n', 'n', 'n', 'n', 'n', 'n', ...","['ALA', 'PHE', 'PRO', 'SER', 'PRO', 'ALA', 'AL...","['1', '2', '3', '4', '5', '6', '7', '8', '9', ...","['C3 H7 N O2', 'C6 H15 N4 O2 1', 'C4 H8 N2 O3'...","['89.093', '175.209', '132.118', '133.103', '1...","['ALA', 'ARG', 'ASN', 'ASP', 'GLN', 'GLU', 'GL...","['y', 'y', 'y', 'y', 'y', 'y', 'y', 'y', 'y', ...","['ALANINE', 'ARGININE', 'ASPARAGINE', 'ASPARTI...","['l-peptide linking', 'l-peptide linking', 'l-...","['?', '?', '?', '?', '?', '?', '?', '?', '?', ...",,,,,,,,['1I4V'],['1'],['0'],['0'],['0'],['1'],['0'],['0'],['0'],['1'],['0'],['0'],['0'],"['ATOM', 'ATOM', 'ATOM', 'ATOM', 'ATOM', 'ATOM...","['1', '2', '3', '4', '5', '6', '7', '8', '9', ...","['N', 'C', 'C', 'O', 'C', 'H', 'H', 'H', 'H', ...","['N', 'CA', 'C', 'O', 'CB', 'H1', 'H2', 'H3', ...","['ALA', 'ALA', 'ALA', 'ALA', 'ALA', 'ALA', 'AL...","['1', '1', '1', '1', '1', '1', '1', '1', '1', ...","['.', '.', '.', '.', '.', '.', '.', '.', '.', ...","['.', '.', '.', '.', '.', '.', '.', '.', '.', ...","['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', ...","['1', '1', '1', '1', '1', '1', '1', '1', '1', ...","['16.068', '16.091', '16.608', '17.513', '17.0...","['21.522', '21.281', '19.869', '19.683', '22.3...","['-36.166', '-37.637', '-37.934', '-38.722', '...","['1', '1', '1', '1', '1', '1', '1', '1', '1', ...","['0', '0', '0', '0', '0', '0', '0', '0', '0', ...","['?', '?', '?', '?', '?', '?', '?', '?', '?', ...","['N', 'CA', 'C', 'O', 'CB', 'H1', 'H2', 'H3', ...","['ALA', 'ALA', 'ALA', 'ALA', 'ALA', 'ALA', 'AL...","['25', '25', '25', '25', '25', '25', '25', '25...","['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', ...","['1', '1', '1', '1', '1', '1', '1', '1', '1', ...",['8'],['23'],['78'],['4755'],['288'],['69480'],"['A', 'A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', ...","['1', '2', '3', '4', '5', '6', '7', '8', '9', ...","['TYR', 'ASP', 'GLY', 'TYR', 'ASP', 'GLY', 'VA...","['28', '46', '105', '28', '46', '105', '62', '...","['.', '.', '.', '.', '.', '.', '.', '.', '.', ...","['A', 'A', 'A', 'B', 'B', 'B', 'A', 'A', 'A', ...","['1', '1', '1', '1', '1', '1', '1', '1', '1', ...","['TYR', 'ASP', 'GLY', 'TYR', 'ASP', 'GLY', 'VA...","['52', '70', '129', '52', '70', '129', '86', '...","['A', 'A', 'A', 'B', 'B', 'B', 'A', 'A', 'A', ...","['ALA', 'SER', 'LYS', 'ALA', 'SER', 'LYS', 'AL...","['32', '52', '112', '32', '52', '112', '65', '...","['.', '.', '.', '.', '.', '.', '.', '.', '.', ...","['A', 'A', 'A', 'B', 'B', 'B', 'A', 'A', 'A', ...","['1', '1', '1', '1', '1', '1', '1', '1', '1', ...","['ALA', 'SER', 'LYS', 'ALA', 'SER', 'LYS', 'AL...","['56', '76', '136', '56', '76', '136', '89', '...","['A', 'A', 'A', 'B', 'B', 'B', 'A', 'A', 'A', ...","['?', '?', '?', '?', '?', '?', '?', '?', '?', ...",,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1,11,11,1F6U,['CrJfcPblSc_Ts924vWPScA'],['2024-05-19 01:51:58'],['0.9.11'],['full'],['pdb-bcif'],['1F6U'],['1F6U'],['1F6U'],['SOLUTION NMR'],"['?', '?', '?']","['6162.739', '6384.502', '65.409']","['1', '2', '3']","['syn', 'man', 'syn']","['polymer', 'polymer', 'non-polymer']","['HIV-1 STEM-LOOP SL2 FROM PSI-RNA PACKAGING',...","['1', '1', '2']","['?', '?', '?']","['?', '?', '?']","['?', '?', '?']",['90'],['90'],['90'],['1F6U'],['1'],['1'],['1'],['1'],['?'],['?'],['dimeric'],['2'],['author_defined_assembly'],['1'],"['A,B,C,D']",['1'],['1'],['1'],['identity operation'],['1_555'],"['x,y,z']",['1'],['0'],['0'],['0'],['1'],['0'],['0'],['0'],['1'],['0'],['0'],['0'],"['helx_p', 'helx_p', 'helx_p']","['helx_p1', 'helx_p2', 'helx_p3']","['GLN', 'ILE', 'GLN']","['2', '24', '45']","['.', '.', '.']","['B', 'B', 'B']","['2', '2', '2']","['GLN', 'ILE', 'GLN']","['2', '24', '45']","['A', 'A', 'A']","['THR', 'CYS', 'CYS']","['12', '28', '49']","['.', '.', '.']","['B', 'B', 'B']","['2', '2', '2']","['THR', 'CYS', 'CYS']","['12', '28', '49']","['A', 'A', 'A']","['5', '5', '5']","['?', '?', '?']","['11', '5', '5']","['?', '?', '?', '?']","['1', '2', '3', '3']","['A', 'B', 'C', 'D']","['N', 'N', 'N', 'N']","['N', 'N', 'N', 'N']","['1', '2']","['no', 'no']","['yes', 'yes']","['polyribonucleotide', 'polypeptide(L)']","['B', 'A']","['(CG1)GCGACUGGUGAGUACGCC', 'MQKGNFRNQRKTVKCFN...","['GGCGACUGGUGAGUACGCC', 'MQKGNFRNQRKTVKCFNCGKE...","['?', '?']","['1', '1', '1', '1', '1', '1', '1', '1', '1', ...","['n', 'n', 'n', 'n', 'n', 'n', 'n', 'n', 'n', ...","['CG1', 'G', 'C', 'G', 'A', 'C', 'U', 'G', 'G'...","['1', '2', '3', '4', '5', '6', '7', '8', '9', ...","['C10 H14 N5 O7 P', 'C3 H7 N O2', 'C6 H15 N4 O...","['347.221', '89.093', '175.209', '132.118', '1...","['A', 'ALA', 'ARG', 'ASN', 'ASP', 'C', 'CG1', ...","['y', 'y', 'y', 'y', 'y', 'y', 'n', 'y', 'y', ...","[""ADENOSINE-5'-MONOPHOSPHATE"", 'ALANINE', 'ARG...","['rna linking', 'l-peptide linking', 'l-peptid...","['?', '?', '?', '?', '?', '?', '?', '?', '?', ...",,,,,,,,['1F6U'],['1'],['0'],['0'],['0'],['1'],['0'],['0'],['0'],['1'],['0'],['0'],['0'],"['HETATM', 'HETATM', 'HETATM', 'HETATM', 'HETA...","['1', '2', '3', '4', '5', '6', '7', '8', '9', ...","['O', 'P', 'O', 'O', 'O', 'C', 'C', 'O', 'C', ...","['OP3', 'P', 'OP1', 'OP2', ""O5'"", ""C5'"", ""C4'""...","['CG1', 'CG1', 'CG1', 'CG1', 'CG1', 'CG1', 'CG...","['1', '1', '1', '1', '1', '1', '1', '1', '1', ...","['.', '.', '.', '.', '.', '.', '.', '.', '.', ...","['.', '.', '.', '.', '.', '.', '.', '.', '.', ...","['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', ...","['1', '1', '1', '1', '1', '1', '1', '1', '1', ...","['-42.627', '-41.555', '-40.306', '-41.446', '...","['-6.41', '-6.296', '-6.988', '-4.886', '-7.13...","['3.043', '1.861', '2.251', '1.422', '0.677', ...","['1', '1', '1', '1', '1', '1', '1', '1', '1', ...","['0', '0', '0', '0', '0', '0', '0', '0', '0', ...","['?', '?', '?', '?', '?', '?', '?', '?', '?', ...","['OP3', 'P', 'OP1', 'OP2', ""O5'"", ""C5'"", ""C4'""...","['CG1', 'CG1', 'CG1', 'CG1', 'CG1', 'CG1', 'CG...","['201', '201', '201', '201', '201', '201', '20...","['B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', ...","['1', '1', '1', '1', '1', '1', '1', '1', '1', ...",['7'],['40'],['39'],['6358'],['150'],['29940'],"['A', 'A']","['1', '2']","['GLY', 'LYS']","['35', '41']","['.', '.']","['B', 'B']","['2', '2']","['GLY', 'LYS']","['35', '41']","['A', 'A']","['CYS', 'GLU']","['36', '42']","['.', '.']","['B', 'B']","['2', '2']","['CYS', 'GLU']","['36', '42']","['A', 'A']","['?', '?']","['covale', 'covale', 'metalc', 'metalc', 'meta...","['?', '?', '?', '?', '?', '?', '?', '?', '?', ...","['covale1', 'covale2', 'metalc1', 'metalc2', '...","['A', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', ...","[""O3'"", 'C', 'SG', 'SG', 'NE2', 'SG', 'SG', 'S...","['CG1', 'ASN', 'CYS', 'CYS', 'HIS', 'CYS', 'CY...","['1', '55', '15', '18', '23', '28', '36', '39'...","['B', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', ...","['CG1', 'ASN', 'CYS', 'CYS', 'HIS', 'CYS', 'CY...","['201', '55', '15', '18', '23', '28', '36', '3...","['1_555', '1_555', '1_555', '1_555', '1_555', ...","['A', 'B', 'C', 'C', 'C', 'C', 'D', 'D', 'D', ...","['P', 'N', 'ZN', 'ZN', 'ZN', 'ZN', 'ZN', 'ZN',...","['G', 'NH2', 'ZN', 'ZN', 'ZN', 'ZN', 'ZN', 'ZN...","['2', '56', '.', '.', '.', '.', '.', '.', '.',...","['B', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', ...","['G', 'NH2', 'ZN', 'ZN', 'ZN', 'ZN', 'ZN', 'ZN...","['202', '56', '128', '128', '128', '128', '149...","['1_555', '1_555', '1_555', '1_555', '1_555', ...","['?', '?', '?', '?', '?', '?', '?', '?', '?', ...","['?', '?', '?', '?', '?', '?', '?', '?', '?', ...","['?', '?', '?', '?', '?', '?', '?', '?', '?', ...","['?', '?', '?', '?', '?', '?', '?', '?', '?', ...","['?', '?', '?', '?', '?', '?', '?', '?', '?', ...","['?', '?', '?', '?', '?', '?', '?', '?', '?', ...","['?', '?', '?', '?', '?', '?', '?', '?', '?', ...","['?', '?', '?', '?', '?', '?', '?', '?', '?', ...","['?', '?', '?', '?', '?', '?', '?', '?', '?', ...","['?', '?', '?', '?', '?', '?', '?', '?', '?', ...","['?', '?', '?', '?', '?', '?', '?', '?', '?', ...","['?', '?', '?', '?', '?', '?', '?', '?', '?', ...","['1.6', '1.325', '2.396', '2.365', '2.117', '2...","['?', '?', '?', '?', '?', '?', '?', '?', '?', ...",,,,,,"['C', 'D']","['3', '3']","['ZN', 'ZN']","['A', 'A']","['1', '1']","['128', '149']","['128', '149']","['ZN', 'ZN']","['ZN', 'ZN']","['.', '.']"


In [None]:
df_pdb_raw = df_pdb_raw.loc[:, ~df_pdb_raw.columns.duplicated()]

In [None]:
# t_df =  df_pdb_raw[df_pdb_raw['data_'].isin(['1F6U', '1I4V'])]

In [None]:
# t_df.to_csv("/content/drive/MyDrive/Colab Notebooks/BMRB_PDB_Project/rawdata/pdb_folders/df_PDB_00#.csv")

In [None]:
# df_pdb_clean = t_df[columns_wanted]
df_pdb_clean = df_pdb_raw[columns_wanted]

##3.1.2 clean up the pdb data frame

In [None]:
# df_pdb_clean = df_pdb_raw[columns_wanted]

In [None]:
df_pdb_clean.head()

Unnamed: 0,_exptl.method,data_,_entity.formula_weight,_entity.id,_entity.pdbx_number_of_molecules,_pdbx_struct_assembly.oligomeric_details,_pdbx_struct_assembly.oligomeric_count,_pdbx_struct_assembly_gen.asym_id_list,_struct_conf.beg_label_comp_id,_struct_conf.beg_label_seq_id,_struct_conf.beg_label_asym_id,_struct_conf.beg_label_entity_id,_struct_conf.end_label_comp_id,_struct_conf.end_label_seq_id,_struct_conf.end_label_asym_id,_struct_conf.end_label_entity_id,_struct_conf.end_auth_asym_id,_struct_conf.pdbx_PDB_helix_class,_struct_conf.pdbx_PDB_helix_length,_entity_poly.pdbx_seq_one_letter_code,_entity_poly_seq.mon_id,_atom_site.type_symbol,_atom_site.label_atom_id,_atom_site.label_comp_id,_atom_site.label_seq_id,_chem_comp.formula_weight,_atom_site.id,_atom_site.type_symbol.1,_atom_site.label_atom_id.1,_atom_site.label_comp_id.1,_atom_site.label_seq_id.1,_atom_site.Cartn_x,_atom_site.Cartn_y,_atom_site.Cartn_z,_struct_sheet_range.sheet_id,_struct_sheet_range.id,_struct_sheet_range.beg_label_comp_id,_struct_sheet_range.beg_label_seq_id,_struct_sheet_range.beg_auth_asym_id,_struct_sheet_range.end_label_comp_id,_struct_sheet_range.end_label_seq_id,_struct_conn.ptnr1_label_comp_id,_struct_conn.ptnr1_label_seq_id,_struct_conn.ptnr2_label_comp_id,_struct_conn.ptnr2_label_seq_id
0,['SOLUTION NMR'],1I4V,['12308.985'],['1'],['2'],['dimeric'],['2'],"['A,B']","['ASP', 'ASP']","['15', '15']","['A', 'B']","['1', '1']","['ILE', 'ILE']","['21', '21']","['A', 'B']","['1', '1']","['A', 'B']","['1', '1']","['7', '7']",['AFPSPAADYVEQRIDLNQLLIQHPSATYFVKASGDSMIDGGISD...,"['ALA', 'PHE', 'PRO', 'SER', 'PRO', 'ALA', 'AL...","['N', 'C', 'C', 'O', 'C', 'H', 'H', 'H', 'H', ...","['N', 'CA', 'C', 'O', 'CB', 'H1', 'H2', 'H3', ...","['ALA', 'ALA', 'ALA', 'ALA', 'ALA', 'ALA', 'AL...","['1', '1', '1', '1', '1', '1', '1', '1', '1', ...","['89.093', '175.209', '132.118', '133.103', '1...","['1', '2', '3', '4', '5', '6', '7', '8', '9', ...","['N', 'C', 'C', 'O', 'C', 'H', 'H', 'H', 'H', ...","['N', 'CA', 'C', 'O', 'CB', 'H1', 'H2', 'H3', ...","['ALA', 'ALA', 'ALA', 'ALA', 'ALA', 'ALA', 'AL...","['1', '1', '1', '1', '1', '1', '1', '1', '1', ...","['16.068', '16.091', '16.608', '17.513', '17.0...","['21.522', '21.281', '19.869', '19.683', '22.3...","['-36.166', '-37.637', '-37.934', '-38.722', '...","['A', 'A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', ...","['1', '2', '3', '4', '5', '6', '7', '8', '9', ...","['TYR', 'ASP', 'GLY', 'TYR', 'ASP', 'GLY', 'VA...","['28', '46', '105', '28', '46', '105', '62', '...","['A', 'A', 'A', 'B', 'B', 'B', 'A', 'A', 'A', ...","['ALA', 'SER', 'LYS', 'ALA', 'SER', 'LYS', 'AL...","['32', '52', '112', '32', '52', '112', '65', '...",,,,
1,['SOLUTION NMR'],1F6U,"['6162.739', '6384.502', '65.409']","['1', '2', '3']","['1', '1', '2']",['dimeric'],['2'],"['A,B,C,D']","['GLN', 'ILE', 'GLN']","['2', '24', '45']","['B', 'B', 'B']","['2', '2', '2']","['THR', 'CYS', 'CYS']","['12', '28', '49']","['B', 'B', 'B']","['2', '2', '2']","['A', 'A', 'A']","['5', '5', '5']","['11', '5', '5']","['(CG1)GCGACUGGUGAGUACGCC', 'MQKGNFRNQRKTVKCFN...","['CG1', 'G', 'C', 'G', 'A', 'C', 'U', 'G', 'G'...","['O', 'P', 'O', 'O', 'O', 'C', 'C', 'O', 'C', ...","['OP3', 'P', 'OP1', 'OP2', ""O5'"", ""C5'"", ""C4'""...","['CG1', 'CG1', 'CG1', 'CG1', 'CG1', 'CG1', 'CG...","['1', '1', '1', '1', '1', '1', '1', '1', '1', ...","['347.221', '89.093', '175.209', '132.118', '1...","['1', '2', '3', '4', '5', '6', '7', '8', '9', ...","['O', 'P', 'O', 'O', 'O', 'C', 'C', 'O', 'C', ...","['OP3', 'P', 'OP1', 'OP2', ""O5'"", ""C5'"", ""C4'""...","['CG1', 'CG1', 'CG1', 'CG1', 'CG1', 'CG1', 'CG...","['1', '1', '1', '1', '1', '1', '1', '1', '1', ...","['-42.627', '-41.555', '-40.306', '-41.446', '...","['-6.41', '-6.296', '-6.988', '-4.886', '-7.13...","['3.043', '1.861', '2.251', '1.422', '0.677', ...","['A', 'A']","['1', '2']","['GLY', 'LYS']","['35', '41']","['A', 'A']","['CYS', 'GLU']","['36', '42']","['CG1', 'ASN', 'CYS', 'CYS', 'HIS', 'CYS', 'CY...","['1', '55', '15', '18', '23', '28', '36', '39'...","['G', 'NH2', 'ZN', 'ZN', 'ZN', 'ZN', 'ZN', 'ZN...","['2', '56', '.', '.', '.', '.', '.', '.', '.',..."


In [None]:
# remove duplicated columns
df_pdb_clean = df_pdb_clean.loc[:, ~df_pdb_clean.columns.duplicated()]

In [None]:
# Convert string representations of lists to actual lists
columns_to_convert = df_pdb_clean.columns.difference(['data_'])
for col in columns_to_convert:
    df_pdb_clean[col] = df_pdb_clean[col].apply(lambda x: ast.literal_eval(x) if pd.notna(x) else x)
    print(col + ' is done.')

_atom_site.Cartn_x is done.
_atom_site.Cartn_y is done.
_atom_site.Cartn_z is done.
_atom_site.id is done.
_atom_site.label_atom_id is done.
_atom_site.label_comp_id is done.
_atom_site.label_seq_id is done.
_atom_site.type_symbol is done.
_chem_comp.formula_weight is done.
_entity.formula_weight is done.
_entity.id is done.
_entity.pdbx_number_of_molecules is done.
_entity_poly.pdbx_seq_one_letter_code is done.
_entity_poly_seq.mon_id is done.
_exptl.method is done.
_pdbx_struct_assembly.oligomeric_count is done.
_pdbx_struct_assembly.oligomeric_details is done.
_pdbx_struct_assembly_gen.asym_id_list is done.
_struct_conf.beg_label_asym_id is done.
_struct_conf.beg_label_comp_id is done.
_struct_conf.beg_label_entity_id is done.
_struct_conf.beg_label_seq_id is done.
_struct_conf.end_auth_asym_id is done.
_struct_conf.end_label_asym_id is done.
_struct_conf.end_label_comp_id is done.
_struct_conf.end_label_entity_id is done.
_struct_conf.end_label_seq_id is done.
_struct_conf.pdbx_PDB

In [None]:
df_pdb_clean.head()

Unnamed: 0,_exptl.method,data_,_entity.formula_weight,_entity.id,_entity.pdbx_number_of_molecules,_pdbx_struct_assembly.oligomeric_details,_pdbx_struct_assembly.oligomeric_count,_pdbx_struct_assembly_gen.asym_id_list,_struct_conf.beg_label_comp_id,_struct_conf.beg_label_seq_id,_struct_conf.beg_label_asym_id,_struct_conf.beg_label_entity_id,_struct_conf.end_label_comp_id,_struct_conf.end_label_seq_id,_struct_conf.end_label_asym_id,_struct_conf.end_label_entity_id,_struct_conf.end_auth_asym_id,_struct_conf.pdbx_PDB_helix_class,_struct_conf.pdbx_PDB_helix_length,_entity_poly.pdbx_seq_one_letter_code,_entity_poly_seq.mon_id,_atom_site.type_symbol,_atom_site.label_atom_id,_atom_site.label_comp_id,_atom_site.label_seq_id,_chem_comp.formula_weight,_atom_site.id,_atom_site.Cartn_x,_atom_site.Cartn_y,_atom_site.Cartn_z,_struct_sheet_range.sheet_id,_struct_sheet_range.id,_struct_sheet_range.beg_label_comp_id,_struct_sheet_range.beg_label_seq_id,_struct_sheet_range.beg_auth_asym_id,_struct_sheet_range.end_label_comp_id,_struct_sheet_range.end_label_seq_id,_struct_conn.ptnr1_label_comp_id,_struct_conn.ptnr1_label_seq_id,_struct_conn.ptnr2_label_comp_id,_struct_conn.ptnr2_label_seq_id
0,[SOLUTION NMR],1I4V,[12308.985],[1],[2],[dimeric],[2],"[A,B]","[ASP, ASP]","[15, 15]","[A, B]","[1, 1]","[ILE, ILE]","[21, 21]","[A, B]","[1, 1]","[A, B]","[1, 1]","[7, 7]",[AFPSPAADYVEQRIDLNQLLIQHPSATYFVKASGDSMIDGGISDG...,"[ALA, PHE, PRO, SER, PRO, ALA, ALA, ASP, TYR, ...","[N, C, C, O, C, H, H, H, H, H, H, H, N, C, C, ...","[N, CA, C, O, CB, H1, H2, H3, HA, HB1, HB2, HB...","[ALA, ALA, ALA, ALA, ALA, ALA, ALA, ALA, ALA, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, ...","[89.093, 175.209, 132.118, 133.103, 146.144, 1...","[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14...","[16.068, 16.091, 16.608, 17.513, 17.05, 15.569...","[21.522, 21.281, 19.869, 19.683, 22.332, 20.74...","[-36.166, -37.637, -37.934, -38.722, -38.196, ...","[A, A, A, A, A, A, B, B, B, C, C, D, D, D, E, E]","[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14...","[TYR, ASP, GLY, TYR, ASP, GLY, VAL, THR, ASP, ...","[28, 46, 105, 28, 46, 105, 62, 71, 102, 82, 93...","[A, A, A, B, B, B, A, A, A, A, A, B, B, B, B, B]","[ALA, SER, LYS, ALA, SER, LYS, ALA, LYS, VAL, ...","[32, 52, 112, 32, 52, 112, 65, 73, 103, 83, 94...",,,,
1,[SOLUTION NMR],1F6U,"[6162.739, 6384.502, 65.409]","[1, 2, 3]","[1, 1, 2]",[dimeric],[2],"[A,B,C,D]","[GLN, ILE, GLN]","[2, 24, 45]","[B, B, B]","[2, 2, 2]","[THR, CYS, CYS]","[12, 28, 49]","[B, B, B]","[2, 2, 2]","[A, A, A]","[5, 5, 5]","[11, 5, 5]","[(CG1)GCGACUGGUGAGUACGCC, MQKGNFRNQRKTVKCFNCGK...","[CG1, G, C, G, A, C, U, G, G, U, G, A, G, U, A...","[O, P, O, O, O, C, C, O, C, O, C, O, C, N, C, ...","[OP3, P, OP1, OP2, O5', C5', C4', O4', C3', O3...","[CG1, CG1, CG1, CG1, CG1, CG1, CG1, CG1, CG1, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[347.221, 89.093, 175.209, 132.118, 133.103, 3...","[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14...","[-42.627, -41.555, -40.306, -41.446, -42.234, ...","[-6.41, -6.296, -6.988, -4.886, -7.131, -8.509...","[3.043, 1.861, 2.251, 1.422, 0.677, 0.502, -0....","[A, A]","[1, 2]","[GLY, LYS]","[35, 41]","[A, A]","[CYS, GLU]","[36, 42]","[CG1, ASN, CYS, CYS, HIS, CYS, CYS, CYS, HIS, ...","[1, 55, 15, 18, 23, 28, 36, 39, 44, 49, 1, 1, ...","[G, NH2, ZN, ZN, ZN, ZN, ZN, ZN, ZN, ZN, C, C,...","[2, 56, ., ., ., ., ., ., ., ., 19, 19, 19, 18..."


###3.1.3 split each atom from the list

####3.1.3.1 First convert the atom columns to list

In [None]:
# drop duplicated columns, incase .apply(ast.literal_eval) won't work later on
df_pdb_clean = df_pdb_clean.loc[:, ~df_pdb_clean.columns.duplicated()]

# the columns needed to be columnized
columns_tolist = ['data_','_atom_site.type_symbol', '_atom_site.label_atom_id','_atom_site.label_comp_id', '_atom_site.label_seq_id',  '_atom_site.Cartn_x', '_atom_site.Cartn_y', '_atom_site.Cartn_z']
df_tolist = df_pdb_clean[columns_tolist]

In [None]:
df_tolist.head()

Unnamed: 0,data_,_atom_site.type_symbol,_atom_site.label_atom_id,_atom_site.label_comp_id,_atom_site.label_seq_id,_atom_site.Cartn_x,_atom_site.Cartn_y,_atom_site.Cartn_z
0,1I4V,"[N, C, C, O, C, H, H, H, H, H, H, H, N, C, C, ...","[N, CA, C, O, CB, H1, H2, H3, HA, HB1, HB2, HB...","[ALA, ALA, ALA, ALA, ALA, ALA, ALA, ALA, ALA, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, ...","[16.068, 16.091, 16.608, 17.513, 17.05, 15.569...","[21.522, 21.281, 19.869, 19.683, 22.332, 20.74...","[-36.166, -37.637, -37.934, -38.722, -38.196, ..."
1,1F6U,"[O, P, O, O, O, C, C, O, C, O, C, O, C, N, C, ...","[OP3, P, OP1, OP2, O5', C5', C4', O4', C3', O3...","[CG1, CG1, CG1, CG1, CG1, CG1, CG1, CG1, CG1, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[-42.627, -41.555, -40.306, -41.446, -42.234, ...","[-6.41, -6.296, -6.988, -4.886, -7.131, -8.509...","[3.043, 1.861, 2.251, 1.422, 0.677, 0.502, -0...."


In [None]:
df_tolist = df_tolist.rename(columns={
    'data_': 'pdb_id',
    '_entity_poly_seq.num': 'atom_id',
    '_atom_site.type_symbol': 'atom_type',
    '_atom_site.label_atom_id' : 'atom_inResidue',
    '_atom_site.label_comp_id' : 'residue_type' ,
    '_atom_site.label_seq_id' : 'peptide_seq_id',
    '_atom_site.Cartn_x' : 'atom_axis_x',
    '_atom_site.Cartn_y' : 'atom_axis_y',
    '_atom_site.Cartn_z' : 'atom_axis_z'
})

In [None]:

# Add a new column with the index values
df_tolist['index_column'] = df_tolist.index

In [None]:
df_tolist.head()

Unnamed: 0,pdb_id,atom_type,atom_inResidue,residue_type,peptide_seq_id,atom_axis_x,atom_axis_y,atom_axis_z,index_column
0,1IIO,"[N, C, C, O, H, H, H, H, H, N, C, C, O, C, O, ...","[N, CA, C, O, H1, H2, H3, HA2, HA3, N, CA, C, ...","[GLY, GLY, GLY, GLY, GLY, GLY, GLY, GLY, GLY, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, ...","[-19.311, -18.95, -19.327, -18.54, -19.413, -1...","[-0.245, -1.311, -0.948, -1.141, 0.667, -0.157...","[7.735, 6.76, 5.338, 4.412, 7.246, 8.46, 8.202...",0
1,1I4V,"[N, C, C, O, C, H, H, H, H, H, H, H, N, C, C, ...","[N, CA, C, O, CB, H1, H2, H3, HA, HB1, HB2, HB...","[ALA, ALA, ALA, ALA, ALA, ALA, ALA, ALA, ALA, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, ...","[16.068, 16.091, 16.608, 17.513, 17.05, 15.569...","[21.522, 21.281, 19.869, 19.683, 22.332, 20.74...","[-36.166, -37.637, -37.934, -38.722, -38.196, ...",1


#### 3.1.3.2 Second extract/explode each atom to an entry

In [None]:
df_tolist_1 = df_tolist[['atom_type', 'atom_inResidue','residue_type','peptide_seq_id','atom_axis_x','atom_axis_y' ,'atom_axis_z','index_column']]
df_tolist_2 = df_tolist[['pdb_id', 'index_column']]

# explode each atom out as an isolated entry
df_atomized = df_tolist_1.apply(lambda x: x.explode()).reset_index(drop=True)
df_atomized = df_atomized.merge(df_tolist_2, left_on='index_column', right_on= 'index_column')
df_atomized.head()

Unnamed: 0,atom_type,atom_inResidue,residue_type,peptide_seq_id,atom_axis_x,atom_axis_y,atom_axis_z,index_column,pdb_id
0,N,N,ALA,1,16.068,21.522,-36.166,0,1I4V
1,C,CA,ALA,1,16.091,21.281,-37.637,0,1I4V
2,C,C,ALA,1,16.608,19.869,-37.934,0,1I4V
3,O,O,ALA,1,17.513,19.683,-38.722,0,1I4V
4,C,CB,ALA,1,17.05,22.332,-38.196,0,1I4V


In [None]:
df_atomized.head(20)

Unnamed: 0,atom_type,atom_inResidue,residue_type,peptide_seq_id,atom_axis_x,atom_axis_y,atom_axis_z,index_column,pdb_id
0,N,N,ALA,1,16.068,21.522,-36.166,0,1I4V
1,C,CA,ALA,1,16.091,21.281,-37.637,0,1I4V
2,C,C,ALA,1,16.608,19.869,-37.934,0,1I4V
3,O,O,ALA,1,17.513,19.683,-38.722,0,1I4V
4,C,CB,ALA,1,17.05,22.332,-38.196,0,1I4V
5,H,H1,ALA,1,15.569,20.741,-35.695,0,1I4V
6,H,H2,ALA,1,15.576,22.417,-35.971,0,1I4V
7,H,H3,ALA,1,17.042,21.574,-35.808,0,1I4V
8,H,HA,ALA,1,15.107,21.418,-38.058,0,1I4V
9,H,HB1,ALA,1,16.535,23.277,-38.286,0,1I4V


In [None]:
# Replace '.' with '100' in 'peptide_seq_id' column
df_atomized['peptide_seq_id'] = df_atomized['peptide_seq_id'].replace('.', '9999')

In [None]:
# Cast 'peptide_seq_id' column to int64
df_atomized['peptide_seq_id'] = df_atomized['peptide_seq_id'].astype('int64')

In [None]:
# df_atomized[df_atomized['peptide_seq_id']== 9999]
df_atomized.iloc[81440:81459]

Unnamed: 0,atom_type,atom_inResidue,residue_type,peptide_seq_id,atom_axis_x,atom_axis_y,atom_axis_z,index_column,pdb_id
81440,C,CA,ASN,55,-46.221,36.444,-5.714,1,1F6U
81441,C,C,ASN,55,-46.599,37.708,-6.487,1,1F6U
81442,O,O,ASN,55,-47.465,38.469,-6.057,1,1F6U
81443,C,CB,ASN,55,-47.157,35.319,-6.16,1,1F6U
81444,C,CG,ASN,55,-46.421,33.979,-6.204,1,1F6U
81445,O,OD1,ASN,55,-45.577,33.677,-5.377,1,1F6U
81446,N,ND2,ASN,55,-46.787,33.194,-7.214,1,1F6U
81447,H,H,ASN,55,-47.289,36.807,-3.962,1,1F6U
81448,H,HA,ASN,55,-45.179,36.161,-5.861,1,1F6U
81449,H,HB2,ASN,55,-48.003,35.251,-5.477,1,1F6U


In [None]:
df_atomized.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 99420 entries, 0 to 99419
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   atom_type       99420 non-null  object
 1   atom_inResidue  99420 non-null  object
 2   residue_type    99420 non-null  object
 3   peptide_seq_id  99420 non-null  int64 
 4   atom_axis_x     99420 non-null  object
 5   atom_axis_y     99420 non-null  object
 6   atom_axis_z     99420 non-null  object
 7   index_column    99420 non-null  int64 
 8   pdb_id          99420 non-null  object
dtypes: int64(2), object(7)
memory usage: 6.8+ MB


In [None]:
# NMR structure usually comes with multiple states, assign them state number

# Initialize states column
df_atomized['states'] = 1
# Iterate over rows to update the states column
current_state = 1
index_num = 0

for i in range(1, len(df_atomized)):
    if df_atomized.at[i, 'index_column'] != index_num:
        current_state = 0
        index_num = df_atomized.at[i, 'index_column']
    elif df_atomized.at[i, 'atom_inResidue'] == 'N' and df_atomized.at[i, 'peptide_seq_id'] == 1:
        current_state += 1
        index_num = df_atomized.at[i, 'index_column']

    df_atomized.at[i, 'states'] = current_state

In [None]:
# replace the 9999 to a proper peptide_seq_id
# Replace 9999 values with previous value + 1
for i in range(1, len(df_atomized)):
    if df_atomized.at[i, 'peptide_seq_id'] == 9999:
        df_atomized.at[i, 'peptide_seq_id'] = df_atomized.at[i - 1, 'peptide_seq_id'] + 1

In [None]:
df_atomized.tail(20)

Unnamed: 0,atom_type,atom_inResidue,residue_type,peptide_seq_id,atom_axis_x,atom_axis_y,atom_axis_z,index_column,pdb_id,states
99400,H,HB1,ALA,54,-52.153,34.173,6.601,1,1F6U,20
99401,H,HB2,ALA,54,-50.853,33.508,7.618,1,1F6U,20
99402,H,HB3,ALA,54,-51.965,34.728,8.282,1,1F6U,20
99403,N,N,ASN,55,-49.413,37.489,7.774,1,1F6U,20
99404,C,CA,ASN,55,-48.799,38.335,8.781,1,1F6U,20
99405,C,C,ASN,55,-47.58,37.618,9.37,1,1F6U,20
99406,O,O,ASN,55,-46.603,38.26,9.752,1,1F6U,20
99407,C,CB,ASN,55,-49.773,38.628,9.924,1,1F6U,20
99408,C,CG,ASN,55,-49.597,40.056,10.443,1,1F6U,20
99409,O,OD1,ASN,55,-48.808,40.836,9.934,1,1F6U,20


In [None]:
# test_df = df_atomized[df_atomized['atom_inResidue'] == 'ZN']
# #test_df = test_df[test_df['residue_type'] == 'ARG']
# #test_df = test_df[test_df['peptide_seq_id'] == 9999 ]
# test_df.head(30)

Unnamed: 0,atom_type,atom_inResidue,residue_type,peptide_seq_id,atom_axis_x,atom_axis_y,atom_axis_z,index_column,pdb_id,states
70975,ZN,ZN,ZN,57,-50.949,11.206,7.572,1,1F6U,1
70976,ZN,ZN,ZN,58,-53.925,28.422,2.562,1,1F6U,1
72472,ZN,ZN,ZN,57,-51.284,10.873,7.629,1,1F6U,2
72473,ZN,ZN,ZN,58,-52.641,27.81,0.776,1,1F6U,2
73969,ZN,ZN,ZN,57,-50.628,11.886,7.444,1,1F6U,3
73970,ZN,ZN,ZN,58,-54.338,28.174,2.417,1,1F6U,3
75466,ZN,ZN,ZN,57,-50.82,11.348,7.362,1,1F6U,4
75467,ZN,ZN,ZN,58,-53.628,28.765,2.646,1,1F6U,4
76963,ZN,ZN,ZN,57,-50.279,11.33,7.839,1,1F6U,5
76964,ZN,ZN,ZN,58,-54.242,28.042,2.128,1,1F6U,5


In [None]:
test_df.head(30)

Unnamed: 0,atom_inResidue,residue_type,peptide_seq_id,atom_axis_x,atom_axis_y,atom_axis_z,index_column,pdb_id,states
424,N,PRO,30,7.351,-0.786,5.743,0,1IIO,1
1628,N,PRO,30,7.445,-0.786,5.659,0,1IIO,2
2832,N,PRO,30,7.359,-0.689,5.752,0,1IIO,3
4036,N,PRO,30,7.508,-0.778,5.767,0,1IIO,4
5240,N,PRO,30,7.404,-0.901,5.742,0,1IIO,5
6444,N,PRO,30,7.296,-0.841,5.714,0,1IIO,6
7648,N,PRO,30,7.526,-0.879,5.731,0,1IIO,7
8852,N,PRO,30,7.303,-0.86,5.874,0,1IIO,8
10056,N,PRO,30,7.381,-0.663,5.708,0,1IIO,9
11260,N,PRO,30,7.219,-0.861,5.594,0,1IIO,10


## 3.1.3 For the rest of feature columns, de-list each element

In [None]:
columns_restlist = ['_exptl.method','_atom_site.type_symbol', '_atom_site.label_atom_id','_atom_site.label_comp_id', '_atom_site.label_seq_id',  '_atom_site.Cartn_x', '_atom_site.Cartn_y', '_atom_site.Cartn_z']

df_pdb_rest = df_pdb_clean.drop(columns= columns_restlist)

In [None]:
df_pdb_rest

Unnamed: 0,data_,_entity.formula_weight,_entity.id,_entity.pdbx_number_of_molecules,_pdbx_struct_assembly.oligomeric_details,_pdbx_struct_assembly.oligomeric_count,_pdbx_struct_assembly_gen.asym_id_list,_struct_conf.beg_label_comp_id,_struct_conf.beg_label_seq_id,_struct_conf.beg_label_asym_id,_struct_conf.beg_label_entity_id,_struct_conf.end_label_comp_id,_struct_conf.end_label_seq_id,_struct_conf.end_label_asym_id,_struct_conf.end_label_entity_id,_struct_conf.end_auth_asym_id,_struct_conf.pdbx_PDB_helix_class,_struct_conf.pdbx_PDB_helix_length,_entity_poly.pdbx_seq_one_letter_code,_entity_poly_seq.mon_id,_chem_comp.formula_weight,_atom_site.id,_struct_sheet_range.sheet_id,_struct_sheet_range.id,_struct_sheet_range.beg_label_comp_id,_struct_sheet_range.beg_label_seq_id,_struct_sheet_range.beg_auth_asym_id,_struct_sheet_range.end_label_comp_id,_struct_sheet_range.end_label_seq_id,_struct_conn.ptnr1_label_comp_id,_struct_conn.ptnr1_label_seq_id,_struct_conn.ptnr2_label_comp_id,_struct_conn.ptnr2_label_seq_id
0,1I4V,[12308.985],[1],[2],[dimeric],[2],"[A,B]","[ASP, ASP]","[15, 15]","[A, B]","[1, 1]","[ILE, ILE]","[21, 21]","[A, B]","[1, 1]","[A, B]","[1, 1]","[7, 7]",[AFPSPAADYVEQRIDLNQLLIQHPSATYFVKASGDSMIDGGISDG...,"[ALA, PHE, PRO, SER, PRO, ALA, ALA, ASP, TYR, ...","[89.093, 175.209, 132.118, 133.103, 146.144, 1...","[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14...","[A, A, A, A, A, A, B, B, B, C, C, D, D, D, E, E]","[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14...","[TYR, ASP, GLY, TYR, ASP, GLY, VAL, THR, ASP, ...","[28, 46, 105, 28, 46, 105, 62, 71, 102, 82, 93...","[A, A, A, B, B, B, A, A, A, A, A, B, B, B, B, B]","[ALA, SER, LYS, ALA, SER, LYS, ALA, LYS, VAL, ...","[32, 52, 112, 32, 52, 112, 65, 73, 103, 83, 94...",,,,
1,1F6U,"[6162.739, 6384.502, 65.409]","[1, 2, 3]","[1, 1, 2]",[dimeric],[2],"[A,B,C,D]","[GLN, ILE, GLN]","[2, 24, 45]","[B, B, B]","[2, 2, 2]","[THR, CYS, CYS]","[12, 28, 49]","[B, B, B]","[2, 2, 2]","[A, A, A]","[5, 5, 5]","[11, 5, 5]","[(CG1)GCGACUGGUGAGUACGCC, MQKGNFRNQRKTVKCFNCGK...","[CG1, G, C, G, A, C, U, G, G, U, G, A, G, U, A...","[347.221, 89.093, 175.209, 132.118, 133.103, 3...","[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14...","[A, A]","[1, 2]","[GLY, LYS]","[35, 41]","[A, A]","[CYS, GLU]","[36, 42]","[CG1, ASN, CYS, CYS, HIS, CYS, CYS, CYS, HIS, ...","[1, 55, 15, 18, 23, 28, 36, 39, 44, 49, 1, 1, ...","[G, NH2, ZN, ZN, ZN, ZN, ZN, ZN, ZN, ZN, C, C,...","[2, 56, ., ., ., ., ., ., ., ., 19, 19, 19, 18..."


In [None]:
# TODO: save both df_pdb_rest and df_atomized
# The de-list will be done in next section

#3.2 Wrap up to an entire script to process in Pycharm

In [None]:
import os
import pandas as pd
import ast

pd.set_option('display.max_columns', None)


def process_file(file_path):
    df_pdb_raw = pd.read_csv(file_path)

    print(f"Successfully read: {file_path}")

    columns_wanted = [
        '_exptl.method', 'data_', '_entity.formula_weight', '_entity.id', '_entity.pdbx_number_of_molecules',
        '_pdbx_struct_assembly.oligomeric_details', '_pdbx_struct_assembly.oligomeric_count',
        '_pdbx_struct_assembly_gen.asym_id_list', '_struct_conf.beg_label_comp_id',
        '_struct_conf.beg_label_seq_id', '_struct_conf.beg_label_asym_id', '_struct_conf.beg_label_entity_id',
        '_struct_conf.end_label_comp_id', '_struct_conf.end_label_seq_id', '_struct_conf.end_label_asym_id',
        '_struct_conf.end_label_entity_id', '_struct_conf.end_auth_asym_id', '_struct_conf.pdbx_PDB_helix_class',
        '_struct_conf.pdbx_PDB_helix_length', '_entity_poly.pdbx_seq_one_letter_code',
        '_entity_poly_seq.mon_id', '_atom_site.type_symbol', '_atom_site.label_atom_id',
        '_atom_site.label_comp_id', '_atom_site.label_seq_id', '_chem_comp.formula_weight',
        '_atom_site.id', '_atom_site.type_symbol', '_atom_site.label_atom_id',
        '_atom_site.label_comp_id', '_atom_site.label_seq_id', '_atom_site.Cartn_x',
        '_atom_site.Cartn_y', '_atom_site.Cartn_z', '_struct_sheet_range.sheet_id',
        '_struct_sheet_range.id', '_struct_sheet_range.beg_label_comp_id',
        '_struct_sheet_range.beg_label_seq_id', '_struct_sheet_range.beg_auth_asym_id',
        '_struct_sheet_range.end_label_comp_id', '_struct_sheet_range.end_label_seq_id',
        '_struct_conn.ptnr1_label_comp_id', '_struct_conn.ptnr1_label_seq_id',
        '_struct_conn.ptnr2_label_comp_id', '_struct_conn.ptnr2_label_seq_id'
    ]

    df_pdb_raw = df_pdb_raw.loc[:, ~df_pdb_raw.columns.duplicated()]
    df_pdb_clean = df_pdb_raw[columns_wanted]
    df_pdb_clean = df_pdb_clean.loc[:, ~df_pdb_clean.columns.duplicated()]

    columns_to_convert = df_pdb_clean.columns.difference(['data_'])
    for col in columns_to_convert:
        df_pdb_clean[col] = df_pdb_clean[col].apply(lambda x: ast.literal_eval(x) if pd.notna(x) else x)

    print(f"Successfully convert to list: {file_path}")

    df_pdb_clean = df_pdb_clean.loc[:, ~df_pdb_clean.columns.duplicated()]

    columns_tolist = [
        'data_', '_atom_site.type_symbol', '_atom_site.label_atom_id', '_atom_site.label_comp_id',
        '_atom_site.label_seq_id', '_atom_site.Cartn_x', '_atom_site.Cartn_y', '_atom_site.Cartn_z'
    ]
    df_tolist = df_pdb_clean[columns_tolist]

    df_tolist = df_tolist.rename(columns={
        'data_': 'pdb_id',
        '_entity_poly_seq.num': 'atom_id',
        '_atom_site.type_symbol': 'atom_type',
        '_atom_site.label_atom_id': 'atom_inResidue',
        '_atom_site.label_comp_id': 'residue_type',
        '_atom_site.label_seq_id': 'peptide_seq_id',
        '_atom_site.Cartn_x': 'atom_axis_x',
        '_atom_site.Cartn_y': 'atom_axis_y',
        '_atom_site.Cartn_z': 'atom_axis_z'
    })

    df_tolist['index_column'] = df_tolist.index

    df_tolist_1 = df_tolist[
        ['atom_type', 'atom_inResidue', 'residue_type', 'peptide_seq_id', 'atom_axis_x', 'atom_axis_y', 'atom_axis_z',
         'index_column']]
    df_tolist_2 = df_tolist[['pdb_id', 'index_column']]

    df_atomized = df_tolist_1.apply(lambda x: x.explode()).reset_index(drop=True)

    print(f"Successfully atomized: {file_path}")

    df_atomized = df_atomized.merge(df_tolist_2, left_on='index_column', right_on='index_column')

    df_atomized['peptide_seq_id'] = df_atomized['peptide_seq_id'].replace('.', '9999')
    df_atomized['peptide_seq_id'] = df_atomized['peptide_seq_id'].astype('int64')

    df_atomized['states'] = 1
    current_state = 1
    index_num = 0

    for i in range(1, len(df_atomized)):
        if df_atomized.at[i, 'index_column'] != index_num:
            current_state = 0
            index_num = df_atomized.at[i, 'index_column']
        elif df_atomized.at[i, 'atom_inResidue'] == 'N' and df_atomized.at[i, 'peptide_seq_id'] == 1:
            current_state += 1
            index_num = df_atomized.at[i, 'index_column']
        df_atomized.at[i, 'states'] = current_state

    print(f"Successfully add 'states' 9999: {file_path}")

    for i in range(1, len(df_atomized)):
        if df_atomized.at[i, 'peptide_seq_id'] == 9999:
            df_atomized.at[i, 'peptide_seq_id'] = df_atomized.at[i - 1, 'peptide_seq_id'] + 1

    columns_restlist = [
        '_exptl.method', '_atom_site.type_symbol', '_atom_site.label_atom_id', '_atom_site.label_comp_id',
        '_atom_site.label_seq_id', '_atom_site.Cartn_x', '_atom_site.Cartn_y', '_atom_site.Cartn_z'
    ]
    df_pdb_rest = df_pdb_clean.drop(columns=columns_restlist)

    file_index = os.path.basename(file_path).split('_')[-1].split('.')[0]
    df_atomized.to_csv(f'/Users/wangqian/PycharmProjects/pythonProject8/df_atomized_{file_index}.csv', index=False)# TODO: update path if needed

    print(f"Successfully save df_atomized_{file_index}")

    df_pdb_rest.to_csv(f'/Users/wangqian/PycharmProjects/pythonProject8/df_pdb_rest_{file_index}.csv', index=False)# TODO: update path if needed
    print(f"Successfully save df_pdb_rest_{file_index}")

    print(f"Processed and saved: df_PDB_{file_index}.csv")


def process_files_sequentially(csv_files):
    for file in csv_files:
        try:
            process_file(file)
            print(f"Completed processing: {file}")
        except Exception as e:
            print(f"Error processing {file}: {e}")


def main():
    pdb_csv_path = "/Users/wangqian/PycharmProjects/pythonProject8/" # TODO: update path if needed
    csv_files = [os.path.join(pdb_csv_path, f"df_PDB_{i}.csv") for i in range(3, 30)]

    chunk_size = 200
    for i in range(0, len(csv_files), chunk_size):
        csv_chunk = csv_files[i:i + chunk_size]
        process_files_sequentially(csv_chunk)
        print(f"Chunk {i // chunk_size + 1} processed.")


if __name__ == '__main__':
    main()
