## Project Overview
In this project, I developed a Python program to identify and trace the backbone of a polypeptide chain from a molecular structure file in XYZ format. The program reads atomic data from the file, determines which atoms are bonded based on covalent radii and interatomic distances, and locates specific functional groups such as amide and carboxyl groups. Starting from the nitrogen atom at the N-terminus, the program reconstructs the polypeptide backbone by following a biologically accurate path through the molecule. The result is a clear, ordered list of backbone atoms, printed in a structured format that reflects the true sequence of the chain.



### Key Features

- Parses `.xyz` molecular files and extracts atom types, positions, and coordinates.
- Determines bonded atoms by comparing interatomic distances against the sum of covalent radii (+0.4 Å tolerance).
- Detects functional groups:
  - **Amide group**: carbon bonded to both nitrogen and oxygen
  - **Carboxyl group**: carbon bonded to two oxygens
- Identifies the start of the chain by locating a nitrogen atom bonded to four atoms (N-terminus).
- Traces through the backbone structure by following bonds and printing atom identities and indices in order.
- Outputs the full backbone chain with formatting:


In [1]:
import pandas as pd
import numpy as np
from cov_radii import *

In [2]:
polypep = pd.read_table('polypep.xyz', skiprows=2, sep='\\s+',
                         names=['atom', 'x', 'y', 'z'])
display(polypep)

Unnamed: 0,atom,x,y,z
0,N,-22.81300,-15.68100,8.13800
1,C,-22.50900,-16.60700,9.21500
2,C,-21.10200,-16.40200,9.74500
3,O,-20.85800,-16.48100,10.95200
4,N,-20.16700,-16.13700,8.84000
...,...,...,...,...
196,H,-14.59685,-17.51337,24.17883
197,H,-15.69359,-16.92701,21.53027
198,H,-16.31849,-16.40429,22.98308
199,H,-16.30627,-18.02944,22.61824


### parse_line(*number*,*string*)

This function receives one integer number, here refered as atom_label, and one string as its argument (a line of the XYZ file) and return a list of the form:

 [element_symbol,atom_label,coordinate_array]

where element_symbol is a string with the chemical symbol letter(s), atom_label is an integer passed on as argument to the function and corresponds to the position of the atom in the sequence of the XYZ file, and coordinate_array is an array of numbers with the atom coordinates. For example:

    ['Se',137,array([66.52700,-0.24800,33.94300])]

In [3]:
def parse_line(number,file_name):

    molecule = pd.read_table(file_name, skiprows=2, sep='\\s+',
                         names=['atom', 'x', 'y', 'z'])
    atom = molecule.loc[number-1,'atom']
    x = molecule.loc[number-1,'x']
    y = molecule.loc[number-1,'y']
    z = molecule.loc[number-1,'z']
    return [atom, number, np.array([x, y, z])]


In [4]:
parse_line(1,'polypep.xyz')

['N', 1, array([-22.813, -15.681,   8.138])]

### Whole_list(file_name)
This function reveives one string and return a list of lists, each of them of the form of the output of parse_line().

In [5]:
def whole_list(file_name):
    #this return the whole list from the given file_name
    rows = len(pd.read_table(file_name, skiprows=2, sep='\\s+',
                         names=['atom', 'x', 'y', 'z']))
    i = 0
    whole_list = []
    while i < rows:
        
        i = i+1
    
        whole_list.append(parse_line(i,file_name))

    return whole_list

### distance(*array1*,*array2*) 

This function receives two arrays as arguments, each array containing the coordinates of a point in 3D space. The function should return the distance between these two points in space.

In [6]:
def distance(array1,array2):
    
    
    return np.linalg.norm(array1 - array2)

In [7]:
distance (np.array([-22.509, -16.607,   9.215]), np.array([-21.102, -16.402,   9.745]))

1.5174234741824704


### coordination(*list1*,*list2*)

This function receives two lists: list1 is of the same form as the output of parse_line(); and list2 is a list of lists, each of them of the form of the output of parse_line(). The function should determine which of the atoms in list2 are bound to the atom of list1, and output these as a list with the same format as list2 but only with the bound atoms.

In [8]:
def coordination(list1,list2):
    # list 2 is a list of lists. The function determines what atom in list2 is attached to list 1.
    bonded_atom_list = []

    
    for i in list2:
        atom_distance = distance(list1[2],i[2])
            
            
        if atom_distance < (cov_dictionary[list1[0]] + cov_dictionary[i[0]] + 0.4) and atom_distance != 0:
            bonded_atom_list = bonded_atom_list+[i]
        
    return bonded_atom_list

In [9]:
print(parse_line(3,'polypep.xyz'))
coordination(parse_line(5,'polypep.xyz'),whole_list('polypep.xyz'))

['C', 3, array([-21.102, -16.402,   9.745])]


[['C', 3, array([-21.102, -16.402,   9.745])],
 ['C', 6, array([-18.785, -15.914,   9.242])],
 ['H', 108, array([-20.41058, -16.09078,   7.88723])]]

### amide(*list1*,*list2*)

This function receives two lists of the same form as the arguments of coordination function(). list1 corresponds to a carbon atom, and list2 to a list with all atoms in the molecule. The function returns a boolean depending on whether the atom is part of an amide bond or not.

In [10]:
def amide(list1, list2):
    if list1[0] != 'C':
        return False
    bonded_atoms = coordination(list1, list2)
    
    has_N = any(atom[0] == 'N' for atom in bonded_atoms)
   
    has_O = any(atom[0] == 'O' for atom in bonded_atoms)
  



    return has_N and has_O

In [11]:
(amide(parse_line(3,'polypep.xyz'),whole_list('polypep.xyz')))


True


### carboxyl(*list1*,*list2*) 

This function receives two lists of the same form as the arguments of coordination function(). list1 corresponds to a carbon atom, and list2 to a list with all atoms in the molecule. The function returns a boolean depending on whether the atom is part of a carboxyl group or not.

In [12]:
def carboxyl(list1, list2):
    if list1[0] != 'C':
        return False
    number_of_oxygen = 0
    if list1[0] == 'C':
        if len(coordination(list1, list2)) == 3:
            for i in coordination(list1, list2):
                if i[0] == 'O':
                    number_of_oxygen += 1

    return number_of_oxygen == 2

In [13]:
print(parse_line(11,'polypep.xyz'))
coordination(parse_line(11,'polypep.xyz'), whole_list('polypep.xyz'))

['C', 11, array([-17.012, -17.317,   6.061])]


[['C', 10, array([-17.828, -17.285,   7.336])],
 ['O', 12, array([-17.101, -16.363,   5.261])],
 ['O', 13, array([-16.291, -18.317,   5.846])]]

### backbone(file_name)
This function identifies and prints the sequence of atoms forming the backbone of a polypeptide chain from a molecular structure file in .xyz format.

In [14]:
def backbone(file_name):
    atoms = whole_list(file_name) 
    backbone_list = []

    for atom in atoms:
        if carboxyl(atom, atoms) or amide(atom, atoms):
            backbone_list.append(atom)

    for atom in backbone_list[:]: 
        for bonded in coordination(atom, atoms):
            if bonded[0] in ('N', 'C') and bonded not in backbone_list:
                backbone_list.append(bonded)

    first_atom = None
    for atom in atoms:
        if atom[0] == 'N' and len(coordination(atom, atoms)) == 4:
            for bonded in coordination(atom, atoms):
                if any(bonded[1] == b[1] for b in backbone_list):
                    first_atom = atom
                    break
            if first_atom:
                break

    if not first_atom:
        print("Start nitrogen not found.")
        return

    
    print('~')
    print(first_atom[0], first_atom[1])
    print('|')

   
    current_atom = first_atom
    while True:
        bonded_atoms = coordination(current_atom, backbone_list)
        if not bonded_atoms:
            print("Backbone traversal incomplete.")
            break

        next_atom = bonded_atoms[0]
        print(next_atom[0], next_atom[1])
        backbone_list.remove(next_atom)

        if carboxyl(next_atom, atoms):
            print('~')
            break

        print('|')
        current_atom = next_atom

In [15]:
backbone('polypep.xyz')

~
N 1
|
C 2
|
C 3
|
N 5
|
C 6
|
C 7
|
N 14
|
C 15
|
C 16
|
N 28
|
C 29
|
C 30
|
N 37
|
C 38
|
C 39
|
N 46
|
C 47
|
C 48
|
N 53
|
C 54
|
C 55
|
N 61
|
C 62
|
C 63
|
N 69
|
C 70
|
C 71
|
N 76
|
C 77
|
C 78
|
N 90
|
C 91
|
C 92
|
N 94
|
C 95
|
C 96
~
