# Extract amino acids responsible for DNA specificity from C2H2 zinc finger domains#

C2H2 zinc finger domains form a loop stabilized by a zentral zinc ion that contacts both cystidine and histidine amino acids. The amino acids that are mainly responsible for DNA binding specificity are located at defined positions within the finger.

## C--C-----X--X--XH---H

C: Cystidine
H: Cystidine
-: Variable amino acid
X: Contact amino acids

This script uses a multiple fasta fiel with protein sequences, identifies C2H2 zinc fingers and returns the contact amino acids for each zinf finger.

In [17]:
with open('KRAB-ZFPs.fa') as file:
    for line in file:
        print(line)

>ZFP809

MGLVSFEDVAVDFTLEEWQDLDAAQRTLYRDVMLETYSSLVFLDPCIAKPKLIFNLERGF

GPWSLAEASSRSLPGVHNVSTLSDTSKKIPKTRLRQLRKTNQKTPSEDTIEAELKARQEV

SKGTTSRHRRAPVKSLCRKSQRTKNQTSYNDGNLYECKDCEKVFCNNSTLIKHYRRTHNV

YKPYECDECSKMYYWKSDLTSHQKTHRQRKRIYECSECGKAFFRKSHLNAHERTHSGEKP

YECTECRKAFYYKSDLTRHKKTHLGEKPFKCEECKKAFSRKSKLAIHQKKHTGEKPYECT

ECKKAFSHQSQLTAHRIAHSSENPYECKECNKSFHWKCQLTAHQKRHTGQYGDS*

>Zfp599

MGLISFEDVAVDFTWEEWQDLDAAQRTLYRDVMLETYISLVSLGHCMNKPELIFKLEQGL

GPWSVAEASDRNLSDFHILTAPIVTSQKNHKAYMWQARTTENKASNEKIAELKEQQKIHQ

GSKSCEREAHGKTFFQKAQSTVIQMSPTRQTALHYTATLTKGQRPHRGKMSREYEECRKT

IFHNSHVPGCQKTLIDTKLCGCTECRKDFSCNSKLTSHPRTRIRKRPYKCKECGKAFCSQ

GKLTLHQIVHTGEKPYECTECGKAFSHKAYLTQHQKIHMSKKPYACTECGKAFYRLSHLT

LHQRTHTNEKPYDCTECQKSFSCRSQLTLHQRTHTGERPFECMECGKSFYYKAHLIRHQR

IHTNEKPFECIECEKSFYCQSDLTVHQRSHTGEKPYECKECGKSFYQKSKLTLHQRNHVG

EKSYACTDCGEVFYCKSHLTLHQTVHTDEHPYICTECGKCFYYKSQLIVHGRTHTGDRPY

KCGDCGKAFSRKSHLIRHQSITHIDKNNLNVANVGKVSTVRPDSLHTHSLYLSEHKHAPS

ILKEKNAG*

>Zfp810

MVLVSFEDVAVDFTWEEWQALDAAQRTLYRDVMLETY

In [18]:
def find_aa(x):    
    seq = str.upper(x)
    fingerprint = '' 
    while len(seq) > 21:                                                   # loop as long the end of the sequence is > 21 aa
        if seq[0] + seq[3] + seq[16] + seq[20] == 'CCHH':
            fingerprint = fingerprint + seq[9] + seq[12] + seq[15] + '-'
            seq = seq[20:]                                                 # Skip remaining C2H2 motif for further screen
        seq = seq[1:]                                                      # Remove fist amino acid from sequence
    return fingerprint[:-1]

In [19]:
with open('KRAB-ZFPs.fa') as file:                 # Open fasta file with protein sequences
    
    zfp_seq = ''                                   # Empty string for protein sequence
    zfp_names = []                                 # Empty list for protein names
    
    for line in file:                              # iterate through lines of fasta file
        
        if line[0] != '>':                         # Add protein sequence to string if line belongs to sequence  
            zfp_seq += line
            
        else:                                      # Store protein name in list if line starts with '>'
            zfp_names.append(line)

Zfp_seq_flat = zfp_seq.replace('\n', '')           # remove newlines from sequence
zfp_seq_list = Zfp_seq_flat.split('*')             # split sequence to individual protein sequences

for i in range(len(zfp_seq_list[:-1])):            # Iterate through protein sequences and apply find_aa function for each seq
    print(zfp_names[i], find_aa(zfp_seq_list[i]))

>ZFP809
 WDS-RHA-YDR-RKI-HQA-WQA
>Zfp599
 SKL-HYQ-RHL-CQL-YHR-CDV-QKL-CHL-YQV
>Zfp810
 CHV-SQV-RDV-CQV-CQV-CDR-RDV-CQV-CER-TDR
>Zfp961
 CSK-YCR-NAE-TSM-RSL-HHA-SCR-YSN-YSN-CYN-YSR
>Zfp882
 SSE-SST-YGR-DSC-DSC-RSC-HAY-LAK-FSC-YSC-YSR-DSC-TSY
>Zfp709
 FNR-CNE-FNR-YND-YST-YST-TSY-STI-VSC-RSC-SNK-GSR-SGR-NAN-SKI-GSR-TDR-RSI
>Zfp617
 YSN-RNA-STA-STA-TTA-STA-QYI-RHI-QYI-RHI-HYI-RSI-LSR-CSK
>Zfp600
 DTK-DTK-DSV-QHI-DTK-ETK-DSV-KHI-QPI-DSV-QHI-DSV-DSV-DYV-HSV-RHI-DSR
>2610305D13Rik
 QLN-HSI-HSI
>Gm13051
 NNM-KNI-QYI-ESI-QHI-QNI-ESI-ESI-ESI-VSI-KHI-EII-RHI-QHI
>Gm13139
 HHI-QHI-ENI-QHI-EHI-EHI-KSI-FSI-ECI-FSI-EKI
>Gm13212
 HHI-HSM-ESI-HHS-KSI-QDI-HHS-RHI-QDI-HHS-RHI-RHI-RHI-RHI-QDI-HHS-RHI-QNI-ESI-RHI
>Gm13242
 DTK-DSV-QHI-DSV-DSV-DSV-HSV-EGR-DSV-DSV-DYV-HSV-EGR
>Rex2_
 DTK-DTK-DSV-QHI-DTK-ETK-DSV-KHI-QPI-DSV-QHI-HSV-EGR-RHI-QHI
>Gm13154
 HNA-HRY-HNA-HNA-RHI-GSI-GSI-GSI-QHI-HHN
>Gm13157
 HHI-HSM-QHV-HHS-KSI-HNI-HHS-RHI-QNI-ESI-RHI
>LOC102638055
 ESI-ESI-ESI-VSI-KNI-RHQ-DSV-QHI-DSV-DSV-DYV-HSV
>