# stitch chromosomal contigs based on nucmer results
strain: M7p
Input: a pad file & a fasta file
Sherwood's trimming comment (April 7, 2023): No N’s at left end, but about 500 bp missing relative to other chrms.  
Remove rightmost about 42 kbp contig; it was unitig_0 and is actually the full lp28-6 sequence including wraparounds at both ends – should not be part of chromsome.  After this removal I think there are about 2500 bp missing from right end.  Already has N’s there.

trim right, stop at unitig_48

m7p.pad file:

unitig_18|quiver|quiver|pilon   -1      237
unitig_54|quiver|quiver|pilon   -1      5542
unitig_19|quiver|quiver|pilon   -1      -117
unitig_50|quiver|quiver|pilon   -1      753
unitig_16|quiver|quiver|pilon   -1      -25977
unitig_70|quiver|quiver|pilon   -1      29429
unitig_25|quiver|quiver|pilon   -1      4589
unitig_63|quiver|quiver|pilon   1       1364
unitig_14|quiver|quiver|pilon   1       1579
unitig_22|quiver|quiver|pilon   -1      5438
unitig_12|quiver|quiver|pilon   1       2395
unitig_6|quiver|quiver|pilon    1       12169
unitig_20|quiver|quiver|pilon   -1      2310
unitig_56|quiver|quiver|pilon   -1      2848
unitig_45|quiver|quiver|pilon   -1      -989
unitig_36|quiver|quiver|pilon   1       218
unitig_29|quiver|quiver|pilon   1       3884
unitig_73|quiver|quiver|pilon   1       2144
unitig_37|quiver|quiver|pilon   -1      1220
unitig_47|quiver|quiver|pilon   1       3153
unitig_32|quiver|quiver|pilon   -1      560
unitig_10|quiver|quiver|pilon   1       2895
unitig_13|quiver|quiver|pilon   1       9152
unitig_21|quiver|quiver|pilon   1       2332
unitig_39|quiver|quiver|pilon   -1      1677
unitig_17|quiver|quiver|pilon   1       513
unitig_46|quiver|quiver|pilon   -1      186
unitig_28|quiver|quiver|pilon   1       1632
unitig_81|quiver|quiver|pilon   1       1446
unitig_38|quiver|quiver|pilon   1       430
unitig_5|quiver|quiver|pilon    -1      3361
unitig_15|quiver|quiver|pilon   1       4156
unitig_31|quiver|quiver|pilon   -1      1277
unitig_24|quiver|quiver|pilon   1       5933
unitig_26|quiver|quiver|pilon   1       1009
unitig_11|quiver|quiver|pilon   1       -159
unitig_41|quiver|quiver|pilon   -1      -7395
unitig_42|quiver|quiver|pilon   1       2338
unitig_43|quiver|quiver|pilon   -1      -44
unitig_30|quiver|quiver|pilon   -1      1455
unitig_49|quiver|quiver|pilon   -1      3094
unitig_48|quiver|quiver|pilon   1       0

In [11]:
import pandas as pd
from Bio import SeqIO
from Bio.SeqRecord import SeqRecord
from Bio.Seq import Seq

file_in ='Broken-to-be-gapped/Mp7.chrom.fasta'
file_out='Broken-to-be-gapped/M7p-chrom-gapped-v2.fasta'

df_pad = pd.read_csv("Broken-to-be-gapped/M7p.pad", delimiter="\t", header=None)
df_pad.columns = ['seq_id', 'revcom', 'pad']
print(df_pad.head())
print(df_pad['pad'].sum())
print(df_pad.shape)

                          seq_id  revcom    pad
0  unitig_18|quiver|quiver|pilon      -1    237
1  unitig_54|quiver|quiver|pilon      -1   5542
2  unitig_19|quiver|quiver|pilon      -1   -117
3  unitig_50|quiver|quiver|pilon      -1    753
4  unitig_16|quiver|quiver|pilon      -1 -25977
88037
(42, 3)


In [12]:
seqs = {}
for seq_record in SeqIO.parse(open (file_in, mode='r'), 'fasta'):
    seqs[seq_record.id] = seq_record

seq_str = ""
contig_len = 0
pad_len = 0
for index, row in df_pad.iterrows():
    id = row['seq_id']
    revcom = row['revcom']
    pad = row['pad']
    
    seq_contig = ''
    if revcom < 0:
        seq_contig = str(seqs[id].reverse_complement().seq)
    else:
        seq_contig = str(seqs[id].seq)
        
    contig_len += len(seqs[id].seq)
    
    # pad or merge
    if pad >= 0: # gap, pad N's
        for i in range(pad):
            seq_contig += 'N'
    else: # overlap, merge
        if revcom < 0:
            seq_contig = seq_contig[abs(pad):] # 4:end (inclusive from 5th, remove 4 bases from revcom start)
        else:
            seq_contig = seq_contig[:pad] # 0:-4 (remove 4 bases from seq end)          
    
    seq_str += seq_contig
    pad_len += pad
        
seq_out = SeqRecord(id = "M7p_chromosome_gapped", seq = Seq(seq_str))
# check sum:
print(contig_len)
print(pad_len)
print(len(seq_str))

#with open(file_out, "w") as f_out:
#    f_out.write(seq_out.format('fasta'))


809131
88037
897168
