Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SyRI crashing for incomplete assemblies #79

Closed
mnshgl0110 opened this issue Jun 2, 2021 · 4 comments
Closed

SyRI crashing for incomplete assemblies #79

mnshgl0110 opened this issue Jun 2, 2021 · 4 comments

Comments

@mnshgl0110
Copy link
Member

mnshgl0110 commented Jun 2, 2021

Hi, sorry because my first message was very vague. Let me explain it a little more in detail. Firstly, I had no problem in reproducing the example analysis. It just worked as expected. Then, I tried to run the protocol on my own genomes. I have different genome assemblies that I want to compare to the current reference, but none of these is at the chromosome level and have varying numbers of scaffolds. I aligned each of my queries independently to the reference using minimap2, and then tried to call the SR using SyRI, with the following parameters:

./syri/bin/syri -c path/to/sam -r path/to/reference -q path/to/query -F S -k -f --no-chrmatch

SyRI started running, it reported that the reference and the query have different number of scaffolds and some were not aligned, and continued to run until it crashed. The error message as follows:

SAM reader - WARNING - A1_scaffold0151 do not align with any reference sequence and cannot be analysed. Remove all unplaced scaffolds and contigs from the assemblies.
Reading Coords - WARNING - Chromosomes IDs do not match.
Reading Coords - WARNING - --no-chrmatch is set. Not matching chromosomes automatically.
Reading Coords - WARNING - BDIQ01000126.1, BDIQ01000179.1, BDIQ01000057.1, BDIQ01000041.1, BDIQ01000183.1, BDIQ01000086.1, BDIQ01000074.1, BDIQ01000107.1, BDIQ01000051.1, BDIQ01000167.1, BDIQ01000205.1, BDIQ01000152.1, BDIQ01000040.1, BDIQ01000030.1, BDIQ01000163.1, BDIQ01000135.1, BDIQ01000080.1, BDIQ01000117.1, BDIQ01000116.1, BDIQ01000088.1, BDIQ01000136.1, BDIQ01000169.1, BDIQ01000038.1, BDIQ01000055.1, BDIQ01000171.1, BDIQ01000160.1, BDIQ01000028.1, BDIQ01000193.1, BDIQ01000067.1, BDIQ01000191.1, BDIQ01000132.1, BDIQ01000014.1, BDIQ01000130.1, BDIQ01000121.1, BDIQ01000144.1, BDIQ01000024.1, BDIQ01000134.1, BDIQ01000139.1, BDIQ01000185.1, BDIQ01000058.1, BDIQ01000129.1, BDIQ01000032.1, BDIQ01000123.1, BDIQ01000063.1, BDIQ01000166.1, BDIQ01000035.1, BDIQ01000190.1, BDIQ01000102.1, BDIQ01000095.1, BDIQ01000076.1, BDIQ01000100.1, BDIQ01000125.1, BDIQ01000112.1, BDIQ01000006.1, BDIQ01000174.1, BDIQ01000178.1, BDIQ01000085.1, BDIQ01000011.1, BDIQ01000090.1, BDIQ01000151.1, BDIQ01000066.1, BDIQ01000108.1, BDIQ01000013.1, BDIQ01000017.1, BDIQ01000061.1, BDIQ01000098.1, BDIQ01000031.1, BDIQ01000198.1, BDIQ01000075.1, BDIQ01000137.1, BDIQ01000048.1, BDIQ01000156.1, BDIQ01000147.1, BDIQ01000127.1, BDIQ01000054.1, BDIQ01000164.1, BDIQ01000060.1, BDIQ01000081.1, BDIQ01000170.1, BDIQ01000012.1, BDIQ01000010.1, BDIQ01000068.1, BDIQ01000165.1, BDIQ01000184.1, BDIQ01000016.1, BDIQ01000050.1, BDIQ01000131.1, BDIQ01000077.1, BDIQ01000020.1, BDIQ01000042.1, BDIQ01000140.1, BDIQ01000158.1, BDIQ01000097.1, BDIQ01000168.1, BDIQ01000180.1, BDIQ01000105.1, BDIQ01000138.1, BDIQ01000022.1, BDIQ01000089.1, BDIQ01000122.1, BDIQ01000197.1, BDIQ01000161.1, BDIQ01000096.1, BDIQ01000047.1, BDIQ01000146.1, BDIQ01000128.1, BDIQ01000201.1, BDIQ01000001.1, BDIQ01000007.1, BDIQ01000194.1, BDIQ01000154.1, BDIQ01000091.1, BDIQ01000188.1, BDIQ01000120.1, BDIQ01000149.1, BDIQ01000083.1, BDIQ01000109.1, BDIQ01000079.1, BDIQ01000043.1, BDIQ01000143.1, BDIQ01000059.1, BDIQ01000114.1, BDIQ01000070.1, BDIQ01000118.1, BDIQ01000056.1, BDIQ01000033.1, BDIQ01000115.1, BDIQ01000025.1, BDIQ01000052.1, BDIQ01000093.1, BDIQ01000141.1, BDIQ01000199.1, BDIQ01000133.1, BDIQ01000065.1, BDIQ01000071.1, BDIQ01000195.1, BDIQ01000053.1, BDIQ01000039.1, BDIQ01000177.1, BDIQ01000073.1, BDIQ01000162.1, BDIQ01000192.1, BDIQ01000082.1, BDIQ01000034.1, BDIQ01000159.1, BDIQ01000101.1, BDIQ01000106.1, BDIQ01000157.1, BDIQ01000046.1, BDIQ01000145.1, BDIQ01000124.1, BDIQ01000111.1, BDIQ01000148.1, BDIQ01000104.1, BDIQ01000153.1, BDIQ01000150.1, BDIQ01000155.1, BDIQ01000002.1, BDIQ01000187.1, BDIQ01000196.1, BDIQ01000062.1, BDIQ01000078.1, BDIQ01000173.1, BDIQ01000186.1, BDIQ01000110.1, BDIQ01000027.1, BDIQ01000182.1, BDIQ01000021.1, BDIQ01000044.1, BDIQ01000172.1, BDIQ01000023.1, BDIQ01000064.1, BDIQ01000092.1, A1_scaffold0145, A1_scaffold0028, A1_scaffold0004, A1_scaffold0065, A1_scaffold0097, A1_scaffold0096, A1_scaffold0012, A1_scaffold0025, A1_scaffold0136, A1_scaffold0040, A1_scaffold0102, A1_scaffold0044, A1_scaffold0049, A1_scaffold0130, A1_scaffold0123, A1_scaffold0019, A1_scaffold0138, A1_scaffold0017, A1_scaffold0135, A1_scaffold0053, A1_scaffold0127, A1_scaffold0092, A1_scaffold0034, A1_scaffold0041, A1_scaffold0036, A1_scaffold0003, A1_scaffold0133, A1_scaffold0117, A1_scaffold0108, A1_scaffold0061, A1_scaffold0057, A1_scaffold0113, A1_scaffold0119, A1_scaffold0099, A1_scaffold0089, A1_scaffold0093, A1_scaffold0079, A1_scaffold0144, A1_scaffold0018, A1_scaffold0005, A1_scaffold0002, A1_scaffold0048, A1_scaffold0128, A1_scaffold0142, A1_scaffold0084, A1_scaffold0087, A1_scaffold0116, A1_scaffold0141, A1_scaffold0082, A1_scaffold0022, A1_scaffold0043, A1_scaffold0148, A1_scaffold0132, A1_scaffold0143, A1_scaffold0029, A1_scaffold0023, A1_scaffold0088, A1_scaffold0106, A1_scaffold0075, A1_scaffold0045, A1_scaffold0147, A1_scaffold0068, A1_scaffold0067, A1_scaffold0105, A1_scaffold0059, A1_scaffold0125, A1_scaffold0046, A1_scaffold0121, A1_scaffold0140, A1_scaffold0115, A1_scaffold0101, A1_scaffold0078, A1_scaffold0033, A1_scaffold0024, A1_scaffold0085, A1_scaffold0052, A1_scaffold0009, A1_scaffold0026, A1_scaffold0006, A1_scaffold0081, A1_scaffold0071, A1_scaffold0076, A1_scaffold0015, A1_scaffold0124, A1_scaffold0030, A1_scaffold0062, A1_scaffold0011, A1_scaffold0060, A1_scaffold0031, A1_scaffold0055, A1_scaffold0047, A1_scaffold0080, A1_scaffold0131, A1_scaffold0070, A1_scaffold0035, A1_scaffold0074, A1_scaffold0064, A1_scaffold0146, A1_scaffold0014, A1_scaffold0066, A1_scaffold0137, A1_scaffold0069, A1_scaffold0016, A1_scaffold0104, A1_scaffold0122, A1_scaffold0110, A1_scaffold0008, A1_scaffold0126, A1_scaffold0086, A1_scaffold0032, A1_scaffold0027, A1_scaffold0129, A1_scaffold0073, A1_scaffold0109, A1_scaffold0098, A1_scaffold0063, A1_scaffold0090, A1_scaffold0149, A1_scaffold0037, A1_scaffold0042, A1_scaffold0058, A1_scaffold0077, A1_scaffold0020, A1_scaffold0118, A1_scaffold0054, A1_scaffold0111, A1_scaffold0095, A1_scaffold0094, A1_scaffold0001, A1_scaffold0100, A1_scaffold0091, A1_scaffold0007, A1_scaffold0072, A1_scaffold0021, A1_scaffold0050, A1_scaffold0120, A1_scaffold0039, A1_scaffold0083, A1_scaffold0038, A1_scaffold0150, A1_scaffold0010, A1_scaffold0051, A1_scaffold0107, A1_scaffold0134, A1_scaffold0013, A1_scaffold0103, A1_scaffold0114, A1_scaffold0056, A1_scaffold0112, A1_scaffold0139 present in only one genome. Removing corresponding alignments
Traceback (most recent call last):
File "/scratch/jcruzcor/04_SyRI_analysis/syri/syri/bin/syri", line 250, in
startSyri(args, coords[["aStart", "aEnd", "bStart", "bEnd", "aLen", "bLen", "iden", "aDir", "bDir", "aChr", "bChr"]])
File "syri/pyxFiles/synsearchFunctions.pyx", line 467, in syri.pyxFiles.synsearchFunctions.startSyri
File "syri/pyxFiles/synsearchFunctions.pyx", line 860, in syri.pyxFiles.synsearchFunctions.outSyn
File "/scratch/jcruzcor/trial_minimap/SYRI/lib/python3.5/site-packages/pandas/core/generic.py", line 4389, in setattr
return object.setattr(self, name, value)
File "pandas/_libs/properties.pyx", line 69, in pandas._libs.properties.AxisProperty.set
File "/scratch/jcruzcor/trial_minimap/SYRI/lib/python3.5/site-packages/pandas/core/generic.py", line 646, in _set_axis
self._data.set_axis(axis, labels)
File "/scratch/jcruzcor/trial_minimap/SYRI/lib/python3.5/site-packages/pandas/core/internals.py", line 3323, in set_axis
'values have {new} elements'.format(old=old_len, new=new_len))
ValueError: Length mismatch: Expected axis has 0 elements, new values have 7 elements
Originally posted by @Jquimcrz in #42 (comment)

@mnshgl0110
Copy link
Member Author

Hi Joaquim,

I updated the comment here, so this should be OK.

The issue here is that SyRI is designed for chromosome-level assemblies, so it is expected that it crashed for incomplete assemblies.

You can use chroder or other homology based scaffolding methods to generate pseudo-chromosome level assemblies and then can use SyRI. Though, this might result in loss of SV information as they could be wrongly scaffolded, but I think in the absence of de novo chromosome-level assemblies they still provide quite a lot of information (check SyRI's manuscript).

I hope this helps. Let me know if you have more questions.

@Jquimcrz
Copy link

Jquimcrz commented Jun 2, 2021

I actually tried to generate these pseudo-chromosome level assemblies and SyRI worked just fine, but I wanted to integrate the information from the gene and repeat annotation of the current reference into the posterior analyses, to find out if any relation with the generation of structural rearrangements. The out.ref.fasta and out.qry.fasta assemblies no longer have the same genomic coordinates as the original reference, and this becomes a big burden for the study of SV. In addition, if I have six assemblies to compare to the reference, the generation of pseudo-chromosome level assemblies is an independent process that is repeated six times, and resulting in six different out.ref.fasta assemblies that cannot be directly compared. If SyRI currently crashes with "incomplete assemblies", what is the --no-chrmatch argument exactly doing? I guess there is no "trick" to overcome this limitation, am I right?

@mnshgl0110
Copy link
Member Author

You can run chroder with the best query assembly, and then use the out.ref.fasta generated from it to scaffold the other query genomes. Then you would have the same reference assembly for all six comparisons.

You would still need to map the gene coords to the pseudo-chromosome level assembly. chroder generates an out.anno file as well that provides the order in which the contigs are concatenated. You can use that to map the gene coords.

--no-chrmatch is to regulate homologous chromosomes identification when the assemblies have different chromosome IDs. Using this option tells SyRI that homologous chromosomes have exactly the same chromosome IDs in the two assemblies, so SyRI should not try to match homologous chromosomes.

@Jquimcrz
Copy link

Jquimcrz commented Jun 2, 2021

Thanks for your answer, I will give it a try.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants