Problem with ABBABABAwindows.py #28

foala · 2020-06-24T05:27:05Z

Hi,
Thank you very much for your scripts.
I have been struggling a lot with ABBABABAwindows.py
I am using python 3

When I run this command
python3 ABBABABAwindows.py -g $FILE.geno.gz -o $FILE.fd2.csv -w 20 -m 10 -s 20 -f phased --minData 0.5 -P1 P1 -P2 P2-P3 P3 -O P4--popsFile pop_file.txt

I get

Traceback (most recent call last):
File "ABBABABAwindows.py", line 256, in
windowQueue = SimpleQueue()
TypeError: init() missing 1 required keyword-only argument: 'ctx'

and when I run
python3 ABBABABAwindows.py
I get

usage: ABBABABAwindows.py [-h] [--windType {sites,coordinate,predefined}] [-w sites] [-s sites] [-m sites]
[--overlap sites] [-D MAXDIST] [--windCoords WINDCOORDS] [--minData proportion] -P1
popName [[samples] ...] -P2 popName [[samples] ...] -P3 popName [[samples] ...] -O popName
[[samples] ...] [--popsFile POPSFILE] [--haploid sample names] [-g GENOFILE] [-o OUTFILE]
[--exclude EXCLUDE] [--include INCLUDE] -f {phased,pairs,haplo,diplo} [-T threads]
[--verbose] [--addWindowID] [--writeFailedWindows]

Can you please assist.
Thank you very much,

The text was updated successfully, but these errors were encountered:

simonhmartin · 2020-06-25T13:00:47Z

Apologies, this was a bug with migration to Python3. Should hopefully be working in the latest version

foala · 2020-06-25T15:19:14Z

Thank you for your reply.
The SimpleQueue issue is resolved, thank you.
Now I have a new error
Traceback (most recent call last):

File "ABBABABAwindows.py", line 312, in
for window in windowGenerator:
File "/Merged_Final/window/py3/genomics.py", line 1894, in slidingCoordWindows
GTs = [site["GTs"][name] for name in names] if extractSpecificGTs else site["GTs"]
File "/Merged_Final/window/py3/genomics.py", line 1894, in
GTs = [site["GTs"][name] for name in names] if extractSpecificGTs else site["GTs"]
KeyError: 'CB11'

Note: CB11 is the name of the first sample in my geno.

I reordered the samples in the pops file to match the order in the geno file, and I get the same error with another sample name.
Update:
I get the error with samples associated with the first population (P1)
Please assist,
Thank you,

simonhmartin · 2020-06-25T16:15:12Z

I think the issue was caused by a change in text encoding in python 3 (as you can see I made these changes for python 3 without actually testing that the script still works!). I've added the fix now (again without testing, so I hope it works!)

foala · 2020-06-25T16:18:03Z

Hi Simon,
Thank you for your prompt reply.
That issue is resolved. Now I get a new error!

/vggpfs/fs3/vgl/scratch/vglshare/sandbox/fal/pg/pg2/plink/vcp2/Merged_Final/window/py3/genomics.py:1334: RuntimeWarning: invalid value encountered in double_scalars
return f4(p1,p2,p3,p4).sum()1./((1 - p1)p2p3(1-p4) + p1 * (1-p2)p3(1-p4)).sum()
Process Process-1:
Traceback (most recent call last):
File "/vggpfs/fs3/vgl/scratch/vglshare/sandbox/fal/conda/envs/admix/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/vggpfs/fs3/vgl/scratch/vglshare/sandbox/fal/conda/envs/admix/lib/python3.8/multiprocessing/process.py", line 108, in run
self._target(*self._args, *self._kwargs)
File "ABBABABAwindows.py", line 36, in ABBABABA_wrapper
statsDict = genomics.ABBABABA(Aln, P1, P2, P3, O, minData)
File "/vggpfs/fs3/vgl/scratch/vglshare/sandbox/fal/pg/pg2/plink/vcp2/Merged_Final/window/py3/genomics.py", line 1567, in ABBABABA
all4freqs = all4Aln.siteFreqs(sites=goodSites)
File "/vggpfs/fs3/vgl/scratch/vglshare/sandbox/fal/pg/pg2/plink/vcp2/Merged_Final/window/py3/genomics.py", line 1005, in siteFreqs
return np.array([binBaseFreqs(self.numArray[:,x][self.nanMask[:,x]], asCounts=asCounts) for x in sites])
File "/vggpfs/fs3/vgl/scratch/vglshare/sandbox/fal/pg/pg2/plink/vcp2/Merged_Final/window/py3/genomics.py", line 1005, in
return np.array([binBaseFreqs(self.numArray[:,x][self.nanMask[:,x]], asCounts=asCounts) for x in sites])
File "/vggpfs/fs3/vgl/scratch/vglshare/sandbox/fal/pg/pg2/plink/vcp2/Merged_Final/window/py3/genomics.py", line 574, in binBaseFreqs
else: return 1. np.bincount(numArr, minlength=4) / n
File "<array_function internals>", line 5, in bincount
MemoryError: Unable to allocate 1020. TiB for an array with shape (140168216732977,) and data type int64

simonhmartin · 2020-06-25T16:20:48Z

Oh that's a pain. This is one I have not seen before. If you attach a short subset of your file (the first 1000 lines should do) I could try figure it out. Might not get to this today though.

foala · 2020-06-25T16:26:15Z

Thank you Simon.
Please find the subest from the geno file (Just rename .txt to .geno)
subset.txt

A lot is riding on this analysis. I really appreciate your help

simonhmartin · 2020-06-26T08:17:37Z

I couldn't recreate your specific error with the first 1000 lines, but I did see an issue with your input file that will cause errors. It contains multi-base genotypes, which cannot be parsed by my scripts. In any case, in this subset, at these sites, all individuals are heterozygous, so the sites don't carry any information anyway. After removing those I could run the script with no issues.
The other thing I noticed is your window size is far too small. By default the script assumes you are using "coordinate" windows , so -w 20 means the window is 20 bp on the chromosome. This means that most of your windows will have zero SNPs, and the rest will tend to have only one SNP. This is not suitable for these statistics. There are two options to rectify this:

Set the window size to a much larger value (tens to hundreds of kb). Most windows will cover sufficient number of SNPs for reliable results. You can then also do post-filtering of the output file based on the "sitesUsed" column, which should be at least 100.
Use --windType sites, in which case the window is defined by the number of sites read from the input file, rather than chromosome coordinates. This results in windows using similar amounts of information, but they will all differ in the number of bp they actually cover in terms of chromosome coordinates. After doing this, it still makes sense to check the "sitesUsed" in the output, because not all sites in your input file will be informative for ABBA BABA statistics.
If you correct these two issues, I think it should work.

foala · 2020-06-26T23:29:21Z

Hi Simon,
Thank you very, very much for your thorough and patient walkthrough.
I have removed the multi-base genotypes, and it seems to work perfectly.

Thank you very much,

foala closed this as completed Jun 27, 2020

nelidapg mentioned this issue Jan 26, 2022

bug in fourPopWindows.py #66

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problem with ABBABABAwindows.py #28

Problem with ABBABABAwindows.py #28

foala commented Jun 24, 2020

simonhmartin commented Jun 25, 2020

foala commented Jun 25, 2020 •

edited

Loading

simonhmartin commented Jun 25, 2020

foala commented Jun 25, 2020

simonhmartin commented Jun 25, 2020

foala commented Jun 25, 2020

simonhmartin commented Jun 26, 2020 •

edited

Loading

foala commented Jun 26, 2020 •

edited

Loading

Problem with ABBABABAwindows.py #28

Problem with ABBABABAwindows.py #28

Comments

foala commented Jun 24, 2020

simonhmartin commented Jun 25, 2020

foala commented Jun 25, 2020 • edited Loading

simonhmartin commented Jun 25, 2020

foala commented Jun 25, 2020

simonhmartin commented Jun 25, 2020

foala commented Jun 25, 2020

simonhmartin commented Jun 26, 2020 • edited Loading

foala commented Jun 26, 2020 • edited Loading

foala commented Jun 25, 2020 •

edited

Loading

simonhmartin commented Jun 26, 2020 •

edited

Loading

foala commented Jun 26, 2020 •

edited

Loading