Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem with ABBABABAwindows.py #28

Closed
foala opened this issue Jun 24, 2020 · 8 comments
Closed

Problem with ABBABABAwindows.py #28

foala opened this issue Jun 24, 2020 · 8 comments

Comments

@foala
Copy link

foala commented Jun 24, 2020

Hi,
Thank you very much for your scripts.
I have been struggling a lot with ABBABABAwindows.py
I am using python 3

When I run this command
python3 ABBABABAwindows.py -g $FILE.geno.gz -o $FILE.fd2.csv -w 20 -m 10 -s 20 -f phased --minData 0.5 -P1 P1 -P2 P2-P3 P3 -O P4--popsFile pop_file.txt

I get

Traceback (most recent call last):
File "ABBABABAwindows.py", line 256, in
windowQueue = SimpleQueue()
TypeError: init() missing 1 required keyword-only argument: 'ctx'

and when I run
python3 ABBABABAwindows.py
I get

usage: ABBABABAwindows.py [-h] [--windType {sites,coordinate,predefined}] [-w sites] [-s sites] [-m sites]
[--overlap sites] [-D MAXDIST] [--windCoords WINDCOORDS] [--minData proportion] -P1
popName [[samples] ...] -P2 popName [[samples] ...] -P3 popName [[samples] ...] -O popName
[[samples] ...] [--popsFile POPSFILE] [--haploid sample names] [-g GENOFILE] [-o OUTFILE]
[--exclude EXCLUDE] [--include INCLUDE] -f {phased,pairs,haplo,diplo} [-T threads]
[--verbose] [--addWindowID] [--writeFailedWindows]

Can you please assist.
Thank you very much,

@simonhmartin
Copy link
Owner

Apologies, this was a bug with migration to Python3. Should hopefully be working in the latest version

@foala
Copy link
Author

foala commented Jun 25, 2020

Thank you for your reply.
The SimpleQueue issue is resolved, thank you.
Now I have a new error
Traceback (most recent call last):

File "ABBABABAwindows.py", line 312, in
for window in windowGenerator:
File "/Merged_Final/window/py3/genomics.py", line 1894, in slidingCoordWindows
GTs = [site["GTs"][name] for name in names] if extractSpecificGTs else site["GTs"]
File "/Merged_Final/window/py3/genomics.py", line 1894, in
GTs = [site["GTs"][name] for name in names] if extractSpecificGTs else site["GTs"]
KeyError: 'CB11'

Note: CB11 is the name of the first sample in my geno.

I reordered the samples in the pops file to match the order in the geno file, and I get the same error with another sample name.
Update:
I get the error with samples associated with the first population (P1)
Please assist,
Thank you,

@simonhmartin
Copy link
Owner

I think the issue was caused by a change in text encoding in python 3 (as you can see I made these changes for python 3 without actually testing that the script still works!). I've added the fix now (again without testing, so I hope it works!)

@foala
Copy link
Author

foala commented Jun 25, 2020

Hi Simon,
Thank you for your prompt reply.
That issue is resolved. Now I get a new error!

/vggpfs/fs3/vgl/scratch/vglshare/sandbox/fal/pg/pg2/plink/vcp2/Merged_Final/window/py3/genomics.py:1334: RuntimeWarning: invalid value encountered in double_scalars
return f4(p1,p2,p3,p4).sum()1./((1 - p1)p2p3(1-p4) + p1 * (1-p2)p3(1-p4)).sum()
Process Process-1:
Traceback (most recent call last):
File "/vggpfs/fs3/vgl/scratch/vglshare/sandbox/fal/conda/envs/admix/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/vggpfs/fs3/vgl/scratch/vglshare/sandbox/fal/conda/envs/admix/lib/python3.8/multiprocessing/process.py", line 108, in run
self._target(*self._args, *self._kwargs)
File "ABBABABAwindows.py", line 36, in ABBABABA_wrapper
statsDict = genomics.ABBABABA(Aln, P1, P2, P3, O, minData)
File "/vggpfs/fs3/vgl/scratch/vglshare/sandbox/fal/pg/pg2/plink/vcp2/Merged_Final/window/py3/genomics.py", line 1567, in ABBABABA
all4freqs = all4Aln.siteFreqs(sites=goodSites)
File "/vggpfs/fs3/vgl/scratch/vglshare/sandbox/fal/pg/pg2/plink/vcp2/Merged_Final/window/py3/genomics.py", line 1005, in siteFreqs
return np.array([binBaseFreqs(self.numArray[:,x][self.nanMask[:,x]], asCounts=asCounts) for x in sites])
File "/vggpfs/fs3/vgl/scratch/vglshare/sandbox/fal/pg/pg2/plink/vcp2/Merged_Final/window/py3/genomics.py", line 1005, in
return np.array([binBaseFreqs(self.numArray[:,x][self.nanMask[:,x]], asCounts=asCounts) for x in sites])
File "/vggpfs/fs3/vgl/scratch/vglshare/sandbox/fal/pg/pg2/plink/vcp2/Merged_Final/window/py3/genomics.py", line 574, in binBaseFreqs
else: return 1.
np.bincount(numArr, minlength=4) / n
File "<array_function internals>", line 5, in bincount
MemoryError: Unable to allocate 1020. TiB for an array with shape (140168216732977,) and data type int64

@simonhmartin
Copy link
Owner

Oh that's a pain. This is one I have not seen before. If you attach a short subset of your file (the first 1000 lines should do) I could try figure it out. Might not get to this today though.

@foala
Copy link
Author

foala commented Jun 25, 2020

Thank you Simon.
Please find the subest from the geno file (Just rename .txt to .geno)
subset.txt

A lot is riding on this analysis. I really appreciate your help

@simonhmartin
Copy link
Owner

simonhmartin commented Jun 26, 2020

I couldn't recreate your specific error with the first 1000 lines, but I did see an issue with your input file that will cause errors. It contains multi-base genotypes, which cannot be parsed by my scripts. In any case, in this subset, at these sites, all individuals are heterozygous, so the sites don't carry any information anyway. After removing those I could run the script with no issues.
The other thing I noticed is your window size is far too small. By default the script assumes you are using "coordinate" windows , so -w 20 means the window is 20 bp on the chromosome. This means that most of your windows will have zero SNPs, and the rest will tend to have only one SNP. This is not suitable for these statistics. There are two options to rectify this:

  1. Set the window size to a much larger value (tens to hundreds of kb). Most windows will cover sufficient number of SNPs for reliable results. You can then also do post-filtering of the output file based on the "sitesUsed" column, which should be at least 100.
  2. Use --windType sites, in which case the window is defined by the number of sites read from the input file, rather than chromosome coordinates. This results in windows using similar amounts of information, but they will all differ in the number of bp they actually cover in terms of chromosome coordinates. After doing this, it still makes sense to check the "sitesUsed" in the output, because not all sites in your input file will be informative for ABBA BABA statistics.
    If you correct these two issues, I think it should work.

@foala
Copy link
Author

foala commented Jun 26, 2020

Hi Simon,
Thank you very, very much for your thorough and patient walkthrough.
I have removed the multi-base genotypes, and it seems to work perfectly.

Thank you very much,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants