-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Different numbers of final read counts in _1 and _2 FASTQ files #4
Comments
Hello! The corresponding python codes to check the read name is as below, if they are not exactly the same, you will see the error message.
Otherwise......may I see a fraction of your input fastq files? Thanks, |
I downloaded the data from SRA using These are some parts of my input FASTQ files
<SRR6691720_2.fastq>
I also checked SAM files, which is the output of FreeHi-C rawDataTraining steps.
I don't think that there are some problems with read names in both input FASTQ files and output mapping files. Thanks, |
yea.....looks fine to me as well. Can you also share the original complete log or screen output information with the error messages? Looks to me that this error may not appear in the training section but in the simulation sequence data processing section. In that case, there must be something wrong with the simulation. |
Here is the original complete log information! [ Training step log ]
[Simulation & Post-process step log]
If you need any other information, please ask me! |
Can you check the number of reads in /mss5/RACA2/read_simulation/hic/FreeHiC/bosTau9/bosTau9_05x/bosTau9_05x/simuSequence/SRR6691720_1.fastq and /mss5/RACA2/read_simulation/hic/FreeHiC/bosTau9/bosTau9_05x/bosTau9_05x/simuSequence/SRR6691720_2.fastq ? It seems that more than 157500000 reads are simulated for each end by the following log.
However, only 1763658 reads of end 1 are aligned by bwa aln. Similarly for end 2. Much fewer reads are processed by bwa.
If the fastq files do not contain 157702285 reads, then there should be something wrong with the simulation section. BTW, have you tried simulating using the demo data and demo runs? Ye |
There are 157,702,265 reads in simuSequence/SRR6691720_1.fastq file Yes, I have already run the demo data and got simulation data. It seems that the demo run did not have any problems in outputs. |
Emm......that is indeed very weird....looks like the simulation is almost complete. Let me know if this helps. Ye |
Hi, I ran the simulation section with 10 times fewer numbers (15,000,000) as you recommended These are the examples of final simulation data
[SRR6691720_2.fastq]
Do you think that I should run the simulation with much smaller numbers? Thanks, |
I see!! I just realized that bedtools has updated its getfasta function in recent released versions!! A quick solution is that if you can download and install 2.25.0 version of bedtools: https://github.com/arq5x/bedtools2/releases/tag/v2.25.0 and still use current freehic software, it should solve the problem. Otherwise, I will update the freehic repository later this week to accommodate the new output from the latest version of bedtools. Let me know how it goes! Thanks for bringing it up! Appreciate your time and tries! Ye |
I was using bedtools 2.28.0 version before and I downloaded bedtools 2.25.0 version as you advised
But I still get the same error messages before and the numbers of reads in both files are still different
I think Sue |
Hello Sue,
BTW, I noticed that you claim to use 20 cores for parallel alignment. Can you confirm that the computing resources are available on your side? Apologize for the trouble that delays your studies! Ye |
Hello Ye, I used the bosTau9 chromosome level assembly which is soft-masked and changed the name of chrX to chr999.
You could just download the data from the link below and filtering regular chromosome sequences only (chr1~29 + chrX). BTW, I renamed the chromosome X as I got some error messages while using FreeHiC because of chrX.
Or I could just share these data by email. Thanks, |
While I am running, if you are interested, you can also check out the lite version of FreeHiC that we recently developed. It is an approximation of the original FreeHi-C simulation pipeline but much faster. ;-) |
Sounds great! If you have any progress, please let me know! :) Thanks, |
Hello, ye! Did you reproduce the same results from mine with the data I told you before? Thanks, |
Apologize for the delay! I should be able to finish the runs later this week. Ye |
Hello Sue,
Therefore, the issue you came across is still not quite clear to me.....may be related to your computing resources? Maybe you can try using 1 core, larger memory resource, clear the /tmp folder where sorting temporary files are stored, and simulate 10000 reads or smaller to start. Otherwise, if this read id mismatches issue continues, maybe the following filtering may help. The rationale is to match read query name from both ends in the sam file. I added right after the alignment step in the simulation session. You can replace the original
Best, |
Hi, I'm using FreeHiC for simulating Cow Hi-C data.
For fastqFile data using during training steps, SRR6691720 data (https://www.ncbi.nlm.nih.gov/sra/?term=SRR6691720) was used and other parameters are below.
train=1
simulate=1
postProcess=1
coreN=20
mismatchN=3
gapN=1
mismatchP=""
gapP=""
chimericP=""
simuN=157702285
readLen=80
resolution=40000
lowerBound=80000
refragU=800
ligateSite="AAGCTAGCTT"
After simulating Hi-C data, I checked the output of the simulation and found that read IDs in each read file did not match. Also, the numbers of reads in read 1 and read 2 files are different.
There was a warning message in log but I already checked that my input FASTQ files do not have any problems.
(All reads are sorted equally in both input read files and all IDs are matched)
Read id in read 1 and read 2 file does not match. Please check the input read files and sort them correctly.
Could you tell me how to fix this problem?
The text was updated successfully, but these errors were encountered: