# Creating datasets: Download & Symlink & Preprocessing steps
## 1. Download (missing = not yet downloaded) data

In [3]:
library(tidyverse, warn.conflicts = FALSE)

In [4]:
setwd("/home/vanda.marosi//floral_development_thesis_vm/datatables/")
# import my datasets
wheat <- read.table("wheat_final.csv", header = TRUE, sep = ",", stringsAsFactors = FALSE)
barley <- read.table("barley_final.csv", header = TRUE, sep = ",", stringsAsFactors = FALSE)
glimpse(wheat)
glimpse(barley)
# import Nadia & Maxim's datasets
wheat_m <- read.table("wheat_22012020_v1_MM.csv", header = TRUE, sep = ",", stringsAsFactors = FALSE)
barley_n <- read.table("hordeum_vulgare_14012020_NK.csv", header = TRUE, sep = ",", stringsAsFactors = FALSE)
glimpse(wheat_m)
glimpse(barley_n)

Rows: 215
Columns: 36
$ Dataset                 <chr> "cytoplasmic_male_sterility", "cytoplasmic_ma…
$ PMID                    <int> 32019527, 32019527, 32019527, 32019527, 32019…
$ Run.ID                  <chr> "SRR10737427", "SRR10737428", "SRR10737429", …
$ GSA                     <chr> "", "", "", "", "", "", "CRA002161", "CRA0021…
$ NCBI.BioProject         <chr> "PRJNA596597", "PRJNA596597", "PRJNA596597", …
$ SRA.Sample.ID           <chr> "SRS5860070", "SRS5860068", "SRS5860066", "SR…
$ BioSample.ID            <chr> "SAMN13632079", "SAMN13632078", "SAMN13632077…
$ Sample.Name.Alias       <chr> "303-B", "303-B", "303-B", "C303A", "C303A", …
$ Batch                   <int> 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5, …
$ Organism                <chr> "Triticum aestivum", "Triticum aestivum", "Tr…
$ Cultivar                <chr> "303-B", "303-B", "303-B", "C303A", "C303A", …
$ GM                      <chr> "wt", "wt", "wt", "mut", "mut", "mut", "wt", …
$ Genotype                <chr

In [70]:
# intersect wheat tables
w_v <- select(wheat, Run.ID, Dataset.name, NCBI.BioProject)
w_m <- select(wheat_m, sample)
colnames(w_m) <- "Run.ID"
# this gives id-s that appear in mine but not in Maxim's -> 67 fastq-s have to be downloaded
y <- anti_join(w_v, w_m, by = "Run.ID") # all rows in w_v that do not have a match in w_m
glimpse(y)

# this gives id-s that appear in Maxim's from my list -> 148 fastq-s have to be symbolic linked
y1 <- inner_join(w_v, w_m, by = "Run.ID") # retain only rows in both sets
glimpse(y1)

Rows: 67
Columns: 3
$ Run.ID          <chr> "CRR088963", "CRR088962", "CRR088961", "CRR088960", "…
$ Dataset.name    <chr> "iamsls", "iamsls", "iamsls", "iamsls", "iamsls", "ia…
$ NCBI.BioProject <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "…
Rows: 148
Columns: 3
$ Run.ID          <chr> "SRR10737427", "SRR10737428", "SRR10737429", "SRR1073…
$ Dataset.name    <chr> "pistillody of stamen", "pistillody of stamen", "pist…
$ NCBI.BioProject <chr> "PRJNA596597", "PRJNA596597", "PRJNA596597", "PRJNA59…


In [71]:
# intersect barley tables
b_v <- select(barley, Run.ID, Dataset.name, NCBI.BioProject)
b_n <- select(barley_n, Run)
colnames(b_n) <- "Run.ID"
# this gives id-s that appear in mine but not in Nadia's -> 65 fastq-s have to be downloaded
x <- anti_join(b_v, b_n, by = "Run.ID")
glimpse(x)

# this gives id-s that appear in Maxim's from my list -> 175 fastq-s have to be symbolic linked
x1 <- inner_join(b_v, b_n, by = "Run.ID")
glimpse(x1)

Rows: 65
Columns: 3
$ Run.ID          <chr> "ERR781039", "ERR781040", "ERR781041", "ERR781042", "…
$ Dataset.name    <chr> "inflorescence development", "inflorescence developme…
$ NCBI.BioProject <chr> "PRJEB8748", "PRJEB8748", "PRJEB8748", "PRJEB8748", "…
Rows: 175
Columns: 3
$ Run.ID          <chr> "ERR1248084", "ERR1248085", "ERR1248086", "ERR1248087…
$ Dataset.name    <chr> "ref dataset drought", "ref dataset drought", "ref da…
$ NCBI.BioProject <chr> "PRJEB12540", "PRJEB12540", "PRJEB12540", "PRJEB12540…


## 1.1 Download additional fastq-s (from my dataset) using `sra-toolkit`

In [6]:
# get Run.ID-s into single vector
setwd("/nfs/pgsb/projects/comparative_triticeae/phenotype/flower_development/refsets/triticum")
w_download <- dplyr::pull(y, Run.ID)
write.table(w_download, file = "wheat_download_runids.txt", append = FALSE, quote = FALSE, sep = "\t", dec = ".",
            row.names = FALSE, col.names = FALSE)
glimpse(w_download)
    
setwd("/nfs/pgsb/projects/comparative_triticeae/phenotype/flower_development/refsets/hordeum")
b_download <- dplyr::pull(x, Run.ID)
write.table(b_download, file = "barley_download_runids.txt", append = FALSE, quote = FALSE, sep = "\t", dec = ".",
            row.names = FALSE, col.names = FALSE)
glimpse(b_download)

 chr [1:67] "CRR088963" "CRR088962" "CRR088961" "CRR088960" "CRR088959" ...
 chr [1:65] "ERR781039" "ERR781040" "ERR781041" "ERR781042" "ERR781043" ...


## 1.2 Steps to download extra files (added to the flower-dataset but not listed in Nadia&Maxim's tables)
1. N&M's and my selected samples Run.ID-s were intersected and saved as vectors: 67 wheat, 65 barley samples needed to be downloaded
2. The above vectors were written into .txt files, placed in subfolders for the respective species
3. SRA Toolkit was downloaded to the same main folder (https://www.ncbi.nlm.nih.gov/sra/docs/sradownload/)
4. `PATH-bin` had to be established every time I opened the terminal, solved this issue by placing it in my bashrc
5. `vdb-config -i` was called for basic configuration following the website's recommendation, and a main empty folder was put as download-target
6. `prefetch --option-file wheat_download_runids.txt`
7. `fasterq-dump --split-files *.sra`
8. copied the `.*fastq`-s into destination folder (triticum), and emptied the sra-download folder for the next batch
9. `prefetch --option-file barley_download_runids.txt`
10. `fasterq-dump --split-files *.sra`
11. copied the `.*fastq`-s into destination folder (hordeum), and emptied the sra-download folder
12. for the extra 2 China + 1 France dataset, the respective websites were called, each of the fastq-s weblink were copied into a bash script with `wget` 
13. `chmod +x` - to make the bash-scripts executable
14. `./*.sh` - to execute them
15. I double-checked whether all files were downloaded, manually downloaded the missing ones, eg.: `prefetch SRR000001` and than `fasterq-dump --split-files SRR000001.sra`

## 2. Create symlinks 

In [6]:
# get Run.ID-s into single vector
setwd("/nfs/pgsb/projects/comparative_triticeae/phenotype/flower_development/refsets/triticum")
#w_download <- as.vector(dplyr::pull(y, Run.ID))
w_symlink <- dplyr::pull(y1, Run.ID)
#w_download <- as.name(w_download)
write.table(w_symlink, file = "wheat_symlink_runids.txt", append = FALSE, quote = FALSE, sep = ",", dec = ".",
            row.names = FALSE, col.names = FALSE)
glimpse(w_symlink)
    
setwd("/nfs/pgsb/projects/comparative_triticeae/phenotype/flower_development/refsets/hordeum")
b_symlink <- dplyr::pull(x1, Run.ID)
write.table(b_symlink, file = "barley_symlink_runids.txt", append = FALSE, quote = FALSE, sep = ",", dec = ".",
            row.names = FALSE, col.names = FALSE)
glimpse(b_symlink)

 chr [1:148] "SRR10737427" "SRR10737428" "SRR10737429" "SRR10737430" ...
 chr [1:175] "ERR1248084" "ERR1248085" "ERR1248086" "ERR1248087" ...


## 2.1 My steps to create symbolic-links
1. N&M's and my selected samples Run.ID-s were intersected and saved as vectors: 148 wheat, 175 barley samples needed to be symbolic-linked
2. The above vectors were written into `wheat/barley_symlink_runids.txt` files, placed in subfolders for the respective species

### For Wheat dataset:
* from `/nfs/scratch/daniel.lang/comparative_triticeae/raw/T.aestivum`
* to `/nfs/pgsb/projects/comparative_triticeae/phenotype/flower_development/refsets/triticum`

### For Barley dataset:
* from `/nfs/scratch/daniel.lang/comparative_triticeae/raw/H.vulgare`
* to `/nfs/pgsb/projects/comparative_triticeae/phenotype/flower_development/refsets/hordeum`

## 2.2 Bash-scripts to find failed SRA-s and create symlinks for existing ones:
1. `csver.sh`: creates a `.csv` file with two columns, 1st column: source files with their path, 2nd column: final destination with symlink-names
    - it is also optimized to detect first the paired fastq-s, and as second step also the single ones (for later use of creating input tables to FastQC/Trimmomatic etc)
2. `confirmationlinker.sh`: 
    - #1 run: which files from my link-list has missing targets from source directory? points them out, that they can be manually downloaded
    - #2 run: creates link between two tables to prepare them for `ln -s`
3. `linker.sh`: creates symlinks using `ln -s /path/to/file /path/to/link`

* scripts are located at: `vanda.marosi/scripts/`

* for the wheat table missing SRA-s(failed downloads), I have `prefetch`-ed them into my folder: SRR7106399, SRR7106400, SRR7106402, SRR7106404, SRR6802615

* for the barley table missing SRA-s are 87, detected via the `confirmationlinker.sh` and cut into new `missing_barley.txt` and `prefetch`-ed:

    from this (column of 87 rows):
    **this link will have no target ERR2026425.fastq.gz**
    
    made this (column of 87 rows):
    **ERR2026425**

    `./confirmationlinker.sh | cut -d " " -f 7 | cut -d "." -f 1 > missing_barley.txt`
* to set maximum file size download limit to 100G (default: 20G): `prefetch --max-size 100G SRR649944`


## 2.3 Create symlinks for French dataset
* used the three above mentioned bash scripts, modified partially but creating symlinks into same folder, with these names: FRR7482-9

In [10]:
setwd("~/scripts")
symlink_french_dataset <- read.table("ngtmpindex.csv", header = FALSE, sep = ",", stringsAsFactors = FALSE)
head(symlink_french_dataset)

Unnamed: 0_level_0,V1,V2
Unnamed: 0_level_1,<chr>,<chr>
1,/nfs/pgsb/projects/comparative_triticeae/phenotype/flower_development/refsets/triticum/NG-5789_1A_lib7482_1_sequence_val_1.fq.gz,/nfs/pgsb/projects/comparative_triticeae/phenotype/flower_development/refsets/triticum/FRR7482_1.fastq.gz
2,/nfs/pgsb/projects/comparative_triticeae/phenotype/flower_development/refsets/triticum/NG-5789_1A_lib7482_2_sequence_val_2.fq.gz,/nfs/pgsb/projects/comparative_triticeae/phenotype/flower_development/refsets/triticum/FRR7482_2.fastq.gz
3,/nfs/pgsb/projects/comparative_triticeae/phenotype/flower_development/refsets/triticum/NG-5789_1B_lib7486_1_sequence_val_1.fq.gz,/nfs/pgsb/projects/comparative_triticeae/phenotype/flower_development/refsets/triticum/FRR7486_1.fastq.gz
4,/nfs/pgsb/projects/comparative_triticeae/phenotype/flower_development/refsets/triticum/NG-5789_1B_lib7486_2_sequence_val_2.fq.gz,/nfs/pgsb/projects/comparative_triticeae/phenotype/flower_development/refsets/triticum/FRR7486_2.fastq.gz
5,/nfs/pgsb/projects/comparative_triticeae/phenotype/flower_development/refsets/triticum/NG-5789_2A_lib7483_1_sequence_val_1.fq.gz,/nfs/pgsb/projects/comparative_triticeae/phenotype/flower_development/refsets/triticum/FRR7483_1.fastq.gz
6,/nfs/pgsb/projects/comparative_triticeae/phenotype/flower_development/refsets/triticum/NG-5789_2A_lib7483_2_sequence_val_2.fq.gz,/nfs/pgsb/projects/comparative_triticeae/phenotype/flower_development/refsets/triticum/FRR7483_2.fastq.gz


## 3. Crosscheck downloaded files with original table

In [7]:
# triticum double-check, here is a column created from the downloaded files using bash (ls *gz | cut -d "." -f 1 | cut -d "_" -f 1 | uniq | grep -v NG > wheat_final_list.txt)
setwd("/nfs/pgsb/projects/comparative_triticeae/phenotype/flower_development/refsets/triticum")
w_check <- read.table("wheat_final_list.txt", header = FALSE, sep = "\t", stringsAsFactors = FALSE)
glimpse(w_check)
colnames(w_check) <- "Run.ID"
# the intersection with my original wheat table (based on which files were downloaded) will give the rows that are present in both tables
ww <- inner_join(w_v, w_check, by = "Run.ID")
glimpse(ww)
# everything was succesfull!

Rows: 215
Columns: 1
$ V1 <chr> "CRR078059", "CRR078060", "CRR078061", "CRR078062", "CRR078063", "…
Rows: 215
Columns: 3
$ Run.ID          <chr> "SRR10737427", "SRR10737428", "SRR10737429", "SRR1073…
$ Dataset.name    <chr> "pistillody of stamen", "pistillody of stamen", "pist…
$ NCBI.BioProject <chr> "PRJNA596597", "PRJNA596597", "PRJNA596597", "PRJNA59…


In [9]:
# hordeum double-check -> here is a column created from the downloaded files using bash (ls *gz | cut -d "." -f 1 | cut -d "_" -f 1 | uniq | grep -v NG > barley_final_list.txt
setwd("/nfs/pgsb/projects/comparative_triticeae/phenotype/flower_development/refsets/hordeum")
b_check <- read.table("barley_final_list.txt", header = FALSE, sep = "\t", stringsAsFactors = FALSE)
glimpse(b_check)
colnames(b_check) <- "Run.ID"
# the intersection with my original wheat table (based on which files were downloaded) will give the rows that are present in both tables
bb <- inner_join(b_v, b_check, by = "Run.ID")
glimpse(bb)
# everything was succesfull!

Rows: 240
Columns: 1
$ V1 <chr> "ERR1248084", "ERR1248085", "ERR1248086", "ERR1248087", "ERR124808…
Rows: 240
Columns: 3
$ Run.ID          <chr> "ERR781039", "ERR781040", "ERR781041", "ERR781042", "…
$ Dataset.name    <chr> "inflorescence development", "inflorescence developme…
$ NCBI.BioProject <chr> "PRJEB8748", "PRJEB8748", "PRJEB8748", "PRJEB8748", "…


## 4. Rename libraries
* to create  universal `_1/2_fastq.gz` naming using bash scripts
* 1. format in triticum only: `CRR088946_f1.fastq.gz`
    solution: `for file in $(find ./*f1.fastq.gz);do file1=$(echo $file | cut -d '/' -f 2 | cut -d '_' -f 1 ); echo mv ./${file1}_f1.* ./${file1}1.fastq.gz; done`
* 2. format in triticum only: `CRR088946_r2.fastq.gz`
    solution: `for file in $(find ./*r2.fastq.gz);do file1=$(echo $file | cut -d '/' -f 2 | cut -d '_' -f 1 ); echo mv ./${file1}_r2.* ./${file1}_2.fastq.gz; done`
* 3. format in tri/hordeum both: `SRR7106404.sra_1.fastq.gz`
    solution: `for file in $(find ./*.sra_1.fastq.gz);do file1=$(echo $file | cut -d '/' -f 2 | cut -d '.' -f 1 ); echo mv ./${file1}.sra_1* ./${file1}_1.fastq.gz; done`
* 4. format in tri/hordeum both: `SRR7106404.sra_2.fastq.gz`
    solution: `for file in $(find ./*.sra_2.fastq.gz);do file1=$(echo $file | cut -d '/' -f 2 | cut -d '.' -f 1 ); echo mv ./${file1}.sra_2* ./${file1}_2.fastq.gz; done`
* 5. format in tri/hordeum both: `SRR5464507.sra.fastq.gz`
     solution: `for file in $(find ./*.sra.fastq.gz);do file1=$(echo $file | cut -d '/' -f 2 | cut -d '.' -f 1 ); echo mv ./${file1}.sra.* ./${file1}.fastq.gz; done`

## 5. MD5 checksum
* Perl script from Daniel: 
    - for triticum: `perl -ne 'chomp; @a=split/\t/; next if $a[0]=~ /^CRR|FRR/; my $u= sprintf(q(https://www.ebi.ac.uk/ena/data/warehouse/filereport?accession=%s&result=read_run&fields=run_accession,fastq_ftp,fastq_md5,fastq_bytes,read_count,first_public,last_updated,center_name,broker_name),$a[0]); my $c=`curl -s "$u"`;  if ($i) {$c=~ s/^run_accession.+\n//g;} print $c; $i++;' wheat_project_table.txt > wheat_project_table.only_sra.metadata.txt`
    - for hordeum: `perl -ne 'chomp; @a=split/\t/; next if $a[0]=~ /^CRR|FRR/; my $u= sprintf(q(https://www.ebi.ac.uk/ena/data/warehouse/filereport?accession=%s&result=read_run&fields=run_accession,fastq_ftp,fastq_md5,fastq_bytes,read_count,first_public,last_updated,center_name,broker_name),$a[0]); my $c=`curl -s "$u"`;  if ($i) {$c=~ s/^run_accession.+\n//g;} print $c; $i++;' barley_project_table.txt > barley_project_table.only_sra.metadata.txt`

* 3 datasets that were downloaded from other websites had to be excluded as they dont have SRAnumber: `CRR|FRR`:
    - CRR: 1x18, 1x27
    - FRR: 1x8
    - alltogether 53 samples are impossible to md5sum-check

### 5.1 Triticum

In [131]:
setwd("/nfs/pgsb/projects/comparative_triticeae/phenotype/flower_development/refsets/triticum")
md5_myfiles_t <- read.table("md5sum_triticum.txt", header = FALSE, stringsAsFactors = FALSE, fill = TRUE)
colnames(md5_myfiles_t) <- c("md5sum", "read_origin")
glimpse(md5_myfiles_t)
# in my files there are total 395 files: 180 paired (2x=360) + 35 single reads
tail(md5_myfiles_t)

md5_srafiles_t <- read.table("wheat_project_table.only_sra.metadata.txt", header = TRUE, stringsAsFactors = FALSE, fill = TRUE)
glimpse(md5_srafiles_t)
# from my files 53 had to be excluded (not SRA-based), and 8 are single reads, and 9 SRAquery failed = (395-53-8-9)/2=162 rows
# so 9 rows are fully empty!!! only run accession is available: rows 96-104 are empty
tail(md5_srafiles_t)

Rows: 395
Columns: 2
$ md5sum      <chr> "f733f5ee55823b5ebf2a849753fbe4d1", "1e496dfb4829e6d2fcf0…
$ read_origin <chr> "CRR078059.fastq.gz", "CRR078060.fastq.gz", "CRR078061.fa…


Unnamed: 0_level_0,md5sum,read_origin
Unnamed: 0_level_1,<chr>,<chr>
390,2fc1e290297d7aa65e27d2768ae07d9d,SRR9593826_1.fastq.gz
391,a24c2baae8befbfdbf8310e81c6b0408,SRR9593826_2.fastq.gz
392,d81780ca607bcdf0316f6ff7167538be,SRR9593827_1.fastq.gz
393,3c721cd1caa85c4c18a4c43173721152,SRR9593827_2.fastq.gz
394,2916437a5589491f0f1ecc9bc5ef4529,SRR9593828_1.fastq.gz
395,9a656f9988cf72ec05c0ed20c0c420a1,SRR9593828_2.fastq.gz


Rows: 162
Columns: 5
$ run_accession <chr> "SRR10737427", "SRR10737428", "SRR10737429", "SRR107374…
$ fastq_ftp     <chr> "ftp.sra.ebi.ac.uk/vol1/fastq/SRR107/027/SRR10737427/SR…
$ fastq_md5     <chr> "c48a1b60c43bdd2be454da61bb3639c2;f561735f47f93862cf552…
$ fastq_bytes   <chr> "2002341649;2538119255", "1974289005;2491528745", "1673…
$ read_count    <int> 24689119, 24037621, 19838033, 24597111, 23550244, 23342…


Unnamed: 0_level_0,run_accession,fastq_ftp,fastq_md5,fastq_bytes,read_count
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>,<int>
157,SRR5186313,ftp.sra.ebi.ac.uk/vol1/fastq/SRR518/003/SRR5186313/SRR5186313_1.fastq.gz;ftp.sra.ebi.ac.uk/vol1/fastq/SRR518/003/SRR5186313/SRR5186313_2.fastq.gz,5a531c5b27bdf30631d945dd0e7ff312;fae6d79a2325fcc19b11130960bd3e26,1825752785;1899613025,23607972
158,SRR5186364,ftp.sra.ebi.ac.uk/vol1/fastq/SRR518/004/SRR5186364/SRR5186364_1.fastq.gz;ftp.sra.ebi.ac.uk/vol1/fastq/SRR518/004/SRR5186364/SRR5186364_2.fastq.gz,5f1b636f4eb935bf1f4243914e9b83b9;601c31619da584853d14b91786dc76e6,1790888245;1872901419,23279520
159,SRR5186375,ftp.sra.ebi.ac.uk/vol1/fastq/SRR518/005/SRR5186375/SRR5186375_1.fastq.gz;ftp.sra.ebi.ac.uk/vol1/fastq/SRR518/005/SRR5186375/SRR5186375_2.fastq.gz,26731d99fc42c834abafed858621955a;a556d201fa0e2be73c13bab5dc24b726,1821737928;1915714923,23375772
160,SRR5186382,ftp.sra.ebi.ac.uk/vol1/fastq/SRR518/002/SRR5186382/SRR5186382_1.fastq.gz;ftp.sra.ebi.ac.uk/vol1/fastq/SRR518/002/SRR5186382/SRR5186382_2.fastq.gz,d203a431336e1d8d3e1d7474206ff3c8;2c3e8c6dfe877438516c6d1bd80476dd,1822596582;1896701016,23530190
161,SRR5186387,ftp.sra.ebi.ac.uk/vol1/fastq/SRR518/007/SRR5186387/SRR5186387_1.fastq.gz;ftp.sra.ebi.ac.uk/vol1/fastq/SRR518/007/SRR5186387/SRR5186387_2.fastq.gz,f382a7e050bf68e8161a15a7cc3da81c;2e43e68afd7207b95637b3c5da450ab8,1795499261;1901983362,23293084
162,SRR5186416,ftp.sra.ebi.ac.uk/vol1/fastq/SRR518/006/SRR5186416/SRR5186416_1.fastq.gz;ftp.sra.ebi.ac.uk/vol1/fastq/SRR518/006/SRR5186416/SRR5186416_2.fastq.gz,1f1834e127d76e260f6d6f0163b39460;0314277a3e5652ddc0d2a9b0239c99ef,1724414113;1819518362,23194453


In [129]:
# wrangling md5_srafiles_t
# delete missing gap of 94-104
md5_1 <- slice(md5_srafiles_t, 1:95)
md5_2 <- slice(md5_srafiles_t, 105:162)
md5_srafiles_joined <- bind_rows(md5_1, md5_2)
# select relevant columns
md5_srafiles_js <- select(md5_srafiles_joined, run_accession, fastq_ftp, fastq_md5)
# separate read names
md5_srafiles_t_n <- separate(md5_srafiles_js, fastq_ftp, sep=";", into=c("read", "read2"), remove = TRUE)
# separate md5sums
md5_srafiles_t_nm <- separate(md5_srafiles_t_n, fastq_md5, sep=";", into=c("md5sum", "md5sum2"), remove = TRUE)
glimpse(md5_srafiles_t_nm)
# bind column2-s under column1-s as new rows
col1 <- select(md5_srafiles_t_nm, read, md5sum)
col2 <- select(md5_srafiles_t_nm, read2, md5sum2)
colnames(col2) <- c("read", "md5sum")
md5_srafiles_clean <- bind_rows(col1, col2)
glimpse(md5_srafiles_clean)

# delete missing gap of 241-248 from single reads
md5_j1 <- slice(md5_srafiles_clean, 1:240)
md5_j2 <- slice(md5_srafiles_clean, 249:306)
md5_srafiles_jc <- bind_rows(md5_j1, md5_j2)
# final table
glimpse(md5_srafiles_jc)
head(md5_srafiles_jc)

“Expected 2 pieces. Missing pieces filled with `NA` in 8 rows [88, 89, 90, 91, 92, 93, 94, 95].”

Rows: 153
Columns: 5
$ run_accession <chr> "SRR10737427", "SRR10737428", "SRR10737429", "SRR107374…
$ read          <chr> "ftp.sra.ebi.ac.uk/vol1/fastq/SRR107/027/SRR10737427/SR…
$ read2         <chr> "ftp.sra.ebi.ac.uk/vol1/fastq/SRR107/027/SRR10737427/SR…
$ md5sum        <chr> "c48a1b60c43bdd2be454da61bb3639c2", "c90946fe4d3243d6c3…
$ md5sum2       <chr> "f561735f47f93862cf552112d5ecfd44", "735ebb95b8efb3cf26…
Rows: 306
Columns: 2
$ read   <chr> "ftp.sra.ebi.ac.uk/vol1/fastq/SRR107/027/SRR10737427/SRR107374…
$ md5sum <chr> "c48a1b60c43bdd2be454da61bb3639c2", "c90946fe4d3243d6c39817345…
Rows: 298
Columns: 2
$ read   <chr> "ftp.sra.ebi.ac.uk/vol1/fastq/SRR107/027/SRR10737427/SRR107374…
$ md5sum <chr> "c48a1b60c43bdd2be454da61bb3639c2", "c90946fe4d3243d6c39817345…


Unnamed: 0_level_0,read,md5sum
Unnamed: 0_level_1,<chr>,<chr>
1,ftp.sra.ebi.ac.uk/vol1/fastq/SRR107/027/SRR10737427/SRR10737427_1.fastq.gz,c48a1b60c43bdd2be454da61bb3639c2
2,ftp.sra.ebi.ac.uk/vol1/fastq/SRR107/028/SRR10737428/SRR10737428_1.fastq.gz,c90946fe4d3243d6c39817345124b33c
3,ftp.sra.ebi.ac.uk/vol1/fastq/SRR107/029/SRR10737429/SRR10737429_1.fastq.gz,36969eeb74e3f80d16e1c46558c9a220
4,ftp.sra.ebi.ac.uk/vol1/fastq/SRR107/030/SRR10737430/SRR10737430_1.fastq.gz,1845b33ce6e8507e2f6a69fdf1768be5
5,ftp.sra.ebi.ac.uk/vol1/fastq/SRR107/031/SRR10737431/SRR10737431_1.fastq.gz,99b22710f2a9d2b02d57588f245b7b2f
6,ftp.sra.ebi.ac.uk/vol1/fastq/SRR107/032/SRR10737432/SRR10737432_1.fastq.gz,781ed0f598ef66f9926ed843a3aebf76


In [125]:
# intersect two tables
missing <- anti_join(md5_srafiles_jc, md5_myfiles_t, by = "md5sum")
glimpse(missing)

Rows: 298
Columns: 2
$ read   <chr> "ftp.sra.ebi.ac.uk/vol1/fastq/SRR107/027/SRR10737427/SRR107374…
$ md5sum <chr> "c48a1b60c43bdd2be454da61bb3639c2", "c90946fe4d3243d6c39817345…


### 5.2 Hordeum

In [147]:
# import tables
setwd("/nfs/pgsb/projects/comparative_triticeae/phenotype/flower_development/refsets/hordeum")
md5_myfiles_h <- read.table("md5sum_hordeum.txt", header = FALSE, stringsAsFactors = FALSE, fill = TRUE)
colnames(md5_myfiles_h) <- c("md5sum", "read_origin")
glimpse(md5_myfiles_h)
# in my files there are total 433 files: 193 paired (2x=386) + 47 single reads
tail(md5_myfiles_h)

md5_srafiles_h <- read.table("barley_project_table.only_sra.metadata.txt", header = FALSE, stringsAsFactors = FALSE, fill = TRUE)
colnames(md5_srafiles_h) <- c("run_accession", "fastq_ftp", "fastq_md5", "4", "5", "6", "7", "8", "9", "10", "11")
md5_srafiles_h <- select(md5_srafiles_h, run_accession, fastq_ftp, fastq_md5)
glimpse(md5_srafiles_h)
# there are many empty queries!!!
tail(md5_srafiles_h)

Rows: 433
Columns: 2
$ md5sum      <chr> "ab64193890edd469d38655955585ac2f", "26120850d01b56932af6…
$ read_origin <chr> "ERR1248084_1.fastq.gz", "ERR1248084_2.fastq.gz", "ERR124…


Unnamed: 0_level_0,md5sum,read_origin
Unnamed: 0_level_1,<chr>,<chr>
428,aeaaab884267a57687cca261a19f3071,SRR9890004_1.fastq.gz
429,03ef1475894988e931c6fc933ec06348,SRR9890004_2.fastq.gz
430,ad99daa9542c46e4fff4ec9de370532c,SRR9890005_1.fastq.gz
431,6ce80aecc9b7afd8d9d2b1d842cc70b3,SRR9890005_2.fastq.gz
432,d5cd12618ab7d5a98622610bcc7b189f,SRR9890006_1.fastq.gz
433,e07233b7a7fd8a1e70a8818e8bf3db59,SRR9890006_2.fastq.gz


Rows: 268
Columns: 3
$ run_accession <chr> "ERR781039", "ERR781040", "ERR781041", "ERR781042", "ER…
$ fastq_ftp     <chr> "ftp.sra.ebi.ac.uk/vol1/fastq/ERR781/ERR781039/ERR78103…
$ fastq_md5     <chr> "948b0de6103c2aeffc4528e50a836fc7", "23ee478974a8d8c4bf…


Unnamed: 0_level_0,run_accession,fastq_ftp,fastq_md5
Unnamed: 0_level_1,<chr>,<chr>,<chr>
263,ERR515192,ftp.sra.ebi.ac.uk/vol1/fastq/ERR515/ERR515192/ERR515192_1.fastq.gz;ftp.sra.ebi.ac.uk/vol1/fastq/ERR515/ERR515192/ERR515192_2.fastq.gz,6453c813f381e90256a1ace2805f15f3;0863ee3974059d5388cc9e4dd00b6ff1
264,ERR515193,ftp.sra.ebi.ac.uk/vol1/fastq/ERR515/ERR515193/ERR515193_1.fastq.gz;ftp.sra.ebi.ac.uk/vol1/fastq/ERR515/ERR515193/ERR515193_2.fastq.gz,1290c2824b66bffd9338f175200a0fca;71aea413b20d193dd841270382b3e684
265,ERR515194,ftp.sra.ebi.ac.uk/vol1/fastq/ERR515/ERR515194/ERR515194_1.fastq.gz;ftp.sra.ebi.ac.uk/vol1/fastq/ERR515/ERR515194/ERR515194_2.fastq.gz,bebfce4b8b24578249cda412db2dd5d2;1a61643391a886e357f0977fe66c75ef
266,ERR515195,ftp.sra.ebi.ac.uk/vol1/fastq/ERR515/ERR515195/ERR515195_1.fastq.gz;ftp.sra.ebi.ac.uk/vol1/fastq/ERR515/ERR515195/ERR515195_2.fastq.gz,fbc583e70b1c2dfba4a4e50f10424ec0;f3c30acbebb3492be10bc0e6355efd34
267,ERR515196,ftp.sra.ebi.ac.uk/vol1/fastq/ERR515/ERR515196/ERR515196_1.fastq.gz;ftp.sra.ebi.ac.uk/vol1/fastq/ERR515/ERR515196/ERR515196_2.fastq.gz,67cfa3dd95fc5e78f0d6b9e9cb7939d6;e9b5669ef6747d2d15ddfdeceee81f72
268,ERR515197,ftp.sra.ebi.ac.uk/vol1/fastq/ERR515/ERR515197/ERR515197_1.fastq.gz;ftp.sra.ebi.ac.uk/vol1/fastq/ERR515/ERR515197/ERR515197_2.fastq.gz,504444da4ffce064c91c60d04553f2da;bd3e06e3b0bc0a11df6a33dd32605ba6


In [150]:
# wrangling md5_srafiles_h
# separate read names
md5_srafiles_t_n <- separate(md5_srafiles_h, fastq_ftp, sep=";", into=c("read", "read2"), remove = TRUE)
# separate md5sums
md5_srafiles_t_nm <- separate(md5_srafiles_t_n, fastq_md5, sep=";", into=c("md5sum", "md5sum2"), remove = TRUE)
glimpse(md5_srafiles_t_nm)
# bind column2-s under column1-s as new rows
col1 <- select(md5_srafiles_t_nm, read, md5sum)
col2 <- select(md5_srafiles_t_nm, read2, md5sum2)
colnames(col2) <- c("read", "md5sum")
md5_srafiles_clean <- bind_rows(col1, col2)
glimpse(md5_srafiles_clean)

# delete missing gap of 241-248 from single reads
md5_h1 <- slice(md5_srafiles_clean, 1:268)
md5_h2 <- slice(md5_srafiles_clean, 316:536)
md5_srafiles_final <- bind_rows(md5_h1, md5_h2)
# final table
glimpse(md5_srafiles_final)
head(md5_srafiles_final)

“Expected 2 pieces. Missing pieces filled with `NA` in 75 rows [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].”

Rows: 268
Columns: 5
$ run_accession <chr> "ERR781039", "ERR781040", "ERR781041", "ERR781042", "ER…
$ read          <chr> "ftp.sra.ebi.ac.uk/vol1/fastq/ERR781/ERR781039/ERR78103…
$ read2         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ md5sum        <chr> "948b0de6103c2aeffc4528e50a836fc7", "23ee478974a8d8c4bf…
$ md5sum2       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
Rows: 536
Columns: 2
$ read   <chr> "ftp.sra.ebi.ac.uk/vol1/fastq/ERR781/ERR781039/ERR781039.fastq…
$ md5sum <chr> "948b0de6103c2aeffc4528e50a836fc7", "23ee478974a8d8c4bff871f01…
Rows: 489
Columns: 2
$ read   <chr> "ftp.sra.ebi.ac.uk/vol1/fastq/ERR781/ERR781039/ERR781039.fastq…
$ md5sum <chr> "948b0de6103c2aeffc4528e50a836fc7", "23ee478974a8d8c4bff871f01…


Unnamed: 0_level_0,read,md5sum
Unnamed: 0_level_1,<chr>,<chr>
1,ftp.sra.ebi.ac.uk/vol1/fastq/ERR781/ERR781039/ERR781039.fastq.gz,948b0de6103c2aeffc4528e50a836fc7
2,ftp.sra.ebi.ac.uk/vol1/fastq/ERR781/ERR781040/ERR781040.fastq.gz,23ee478974a8d8c4bff871f01448818d
3,ftp.sra.ebi.ac.uk/vol1/fastq/ERR781/ERR781041/ERR781041.fastq.gz,4d89368a89b6ddbefd9f737224c2dd37
4,ftp.sra.ebi.ac.uk/vol1/fastq/ERR781/ERR781042/ERR781042.fastq.gz,0eb49bddcf0785ad4c4789a64b304b02
5,ftp.sra.ebi.ac.uk/vol1/fastq/ERR781/ERR781043/ERR781043.fastq.gz,2b4e82f440d4c5cf534dbc6ab31bf7a3
6,ftp.sra.ebi.ac.uk/vol1/fastq/ERR781/ERR781044/ERR781044.fastq.gz,ae7176a805965f383295d93b3f042212


In [151]:
# intersect two tables
missing <- anti_join(md5_srafiles_final, md5_myfiles_h, by = "md5sum")
glimpse(missing)

Rows: 489
Columns: 2
$ read   <chr> "ftp.sra.ebi.ac.uk/vol1/fastq/ERR781/ERR781039/ERR781039.fastq…
$ md5sum <chr> "948b0de6103c2aeffc4528e50a836fc7", "23ee478974a8d8c4bff871f01…


# 6. Preprocessing
* 1. FastQC + MultiQC
* 2. Trimmomatic
* 3. FastQC + MultiQC

## 6.1 Create Project table for FastQC_01_raw
### 6.1.1 Triticum

In [64]:
w_pt <- select(wheat, Run.ID, Dataset.name, Tissue)
w_pt <- inner_join(w_pt, w_check, by = "Run.ID")
colnames(w_pt) <- c("ID", "dataset_name", "tissue", "library")
glimpse(w_pt)
# save project table
setwd("/nfs/pgsb/projects/comparative_triticeae/phenotype/flower_development/refsets/triticum")
write.table(w_pt, file = "wheat_project_table.txt", append = FALSE, quote = FALSE, sep = "\t", dec = ".",
            row.names = FALSE, col.names = FALSE)
# create & save testing subset 
w_pt_test1 <- slice(w_pt, 1:2) # choosing 2 paried reads with "_1/2.fastq.gz"
w_pt_test2 <- slice(w_pt, 102:103) # choosing 2 single reads with "fq.gz"
w_pt_test3 <- slice(w_pt, 141) # choosing 1 single read with "sra.fastq.gz" extension
w_pt_test <- bind_rows(w_pt_test1, w_pt_test2)
w_pt_test <- bind_rows(w_pt_test, w_pt_test3)
print(w_pt_test)
write.table(w_pt_test, file = "wheat_project_table_test.txt", append = FALSE, quote = FALSE, sep = "\t", dec = ".",
            row.names = FALSE, col.names = TRUE)

Rows: 215
Columns: 4
$ ID           <chr> "SRR10737427", "SRR10737428", "SRR10737429", "SRR1073743…
$ dataset_name <chr> "pistillody of stamen", "pistillody of stamen", "pistill…
$ tissue       <chr> "anther", "anther", "anther", "anther", "anther", "anthe…
$ library      <chr> "paired", "paired", "paired", "paired", "paired", "paire…
           ID              dataset_name tissue library
1 SRR10737427      pistillody of stamen anther  paired
2 SRR10737428      pistillody of stamen anther  paired
3   CRR078059                      tf q  spike  single
4   CRR078085                      tf q  spike  single
5  SRR5464524 inflorescence development  spike  single


### 6.1.2 Hordeum

In [12]:
b_pt <- select(barley, Run.ID, Dataset.name, Tissue)
b_pt <- inner_join(b_pt, b_check, by = "Run.ID")
colnames(b_pt) <- c("ID", "dataset_name", "tissue")
glimpse(b_pt)
# save project table
setwd("/nfs/pgsb/projects/comparative_triticeae/phenotype/flower_development/refsets/hordeum")
write.table(b_pt, file = "barley_project_table.txt", append = FALSE, quote = FALSE, sep = "\t", dec = ".",
            row.names = FALSE, col.names = TRUE)


Rows: 240
Columns: 3
$ ID           <chr> "ERR781039", "ERR781040", "ERR781041", "ERR781042", "ERR…
$ dataset_name <chr> "inflorescence development", "inflorescence development"…
$ tissue       <chr> "apex", "apex", "apex", "apex", "apex", "apex", "apex", …


### 6.1.3 Delete 0 size files from FastQC-folder
1. to confirm: `for file in $(find ./ -size 0 -print); do ls $file; echo $file will be removed;ls $file -lah; done`

2. to delete: `for file in $(find ./ -size 0 -print); do ls $file; echo $file will be removed;ls $file -lah; rm $file; done`

### 6.1.4 Crosscheck fastqc-s with original dataset

In [33]:
# hordeum had all the runs completed
# triticum had one paired read missing:
setwd("/nfs/pgsb/projects/comparative_triticeae/phenotype/flower_development/refsets/triticum/01_FastQC_raw_paired")
final <- read.table("01_trit_paired_fastqc.txt", header = FALSE, sep = "\t", stringsAsFactors = FALSE)
colnames(final) <- "Run.ID"
glimpse(final)

# create table with read layout
w_check <- select(wheat, Run.ID, Library.layout)
w_check_paired <- group_by(w_check, Library.layout)
glimpse(w_check_paired)

# intersect tables and get the missing paired read
joined_wf <- anti_join(w_check_paired, final, by = "Run.ID")
glimpse(joined_wf)
print(joined_wf)
# for some reason SRR10737427* wasnt processed, although it exists as a symlink and looks whole

Rows: 179
Columns: 1
$ Run.ID <chr> "CRR088946", "CRR088947", "CRR088948", "CRR088949", "CRR088950…
Rows: 215
Columns: 2
Groups: Library.layout [2]
$ Run.ID         <chr> "SRR10737427", "SRR10737428", "SRR10737429", "SRR10737…
$ Library.layout <chr> "paired", "paired", "paired", "paired", "paired", "pai…
Rows: 36
Columns: 2
Groups: Library.layout [2]
$ Run.ID         <chr> "SRR10737427", "CRR078059", "CRR078085", "CRR078084", …
$ Library.layout <chr> "paired", "single", "single", "single", "single", "sin…
# A tibble: 36 x 2
# Groups:   Library.layout [2]
   Run.ID      Library.layout
   <chr>       <chr>         
 1 SRR10737427 paired        
 2 CRR078059   single        
 3 CRR078085   single        
 4 CRR078084   single        
 5 CRR078083   single        
 6 CRR078082   single        
 7 CRR078081   single        
 8 CRR078080   single        
 9 CRR078079   single        
10 CRR078078   single        
# … with 26 more rows


In [164]:
# create extra project table for that 1 missing file: SRR10737427
setwd("/nfs/pgsb/projects/comparative_triticeae/phenotype/flower_development/refsets/triticum")
extra <- read.table("wheat_project_table_trimmomatic_paired.txt", header = FALSE, sep = "\t", stringsAsFactors = FALSE)
extra <- slice(extra, 2)
colnames(extra) <- c("ID", "dataset_name", "tissue")
glimpse(extra)
write.table(extra, file = "extra_1missingfile_paired.txt", append = FALSE, quote = FALSE, sep = "\t", dec = ".",
            row.names = FALSE, col.names = TRUE)

Rows: 1
Columns: 3
$ ID           <chr> "SRR10737427"
$ dataset_name <chr> "pistillody of stamen"
$ tissue       <chr> "anther"


## 6.2 Trimming with `Trimmomatic`
* bash scripts are available: `home/pgsb/vanda.marosi/scripts/triticum & /hordeum`

### Create separate project tables for Trimmomatic paired & single reads
### 6.2.1 Triticum

In [158]:
setwd("/nfs/pgsb/projects/comparative_triticeae/phenotype/flower_development/refsets/triticum")
w_pt_trim <- select(wheat, Run.ID, Dataset.name, Tissue, Library.layout)
colnames(w_pt_trim) <- c("ID", "dataset_name", "tissue", "library")
w_pt_trim <- group_by(w_pt_trim, library)
glimpse(w_pt_trim)

w_pt_trim_paired <- filter(w_pt_trim, library == "paired")
w_pt_trim_paired <- ungroup(w_pt_trim_paired)
w_pt_trim_paired <- select(w_pt_trim_paired, ID, dataset_name, tissue)
glimpse(w_pt_trim_paired)
write.table(w_pt_trim_paired, file = "wheat_project_table_trimmomatic_paired.txt", append = FALSE, quote = FALSE, sep = "\t", dec = ".",
            row.names = FALSE, col.names = TRUE)

w_pt_trim_single <- filter(w_pt_trim, library == "single")
w_pt_trim_single <- ungroup(w_pt_trim_single)
w_pt_trim_single <- select(w_pt_trim_single, ID, dataset_name, tissue)
glimpse(w_pt_trim_single)
write.table(w_pt_trim_single, file = "wheat_project_table_trimmomatic_single.txt", append = FALSE, quote = FALSE, sep = "\t", dec = ".",
            row.names = FALSE, col.names = TRUE)

Rows: 215
Columns: 4
Groups: library [2]
$ ID           <chr> "SRR10737427", "SRR10737428", "SRR10737429", "SRR1073743…
$ dataset_name <chr> "pistillody of stamen", "pistillody of stamen", "pistill…
$ tissue       <chr> "anther", "anther", "anther", "anther", "anther", "anthe…
$ library      <chr> "paired", "paired", "paired", "paired", "paired", "paire…
Rows: 180
Columns: 3
$ ID           <chr> "SRR10737427", "SRR10737428", "SRR10737429", "SRR1073743…
$ dataset_name <chr> "pistillody of stamen", "pistillody of stamen", "pistill…
$ tissue       <chr> "anther", "anther", "anther", "anther", "anther", "anthe…
Rows: 35
Columns: 3
$ ID           <chr> "CRR078059", "CRR078085", "CRR078084", "CRR078083", "CRR…
$ dataset_name <chr> "tf q", "tf q", "tf q", "tf q", "tf q", "tf q", "tf q", …
$ tissue       <chr> "spike", "spike", "spike", "spike", "spike", "spike", "s…


### 6.2.2 Hordeum

In [159]:
setwd("/nfs/pgsb/projects/comparative_triticeae/phenotype/flower_development/refsets/hordeum")
b_pt_trim <- select(barley, Run.ID, Dataset.name, Tissue, Library.layout)
colnames(b_pt_trim) <- c("ID", "dataset_name", "tissue", "library")
b_pt_trim <- group_by(b_pt_trim, library)
glimpse(b_pt_trim)

b_pt_trim_paired <- filter(b_pt_trim, library == "paired")
b_pt_trim_paired <- ungroup(b_pt_trim_paired)
b_pt_trim_paired <- select(b_pt_trim_paired, ID, dataset_name, tissue)
glimpse(b_pt_trim_paired)
write.table(b_pt_trim_paired, file = "barley_project_table_trimmomatic_paired.txt", append = FALSE, quote = FALSE, sep = "\t", dec = ".",
            row.names = FALSE, col.names = TRUE)

b_pt_trim_single <- filter(b_pt_trim, library == "single")
b_pt_trim_single <- ungroup(b_pt_trim_single)
b_pt_trim_single <- select(b_pt_trim_single, ID, dataset_name, tissue)
glimpse(b_pt_trim_single)
write.table(b_pt_trim_single, file = "barley_project_table_trimmomatic_single.txt", append = FALSE, quote = FALSE, sep = "\t", dec = ".",
            row.names = FALSE, col.names = TRUE)

Rows: 240
Columns: 4
Groups: library [2]
$ ID           <chr> "ERR781039", "ERR781040", "ERR781041", "ERR781042", "ERR…
$ dataset_name <chr> "inflorescence development", "inflorescence development"…
$ tissue       <chr> "apex", "apex", "apex", "apex", "apex", "apex", "apex", …
$ library      <chr> "single", "single", "single", "single", "single", "singl…
Rows: 193
Columns: 3
$ ID           <chr> "ERR1248084", "ERR1248085", "ERR1248086", "ERR1248087", …
$ dataset_name <chr> "ref dataset drought", "ref dataset drought", "ref datas…
$ tissue       <chr> "spike", "spike", "spike", "spike", "spike", "spike", "l…
Rows: 47
Columns: 3
$ ID           <chr> "ERR781039", "ERR781040", "ERR781041", "ERR781042", "ERR…
$ dataset_name <chr> "inflorescence development", "inflorescence development"…
$ tissue       <chr> "apex", "apex", "apex", "apex", "apex", "apex", "apex", …


## 6.3 FastQC & MultiQC
* bash scripts are available: `home/pgsb/vanda.marosi/scripts/triticum & /hordeum`

### Results
* **wheat paired:** 1 sample failed - out of 180
* **wheat single:** all 35 good
* **barley single:** all 47 good
* **barley paired:** 1 sample failed - out of 
 - since then all of them checked and are good!

#### Investigating 1 barley paired failed sequences
1. table of successful runs with bash: `ls | grep ".zip" | cut -d "_" -f 1 | uniq > 03_hordeum_paired_fastqc.txt` 

In [162]:
## looking for barley paired 2 missing files
setwd("/nfs/pgsb/projects/comparative_triticeae/phenotype/flower_development/refsets/hordeum/03_FastQC_trimmed_paired")
fqc_b <- read.table("03_hordeum_paired_fastqc.txt", header = FALSE, sep = "\t", stringsAsFactors = FALSE)
colnames(fqc_b) <- "Run.ID"
glimpse(fqc_b)

# create table with read layout
b_check <- select(barley, Run.ID, Library.layout)
b_check_paired <- filter(b_check, Library.layout == "paired")
glimpse(b_check_paired)

# intersect tables and get the missing paired read
joined_b <- anti_join(b_check_paired, fqc_b, by = "Run.ID")
print(joined_b)

Rows: 192
Columns: 1
$ Run.ID <chr> "ERR1248085", "ERR1248086", "ERR1248087", "ERR1248088", "ERR12…
Rows: 193
Columns: 2
$ Run.ID         <chr> "ERR1248084", "ERR1248085", "ERR1248086", "ERR1248087"…
$ Library.layout <chr> "paired", "paired", "paired", "paired", "paired", "pai…
      Run.ID Library.layout
1 ERR1248084         paired


read ERR1248084 has 101 basepair, and seems to be an average quality sample from the 1st (raw) multiqc
* solution: rerun sample from trimmomatic separately, bash scripts are made with name `extra` for trimming and last fastqc

In [165]:
# create extra project table for that 1 missing file: ERR1248084
setwd("/nfs/pgsb/projects/comparative_triticeae/phenotype/flower_development/refsets/hordeum")
extra <- read.table("barley_project_table_trimmomatic_paired.txt", header = FALSE, sep = "\t", stringsAsFactors = FALSE)
extra <- slice(extra, 2)
colnames(extra) <- c("ID", "dataset_name", "tissue")
glimpse(extra)
write.table(extra, file = "extra_1missingfile_paired.txt", append = FALSE, quote = FALSE, sep = "\t", dec = ".",
            row.names = FALSE, col.names = TRUE)

Rows: 1
Columns: 3
$ ID           <chr> "ERR1248084"
$ dataset_name <chr> "ref dataset drought"
$ tissue       <chr> "spike"


after re-running:
- triticum SRR10737427 paired sample: 01_FastQC & `multiqc -f .`, Trimmomatic, 03_FastQC & `multiqc -f .`
- hordeum ERR1248084 paired sample: Trimmomatic, 03_FastQC & `multiqc -f .`, 

a new multiqc-report was generated and finally all samples were fully processed and in place

# Final results of quality controlled data
* created bash scripts to measure data size in GB
* scripts are available under `size_calculator_wheat/barley.sh` in the Bash_scripts folder
* final calculation: 
       barley = 696 GB paried + 38 GB single = 734 GB
       wheat = 723 GB paired + 19 GB single = 742 GB
       in total 1476 GB = 1.44 TB data

In [152]:
sessionInfo()

R version 3.6.3 (2020-02-29)
Platform: x86_64-conda_cos6-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS/LAPACK: /home/vanda.marosi/anaconda3/envs/r/lib/libopenblasp-r0.3.9.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=de_DE.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=de_DE.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=de_DE.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] forcats_0.5.0   stringr_1.4.0   purrr_0.3.4     readr_1.3.1    
[5] tidyr_1.0.2     tibble_3.0.1    ggplot2_3.3.0   tidyverse_1.2.1
[9] dplyr_0.8.5    

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.4.6     cellranger_1.1.0 pillar_1.4.3     compiler_3.6.3  
 [5] base64enc_0.1-3  to