Skip to content

Working with several samples

Frédéric Mahé edited this page Nov 27, 2022 · 6 revisions

Working with several samples

Most studies comprise more than one sample. When working with several amplicon sets (i.e. samples) targeting the same DNA fragment, it is recommended to dereplicate the amplicons at the study level to speed up the clustering process.

Dereplication at the study level

Let's assume that you have 10 samples (sample_1.fasta, sample_2.fasta, etc.), and each of them was dereplicated as explained here. The following block of shell code performs a dereplication at the study level (i.e. it merges all samples and updates the amplicon abundances accordingly). Please note that these are code examples; there are surely shorter, faster, or safer ways to encode the same logic.

# Dereplicate the whole study
export LC_ALL=C
cat sample_*.fasta | \
awk 'BEGIN {RS = ">" ; FS = "[_\n]"}
     {if (NR != 1) {abundances[$1] += $2 ; sequences[$1] = $3}}
     END {for (amplicon in sequences) {
         print ">" amplicon "_" abundances[amplicon] "_" sequences[amplicon]}}' | \
sort --temporary-directory=$(pwd) -t "_" -k2,2nr -k1.2,1d | \
sed -e 's/\_/\n/2' > all_samples.fasta

The operation is usually very fast (millions of amplicons dealt with per minute). For large projects, it is possible to reach the maximum amount of data sort can deal with in memory, in that case it is necessary to use the --temporary-directory option of sort. The use of LC_ALL=C speeds up all "text" operations (sort, sed, grep, etc.), but it assumes that your fasta files contain only ascii characters.

Produce a contingency table for amplicons

To keep track of the presence-absence (or number of occurrences) of each amplicon in each sample, you might want to build a contingency table. The following shell code will read the fasta files representing your samples and produce a tab-delimited table containing all the amplicons (rows) and their occurrences in the different samples (columns). Tested with GNU awk 4.

awk 'BEGIN {FS = "[>_]"}

     # Parse the sample files
     /^>/ {contingency[$2][FILENAME] = $3
           amplicons[$2] += $3
           if (FNR == 1) {
               samples[++i] = FILENAME
           }
          }

     END {# Create table header
          printf "amplicon"
          s = length(samples)
          for (i = 1; i <= s; i++) {
              printf "\t%s", samples[i]
          }
          printf "\t%s\n", "total"

          # Sort amplicons by decreasing total abundance (use a coprocess)
          command = "LC_ALL=C sort -k1,1nr -k2,2d"
          for (amplicon in amplicons) {
               printf "%d\t%s\n", amplicons[amplicon], amplicon |& command
          }
          close(command, "to")
          FS = "\t"
          while ((command |& getline) > 0) {
              amplicons_sorted[++j] = $2
          }
          close(command)

          # Print the amplicon occurrences in the different samples
          n = length(amplicons_sorted)
          for (i = 1; i <= n; i++) {
               amplicon = amplicons_sorted[i]
               printf "%s", amplicon
               for (j = 1; j <= s; j++) {
                   printf "\t%d", contingency[amplicon][samples[j]]
               }
               printf "\t%d\n", amplicons[amplicon]
          }}' sample_*.fasta > amplicon_contingency_table.csv

That code is efficient enough to deal with 3 million unique amplicons in 10 samples in less than a minute. For extremely large studies, working with flat text files might become intractable (imagine a table with 130 million rows and 1,000 columns!). If you are in that situation, switching to (fast) relational database is probably your best option (e.g. PostgreSQL).

If the above code does not work for you (GNU awk 4 is not installed everywhere), you can use the python script amplicon_contingency_table.py located in the folder scripts:

python amplicon_contingency_table.py samples_*.fasta > amplicons_table.csv

Perform the clustering

If you want swarm to partition your dataset with the finest resolution (a local number of differences d = 1, with elimination of potential chained clusters and secondary grafting of small clusters) on a quadricore CPU:

./swarm -d 1 -f -t 4 amplicons.fasta -s amplicons.stats > amplicons.swarms

Produce a contingency table for clusters

The "stats" table obtained gives for each cluster the total abundance, the number of unique amplicons in the cluster, the name and the abundance of the most abundant amplicon in the cluster (i.e. the seed), and the number of low abundant amplicons (abundance = 1) in the cluster.

Now that you have sorted clusters, you can use that information with the amplicon contingency table to produce a new cluster contingency table. That table will indicate for each cluster (rows) the number of time elements of that cluster (i.e. amplicons) have been seen in each sample (columns).

STATS="amplicons.stats"
SWARMS="amplicons.swarms"
AMPLICON_TABLE="amplicon_contingency_table.csv"
CLUSTER_TABLE="cluster_contingency_table.csv"

# Header
echo -e "cluster\t$(head -n 1 "${AMPLICON_TABLE}")" > "${CLUSTER_TABLE}"

# Compute "per sample abundance" for each cluster
awk -v SWARM="${SWARMS}" \
    -v TABLE="${AMPLICON_TABLE}" \
    'BEGIN {FS = " "
            while ((getline < SWARM) > 0) {
                swarms[$1] = $0
            }
            FS = "\t"
            while ((getline < TABLE) > 0) {
                table[$1] = $0
            }
           }

     {# Parse the stat file (clusters sorted by decreasing abundance)
      seed = $3 "_" $4
      n = split(swarms[seed], cluster, "[ _]")
      for (i = 1; i < n; i = i + 2) {
          s = split(table[cluster[i]], abundances, "\t")
          for (j = 1; j < s; j++) {
              samples[j] += abundances[j+1]
          }
      }
      printf "%s\t%s", NR, $3
      for (j = 1; j < s; j++) {
          printf "\t%s", samples[j]
      }
     printf "\n"
     delete samples
     }' "${STATS}" >> "${CLUSTER_TABLE}"