# Globbing and Brace Expansion

Let's see how to match filenames using shell globbing patterns.
Here are example data files files.

In [1]:
ls data/filenames/

annot-a.gff3              [0m[38;5;9mBa-samp2-R1.fastq.gz[0m      sample_3.fastq
annot-b.gff3              [38;5;9mBa-samp2-R2-cln.fastq.gz[0m  sample4.fastq
annot-c.gff3              [38;5;9mBa-samp2-R2.fastq.gz[0m      sample5.fastq
annot-d.gff3              [38;5;9mBa-samp3-R1-cln.fastq.gz[0m  sample6.fastq
annot-e.gff3              [38;5;9mBa-samp3-R1.fastq.gz[0m      t1.txt
annot-f.gff3              [38;5;9mBa-samp3-R2-cln.fastq.gz[0m  test1.txt
annot-g.gff3              [38;5;9mBa-samp3-R2.fastq.gz[0m      testing1.txt
annot-xx.gff3             [38;5;9mFt-samp1-R1-cln.fastq.gz[0m  test-reads.bam
annot-xy.gff3             [38;5;9mFt-samp1-R1.fastq.gz[0m      test-reads.sam
annot-xz.gff3             [38;5;9mFt-samp1-R2-cln.fastq.gz[0m  [38;5;9mYp-samp1-R1-cln.fastq.gz[0m
annot-yx.gff3             [38;5;9mFt-samp1-R2.fastq.gz[0m      [38;5;9mYp-samp1-R1.fastq.gz[0m
annot-yy.gff3             [38;5;9mFt-samp2-R1-cln.fastq.gz[0m  [38;5;9mYp-samp1-R2-cln.fa

Below is a brief explanation of the shell globbing syntax.

- `*`: match zero or more characters, can be any character
- `?`: match exactly one character, can be any character
- `[X-Z]`: character ranges
- `[3-6]`: numeric ranges

Shell globbing patterns will only expand if there are filenames matching the specified patterns.
On the other hand, brace expansion patterns will expand even for files that don't exist.
This can be useful for creating new files/directories or listing ranges of letters or numbers.
Below is a brief explanation of brace expansion syntax.

- `{.faa,.gff3}`: alternative string matches
- `{1..25}`: numeric ranges
    - can also specify increment like `{2..20..2}`
    - can also zero pad to force equal width like `{001..100}`
- `{a..g}`: character ranges

Here are a few examples of globbing and brace expansion in action.

In [2]:
# Match files starting with 't' and ending with '1.txt'
ls data/filenames/t*1.txt

data/filenames/t1.txt  data/filenames/test1.txt  data/filenames/testing1.txt


In [3]:
# Match filenames containing 'cln'
ls data/filenames/*cln*

[0m[38;5;9mdata/filenames/Ba-samp1-R1-cln.fastq.gz[0m
[38;5;9mdata/filenames/Ba-samp1-R2-cln.fastq.gz[0m
[38;5;9mdata/filenames/Ba-samp2-R1-cln.fastq.gz[0m
[38;5;9mdata/filenames/Ba-samp2-R2-cln.fastq.gz[0m
[38;5;9mdata/filenames/Ba-samp3-R1-cln.fastq.gz[0m
[38;5;9mdata/filenames/Ba-samp3-R2-cln.fastq.gz[0m
[38;5;9mdata/filenames/Ft-samp1-R1-cln.fastq.gz[0m
[38;5;9mdata/filenames/Ft-samp1-R2-cln.fastq.gz[0m
[38;5;9mdata/filenames/Ft-samp2-R1-cln.fastq.gz[0m
[38;5;9mdata/filenames/Ft-samp2-R2-cln.fastq.gz[0m
[38;5;9mdata/filenames/Ft-samp3-R1-cln.fastq.gz[0m
[38;5;9mdata/filenames/Ft-samp3-R2-cln.fastq.gz[0m
[38;5;9mdata/filenames/Yp-samp1-R1-cln.fastq.gz[0m
[38;5;9mdata/filenames/Yp-samp1-R2-cln.fastq.gz[0m
[38;5;9mdata/filenames/Yp-samp2-R1-cln.fastq.gz[0m
[38;5;9mdata/filenames/Yp-samp2-R2-cln.fastq.gz[0m
[38;5;9mdata/filenames/Yp-samp3-R1-cln.fastq.gz[0m
[38;5;9mdata/filenames/Yp-samp3-R2-cln.fastq.gz[0m


In [4]:
# Match annotation files with a single letter label
ls data/filenames/annot-?.gff3

data/filenames/annot-a.gff3  data/filenames/annot-e.gff3
data/filenames/annot-b.gff3  data/filenames/annot-f.gff3
data/filenames/annot-c.gff3  data/filenames/annot-g.gff3
data/filenames/annot-d.gff3


In [5]:
# Match annotation files with a double letter label
ls data/filenames/annot-??.gff3

data/filenames/annot-xx.gff3  data/filenames/annot-yz.gff3
data/filenames/annot-xy.gff3  data/filenames/annot-zx.gff3
data/filenames/annot-xz.gff3  data/filenames/annot-zy.gff3
data/filenames/annot-yx.gff3  data/filenames/annot-zz.gff3
data/filenames/annot-yy.gff3


In [6]:
# Match Fastq files for samples 2-5
ls data/filenames/sample[2-5].fastq
# Whoops! sample3 is missing since it's named inconsistently.

data/filenames/sample2.fastq  data/filenames/sample5.fastq
data/filenames/sample4.fastq


In [7]:
# Match files with .sam, .bam, or .txt extensions
ls data/filenames/*{.sam,.bam,.txt}

data/filenames/t1.txt        data/filenames/test-reads.bam
data/filenames/test1.txt     data/filenames/test-reads.sam
data/filenames/testing1.txt


In [8]:
# Create dummy files for every multiple of 3 between 1 and 90
#     This is possible with brace expansion but not globbing
touch data/filenames/dummy{03..90..3}
ls data/filenames/dummy*
rm -f data/filenames/dummy*  # cleanup

data/filenames/dummy03  data/filenames/dummy33  data/filenames/dummy63
data/filenames/dummy06  data/filenames/dummy36  data/filenames/dummy66
data/filenames/dummy09  data/filenames/dummy39  data/filenames/dummy69
data/filenames/dummy12  data/filenames/dummy42  data/filenames/dummy72
data/filenames/dummy15  data/filenames/dummy45  data/filenames/dummy75
data/filenames/dummy18  data/filenames/dummy48  data/filenames/dummy78
data/filenames/dummy21  data/filenames/dummy51  data/filenames/dummy81
data/filenames/dummy24  data/filenames/dummy54  data/filenames/dummy84
data/filenames/dummy27  data/filenames/dummy57  data/filenames/dummy87
data/filenames/dummy30  data/filenames/dummy60  data/filenames/dummy90


Different patterns can be combined for complex matches.

In [9]:
# Match samples 2 and 3 from Ba and Yp, both R1 and R2; exclude cleaned reads
ls data/filenames/{Ba,Yp}-samp[2-3]-R?.fastq.gz

[0m[38;5;9mdata/filenames/Ba-samp2-R1.fastq.gz[0m  [38;5;9mdata/filenames/Yp-samp2-R1.fastq.gz[0m
[38;5;9mdata/filenames/Ba-samp2-R2.fastq.gz[0m  [38;5;9mdata/filenames/Yp-samp2-R2.fastq.gz[0m
[38;5;9mdata/filenames/Ba-samp3-R1.fastq.gz[0m  [38;5;9mdata/filenames/Yp-samp3-R1.fastq.gz[0m
[38;5;9mdata/filenames/Ba-samp3-R2.fastq.gz[0m  [38;5;9mdata/filenames/Yp-samp3-R2.fastq.gz[0m


## Exercise

Let's try putting these patterns into practice on another set of files.
These are the bacteria species present in the RefSeq genomes FTP.

In [10]:
ls data/refseqbacteria/

Acinetobacter_pittii.txt           Lactobacillus_salivarius.txt
Aeromonas_hydrophila.txt           Lactococcus_lactis.txt
Agrobacterium_fabrum.txt           Legionella_pneumophila.txt
Aliivibrio_fischeri.txt            Leptospira_interrogans.txt
Amycolatopsis_mediterranei.txt     Listeria_monocytogenes.txt
Aquifex_aeolicus.txt               Mesoplasma_florum.txt
Bacillus_anthracis.txt             Mesorhizobium_ciceri.txt
Bacillus_cereus.txt                Moorella_thermoacetica.txt
Bacillus_subtilis.txt              Mycobacterium_leprae.txt
Bacillus_thuringiensis.txt         Mycobacterium_tuberculosis.txt
Bacteroides_fragilis.txt           Mycobacteroides_abscessus.txt
Bacteroides_thetaiotaomicron.txt   Mycolicibacterium_smegmatis.txt
Bifidobacterium_bifidum.txt        Mycoplasma_mycoides.txt
Bifidobacterium_longum.txt         Mycoplasma_pneumoniae.txt
Bordetella_bronchiseptica.txt      Neisseria_gonorrhoeae.txt
Bordetella_parapertussis.txt       Neisseria_meningitidis.txt
Bordetella_p

In [None]:
# Match files starting with 'R'
# ...your code goes here...

In [None]:
# Match files with species name 4 characters long
# ...your code goes here...

In [None]:
# Match files with species name starting with g, h, i, or k
# ...your code goes here...

In [None]:
# Match files with genus name in Bacillus or Yersinia
# ...your code goes here...

In [None]:
# Bonus! Create all of the following directories using a single command
#    GramNeg_1a   GramNeg_1b   GramNeg_2a   GramNeg_2b   GramNeg_3a   GramNeg_3b
#    GramPos_1a   GramPos_1b   GramPos_2a   GramPos_2b   GramPos_3a   GramPos_3b
# ...your code goes here...
ls | grep Gram
rm -rf Gram*  # cleanup