Lesson Navigation
- Introduction
- cat, tac, head, and tail
- wc and nl
- sort, uniq, and cut
- split and shuf
- paste and join
- tr and sed
- grep
- diff and comm
split
will output consecutive sections of files; that is, one use-case is to split a big file into numerous smaller files by the number of lines or size. Navigate to the /Udata2/Sorting
folder with cd
:
user@computer:~$ cd /home/user/Desktop/Udata2/Sorting
First, check the number of lines of the file molecules3.smi
:
user@computer:~$ wc -l molecules3.smi
1430 molecules3.smi
Let's say we needed to split this file into chunks of 250 lines each. The default number of lines for split
is 1000, but we can change this with the -l
option. In addition, we will use mols_
as the file prefix and the additional-suffix
option to add our preferred file extension:
user@computer:~$ split -l250 molecules3.smi mols_ --additional-suffix='.smi'
user@computer:~$ wc -l mols*.smi
250 mols_aa.smi
250 mols_ab.smi
250 mols_ac.smi
250 mols_ad.smi
250 mols_ae.smi
180 mols_af.smi
1430 total
shuf
outputs a random selection of lines. According to info shuf
, The default behavior is that "Each output permutation is equally likely." This is quite useful for sampling datasets. For example, we could use shuf
to select 20 random lines from molecules3.smi
:
user@computer:~$ shuf -n20 molecules3.smi
CCC1(C(CCC2C1=CCCC2)C(C)(C)C)C PZPIXXRYSWHWEV-UHFFFAOYSA-N 143681056
CC1(CCC(C2=CCCCC21)(C)C)C LOTKSLLORLTOQN-UHFFFAOYSA-N 145751120
CCC1CC[C@H](C2C1C=C(CC2)C)C JGVOBAQNVSGFRL-NTXGFPLRSA-N 141800277
C[C@@H]1CCC2=CC(=C)CCC2C1 VYQXHZJGWUHKSE-RWANSRKNSA-N 153141248
CCC[C@H]1CCC(=CC1)C2[C@@H](CCC3=CC23)C SPPRTJKFYRDPTA-QJRTVWDNSA-N 152806595
...
...
csplit
can split files on a specified pattern. Navigate to the /Udata2/SDF
folder with cd
:
user@computer:~$ cd /home/user/Desktop/Udata2/SDF
The PubChem_substance_WikiData1-15.sdf
file is a sample of 15 chemical substance records from WikiData in the PubChem Substance database, exported in .sdf
format. The .sdf
file has a pattern. While the number of lines for each record varies, each molecule substance record ends with four dollar signs, $$$$
. What if we wanted to split the .sdf
file into individual substance files, creating 15 files? We can do this with csplit
as follows:
user@computer:~$ csplit --elide-empty-files --suffix-format='%03d.mol' PubChem_substance_WikiData1-15.sdf '/\$\{4\}/1' '{*}'
user@computer:~$ ls
PubChem_substance_WikiData1-15.sdf xx003.mol xx007.mol xx011.mol
xx000.mol xx004.mol xx008.mol xx012.mol
xx001.mol xx005.mol xx009.mol xx013.mol
xx002.mol xx006.mol xx010.mol xx014.mol
The above csplit
command is more complicated than split
. Let's look at each piece and the thought process:
- First, the pattern I wanted to define for the file splitting was four repeated dollar signs,
$$$$
. The regular expression I came up with for this was:
\${4}
, which is a dollar sign repeated 4 times (i.e., {4}
). Note that I added an escape \
to avoid special interpretation of $
. I'm not an expert with regular expressions, but this seemed reasonable to try.
- The regular expression above did not work, but I found out that in POSIX Basic Regular Syntax, the metacharacters need to be escaped like this:
\$\{4\}
, see the Wikipedia Regular Expressions page and chapter 19 of LinuxCommand.org / LinuxCommand (PDF version).
- Next, this pattern is inserted into forward slashes as per the
csplit
syntax:
/\$\{4\}/
- After some testing, I found that I needed to add an offset of
1
to the pattern, to allow the$$$$
line to be included at the end of the file, and not the beginning of the next file. So, the pattern argument became:
/\$\{4\}/1
- Next, this pattern is followed by the
csplit
notation'{*}'
to repeat the pattern for every match (see$ man csplit
). So now we have our final pattern syntax:
'/\$\{4\}/1' '{*}'
-
The
--elide-empty-files
option was added because a 16th empty file is created without this, presumably from trying to split the file at the last occurrence of$$$$
. -
The
--suffix-format='%03d.mol'
specifies to label each output file with 3 integer digits and the.mol
extension. The%03d
is a printf notation
Attribution
Some content in this workshop have been adapted and derive from the Software Carpentry The Unix Shell lesson (CC BY 4.0 license for Instructional Materials and MIT License for programs and code examples). We reused a few parts of their lesson including some descriptions and command examples, but included our own specific datasets and use-case workflows. We have maintained the MIT license for program and code examples. Molecular and bibliographic dataset examples were retrieved from NCBI via their EDirect utility and is credited to NCBI and NLM. Please see the NCBI Website and Data Usage Policies and Disclaimers for more information regarding the data.