# Shell scripting tutorial

Learning shell scripting can provide you quick and easy ways to perform a lot of work related with a machine learning / text processing project.

Some of the things you can achieve with shell scripting:

- Installing project dependencies
- Build project dependencies from source
- Download organize and clean data
- Extract data statistics (e.g. word / character counts)
- Perform text processing
- Train and evaluate models using existing frameworks that provide a command line interface (Kaldi, openfst, fasttext, fairseq etc.)



First of all list current working directory and it's contents

In [49]:
pwd

/home/geopar/projects/bash_tutorial


In [50]:
ls -lah

total 80K
drwxr-xr-x  6 geopar geopar 4.0K Oct 18 06:48  [0m[01;34m.[0m
drwxrwxrwx 13 geopar geopar 4.0K Oct 17 19:45  [34;42m..[0m
-rw-r--r--  1 geopar geopar  55K Oct 18 06:48 'Bash Tutorial.ipynb'
drwxr-xr-x  2 geopar geopar 4.0K Oct 17 20:52  [01;34mbin[0m
drwxr-xr-x  2 geopar geopar 4.0K Oct 18 06:42  [01;34mdata[0m
drwxr-xr-x  2 geopar geopar 4.0K Oct 17 19:46  [01;34m.ipynb_checkpoints[0m
drwxr-xr-x  9 geopar geopar 4.0K Oct 17 20:41  [01;34mkenlm[0m


## Installing project dependencies and building a project from source

Let's say we need to build an N-Gram language model for some corpus. One commonly used tool for this is KenLM. Let's download and build it from source

Download KenLM from git repo

In [51]:
git clone https://github.com/kpu/kenlm

fatal: destination path 'kenlm' already exists and is not an empty directory.


: 128

Install necessary dependencies for building KenLM (follow docs: https://kheafield.com/code/kenlm/dependencies/)

In [52]:
# sudo apt-get install build-essential libboost-all-dev cmake zlib1g-dev libbz2-dev liblzma-dev

List files in current directory

In [53]:
ls -lah

total 80K
drwxr-xr-x  6 geopar geopar 4.0K Oct 18 06:48  [0m[01;34m.[0m
drwxrwxrwx 13 geopar geopar 4.0K Oct 17 19:45  [34;42m..[0m
-rw-r--r--  1 geopar geopar  55K Oct 18 06:48 'Bash Tutorial.ipynb'
drwxr-xr-x  2 geopar geopar 4.0K Oct 17 20:52  [01;34mbin[0m
drwxr-xr-x  2 geopar geopar 4.0K Oct 18 06:42  [01;34mdata[0m
drwxr-xr-x  2 geopar geopar 4.0K Oct 17 19:46  [01;34m.ipynb_checkpoints[0m
drwxr-xr-x  9 geopar geopar 4.0K Oct 17 20:41  [01;34mkenlm[0m


Navigate inside kenlm

In [54]:
cd kenlm

Display current working directory

In [55]:
pwd

/home/geopar/projects/bash_tutorial/kenlm


In [56]:
ls -lah

total 212K
drwxr-xr-x 9 geopar geopar 4.0K Oct 17 20:41 [0m[01;34m.[0m
drwxr-xr-x 6 geopar geopar 4.0K Oct 18 06:48 [01;34m..[0m
drwxr-xr-x 7 geopar geopar 4.0K Oct 17 20:41 [01;34mbuild[0m
-rw-r--r-- 1 geopar geopar  696 Oct 17 20:40 BUILDING
-rwxr-xr-x 1 geopar geopar   81 Oct 17 20:40 [01;32mclean_query_only.sh[0m
drwxr-xr-x 3 geopar geopar 4.0K Oct 17 20:40 [01;34mcmake[0m
-rw-r--r-- 1 geopar geopar 3.7K Oct 17 20:40 CMakeLists.txt
-rwxr-xr-x 1 geopar geopar 1.2K Oct 17 20:40 [01;32mcompile_query_only.sh[0m
-rw-r--r-- 1 geopar geopar  26K Oct 17 20:40 COPYING
-rw-r--r-- 1 geopar geopar  35K Oct 17 20:40 COPYING.3
-rw-r--r-- 1 geopar geopar 7.5K Oct 17 20:40 COPYING.LESSER.3
-rw-r--r-- 1 geopar geopar  63K Oct 17 20:40 Doxyfile
drwxr-xr-x 6 geopar geopar 4.0K Oct 17 20:40 [01;34m.git[0m
drwxr-xr-x 3 geopar geopar 4.0K Oct 17 20:40 [01;34m.github[0m
-rw-r--r-- 1 geopar geopar  249 Oct 17 20:40 .gitignore
-rw-r--r-- 1 geopar geopar 1.2K Oct 17 20:40 LICENSE
drwxr-xr-x

Use cmake to compile project (follow instructions: https://github.com/kpu/kenlm)

In [11]:
mkdir build
cd build
cmake ..
make -j4
# make install to install kenlm system-wide

-- The C compiler identification is GNU 9.3.0
-- The CXX compiler identification is GNU 9.3.0
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found Boost: /usr/lib/x86_64-linux-gnu/cmake/Boost-1.71.0/BoostConfig.cmake (found suitable version "1.71.0", minimum required is "1.41.0") found components: program_options system thread unit_test_framework 
-- Check if compiler accepts -pthread
-- Check if compiler accepts -pthread - yes
-- Found Threads: TRUE  
-- Found ZLIB: /usr/lib/x86_64-linux-gnu/libz.so (found version "1.2.11") 
-- Found BZi

[ 68%] [32m[1mLinking CXX executable ../bin/query[0m
[ 68%] Built target query
[35m[1mScanning dependencies of target filter[0m
[ 69%] [32mBuilding CXX object lm/filter/CMakeFiles/filter.dir/filter_main.cc.o[0m
[ 70%] [32m[1mLinking CXX executable ../../bin/phrase_table_vocab[0m
[ 70%] Built target phrase_table_vocab
[35m[1mScanning dependencies of target kenlm_interpolate[0m
[ 71%] [32mBuilding CXX object lm/interpolate/CMakeFiles/kenlm_interpolate.dir/backoff_reunification.cc.o[0m
[ 72%] [32mBuilding CXX object lm/builder/CMakeFiles/kenlm_builder.dir/corpus_count.cc.o[0m
[ 73%] [32mBuilding CXX object lm/interpolate/CMakeFiles/kenlm_interpolate.dir/bounded_sequence_encoding.cc.o[0m
[ 75%] [32mBuilding CXX object lm/interpolate/CMakeFiles/kenlm_interpolate.dir/merge_probabilities.cc.o[0m
[ 76%] [32mBuilding CXX object lm/builder/CMakeFiles/kenlm_builder.dir/initial_probabilities.cc.o[0m
[ 77%] [32mBuilding CXX object lm/interpolate/CMakeFiles/kenlm_interpolate

The previous commands built the KenLM binaries inside the bin folder. Let's copy it in a more accessible directory

In [57]:
ls -l

total 192
drwxr-xr-x 7 geopar geopar  4096 Oct 17 20:41 [0m[01;34mbuild[0m
-rw-r--r-- 1 geopar geopar   696 Oct 17 20:40 BUILDING
-rwxr-xr-x 1 geopar geopar    81 Oct 17 20:40 [01;32mclean_query_only.sh[0m
drwxr-xr-x 3 geopar geopar  4096 Oct 17 20:40 [01;34mcmake[0m
-rw-r--r-- 1 geopar geopar  3689 Oct 17 20:40 CMakeLists.txt
-rwxr-xr-x 1 geopar geopar  1154 Oct 17 20:40 [01;32mcompile_query_only.sh[0m
-rw-r--r-- 1 geopar geopar 26530 Oct 17 20:40 COPYING
-rw-r--r-- 1 geopar geopar 35147 Oct 17 20:40 COPYING.3
-rw-r--r-- 1 geopar geopar  7637 Oct 17 20:40 COPYING.LESSER.3
-rw-r--r-- 1 geopar geopar 63537 Oct 17 20:40 Doxyfile
-rw-r--r-- 1 geopar geopar  1150 Oct 17 20:40 LICENSE
drwxr-xr-x 7 geopar geopar  4096 Oct 17 20:40 [01;34mlm[0m
-rw-r--r-- 1 geopar geopar   220 Oct 17 20:40 MANIFEST.in
drwxr-xr-x 2 geopar geopar  4096 Oct 17 20:40 [01;34mpython[0m
-rw-r--r-- 1 geopar geopar  5394 Oct 17 20:40 README.md
-rw-r--r-- 1 geopar geopar  1937 Oct 17 20:40 setup.py
drwxr-x

In [58]:
cp -r bin ../../  # 2 directories up
cd ../../
ls -lah

cp: cannot stat 'bin': No such file or directory
total 56K
drwxrwxrwx 13 geopar geopar 4.0K Oct 17 19:45 [0m[34;42m.[0m
drwxr-xr-x 24 geopar geopar 4.0K Oct 17 20:53 [01;34m..[0m
drwxr-xr-x  6 geopar geopar 4.0K Oct 18 06:48 [01;34mbash_tutorial[0m
drwxr-xr-x  4 geopar geopar 4.0K Sep 22 16:06 [01;34mexam-generator[0m
drwxr-xr-x  3 geopar geopar 4.0K Oct  7 20:20 [01;34mfsspell[0m
-rw-r--r--  1 geopar geopar  174 Oct  8 18:25 mispel.py
drwxr-xr-x  3 geopar geopar 4.0K Oct  8 18:20 [01;34m.mypy_cache[0m
drwxr-xr-x  4 geopar geopar 4.0K Sep  3 13:23 [01;34mpython-lab[0m
drwxrwxrwx 17 geopar geopar 4.0K Sep  7 06:21 [34;42mslp[0m
drwxr-xr-x 14 geopar geopar 4.0K Sep 28 18:57 [01;34mslp_daptmlm[0m
drwxr-xr-x 10 geopar geopar 4.0K Sep  6 14:21 [01;34mspace-vim[0m
drwxr-xr-x  2 geopar geopar 4.0K Oct  7 18:19 [01;34mspellweb[0m
drwxr-xr-x 14 geopar geopar 4.0K Sep  7 08:09 [01;34msplchk[0m
drwxr-xr-x 15 geopar geopar 4.0K Sep 15 19:20 [01;34mtransformers[0m


## Download and preprocessing training corpus

Let's get a book from project gutenberg and clean it up using bash

In [20]:
mkdir data
wget -O data/dracula.txt http://www.gutenberg.org/cache/epub/345/pg345.txt

--2020-10-17 20:54:39--  http://www.gutenberg.org/cache/epub/345/pg345.txt
Resolving www.gutenberg.org (www.gutenberg.org)... 152.19.134.47, 2610:28:3090:3000:0:bad:cafe:47
Connecting to www.gutenberg.org (www.gutenberg.org)|152.19.134.47|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 883160 (862K) [text/plain]
Saving to: ‘data/dracula.txt’


2020-10-17 20:54:43 (272 KB/s) - ‘data/dracula.txt’ saved [883160/883160]



In [36]:
ls -lah data

total 1.7M
drwxr-xr-x 2 geopar geopar 4.0K Oct 18 06:42 [0m[01;34m.[0m
drwxr-xr-x 6 geopar geopar 4.0K Oct 18 06:42 [01;34m..[0m
-rw-r--r-- 1 geopar geopar 859K Oct 18 06:42 dracula1.txt
-rw-r--r-- 1 geopar geopar 863K Oct  1 11:00 dracula.txt


Count the number of lines, words and characters using wc

In [83]:
wc -l data/dracula.txt

15973 data/dracula.txt


We can format column printing using awk

In [78]:
wc -l data/dracula.txt | awk '{printf "%s contains %s lines\n", $2, $1}'

data/dracula.txt contains 15973 lines


In [65]:
wc -w data/dracula.txt | awk '{printf "%s contains %s words\n", $2, $1}'

data/dracula.txt contains 164424 words


In [66]:
wc -c data/dracula.txt | awk '{printf "%s contains %s characters\n", $2, $1}'

data/dracula.txt contains 883160 characters


Inspect the first 250 lines using head

In [70]:
head -250 data/dracula.txt

﻿The Project Gutenberg EBook of Dracula, by Bram Stoker

This eBook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever.  You may copy it, give it away or
re-use it under the terms of the Project Gutenberg License included
with this eBook or online at www.gutenberg.org/license


Title: Dracula

Author: Bram Stoker

Release Date: August 16, 2013 [EBook #345]

Language: English


*** START OF THIS PROJECT GUTENBERG EBOOK DRACULA ***




Produced by Chuck Greif and the Online Distributed
Proofreading Team at http://www.pgdp.net (This file was
produced from images generously made available by The
Internet Archive)







                                DRACULA





                                DRACULA

                                  _by_

                              Bram Stoker

                        [Illustration: colophon]

                                NEW YORK

                            GROSSET & DUNLAP

                              _P

We see that the first 200 lines contain project gutenberg specific text and the tableof contents. We can remove these using sed. Then we inspect the new file using head and wc.

In [88]:
sed -e "1,200d" data/dracula.txt > data/dracula1.txt
head data/dracula1.txt


_3 May. Bistritz._--Left Munich at 8:35 P. M., on 1st May, arriving at
Vienna early next morning; should have arrived at 6:46, but train was an
hour late. Buda-Pesth seems a wonderful place, from the glimpse which I
got of it from the train and the little I could walk through the
streets. I feared to go very far from the station, as we had arrived
late and would start as near the correct time as possible. The
impression I had was that we were leaving the West and entering the
East; the most western of splendid bridges over the Danube, which is
here of noble width and depth, took us among the traditions of Turkish


In [89]:
wc data/dracula1.txt | \
    awk '{
    printf "%s contains %s lines, %s words and %s characters\n", $4, $1, $2, $3
    }'

data/dracula1.txt contains 15773 lines, 164092 words and 878987 characters


We can also remove all empty lines using sed

In [92]:
sed -r '/^\s*$/d' data/dracula1.txt > data/dracula2.txt

wc data/dracula2.txt | \
    awk '{
    printf "%s contains %s lines, %s words and %s characters\n", $4, $1, $2, $3
    }'

data/dracula2.txt contains 13323 lines, 164092 words and 874087 characters


Convert all characters to lowercase using tr

In [94]:
tr A-Z a-z <data/dracula2.txt >data/dracula3.txt
head data/dracula3.txt

_3 may. bistritz._--left munich at 8:35 p. m., on 1st may, arriving at
vienna early next morning; should have arrived at 6:46, but train was an
hour late. buda-pesth seems a wonderful place, from the glimpse which i
got of it from the train and the little i could walk through the
streets. i feared to go very far from the station, as we had arrived
late and would start as near the correct time as possible. the
impression i had was that we were leaving the west and entering the
east; the most western of splendid bridges over the danube, which is
here of noble width and depth, took us among the traditions of turkish
rule.


And remove punctuation and numbers

In [155]:
cat data/dracula3.txt | tr -d [:punct:] | tr -d [:digit:] > data/dracula4.txt
head data/dracula4.txt

 may bistritzleft munich at  p m on st may arriving at
vienna early next morning should have arrived at  but train was an
hour late budapesth seems a wonderful place from the glimpse which i
got of it from the train and the little i could walk through the
streets i feared to go very far from the station as we had arrived
late and would start as near the correct time as possible the
impression i had was that we were leaving the west and entering the
east the most western of splendid bridges over the danube which is
here of noble width and depth took us among the traditions of turkish
rule


Now we can perform a word frequency analysis using uniq and sort.
First we need to substitute spaces with newlines and then sort them to group the same words together. Uniq then will count consecutive lines that are the same and print word frequencies. We reverse sort the result to print most frequent words first

In [6]:
# sed -r 's/\s+/\n/g' data/dracula4.txt | \  # Replace spaces with new lines
#     awk 'NF' | \  # another way to remove empty lines
#     sort | \  # alphabetical sort
#     uniq -c | \ # word count 
#     sort -nr | \ # reverse numeric sort
#     awk '{$1=$1; print}' | \  # strip leading and trailing whitespace
#     awk 'BEGIN { OFS="\t" } {print $2,$1}' > data/wordcount.txt  # reverse columns
    
    
sed -r 's/\s+/\n/g' data/dracula4.txt | \
    awk 'NF' | \
    sort | \
    uniq -c | \
    sort -nr | \
    awk '{$1=$1; print}' | \
    awk 'BEGIN { OFS=" "  } {print $2,$1}' > data/wordcount.txt

In [56]:
head data/wordcount.txt

the 8027
and 5893
i 4710
to 4538
of 3732
a 2962
in 2555
he 2543
that 2455
it 2138


We can even create a histogram of word counts using a simple python script.

In [72]:
function histogram {
python3 -c 'import sys
for line in sys.stdin:
  data, width = line.split()
  print("{:<15}{:=<{width}}".format(data, "", width=int(int(width) / 75)))' # each = corresponds to a count of 75

}

In [73]:
cat data/wordcount.txt  | histogram > data/histogram.txt

In [74]:
head -250 data/histogram.txt 

up             =====
some           =====
out            =====
would          =====
shall          =====
may            =====
our            =====
now            =====
see            =====
been           =====
know           =====
can            =====
more           ====
time           ====
an             ====
has            ====
come           ====
am             ====
over           ====
any            ====
van            ====
your           ====
came           ====
helsing        ===
went           ===
into           ===
only           ===
who            ===
go             ===
very           ===
did            ===
before         ===
like           ===
here           ===
back           ===
down           ===
again          ===
seemed         ===
about          ===
well           ===
even           ===
such           ===
way            ===
took           ==
lucy           ==
than           ==
good           ==
dear           ==
think          ==
their          ==
much           ==
wher

## Training an n-gram language model

Now we can use KenLM to train a 3-gram Language model on our preprocessed corpus

In [78]:
./bin/lmplz --help

Builds unpruned language models with modified Kneser-Ney smoothing.

Please cite:
@inproceedings{Heafield-estimate,
  author = {Kenneth Heafield and Ivan Pouzyrevsky and Jonathan H. Clark and Philipp Koehn},
  title = {Scalable Modified {Kneser-Ney} Language Model Estimation},
  year = {2013},
  month = {8},
  booktitle = {Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics},
  address = {Sofia, Bulgaria},
  url = {http://kheafield.com/professional/edinburgh/estimate\_paper.pdf},
}

Provide the corpus on stdin.  The ARPA file will be written to stdout.  Order of
the model (-o) is the only mandatory option.  As this is an on-disk program,
setting the temporary file location (-T) and sorting memory (-S) is recommended.

Memory sizes are specified like GNU sort: a number followed by a unit character.
Valid units are % for percentage of memory (supported platforms only) and (in
increasing powers of 1024): b, K, M, G, T, P, E, Z, Y.  Default is K (*1024).

: 1

In [82]:
./bin/lmplz -o 3 <data/dracula4.txt > data/dracula.lm.arpa 

=== 1/5 Counting and sorting n-grams ===
Reading /home/geopar/projects/bash_tutorial/data/dracula4.txt
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Unigram tokens 163116 types 10719
=== 2/5 Calculating and sorting adjusted counts ===
Chain sizes: 1:128628 2:3683752192 3:6907035648
Statistics:
1 10719 D1=0.644236 D2=1.00333 D3+=1.39803
2 72604 D1=0.771329 D2=1.15115 D3+=1.31645
3 133828 D1=0.882174 D2=1.25019 D3+=1.4389
Memory estimate for binary LM:
type      kB
probing 4326 assuming -p 1.5
probing 4793 assuming -r models -p 1.5
trie    1828 without quantization
trie    1039 assuming -q 8 -b 8 quantization 
trie    1738 assuming -a 22 array pointer compression
trie     949 assuming -a 22 -q 8 -b 8 array pointer compression and quantization
=== 3/5 Calculating and sorting initial probabilities ===
Chain sizes: 1:128628 2:1161664 3:2

We can see some 1-gram, 2-gram and 3-gram scores using grep

In [90]:
cat data/dracula.lm.arpa | egrep "1-grams|2-grams|3-grams" -A10

\[01;31m[K1-grams[m[K:
-4.890992	<unk>	0
0	<s>	-0.61173564
-1.4264659	</s>	0
-2.7945037	may	-0.3863618
-4.7507243	bistritzleft	-0.11276052
-4.7507243	munich	-0.11276052
-2.1378868	at	-0.7028289
-3.8816328	p	-0.14545205
-3.9838684	m	-0.19899167
-2.1765325	on	-0.64976496
[36m[K--[m[K
\[01;31m[K2-grams[m[K:
-1.7202902	<s> </s>	0
-1.1438557	may </s>	0
-1.0149596	at </s>	0
-1.0583433	p </s>	0
-0.9642732	m </s>	0
-1.2002825	on </s>	0
-1.1483785	early </s>	0
-0.9711424	next </s>	0
-1.4812524	morning </s>	0
-1.1897001	should </s>	0
[36m[K--[m[K
\[01;31m[K3-grams[m[K:
-0.8619659	<s> may </s>
-1.1233317	i may </s>
-0.7419162	of may </s>
-0.9894832	it may </s>
-1.0609393	and may </s>
-1.0421566	we may </s>
-1.3042028	he may </s>
-1.8012102	there may </s>
-0.9887751	this may </s>
-1.196441	you may </s>


We can also use query to use the trained language model to score the perplexity of a sentence.
Lower perplexity indicates a more probable sentence.

Let's have the model score two possible endings.

In [91]:
./bin/query -h

./bin/query: invalid option -- 'h'
KenLM was compiled with maximum order 6.
Usage: ./bin/query [-b] [-n] [-w] [-s] lm_file
-b: Do not buffer output.
-n: Do not wrap the input in <s> and </s>.
-v summary|sentence|word: Print statistics at this level.
   Can be used multiple times: -v summary -v sentence -v word
-l lazy|populate|read|parallel: Load lazily, with populate, or malloc+read
The default loading method is populate on Linux and read on others.

Each word in the output is formatted as:
  word=vocab_id ngram_length log10(p(word|context))
where ngram_length is the length of n-gram matched.  A vocab_id of 0 indicates
the unknown word. Sentence-level output includes log10 probability of the
sentence and OOV count.


: 1

In [92]:
echo "harker and mina die a horrible death" > data/bad_ending
echo "harker and mina live happily ever after" > data/good_ending

In [100]:
cat data/bad_ending
./bin/query data/dracula.lm.arpa < data/bad_ending 2>&1| grep "Perplexity" | head -1

harker and mina die a horrible death
Perplexity including OOVs:	456.47716780020465


In [101]:
cat data/good_ending
./bin/query data/dracula.lm.arpa < data/good_ending 2>&1| grep "Perplexity" | head -1

harker and mina live happily ever after
Perplexity including OOVs:	916.2937036462345
