# Regular Expressions

Regular expressions (regex) define a pattern with wildcards to search text.  Regexs can be used with several Linux commands and are also supported in many programming languages including Python and C++.
In this tutorial we will use 'grep' and 'egrep' to investigate the common regular expressions.  Full details can be found online, for example:  

[RegEx](https://www.rexegg.com/regex-quickstart.html "Cheat Sheet")

First, let's create a Jupyter magic command to create bash scripts and ignore errors

In [37]:
%alias_magic b bash -p '--no-raise-error'

Created `%%b` as an alias for `%%bash --no-raise-error`.


The file:  
- /usr/share/dict/words  

is a dictionary with one word per line.  We can use it to illustrate most RegExs.

In [7]:
%%b
grep abc /usr/share/dict/words

abcoulomb
Babcock
crabcatcher
dabchick


This example look for the pattern "abc" anywhere on a line in our chosen file.  However, recall that this file has only one word per line in any case, so just one word gets printed for matching lines.  
Note that when using "grep", the pattern can appear anywhere in the line, not just at the start of a line.  If you want to match a pattern at the start of a line you need to use anchors as below:

In [38]:
%%b
grep '^abc' /usr/share/dict/words

abcoulomb


Now only lines beginning with "abc" are matched.  The ^ sign anchors the pattern to the start of the line.  It's wise to enclose the pattern in single quotes to avoid 'bash" treating part of it as a wildcard.  
Other anchors are possible:

In [42]:
%%b
grep 'father$' /usr/share/dict/words

cofather
father
forefather
godfather
grandfather
gudefather
housefather
misfather
stepfather
stepgrandfather
stockfather
unfather


The $ anchors the pattern to the end of the line.  If you use both anchors: 

In [44]:
%%b
grep '^father$' /usr/share/dict/words

father


The pattern must fill the entire line.  Note that when we add wildcards to the pattern we can get a result with several lines:

In [45]:
%%b
grep '^.ather$' /usr/share/dict/words

bather
father
gather
lather
nather
rather


Here the dot matches any one character.  
Indeed, the dot is often used in patterns:

In [46]:
%%b
grep '^as.....er$' /usr/share/dict/words

asphalter
assaulter
assembler
astatizer
astringer


Five dots in the pattern match any 5 characters.  This we have searched for 9 letter words beginning with 'as' and ending with 'er'.  
RegExs allow repeat counts, so the above vcan also be written:

In [48]:
%%b
grep '^as.{5}er$' /usr/share/dict/words

However, as you can see, this doesn't seem to work.  That's because "grep" doesn't support all RegExs.  What we need is extended grep or "egrep":

In [49]:
%%b
egrep '^as.{5}er$' /usr/share/dict/words

asphalter
assaulter
assembler
astatizer
astringer


Now things work as expected.  We will continue to use egrep from now on.  
Now let's look at square brackets.  [ ] mean any of the characters inside the brackets: 

In [50]:
%%b
egrep '^a[rst].{5}er$' /usr/share/dict/words

archaizer
archiater
archruler
archsewer
archurger
areasoner
areometer
arraigner
arsnicker
artificer
artolater
asphalter
assaulter
assembler
astatizer
astringer
atmolyzer
atmometer
attainder
attempter
attracter


This is the same pattern as before, except the second character can be an 'r', 's' or 't'.  Brackets also work with repeat counts:

In [52]:
%%b
egrep '^a[rst]{3}.{5}$' /usr/share/dict/words

astraddle
astragali
astrakhan
astringer
astrocyte
astrodome
astrogeny
astroglia
astrogony
astrolabe
astrology
astronaut
astronomy
astrophil
attracter
attractor
attrahent
attribute
attrition
attritive


The above pattern searches for 'a' followed by 3 successive occurances of 'r', 's' or 't' in any order, followed by any 5 characters.  The anchors further restict the search to fill the entire line.  
Note that without the anchors, many more lines are matched:

In [54]:
%%b
egrep 'a[rst]{3}.{5}' /usr/share/dict/words

agastroneuria
Anastrophia
antiastronomical
antiattrition
astraddle
astraeiform
astragalar
astragalectomy
astragali
astragalocalcaneal
astragalocentral
astragalomancy
astragalonavicular
astragaloscaphoid
astragalotibial
astragalus
astrakanite
astrakhan
astraphobia
astrapophobia
astriction
astrictive
astrictively
astrictiveness
astriferous
astringency
astringent
astringently
astringer
astroalchemist
astroblast
astrochemist
astrochemistry
astrochronological
astrocyte
astrocytoma
astrocytomata
astrodiagnosis
astrodome
astrogeny
astroglia
astrognosy
astrogonic
astrogony
astrograph
astrographic
astrography
astrolabe
astrolabical
astrolater
astrolatry
astrolithology
astrologaster
astrologer
astrologian
astrologic
astrological
astrologically
astrologistic
astrologize
astrologous
astrology
astromancer
astromancy
astromantic
astrometeorological
astrometeorologist
astrometeorology
astrometer
astrometrical
astrometry
astronaut
astronautics
astronomer
astronomic
astronomical
astronomically
astronom

Yet another example of using brackets:

In [20]:
%%b
egrep a[sed]sp /usr/share/dict/words

broadspread
grassplot
headspring
passpenny
passport
passportless
praesphenoid


In [78]:
%%b
egrep th..ing /usr/share/dict/words

atheling
blithering
feathering
forthbring
forthbringer
forthgoing
gathering
ingathering
leathering
Lotharingian
mothering
pregathering
slithering
smothering
smotheringly
splathering
strengthening
strengtheningly
sympathizing
sympathizingly
taxgathering
thuringite
unbothering
underfeathering
unsympathizing
unsympathizingly
untethering
unthawing
unwithering
weathering
withering
witheringly
woolgathering


Here we search for lines with 'th' followed by 2 characters, followed by 'ing'.  
Using [ ] we can restict the search:

In [82]:
%%b
egrep th[er]{2}ing /usr/share/dict/words

blithering
feathering
gathering
ingathering
leathering
mothering
pregathering
slithering
smothering
smotheringly
splathering
taxgathering
unbothering
underfeathering
untethering
unwithering
weathering
withering
witheringly
woolgathering


Now those 2 characters must be 'e' or 'r'.  The reverse search is possible:

In [85]:
%%b
egrep 'th[^er]{2}ing' /usr/share/dict/words

forthgoing
sympathizing
sympathizingly
unsympathizing
unsympathizingly
unthawing


The ^ as the first character inside [ ], negates the search; the two characters can be anything other than 'e' or 'r'.

In [86]:
%%b
egrep 'th[^er]{2,4}ing' /usr/share/dict/words

clothmaking
earthmaking
earthquaking
forthcoming
forthcomingness
forthgoing
forthputting
freethinking
mythmaking
nonthinking
outhousing
pathfinding
pothunting
soothsaying
sympathizing
sympathizingly
thatching
thigging
thinking
thinkingly
thinkingpart
thinkling
thinning
thomasing
thudding
thuddingly
thumping
thumpingly
thwacking
thwackingly
toothaching
tufthunting
unmethodizing
unsympathizing
unsympathizingly
unthanking
unthawing
unthinking
unthinkingly
unthinkingness
unthinning
unwithholding
wealthmaking
wreathmaking


The same search with a repeat of {2,4}, i.e. at least 2 and at most 4 repeats.  {2,} would mean 2 or more.

In [26]:
%%bash
egrep 'b[o]+st' /usr/share/dict/words

boost
booster
boosterism
bostangi
bostanji
bosthoon
boston
bostonite
bostrychid
bostrychoid
bostrychoidal
bostryx
carbostyril
embosture
hydrocarbostyril
isocarbostyril
phlebostasia
phlebostasis
phlebostenosis
phlebostrepsis
thrombostasis
upboost


In the above search, '+' is a repeat count of 1 or more.  It is equivalent to {1,}

In [35]:
%%bash
egrep '[aeiou]{5,}' /usr/share/dict/words

cadiueio
Chaouia
euouae
Guauaenok


DONE TO HERE

In [37]:
%%bash
egrep 'a....rb' /usr/share/dict/words

anteorbital
cyanocarbonic
nasoturbinal
sulphatocarbonic
transverbate
transverbation
transverberate
transverberation


In [46]:
%%bash
egrep '.{23,24}' /usr/share/dict/words

anthropomorphologically
blepharosphincterectomy
epididymodeferentectomy
formaldehydesulphoxylate
formaldehydesulphoxylic
gastroenteroanastomosis
hematospectrophotometer
macracanthrorhynchiasis
pancreaticoduodenostomy
pathologicohistological
pathologicopsychological
pericardiomediastinitis
phenolsulphonephthalein
philosophicotheological
Pseudolamellibranchiata
pseudolamellibranchiate
scientificogeographical
scientificophilosophical
tetraiodophenolphthalein
thymolsulphonephthalein
thyroparathyroidectomize
transubstantiationalist


In [51]:
%%bash
egrep 'b[.]*ingly$' /usr/share/dict/words


absorbingly
benumbingly
daubingly
disturbingly
gibingly
jabbingly
numbingly
perturbingly
snubbingly
sobbingly
stabbingly
throbbingly
undisturbingly


In [60]:
%%bash
grep '^ap.*ingly$' /usr/share/dict/words

appallingly
appealingly
appeasingly
appetizingly
applaudingly
applyingly
appraisingly
appreciatingly
apprehendingly
approvingly


In [58]:
%%bash
egrep '^a.*b$' /usr/share/dict/words

abb
abcoulomb
abscoulomb
absorb
abwab
acerb
adsorb
adverb
alb
aplomb
aquabib
archsnob
ardeb
athrob
autocab


In [65]:
%%bash
egrep '.{3,}(after|before)' /usr/share/dict/words

engrafter
hereafter
hereafterward
herebefore
hereinafter
hereinbefore
midafternoon
pinbefore
preafternoon
thenceafter
thereafter
thereafterward
thereinafter
thereinbefore
unraftered
whereafter
woodcrafter


In [79]:
%%bash
egrep 'h.*[z]{2}' /usr/share/dict/words

Belshazzar
Belshazzaresque
hazzan
hizz
humbuzz
huzz
huzza
huzzard
photomezzotype
unhuzzaed
whizzer
whizzerman
whizziness
whizzing
whizzingly
whizzle


In [82]:
%%bash
egrep -i 'z.*z.*z' /usr/share/dict/words

zizz
Zyzzogeton


In [23]:
%%bash --no-raise-error
egrep -i 'z.*z.\K*z' /usr/share/dict/words
echo "DONE"

DONE


In [11]:
%%bash
egrep -in '\d\d' FishPie.txt

16:cook fish for 10 mins in 1/2 pint milk in a frying pan
22:place pie in middle shelf of oven and cook for 20 mins at 170C


In [19]:
%%bash
egrep -in '\d.\d' FishPie.txt

9:1/2 pint milk for white sauce
16:cook fish for 10 mins in 1/2 pint milk in a frying pan
22:place pie in middle shelf of oven and cook for 20 mins at 170C
