![regular expressions](./images/cover.png)
# <a id='toc1_'></a>[Regular Expressions with Python](#toc0_)

**Table of contents**<a id='toc0_'></a>    
- [Regular Expressions with Python](#toc1_)    
  - [What are Regular Expressions](#toc1_1_)    
  - [GREP](#toc1_2_)    
  - [Regular Expressions in Python | The "re" library](#toc1_3_)    
  - [Notation in RegEx](#toc1_4_)    
    - [Special characters](#toc1_4_1_)    
    - [Raw Strings](#toc1_4_2_)    
  - [Basic functions](#toc1_5_)    
    - [Defining a working string](#toc1_5_1_)    
    - [re.search( ) and re.findall( ) | Search anywhere](#toc1_5_2_)    
    - [re.match( ) | Search the beginning of the string](#toc1_5_3_)    
    - [re.fullmatch ( ) | Checks if the entire string is a match for the regex](#toc1_5_4_)    
    - [re.split( ) | Splits the string using the regex as a separator](#toc1_5_5_)    
    - [re.compile( ) | Create a compiled regex object](#toc1_5_6_)    
    - [re.sub( ) | Replace strings (substitute)](#toc1_5_7_)    
      - [Match ReGex and replace with literal strings](#toc1_5_7_1_)    
      - [Match ReGex and replace with another ReGex](#toc1_5_7_2_)    
  - [The "Match" Object](#toc1_6_)    
  - [Regex Syntax](#toc1_7_)    
    - [Find a match at the start of a string (^) & re.MULTILINE](#toc1_7_1_)    
    - [Find a match at the end of a string ($)](#toc1_7_2_)    
    - [Ignore the letter case | re.IGNORECASE](#toc1_7_3_)    
    - [Match any character with a wildcard (.) except newline (\n) or empty string ("")](#toc1_7_4_)    
    - [Using repetition ({ } + *)](#toc1_7_5_)    
    - [Match any variant in a set ([ ])](#toc1_7_6_)    
    - [Character Classes ([0-9], \d, [A-z], \w, etc.)](#toc1_7_7_)    
    - [Negation of a set [^ ]](#toc1_7_8_)    
    - [Logical OR ```|```](#toc1_7_9_)    
  - [Capturing Groups ```( )```](#toc1_8_)    
  - [Applying RegEx to Logs](#toc1_9_)    
    - [About the Data](#toc1_9_1_)    
    - [Checking out the data](#toc1_9_2_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

## <a id='toc1_1_'></a>[What are Regular Expressions](#toc0_)

A **regular expression**, or RegEx / regex, is a pattern of characters which is passed onto an engine that interprets this pattern and searches for it in a text. For example, suppose you have a log of residential addresses and you want to extract the postal code from each one of them. RegEx offers the possibility to search not only for an exact sequence of characters, like "_W1S 4DZ_", but something more generelizable that represents variations of the a sequence, or pattern such as:

>"_three alphanumerical characters, followed by a space, followed by one digit and two letters_".

which, as you will see, can be written as:

>```[A-Z0-9]{3}\s\d\w{2}```

## <a id='toc1_2_'></a>[GREP](#toc0_)

Most modern programming languages offer some form of functionality to handle regular expressions. The syntax is not exactly the same across the whole spectrum of languages, each language has it's own flavour, but common threads can be observed across different platforms. These common motifs have roots in the 1950s when regular expressions were first implemented in computing systems. Back then, the implementation of RegEx in UNIX systems gave rise to the tool we now know as "**GREP**" which stands for "_**G**lobal search for **R**egular **E**xpression and **P**rint matching lines_".

Nowadays the tool "grep" is available directly from Linux-based command-line terminal and does exactly what the GREP accronym suggests:

>it looks globally across a text-based file, and returns the lines where the input character pattern is found.

To call the GREP functionality you may use the bash syntax:

```console
grep [-<flag>] <regular_expression> <filename.ext>
```

Have a look at the example below:

![grep](./images/grep.png)

The file ```textfile.txt``` has four sentences. Among the words, it is possible to identify in sentences 1 and 3 a reference to an address in the UK containing a postal code formed by two groups of alphanumeric characters. We then see two possible implementations of a search query using different notations, that return the same result.
The result is the whole sentence returned with the match sequence highlighted.

## <a id='toc1_3_'></a>[Regular Expressions in Python | The "re" library](#toc0_)

The Python ["re" library](https://docs.python.org/3/library/re.html) allows you to access the engine for regular expression matching.
Out of the several functions available in the module, some particularly important are:
- ```re.search(<regular_expression>, <string>)``` which returns the first found match in the input string
- ```re.findall(<regular_expression>, <string>)``` which return all the found matches in the input string
- ```re.split(<regular_expression>, <string>)``` which splits the string where matches in the string are found.
- ```re.sub(<regular_expression>, <replacement_expression>, <string>)``` which replaces elements of the string

## <a id='toc1_4_'></a>[Notation in RegEx](#toc0_)

### <a id='toc1_4_1_'></a>[Special characters](#toc0_)

Some characters have a special meaning in the context of ReGex and when used in the search pattern they can be incorrectly interpreted by the regex engine. These special characters are non-alphanumeric and include parenthesis (), curly braces {}, asterisk *, the dot or period mark ., the plus sign +, the dollar sign $ or the circumflex diacritic ^, among others. When you literally mean these characters, it is required that you pass the escape character \ (which in itself is a special character) next to special character you want to include in your search pattern.

For example, suppose you were looking for the end of the sentences and therefore you look where the period mark lies. However, because the period is a special character, instead of "." your should write "\\." in the search pattern.

### <a id='toc1_4_2_'></a>[Raw Strings](#toc0_)
Because of the undesired effects that escape characters can have in how the RegEx engines interpret the search pattern, it is highly advisable to **always** use RAW strings when specifying your search pattern. Even if this is not sctricly required, it is a good habit to do it always so that, for example, instead of
```console
search_pattern = "<some_string>"
```
include instead an **r** preceeding the pattern string:
```console
search_pattern = r"<some_string>"
```

## <a id='toc1_5_'></a>[Basic functions](#toc0_)

### <a id='toc1_5_1_'></a>[Defining a working string](#toc0_)

Let us create a string on which we can try to find character sequences and patterns.

In [1]:
# Prepare a string on which we can do regex matching
string = """Pirates and sailors, together with songs of the sea are still a favourite theme among many.
The excerpt below is from the sea shanty: "Blow the Man Down"

> She was round in the counter and bluff in the bow,
> Way aye blow the man down
> So I took in all sail and cried, "Way enough now."
> Give me some time to blow the man down!

According to https://en.wikipedia.org/wiki/Blow_the_Man_Down the earliest references to this
song date from the 1860s, based on 1987 publication. According to http://www.ranker.com the song
sits at #3 out the 12 as one of the most popular sea songs of all time.
"""

print(string)

Pirates and sailors, together with songs of the sea are still a favourite theme among many.
The excerpt below is from the sea shanty: "Blow the Man Down"

> She was round in the counter and bluff in the bow,
> Way aye blow the man down
> So I took in all sail and cried, "Way enough now."
> Give me some time to blow the man down!

According to https://en.wikipedia.org/wiki/Blow_the_Man_Down the earliest references to this
song date from the 1860s, based on 1987 publication. According to http://www.ranker.com the song
sits at #3 out the 12 as one of the most popular sea songs of all time.



### <a id='toc1_5_2_'></a>[re.search( ) and re.findall( ) | Search anywhere](#toc0_)

Let's get to know the functions in the "**re**" library by trying to find exact sequences of letters, such as _"http"_.

In [2]:
import re

regex = r"http"      #note the initial "r" declaration
print("Output from re.search: ", re.search(regex,string))
print("Output from re.findall: ", re.findall(regex,string))

#Note that:
#  the re.search function returns a match object refering to the first match
#  the re.findall outputs a list of all objects found.

Output from re.search:  <re.Match object; span=(345, 349), match='http'>
Output from re.findall:  ['http', 'http']


In [3]:
# Make a function that emulates grep, and returns the line where matches are found

def get_line(reg_exp:str, string:str):
    '''Takes a user-defined regular expression and passes it through the input string.
    it returns all lines where matching occurs.
    If no matches are found it returns an empty list.
    '''
    input_txt = string.split("\n")
    out_lst = []

    for line in input_txt:
        result = re.search(reg_exp,line)
        if result is not None:
            out_lst +=[line]

    return out_lst

In [4]:
get_line(regex,string)
#Note that the ouput confirms the presence of "http" anywhere in the lines.

['According to https://en.wikipedia.org/wiki/Blow_the_Man_Down the earliest references to this',
 'song date from the 1860s, based on 1987 publication. According to http://www.ranker.com the song']

### <a id='toc1_5_3_'></a>[re.match( ) | Search the beginning of the string](#toc0_)

In [5]:
print("Output from re.match: ", re.match(regex, string))
# re.match only searches at the begning of the string so the output here is "None"

Output from re.match:  None


In [6]:
# change the regex expression to have a match.
print("Output from re.match: ", re.match("Pirates", string))

Output from re.match:  <re.Match object; span=(0, 7), match='Pirates'>


### <a id='toc1_5_4_'></a>[re.fullmatch ( ) | Checks if the entire string is a match for the regex](#toc0_)

In [7]:
print("Output from re.fullmatch: ", re.fullmatch(regex, string))
# the sequence "http" is a part of the string, and therefore does not matches the entire string, so the output is None

Output from re.fullmatch:  None


In [8]:
# change the regex expression and the input string to have a fullmatch.
print("Output from re.fullmatch: ", re.fullmatch(r"P[a-z]*s", "Pirates"))

Output from re.fullmatch:  <re.Match object; span=(0, 7), match='Pirates'>


### <a id='toc1_5_5_'></a>[re.split( ) | Splits the string using the regex as a separator](#toc0_)

In [9]:
re.split(regex,string)
# Note that the output is a list of 3 elements.
# the "http" sequence disappeared

['Pirates and sailors, together with songs of the sea are still a favourite theme among many.\nThe excerpt below is from the sea shanty: "Blow the Man Down"\n\n> She was round in the counter and bluff in the bow,\n> Way aye blow the man down\n> So I took in all sail and cried, "Way enough now."\n> Give me some time to blow the man down!\n\nAccording to ',
 's://en.wikipedia.org/wiki/Blow_the_Man_Down the earliest references to this\nsong date from the 1860s, based on 1987 publication. According to ',
 '://www.ranker.com the song\nsits at #3 out the 12 as one of the most popular sea songs of all time.\n']

The re.split() function works very similarly to the str.split() method.
The biggest difference is that:
- the re.split() alows you to use regular expressions, whereas
- the str.plit() you can only use literal characters present in the text - no expressions allowed.

In [10]:
re.split(r"\n",string)

['Pirates and sailors, together with songs of the sea are still a favourite theme among many.',
 'The excerpt below is from the sea shanty: "Blow the Man Down"',
 '',
 '> She was round in the counter and bluff in the bow,',
 '> Way aye blow the man down',
 '> So I took in all sail and cried, "Way enough now."',
 '> Give me some time to blow the man down!',
 '',
 'According to https://en.wikipedia.org/wiki/Blow_the_Man_Down the earliest references to this',
 'song date from the 1860s, based on 1987 publication. According to http://www.ranker.com the song',
 'sits at #3 out the 12 as one of the most popular sea songs of all time.',
 '']

In [11]:
string.split("\n")

['Pirates and sailors, together with songs of the sea are still a favourite theme among many.',
 'The excerpt below is from the sea shanty: "Blow the Man Down"',
 '',
 '> She was round in the counter and bluff in the bow,',
 '> Way aye blow the man down',
 '> So I took in all sail and cried, "Way enough now."',
 '> Give me some time to blow the man down!',
 '',
 'According to https://en.wikipedia.org/wiki/Blow_the_Man_Down the earliest references to this',
 'song date from the 1860s, based on 1987 publication. According to http://www.ranker.com the song',
 'sits at #3 out the 12 as one of the most popular sea songs of all time.',
 '']

### <a id='toc1_5_6_'></a>[re.compile( ) | Create a compiled regex object](#toc0_)

When your ReGex is rather complex, and used many times over a script it can be useful to compile that expression first into an object that be reused multiple times in a script.

In [12]:
regex_comp = re.compile(r"HTTP", re.IGNORECASE)

print("Output from re.search: ",regex_comp.search(string))
print("Output from re.findall: ",regex_comp.findall(string))

Output from re.search:  <re.Match object; span=(345, 349), match='http'>
Output from re.findall:  ['http', 'http']


In [13]:
# Using the search of findall methods from the regex object is equivalent to
# Passing regex_comp as a parameter of a re.search or re.findall functions.

print("Output from re.search: ", re.search(regex_comp,string))
print("Output from re.findall: ", re.findall(regex_comp,string))

Output from re.search:  <re.Match object; span=(345, 349), match='http'>
Output from re.findall:  ['http', 'http']


### <a id='toc1_5_7_'></a>[re.sub( ) | Replace strings (substitute)](#toc0_)

#### <a id='toc1_5_7_1_'></a>[Match ReGex and replace with literal strings](#toc0_)

In [14]:
# Replace the URLs with the names of the websites.
regex = r"""(htt[^\s]+)(.*\n*.*)(htt[^\s]+)"""  # Match a start with "htt" followed by any character except spaces
                                                # Followed by any character repeated before and after any possible new lines
                                                # until a new "http" is found whih should include every possible character except space.


print("Output from re.search: ", re.search(regex,string))
print("Output from re.findall: ", re.findall(regex,string))
print("")
repl = r'"Wikipedia"\2"Ranker"'     # Keeps group 2 in the middle, but substitute groups 1 and 3 with literal strings.
result = re.sub(regex,repl,string)
print(result)
re.purge()

Output from re.search:  <re.Match object; span=(345, 512), match='https://en.wikipedia.org/wiki/Blow_the_Man_Down t>
Output from re.findall:  [('https://en.wikipedia.org/wiki/Blow_the_Man_Down', ' the earliest references to this\nsong date from the 1860s, based on 1987 publication. According to ', 'http://www.ranker.com')]

Pirates and sailors, together with songs of the sea are still a favourite theme among many.
The excerpt below is from the sea shanty: "Blow the Man Down"

> She was round in the counter and bluff in the bow,
> Way aye blow the man down
> So I took in all sail and cried, "Way enough now."
> Give me some time to blow the man down!

According to "Wikipedia" the earliest references to this
song date from the 1860s, based on 1987 publication. According to "Ranker" the song
sits at #3 out the 12 as one of the most popular sea songs of all time.



#### <a id='toc1_5_7_2_'></a>[Match ReGex and replace with another ReGex](#toc0_)

In [15]:
# Exchange the position of the URLs in the text
regex = r"(htt[^\s]+)(.*\n.*)(htt[^\s]+)"
print("Output from re.search: ", re.search(regex,string))
print("Output from re.findall: ", re.findall(regex,string))
print("")
result = re.sub(regex,r"\3\2\1",string)
print(result)
re.purge()

Output from re.search:  <re.Match object; span=(345, 512), match='https://en.wikipedia.org/wiki/Blow_the_Man_Down t>
Output from re.findall:  [('https://en.wikipedia.org/wiki/Blow_the_Man_Down', ' the earliest references to this\nsong date from the 1860s, based on 1987 publication. According to ', 'http://www.ranker.com')]

Pirates and sailors, together with songs of the sea are still a favourite theme among many.
The excerpt below is from the sea shanty: "Blow the Man Down"

> She was round in the counter and bluff in the bow,
> Way aye blow the man down
> So I took in all sail and cried, "Way enough now."
> Give me some time to blow the man down!

According to http://www.ranker.com the earliest references to this
song date from the 1860s, based on 1987 publication. According to https://en.wikipedia.org/wiki/Blow_the_Man_Down the song
sits at #3 out the 12 as one of the most popular sea songs of all time.



## <a id='toc1_6_'></a>[The "Match" Object](#toc0_)

Some functions in the **re** library return a "match" object that encapsulates important parameters concerning a single match.

Using indexing notation, it is possible to access the matched string and subgroups, e.g:
- ```<match_object_name>[0]```returns the complete matched string
- ```<match_object_name>[1]```returns the first group substring
- ```<match_object_name>[2]```returns the second group substring
- ```<match_object_name>[3]```returns the thrid group substring... and so on.

These "match" objects have attributes which are important to know:
- **re**: the regular expression object associated with a given match object.
- **string**: the string that was searched.
- **pos**: the index of the start of the searched substring.
- **endpos**: the index of the end of the searched substring.
- **lastindex**: the integer index of the last matched capturing group, or None if no group was matched.
- **lastgroup**: the name of the last matched capturing group, or None if no group was matched.


In [16]:
pattern = r"(\w+) (\d+)"        # Match containing a group of letters followed by a group of digits.
text = "There are 42 apples in the basket"
match = re.search(pattern, text)

print("re:", match.re)
print("string:", match.string)
print("pos:", match.pos)
print("endpos:", match.endpos)
print("lastindex:", match.lastindex)
print("lastgroup:", match.lastgroup)
print("--------------")
for i in range(match.lastindex+1):
    print("match[" + str(i) + "] ->", match[i])


re: re.compile('(\\w+) (\\d+)')
string: There are 42 apples in the basket
pos: 0
endpos: 33
lastindex: 2
lastgroup: None
--------------
match[0] -> are 42
match[1] -> are
match[2] -> 42


## <a id='toc1_7_'></a>[Regex Syntax](#toc0_)

### <a id='toc1_7_1_'></a>[Find a match at the start of a string (^) & re.MULTILINE](#toc0_)

In [17]:
# The ^ character instructs that the pattern should be at the begining of the string
regex = r"^s"   # match should be:
                    # - present at the start of the string (^)
                    # - start with the letter s (s)
                    # - followed by any 4 character (.)
                    # - repeated as many times as necessary (*)
print("Output from re.search: ", re.search(regex,string))
print("Output from re.findall: ", re.findall(regex,string))
# Note:
# The output is "None" because the string starts with the word: "Pirates".

Output from re.search:  None
Output from re.findall:  []


In [18]:
# To enable each line to act as a string, pass re.MULTILINE as a parameter
print("Output from re.search: ", re.search(regex,string,re.MULTILINE))
print("Output from re.findall: ", re.findall(regex,string,re.MULTILINE))
get_line(regex,string)

# Note:
# Notice how the get_line() function retried two lines that started with the letter s

Output from re.search:  <re.Match object; span=(425, 426), match='s'>
Output from re.findall:  ['s', 's']


['song date from the 1860s, based on 1987 publication. According to http://www.ranker.com the song',
 'sits at #3 out the 12 as one of the most popular sea songs of all time.']

### <a id='toc1_7_2_'></a>[Find a match at the end of a string ($)](#toc0_)

In [19]:
regex = r"time.$"
print("Output from re.search: ", re.search(regex,string))
print("Output from re.findall: ", re.findall(regex,string))
get_line(regex,string)


Output from re.search:  <re.Match object; span=(588, 593), match='time.'>
Output from re.findall:  ['time.']


['sits at #3 out the 12 as one of the most popular sea songs of all time.']

### <a id='toc1_7_3_'></a>[Ignore the letter case | re.IGNORECASE](#toc0_)

In [20]:
regex = r"DOWN"
print("Output from re.search: ", re.search(regex,string,re.IGNORECASE))
print("Output from re.findall: ", re.findall(regex,string,re.IGNORECASE))

Output from re.search:  <re.Match object; span=(148, 152), match='Down'>
Output from re.findall:  ['Down', 'down', 'down', 'Down']


### <a id='toc1_7_4_'></a>[Match any character with a wildcard (.) except newline (\n) or empty string ("")](#toc0_)

In [21]:
regex = r"s.... " # match a t (t),followed by any two letters (..), followed by a space ( )
print("Output from re.search: ", re.search(regex,string))
print("Output from re.findall: ", re.findall(regex,string))
get_line(regex, string)

Output from re.search:  <re.Match object; span=(6, 12), match='s and '>
Output from re.findall:  ['s and ', 'songs ', 'still ', 's one ', 'songs ']


['Pirates and sailors, together with songs of the sea are still a favourite theme among many.',
 'sits at #3 out the 12 as one of the most popular sea songs of all time.']

### <a id='toc1_7_5_'></a>[Using repetition ({ } + *)](#toc0_)
Sometimes you may want to indicate that a given character, or character class, should repeat itself.
Regex syntax offers three alternatives:
- **{}** the precedding character can repeat within specific limits 
- **+** the preceeding character occurs at least once, and may or may not repeat itself.
- __*__ the preceeding character may or may not occur, and may or may not repeat itself.

When using limits, you can:
 - specify the exact number of repetions, e.g. {5}
 - specify a range, e.g. {5,10}
 - specify a maximum number of repetitions, e.g. {,10}, or
 - specify a minimum number of repetitions, e.g. {5,}


In [22]:
print(string)

Pirates and sailors, together with songs of the sea are still a favourite theme among many.
The excerpt below is from the sea shanty: "Blow the Man Down"

> She was round in the counter and bluff in the bow,
> Way aye blow the man down
> So I took in all sail and cried, "Way enough now."
> Give me some time to blow the man down!

According to https://en.wikipedia.org/wiki/Blow_the_Man_Down the earliest references to this
song date from the 1860s, based on 1987 publication. According to http://www.ranker.com the song
sits at #3 out the 12 as one of the most popular sea songs of all time.



In [23]:
regex = r"sea.*"              # matching string starts with an "s",
                            # followed by any character (.),
                            # repeated as many times until the next newline
print("Output from re.search: ", re.search(regex,string))
print("Output from re.findall: ", re.findall(regex,string))
print(len(re.findall(regex,string)))

Output from re.search:  <re.Match object; span=(48, 91), match='sea are still a favourite theme among many.'>
Output from re.findall:  ['sea are still a favourite theme among many.', 'sea shanty: "Blow the Man Down"', 'sea songs of all time.']
3


In [24]:
regex = r"s.{4} "           # matching string starts with an "s",
                            # followed by any character (.),
                            # repeated exactly 4 times ({4}), and
                            # ends with a space
print("Output from re.search: ", re.search(regex,string))
print("Output from re.findall: ", re.findall(regex,string))
print(len(re.findall(regex,string)))

Output from re.search:  <re.Match object; span=(6, 12), match='s and '>
Output from re.findall:  ['s and ', 'songs ', 'still ', 's one ', 'songs ']
5


In [25]:
regex = r"s.{,4} "          # matching string starts with an "s",
                            # followed by any character (.),
                            # repeated a maximum of 4 times ({,4}), and
                            # ends with a space
print("Output from re.search: ", re.search(regex,string))
print("Output from re.findall: ", re.findall(regex,string))
print(len(re.findall(regex,string)))

Output from re.search:  <re.Match object; span=(6, 12), match='s and '>
Output from re.findall:  ['s and ', 's, ', 'songs ', 'sea ', 'still ', 's ', 'sea ', 's ', 'sail ', 'some ', 'st ', 's to ', 'song ', 's, ', 'sed ', 'sits ', 's one ', 'st ', 'sea ', 'songs ']
20


In [26]:
regex = r"s.{4,} "          # matching string starts with an "s",
                            # followed by any character (.),
                            # repeated a minumum of 4 times ({4,}), and
                            # ends with a space
print("Output from re.search: ", re.search(regex,string))
print("Output from re.findall: ", re.findall(regex,string))
print(len(re.findall(regex,string)))

Output from re.search:  <re.Match object; span=(6, 86), match='s and sailors, together with songs of the sea are>
Output from re.findall:  ['s and sailors, together with songs of the sea are still a favourite theme among ', 's from the sea shanty: "Blow the Man ', 's round in the counter and bluff in the ', 'sail and cried, "Way enough ', 'some time to blow the man ', 's://en.wikipedia.org/wiki/Blow_the_Man_Down the earliest references to ', 'song date from the 1860s, based on 1987 publication. According to http://www.ranker.com the ', 'sits at #3 out the 12 as one of the most popular sea songs of all ']
8


In [27]:
regex = r"s[ea]*"          # matching string starts with "s",
                           # followed by an optional "e" or "a", repeating until a newline character
print("Output from re.search: ", re.search(regex,string))
print("Output from re.findall: ", re.findall(regex,string))
print(len(re.findall(regex,string)))

# Note:
# With the * sign, the possible solutions include:
# - "sea " (only one space, no repetition of the white space),
# - "sea  " (one white space and one repetition of the white space),
# - "sea   " (one white space and two repetitions of the white space)... etc.

Output from re.search:  <re.Match object; span=(6, 7), match='s'>
Output from re.findall:  ['s', 'sa', 's', 's', 's', 'sea', 's', 's', 'sea', 's', 's', 'sa', 's', 's', 's', 's', 's', 's', 's', 'se', 's', 's', 's', 's', 's', 'sea', 's', 's']
28


In [28]:
regex = r"s[ea]+"          # matching string starts with an "s",
                           # followed by an obligatory "e" or "a", repeating until a newline character
print("Output from re.search: ", re.search(regex,string))
print("Output from re.findall: ", re.findall(regex,string))
print(len(re.findall(regex,string)))

# Note:
# With the + sign the conditions of at least one repetition of the "white space" is forced.
# , the only possible solutions would require two or more spaces after the word "sea".
# Since in the text there are no instances of "sea" followed by two or more spaces, the output is None.

Output from re.search:  <re.Match object; span=(12, 14), match='sa'>
Output from re.findall:  ['sa', 'sea', 'sea', 'sa', 'se', 'sea']
6


### <a id='toc1_7_6_'></a>[Match any variant in a set ([ ])](#toc0_) [&#8593;](#toc0_)

In [29]:
# Specifying case independence for a particular character in the pattern
regex = r"[Dd]own"       # Match either "Down" or "down"
print("Output from re.search: ", re.search(regex,string))
print("Output from re.findall: ", re.findall(regex,string))
get_line(regex, string)

Output from re.search:  <re.Match object; span=(148, 152), match='Down'>
Output from re.findall:  ['Down', 'down', 'down', 'Down']


['The excerpt below is from the sea shanty: "Blow the Man Down"',
 '> Way aye blow the man down',
 '> Give me some time to blow the man down!',
 'According to https://en.wikipedia.org/wiki/Blow_the_Man_Down the earliest references to this']

In [30]:
# Beyond different letter cases, different characters
regex = r"s[aeo].{3}"       # MAtch starts with an "s", folowed by "a","e" or "o", folowed by any character 3 times.
print("Output from re.search: ", re.search(regex,string))
print("Output from re.findall: ", re.findall(regex,string))
get_line(regex, string)

Output from re.search:  <re.Match object; span=(12, 17), match='sailo'>
Output from re.findall:  ['sailo', 'songs', 'sea a', 'sea s', 'sail ', 'some ', 'song ', 'sed o', 'sea s']


['Pirates and sailors, together with songs of the sea are still a favourite theme among many.',
 'The excerpt below is from the sea shanty: "Blow the Man Down"',
 '> So I took in all sail and cried, "Way enough now."',
 '> Give me some time to blow the man down!',
 'song date from the 1860s, based on 1987 publication. According to http://www.ranker.com the song',
 'sits at #3 out the 12 as one of the most popular sea songs of all time.']

### <a id='toc1_7_7_'></a>[Character Classes ([0-9], \d, [A-z], \w, etc.)](#toc0_) [&#8593;](#toc0_)

Characters can be divided into classes, such as:
- matches ranges of digits: ```[0-9]```, ```[1-4]```, etc.
- matches letters: ```[a-z]```,```[A-Z]```, ```[A-Za-z]``` or ```[A-z]```, and whatever range ```[B-D]```
- matches a combination of letters and digits: ```[A-Z0-9]```, ```[A-z0-9]```, ```[a-z0-9]```
- matches white spaces ```\s``` and non-white spaces ```\S```
- matches word characters ```\w``` and non-word characters ```\W```
- matches unions of subsets: ```[a-cx-z]``` or ```[a-c[x-z]]```


In [31]:
regex = r"\d{4}"        # Four digits
print("Output from re.search: ", re.search(regex,string))
print("Output from re.findall: ", re.findall(regex,string))
get_line(regex, string)

Output from re.search:  <re.Match object; span=(444, 448), match='1860'>
Output from re.findall:  ['1860', '1987']


['song date from the 1860s, based on 1987 publication. According to http://www.ranker.com the song']

In [32]:
regex = r"\d{3}[0-2]"        # Three digits, terminating in either 0,1 or 2
print("Output from re.search: ", re.search(regex,string))
print("Output from re.findall: ", re.findall(regex,string))
get_line(regex, string)

Output from re.search:  <re.Match object; span=(444, 448), match='1860'>
Output from re.findall:  ['1860']


['song date from the 1860s, based on 1987 publication. According to http://www.ranker.com the song']

In [33]:
regex = r"[A-z0-9]{4}s" 	# Four alphanumeric characters terminating with an "s"
print("Output from re.search: ", re.search(regex,string))
print("Output from re.findall: ", re.findall(regex,string))
get_line(regex, string)

Output from re.search:  <re.Match object; span=(2, 7), match='rates'>
Output from re.findall:  ['rates', 'ilors', 'songs', 'https', 'rlies', 'ences', '1860s', 'songs']


['Pirates and sailors, together with songs of the sea are still a favourite theme among many.',
 'According to https://en.wikipedia.org/wiki/Blow_the_Man_Down the earliest references to this',
 'song date from the 1860s, based on 1987 publication. According to http://www.ranker.com the song',
 'sits at #3 out the 12 as one of the most popular sea songs of all time.']

In [34]:
regex = r"\W{3}"    # Three consecutive non-word characters
print("Output from re.search: ", re.search(regex,string))
print("Output from re.findall: ", re.findall(regex,string))
get_line(regex, string)

Output from re.search:  <re.Match object; span=(132, 135), match=': "'>
Output from re.findall:  [': "', '"\n\n', ',\n>', '\n> ', ', "', '."\n', '!\n\n', '://', '://']


['The excerpt below is from the sea shanty: "Blow the Man Down"',
 '> So I took in all sail and cried, "Way enough now."',
 'According to https://en.wikipedia.org/wiki/Blow_the_Man_Down the earliest references to this',
 'song date from the 1860s, based on 1987 publication. According to http://www.ranker.com the song']

In [35]:
regex = r"[h-lp-t]{4,5}"    # a set of 4 to 5 letters between h and l or between p and t
print("Output from re.search: ", re.search(regex,string))
print("Output from re.findall: ", re.findall(regex,string))
get_line(regex, string)

Output from re.search:  <re.Match object; span=(56, 61), match='still'>
Output from re.findall:  ['still', 'https', 'ikip', 'this', 'http', 'sits']


['Pirates and sailors, together with songs of the sea are still a favourite theme among many.',
 'According to https://en.wikipedia.org/wiki/Blow_the_Man_Down the earliest references to this',
 'song date from the 1860s, based on 1987 publication. According to http://www.ranker.com the song',
 'sits at #3 out the 12 as one of the most popular sea songs of all time.']

### <a id='toc1_7_8_'></a>[Negation of a set [^ ]](#toc0_) [&#8593;](#toc0_)

In [36]:
regex = r"http[^s]"        # Negating the presence of a an "s" after "http"
print("Output from re.search: ", re.search(regex,string))
print("Output from re.findall: ", re.findall(regex,string))
get_line(regex, string)

Output from re.search:  <re.Match object; span=(491, 496), match='http:'>
Output from re.findall:  ['http:']


['song date from the 1860s, based on 1987 publication. According to http://www.ranker.com the song']

In [37]:
regex = r"\d{4}[^\s]"        # Negating the presence of a "space" after any sequence of four digits
print("Output from re.search: ", re.search(regex,string))
print("Output from re.findall: ", re.findall(regex,string))
get_line(regex, string)
# Note:
# Notice how 1860 and 1987 are referenced in the same sentence.
# However, notice how 1987 is followed by a space, whereas 1860 is followed by the letter "s"
# Only "1860s" is a match, because the query rejects the presence of a space after four digits.

Output from re.search:  <re.Match object; span=(444, 449), match='1860s'>
Output from re.findall:  ['1860s']


['song date from the 1860s, based on 1987 publication. According to http://www.ranker.com the song']

In [38]:
# Recall that ^ serves also as a anchor to the start of a string/line
regex = r"^\W [A-z]"        # Sentences that start with a non-word character,
                            # followed by a space
                            # followed by a letter
                            
print("Output from re.search: ", re.search(regex,string, re.MULTILINE))
print("Output from re.findall: ", re.findall(regex,string, re.MULTILINE))
get_line(regex, string)

Output from re.search:  <re.Match object; span=(155, 158), match='> S'>
Output from re.findall:  ['> S', '> W', '> S', '> G']


['> She was round in the counter and bluff in the bow,',
 '> Way aye blow the man down',
 '> So I took in all sail and cried, "Way enough now."',
 '> Give me some time to blow the man down!']

In [39]:
# But the meaning of NEAGATION comes only together with the square-brackets
regex = r"^\W [A-z][^a-h]"          # Sentences that start with a non-word character,
                                    # followed by a space
                                    # followed by a letter
                                    # followed by any character which is not in the interval [a-h]
print("Output from re.search: ", re.search(regex,string, re.MULTILINE))
print("Output from re.findall: ", re.findall(regex,string, re.MULTILINE))
get_line(regex, string)

Output from re.search:  <re.Match object; span=(236, 240), match='> So'>
Output from re.findall:  ['> So', '> Gi']


['> So I took in all sail and cried, "Way enough now."',
 '> Give me some time to blow the man down!']

### <a id='toc1_7_9_'></a>[Logical OR ```|```](#toc0_)


In [40]:
regex = r"[a-z]+\.|[0-9]+"        # a set of letters followed by a period mark OR a set of digits
print("Output from re.search: ", re.search(regex,string, re.MULTILINE))
print("Output from re.findall: ", re.findall(regex,string, re.MULTILINE))
get_line(regex, string)

Output from re.search:  <re.Match object; span=(86, 91), match='many.'>
Output from re.findall:  ['many.', 'now.', 'en.', 'wikipedia.', '1860', '1987', 'publication.', 'www.', 'ranker.', '3', '12', 'time.']


['Pirates and sailors, together with songs of the sea are still a favourite theme among many.',
 '> So I took in all sail and cried, "Way enough now."',
 'According to https://en.wikipedia.org/wiki/Blow_the_Man_Down the earliest references to this',
 'song date from the 1860s, based on 1987 publication. According to http://www.ranker.com the song',
 'sits at #3 out the 12 as one of the most popular sea songs of all time.']

## <a id='toc1_8_'></a>[Capturing Groups ```( )```](#toc0_)

In [41]:
string = "Lovelace, Ada"
regex = r"^(\w*), (\w*)$"
result = re.search(regex,string)
print(result)

<re.Match object; span=(0, 13), match='Lovelace, Ada'>


In [42]:
result.groups()

('Lovelace', 'Ada')

## <a id='toc1_9_'></a>[Applying RegEx to Logs](#toc0_)

### <a id='toc1_9_1_'></a>[About the Data](#toc0_)

To look at regular expressions we will use a log file issued by a supercomputer called "Thunderbird" that in 2006 featured as #6 in the top500 list of the most powerful computers.

The file is accessible on [this GitHub repository](https://github.com/logpai/loghub/blob/master/Thunderbird/Thunderbird_2k.log) and it contains 2000 entries. It is only a small fraction of the original dataset that contains over 211 million events that were used in a [2007 paper by Oliner & Stearley](https://www.computer.org/csdl/proceedings-article/dsn/2007/28550575/12OmNB1eJDn) that aimed at predicting hardware failures through analysis of computer logs.

To access the complete dataset, you can get it [here](https://www.usenix.org/cfdr-data#hpc4).

### <a id='toc1_9_2_'></a>[Checking out the data](#toc0_)

In [85]:
# Load example data where each line is an element of a list
with open(".\data\Thunderbird_2k.log", "r") as f:
    log_file = f.readlines()

# Get the first 10 elements of the log_file list
log_file[0:10]

['- 1131566461 2005.11.09 dn228 Nov 9 12:01:01 dn228/dn228 crond(pam_unix)[2915]: session closed for user root\n',
 '- 1131566461 2005.11.09 dn228 Nov 9 12:01:01 dn228/dn228 crond(pam_unix)[2915]: session opened for user root by (uid=0)\n',
 '- 1131566461 2005.11.09 dn228 Nov 9 12:01:01 dn228/dn228 crond[2916]: (root) CMD (run-parts /etc/cron.hourly)\n',
 '- 1131566461 2005.11.09 dn261 Nov 9 12:01:01 dn261/dn261 crond(pam_unix)[2907]: session closed for user root\n',
 '- 1131566461 2005.11.09 dn261 Nov 9 12:01:01 dn261/dn261 crond(pam_unix)[2907]: session opened for user root by (uid=0)\n',
 '- 1131566461 2005.11.09 dn261 Nov 9 12:01:01 dn261/dn261 crond[2908]: (root) CMD (run-parts /etc/cron.hourly)\n',
 '- 1131566461 2005.11.09 dn3 Nov 9 12:01:01 dn3/dn3 crond(pam_unix)[2907]: session closed for user root\n',
 '- 1131566461 2005.11.09 dn3 Nov 9 12:01:01 dn3/dn3 crond(pam_unix)[2907]: session opened for user root by (uid=0)\n',
 '- 1131566461 2005.11.09 dn3 Nov 9 12:01:01 dn3/dn3 cron

The log features the word _**crond**_ which is the name of a process that is short for "Cron Daemon":
>- **"Cron"**: is a job scheduling utility in UNIX-based systems. Cron is used to schedule tasks to run automatically at specific times. You can apply this skill in many scenarios, such as automating basic and repetitive tasks, including scheduling backups or collecting system logs periodically.
>- **"Daemon"**: In Unix-like systems, a daemon is a computer program that runs as a background process, rather than being under the direct control of an interactive user. Traditionally, the process names of a daemon end with the letter "d", for clarification that the process is in fact a daemon, and for differentiation between a daemon and a normal computer program.
>- **"pam_unix"** attached to _crond_ to form _crond(pam_unix)_ indicates that crond daemon is using the **pam_unix** module for authentication. This means that when a user or process attempts to interact with the crond daemon, their credentials are being checked using the traditional Unix authentication methods provided by the pam_unix module. 

So these logs seem to be mostly associated with scheduled tasks run automatically via **crond**.

Every time _crond_ is activated it gives rise to a process bearing its name and to which a unique identifier (ID) is issued and shown between squared brackets, e.g. [2915].

Each entry ends with a small text message describing the event that the process raised, such as "session closed for user root"

### Which daemon processes are being called in the logs?

In [143]:
import re
regex = r"(\w{1,10}d)[\(\w\)]*\[\d*\]" # extract the name of the processes that end with "d"

daemon_lst = []
non_daemon_lst = []
for line in log_file:
    result = re.findall(regex,line)
    if len(result) != 0 and result[0] not in daemon_lst:        
            daemon_lst.append((result[0]))
    elif len(result) == 0:
        non_daemon_lst.append(line)

daemon_lst


['crond',
 'gmetad',
 'ntpd',
 'sshd',
 'send',
 'xinetd',
 'snmpd',
 'statd',
 'smartd']

In [121]:
non_daemon_lst
# Processes not ending with "d" are listed in the non-daemon list of processes (good!)
# It seems that the "period" punctuation mark is allowed in process names (e.g. ib_sm.x[24904])

['- 1131566470 2005.11.09 tbird-sm1 Nov 9 12:01:10 src@tbird-sm1 ib_sm.x[24904]: [ib_sm_sweep.c:1831]: ********************** NEW SWEEP ********************\n',
 '- 1131566474 2005.11.09 tbird-sm1 Nov 9 12:01:14 src@tbird-sm1 ib_sm.x[24904]: [ib_sm_sweep.c:1455]: No topology change\n',
 '- 1131566474 2005.11.09 tbird-sm1 Nov 9 12:01:14 src@tbird-sm1 ib_sm.x[24904]: [ib_sm_sweep.c:1482]: No configuration change required\n',
 '- 1131566484 2005.11.09 tbird-sm1 Nov 9 12:01:24 src@tbird-sm1 ib_sm.x[24904]: [ib_sm_sweep.c:1831]: ********************** NEW SWEEP ********************\n',
 '- 1131566488 2005.11.09 tbird-sm1 Nov 9 12:01:28 src@tbird-sm1 ib_sm.x[24904]: [ib_sm_sweep.c:1455]: No topology change\n',
 '- 1131566488 2005.11.09 tbird-sm1 Nov 9 12:01:28 src@tbird-sm1 ib_sm.x[24904]: [ib_sm_sweep.c:1482]: No configuration change required\n',
 '- 1131566498 2005.11.09 tbird-sm1 Nov 9 12:01:38 src@tbird-sm1 ib_sm.x[24904]: [ib_sm_sweep.c:1831]: ********************** NEW SWEEP **********

In [144]:
# Inlcude the period punctuation mark as part of a valid set of character for process names:
regex = r"([\w\.]{1,10}d)[\(\w\)]*\[\d*\]" # extract the name of the processes

daemon_lst = []
non_daemon_lst = []
for line in log_file:
    result = re.findall(regex,line)
    if len(result) != 0:        
            daemon_lst.append((result[0]))
    elif len(result) == 0:
        non_daemon_lst.append(line)

daemon_set = set(daemon_lst)
daemon_set

# Note: daemon_lst now includes rpc.statd


{'crond',
 'gmetad',
 'ntpd',
 'rpc.statd',
 'send',
 'smartd',
 'snmpd',
 'sshd',
 'xinetd'}

In [145]:
print("{} different daemon process names were identified in the logs.".format(len(daemon_set)))
print("Out of {} log messages, {} were issued through a non-daemon process.".format(len(log_file),len(non_daemon_lst)))

9 different daemon process names were identified in the logs.
Out of 2000 log messages, 449 were issued through a non-daemon process.


### How many times each daemon process occurs in the log?

In [165]:
import pandas as pd

df_daemons = pd.DataFrame(columns=["Daemon","Counts","Frequency"])
for i, daemon_name in enumerate(list(daemon_set)):
    df_daemons.loc[i,"Daemon"] = list(daemon_set)[i]
    df_daemons.loc[i,"Counts"] = daemon_lst.count(daemon_name)
    df_daemons.loc[i,"Frequency"] = daemon_lst.count(daemon_name)/len(log_file)

df_daemons.sort_values("Frequency",ascending=False,inplace =True)
df_daemons

Unnamed: 0,Daemon,Counts,Frequency
4,gmetad,830,0.415
8,ntpd,571,0.2855
5,crond,62,0.031
7,xinetd,36,0.018
1,send,35,0.0175
2,sshd,14,0.007
0,rpc.statd,1,0.0005
3,smartd,1,0.0005
6,snmpd,1,0.0005


### What are the messages of the most frequent process ID?

In [170]:
## Find out which is the most process ID
regex = r"[\w\.]{1,10}d[\(\w\)]*\[(\d*)\]" # extract the process ID of the processes

pid_lst = []
for line in log_file:
    result = re.findall(regex,line)
    if len(result) != 0:        
            pid_lst.append((result[0]))

df_pid = pd.DataFrame(columns=["PID","Counts","Frequency"])
for i, pid in enumerate(list(set(pid_lst))):
    df_pid.loc[i,"PID"] = pid
    df_pid.loc[i,"Counts"] = pid_lst.count(pid)
    df_pid.loc[i,"Frequency"] = pid_lst.count(pid)/len(log_file)

df_pid.sort_values("Counts",ascending=False,inplace =True)
df_pid

# Note: Process with PID 1682 occurs 40% of the times

Unnamed: 0,PID,Counts,Frequency
89,1682,799,0.3995
344,1798,35,0.0175
104,1691,31,0.0155
108,1815,5,0.0025
380,14256,4,0.002
...,...,...,...
210,24262,1,0.0005
209,14363,1,0.0005
208,2921,1,0.0005
207,30260,1,0.0005


In [171]:
regex = r"\[1682\](.*)"

msg_lst = []
for line in log_file:
    result = re.findall(regex,line)
    if len(result) != 0:        
            msg_lst.append((result[0]))

msg_lst

[': data_thread() got not answer from any [Thunderbird_A8] datasource',
 ': data_thread() got not answer from any [Thunderbird_B8] datasource',
 ': data_thread() got not answer from any [Thunderbird_C5] datasource',
 ': data_thread() got not answer from any [Thunderbird_B7] datasource',
 ': data_thread() got not answer from any [Thunderbird_A4] datasource',
 ': data_thread() got not answer from any [Thunderbird_B4] datasource',
 ': data_thread() got not answer from any [Thunderbird_C8] datasource',
 ': data_thread() got not answer from any [Thunderbird_B3] datasource',
 ': data_thread() got not answer from any [Thunderbird_D5] datasource',
 ': data_thread() got not answer from any [Thunderbird_A1] datasource',
 ': data_thread() got not answer from any [Thunderbird_C2] datasource',
 ': data_thread() got not answer from any [Thunderbird_B1] datasource',
 ': data_thread() got not answer from any [Thunderbird_A3] datasource',
 ': data_thread() got not answer from any [Thunderbird_A5] datas

### Filter list for entries that do not match the typical pattern

In [173]:
regex = r"\[Thunderbird\w{3}\]"
strange_msg_lst = []
for msg in msg_lst:
    result = re.findall(regex,msg)
    if len(result) == 0:        
            strange_msg_lst.append(msg)

strange_msg_lst

[': RRD_update (/var/lib/ganglia/rrds/C Nodes/cn304/pkts_out.rrd): illegal attempt to update using time 1131563037 when last update time is 1131563037 (minimum one second step)',
 ': RRD_update (/var/lib/ganglia/rrds/D Nodes/dn731/pkts_out.rrd): illegal attempt to update using time 1131563089 when last update time is 1131563089 (minimum one second step)',
 ': RRD_update (/var/lib/ganglia/rrds/D Nodes/dn731/pkts_out.rrd): illegal attempt to update using time 1131563117 when last update time is 1131563117 (minimum one second step)',
 ': RRD_update (/var/lib/ganglia/rrds/D Nodes/dn731/pkts_out.rrd): illegal attempt to update using time 1131563149 when last update time is 1131563149 (minimum one second step)',
 ': RRD_update (/var/lib/ganglia/rrds/D Nodes/dn731/pkts_out.rrd): illegal attempt to update using time 1131563319 when last update time is 1131563319 (minimum one second step)',
 ': RRD_update (/var/lib/ganglia/rrds/D Nodes/dn731/pkts_out.rrd): illegal attempt to update using time 1