# Python for testers -    
# Patterns

# Shell patterns

## Shell pattern matching

* Can use place holders:
  * `'*'` = any or none characters
  * `'?'` = exactly 1 character
  * `'[amz]'` = 'a', 'm' or 'z'
  * `'[a-z]'` = any single character between 'a' and 'z' (inclusive)
* module `fnmatch` to match specific strings
* module `glob` to match filenames in a folder

## Shell pattern examples

In [1]:
from fnmatch import fnmatch
fnmatch('some.sh', '*.sh')    # * = any or none characters

True

In [2]:
fnmatch('data123.csv', 'data???.csv')  # ? = exactly 1 character

True

In [3]:
fnmatch('data123.csv', 'data[0-9][0-9][0-9].csv')  # [0-9] = 1 digit

True

In [4]:
fnmatch('data12345.csv', 'data[0-9][0-9][0-9].csv')  # no match

False

## Shell pattern case sensitivity

Case sensitivity depends on platform

In [5]:
fnmatch('SOME.SH', '*.sh')  # False under Linux, True under Windows

False

For platform independent, case insensitive matches, use a lower case pattern and convert the search string to lower case:

In [6]:
fnmatch('SOME.SH'.lower(), '*.sh')  # True both under Linux and Windows

True

# Regular expressions

> Some people, when confronted with a problem,

> think "I know, I'll use regular expressions."

> Now they have two problems.

Jamie Zawinski

## About regular expressions

* The [re](https://docs.python.org/3/library/re.html) module support regular expressions.
* Useful for simple parsing and pattern matching.
* Difficult to understand and maintain.


## Pattern matching

In [7]:
import re
if re.match(r'.*\.sh', 'some.sh'):
    # similar to fnmatch.fnmatch('some.sh', '*.sh')
    print("it's a shell script!")

it's a shell script!


In [8]:
print(re.match(r'.*\.sh', 'data.xml'))  # no match

None


## Compiled regular expressions

In [9]:
sh_regex = re.compile(r'.*\.sh')
if sh_regex.match('some.sh'):  # faster than re.match()
    print("it's a shell script")

it's a shell script


* Compile once, match often
* Faster when matching in loops
* Easier to read because regex gets a variable name

## Place holders (excerpt)

* `.` = any single character
* `.+` = at least one character
* `.*` = any number of characters, including none
* `\.` = a single dot (literally)
* `^` = start of line
* `$` = end of line
* `\s` = white space
* `[a-z]` = any lower case letter
* `[^a-z]` = anything but a lower case letter
* `(abc|xy)` = 'abc' or 'xy'
* more: https://docs.python.org/3/library/re.html#regular-expression-syntax

# Regular expression examples

## Example matches

The match function checks if the pattern is anywhere is in string:

In [10]:
bool(re.match(r'.', 'some.sh'))

True

Use `^` and `$` to specifically match patterns relative to the start and/or end of the string:

In [11]:
bool(re.match(r'^.$', 'some.sh'))

False

## Example matches (continued)

To match a certain number of characters, one might be tempted to use a corresponding number of dots:

In [12]:
bool(re.match(r'^.......$', 'some.sh'))

True

There's a shortcut for this: simply specify the expected number of characters in curly braches:

In [13]:
bool(re.match(r'^.{7}$', 'some.sh'))

True

## Building complex regular expressions 

It can be challanging to build complex regular expression. In particular, if the supposed correct expression does not match the test data as expected. Unfortunately, non matches do not give any feedback where excatly the do not match.

An efficient approach for that is to open a Python shell such as IDLE or IPython and build the regular expression bit by bit based on a specific example that is supposed to match.

Example goal: we need a regular expression, that matches strings like `'My name is <name>.'` where `<name>` must start with an upper case letter, followed by one or more lower case letters. For example, in `'My name is Alice.'`, `<name>` would be Alice.

## Building complex regular expressions (continued)

First, let's build a regular expression that matches the initial parts of the string. Note that the pattern string has a `r` before the initial quote in order to keep Python from processing escape sequences:

In [14]:
bool(re.match(r'^My name is ', 'My name is Alice.'))

True

Next, check for the initial upper case leter of the name:

In [15]:
bool(re.match(r'^My name is [A-Z]', 'My name is Alice.'))

True

## Building complex regular expressions (continued)

Now we need a lower case letter:

In [16]:
bool(re.match(r'^My name is [A-Z][a-z]', 'My name is Alice.'))

True

Actually, there can be more than one lower case letter:

In [17]:
bool(re.match(r'^My name is [A-Z][a-z]+', 'My name is Alice.'))

True

## Building complex regular expressions (continued)

Next, there is a literal dot we need to escape with a backslash (`\`):

In [18]:
bool(re.match(r'^My name is [A-Z][a-z]+\.', 'My name is Alice.'))

True

And finally we need to specify that there may be no further characters after the dot:

In [19]:
bool(re.match(r'^My name is [A-Z][a-z]+\.', 'My name is Alice.$'))

True

# Extracting data

## Extracting parts from a string

As example, consider sentences that look like

  `'My <item> is <value>.'`

For example:
* 'My name is Alice.' --> 'name', 'Alice'
* 'My favorite color is yellow.' --> 'favorite color', 'yellow'

## Extracting parts from a string (continued)

Use the `(?P<groupname>pattern)` notation to extract a text matching `pattern` to a group named `groupname`:

In [20]:
item_value_regex = re.compile(r'My (?P<item>.+) is (?P<value>.+)\.')

item_value_match = item_value_regex.match('My name is Alice.')

Once you found a match, you can access the groups in it using the `group()` function:

In [21]:
item_value_match.group('item')

'name'

In [22]:
item_value_match.group('value')

'Alice'

## Pygments lexers

* Lexers read a structured text (e.g. source code) convert it to a stream of token (e.g. keyword, number, string, comment etc).
* The [pygments](http://pygments.org/) library provides many lexers for programming languages and configuration file formats.
* It also has a general [RegexLexer](http://pygments.org/docs/lexerdevelopment/#regexlexer) that provides a sound base to develop you own lexers quickly.
* Particular useful for domain specific languages (DSL).

## SQL lexer example

In [23]:
from pygments import lexers

sql_code = "select * from customer where date_of_birth >= '1980-01-01'"
sql_lexer = lexers.get_lexer_by_name('sql')
for token_type, token_value in sql_lexer.get_tokens(sql_code):
    print('%-27s - %s' % (token_type, token_value))

Token.Keyword               - select
Token.Text                  -  
Token.Operator              - *
Token.Text                  -  
Token.Keyword               - from
Token.Text                  -  
Token.Name                  - customer
Token.Text                  -  
Token.Keyword               - where
Token.Text                  -  
Token.Name                  - date_of_birth
Token.Text                  -  
Token.Operator              - >
Token.Operator              - =
Token.Text                  -  
Token.Literal.String.Single - '1980-01-01'
Token.Text                  - 



# Regex lexer for C++ comments

In [24]:
from pygments.lexer import RegexLexer
from pygments.token import *

class CppCommentLexer(RegexLexer):
    name = 'Example Lexer with states'

    tokens = {
        'root': [
            (r'[^/]+', Text),
            (r'/\*', Comment.Multiline, 'comment'),
            (r'//.*?$', Comment.Singleline),
            (r'/', Text)
        ],
        'comment': [
            (r'[^*/]', Comment.Multiline),
            (r'/\*', Comment.Multiline, '#push'),
            (r'\*/', Comment.Multiline, '#pop'),
            (r'[*/]', Comment.Multiline)
        ]
    }

Source: http://pygments.org/docs/lexerdevelopment/

# Summary

* Use `fnmatch` and `glob` for simple patterns.
* Use regular expressions for more powerful patterns.
* Use regular expressions to extract parts from a string.
* Use lexers for to extract parts of complex structured text, e.g. domain specific languages.