# Python for testers -    
# Patterns

# Shell patterns

## Shell pattern matching

* Can use place holders:
  * `'*'` = any or none characters
  * `'?'` = exactly 1 character
  * `'[amz]'` = 'a', 'm' or 'z'
  * `'[a-z]'` = any single character between 'a' and 'z' (inclusive)
* module `fnmatch` to match specific strings
* module `glob` to match filenames in a folder

## Shell pattern examples

In [146]:
from fnmatch import fnmatch
fnmatch('some.sh', '*.sh')    # * = any or none characters

True

In [147]:
fnmatch('data123.csv', 'data???.csv')  # ? = exactly 1 character

True

In [148]:
fnmatch('data123.csv', 'data[0-9][0-9][0-9].csv')  # [0-9] = 1 digit

True

## Shell pattern case sensitivity

Case sensitivity depends on platform

In [151]:
fnmatch('some.sh', '*.SH')  # False under Linux, True under Windows

False

# Regular expressions

> Some people, when confronted with a problem,

> think "I know, I'll use regular expressions."

> Now they have two problems.

Jamie Zawinski

## About regular expressions

* Yes, Python can do regular expressions!
* Useful for simple parsing and pattern matching.
* Difficult to understand and maintain.


## Pattern matching

In [158]:
import re
if re.match(r'.+\.sh', 'some.sh'):  # similar to fnmatch.fnmatch('some.sh', '*.sh')  
    print("it's a shell script!")

it's a shell script!


In [156]:
print(re.match(r'.+\.sh', 'data.xml'))  # no match

None


## Compiled regular expressions

In [161]:
sh_regex = re.compile(r'.+\.sh')
if sh_regex.match('some.sh'):  # faster than re.match()
    print("it's a shell script")

it's a shell script


* Compile once, match often
* Faster when matching in loops
* Easier to read because regex gets a variable name

## Place holders (excerpt)

* `.` = any single character
* `.+` = at least one character
* `.*` = any number of characters, including none
* `\.` = a single dot (literally)
* `^` = start of line
* `$` = end of line
* `\s` = white space
* `[a-z]` = any lower case letter
* `[^a-z]` = anything but a lower case letter
* `(abc|xy)` = 'abc' or 'xy'
* more: https://docs.python.org/3/library/re.html#regular-expression-syntax

## Extracting parts from a string

Use the `(?P<groupname>pattern)` notation to extract a text matching `pattern` to a group named `groupname`:

In [188]:
item_value_regex = re.compile(r'My (?P<item>\w+) is (?P<value>.+)\.')
item_value_match = item_value_regex.match('My name is Alice.')

Once you found a match, you can access the groups in it using the `group()` function:

In [189]:
item_value_match.group('item')

'name'

In [163]:
item_value_match.group('value')

'Alice'

# Lexing

## Pygments lexers

* Lexers read a structured text (e.g. source code) convert it to a stream of token (e.g. keyword, number, string, comment etc).
* The [pygments](http://pygments.org/) library provides many lexers for programming languages and configuration file formats.
* It also has a general [RegexLexer](http://pygments.org/docs/lexerdevelopment/#regexlexer) that provides a sound base to develop you own lexers quickly.

## SQL lexer example

In [181]:
from pygments import lexers

sql_code = "select * from customer where date_of_birth >= '1980-01-01'"
sql_lexer = lexers.get_lexer_by_name('sql')
for token_type, token_value in sql_lexer.get_tokens(sql_code):
    print('%-27s - %s' % (token_type, token_value))

Token.Keyword               - select
Token.Text                  -  
Token.Operator              - *
Token.Text                  -  
Token.Keyword               - from
Token.Text                  -  
Token.Name                  - customer
Token.Text                  -  
Token.Keyword               - where
Token.Text                  -  
Token.Name                  - date_of_birth
Token.Text                  -  
Token.Operator              - >
Token.Operator              - =
Token.Text                  -  
Token.Literal.String.Single - '1980-01-01'
Token.Text                  - 



# Regex lexer for C++ comments

In [190]:
from pygments.lexer import RegexLexer
from pygments.token import *

class CppCommentLexer(RegexLexer):
    name = 'Example Lexer with states'

    tokens = {
        'root': [
            (r'[^/]+', Text),
            (r'/\*', Comment.Multiline, 'comment'),
            (r'//.*?$', Comment.Singleline),
            (r'/', Text)
        ],
        'comment': [
            (r'[^*/]', Comment.Multiline),
            (r'/\*', Comment.Multiline, '#push'),
            (r'\*/', Comment.Multiline, '#pop'),
            (r'[*/]', Comment.Multiline)
        ]
    }

Source: http://pygments.org/docs/lexerdevelopment/