# A survival kit for dealing with Regular Expressions in Python

This introduction is very, very heavily inspired by [this tutorial](https://scotch.io/tutorials/an-introduction-to-regex-in-python) by [Jee Gikera](https://scotch.io/@jee). You can find a lot of Cheatsheets for writting regular expressions online, like for instance [here](https://www.debuggex.com/cheatsheet/regex/python).

## What are regular expressions ?
Regular expressions are basically just a sequence of characters that can be used to define a search pattern for finding text. This "search engine" is embedded within the Python programming language (and many other languages as well) and made available through the `re` module.

In [1]:
import re

## Main use cases in data analysis in biology

- Parsing the path to recording files
- Reading the content of data files when it does not make sense / it is impossible to use dedicated Python library
- Finding basic patterns in DNA/RNA/Amino-Acid sequences

### Regular expressions can be extremely useful but also a bit tricky in the beginning
![facepalm](https://miro.medium.com/max/400/0*eiqUk8yDrfMCGwsN.jpg)

## Regular expression syntax
### Ordinary characters
Regular expressions can contain both special and ordinary characters. Most ordinary characters, like 'A', 'a', or '0', are the simplest regular expressions; they simply match themselves.

E.G. "ABC" will only match "ABC"

### Special characters
There are also other special characters which can't match themselves, i.e. `^`, `$`, `*`, `+`, `?`, `{`, `}`, `[`, `]`, `\`, `|`, `(`, and `)`. The best practice is to systematically refer to a cheatsheet unless you have a supernatural Beth-Harmon-level instinct for regular expressions.

![regexp first time](https://i.redd.it/iywcj7vuieg21.png)


A few examples from [wikipedia](https://en.wikipedia.org/wiki/Regular_expression) and this [cheatsheet](https://www.debuggex.com/cheatsheet/regex/python):

- `a|b`	a or b
- `a*`	0 or more a's
- `a+`  1 or more a's
- `a?`  0 or 1 a's
- `\d`  one digit
- `\s`  one space
- `\w`  one alphanumeric character
- `{2}`	Exactly 2
- `{2,5}` 2 to 5
- `[A-D]` A to D (A, B, C or D)
- `[3-5]` 3 to 5 (3, 4 or 5)
- `(...)` Capturing group
- `^` Start of the string `$` End of the string
- `.` Any character except newline

A few examples using special characters:
- `(a|b){2}` a or b two times ("aa", "ab", "ba" or "bb")
- `^[3-5].*`   A number between 3 and 5 at the beginning of the string followed by 0 or more alphanumerical characters (e.g. "4 is a beautiful integer")
- `\d{6,8}`  6 to 8 digits
- `[a-z]+`   One or more lowercase letter
- `\d_[a-e]` A digit, followed by an underscore, followed by a lowercase letter between a and e (e.g. "3_d").

## Using the re module

### Using re.match and re.search
`re.match(pattern, string)` checks for a match only at the beginning of the string, while a  `re.search(pattern, string)` checks for a match anywhere in the string.

The output is a `re.Match` object which has a boolean value of `True`, None is returned if the pattern does not match the string:

In [3]:
pattern = "\d{6,8}"

if re.match(pattern, "123456"):
    print("I found the pattern !")
else:
    print("The input string does not match the pattern")

The input string does not match the pattern


In [4]:
# Note that re.match only looks at the beginning of the string
pattern = "\d{6,8}"

if re.match(pattern, "a_123456"):
    print("I found the pattern !")
else:
    print("The input string does not match the pattern")

The input string does not match the pattern


In [5]:
# ..whereas re.search looks everywhere
pattern = "\d{6,8}"

if re.search(pattern, "a_123456"):
    print("I found the pattern !")
else:
    print("The input string does not match the pattern")

I found the pattern !


The resulting re.Match object has several built-in methods for instance for finding the index of the occurence of a match or the value of the match:

In [6]:
my_string = "Cats are beautiful creatures"

# b followed by one or more alphanumerical characters
pattern = "b\w+"

# create a re.Match object
my_match = re.search("b\w+", my_string)

# Index of the start and end of the match
start_idx, end_idx = my_match.span()
print(my_string[start_idx: end_idx])
# Show the match directly
print(my_match.group())

beautiful
beautiful


#### Optional exercise:
You can practice your regexp skills on [this](https://regexone.com/) website.

#### Exercise: 
You obtained the content of one of your folders containing experimental data (using for instance the [pathlib](https://realpython.com/python-pathlib/) module). 

From the list of files, filter the `.csv` files whose name `start with 8 digits` (corresponding to a date) and exclude those which include `test` in their name:

In [12]:
file_names = ["20200112_dark_test_01.csv", "20200112_dark_test_01.avi", ".tmp_20200112_dark_test_01.csv",
              "20200112_light_test.csv", "20200112_light_test.avi", "20200112_dark.avi", "20200112_dark.csv",
              "20200112_light_02.avi", "20200112_light_02.csv", "metadata.csv", "20200113_dark_test_02.csv",
              "20200113_dark_test_02.avi", "20200113_dark_test_02.avi", "20200113_dark_03.csv",
              "20200113_dark_03.avi", "20200113_csv", "backup_csv.json"]

pattern = "\d{8}.*\.csv$"
pattern_test = "test"

matching_names = []
for f in file_names:
    if re.match(pattern, f):
        if not re.search(pattern_test, f):
            matching_names.append(f)
matching_names

['20200112_dark.csv', '20200112_light_02.csv', '20200113_dark_03.csv']

![how you feel](https://i.imgflip.com/2x54jq.jpg)

## A few tricks
### Compiling regular expressions for improving speed
`re.compile(pattern)` is used to compile a regular expression pattern into a regular expression object, which can be used for matching using its `match()` and `search()` methods, which we have discussed above. This can also save time since parsing/handling regex strings can be computationally expensive to run.

**Example:**

In [7]:
pattern = re.compile('Python')

result = pattern.findall('Pythonistas are programmers that use Python, which is an easy-to-learn and powerful language.')
print(result)

find = pattern.findall('Python is easy to learn')
print(find)

['Python', 'Python']
['Python']


### Find all occurences of a pattern
`re.findall(pattern, string)` is used to search all occurences of an input pattern in a string:

**Example:**

In [8]:
# Find the occurences of groups of one or more alphanumerical characters
my_string = "42 is the Answer to the Ultimate Question of Life, the Universe, and Everything"
re.findall("\w+", my_string)

['42',
 'is',
 'the',
 'Answer',
 'to',
 'the',
 'Ultimate',
 'Question',
 'of',
 'Life',
 'the',
 'Universe',
 'and',
 'Everything']

### Substitute a string if a pattern occurs
`re.sub(pattern, replacement, string)` is used to search and substitute for a new string if the pattern occurs.

**Example:**

In [9]:
result = re.sub('python', 'regexp', 'I love learning python')
print(result)

I love learning regexp


### Using groupdict to extract variables from a string
The following syntax can be used to give a name to a substring matching an expression:

`(?P<Y>...)`	Capturing group named Y

The groupdict method can then be used to extract this or these substring(s):

In [10]:
pattern = "(?P<date>\d{6})_test\.(?P<format>[a-z]+)"
target = "191201_test.csv"

my_match = re.search(pattern, target)
# If the pattern was found
if my_match:
    print(my_match.groupdict())
# If the pattern was not found
else:
    print("Pattern not found")

{'date': '191201', 'format': 'csv'}


#### Exercise: Parse the content of the following strings with the same regexp

In [17]:
example1 = "190192    MEINGVEIEDTFAEAFEAKMARVLITAASHKWAMIAVKEATGFGTSVIMCPAEAGIDCYVPPEETPDGRP 70\n"
example2 = "28892     YVGEEDFGIVKGVAGGNFFVMGENQMAALMGAQAAVDAIAGVGGVITSFPIVASGSKVGKYKFMASTNEK 210\n"
# We expect for example1:
# {'organism_id': '190192',
#  'sequence': 'MEINGVEIEDTFAEAFEAKMARVLITAASHKWAMIAVKEATGFGTSVIMCPAEAGIDCYVPPEETPDGRP',
#  'end_position': '70'}
# and for example2:
# {'organism_id': '28892',
#  'sequence': 'YVGEEDFGIVKGVAGGNFFVMGENQMAALMGAQAAVDAIAGVGGVITSFPIVASGSKVGKYKFMASTNEK',
#  'end_position': '210'}

pattern = "(?P<organism_id>\d+)\s+(?P<sequence>\w+)\s+(?P<end_position>\d+)\n"

my_match = re.search(pattern, example2)
my_match.groupdict()


{'organism_id': '28892',
 'sequence': 'YVGEEDFGIVKGVAGGNFFVMGENQMAALMGAQAAVDAIAGVGGVITSFPIVASGSKVGKYKFMASTNEK',
 'end_position': '210'}