## String and Text re

### Splitting Strings on Any of Multiple Delimiters

In [None]:
names = 'Abu bin Ahmad; Osman, Iskandar,Mohammad,   Tan Ah Kau'

# Let's split the line
names.split()

`split()` only split text based on space(s). We need more than this.

Make use of standard library **`re`**.

`re.split()` support multiple delimiters, multiple patterns.

In [None]:
import re

re.split(r'[;,\s]', names) # \s means space

It does not handle multiple space after `Mohammad`.

Let's try again by combining addtional logic `\s*` (*zero or more whitespace*).

In [None]:
re.split(r'[;,\s]\s*', names)

We can use *capturing group* (pattern enclosed in parentheses) to include *matched  text* in the result.

In [None]:
fields = re.split(r'(;|,|\s)\s*', names)
print(fields)

If we want to split the fields based on `;` or `,` **and** one or more spaces, then we remove `\s` from the character set in brackets.

In [None]:
re.split(r'[;,]\s*', names)

If we need to separate fields into values and delimiters, we can use the following method:

In [None]:
fields = re.split(r'(;|,|\s)\s*', names)
values = fields[::2]
delimiters = fields[1::2]
print(f'values = {values}')
print(f'delimiters = {delimiters}')

We can join back the values and delimiters but removing additional spaces.

In [None]:
''.join(v+d for v,d in zip(values, delimiters))

There is a missing delimiter after `Kau`. Let's do again.

In [None]:
''.join(v+d for v,d in zip(values, delimiters+['']))

If we do not want the delimiters in the results but still want to use group parts, then we will use a *noncapture group* specified as `(?:...)`.

In [None]:
re.split(r'(?:,|;|\s)\s*', names)

### Matching Text at the Start or End of a String

String comes with functions to check with the beginning or end of itself: `str.starts()` amd `str.endswith()`

In [None]:
filename = '/tmp/local/spam.txt'
filename.endswith('.txt')

In [None]:
filename.startswith('/temp')

In [None]:
# import os
# filenames = os.listdir('.')
filenames = ['Makefile', 'foo.c', 'bar.py', 'spam.c', 'spam.h']

`... for ... in ...` is a generator expression, which generates a result which must be  converted to list, tuple or dict (if applicable)

If we need to check against multiple choices with for `endswith()` and `startswith()`, we need to put the filters in **tuple**.

In [None]:
[name for name in filenames if name.endswith(('.c', '.h'))]

In [None]:
any(name.endswith('.py') for name in filenames)

We can use `re.match()` to match but it may be overfill.

In [None]:
any(re.match('.*.c|.*.h', name) for name in filenames)

In [None]:
files = list(name for name in filenames if re.match(r'.*[ch]$', name))
print(files)

### Extract Matched Information

In [None]:
timestr = '5h23m'

# To extract hours and minutes from timestr
matchObj = re.match(r'(\d+)h(\d+)m', timestr)
if matchObj:
  print(f"hours={matchObj.group(1)} minutes={matchObj.group(2)}")
else:
  print("No match")

In [None]:
# To extract hours and minutes from timestr and put them in dictionary
matchObj = re.match(r'(?P<hours>\d+)h(?P<minutes>\d+)m', timestr)
if matchObj:
  print(f"hours={matchObj.group(1)} minutes={matchObj.group(2)}")
  print(matchObj.groupdict())  ## Note: values are in string
else:
  print("No match")