In [1]:
import re

### Index
* [Split](#Split)
* [Substitute](#Substitute)

### Split
the string is split based on the matches of the pattern

In [2]:
re.split(r"\n", "Beautiful is better than ugly.\nExplicit is better than implicit.")

['Beautiful is better than ugly.', 'Explicit is better than implicit.']

In [3]:
# \W stands for [^a-zA-Z_0-9]
# this same pattern can be used to split a variety of marks
pattern = re.compile(r"\W")
texts = ("hello world","hello*world","hello&world")
for txt in texts:
    splited = pattern.split(txt)
    print txt,":",splited

hello world : ['hello', 'world']
hello*world : ['hello', 'world']
hello&world : ['hello', 'world']


The **maxsplit** parameter specifies how many splits can be done at maximum and returns the remaining part in the result

In [4]:
pattern.split("Beautiful is better than ugly", 2)

['Beautiful', 'is', 'better than ugly']

Normally, the pattern matched is not included. <span style="color:blue;font-weight:bold">Use groups to include the matched pattern in the result</span>

In [5]:
re.split(r"-","hello-world")# normally the matched pattern isn't included

['hello', 'world']

In [6]:
re.split(r"(-)","hello-world")# use group to include matched pattern

['hello', '-', 'world']

### Substitute
sub(repl, string, count=0)

In [9]:
pattern = re.compile(r"[0-9]+")
pattern.sub("-", "order0 order1789 order13")

'order- order- order-'

it replaces the <span style="color:blue;font-weight:bold">leftmost non-overlapping occurrences</span> of the pattern

In [10]:
re.sub('00', '-', 'order00000')

'order--0'

the first argument of 'sub' can also be a function, which receives a MatchObject and return a string as replacement. 

For example, imagine you have two kinds of orders. 
* Some start with a dash, like **-1234**
* the others start with a letter, like **A193, B123, C124**

you want to change like this:
* the ones starting with a dash should start with an A, so **-1234-->A1234**
* and the rest should start with a B, so **A193, B123, C124-->B193, B123, B124**

In [11]:
def normalize_orders(matchobj):
    if matchobj.group() == '-': return "A"
    else: return "B"
    
re.sub('[-|A-Z]', normalize_orders, '-1234 A193 B123')

'A1234 B193 B123'

Use <span style="color:red;font-weight:bold;font-size:1.2em">Backreferences</span> in substitution

In [12]:
# reverse each segment
pattern = re.compile(r"(\d+)-(\w+)")
pattern.sub(r"\2-\1", "1-a\n20-baer\n34-afcr")

'a-1\nbaer-20\nafcr-34'