# Regex Notebook

Import the regex library _re_ as follows:

In [19]:
import re

Just creating a string sentence to search on

In [20]:
text1 = "This is a beautiful day"

## Regex Modules:

There are a few regex modules that are used to search, replace etc. 
We'll see these modules in action here:

### The search() Module

This module is used to search for as pattern present _anywhere_ in the string.

**Syntax:** ``re.search(pattern, string, flags)``

This module returns a re.Match Object which has a few more interesting modules like:

- ``group()``
- ``groups()``
- ``start()``
- ``end()``
- ``span()``


And a few others.

In [21]:
re.search(r'is', text1)

<re.Match object; span=(2, 4), match='is'>

In [22]:
m = re.search(r'is', text1)
print(type(m))

<class 're.Match'>


#### The group() Sub Module:

This module returns the matched string.

In [23]:
m.group()

'is'

#### The start(), end(), span() Sub Modules:
These modules return:
 - The pattern start index
 - The pattern end index
 - The pattern [start, end]
 

In [24]:
m.start(), m.end(), m.span()

(2, 4, (2, 4))

### The match() Module:
Very similar to the ``re.search`` module, 
but this module searches for the patterns only at the beginning of the string.

Syntax: `re.match(pattern, string, flags)`
<hr>

Here, we are searching for 'is' at the start of text1,
 but beginning characters of text1 are 'Th', so this returns nothing. 

In [25]:
m = re.match(r'is', text1)
print(m)

None


But here, Since 'Th' is searched and it's present at the beginning,
 it returns the expected re.Match Object.

In [26]:
m = re.match(r'Th', text1)
print(m)

<re.Match object; span=(0, 2), match='Th'>


Since it is the same re.Match Object as previously seen, 
all the sub modules work in this as well.

In [27]:
m.group(), m.start(), m.end(), m.span()

('Th', 0, 2, (0, 2))

Also, you can access the span elements just like accessing a list element. 
Simple as using the index values.

In [28]:
x = m.span()
print(x[0], [1])

0 [1]


### The findall() Module:
This module, returns a list of all matched strings.

In [29]:
re.findall(r'is', text1)

['is', 'is']

In [30]:
text2 = "abbbaaabbbbabababa"

In [31]:
re.findall(r'ba', text2)

['ba', 'ba', 'ba', 'ba', 'ba']

In [32]:
mat = re.finditer(r'ba', text2)

In [33]:
for m in mat:
    print(m.group(), m.start(), m.end(), m.span())

ba 3 5 (3, 5)
ba 10 12 (10, 12)
ba 12 14 (12, 14)
ba 14 16 (14, 16)
ba 16 18 (16, 18)


In [34]:
print(re.sub(r'ba', 'xy', text2, count=2))

abbxyaabbbxybababa


In [35]:
pat = re.compile(r'ba')
type(pat)

re.Pattern

In [36]:
re.findall(pat, text2)

['ba', 'ba', 'ba', 'ba', 'ba']

In [37]:
text3 = "akasad kadkad; asdadnnas; asdkakds: ajdasdj, sjdjdj; sisisiu;      hshs"

In [38]:
text3_list = re.split(r'[ ;:,]\s*', text3)

In [39]:
print(text3_list)

['akasad', 'kadkad', 'asdadnnas', 'asdkakds', 'ajdasdj', 'sjdjdj', 'sisisiu', 'hshs']


## Patterns: Repetition Coding

### Two types: 
- Greedy
- Non Greedy
<hr>

#### Greedy Repetition

\* \+ ? {n} {m,n}

In [40]:
text = "ab abb a a a abbbb abbbbbb"

\* Indicates 0 or more

In [41]:
print(re.findall(r'ab*', text))

['ab', 'abb', 'a', 'a', 'a', 'abbbb', 'abbbbbb']


\+ Indicates 1 or more

In [42]:
print(re.findall(r'ab+', text))

['ab', 'abb', 'abbbb', 'abbbbbb']


? Indicates 0 or 1

In [43]:
print(re.findall(r'ab?', text))


['ab', 'ab', 'a', 'a', 'a', 'ab', 'ab']


{n} Indicates exactly n

In [44]:
print(re.findall(r'ab{2}', text))


['abb', 'abb', 'abb']


{m, n} Indicates m to n

In [45]:
print(re.findall(r'ab{3,5}', text))

['abbbb', 'abbbbb']


#### Non Greedy:
Similar to greedy, only difference is '?'

In [46]:
print(re.findall(r'ab*?', text))
print(re.findall(r'ab+?', text))
print(re.findall(r'ab??', text))
print(re.findall(r'ab{3}?', text))
print(re.findall(r'ab{3,5}?', text))

['a', 'a', 'a', 'a', 'a', 'a', 'a']
['ab', 'ab', 'ab', 'ab']
['a', 'a', 'a', 'a', 'a', 'a', 'a']
['abbb', 'abbb']
['abbb', 'abbb']


### Character Sets:

In [47]:
text = "xyxyxyxxxxxyyyzzzxyz"

[xyz] - Indicates "Select if x or y or z character is matched."

In [48]:
print(re.findall(r'[xy]', text))
print(re.findall(r'x[xy]', text))
print(re.findall(r'x[xy]+', text))
print(re.findall(r'x[xy]+?', text))

['x', 'y', 'x', 'y', 'x', 'y', 'x', 'x', 'x', 'x', 'x', 'y', 'y', 'y', 'x', 'y']
['xy', 'xy', 'xy', 'xx', 'xx', 'xy', 'xy']
['xyxyxyxxxxxyyy', 'xy']
['xy', 'xy', 'xy', 'xx', 'xx', 'xy', 'xy']


^ This symbol is called __Caret__

^ Indicates "Select everything **except** the following things"

In [49]:
print(re.findall(r'y[^xy]+', text))

['yzzz', 'yz']


In [50]:
text = "12 3 4 5 6 S T Sample Text. +=- some Punctuation marks!!!"

Range [A-Z] or [a-z] or [1-9]

This indicates "Select everything from [this,to this]"

In [51]:
print(re.findall(r'[A-Z][a-z]+', text))
print(re.findall(r'[^.\-=+! ]+', text))
print(re.findall(r'[1-3^.\-=+! ]+', text))

['Sample', 'Text', 'Punctuation']
['12', '3', '4', '5', '6', 'S', 'T', 'Sample', 'Text', 'some', 'Punctuation', 'marks']
['12 3 ', ' ', ' ', ' ', ' ', ' ', ' ', '. +=- ', ' ', ' ', '!!!']


### Escape Codes

\d indicates "select all digits"

\D indicates "select all non-digits"

In [68]:
text = "This is a beautiful day 123   %$."

In [53]:
print(re.findall(r'\d', text))
print(re.findall(r'\d+', text))

['1', '2', '3']
['123']


In [54]:
#All non digits

print(re.findall(r'\D', text))
print(re.findall(r'\D+', text))

['T', 'h', 'i', 's', ' ', 'i', 's', ' ', 'a', ' ', 'b', 'e', 'a', 'u', 't', 'i', 'f', 'u', 'l', ' ', 'd', 'a', 'y', ' ', ' ', ' ', ' ', '%', '$']
['This is a beautiful day ', '   %$']


\s indicates "select all space charaters"
\s indicates "select all non-space charaters"

In [55]:
print(re.findall(r'\s', text))
print(re.findall(r'\s+', text))


[' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ']
[' ', ' ', ' ', ' ', ' ', '   ']


In [56]:
print(re.findall(r'\S', text))
print(re.findall(r'\S+', text))



['T', 'h', 'i', 's', 'i', 's', 'a', 'b', 'e', 'a', 'u', 't', 'i', 'f', 'u', 'l', 'd', 'a', 'y', '1', '2', '3', '%', '$']
['This', 'is', 'a', 'beautiful', 'day', '123', '%$']


\w indicates "select all alphanumeric characters [A-Z], [a-z], [0-9]"

\W indicates "select all non-alphanumeric characters"

In [57]:
print(re.findall(r'\w', text))
print(re.findall(r'\w+', text))

['T', 'h', 'i', 's', 'i', 's', 'a', 'b', 'e', 'a', 'u', 't', 'i', 'f', 'u', 'l', 'd', 'a', 'y', '1', '2', '3']
['This', 'is', 'a', 'beautiful', 'day', '123']


In [58]:
print(re.findall(r'\W', text))
print(re.findall(r'\W+', text))

[' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', '%', '$']
[' ', ' ', ' ', ' ', ' ', '   %$']


### Anchoring

- '^' - for search from start of the string
- '$' - for search from end of the string
- '\A' - Also for search from start of the string
- '\Z' - Also for search from end of the string
- '\b' - To search in between a word in the string sentence

In [69]:
print(re.findall(r'is', text))

#Since text does not begin with 'is' it will be empty
print(re.findall(r'^is+', text))

['is', 'is']
[]


In [70]:
print(re.findall(r'\.$', text))

['.']


In [77]:
text = "This is a beautiful beautifulday 123   %$."

print(text)
print(re.findall(r'\bis\b', text))
print(re.search(r'\bis\b', text))

print(re.findall(r'beautiful', text))
print(re.search(r'beautiful', text))

print(re.findall(r'\bbeautiful\b', text))
print(re.search(r'\bbeautiful\b', text))

This is a beautiful beautifulday 123   %$.
['is']
<re.Match object; span=(5, 7), match='is'>
['beautiful', 'beautiful']
<re.Match object; span=(10, 19), match='beautiful'>
['beautiful']
<re.Match object; span=(10, 19), match='beautiful'>


### Flags

- `re.IGNORECASE` or `re.I` or 2

	Makes Expression Case Insensitive
- `re.DOTALL` or `re.S` or 16
	
	Makes . character to include \n character
- `re.MULTILINE` or `re.M` or 8
	
	Makes ^ and $ character to match
	 on beginning and ending of each line of the string
- 	`re.VERBOSE` or `re.X` or 64 
- 	`re.DEBUG` or 128 	

In [78]:
text = "HeLLo hello HELLO"

In [79]:
print(re.findall(r'hello', text))

['hello']


In [80]:
# Using Ignore flag
print(re.findall(r'hello', text, re.IGNORECASE))
print(re.findall(r'hello', text, re.I))
print(re.findall(r'hello', text, 2))

['HeLLo', 'hello', 'HELLO']


['HeLLo', 'hello', 'HELLO']
['HeLLo', 'hello', 'HELLO']
