# RegEx Tutorial

A **RegEx** is a string of text that allows you to create patterns that help match, locate, and manage text.

Regular expressions are essentially a tiny, highly specialized programming language embedded inside Python and made available through the `re` module

## Matching Characters

Though in RegEx, most of the characters matched as it is eg. the regular expression `test` will match the string `test` exactly, there are some special case of characters know as *Metacharacters* which has different meaning associated with it.

First we will see how to match simple chars in the string.

In [2]:
# import re module
import re

**CASE 1:**

We will try to find all the strings which has a word `cat` in it.

In [12]:
# Let's define a list of strings
words = ["Hey I'm there", "Where's my cat", "The game of Cat and Mouse", "It could be catalyst"]

# define the pattern to match
pattern = 'cat'
for word in words:
    match = re.search(pattern, word)
    if match:
        print(word)

Where's my cat
It could be catalyst


#### Observations from above example:

1. It correctly ignores the 1st string which does not contain the word `cat`
2. It idnetifies the word cat in 2nd string
3. It is case-sensative so it ignores the word `Cat` from 3rd string.
4. It prints 4th string which it not supposed to print. 

So we found 2 issues in above approach.

1. It is case-sensative 
2. It will return any string even if string `cat` is part of other word eg. `catalyst` in 4th string

We can make it case insensative using `re.IGNORECASE` and 2nd one can be handled using *Word Boundary* approach which we will cover later in this Tutorial.


**CASE 2:**
Let's have a look how can we make it case-insesative.

In [19]:
words = ["Hey I'm there", "Where's my cat", "The game of Cat and Mouse"]
pattern = 'cat'

for word in words:
    match = re.search(pattern, word, re.IGNORECASE)
    if match:
        print(word)

Where's my cat
The game of Cat and Mouse


##### Now we'll see *Metacharacters* and their usage.

Here's the complete list of *Metacharacters* :

`. ^ $ * + ? { } [ ] \ | ( )`

### 1.  Metacharacter:  `[ ]` 
`[` and `]` are used to match a character with specific character set.

**CASE 3:**

Say we have to check if the 3 letter string contains word `at` in it at the end.

In [17]:
words = ['cat', 'wed', 'Bat', 'rat', 'ate', 'sit', 'Fat']
pattern = r'[A-Za-z]at'

for word in words:
    match = re.search(pattern, word)
    if match:
        print(word)

cat
Bat
rat
Fat


In above example we have passed 2 ranges in `[]`. First Uppercase A-Z and lowercase a-z

In short it works as `OR` condition.
It tries to find a character given in the range.

**CASE 4:**

Let's take a similar example and we'll only fetch the words starting with `b` or `c` and followed by `at`.

In [18]:
words = ['cat', 'wed', 'Bat', 'rat', 'ate', 'sit', 'Fat']
pattern = r'[bc]at'

for word in words:
    match = re.search(pattern, word, re.IGNORECASE)   # ignoring the case
    if match:
        print(word)

cat
Bat


**CASE 5:**

Now let's see how to compliment a set.

We want to match all the strings finishing with `at` but not starting with `b` or `c`

In [22]:
words = ['cat', 'wed', 'Bat', 'rat', 'ate', 'sit', 'Fat', '8at']
pattern = r'[^bc]at'

for word in words:
    match = re.search(pattern, word, re.IGNORECASE)   # ignoring the case
    if match:
        print(word)

rat
Fat
8at


Simply adding `^` character at the start we can compliment the set. We'll see the usage of `^` outside the `[ ]` soon.

Furthermore `^` has special meaning only at the start. If you enter it elsewhere it will work as a character

**CASE 6:**

We have to match if word ending with `at` and starting with `b` or `^`.

In [25]:
words = ['cat', 'wed', 'Bat', 'bat', 'ate', 'sit', '^at', '8at']
pattern = r'[b^]at'

for word in words:
    match = re.search(pattern, word, re.IGNORECASE)   # ignoring the case
    if match:
        print(word)

Bat
bat
^at


### 2. Metacharacter:   `.`

`.` is used to match anything except newline.

**CASE 7:**

Finding the 3 letter pattern that starts with `a` followed by any character and ends with `e`

*eg.  ate, ace, ave, save*  

In [33]:
words = ['aces', 'weds', 'A.e', 'bat', 'eats', 'save', '^at', 'a\nesim','a\ns']   # `\n` is newline in last 2 strings
pattern = r'a.e'

for word in words:
    match = re.search(pattern, word, re.IGNORECASE)   # ignoring the case
    if match:
        print(word)

aces
A.e
save


We'll see more examples of it later in this tutorial. 

### 3. Metacharacter:  `*`

`*` is used to match characters 0 or more times.

**CASE 8:**

Find the decimal number `1.1` from the list of strings.

In [34]:
words = ['1.1', '1.01', '1.1000', '0.1100', '1.110', '1.10']   
pattern = r'1.10*'

for word in words:
    match = re.search(pattern, word)
    if match:
        print(word)

1.1
1.1000
1.110
1.10


The above example fails on 1.110. It checks for 0 or more `0`s in string after `1.1` as founds no zero which also a match.

We can avoid it by telling it that `0` is the last character in the string and here comes `$` to help us

### 4. Metacharacter: `$`

`$` matches end of the line.

**CASE 9:**

We'll use the same example as *CASE 8*

In [35]:
words = ['1.1', '1.01', '1.1000', '0.1100', '1.110', '1.10']   
pattern = r'1.10*$'

for word in words:
    match = re.search(pattern, word)
    if match:
        print(word)

1.1
1.1000
1.10


### 5. Metacharacter: `^`

We have seen usage of `^` inside the `[ ]`. Now let's see how to use it outside it.

`^` matches beginning of the line.

**CASE 10:**

Let's find out all the words starting with `k`.

In [38]:
words = ['cat', 'kite', 'KITTEN', 'cold', 'skin', 'kane']   
pattern = r'^k'

for word in words:
    match = re.search(pattern, word, re.IGNORECASE)
    if match:
        print(word)

kite
KITTEN
kane


### 6. Metacharacter: `|` 


`|` is used as `or` in regex.

**CASE 11:**

Find out the sentences containing `TEA` or `COFFEE`.

In [40]:
words = ['Cat drinks milk', 'He likes coffee', 'He neither like tea or coffee', 'She prefers cold drinks', 'Green Tea is healthy']   
pattern = r'tea|coffee'

for word in words:
    match = re.search(pattern, word, re.IGNORECASE)
    if match:
        print(word)

He likes coffee
He neither like tea or coffee
Green Tea is healthy


### 7. Metacharacter: `+`

`+` matches 1 or more chracter.

**CASE 12:**

Find out string that contains 1 or more `b`s between `a` and `c`.

*eg.  abbbc, abc, aaabccc*

In [41]:
words = ['aaabc', 'abc', 'abccc', 'acb', 'ac', 'aaabbbbbcc']
pattern = r'ab+c'

for word in words:
    match = re.search(pattern, word, re.IGNORECASE)
    if match:
        print(word)

aaabc
abc
abccc
aaabbbbbcc


### 8. Metacharacter: `?`

`?` matches 0 or 1 character.

**CASE 13:**

Find out strings containg 0 or 1 `b` at the end.

*eg:  aaab, accc*

In [42]:
words = ['aaab', 'ab', 'abccc', 'acbbbb', 'ac']
pattern = r'ab?$'

for word in words:
    match = re.search(pattern, word, re.IGNORECASE)
    if match:
        print(word)

aaab
ab
