# Matching Strings

# Metacharacters `[] . ^ $ * + ? { } \ | ()`
- Metacharacters don’t match themselves. Instead, they signal that some to match un ordinary string,   
- They affect other portions of the RE by repeating them or changing their meaning. 
___

## Metacharacter `[]` (And)
- characters we want to match
- It specifies a character class
___

#### re.search(pattern, string)
- will seearch the pattern in the string and return(re.Match object)  first occurence
___

#### re.findall(pattern, string)
- will find all occurences of the pattern and return a list
___

#### raw string `r" "` or `r'  '`
- Preceding a string with r makes the string raw 
- raw strings are pure strings and do not have any special meaning
- `'\n'` will be a one string(new line) but if we use `r'\n'` it contains two strings '\' and 'n'
___

In [38]:
import re

In [39]:
string_list = ['rsxpxoru hdstyilu eikmcmrh', # random list of strings 
               'ynrknmpm xirpmixm jwdcllmk',
               'lbpvqhxt mksqpptd sgwrndfe',
               'ypzgtwfe aufmbrsh nddblklj',
               'hlqmrghi oiqljjfu jmutxxxe',
               'ggbsarff bixggvkr uqjbwdtu',]
pattern = r"[ab]"  # [] search for a and b
for index,string in enumerate(string_list, start =1):
    print(index,":",re.search(pattern, string))


1 : None
2 : None
3 : <re.Match object; span=(1, 2), match='b'>
4 : <re.Match object; span=(9, 10), match='a'>
5 : None
6 : <re.Match object; span=(2, 3), match='b'>


## Hyphen `-` specifies range is a special character and not a metacharacter
- defines a range
  - [a-d] all words from a-d
  - [a-z] all words from a-z
  - [A-Z] all words from A-Z
  - [0-9] all digits from 0-9

#### Caret sign `^` : (metacharacter)
- caret sign is used to search pattern at **start** of a string Dollar sign 

In [40]:
start_str = "test caret sign", "caret test sign"
pattern_start = "^test"
for index,string in enumerate(start_str, start =1):
    print(index,":",re.search(pattern_start, string))

1 : <re.Match object; span=(0, 4), match='test'>
2 : None


** Matches only the strings which start with `test` **

#### Dollar sign `$` : (metacharacter)
- is used to search pattern at **end** of a string 

In [42]:
end_str = "test  sign caret", "caret test sign"
pattern_end = "sign$"
for index,string in enumerate(end_str, start =1):
    print(index,":",re.search(pattern_end, string))

1 : None
2 : <re.Match object; span=(11, 15), match='sign'>


** matches sign which will be in the end of sentence **
___

### **Metacharacteres are not active inside a class except at special location inside a class**
- '[a^bz]' is inside a class which will search for character `^abz`

In [45]:
str_L2 =["ca^", "xsdf", "cb", "^zad"]
pattern_2 = "[abz^]"
for index,string in enumerate(str_L2, start =1):
    print(index,":",re.search(pattern_2, string))

1 : <re.Match object; span=(1, 2), match='a'>
2 : None
3 : <re.Match object; span=(1, 2), match='b'>
4 : <re.Match object; span=(0, 1), match='^'>


#### Using `^` inside a class and at the begining i.e [^xyz]
- This makes the pattern completely different if the location is in the start `^` acts as a complement of the set
- This means It will search any characters apart from the ones inside the set i.e `x` or `y` or `z` (complement of `x` or `y` or `z`)

In [44]:
str_L3 =["abz", "xsdf", "cb", "^zad"]
pattern_3 = "[^abz]"
for index,string in enumerate(str_L3, start =1):
    print(index,":",re.search(pattern_3, string))

1 : None
2 : <re.Match object; span=(0, 1), match='x'>
3 : <re.Match object; span=(0, 1), match='c'>
4 : <re.Match object; span=(0, 1), match='^'>


** the first string returns none as it only contains item which will not be included **

## Backslash `\`
- It can be followed by various characters to signal various special sequences
- It is also used as an escape sequence to escape the special meaning and treat them as normal characters

### Special sequences begining with `\`
#### - `\w` : matches any alphanumeric character `([a-zA-Z0-9_])`

In [48]:
str_L4 =["abz", "123", "ABC", "__ab90","#$%"]
pattern_4 = "\w"
for index,string in enumerate(str_L4, start =1):
    print(index,":",re.search(pattern_4, string))

1 : <re.Match object; span=(0, 1), match='a'>
2 : <re.Match object; span=(0, 1), match='1'>
3 : <re.Match object; span=(0, 1), match='A'>
4 : <re.Match object; span=(0, 1), match='_'>
5 : None


** It matches with the ones containing character between` a-z, A-Z, 0-9 and _`(underscore) **
___

#### - `\W` : matches any alphanumeric character except in `([^a-zA-Z0-9_])`

In [83]:
str_L4 =["abz", "123", "ABC", "__ab90","#$%", "_", "."]
pattern_4 = "\W"
for index,string in enumerate(str_L4, start =1):
    print(index,":",re.search(pattern_4, string))

1 : None
2 : None
3 : None
4 : None
5 : <re.Match object; span=(0, 1), match='#'>
6 : None
7 : <re.Match object; span=(0, 1), match='.'>


** Matches 5 as it does not contain any a-z, A-Z, 0-9, _ **
___

#### - `\d` (lowercase) matches any decimal digit [0-9]

In [77]:
str_L4 =["abz", "123", "ABC", "__ab90","#$%"]
pattern_5 = "\d"
for index,string in enumerate(str_L4, start =1):
    print(index,":",re.search(pattern_5, string))

1 : None
2 : <re.Match object; span=(0, 1), match='1'>
3 : None
4 : <re.Match object; span=(4, 5), match='9'>
5 : None


** 1 and 5 don not have any digits, returning None **

#### - `\D` matches any non-digit character i.e complement of [0-9] or in regex terms [^0-9]

In [78]:
str_L4 =["abz", "123", "ABC", "__ab90","#$%"]
pattern_6 = "\D"
for index,string in enumerate(str_L4, start =1):
    print(index,":",re.search(pattern_6, string))

1 : <re.Match object; span=(0, 1), match='a'>
2 : None
3 : <re.Match object; span=(0, 1), match='A'>
4 : <re.Match object; span=(0, 1), match='_'>
5 : <re.Match object; span=(0, 1), match='#'>


** 2 has just digits thus returning None **
___

####  - `\s` (lowercase) matches any white space charater i.e [ \t\n\r\f\v] 
- It the string has white space and escape sequences result returns a Match


In [79]:
str_L4 =["abz\n", "123", "ABC", "__ab90","#$%"," ", "df\n", "/ne", "abc"]
pattern_7 = "\s"
for index,string in enumerate(str_L4, start =1):
    print(f'{index} : {string}'.ljust(12), "-->",re.search(pattern_7, string))

1 : abz
     --> <re.Match object; span=(3, 4), match='\n'>
2 : 123      --> None
3 : ABC      --> None
4 : __ab90   --> None
5 : #$%      --> None
6 :          --> <re.Match object; span=(0, 1), match=' '>
7 : df
      --> <re.Match object; span=(2, 3), match='\n'>
8 : /ne      --> None
9 : abc      --> None


** matches any string with white space and escape sequence **
___


####  - `\S` matches any non white space charater i.e [^ \t\n\r\f\v] 
- It the string has any other character excep from white space and escape sequences, result returns a Match

In [88]:
str_L4 =["abz", "123", "ABC", "__ab90","\t", " "]
pattern_8 = "\S"
for index,string in enumerate(str_L4, start =1):
    print(index,":",re.search(pattern_8, string))

1 : <re.Match object; span=(0, 1), match='a'>
2 : <re.Match object; span=(0, 1), match='1'>
3 : <re.Match object; span=(0, 1), match='A'>
4 : <re.Match object; span=(0, 1), match='_'>
5 : None
6 : None


####  - `\b` Zero width assertions
- They don't cause the engine to advance throug the string
> later

** 5 and 6 have just escape sequence or white space thus returns None
___

### `.`  Dot : matches anything except newline character
- `re.DOTALL` method will even match a newline


In [90]:
str_L4 =["abz\n", "123", "ABC", "__ab90","#$%"," ", "df\n", "\n", "abc"]
pattern_8 = "."
for index,string in enumerate(str_L4, start =1):
    print(f'{index} : {string}'.ljust(12), "-->",re.search(pattern_8, string))

1 : abz
     --> <re.Match object; span=(0, 1), match='a'>
2 : 123      --> <re.Match object; span=(0, 1), match='1'>
3 : ABC      --> <re.Match object; span=(0, 1), match='A'>
4 : __ab90   --> <re.Match object; span=(0, 1), match='_'>
5 : #$%      --> <re.Match object; span=(0, 1), match='#'>
6 :          --> <re.Match object; span=(0, 1), match=' '>
7 : df
      --> <re.Match object; span=(0, 1), match='d'>
8 : 
        --> None
9 : abc      --> <re.Match object; span=(0, 1), match='a'>


** 8 is \n (new line) thus returns None **
___

In [None]:
### the `|` (OR) operator will match any of two patterns if `A|B`