### Regular Expressions (RegEx)

* RegEx is language independent specification to find out a pattern in a larger text

* It can help us 
    * select a portion of the large text
    * verify if  a current text matches my pattern requirement
    

* This if the whole idea to search if a given text in present in a larger text.


### A very basic example

In [25]:
country_info='''


Here is a set of information Some countries

Country: India
Continent: Asia
Currency: INR
Capital: New Delhi

Country: USA
Continent: America
Currency: USD
Capital: Washington

Country: Vietnam
Continent: Asia
Currency: DNB
Capital: Hanoi

Country: U.K.
Continent: Europe
Currency: GBP
Capital: London


Country: Ukraine
Continent: Europe
Currency: GBP
Capital: London

Country: France
Continent: Europe
Currency: Euro
Capital: Paris

Country: UAE
Continent: Asia
Currency: Dehram
Capital: Abu Dhabi



'''

### Here we can write Simple Text Search Like:

* Is Russia present in the country_info?
* Is India present in the country_info?

In [3]:
print('India' in country_info) # True
print('Russia' in country_info)

True
False


### But what about complicated scenarios?

#### get a list of all currencies?
* find first place where we get "Currency" word.
* read till the end of line
* extract currency name
* search for next occurence

In [10]:
def get_currencies(country_info):
    currencies=[]
    start=0
    while True:
        start=country_info.find('Currency',start)
        if start==-1:
            break
        end_line=country_info.find("\n",start)
        line= country_info[start:end_line].split(":")[-1].strip()
        currencies.append(line)
        start=end_line

    return currencies

In [11]:
get_currencies(country_info)

['INR', 'USD', 'DNB', 'GBP', 'Euro', 'Dehram']

#### More complex

* find all country name that begins with U?
* find all country and their capital
* find all country in Asia

### Regular Expressions can perform wild card kind of match

* it has few basic syntax.
* strongly recommend to access https://regex101.com 

### How to use regular expressions

* we need a module : **re**
* it has several functions

    * re.search(pattern,text)
    * re.findall(pattern,text)
    * ...

### let us use findall

In [13]:
import re

### find how many times Asia occurs in the given text

* We can do a simple text search match.
* By default it will match all occurances of that text.

In [15]:
pattern="Asia"

result=re.findall(pattern, country_info)

print(result)
print(len(result))

['Asia', 'Asia', 'Asia']
3


### case insensetive search
* by default search is case sensetive

In [16]:
pattern="asia"
re.findall(pattern,country_info)

[]

### we can make it case insensetive

In [17]:
pattern='asia'
re.findall(pattern,country_info, re.IGNORECASE)

['Asia', 'Asia', 'Asia']

### Find all Countries which starts with U

* Now we can use wildcards
    * . --> any one character
    * ... --> any three character
    * [abc] ---> either of these three letter
    * [0-9] ---> any of the digit
    * [A-Z] ---> any of upper case letter
    * \w ---> [A-Za-z0-9_]
    * \s ---> any white space
    * \d --->[0-9]

* Repeations
    * * --> 0 or more
    * ? --> 0 or 1
    * + --> 1 or more
    *{n} --> 9 times
    *{2,3}--> 2 or 3 times

* Example
    * Simple Mobile Number Match
        * [0-9]{10}
            * any of these digits appearing 10 times 

    * If mobile number is going to be 10 digit number and first digit cant be 0
        * [1-9][0-9]{9}
            * first digit can be 1-9
            * second onward 9 digits can be 0-9


#### find all countries which are United

In [26]:
pattern='U[A-Z.]+'
re.findall(pattern,country_info)

['USA', 'USD', 'U.K.', 'UAE']

#### How to eliminate USD from

* We have to pickup those 'U' starting that appear after Country not after currency

In [27]:
pattern="Country: U[A-Z.]+"
re.findall(pattern, country_info)

['Country: USA', 'Country: U.K.', 'Country: UAE']

### I don't want to include "Country:" in my result

* we have two things
    * selector
        * helps us select a segment text
    * group
        * to select section of the original selection
            * we can select it by wrapping our requirement in ()


In [28]:
pattern="Country: (U[A-Z.]+)"
re.findall(pattern, country_info)

['USA', 'U.K.', 'UAE']

### How do I create a pattern to match

* indian PAN cards?

* A pan card has
    * 5 alphabets
    * 4 digits
    * 1 alphabet

In [31]:
pan_pattern=r'[A-Za-z]{5}\d{4}[A-Za-z]'

pans=['AXPXM9842C','APXM9842C','A1XM9842C','AXPX99842C','AYPYM9842C']

for pan in pans:
    if re.match(pan_pattern, pan):
        print(f'{pan}\t\tVALID')
    else:
        print(f'{pan}\t\tINVALID')
        



AXPXM9842C		VALID
APXM9842C		INVALID
A1XM9842C		INVALID
AXPX99842C		INVALID
AYPYM9842C		VALID
