## REGEX

A single expression, commonly called a regex, is a string
formed according to the regular expression language.

### Regular Expressions
Regular expressions provide a flexible way to search or match (often more complex)
string patterns in text. 

### re-module:
re module functions fall into three categories:

1) pattern matching,

2) substitution,

3) splitting.


### Uses of Regex 
Naturally these are all related; a regex describes a pattern to locate in the
text, which can then be used for many purposes. Let’s look at a simple example:

suppose we wanted to split a string with a variable number of whitespace characters
(tabs, spaces, and newlines).

##### eg text = "foo bar\t baz \tqux"

In [39]:
import re
import pandas as pd
import numpy as np
text="foo bar\t baz \tqux"

In [2]:
re.split('\s+',text)

['foo', 'bar', 'baz', 'qux']

### Compiling regex 

You can compile the regex yourself
with re.compile, forming a reusable regex object:

In [3]:
regex=re.compile('\s+')

In [4]:
regex.split(text)

['foo', 'bar', 'baz', 'qux']

## 1) Regex all patterns matching

instead, you wanted to get a list of all patterns matching the regex, you can use the
findall method:

In [5]:
regex.findall(text)

[' ', '\t ', ' \t']

### Regex Object

Creating a regex object with re.compile is highly recommended if you intend to
apply the same expression to many strings; 

-doing so will save CPU cycles.

-match and search are closely related to findall.

-While findall returns all matches
in a string, 

-search returns only the first match.

-More rigidly, match only matches at
the beginning of the string.

In [6]:
text = """Dave dave@google.com
Steve steve@gmail.com
Rob rob@gmail.com
Ryan ryan@yahoo.com
"""

In [7]:
# regex.findall(text)

In [8]:
pattern = r'[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}'

In [9]:
regex=re.compile(pattern,flags=re.IGNORECASE)

In [10]:
regex.findall(text)

['dave@google.com', 'steve@gmail.com', 'rob@gmail.com', 'ryan@yahoo.com']

## 2a) Search

search returns a special match object for the first email address in the text. 

For the preceding regex, the match object can only tell us the start and end position of the
pattern in the string:

In [11]:
m = regex.search(text)
m

<_sre.SRE_Match object; span=(5, 20), match='dave@google.com'>

In [12]:
text[m.start():m.end()]

'dave@google.com'

regex.match returns None, as it only will match if the pattern occurs at the start of the
string:

In [13]:
print(regex.match(text))

None


## 2)b SUBSTITUTION:

#### Relatedly, sub will return a new string with occurrences of the pattern replaced by the a new string:

In [14]:
print(regex.sub('REDACTED',text))

Dave REDACTED
Steve REDACTED
Rob REDACTED
Ryan REDACTED



## 3)SEGMENTATION

Suppose you wanted to find email addresses and simultaneously segment each
address into its three components:

username, 

domain name, 

and domain suffix. 

To do this, put parentheses around the parts of the pattern to segment:

In [15]:
pattern = r'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2,4})'

In [16]:
regex=re.compile(pattern,flags=re.IGNORECASE)

A match object produced by modified regex returns a tuple of pattern components with its groups method

In [17]:
m=regex.match('wesm@bright.com')

In [18]:
m.groups()

('wesm', 'bright', 'com')

In [19]:
regex.findall(text)

[('dave', 'google', 'com'),
 ('steve', 'gmail', 'com'),
 ('rob', 'gmail', 'com'),
 ('ryan', 'yahoo', 'com')]

In [20]:
t1='United Kingdom of Great Britain and Northern Ireland19'

In [29]:
pattern = r'([a-zA-Z\s]+)([0-9]+)'

In [30]:
# pattern=r'\d', r' \(([^)]+)\)'

In [31]:
regex=re.compile(pattern)

In [32]:
m=regex.match(t1)

In [33]:
m.group(1)

'United Kingdom of Great Britain and Northern Ireland'

In [34]:
m.group(2)

'19'

In [35]:
regex.findall(t1)

[('United Kingdom of Great Britain and Northern Ireland', '19')]

In [36]:
regex.split(t1)

['', 'United Kingdom of Great Britain and Northern Ireland', '19', '']

In [54]:
# Method two on df column

df = pd.DataFrame({'iso': ['4He', '16', '197Au']})
df

Unnamed: 0,iso
0,4He
1,16
2,197Au


In [45]:
# spliting a column in to alphabaets and numeric

result = df['iso'].str.split('(\d+)([A-Za-z]+)', expand=True)
result

Unnamed: 0,0,1,2,3
0,,4,He,
1,,16,O,
2,,197,Au,


In [43]:
result = result.loc[:,[1,2]]
result

Unnamed: 0,1,2
0,4,He
1,16,O
2,197,Au


In [46]:
##### Method 3




In [51]:
df['num']=df['iso'].str.extract('(\d+)').astype(int)
df

Unnamed: 0,iso,num
0,4He,4
1,16,16
2,197Au,197


In [62]:
# trying to seperate alphabaetic words (but not  helping)

df['alpha']=df['iso'].str.extract('a-zA-z\s+').astype(int)
df

ValueError: pattern contains no capture groups