# 2.1 Splitting Strings on Any of Multiple Delimiters 

**Problem** <br> 
You need to split a string into fields, but the delimiters aren't consistent throughout the string. <br> 

re.split() is useful because you can specify multiple patterns for the separator. <br>
Regular Expression Operations -> https://docs.python.org/3/library/re.html <br>

In [1]:
line = 'asdf fjdk; afed, fjek,asdf, foo'

In [2]:
import re

In [3]:
re.split(r'[;,\s]\s*', line)

['asdf', 'fjdk', 'afed', 'fjek', 'asdf', 'foo']

In [8]:
## when capture groups are enclosed in parentheses, the matched
### text is also included in the result. 
f = re.split(r'(;|,|\s)\s*', line)
f

['asdf', ' ', 'fjdk', ';', 'afed', ',', 'fjek', ',', 'asdf', ',', 'foo']

In [9]:
#getting split contexts
values = f[::2]
values

['asdf', 'fjdk', 'afed', 'fjek', 'asdf', 'foo']

In [10]:
delimiters = f[1::2] + ['']
delimiters

[' ', ';', ',', ',', ',', '']

In [11]:
### join again using same delmiters. 
''.join(v+d for v,d in zip(values, delimiters))

'asdf fjdk;afed,fjek,asdf,foo'

In [12]:
## if separator characters are not needed, we use not capture grp
re.split(r'(?:,|;|\s)\s*', line)

['asdf', 'fjdk', 'afed', 'fjek', 'asdf', 'foo']

# 2.2 Matching Text at the Start or End of a String

**Problem** <br> 
You need to check the start or end of a string for specific text patterns, such as filename extensions, URL schemes and so on. <br> 

Use of *str.startwith()* or *str.endswith()*. 

In [1]:
filename = 'spam.txt'

In [2]:
filename.endswith('.txt')

True

In [3]:
filename.startswith('file:')

False

In [4]:
url = 'http://www.python.org'

In [5]:
url.startswith('http:')

True

### When you have multiple choices, provide a tuple-

In [6]:
import os

In [7]:
filenames = os.listdir('.')
filenames

['README.md',
 'Data Structures and Algorithms.ipynb',
 '.ipynb_checkpoints',
 'Strings and Text .ipynb',
 '.git']

In [8]:
[name for name in filenames if name.endswith(('.ipynb', '.md'))]

['README.md',
 'Data Structures and Algorithms.ipynb',
 'Strings and Text .ipynb']

In [10]:
any(name.endswith('.py') for name in filenames)

False

In [11]:
any(name.endswith('.ipynb') for name in filenames)

True

### Different example

In [12]:
from urllib.request import urlopen

In [14]:
def read_data(name):
    if name.startswith(('http:', 'https:', 'ftp:')):
        return urlopen(name).read()
    
    else:
        with open(name) as f:
            return f.read()
        
## tuple required as input | convert first if list or set

### Using slicing 

(copy pasted)

In [None]:
>>> filename = 'spam.txt'
>>> filename[-4:] == '.txt'
True
>>> url = 'http://www.python.org'
>>> url[:5] == 'http:' or url[:6] == 'https:' or url[:4] == 'ftp:'
True

### Using Regex 

(copy pasted )

In [None]:
>>> import re
>>> url = 'http://www.python.org'
>>> re.match('http:|https:|ftp:', url)
<_sre.SRE_Match object at 0x101253098>
>>>

# 2.3 Matching Strings Using Shell Wildcard Patterns 

**Problem** <br>
You want to match text using the same wildcard patterns as are commonly used when working in Unix shells (e.g., * .py,Dat[0-9]* .csv, etc) <br> 

fnmatch module <br>

In [15]:
from fnmatch import fnmatch, fnmatchcase

In [17]:
fnmatch('Data Structures and Algorithms.ipynb', '*.ipynb')

True

In [19]:
fnmatch('Data Structures and Algorithms.ipynb', '?Algorithms.ipynb')

False

In [21]:
names = ['Dat1.csv', 'Dat2.csv', 'config.ini', 'foo.py']

In [22]:
[name for name in names if fnmatch(name, 'Dat*.csv')]

['Dat1.csv', 'Dat2.csv']

In [23]:
addresses = [
'5412 N CLARK ST',
'1060 W ADDISON ST',
'1039 W GRANVILLE AVE',
'2122 N CLARK ST',
'4802 N BROADWAY',
]

In [24]:
from fnmatch import fnmatchcase 

In [25]:
[addr for addr in addresses if fnmatchcase(addr, '* ST')]

['5412 N CLARK ST', '1060 W ADDISON ST', '2122 N CLARK ST']

In [26]:
[addr for addr in addresses if fnmatchcase(addr, '54[0-9][0-9] *CLARK*')]

['5412 N CLARK ST']

# 2.4 Matching and Searching for Text Patterns 

**Problem** <br> 
You want to match or search text for a specific pattern. <br> 

In [27]:
text = 'yeah, but no, but yeah, but no, but yeah'

In [28]:
text == 'yeah'

False

In [29]:
text.startswith('yeah')

True

In [30]:
text.endswith('no')

False

In [31]:
text.find('no') #find the location of first occurrence

10

In [32]:
## matching dates 
text1 = '11/27/2012'
text2 = 'Nov 27, 2012'

In [33]:
import re

if re.match(r'\d+/\d+/\d+', text1):
    print('yes')
    
else:
    print('no')
    


yes


In [35]:
if re.match(r'\d+/\d+/\d+', text2):
    print('yes')
else:
    print('no')

no


#### If matching a lot of things using the same pattern, precompile regex pattern into a pattern object first. 

In [36]:
datepat = re.compile(r'\d+/\d+/\d+') #pattern to match with 

if datepat.match(text1):
    print('yes')
else:
    print('no')

yes


In [37]:
if datepat.match(text2):
    print('yes')
else:
    print('no')

no


#### match() finds a match using the start of a string. For all occurrences of a pattern we use findall(). 

In [38]:
text = 'Today is 11/27/2012. PyCon starts 3/13/2013.'
datepat.findall(text)

['11/27/2012', '3/13/2013']

In [39]:
#using capture groups 
datepat = re.compile(r'(\d+)/(\d+)/(\d+)')

In [40]:
m = datepat.match('11/27/2012')

In [41]:
m

<_sre.SRE_Match object; span=(0, 10), match='11/27/2012'>

In [42]:
m.group(0) #extractting contents of the group

'11/27/2012'

In [43]:
m.group(1)

'11'

In [44]:
m.group(2)

'27'

In [45]:
m.group(3)

'2012'

In [46]:
m.groups()

('11', '27', '2012')

In [47]:
month, day, year = m.groups() 

In [48]:
text

'Today is 11/27/2012. PyCon starts 3/13/2013.'

In [49]:
datepat.findall(text)

[('11', '27', '2012'), ('3', '13', '2013')]

In [50]:
for month,day,year in datepat.findall(text):
    print('{}-{}-{}'.format(year, month, day))

2012-11-27
2013-3-13


In [51]:
## alternatively find matches iteratively
for m in datepat.finditer(text):
    print(m.groups())
    
## returns a tuple

('11', '27', '2012')
('3', '13', '2013')
