## String Manipulation

### String Object Methods


In [1]:
val = 'a,b,guido'
val

'a,b,guido'

In [3]:
val.split(',')

['a', 'b', 'guido']

- __split is often combined with 'strip' to trim whitespace (including line breaks):__

In [6]:
pieces = [x.strip() for x in val.split(',')]
pieces

['a', 'b', 'guido']

- __These substrings could be concatenated together with a two-colon delimiter using
addition:__

In [8]:
one, two, three = pieces

In [9]:
one + "::" + two + "::" + three

'a::b::guido'

- __but this is not good practise, we have 'join' method for this above__

In [10]:
'::'.join(pieces)

'a::b::guido'

- __Other functions are__

In [12]:
'val' in pieces

False

In [14]:
'guido' in pieces

True

In [16]:
val.replace(',','-')

'a-b-guido'

### Regular Expressions

- __examples__

In [19]:
import re

In [20]:
text = "foo bar\t baz \tqux"
text

'foo bar\t baz \tqux'

- __The regex describing one or more whitespace characters is ''\s+'' :__

In [21]:
re.split('\s+', text)

['foo', 'bar', 'baz', 'qux']

In [None]:
re.findall(text)

In [25]:
text = """Dave dave@google.com
          Steve steve@gmail.com
          Rob rob@gmail.com
          Ryan ryan@yahoo.com
       """
text

'Dave dave@google.com\n          Steve steve@gmail.com\n          Rob rob@gmail.com\n          Ryan ryan@yahoo.com\n       '

- __'match' and 'search' are closely related to findall . While 'findall' returns all matches in a string, search returns only the first match. More rigidly, match only matches at the beginning of the string.__

In [26]:
pattern = r'[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}'
pattern

'[A-Z0-9._%+-]+@[A-Z0-9.-]+\\.[A-Z]{2,4}'

- __'re.IGNORECASE' makes the regex case-insensitive__

In [31]:
regex = re.compile(pattern, flags=re.IGNORECASE)
regex

re.compile(r'[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}', re.IGNORECASE|re.UNICODE)

In [32]:
regex.findall(text)

['dave@google.com', 'steve@gmail.com', 'rob@gmail.com', 'ryan@yahoo.com']

- __'sub' will return a new string with occurrences of the pattern replaced by the
a new string:__

In [34]:
print(regex.sub('REDACTED', text))

Dave REDACTED
          Steve REDACTED
          Rob REDACTED
          Ryan REDACTED
       


- __Suppose you wanted to find email addresses and simultaneously segment each address into its three components: username, domain name, and domain suffix.__
- __To do this, put parentheses around the parts of the pattern to segment:__

In [35]:
pattern = r'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2,4})'
regex = re.compile(pattern, flags=re.IGNORECASE)
regex

re.compile(r'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2,4})',
re.IGNORECASE|re.UNICODE)

In [36]:
m = regex.match('wesm@bright.net')
m.groups()

('wesm', 'bright', 'net')

In [37]:
regex.findall(text)

[('dave', 'google', 'com'),
 ('steve', 'gmail', 'com'),
 ('rob', 'gmail', 'com'),
 ('ryan', 'yahoo', 'com')]

### Vectorized String Functions in pandas

- __Cleaning up a messy dataset for analysis often requires a lot of string munging and
regularization__

In [44]:
import numpy as np
import pandas as pd

data = {'Dave': 'dave@google.com', 'Steve': 'steve@gmail.com','Rob': 'rob@gmail.com', 'Wes': np.nan}

In [45]:
data = pd.Series(data)
data

Dave     dave@google.com
Steve    steve@gmail.com
Rob        rob@gmail.com
Wes                  NaN
dtype: object

In [46]:
data.isnull()

Dave     False
Steve    False
Rob      False
Wes       True
dtype: bool

- __we could check whether each email address has 'gmail' in it with 'str.contains' :__

In [47]:
data.str.contains('gmail')

Dave     False
Steve     True
Rob       True
Wes        NaN
dtype: object

In [48]:
pattern = '([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\\.([A-Z]{2,4})'

In [49]:
data.str.findall(pattern, flags=re.IGNORECASE)

Dave     [(dave, google, com)]
Steve    [(steve, gmail, com)]
Rob        [(rob, gmail, com)]
Wes                        NaN
dtype: object