# 7.4 String Manipulation

1. [Python Built-In String Object Methods](#builtin)
1. [Regular Expressions](#regex)
1. [String Functions in pandas](#pandas)

In [59]:
import pandas as pd
import numpy as np
import re

<a name="builtin"></a>
# Python Built-In String Object Methods

Built-in methods are often sufficient for a majority of data cleaning tasks.

A summary of some of the stuff shown below:
1. `.split` split string on a delimiter
1. `.strip` strip whitespace
1. `join` join strings
1. `in` check for substrings
1. `find` 
1.  `index`
1. `count` count occurrences
1. `replace` like gsub

<img src="./myImages/table7.4_builtInStringMethods.png", width = 600>

In [3]:
# Create a string
val = "a,b,  guido"
val

'a,b,  guido'

In [4]:
# Split on a delimiter
val.split(",")

['a', 'b', '  guido']

In [5]:
# Split and also remove whitespace
pieces = [x.strip() for x in val.split(",")]
pieces

['a', 'b', 'guido']

In [8]:
# Assign pieces to individual variables with tuple evaluation
first, second, third = pieces

In [9]:
# Inefficiently join them back together
first + "::" + second + "::" + third

'a::b::guido'

In [6]:
# Efficiently join them with join (similar to paste)
"::".join(pieces)

'a::b::guido'

In [12]:
# Detect substring with in
"guido" in val

'a::b'

In [13]:
# Detect substring with index
val.index(",")

1

In [14]:
# Fail to detect substring with index
val.index(":")

ValueError: substring not found

In [15]:
# Detect substring with find
val.find(",")

1

In [16]:
# Fail to detect with find
val.find(":")

-1

In [None]:
# Count occurrences
val.count(",")

In [21]:
# String replace
val.replace(",", "::")

'a::b::  guido'

In [22]:
# Delete
val.replace(",", "")

'ab  guido'

<a name="regex"></a>
# Regular Expressions

The built-in `re` module handles regular expressions.

re module functions are generally:
1. pattern matching
1. substitution
1. splitting

Something to note is that regex's in python are *compiled* so you can make a regular expression and then apply it to various different objects. This will be shown below. Using standard `re` functions, the regular expression will be compiled within the function call.  This is much faster if applying the regex to a ton of different strings.  

Also, use *raw* string literals with "r" to avoid a ton of escape characters

[Here](https://docs.python.org/3/library/re.html) is a link to the Python regular expression documentation, while [here](https://www.dataquest.io/blog/regex-cheatsheet/) is a link to a cheatsheet (also shown below).

<img src="./myImages/python-regular-expressions-cheat-sheet.png" width = 700>   


<img src="./myImages/table7.5_regexMethods.png" width = 600>

In [24]:
# Example text
text = "foo    bar\t baz  \tqux"
text

'foo    bar\t baz  \tqux'

In [25]:
# Split on whitespace
re.split(r"\s+", text)

['foo', 'bar', 'baz', 'qux']

In [28]:
# Compile the regex first
regex = re.compile(r"\s+")
regex

re.compile(r'\s+', re.UNICODE)

In [29]:
regex.split(text)

['foo', 'bar', 'baz', 'qux']

In [30]:
# Get all the matches to the regex
regex.findall(text)

['    ', '\t ', '  \t']

In [31]:
# Other method:
re.findall(r"\s+", text)

['    ', '\t ', '  \t']

`match`, `search`, and `findall` all search for regexes with slightly different behavior:

1. `match` will only match if the regex is at the beginning of the string
1. `search` will only return the first regex match
1. `findall` will find all of the matches

In [35]:
# Example text of emails
text = """Dave dave@google.com
Steve steve@gmail.com
Rob rob@gmail.com
Ryan ryan@yahoo.com"""
text

'Dave dave@google.com\nSteve steve@gmail.com\nRob rob@gmail.com\nRyan ryan@yahoo.com'

In [32]:
# Example regex to match most email addresses
# [A-Z0-9._%+-]+ - any uppercase letter or number and common symbols
# @
# [A-Z0-9.-]+ - any uppercase letter or number and . or -
# .
# [A-Z]{2,4} - 2 through 4 uppercase letters
pattern = r"[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}"
pattern

'[A-Z0-9._%+-]+@[A-Z0-9.-]+\\.[A-Z]{2,4}'

In [33]:
# Compile this regex and tell it to ignore case to now get all upper or lowercase letters
regex = re.compile(pattern, flags=re.IGNORECASE)
regex

re.compile(r'[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}', re.IGNORECASE|re.UNICODE)

In [36]:
# Use find all to get all email addresses
regex.findall(text)

['dave@google.com', 'steve@gmail.com', 'rob@gmail.com', 'ryan@yahoo.com']

In [37]:
# Use search to get the first match and return a new type of object (match object)
m = regex.search(text)
m

<re.Match object; span=(5, 20), match='dave@google.com'>

In [44]:
# Return the beginning of the match
m.start()

5

In [45]:
# Return the end
m.end()

20

In [46]:
# Use the above to grab the matching bit from the main string
text[m.start():m.end()]

'dave@google.com'

[Match objects](https://docs.python.org/3/library/re.html#match-objects) have their own methods. A few common ones:

<img src="./myImages/matchMethods.png" width=600>

In [39]:
# I don't totally understand how match works yet.
# Our regex doesn't match
print(regex.match(text))

None


In [43]:
# Matching for more than one letter does match and only gets 'Dave'
print(re.match("[A-Za-z]+", text))

<re.Match object; span=(0, 4), match='Dave'>


In [47]:
# Use sub like gsub
print(regex.sub("REDACTED", text))

Dave REDACTED
Steve REDACTED
Rob REDACTED
Ryan REDACTED


A slightly more complicated use of regular expressions is to use `()` to group different portions of the expression. This allows the use of the `groups` method for match objects.

In the above example, the regex `"[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}"` can be split into `username`, `domain`, `suffix` like so: `"([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2,4})"`  

Note that neither the `@` nor the `\.` are included within parentheses so they won't be returned.  

If a grouped regular expression is used with `findall`, then a tuple will be returned

In [49]:
# Compile new pattern
pattern = r"([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2,4})"
regex = re.compile(pattern, flags=re.IGNORECASE)

In [52]:
# Match
m = regex.match("wesm@bright.net")
m

<re.Match object; span=(0, 15), match='wesm@bright.net'>

In [54]:
# Use the groups method
m.groups()

('wesm', 'bright', 'net')

In [55]:
regex.findall("wesm@bright.net")

[('wesm', 'bright', 'net')]

In [56]:
regex.findall(text)

[('dave', 'google', 'com'),
 ('steve', 'gmail', 'com'),
 ('rob', 'gmail', 'com'),
 ('ryan', 'yahoo', 'com')]

Just like in `sed` (and I think `awk`), we can use these groups with our substitutions with `sub`, where each group is referred to by `\number` (e.g. `\1` = `([A-Z0-9._%+-]+)`; `\2` = `([A-Z0-9.-]+)`; `\3` = `([A-Z]{2,4})`).  This can be used to make nice outputs easily:

In [57]:
print(regex.sub(r"Username: \1, Domain: \2, Suffix: \3", text))

Dave Username: dave, Domain: google, Suffix: com
Steve Username: steve, Domain: gmail, Suffix: com
Rob Username: rob, Domain: gmail, Suffix: com
Ryan Username: ryan, Domain: yahoo, Suffix: com


<a name="pandas"></a>
# String Functions in pandas

The string and regular expression methods listed above will fail when mapped across data (`data.map`) that have missing values.  

pandas has array-oriented methods to apply these string operations that will skip over the NA values and propagate them.  

These are accessed through Series's `str` attribute

<img src="./myImages/table7.6_seriesStringMethods.png" width = 600>

In [60]:
# Dictionary with missing data
data = {"Dave": "dave@google.com", "Steve": "steve@gmail.com",
        "Rob": "rob@gmail.com", "Wes": np.nan}
data

{'Dave': 'dave@google.com',
 'Steve': 'steve@gmail.com',
 'Rob': 'rob@gmail.com',
 'Wes': nan}

In [61]:
# Convert to a series
data = pd.Series(data)
data

Dave     dave@google.com
Steve    steve@gmail.com
Rob        rob@gmail.com
Wes                  NaN
dtype: object

In [62]:
# Check for substrings with contains
data.str.contains("gmail")

Dave     False
Steve     True
Rob       True
Wes        NaN
dtype: object

In [63]:
# Convert Series to string extension type
data_as_string_ext = data.astype('string')
data_as_string_ext

Dave     dave@google.com
Steve    steve@gmail.com
Rob        rob@gmail.com
Wes                 <NA>
dtype: string

In [64]:
# Same check now returns boolean instead of object
data_as_string_ext.str.contains("gmail")

Dave     False
Steve     True
Rob       True
Wes       <NA>
dtype: boolean

In [None]:
# Regular expression
pattern = r"([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2,4})"

In [65]:
# Use findall and also the re option of IGNORECASE
data.str.findall(pattern, flags=re.IGNORECASE)

Dave     [(dave, google, com)]
Steve    [(steve, gmail, com)]
Rob        [(rob, gmail, com)]
Wes                        NaN
dtype: object

In [71]:
# Return a Series with the matches
matches = data.str.findall(pattern, flags=re.IGNORECASE)
matches

Dave     [(dave, google, com)]
Steve    [(steve, gmail, com)]
Rob        [(rob, gmail, com)]
Wes                        NaN
dtype: object

In [70]:
# Series type
type(matches)

pandas.core.series.Series

In [69]:
# Don't quite get this bit yet...
type(matches.str)

pandas.core.strings.accessor.StringMethods

In [72]:
type(matches.str[0])

pandas.core.series.Series

In [73]:
matches.str[0]

Dave     (dave, google, com)
Steve    (steve, gmail, com)
Rob        (rob, gmail, com)
Wes                      NaN
dtype: object

In [74]:
matches.str[0].str.get(1)

Dave     google
Steve     gmail
Rob       gmail
Wes         NaN
dtype: object

In [75]:
data

Dave     dave@google.com
Steve    steve@gmail.com
Rob        rob@gmail.com
Wes                  NaN
dtype: object

In [78]:
# Slice
data.str[:5]

Dave     dave@
Steve    steve
Rob      rob@g
Wes        NaN
dtype: object

In [79]:
# Convert to DataFrame with extract
data.str.extract(pattern, flags=re.IGNORECASE)

Unnamed: 0,0,1,2
Dave,dave,google,com
Steve,steve,gmail,com
Rob,rob,gmail,com
Wes,,,
