# Wrangling Text

When you think of data cleaning, one task that probably comes to mind is wrangling text. After all, when people enter data on a form or different formatting conventions are appended together, you will likely find yourself standardizing the data and trying to make it consistent. You will also seek values that were lost in translation and are unusable. 

In this section we will cover a variety of techniques to wrangle text and perform tasks like finding, replacing, and splitting values. Along the way, we will learn some regular expressions to perform pattern recognition in these tasks. 

First let's bring in our dependencies, and look at this dataset from Github. Notice how we have some contact information as well as a log of IP address of different users. We are going to learn how to perform some common text operations to clean this dataset and enforce some consistency. 

In [None]:
import pandas as pd 
import numpy as np 

url = 'https://raw.githubusercontent.com/thomasnield/machine-learning-demo-data/master/unprocessed/contacts.csv'
df = pd.read_csv(url)

df

These are the common string operations in Pandas we can use. Note that these typically accept a regular expression as a pattern, and we will cover this. 

| Function   | Description                                                                 |
|------------|-----------------------------------------------------------------------------|
| `count()`    | Counts the number of instances in a pattern                                 |
| `contains()` | Returns a boolean True/False indicating whether a string contains a pattern |
| `replace()`  | Replaces the found patterns in a string with another specified string.      |
| `fullmatch()`    | Determines if the entire string matches the pattern                         |
| `split()`    | Splits a string into separate strings using the pattern as the separator    |
| `extract()`  | Finds all occurrences of a pattern and packages them into columns           |
| `findall()`  | Finds all occurrences of a pattern and packages them into a list            |

But first, we will need to cover a few basics with regular expressions. 

## Regular Expression Basics

If you ever have used wildcards to search for text patterns, regular expressions are similar. **Regular expressions** are a special programming language specifically for matching complex text patterns. They allow matching, splitting, and replacing text based on a standardized pattern syntax. You can find them implemented in hundreds of platforms including Python, Java, and SQL. Even IDE's and text editors will allow you to search text using regular expressions such as VSCode, PyCharm, and Notepad++. They are so useful that Pandas makes them the default pattern convention for many of its aforementioned string methods. 

We are going to learn just enough about regular expressions to get through this notebook.

> You can refer to Python's documentation on the `re` package here: https://docs.python.org/3/library/re.html. For a more thorough walkthrough on regular expressions, check out my article with O'Reilly: https://www.oreilly.com/content/an-introduction-to-regular-expressions/

Let's first just use plain Python's `re` library which implements regular expressions. We are going to test our regular expressions with the `fullmatch()` function, and wrap it up in a function called `regex_match()` that will simply print whether the pattern matches the string. It will also do some convenient font color formatting in the output. 

In [None]:
import re

def red(str): 
    return '\033[91m' + str + '\033[0m'

def green(str): 
    return '\033[92m' + str + '\033[0m'

def regex_match(string, pattern):
    result = re.fullmatch(pattern=pattern, string=string)

    if result:
        print(f"{green(string)} Matches {green(pattern)}")
    else:
        print(f"{red(string)} Doesn't Match {red(pattern)}")

To match a single uppercase alphabetic character, use the character range `[A-Z]` as a placeholder for a single character. Note how it is case senstive and you can also define arbitrary ranges of letters. 

In [None]:
regex_match("C", "[A-Z]") # Match
regex_match("F", "[A-C]") # Doesn't Match
regex_match("3", "[A-Z]") # Doesn't Match 
regex_match("c", "[A-Z]") # Doesn't Match 
regex_match("-", "[A-Z]") # Doesn't Match 

To match both uppercase and lowercase letters, use `[A-Za-z]`. 

In [None]:
regex_match("C", "[A-ZA-z]") # Match
regex_match("c", "[A-Za-z]") # Matches
regex_match("3", "[A-Za-z]") # Doesn't Match 

We can also use `[0-9]` to specify a valid digit 0-9, or any arbitrary range of a single digit. 

In [None]:
regex_match("9", "[0-9]") # Match
regex_match("c", "[A-Za-z0-9]") # Match
regex_match("9", "[3-6]") # Doesn't Match
regex_match("C", "[0-9]") # Doesn't Match

You can also specify a set of letters, digits and characters. Below we only qualify the characters A, C, F, 2, 8, or 9. 

In [None]:
regex_match("9", "[ACF289]") # Match
regex_match("C", "[ACF289]") # Match
regex_match("7", "[ACF289]") # Doesn't Match
regex_match("G", "[ACF289]") # Doesn't Match

Letters and digits outside a character range `[ ]` are literally treated as letters and digits in regular expressions. They will match only those values. 

In [None]:
regex_match("Texas", "Texas") # Match
regex_match("Texas", "Arizona") # Doesn't Match 
regex_match("Texas", "TEXAS") # Doesn't Match 

If you want to match 3 uppercase alphabetic letters, either write `[A-Z]` three times or put `{3}` repetitions next to the character range.  You can also use `{2,3}` to specify a minimum of 2 repetitions and a maximum of `3`. 

In [None]:
regex_match("AEH", "[A-Z][A-Z][A-Z]") # Match
regex_match("AFH", "[A-Z]{3}") # Match
regex_match("AFH", "[A-Z]{2,3}") # Match
regex_match("AF", "[A-Z]{2,3}") # Match
regex_match("A9H", "[A-Z]{2,3}") # Doesn't Match

If you want to match one or more instances of a pattern, put a `+` next to it. For example, `[A-Z]+` will match 1 or more alphabetic uppercase characters.  

In [None]:
regex_match("AEH", "[A-Z]+") # Match
regex_match("AEHSDHHHNHEHHBV", "[A-Z]+") # Match
regex_match("93572", "[0-9]+") # Match
regex_match("AEHSDHHHNHEHHBV", "[A-Z0-9]+") # Match
regex_match("93572", "[A-Z]+") # Doesn't Match
regex_match("AEHSDHHHNHEHHBV", "[0-9]+") # Doesn't Match

Another helpful quantifier is the `?` which matchs 0 or 1 instances of a pattern. For example, we can use it to specify an optional digit in front of two uppercase letters. 

In [None]:
regex_match("2GH", "[0-9]?[A-Z]{2}") # Match
regex_match("GH", "[0-9]?[A-Z]{2}") # Match
regex_match("2H", "[0-9]?[A-Z]{2}") # No Match
regex_match("22H", "[0-9]?[A-Z]{2}") # No Match

The dot `.` represents a wildcard character, matching any single character including non-alphanumeric characters like punctuation and symbols. If you intend to match a literal dot, use an escape slash in front of it `\.`. 

With a wildcard character, you can also put a quantifier like `{3}` or `+` after it to specify 3 characters or one or more characters respectively.

In [None]:
regex_match("A#H", "...") # Match
regex_match("A#H", ".{3}") # Match 
regex_match("A#H", ".+") # Match
regex_match("AH", ".{3}") # Doesn't Match

Finally, the last operator we need to know is grouping up parantheses `()` as well as the alternator `|`. If I want to only match airport connections from `ABQ` or `DAL` to `HOU` or `PHX`, I could express that with `(ABQ|DAL)-(HOU|PHX)`.  

In [None]:
regex_match("ABQ", "(ABQ|DAL)") # Match 
regex_match("ABQ-HOU", "(ABQ|DAL)-(HOU|PHX)") # Match 
regex_match("DAL-HOU", "(ABQ|DAL)-(HOU|PHX)") # Match 
regex_match("DAL-PHX", "(ABQ|DAL)-(HOU|PHX)") # Match 
regex_match("PHX-DAL", "(ABQ|DAL)-(HOU|PHX)") # Doesn't Match 
regex_match("MDW-DAL", "(ABQ|DAL)-(HOU|PHX)") # Doesn't Match 


## Partial String Matches

Let's say we want to find all records with an `Email` containing a domain of `outlook.com`. This is easy enough using the `contains()` function under the `str` property. Note that the pattern string is treated as a regular expression so we need to escape the dot `.` with a backslash `\.`. Otherwise, it will be treated as a wildcard.

In [None]:
df['Email'].str.contains('outlook\.com', regex=True)

Since one of the values for email is `NaN`, we will need to handle it if we are to use this as a filtering mask. We can do that by passing `na = False` to the `contains()` function. This will cause missing values to be treated as `False`. 

In [None]:
df[df['Email'].str.contains('outlook\.com', regex=True, na=False)]

## Full String Matches

Let's say we want to hunt down invalid IP addresses. While we can [get wildly specific and elaborate with ipv4 patterns](https://stackoverflow.com/questions/5284147/validating-ipv4-addresses-with-regexp) let's keep it simple. 

Below is a simplistic regular exression to match an IP address. We use the `fullmatch()` to qualify the IP address string in full.

In [None]:
ipAddressRegex = r'[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}'

df['IP_ADDRESS'].str.fullmatch(ipAddressRegex)

> Typically, you only need to make your regular expression as specific enough to capture what you're looking for in the data. If you do not know your data well, you will want to err on being more specific. 

Let's use to qualify IP addresses that don't match in a condition. Sure enough, we have one broken IP address that exceeds 4 digits between the `.` separators.

In [None]:
df[df['IP_ADDRESS'].str.fullmatch(ipAddressRegex) == False]

Here's another example finding invalid US phone numbers. Note how we qualify the first 3 digits, then the next 3, and then then final 4 digits. Variants that may or may not contain hypens `-`, parantheses for area code `( )`, and spaces. Sure enough we find three broken phone numbers.

In [None]:
df[df['Phone'].str.fullmatch(r"\(?[0-9]{3}\)?[ -]?[0-9]{3}[ -]?[0-9]{4}") == False]

Let's go ahead and only include rows in our dataframe that have valid phone numbers and IP addresses. 

In [None]:
df = df[df['Phone'].str.fullmatch(r"\(?[0-9]{3}\)?[ -]?[0-9]{3}[ -]?[0-9]{4}")]

df = df[df['IP_ADDRESS'].str.fullmatch(ipAddressRegex)]

df

Finally, let's identify all invalid email addresses. An email needs to have a series of alphanumeric characters (with some allowable symbols like dot `.`), followed by the `@` symbol, then the domain. We will also treat `na` as false to also capture missing email addresses.

In [None]:
df[df['Email'].str.fullmatch(r'[.A-Za-z0-9]+@[A-Za-z0-9]+\.[A-Za-z]+', na=False) == False]

So we find two email addresses that are missing or broken. Lily's email is missing a domain! We will remove those two instances from the dataframe. 

In [None]:
df = df[df['Email'].str.fullmatch(r'[.A-Za-z0-9]+@[A-Za-z0-9]+\.[A-Za-z]+', na=False)]

df

## Finding All Matches

We can also use `findall()` to look for all partial matches of a regular expression and return them as a series. Below we extract all the email domains from the `Email` column.

In [None]:
df['Email'].str.findall(r'[A-Za-z0-9]+\.[A-Za-z]{3}$')

If we wanted to gather the unique domains, we can join the "lists" of single items into a string and then qualify the unique values. 

In [None]:
df['Email'].str.findall(r'[A-Za-z0-9]+\.[A-Za-z]{3}$').str.join("").unique()

## Replacing Matches

Let's say we want to clean up phone numbers by removing any extraneous dashes `-`, parantheses `()`, and spaces ` `. We can do that by using a regular expression character set `[- ()]`. Note we have to make the dash `-` the first character so it doesn't get confused as a range operator. We also throw a space ` ` in there too so we capture spaces.

In [None]:
df['Phone'].str.replace(r"[- ()]", "", regex=True)

## Splitting Text 

A powerful tool we can use to split text into columns is use the `str.split()` function. We provide a pattern that can be a separator (like commas `,`) or a full-on regular expression pattern. 

Here is how we can separate out the email domains into separate columns. We can then rename these columns and append them back to our dataframe. 

In [None]:
df['Email'].str.split("@", expand=True, regex=False)

When you use regular expression features like look-aheads, it opens up more powerful splitting capabilities based on surrounding characters. This is beyond the scope of this notebook. 

## Exercise

Complete the code below by replacing the question mark `?`. Replace it with a regular expression operation to identify records that are missing a street number in the dataframe.

In [None]:
import pandas as pd

df = pd.DataFrame({
    "CUSTOMER_NAME" : ["Rex Tooling", "Prairie Construction", "Banke Logistics"],
    "STREET_ADDRESS" : ["147 Collie Way", "56 Samson Dr", "Elijah Blvd"]
})

df[? == False]

### SCROLL DOWN FOR ANSWER
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
v 

In [None]:
import pandas as pd

df = pd.DataFrame({
    "CUSTOMER_NAME" : ["Rex Tooling", "Prairie Construction", "Banke Logistics"],
    "STREET_ADDRESS" : ["147 Collie Way", "56 Samson Dr", "Elijah Blvd"]
})

df[df["STREET_ADDRESS"].str.fullmatch("[0-9]+ [A-Za-z0-9]+ (Way|Blvd|Dr|St)") == False]