# Searching, Splitting, and Replacing Text 

In the previous sections we mastered the fundamentals of regular expressions. Now let's see how we can use this knowledge to accomplish some common tasks.

## Compiling a Regex

Previously we used shorthand functions like `match`, `fullmatch`, and `search` from the `re` package. While these are fine for one-off matches, there will be situations where you want to reuse a regular expression. A regular expression actually has to be compiled as a mini-program, meaning they can be expensive to set up and use. This is why when you intend to use a regular expression multiple times you will want to compile and save it. 

Below we compile a regular expression that looks for websites that may be `http` or `https` and end with `.com`, `.org,` or `.gov`. 

In [None]:
import re

web_pattern = re.compile(r'(https?://)?(www\.)?([a-z0-9]+)\.(com|org|gov)')

We can now use this compiled and reusable `Pattern` object for multiple tasks. We can, for example, pass it to a `pattern` argument in place of a string. This way `fullmatch` will not waste any time doing compilation. 

In [None]:
re.fullmatch(pattern=web_pattern, string="https://www.anaconda.com") != None 

## Scanning a Document

If we imported a document into a string, we can use the `finditer()` function on a `Pattern` object to find multiple [`Match` objects](https://docs.python.org/3/library/re.html#re.Match) in that document. We can take those results and iterate them in a `for` loop. 

In [None]:
urls = """
Here are a few websites below: 

https://www.yawmanflight.com
http://microsoft.com
https://youtube.com
https://www.anaconda.com

These are non-commercial sites: 
https://www.python.org
https://whitehouse.gov 
"""

matches = web_pattern.finditer(urls)

for match in matches:
    print(match[0])



Something that is interesting is `match[0]` will return the full match. But indices after that will return the group match as indexed by each pair of parantheses `( )`. For example, the fourth group of parantheses `(com|org|gov)` in our pattern will return that web domain. We can access it using `match[4]`. 

In [None]:
matches = web_pattern.finditer(urls)

for match in matches:
    print(match[4])

Read more in the [Match object documentation](https://docs.python.org/3/library/re.html#re.Match) to learn about other methods available. 

## Splitting

Regular expressions can offer some interesting capabilities when it comes to splitting data. 

Let's load up the famous machine learning Iris dataset. While typically we would use Pandas to load tabular data (which we will discuss in the next section), let's learn a few tricks from scratch. 

In [None]:
import urllib.request

urllib.request.urlretrieve(
    r"https://raw.githubusercontent.com/thomasnield/machine-learning-demo-data/master/classification/iris.csv",
    "iris.csv"
)

filename = 'iris.csv' 
file = open(filename, encoding="utf-8")
text = file.read()
file.close()
print(text)

So we loaded that entire dataset into a single string `text`. It is common to split on new lines followed by comma separated values for each row. 

In [None]:
for row in text.split("\n"):
    print(row.split(","))

But what if we wanted to only separate the last column with the species? There is an opportunity to use a regular expression as our separator. We can use a `,` followed by a suffix `(?=[a-z]+$)` to match only commas that are followed by lowercase alphabetic characters `[a-z]+` and then an end-of-string `$`. 

In [None]:
split_pattern = re.compile(r",(?=[a-z]+$)")

for row in text.split("\n"):
    print(re.split(split_pattern, row))

Perfect! With regular expressions you can split strings on much more elaborate and context-driven separators. 

## Replacing

Let's return to our previous example with the websites. 

In [None]:
urls = """
Here are a few websites below: 

https://www.yawmanflight.com
http://microsoft.com
https://youtube.com
https://www.anaconda.com

These are non-commercial sites: 
https://www.python.org
https://whitehouse.gov 
"""

matches = web_pattern.finditer(urls)

for match in matches:
    print(match[0])

Let's say we wanted to clean up the document and replace the `http` with `https`. Obviously we do not want to replace the `http` that already exists in existing `https` strings, so we will make sure it is not followed by an "s". this can be done using a suffix using a suffix `(?=[^s])`. 

In [None]:
fix_https = re.sub(pattern="http(?=[^s])", 
                   repl="https", 
                   string=urls)
print(fix_https)

Note that there are [additional parameters](https://docs.python.org/3/library/re.html#re.sub) for flags as well as the `count` which is the maximum number of replacements to make. 

Now let's say we want to inject a `www.` where it is missing. To achieve this, we need what is called a **negative lookahead**, which is a suffix that *we do not want matched* to qualify. We will qualify two slashes `//` that are not followed by `www.`, which can be expressed as `(?!www)`. 

In [None]:
fix_www = re.sub(pattern="//(?!www)", 
                   repl="//www.", 
                   string=fix_https)
print(fix_www)

## Exercise

Split the string below to separate on commas but only if the commas exist between two digits. Replace the question mark `?` with the regular expression below. 

In [None]:
import re

split_pattern = re.compile(?)

re.split(pattern=split_pattern, string="6.1,2.9,4.7,1.4,versicolor")

### SCROLL DOWN FOR ANSWER
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
v 

In [None]:
import re

split_pattern = re.compile("(?<=[0-9]),(?=[0-9])")

re.split(pattern=split_pattern, string="6.1,2.9,4.7,1.4,versicolor")