# Regular Expressions in Pandas, SQL, and NLP

In this section we will learn a few practical places we can apply regular expressions through Python libraries. Regular expressions are supported in many, many places but hopefully this will give an idea of how regular expressions can be used for common libraries. 

## Pandas 

When you import a CSV, you typically would use Pandas in a Python environment. 

In [5]:
import pandas as pd 

url = r"https://raw.githubusercontent.com/thomasnield/machine-learning-demo-data/master/classification/iris.csv"
df = pd.read_csv(url)
df

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,virginica
146,6.3,2.5,5.0,1.9,virginica
147,6.5,3.0,5.2,2.0,virginica
148,6.2,3.4,5.4,2.3,virginica


Recall in the last section how we manually separated only the `species` column from the rest of the data. We can achieve this using the `sep` argument and provide a regular expression. We will need to tell Pandas to use the `python` engine to handle the regular expression. 

In [9]:
import re 

pd.read_csv(url, sep=",(?=[a-z]+$)", engine='python')

Unnamed: 0,"sepal_length,sepal_width,petal_length,petal_width",species
0,"5.1,3.5,1.4,0.2",setosa
1,"4.9,3.0,1.4,0.2",setosa
2,"4.7,3.2,1.3,0.2",setosa
3,"4.6,3.1,1.5,0.2",setosa
4,"5.0,3.6,1.4,0.2",setosa
...,...,...
145,"6.7,3.0,5.2,2.3",virginica
146,"6.3,2.5,5.0,1.9",virginica
147,"6.5,3.0,5.2,2.0",virginica
148,"6.2,3.4,5.4,2.3",virginica


Going back to our original DataFrame with columns predictably separated, let's say we wanted to match a regular expression against a field. We can use the `str.match()` function to return a `Boolean` array of values, and then qualify only those records. Below we match only species that start with a `v` and the third character is an `r`, as specified by the regex `^v[a-z]r.*`.  

In [18]:
df[df['species'].str.match("^v[a-z]r.*") == True]

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
50,7.0,3.2,4.7,1.4,versicolor
51,6.4,3.2,4.5,1.5,versicolor
52,6.9,3.1,4.9,1.5,versicolor
53,5.5,2.3,4.0,1.3,versicolor
54,6.5,2.8,4.6,1.5,versicolor
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,virginica
146,6.3,2.5,5.0,1.9,virginica
147,6.5,3.0,5.2,2.0,virginica
148,6.2,3.4,5.4,2.3,virginica


Sure enough we get records where the species are `versicolor` and `virginica`. 

This example may be slightly contrived, but we can also replace a regular expression pattern with different text. Below we take that regex pattern and replace those three latters with "XXX". This could be helpful if you are trying to replace sensitive information like social security numbers. 

In [27]:
df['species'].str.replace("^v[a-z]r", "XXX", regex=True)

0         setosa
1         setosa
2         setosa
3         setosa
4         setosa
         ...    
145    XXXginica
146    XXXginica
147    XXXginica
148    XXXginica
149    XXXginica
Name: species, Length: 150, dtype: object

There are a lot of places that accept regular expressions in Pandas, so be sure to keep an eye out for regex-related parameters in the functions you use!

## SQL 

In [28]:
import urllib.request
import sqlite3
import pandas as pd 

urllib.request.urlretrieve("https://github.com/thomasnield/anaconda_intro_to_sql/blob/main/company_operations.db?raw=true", "company_operations.db")
conn = sqlite3.connect('company_operations.db')


sql = "SELECT * FROM CUSTOMER"

pd.read_sql(sql, conn)

Unnamed: 0,CUSTOMER_ID,CUSTOMER_NAME,ADDRESS,CITY,STATE,ZIP,CATEGORY
0,1,Alpha Medical,18745 Train Dr,Dallas,TX,75021,INDUSTRIAL
1,2,Oak Cliff Base,2379 Cliff Ave,Abbevile,LA,70510,GOVERNMENT
2,3,Sports Unlimited,1605 Station Dr,Alexandrai,LA,71301,COMMERCIAL
3,4,Riley Sporting Goods,9854 Firefly Blvd,Austin,TX,78701,COMMERCIAL
4,5,Lite Industrial,462 Roadrunner Blvd,Houston,TX,77254,INDUSTRIAL
5,6,Prairie Sports Center,689 Stadium Way,Tulsa,OK,74101,COMMERCIAL
6,7,Facility 95,2396 Runway Dr,Oklahoma City,OK,73101,GOVERNMENT
7,8,Allen Stadium,573 HIllcrest Rd,Allen,TX,75002,COMMERCIAL
8,9,Dent Research,392 45th St,Waco,TX,76700,INDUSTRIAL
9,10,Gamma Solutions,2752 27th St,Phoenix,AZ,85001,COMMERCIAL


In [34]:
import re 

def regexp(pattern, string):
    return 1 if re.search(pattern, string) else 0

conn.create_function('regexp', 2, regexp)

In [35]:
sql = "SELECT * FROM CUSTOMER WHERE ADDRESS REGEXP '.*(Dr|Ave)'"

pd.read_sql(sql, conn)

Unnamed: 0,CUSTOMER_ID,CUSTOMER_NAME,ADDRESS,CITY,STATE,ZIP,CATEGORY
0,1,Alpha Medical,18745 Train Dr,Dallas,TX,75021,INDUSTRIAL
1,2,Oak Cliff Base,2379 Cliff Ave,Abbevile,LA,70510,GOVERNMENT
2,3,Sports Unlimited,1605 Station Dr,Alexandrai,LA,71301,COMMERCIAL
3,7,Facility 95,2396 Runway Dr,Oklahoma City,OK,73101,GOVERNMENT
