# Regular Expressions (regex)

Pages we need to visit:

- Example: http://dof.ca.gov/Forecasting/Economics/Economic_and_Revenue_Updates
- Live Testing: https://regex101.com/
- Tip sheet: http://sandeepmj.github.io/regex-email/regex-table
- Common pattern finder: https://regexlib.com/Default.aspx

Some dummy text:

In [None]:
## place dummy text into a variable called mydoc
mydoc = '''
12534 127 ab aba abba sandeepj@bloomberg.net abbba, abbbbba, (518) 469-4581 abcde.The dog is a not a hog. ABA ABBA ABBBA.


Ab_CD123  123456 and 12456	tor 12531245134562. 123867584789. $40.44 or $3 or $52,583.08 or $610,235.11

The cat sat down and called 514-957-3453 while the other caaaaaat purred. This cat is in California while this caaaat is in Iraq, but none are in ct. My dog prefers cat food to dog food but hates fish food.My food tastes yummy!

AB_cd 	<+>-.,!@# $%^&*();\/|_^@1# (917) 488-5410

*!dsar2d1

I told him to search the thesaurus where sandeep.junnarkar@journalism.cuny.edu he'd be able to sjnews@gmail.com find words like them.

abcdefgczhijklmnopqrstuvwxyz	ABCDEFGCZHIJKLMNOPQRSTUVWXYZ

A h0g is a hog.

Dog and dog and DOG. His number is 415.458.9163.

&^%@ 129

abba



'''

In [None]:
!python --version

In [None]:
mydoc

In [None]:
import re
import pandas as pd

In [None]:
## Run this cell
some_text = "The dog barked at the other dog."

## ```search```

- returns the first instance of the search pattern

#### Method 1. 
#### ```some_result = re.search(some_pattern, some_string)```

In [None]:
what = re.search("dog", some_text)
what

#### Method 2.

####  ```pattern_var = re.compile("some_pattern")```

#### ```a_result = pattern_var.search(some_string)```

In [None]:
my_pattern = re.compile("dog")
some_result = my_pattern.search(some_text)
some_result

#### The two methods are identical, but method 2 is more efficient since you store the pattern and reuse it.

#### ```re.compile``` has other benefits we will get to soon.

In [None]:
## what type of object is it?
type(what)

In [None]:
## pull out location from the match object
what.span()

#### ```.group()``` returns the matched pattern in from the text

In [None]:
## pull out the matched pattern itself
what.group()

In [None]:
## let's look for something that we know if NOT in the text
hello_kitty = re.search("cat", some_text)
hello_kitty

#### You can build conditional logic based on what is returned -- even if nothing is returned.

In [None]:
if hello_kitty == None:
    print("Your pattern was not found so we do step X...")
elif hello_kitty.group() == "cat":
    print("Your pattern was found so we do step Y")

In [None]:
some_text = "The dog barked at the other dog and cat."
hello_kitty = re.search("cat", some_text)

In [None]:
if hello_kitty == None:
    print("Your pattern was not found so we do step X...")
elif hello_kitty.group() == "cat":
    print("Your pattern was found so we do step Y")

#### Let's actually use a regex pattern

In [None]:
## find all words and digits (excluding symbols)
pat = re.compile('\w+')

In [None]:
## let's run it on our mydoc string
mymatch = re.search(pat,mydoc)
mymatch

In [None]:
## put out the actual match
mymatch.group()

## ```findall```

- Returns a list of all found items

In [None]:
print(pat.findall(mydoc))

In [None]:
## find any "a" followed by 1 or 2 "b"
pat1 = re.compile('ab{1,2}')

In [None]:
pat1

In [None]:
print(pat1.findall(mydoc))

In [None]:
## find any "a" followed by 1 or 2 "b" ignore case
pat2 = re.compile('ab{1,2}',
                  re.I)

In [None]:
print(pat2.findall(mydoc))

In [None]:
## find "dog" lower case only
pat3 = re.compile('dog')

In [None]:
print(pat3.findall(mydoc))

### Flags

- ```re.IGNORECASE``` or ```re.I``` for ignore case
- ```re.MULTILINE¶``` or ```re.M``` for multiline
- ```re.DOTALL``` or ```re.S``` for period includes new lines
- ```re.VERBOSE``` or ```re.X``` for breaking up the regex

<a href="https://docs.python.org/3/library/re.html#re.ASCII">More on flags</a>.

In [None]:
## find "dog" ignore case
pat4 = re.compile('dog',
                 re.IGNORECASE)

In [None]:
## find "dog" ignore case
pat4 = re.compile('dog',
                 re.I)

In [None]:
print(pat4.findall(mydoc))

In [None]:
## find all numbers that are a group that follow the pattern x2x on the page 1132611   527  349
pat5 = re.compile("(\\b\d2\d\\b)",
                 re.I)

In [None]:
print(pat5.findall(mydoc))

## ```finditer```

- Returns a iterator.
- More processing and memory efficient.
- Even though it takes an extra step to extract data, it is considered more efficient

In [None]:
## let's run pat5 on mydoc but using finditer
mymatches = re.finditer(pat5, mydoc)

In [None]:
## call matches
mymatches

In [None]:
type(mymatches)

In [None]:
for match in mymatches:
    print(match)

<img src="../support_files/groups.png">

In [None]:
## group() always returns ALL capture groups
target_number = [match.group() for match in re.finditer(pat5, mydoc)]
target_number

In [None]:
## group(0) always returns ALL capture groups
target_number = [match.group(0) for match in re.finditer(pat5, mydoc)]
target_number

In [None]:
## group(1) always returns ONLY the 1st capture group
## if there were more groups, for example a second group, we'd pull it out using group(2)
target_number = [match.group(1) for match in re.finditer(pat5, mydoc)]
target_number

### Finding phone numbers

In [None]:
## pattern for finding tel. in the format: xxx-yyy-zzzz
telpat1 = re.compile(r'\d{3}-\d{3}-\d{4}')

In [None]:
## find and store what is found with that pattern
foundtel1 = telpat1.findall(mydoc)
foundtel1

In [None]:
## pattern for finding tel. in the format: (xxx) yyy-zzzz
telpat2 = re.compile(r'\(\d{3}\)\s\d{3}-\d{4}')

In [None]:
## print what is found with that pattern
print(telpat2.findall(mydoc))

In [None]:
## find and store what is found with that pattern
foundtel2 = telpat2.findall(mydoc)
foundtel2

In [None]:
## pattern for finding all US numbers. In addition to the above, also xxx.yyy.zzzz
telpat3 = re.compile(r'((\(\d{3}\) ?)|(\d{3}\.)|(\d{3}-))\d{3}(-?|\.?)?\d{4}')

In [None]:
## find all and store
foundtel3 = telpat3.findall(mydoc)

In [None]:
## type
type(foundtel3)

In [None]:
## call it
foundtel3

In [None]:
##NEED TO RUN THIS EVERYTIME I CHANGE GROUP NUMBER IN NEXT CELL
## instead we use finditer
telfind3 = re.finditer(telpat3, mydoc)
type(telfind3)


In [None]:
## CHANGE GROUP NUMBERS TO DEMO
numbs = []
for tel in telfind3:
#     print(tel)
    numbs.append(tel.group())
    print(tel.group())

In [None]:
type(telfind3)

In [None]:
## CALL THE LIST
numbs

In [None]:
##LIST COMPREHENSION WAY
## Switching groups more effective
telphones = [match.group(0) for match in re.finditer(telpat3, mydoc)]
telphones

In [None]:
##RUN THIS AGAIN IF YOU WANT NEXT STEP TO WORK
telfind3 = re.finditer(telpat3, mydoc)
type(telfind3)


In [None]:
## NOTE THAT YOU CAN ONLY RUN THIS DIRECTLY ON telfind3 IF YOU RUN THE PATTERN FIRST
telphones = [match.group() for match in telfind3]
telphones

In [None]:
telpat2 = r"((\(\d{3}\) ?)|(\d{3}-))?\d{3}-\d{4}"

mydoc2 = ("mydoc = '''12534 127  ab abba abbba, abbbbba, (518) 469-4581 abcde.The dog is a not a hog. ABA ABBA ABBBA.\n\n"
	"Ab_CD123  123456 and 12456	tor 12531245134562. 123867584789. $40.44 or $3\n\n"
	"The cat sat down 514-957-3453 while the other caaaaaat purred. This cat is in California while this caaaat is in Iraq, but none are in ct.\n\n"
	"AB_cd 	<+>-.,!@# $%^&*();\\/|_^@1#\n\n"
	"*!dsar2d1\n\n"
	"I told him to search the thesaurus where he'd be able to find words like them.\n\n"
	"abcdefgczhijklmnopqrstuvwxyz	ABCDEFGCZHIJKLMNOPQRSTUVWXYZ\n\n"
	"A h0g is a hog.\n\n"
	"Dog and dog and DOG.\n\n"
	"&^%@ 129\n\n"
	"abba\n\n"
	"'''")

telfound2 = re.finditer(telpat2, mydoc2, re.MULTILINE)

for matchNum, match in enumerate(telfound2, start=1):
    print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))


## Find all emails

In [None]:
## pattern for finding emails
emailpat1 = re.compile(r'([a-z0-9][-a-z0-9_\+\.]*[a-z0-9])@([a-z0-9][-a-z0-9\.]*[a-z0-9]\.(arpa|root|aero|biz|cat|com|coop|edu|gov|info|int|jobs|mil|mobi|museum|name|net|org|pro|tel|travel|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cu|cv|cx|cy|cz|de|dj|dk|dm|do|dz|ec|ee|eg|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|sk|sl|sm|sn|so|sr|st|su|sv|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|um|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)|([0-9]{1,3}\.{3}[0-9]{1,3}))')

In [None]:
emails1 = emailpat1.findall(mydoc)
emails1

In [None]:
## run it to find emails
emails = [match.group() for match in re.finditer(emailpat1, mydoc)]
emails

## Find key data points from multiple documents

- These documents must have pretty much an identical structure to them.

In [None]:
import glob

In [None]:
## path to documents
path = "docs/*.txt"
myfiles = sorted(glob.glob(path))
myfiles

In [None]:
## quick read reminder
for file in myfiles:
    
    with open(file, "r") as document:
        print(document.read())

In [None]:
## find date pattern
for file in myfiles:
    
    with open(file, "r") as document:
        all_text = document.read()
        all_text = all_text.lower()
        ## Matches dates of appeals request
        date_pat = re.compile(r'request:\s(\w+\s\d{0,2},\s\d{4})')
        date = [match.group(1) for match in re.finditer(date_pat, all_text)]
        print(date)

In [None]:
## find date pattern and store findings in a list
request_dates_list = []
for file in myfiles:
    
    with open(file, "r") as document:
        all_text = document.read()
        all_text = all_text.lower()
        ## Matches dates of appeals request
        date_pat = re.compile(r'request:\s(\w+\s\d{0,2},\s\d{4})')
        date = [match.group(1) for match in re.finditer(date_pat, all_text)]
        request_date = date[0]
        print(request_date)
        request_dates_list.append(request_date)
        

In [None]:
request_dates_list

In [None]:
## add case number pattern and store findings in a list
request_dates_list = []
case_numbers_list = []
for file in myfiles:
    
    with open(file, "r") as document:
        all_text = document.read()
        all_text = all_text.lower()
        ## Matches dates of appeals request
        date_pat = re.compile(r'request:\s(\w+\s\d{0,2},\s\d{4})')
        date = [match.group(1).replace('request:', "").strip() for match in re.finditer(date_pat, all_text)]
        request_date = date[0]
        print(request_date)
        request_dates_list.append(request_date)
        
        case_pat = re.compile(r'case\s#:\s(\d{7}\w)')
        case = [match.group(1).replace('case #:', "").strip() for match in re.finditer(case_pat, all_text)]
        case_num = case[0]
        case_numbers_list.append(case[0])
        print(case_num)

In [None]:
case_numbers_list

In [None]:
## find decision pattern and store findings in a list

request_dates_list = []
case_numbers_list = []
decisions_list = []
for file in myfiles:
    
    with open(file, "r") as document:
        all_text = document.read()
        all_text = all_text.lower()
        ## Matches dates of appeals request
        date_pat = re.compile(r'request:\s(\w+\s\d{0,2},\s\d{4})')
        date = [match.group(1).replace('request:', "").strip() for match in re.finditer(date_pat, all_text)]
        request_date = date[0]
        print(request_date)
        request_dates_list.append(request_date)
        
        case_pat = re.compile(r'case\s#:\s(\d{7}\w)')
        case = [match.group(1).replace('case #:', "").strip() for match in re.finditer(case_pat, all_text)]
        case_num = case[0]
        case_numbers_list.append(case[0])
        print(case_num)
        
        decision_pat = re.compile(r'decision:\n.+(is\s\w+)')
        decision = [match.group(1).replace('is ', "").strip() for match in re.finditer(decision_pat, all_text)]
        print(decision[0])
        decisions_list.append(decision[0])

In [None]:
decisions_list

In [None]:
case_numbers_list

In [None]:
## find decision date pattern and store findings in a list
## added source list too

request_dates_list = []
case_numbers_list = []
decisions_list = []
decision_date_list =[]
source_list = []
for file in myfiles:
    source_list.append(file)
    with open(file, "r") as document:
        all_text = document.read()
        all_text = all_text.lower()
        ## Matches dates of appeals request
        date_pat = re.compile(r'request:\s(\w+\s\d{0,2},\s\d{4})')
        date = [match.group(1) for match in re.finditer(date_pat, all_text)]
        request_date = date[0]
        print(request_date)
        request_dates_list.append(request_date)
        
        case_pat = re.compile(r'case\s#:\s(\d{7}\w)')
        case = [match.group(1) for match in re.finditer(case_pat, all_text)]
        case_num = case[0]
        case_numbers_list.append(case[0])
        print(case_num)
        
        decision_pat = re.compile(r'decision:\n.+is\s(\w+)')
        decision = [match.group(1) for match in re.finditer(decision_pat, all_text)]
        print(decision[0])
        decisions_list.append(decision[0])
        
        decision_date_pat = re.compile(r"dated: albany, new york (\d{1,2}\/\d{1,2}\/\d{4})")
        decision_date = [match.group(1) for match in re.finditer(decision_date_pat, all_text)]
        print(decision_date[0])
        decision_date_list.append(decision_date[0])

In [None]:
decision_date_list

In [None]:
source_list

In [None]:
## zip it all together
final_decision = []
for (request_date, case_number, decision, decision_date, source) in zip(request_dates_list, case_numbers_list, decisions_list, decision_date_list, source_list):
    decision_dict = {"request_date": request_date, "case_number": case_number, "decision": decision, "decision_date": decision_date, "source": source}
#     print(decision_dict)
    final_decision.append(decision_dict)
    
final_decision

In [None]:
def export2csv(a_list, filename):
    '''
    provide list name first
    provide filename as a string
    '''
    df = pd.DataFrame(a_list)
    df.to_csv(filename, encoding='utf-8', index=False)
    print(f"{filename} is in your project folder!")

In [None]:
export2csv(final_decision,"decisions.csv")