## RegEx Review

- http://www.pyregex.com

## Activity:
- Try to convert every single stand-alone instance of 'i' to 'I' in the corpus. Make sure not to change the 'i' occurring within a word:

In [1]:
import re 

processed_sent = re.sub(r'\si\s', ' I ', 'when i go outside, i will enjoy my time')
print(processed_sent)

when I go outside, I will enjoy my time


## Activity:

- Find all phone numbers (with area code or not) in a text

In [62]:
txt = "Apparently, 867-5307 is Jenny's phone number, but I'm not sure what her area code is." + \
"If you need my office line, it's 215-895-2185." 


numbers_with_area_code =  re.findall("[0-9]{3}-[0-9]{3}-[0-9]{4}", txt)
print(numbers_with_area_code)
# The following commnnted line will produce: 867-5307 and 895-2185 which is not what we want
# numbers_wo_area_code =  re.findall("[0-9]{3}-[0-9]{4}", txt)
# remove the () in the following line and see what will happen
numbers_wo_area_code =  re.findall("\s([0-9]{3}-[0-9]{4})", txt)
print(numbers_wo_area_code)
print(numbers_with_area_code + numbers_wo_area_code)

['215-895-2185']
['867-5307']
['215-895-2185', '867-5307']


In [2]:
txt = "Apparently, 867-5307 is Jenny's phone number, but I'm not sure what her area code is." + \
"If you need my office line, it's 215-895-2185." 



## By grouping and using a `{1,2}` flexible match, we can get full and partial numbers
## Note: we have to use a non-capturing group (?:...) in order to make sure we get the full expression
## without capturing the first three digits, only.
numbers =  re.findall("(?:[0-9]{3}-){1,2}[0-9]{4}", txt)
print(numbers)

['867-5307', '215-895-2185']


In [45]:
txt = "Apparently, 867-5307 is Jenny's phone number, but I'm not sure what her area code is." + \
"If you need my office line, it's 215-895-2185." 

# Find a number of this pattern: xxx-xxx-xxxx
numbers =  re.findall("[0-9]{3}-[0-9]{3}-[0-9]{4}", txt)
print(numbers)
# Find a number of this pattern: xxx-xxx-xxxx, but return only its first three digits (because of ())
numbers =  re.findall("([0-9]{3})-[0-9]{3}-[0-9]{4}", txt)
print(numbers)
# Find a number of this pattern: xxx-xxx-xxxx, and return the whole digits ((?:) make this to happen)
numbers =  re.findall("(?:[0-9]{3})-[0-9]{3}-[0-9]{4}", txt)
print(numbers)
# Find a number of this pattern: xxx-xxx-xxxx, but return only this part of it xxx-xxxx
numbers =  re.findall("[0-9]{3}-([0-9]{3}-[0-9]{4})", txt)
print(numbers)
numbers =  re.findall("([0-9]{3}-){1,2}[0-9]{4}", txt)
print(numbers)

['215-895-2185']
['215']
['215-895-2185']
['895-2185']
['867-', '895-']


## Activity:

- Find any money amount (represented by $ with cents or not) in a text

In [3]:
string='effe testen,  wat is dat duur zeg $127.89! Bah liever 25.87, of $85 '  # your string

# maak de hele  subexpressie optioneel met het ?

reg=r'\$\d+(?:\.\d\d)?'   # your pattern between the quotes. Keep the "r" in front.
# ?:\.\d\d -> followed by . and two numbers after it
# ? (last one) -> optional -> we may have cents or may not have cents
re.findall(reg,string)

['$127.89', '$85']

## Activity:

- Use regex for text data cleaning
- Review tokenization, lower and join
- Tokenization is the process of breaking up text into smaller units. Usually, this means breaking a string up into words.

In [17]:
s = 'This is a Book'
print(s.split())

['This', 'is', 'a', 'Book']


In [18]:
print(s.lower().split())

['this', 'is', 'a', 'book']


In [4]:
import pandas as pd 

test_strs = ['THIS IS A TEST!', 'another test', 'JUS!*(*UDFLJ)']
df = pd.DataFrame(test_strs, columns=['text'])
df

Unnamed: 0,text
0,THIS IS A TEST!
1,another test
2,JUS!*(*UDFLJ)


In [5]:
from nltk.corpus import stopwords
import re

def clean(x):
    x = x.lower()
    # remove anythings that are not character (\w) and are not space (\s) 
    x = re.sub(r'[^\w\s]', '', x)
    stop = stopwords.words('english')
    x = [word for word in x.split() if word not in stop]
    print(x)
    return " ".join(x)

In [6]:
df['new_text'] = df['text'].apply(lambda x: clean(x))

['test']
['another', 'test']
['jusudflj']


In [7]:
df

Unnamed: 0,text,new_text
0,THIS IS A TEST!,test
1,another test,another test
2,JUS!*(*UDFLJ),jusudflj
