# Elements Of Data Processing (2020S2) - Week 4


## Regular expressions 
Regular expressions allow you to match patterns in strings, rather than matching exact characters.  
For example, 
if I wished to find all phone numbers of the form (03) xxxx xxxx, where x is some arbitrary digit, 
I could use a regular expression like this: 
    
\(03\) \d\d\d\d \d\d\d\d

*or*

\(03\) \d{4} \d4}    

The **re** library in python allows you to use regular expressions.  It provides a number of useful functions, 
including:
    
***search*** - Searches for a particular pattern in a string | return boolean: True, False

***findall*** - Finds all substrings that match a particular pattern

***sub*** - Replaces substrings that match a particular pattern with a new substring


### This example looks for phone numbers that match the format above

In [1]:
#This examples looks for phone numbers that match the format above
import re

string = r'Name: Chris, ph: (03) 9923 1123, comments: this is not my real number'
# r' ' means the string is to be treated as a raw string...escape codes will be ignored
# for example '\n' means new line; r'\n' means \ and n 
# but r"\" is not a valid string...since the backslash would escape the following qupte character 
pattern = r'\(03\) \d{4} \d{4,4}' #{min repetition, max repetition}
# even though r' ' means regular expression, when parse to re.method its matacharacter has special meaning
if re.search(pattern, string) :
    print("Phone number found")
else :
    print("Not found")

Phone number found


In [2]:
string = r'\s('
pattern = r'\s'
#\S is anything but space
if re.search(pattern, string):
    print("yes")
else:
    print("no")

no


In [3]:
strings = [
    r'a[c',
    r'a.c',
    r'acc',
    r'a/c',
    r'acccc',
    r'acacc'
]

pattern = r'a.c'
#special characters lose their special meaning inside sets: [(+*)] will match any of literal characters [, (, +, *, ), ]
#so if do this pattern = r'a[.]c' will only match r'a.c'
for s in strings:
    if re.search(pattern, s) :
        print("Phone number found")
    else :
        print("Not found")

Phone number found
Phone number found
Phone number found
Phone number found
Phone number found
Phone number found


### <span style="color:blue"> Exercise 1 </span>

Modify the example above so that it will also find phone numbers starting with 03 that:
    
- are missing brackets and/or
- instead of a space, use hyphens,  backslashes and/or spaces.

Your program should match all elements in ***strings*** in the code segment below 

In [4]:
#This examples looks for phone numbers that match the format above
import re
strings = [
    r'Name: Chris, ph: (03) 9923 1123, comments: this is not my real number',
    r'Name: John, ph: 03-9923-1123, comments: this might be an old number',
    r'Name: Sara, phone: (03)-9923-1123, comments: there is data quality issues, so far, three people sharig the same number',
    r'Name: Christopher, ph: (03)\-9923 -1123, comments, is this the same Chris in the first record?'
]

#change this line
pattern =  r'\(?03\)?[-\\\s]*\d{4}[-\\\s]*\d{4,4}'
#? exist or not: 0-1
#[]means looks for some group / set => but only one repetition => 0 or more repetition
#\s means space
#[\\-\s]* looking for an element in the set and repeate 
#[\\-\s]* didnt not work initially -> change order in the set, order matters

for s in strings:
    if re.search(pattern, s) :
        print("Phone number found")
    else :
        print("Not found")

Phone number found
Phone number found
Phone number found
Phone number found


### <span style="color:blue"> Exercise 2 </span>

Write a program that will remove all leading zeros from an IP address
    
For example, 0216.08.094.102 should become 216.8.94.196

Your program should match all elements in ***strings*** in the code segment below 

In [5]:
#Exercise 2: Write a program that will remove all leading zeros from an IP address
#For example, 0216.08.094.102 should become 216.8.94.196
import re

ip_addr = '0216.08.094.102'
#first find leading 0 character
#replace with empty string
pattern = r'^0'
replace =r''
revised_addr = re.sub(pattern, replace, ip_addr)
#second find literal .0 
#replace with literal .
pattern2 = r'\.0' #want pattern to be literal . need to escape 
replace2 = r'.'
revised_addr2 = re.sub(pattern2, replace2, revised_addr)
print(revised_addr2)

216.8.94.102


In [6]:
#will find all regular expression inside the ()
#so need to use | (or)
pattern = r'(^|\.)0'
revised_addr = re.sub(pattern, r'\1', ip_addr)
print(revised_addr)

216.8.94.102


## Natural Language Processing ##
The ***nltk*** library provides you with tools for natural language processing, including tokenizing, stemming and lemmatization

In [7]:
import nltk
from nltk.stem.porter import *
# if running the first time with errors:
#nltk.download('punkt')
#nltk.download('stopwords')
#like java, create an object of type PoterStemmer()
porterStemmer = PorterStemmer()
#word tokenization
speech = 'Four score and seven years ago our fathers brought forth on this continent, a new nation, conceived in Liberty, and dedicated to the proposition that all men are created equal. Now we are engaged in a great civil war, testing whether that nation, or any nation so conceived and so dedicated, can long endure. We are met on a great battle-field of that war. We have come to dedicate a portion of that field, as a final resting place for those who here gave their lives that that nation might live. It is altogether fitting and proper that we should do this. But, in a larger sense, we can not dedicate -- we can not consecrate -- we can not hallow -- this ground. The brave men, living and dead, who struggled here, have consecrated it, far above our poor power to add or detract. The world will little note, nor long remember what we say here, but it can never forget what they did here. It is for us the living, rather, to be dedicated here to the unfinished work which they who fought here have thus far so nobly advanced. It is rather for us to be here dedicated to the great task remaining before us -- that from these honored dead we take increased devotion to that cause for which they gave the last full measure of devotion -- that we here highly resolve that these dead shall not have died in vain -- that this nation, under God, shall have a new birth of freedom -- and that government of the people, by the people, for the people, shall not perish from the earth.'
wordList = nltk.word_tokenize(speech)
# download stopwords object:
from nltk.corpus import stopwords
#stopwords object has method stopwords.words() set language to be english
stopWords = set(stopwords.words('english'))

#[] a list of w: rules - loop through wordList if w not in stopwords
# filteredList = []
# for w in wordList:
#     if w not in stopWords:
#         filteredList.append(w)

filteredList = [w for w in wordList if not w in stopWords] #1
for word in filteredList:
    stemWord = porterStemmer.stem(word)
    
wordDict = {}
for word in filteredList:
    stemWord = porterStemmer.stem(word)
    if stemWord in wordDict : 
        wordDict[stemWord] = wordDict[stemWord] +1
    else :
        wordDict[stemWord] = 1
wordDict = {k: v for k, v in sorted(wordDict.items(), key=lambda item: item[1], reverse=True)} #2
#similar one liner structure as #1
# for key in wordDict: 
#     print(key, wordDict[key])

In [8]:
#it is a dict_item
wordDict.items()

dict_items([(',', 22), ('.', 10), ('--', 7), ('dedic', 6), ('nation', 5), ('live', 4), ('great', 3), ('It', 3), ('dead', 3), ('us', 3), ('shall', 3), ('peopl', 3), ('new', 2), ('conceiv', 2), ('men', 2), ('war', 2), ('long', 2), ('We', 2), ('gave', 2), ('consecr', 2), ('the', 2), ('far', 2), ('rather', 2), ('devot', 2), ('four', 1), ('score', 1), ('seven', 1), ('year', 1), ('ago', 1), ('father', 1), ('brought', 1), ('forth', 1), ('contin', 1), ('liberti', 1), ('proposit', 1), ('creat', 1), ('equal', 1), ('now', 1), ('engag', 1), ('civil', 1), ('test', 1), ('whether', 1), ('endur', 1), ('met', 1), ('battle-field', 1), ('come', 1), ('portion', 1), ('field', 1), ('final', 1), ('rest', 1), ('place', 1), ('might', 1), ('altogeth', 1), ('fit', 1), ('proper', 1), ('but', 1), ('larger', 1), ('sens', 1), ('hallow', 1), ('ground', 1), ('brave', 1), ('struggl', 1), ('poor', 1), ('power', 1), ('add', 1), ('detract', 1), ('world', 1), ('littl', 1), ('note', 1), ('rememb', 1), ('say', 1), ('never', 

In [9]:
# more lambda explained
# lambda is an anynomous function, it doesn't need a name! aks doesn't need to be assigned to anything
# if wants to assign a name, use def methodName():
# more explained: https://stackoverflow.com/questions/8966538/syntax-behind-sortedkey-lambda
s = [[12, 'tall', 'blue', 1],
[2, 'short', 'red', 9],
[4, 'tall', 'blue', 13]]
# x refers to each element in s
# x is a list
#x[1]: tall, short, tall
#x[2]: blue, red, blue
s = sorted(s, key = lambda x: (x[1], x[2]))
s

[[2, 'short', 'red', 9], [12, 'tall', 'blue', 1], [4, 'tall', 'blue', 13]]

In [10]:
#sorted(any iterable, key=)
# list.sorted() only accept list
student_tuples = [
    ('john', 'A', 15),
    ('jane', 'B', 12),
    ('dave', 'B', 10),]
sorted(student_tuples, key=lambda student: student[2]) 
#lambda key = lambda argument : expression

[('dave', 'B', 10), ('jane', 'B', 12), ('john', 'A', 15)]

### <span style="color:blue"> Exercise 3 </span>

Modify the example above to use a WordNet Lemmatizer instead of a porter stemmer.

Comment on the differences

In [11]:
#Solution to Exercise 3: 
import nltk
from nltk.stem.porter import *
from nltk.stem import WordNetLemmatizer 

nltk.download('wordnet')
# This part is a bit like java: Scanner keyboard = new Scanner()
# to create an object from Scanner class => this object will have corresponding properties and methods from this class
lemmatizer = WordNetLemmatizer()

[nltk_data] Downloading package wordnet to /home/jovyan/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [12]:
#create a dic
lemmaDict = {}
for word in filteredList:
    lemmaWord = lemmatizer.lemmatize(word)
    if lemmaWord in lemmaDict : 
        lemmaDict[lemmaWord] = lemmaDict[lemmaWord] +1
    else :
        lemmaDict[lemmaWord] = 1

lemmaDict = {k: v for k, v in sorted(lemmaDict.items(), key= lambda item: item[1], reverse = True)}
for item in lemmaDict:
    #print(item + ": "+ str(lemmaDict[item]))
    print(item, lemmaDict[item])

, 22
. 10
-- 7
nation 5
dedicated 4
great 3
It 3
dead 3
u 3
shall 3
people 3
new 2
conceived 2
men 2
war 2
long 2
We 2
dedicate 2
gave 2
The 2
living 2
far 2
rather 2
devotion 2
Four 1
score 1
seven 1
year 1
ago 1
father 1
brought 1
forth 1
continent 1
Liberty 1
proposition 1
created 1
equal 1
Now 1
engaged 1
civil 1
testing 1
whether 1
endure 1
met 1
battle-field 1
come 1
portion 1
field 1
final 1
resting 1
place 1
life 1
might 1
live 1
altogether 1
fitting 1
proper 1
But 1
larger 1
sense 1
consecrate 1
hallow 1
ground 1
brave 1
struggled 1
consecrated 1
poor 1
power 1
add 1
detract 1
world 1
little 1
note 1
remember 1
say 1
never 1
forget 1
unfinished 1
work 1
fought 1
thus 1
nobly 1
advanced 1
task 1
remaining 1
honored 1
take 1
increased 1
cause 1
last 1
full 1
measure 1
highly 1
resolve 1
died 1
vain 1
God 1
birth 1
freedom 1
government 1
perish 1
earth 1
