# Lecture 13: 

1. Replace
2. Reading URLs into strings
3. Captured Groups in Regular Expressions
4. Cleaning data

__Useful links:__ 
- Regular Expression Cheat Sheet: https://www.debuggex.com/cheatsheet/regex/python
- To test your regular expressions. http://pythex.org/

## 1. Replace

We can use the `replace( )` function to replace substrings of a string with other substrings. There is an optional third argument k, which causes only the first k instances to be replaced.


In [1]:
s='aba'
s.replace('a','c') #s is not modified as strings are immutable

'cbc'

In [2]:
s.replace('a','c',1)

'cba'

### Exercise:

- Use only the replace function to turn the string ‘aaaa’ into the string ‘abcd’.

In [3]:
s = 'aaaa'
s.replace('aaaa','abcd')
s.replace('a','d',4).replace('d','c',3).replace('c','b',2).replace('b','a',1)

'abcd'

## 2. Reading URLs into strings

For one of the homework questions, we wish to import the data from a url into a string. We use the module `urllib` and the following code for this.

In [6]:
import urllib
url = "http://www.math.ucla.edu/~hangjie/contact"
page=urllib.request.urlopen(url).read()
print(type(page))
page = str(page)
print(type(page))

#print(page)

<class 'bytes'>
<class 'str'>


## 3. Captured Groups in regular expressions
Recall that we use parentheses `(...)` to capture groups in a regular expression. The groups are numbered 1,2,3,... by the appearance of their opening parenthesis. We can then ask for any of the groups to be repeated in the pattern. 

For example, suppose that we are looking for a pattern of the form `‘a1a’` or `‘b2b’` (a letter, followed by a digit, followed by the same letter again). We can do that as follows:

In [7]:
import re
re.search(r'([a-z])\d\1','h3g h4h').group()

'h4h'

In [8]:
re.search(r'([a-z])\d\1','g3g h4h').group()

'g3g'

In [11]:
re.findall(r'([a-z])\d\1','g3g h4h')#findall will only return the captured group, which is [a-z]

['g', 'h']

In [12]:
re.findall(r'([a-z])\d\1','g3g h4h g8d') #still follow the whole pattern

['g', 'h']

In [9]:
re.findall(r'(([a-z])\d\2)','g3g h4h')

[('g3g', 'g'), ('h4h', 'h')]

Let's look at the example: looking for a repeated word. (Note the extra spaces around the pattern, this is to avoid finding consecutive words with similar ending/beginning.)

In [13]:
import re
s='this is is a sentence'
re.search(r' (\w+) \1 ',s).group(1)

'is'

We can then use `re.sub( )` to replace a pattern. In this case, remove the repeated word:

In [14]:
re.sub(r' (\w+) \1 ',r' \1 ',s)

'this is a sentence'

It is not possible in a single regular expression to ask for a captured group to appear an arbitrary number of times, or to look for arbitrarily many groups, but we can often achieve such things by combining a regular expression with a loop. For example, suppose that we want to check that a word (in lower case letters) is a palindrome. We know that a word is a palindrome if the first and last letter are the same, and the part between them is a palindrome. So, we can work recursively.

In [15]:
import re
def palindrome(word):
        if len(word)<=1:
            return True
        elif re.match(r'^([a-z]).*\1$',word): #check if the first lower case letter is the same as the last one
            return palindrome(word[1:len(word)-1])
        else:
            return False

In [16]:
print(palindrome('abccca'))
print(palindrome('hanhnaa'))
print(palindrome('ahha'))

False
False
True


### Exercise:
- Write a function that shortens sentences. "This is a sentence." becomes "This ... sentence."

- Write a regular expression that recognizes large integers with correct thousands separators, such as 10,000 and 3,746,982.

## 4. Cleaning Data

We can use regular expressions to transform data from .txt or other types of files to python variables. Later on, we will work with modules (such as numpy and pandas) and variable types that are useful for data analysis, but for now we can practice with putting data into lists, using regular expressions. 

For example, suppose we have the following marathon.txt file, with data on marathon times.

Andrea    5:31 <br>
Ben         5:02  <br>
Carl        6:21  <br>
Didi        5:10  <br>

We read this into a string using the open( ) and read( ) functions. 

**Make sure that your working directory matches the directory of the file.**

In [18]:
times=open('marathon.txt', 'r').read()
times

'Andrea    5:31\nBen       5:02\nCarl      6:21\nDidi      5:10'

In [17]:
"abc dse ".split(" ")

['abc', 'dse', '']

We can use the re.split() function to split our string first by the newline separators:

In [19]:
Lrows=re.split(r'\n',times)
Lrows

['Andrea    5:31', 'Ben       5:02', 'Carl      6:21', 'Didi      5:10']

We then split each row by the space separators. In this case, it makes sense to split the time into two
items: hours and minutes. Therefore, we also split by the “:” symbol.

In [20]:
L=[re.split(r'\s+|:',i) for i in Lrows]
L

[['Andrea', '5', '31'],
 ['Ben', '5', '02'],
 ['Carl', '6', '21'],
 ['Didi', '5', '10']]

Finally, we would like the hour and minute items to be integers, not strings:

In [21]:
for i in L:
    for j in [1,2]:
        i[j] = int(i[j])       
L

[['Andrea', 5, 31], ['Ben', 5, 2], ['Carl', 6, 21], ['Didi', 5, 10]]

### Exercise:
- Find different types of data that are of interest to you, and think about how to clean it up and import it into python in a useful form.