# Lecture 9: 

1. Replace
2. Reading URLs into strings
3. Captured Groups in Regular Expressions
4. Cleaning data

__Useful links:__ 
- Regular Expression Cheat Sheet: https://www.debuggex.com/cheatsheet/regex/python
- To test your regular expressions. http://pythex.org/

## 1. Replace

We can use the __replace( )__ function to replace substrings of a string with other substrings. There is an optional third argument k, which causes only the __first k instances__ to be replaced.


In [5]:
s='aba'
s1=s.replace(r'[a-z]','c') #note replace(,) will NOT modify original item, need to store it in NEW item
print s1

aba


In [2]:
s.replace('a','c',1)

'cba'

### Exercise:

- Use only the replace function to turn the string ‘aaaa’ into the string ‘abcd’.

In [3]:
s = 'aaaa'

In [5]:
s.replace('aaaa','abcd')

'abcd'

In [6]:
s.replace('a','d',4).replace('d','c',3).replace('c','b',2).replace('b','a',1)

'abcd'

In [None]:
s1 = s.replace('a','d',4)
s2 = s1.replace('d','c',3)
s3 = s2.replace('c','b',2)
s4 = s3.replace('b','a',1) #same as above block
print s4

## 2. Reading URLs into strings

For one of the homework questions, we wish to import the data from a url into a string. We use the module __urllib__ or __urllib2__ and the following code for this.

In [33]:
#python2
import urllib2

url = "http://www.math.ucla.edu/~hangjie/contact"
#page=urllib2.urlopen(url) # page now is an object
page=urllib2.urlopen(url).read() #get source code for the page
print page

out = open("webtext.txt","w")
print out
out.write(str(page))
out.close()

<!doctype html>
<!--[if lt IE 7]><html class="no-js lt-ie9 lt-ie8 lt-ie7" lang="en"> <![endif]-->
<!--[if (IE 7)&!(IEMobile)]><html class="no-js lt-ie9 lt-ie8" lang="en"><![endif]-->
<!--[if (IE 8)&!(IEMobile)]><html class="no-js lt-ie9" lang="en"><![endif]-->
<!--[if gt IE 8]><!--> <html class="no-js" lang="en"><!--<![endif]-->
<head>
<meta charset="utf-8">
<title>Hangjie Ji's site  &#8211; Contact Me </title>
<meta name="description" content="">
<meta name="keywords" content="contact">


<!-- Open Graph -->
<meta property="og:locale" content="en_US">
<meta property="og:type" content="article">
<meta property="og:title" content="Contact Me">
<meta property="og:description" content="Welcome to my site!">
<meta property="og:url" content="https://www.math.ucla.edu/~hangjie/contact/">
<meta property="og:site_name" content="Hangjie Ji's site">





<link rel="canonical" href="https://www.math.ucla.edu/~hangjie/contact/">
<link href="https://www.math.ucla.edu/~hangjie/feed.xml" type="applic


#For __Python3__ the following works:

import urllib

page=urllib.request.urlopen(url).read()

## 3. Captured Groups in regular expressions
Recall that we use parentheses __(...)__ to capture groups in a regular expression. The groups are numbered 1,2,3,... by the appearance of their opening parenthesis. We can then ask for any of the groups to be repeated in the pattern. 

For example, suppose that we are looking for a pattern of the form ‘a1a’ or ‘b2b’ (a letter, followed by a digit, followed by the same letter again). We can do that as follows:

In [13]:
import re
re.search(r'([a-z])\d\1','h3g h4h').group() #\1 is to refer to the same pattern in the previous group in (parenthesis)

'h4h'

In [14]:
re.search(r'([a-z])\d\1','g3g h4h').group()

'g3g'

In [17]:
re.findall(r'([a-z])\d\1','g3g h4h')

['g', 'h']

In [20]:
re.findall(r'(([a-z])\d\1)','g3g h4h') #now [a-z] is group 2, not 1

['g', 'h']

In [9]:
re.findall(r'(([a-z])\d\2)','g3g h4h')

[('g3g', 'g'), ('h4h', 'h')]

Let's look at the example: looking for a repeated word. (Note the extra spaces around the pattern, this is to avoid finding consecutive words with similar ending/beginning.)

In [15]:
import re
s='this is is a sentence'
re.search(r' (\w+) \1 ',s).group(1)

'is'

In [16]:
re.search(r' (\w+) \1 ',s).group()#group() without argument will return the whole substring that matched

' is is '

We can then use __re.sub( )__ to replace a pattern. In this case, remove the repeated word:

In [23]:
re.sub(r' (\w+) \1 ',r' \1 ',s) #replace ' is is ' by ' is '

'this is a sentence'

It is not possible in a single regular expression to ask for a captured group to appear an arbitrary number of times, or to look for arbitrarily many groups, but we can often achieve such things by combining a regular expression with a loop. For example, suppose that we want to check that a word (in lower case letters) is a palindrome. We know that a word is a palindrome if the first and last letter are the same, and the part between them is a palindrome. So, we can work recursively.

In [9]:
import re
def palindrome(word):
        if len(word)<=1:
            return True
        elif re.match(r'^([a-z]).*\1$',word): #match only check the beginning of string, if not the same, no match
            return palindrome(word[1:len(word)-1])
        else:
            return False

In [20]:
print palindrome('abccca')
print palindrome('hanhnaa')
print palindrome('ahha')
print palindrome('f  f') #false b/c matching creteria is [a-z]

False
False
True
False


### Exercise:
- Write a function that shortens sentences. "This is a sentence." becomes "This ... sentence."

- Write a regular expression that recognizes large integers with correct thousands separators, such as 10,000 and 3,746,982.

## 4. Cleaning Data

We can use regular expressions to transform data from .txt or other types of files to python variables. Later on, we will work with modules (such as numpy and pandas) and variable types that are useful for data analysis, but for now we can practice with putting data into lists, using regular expressions. 

For example, suppose we have the following marathon.txt file, with data on marathon times.

Andrea    5:31 <br>
Ben         5:02  <br>
Carl        6:21  <br>
Didi        5:10  <br>

We read this into a string using the open( ) and read( ) functions. 

**Make sure that your working directory matches the directory of the file.**

In [7]:
times=open('marathon.txt', 'r').read() #read the content of file and return a string
times

'Andrea    5:31\nBen       5:02\nCarl      6:21\nDidi      5:10'

We can use the re.split() function to split our string first by the newline separators:

In [10]:
Lrows=re.split(r'\n',times) #re.split returns a list, with element according to split creteria
Lrows

['Andrea    5:31', 'Ben       5:02', 'Carl      6:21', 'Didi      5:10']

We then split each row by the space separators. In this case, it makes sense to split the time into two
items: hours and minutes. Therefore, we also split by the “:” symbol.

In [11]:
L=[re.split(r'\s+|:',i) for i in Lrows]
L
print L[1][0]

Ben


Finally, we would like the hour and minute items to be integers, not strings:

In [24]:
for i in L:
    for j in [1,2]:
        i[j] = int(i[j])       
L

[['Andrea', 5, 31], ['Ben', 5, 2], ['Carl', 6, 21], ['Didi', 5, 10]]

In [27]:
L.sort(key = lambda x : x[1:3])
L

[['Ben', 5, 2], ['Didi', 5, 10], ['Andrea', 5, 31], ['Carl', 6, 21]]

In [30]:
out = open("output.txt","w")
print out
out.write(str(L))
out.close()

<open file 'output.txt', mode 'w' at 0x103e498a0>


In [32]:
out2 = open("output2.txt","w")
for line in L:
    out2.write(str(line)+'\n')
out2.close()


### Exercise:
- Find different types of data that are of interest to you, and think about how to clean it up and import it into python in a useful form.