# Week 2 demonstration


The following scripts are used to demonstrate how regular expressions work in Python.

In [1]:
import re

## group()

In [2]:
test_string = "Incident American Airlines Flight 11 involving a Boeing 767-223ER in 2001"

In [3]:
mobject = re.search(r"Incident (.*) involving", test_string)

In [4]:
mobject.group(0)

'Incident American Airlines Flight 11 involving'

In [5]:
mobject.group(1)

'American Airlines Flight 11'

In [6]:
mobject1 = re.findall(r"Incident (.*) involving", test_string)

In [7]:
mobject1[0]

'American Airlines Flight 11'

## Raw String

In [8]:
string1 = '\section{Test the Python regular expressions}'
string1

'\\section{Test the Python regular expressions}'

In [9]:
string2 = "\\\\section"
string2

'\\\\section'

In [10]:
print string2

\\section


In [11]:
x = re.match(string2, string1)
x.group(0)

'\\section'

In [12]:
print x.group(0)

\section


In [13]:
string3 = r"\\section"
string3

'\\\\section'

In [14]:
re.match(string3, string1)

<_sre.SRE_Match at 0x10af50f38>

## Case study 1: develop regular expression for validating dates

### Year 
Let's start with year.

We assume that year can take the following two pattens
1. Four-digit year, such as
```
    2016, 1999
```
2. Two-digit year, such as
```
    16, 99
```


In [7]:
def year(pattern, m):
    if re.match(pattern, m):
         print m + " is a year"
    else :
        print m + " is NOT a year"

For the four-digit year, the regular expression is 

```python
    r"^\d{4}$"
```

In [16]:
year(r"^\d{4}$", "2016")

2016 is a year


In [17]:
year(r"^\d{4}$", "1998")

1998 is a year


For the two-digit year, the regular expression is 
```python
    r"^\d{2}$"
```

In [18]:
year(r"^\d{2}$", "16")

16 is a year


In [19]:
year(r"^\d{2}$", "98")

98 is a year


<font color = 'red' size="4"> What is the regular expression for both pattens? </font>



Actually, we can merge the two regular expression together by using the repetition character <font color = 'red'>?</font> as follows:
```python
    r"(?:\d{2})?\d{2}"
```
![day](./date0.jpg)


In [20]:
year(r"^(?:\d{2})?\d{2}$", "16")

16 is a year


What does "<font color='red'>(?: )</font>" mean? 

"<font color='red'>(?: )</font>" indicates a non-capturing group, it matches whatever regular expression is inside the parentheses, but the substring matched by the group <font color='blue'>cannot be retrieved </font>. 


In [21]:
obj = re.match(r"^(?:\d{2})?\d{2}$", "2016")
print obj.groups()
print obj.group(0)

()
2016


In [22]:
obj = re.match(r"^(\d{2})?\d{2}$", "2016")
print obj.groups()
print obj.group(0)
print obj.group(1)

('20',)
2016
20


Therefore, the regular expression for either 4-digit or 2-digit year is
```python
    r"^(?:\d{2})?\d{2}$"
```

### Month

Mow let's us consider month. There are twelve months a year, which are
```
    1, 2, 3, 4, 5, 6, 7, 8, 9
    10, 11, 12
```
or 
```
    01, 02, 03, 04, 05, 06, 07, 08, 09 
    10, 11, 12 
```


In [23]:
def month(pattern, m):
    if re.match(pattern, m):
         print m + " is a month"
    else :
        print m + " is NOT a month"

Firstly, assume they all have two digits, which means we have leading zeros for months from January to September.

In [24]:
month(r"^\d{2}$", "9")

9 is NOT a month


In [25]:
month(r"^\d{2}$", "12")

12 is a month


Now, assume that we allow the leading zeros to be omitted. 

In [26]:
month(r"^\d?\d$", "9")

9 is a month


<font color='red', size = '5'>What is the problem of this regular expression?</font>
</br>





In [27]:
month(r"^\d?\d$", "13")

13 is a month


In [28]:
month(r"^\d?\d$", "00")

00 is a month


```
    01, 02, 03, 04, 05, 06, 07, 08, 09 
    10, 11, 12 
```

As shown above, the first digit can be 0, 1
```python
    r"(0?[1-9]|1[0-2])"
```
![month](./date1.jpg)

In [29]:
month(r"^(0?[1-9]|1[0-2])$", "10")

10 is a month


In [30]:
month(r"^(0?[1-9]|1[0-2])$", "00") ## try 00

00 is NOT a month


### Days 

What is the patten of days in a month? 

```
        01, 02, 03, 04, 05, 06, 07, 08, 09
    10, 11, 12, 13, 14, 15, 16, 17, 18, 19 
    20, 21, 22, 23, 24, 25, 26, 27, 28, 29
    30 31 
```

In [31]:
def day(pattern, m):
    if re.match(pattern, m):
         print m + " is a day"
    else :
        print m + " is NOT a day"

If we assume that each month has 31 days, which is obviously not true. Making this assumption will simply the regular expression.

The simplest regular expression is
```python
    r"(\d?\d)"
```

In [32]:
day(r"^(\d?\d)$", "1")

1 is a day


In [33]:
day(r"^(\d?\d)$", "21")

21 is a day


In [34]:
day(r"^(\d?\d)$", "32")

32 is a day


Oops! 32 is not a day in a month. 

Let's look at first 9 days that are
```
01, 02, 03, 04, 05, 06, 07, 08, 09
```
The patten is (assume leading zeros can be omitted):
```python
   0?[1-9]
```
Then, look at the following 20 days that are
```
    10, 11, 12, 13, 14, 15, 16, 17, 18, 19 
    20, 21, 22, 23, 24, 25, 26, 27, 28, 29
```
The patten is:
```python
    [12][0-9]
```
Finally, the last two days are
```
    30, 31
```
The patten is
```python
    3[01]
```
Putting all together, we have
```python
    r"(0?[1-9]|[12][0-9]|3[01])"
```
![](./date2.jpg)

In [35]:
day(r"(0[1-9]|[12][0-9]|3[01])", "31")

31 is a day


In [36]:
day(r"(0[1-9]|[12][0-9]|3[01])", "00")

00 is NOT a day


In [37]:
day(r"(0[1-9]|[12][0-9]|3[01])", "33")

33 is NOT a day


Finally, we have our regular expression for handling dates, which is
```python
    r"(0[1-9]|[12][0-9]|3[01])[/-](0?[1-9]|1[0-2])[/-]((?:\d{2})?\d{2})"
```
![](./date3.jpg)

In [38]:
def date(pattern, m):
    if re.match(pattern, m):
         print m + " is a date"
    else :
        print m + " is NOT a date"

In [39]:
date(r"(0[1-9]|[12][0-9]|3[01])[/-](0?[1-9]|1[0-2])[/-]((?:\d{2})?\d{2})", "19-10-2019")

19-10-2019 is a date


In [40]:
date(r"(0[1-9]|[12][0-9]|3[01])[/-](0?[1-9]|1[0-2])[/-]((?:\d{2})?\d{2})", "19/10/2019")

19/10/2019 is a date


In [41]:
date(r"(0[1-9]|[12][0-9]|3[01])[/-](0?[1-9]|1[0-2])[/-]((?:\d{2})?\d{2})", "19/13/2019")

19/13/2019 is NOT a date


<font color="red" size = 4>Home work:</font> we have asssumed all the months in a year have 31 days, but the number of days in a month can be 28, 29, 30, 31. How can we refine the regular expression so that it can distinguish those months? (Note assume all years are a leap year, which means every Feburary has 29 days.)

## Case study 2: validate credit card number

In [42]:
def isCreditCard(pattern, string):
    if re.match(pattern, string):
        print string + " is a credit card nunmber!"
    else:
        print string + " is NOT a credit card number!"

### Visa cards:

The patten of visa card numbers:
* 13 or 16 digits, starting with 4. 

Examples,
* 4123456789012
* 4123456789012345

For 13-digit Visa card numbers, the regular expression should be
```python
    4\d{12}
```

For 16-digit Visa car numbers, the regular expression should be
```python
    4\d{15}
```

In [43]:
isCreditCard(r"^4\d{12}$", "4123456789012")

4123456789012 is a credit card nunmber!


In [44]:
isCreditCard(r"^4\d{12}$", "4123456789012345")

4123456789012345 is NOT a credit card number!


In [45]:
isCreditCard(r"^4\d{15}$", "4123456789012")

4123456789012 is NOT a credit card number!


In [46]:
isCreditCard(r"^4\d{15}$", "4123456789012345")

4123456789012345 is a credit card nunmber!


Now, we need to write one regular expression for validating visa card numbers.

```python
    4\d{12}(?:\d{3})?
```
![](./credit1.jpg)

In [47]:
isCreditCard(r"^4\d{12}(?:\d{3})?$", "4123456789012")

4123456789012 is a credit card nunmber!


In [48]:
isCreditCard(r"^4\d{12}(?:\d{3})?$", "4123456789012345")

4123456789012345 is a credit card nunmber!


### MasterCard:
the patten of master card numbers:

* 16 digits, starting with 51 through 55.

For example,

* 5123456789012345
* 5523456789012345

the regular expression looks like
```python
5[1-5]\d{14}
```

![](./credit2.jpg)

In [49]:
isCreditCard(r"^5[1-5]\d{14}$", "5123456789012345")

5123456789012345 is a credit card nunmber!


In [50]:
isCreditCard(r"^5[1-5]\d{14}$", "5723456789012345")

5723456789012345 is NOT a credit card number!


### American Express 

The patten of American Express car numbers:

* 15 digits, starting with 34 or 37.

For example, 
* 341234567890123
* 371234567890123

Now it should be easy to figure out the regular expression:
```python
    3[47]\d{13}
```

In [51]:
isCreditCard(r"3[47]\d{13}", "341234567890123")

341234567890123 is a credit card nunmber!


In [52]:
isCreditCard(r"3[47]\d{13}", "371234567890123")

371234567890123 is a credit card nunmber!


In [53]:
isCreditCard(r"3[47]\d{13}", "381234567890123")

381234567890123 is NOT a credit card number!


Now let's put the three regular expression togeather:
```python
    (?x)
        4\d{12}(?:\d{3})? | # Visa
        5[1-5]\d{14} |      # Master
        3[47]\d{13}         # American Express 
```
<font color='red'>(?x)</font> is a flag that indicates verbose regular expression.

In [54]:
cardPattern = r'''(?x)
        4\d{12}(?:\d{3})? | # Visa
        5[1-5]\d{14} |      # Master
        3[47]\d{13}         # American Express 
        '''

In [55]:
isCreditCard(cardPattern, "31234567890123")

31234567890123 is NOT a credit card number!
