## Regular Expression 
### Natural Language Processing
Submitted by : **Saikiran N. Pasikanti**

**Task1:** Write a python script using regular expression for extracting the time pattern from a text.

**Assumptions :**
1. Time pattern should consider valid hours, minutes and different styles of writing patterns.
2. Hours should be between **0** to **23**
3. Minutes should be between **0** to **59**
4. Minutes should be always two-digit number
5. Hour should be either single-digit or two-digit number
6. Separator should be always **" : "**

In [1]:
#loading the required regular expression library into python environment
import re 

#### Expression for Minutes

In [2]:
# Sample text for minutes-regular expression
minute = "59 60 00 01 1 5 12 24 25 66 999 456 45 This text contains SEVEN valid minute records"

# re.finall() function compile the regular expression, finds all the matches in the subject
# Output will be list
outm = re.findall(r'\b([0-5]\d)\b', minute)


print(outm)                # This will print the required valid minutes records
#print("valid =",len(outm)) # This will help us to verify the output with actual valid records

['59', '00', '01', '12', '24', '25', '45']
valid = 7


##### Explanation of regular expression
`\b     - matches start position of the word (word boundary)`<br>
`[0-5]  - matches when first digit is between 0 to 5`<br>
`\d     - matches when second digit is any numeric number from 0 to 9`<br>
`\b     - matches end position of the word`<br>

***This will match two digit number from 00-59***

#### Expression for Hours

In [3]:
# Sample text for hours-regular expression
hour = "00 24 25 99 888 22 10 1 02 05 5 32 31 30 220 This text contains SEVEN valid hour records"

# re.finall() function compile the regular expression, finds all the matches in the subject
# Output will be list
outh = re.findall(r'\b(0?\d|[12][0-3])\b', hour)


print(outh)                # This will print the required valid hours records
#print("valid =",len(outm)) # This will help us to verify the output with actual valid records

['00', '22', '10', '1', '02', '05', '5']
valid = 7


##### Explanation of regular expression
`\b    - matches start position of the word (word boundary)`<br>
`0?    - matches when first digit is 0 and its optional (means if no 0, proceeds to next match) (for single or double digit)`<br>
`\d    - matches when second digit is any numeric number from 0 to 9`<br>
`|     - alternative re`<br>
`[12]  - matches when first digit is either 1 or 2`<br>
`[0-3] - matches when second digit is between 0 to 3`<br>
`\b    - matches end position of the word`<br>

***This will match either single digit number or two digit number from 0-23 or 00-23***

In [4]:
time = '''00:00 #Valid
01:01 #Valid
59:59
999:99
10:250
30:12
1:72
3:52 #Valid
23:59 #Valid
24:00
12:00 #Valid
1:60
2:45 #Valid
10:30 #Valid
5:45 #Valid
09:45 #Valid
This text contains NINE valid time records'''

In [5]:
# Output will be list of tuples
# Separator is always ":"
# (?:) Non-capturing group are used for extracting valid time as complete string
outt = re.findall(r'\b(?:0?\d|[12][0-3])\b:\b(?:[0-5]\d)\b', time) 

outt                  # List of valid records
print("Valid = ", len(outt)) # Number of valid records

Valid =  9


**Task2:** Write a python script using regular expression for extracting the date pattern from a text.

**Assumptions :**
1. Date pattern should consider valid name of months, number of days (1 to 31), number of months (1 to 12) and different styles of writing date patterns.   

2. Years should be between **0001** to **9999** (always four-digit number)
3. Months should be between **1** to **12** (either single-digit or two-digit number)
4. Months can be Alphabetical; Either "Full Name" or "Three Letter" short name with only first letter Capital i.e **Jan-Dec or January-December**
5. Day should be between **1** to **31** (either single-digit or two-digit number)
6. Separator can be either " - " or " / " or "< space>" or "."
7. Condition for Leap Year is years which are multiples of 4 and not multiples of 100 or multiples of 400 

`31 days = Jan(1), March(3), May(5), Jul(7), Aug(8), Oct(10), Dec(12)
30 days = Apr(4), Jun(6), Sep(9), Nov(11)
29 days = Feb(2) for leap year
28 days = Feb(2)`

In [35]:
date ='''This text contains (only 33) valid date formats as per the assumptions
15 November 1989  # Valid       30-11-1890       # Valid            # Invalid Leap year           1-Jan-12345
October 2013                    1-1-9899         # Valid            29-2-2000  # Valid            32/10/1900
16/11/2016        # Valid       30-9-1789        # Valid            29-2-1900                     31-2-2018
16.11.2016        # Valid       07-5-9890        # Valid            31-November-2018
16-11-2016        # Valid       1-10-2030        # Valid            # Other Invalid formats       32-Jan-2019
2016-11-16                      01-09-9008       # Valid            2-2/2000                      32 October 2013
9.9.1994          # Valid       11/11/200        # Valid            12 12 2010                    12-13-2018
6.02.2006         # Valid       01/1/1990        # Valid            392-12-2017                   19/15/2900
02-29-2011                      10.10.2010       # Valid            1-896-2017                    00/00/0000
32-12-2011                      1.1.0001         # Valid            01-JANUARY-2018               99/99/9999
01@11@2011                      01-Jan-2018      # Valid            1-JAN-2018                    32 Cricket 3001
Cricket 2013                    2/Apr/1996       # Valid            1-Januardy-2018 
1-10-1994         # Valid       02/Feb/1997      # Valid            11-jan-2017 
1-2-9999          # Valid       1-December-1789  # Valid            1-10-18 
1.5.1907          # Valid       30-November-1080 # Valid            1-Jan-18 
9.12.0198         # Valid       12-October-1670  # Valid            1-10-218 
01-01-2018        # Valid       29-2-2004        # Valid            1-Jan-218 
10-9-1650         # Valid       29.2.7152        # Valid            1-10-12345 
29/2/2020         # Valid 
'''

In [8]:
#Regular expression for extracting years as per the assumption 8 & 9
#Match from 1-31 or 01-31
out1 = re.findall(r'\b(0?[1-9]|[12]\d|3[01])\b', date) 
print(out1)

['30', '01', '31', '1', '30', '02', '31', '2', '30', '03', '31', '3', '30', '04', '31', '4', '30', '05', '31', '5', '30', '06', '31', '6', '30', '07', '31', '7', '30', '08', '31', '8', '30', '09', '31', '9', '30', '10', '31', '10', '30', '11', '31', '11', '30', '12', '31', '12', '01', '01', '31', '12', '13', '01', '01', '27', '02', '31', '12', '13', '01', '01', '01', '01', '31', '12', '28', '2', '01', '01', '1', '1', '31', '12', '13', '1', '01', '01', '28', '01', '29', '01', '25', '31', '31', '01', '12', '01', '31', '30', '31', '30']


#### Explanation of Regular Expression
`0?[1-9] - matches single digit from 1-9 or two digit number in which first digit is 0 and second digit from 1-9
[12]\d  - matches two digit number in which first digit is either 1 or 2 and second digit any number 0-9
3[01]   - matches two digit number in which first digit is always 3 and second digit is either 0 or 1
|       - for alternative regular expression
\b      - start/ end word boundary`

In [9]:
#Regular expression for extracting month as per the assumption 4 & 5 
#Match from 1-12 or 01-12
out2 = re.findall(r'\b(0?[1-9]|1[0-2])\b', date) 
print(out2)

['01', '1', '02', '2', '03', '3', '04', '4', '05', '5', '06', '6', '07', '7', '08', '8', '09', '9', '10', '10', '11', '11', '12', '12', '01', '01', '12', '01', '01', '02', '12', '01', '01', '01', '01', '12', '2', '01', '01', '1', '1', '12', '1', '01', '01', '01', '01', '01', '12', '01']


#### Explanation of Regular Expression
`0?[1-9] - matches single digit from 1-9 or two digit number in which first digit is 0 and second digit from 1-9
1[0-2]  - matches two digit number in which first digit is always 1 and second digit is either 0 or 1 or 2`

In [10]:
#Regular expression for extracting year as per the assumption 2 & 3
#Match from 0001 to 9999
out3 = re.findall(r'\b(?:(?!0{4})\d\d\d\d)\b', date) 
print(out3)

['1989', '2989', '1989', '1989', '1889', '1999', '1989', '1989', '1789', '2789', '1900', '1989', '1989', '1989', '1989', '1989', '1989', '1989', '1989', '1989', '1989', '1989', '1989', '1989', '1000', '3000', '3001', '0999', '9999', '1000', '3000', '3001', '0999', '9999', '1000', '3000', '3000', '0999', '9999', '1000', '3000', '3001', '1999', '9999', '1000', '3000', '3001', '0999', '9999', '1000', '2120', '3001', '2999', '9999', '2000', '3000', '3000', '2000', '3000', '3000', '2525', '2525']


#### Explanation of Regular Expression
`?!0{4}  - neagtive look ahead, will not match 0000
\d\d\d\d - matches four digit number 0000-9999
Overall it matches number from 0001-9999`

In [11]:
#Regular expression for extracting month as per the assumption 6
#Match from Jan-Dec or Janurary-December
out4 = re.findall(r'\b(Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?)\b', date) 
print(out4)

['Jan', 'Feb', 'January', 'February', 'Apr', 'Apr', 'Nov', 'Nov']


#### Explanation of Regular Expression
`(Jan)     - capturing group, matches three-letter string Jan, Case sensitive
(?:uary)?  - (?: )non-capturing group, matches four-letter string uary, ()? implies it is optional`

<space>

### Final Regualr Expression

We will extract valid date strings as original using non-capturing group (?:)

In [36]:
ext = re.findall(r'''
        
\b(?:(?:   #30-day Months
        (?:  0?[1-9] | [12]\d | 30  )[ ./-]          #Matches days between 1-30 or 01-30 with separator./-
        (?:  0?[469] | 11 | Apr(?:il)? | Jun(?:e)? | Sep(?:tember)? | Nov(?:ember)?)  
|
                      
        #31-day Months
        (?:  0?[1-9] | [12]\d | 3[01] )[ ./-]        #Matches days between 1-31 or 01-31 with separator./-
        (?:  0?[13578] | 1[02] | Jan(?:uary)?| Mar(?:ch)? |May| Jul(?:y)?|Aug(?:ust)? | Oct(?:ober)?| Dec(?:ember)?) 
|
                      
        #28-day Month
        (?:  0?[1-9] | [12][0-8]      )[ ./-]        #Matches days between 1-28 or 01-28 with separator./-
        (?:  0?[2] | Feb(?:ruary)?    )
)         
                      
        [ ./-]
        #Year
        (?:  (?!0{4})   \d\d\d\d))                   #Matches any year between 0001-9999

|


# This expression will match Leap year 29 day of February
# Three non-capturing groups
        (?:(?:29[ ./-](?:0?[2] | Feb(?:ruary)?)[ ./-])  #Matches day and month
        (?:(?:\d\d)(?:[02468][48]|[13579][26]|[2468]0)  #Years which are multiples of 4 and not multiples of 100
        |
        (?:(?:[02468][48]|[13579][26]|[2468]0)00)))     #Years which are multiples of 400


\b
''', date, re.VERBOSE)
#print(ext)
print("There are ",len(ext), "strings which are in valid date format in the text-variable 'date'")
ext

There are  35 strings which are in valid date format in the text-variable 'date'


['15 November 1989',
 '30-11-1890',
 '1-Jan-1234',
 '1-1-9899',
 '29-2-2000',
 '16/11/2016',
 '30-9-1789',
 '16.11.2016',
 '07-5-9890',
 '16-11-2016',
 '1-10-2030',
 '01-09-9008',
 '2-2/2000',
 '9.9.1994',
 '12 12 2010',
 '6.02.2006',
 '01/1/1990',
 '10.10.2010',
 '1.1.0001',
 '01-Jan-2018',
 '2/Apr/1996',
 '1-10-1994',
 '02/Feb/1997',
 '1-2-9999',
 '1-December-1789',
 '1.5.1907',
 '30-November-1080',
 '9.12.0198',
 '12-October-1670',
 '01-01-2018',
 '29-2-2004',
 '10-9-1650',
 '29.2.7152',
 '1-10-1234',
 '29/2/2020']

<space>

## Regualr Expression for Date with same separator always

If there is a condition that both separators used after day and month should be same, we can use the following regular expression for extracting.

In [37]:
ext = re.findall(r'''
\b(?:(?:   #30-day Months
        (?:  0?[1-9] | [12]\d | 30  )[-]          #Matches days between 1-30 or 01-30 with separator-
        (?:  0?[469] | 11 | Apr(?:il)? | Jun(?:e)? | Sep(?:tember)? | Nov(?:ember)?)  
|                
        #31-day Months
        (?:  0?[1-9] | [12]\d | 3[01] )[-]        #Matches days between 1-31 or 01-31 with separator-
        (?:  0?[13578] | 1[02] | Jan(?:uary)?| Mar(?:ch)? |May| Jul(?:y)?|Aug(?:ust)? | Oct(?:ober)?| Dec(?:ember)?) 
|
      #28-day Month
        (?:  0?[1-9] | [12][0-8]      )[-]        #Matches days between 1-28 or 01-28 with separator-
        (?:  0?[2] | Feb(?:ruary)?    )
)                       
        [-]
        #Year
        (?:  (?!0{4})   \d\d\d\d))                   #Matches any year between 0001-9999
|
# This expression will match Leap year 29 day of February
# Three non-capturing groups
        (?:(?:29[ ./-](?:0?[2] | Feb(?:ruary)?)[ ./-])  #Matches day and month
        (?:(?:\d\d)(?:[02468][48]|[13579][26]|[2468]0)  #Years which are multiples of 4 and not multiples of 100
        |
        (?:(?:[02468][48]|[13579][26]|[2468]0)00)))     #Years which are multiples of 400
\b
|
\b(?:(?:   #30-day Months
        (?:  0?[1-9] | [12]\d | 30  )[.]          #Matches days between 1-30 or 01-30 with separator.
        (?:  0?[469] | 11 | Apr(?:il)? | Jun(?:e)? | Sep(?:tember)? | Nov(?:ember)?)  
|                      
        #31-day Months
        (?:  0?[1-9] | [12]\d | 3[01] )[.]        #Matches days between 1-31 or 01-31 with separator.
        (?:  0?[13578] | 1[02] | Jan(?:uary)?| Mar(?:ch)? |May| Jul(?:y)?|Aug(?:ust)? | Oct(?:ober)?| Dec(?:ember)?) 
|                     
        #28-day Month
        (?:  0?[1-9] | [12][0-8]      )[.]        #Matches days between 1-28 or 01-28 with separator.
        (?:  0?[2] | Feb(?:ruary)?    )
)                               
        [.]
        #Year
        (?:  (?!0{4})   \d\d\d\d))                   #Matches any year between 0001-9999
|
# This expression will match Leap year 29 day of February
# Three non-capturing groups
        (?:(?:29[.](?:0?[2] | Feb(?:ruary)?)[/])  #Matches day and month
        (?:(?:\d\d)(?:[02468][48]|[13579][26]|[2468]0)  #Years which are multiples of 4 and not multiples of 100
        |
        (?:(?:[02468][48]|[13579][26]|[2468]0)00)))     #Years which are multiples of 400
\b 
|
\b(?:(?:   #30-day Months
        (?:  0?[1-9] | [12]\d | 30  )[/]          #Matches days between 1-30 or 01-30 with separator/
        (?:  0?[469] | 11 | Apr(?:il)? | Jun(?:e)? | Sep(?:tember)? | Nov(?:ember)?)  
|                      
       #31-day Months
        (?:  0?[1-9] | [12]\d | 3[01] )[/]        #Matches days between 1-31 or 01-31 with separator/
        (?:  0?[13578] | 1[02] | Jan(?:uary)?| Mar(?:ch)? |May| Jul(?:y)?|Aug(?:ust)? | Oct(?:ober)?| Dec(?:ember)?) 
|
        #28-day Month
        (?:  0?[1-9] | [12][0-8]      )[/]        #Matches days between 1-28 or 01-28 with separator/
        (?:  0?[2] | Feb(?:ruary)?    )
)                               
        [/]
        #Year
        (?:  (?!0{4})   \d\d\d\d))                   #Matches any year between 0001-9999
|
# This expression will match Leap year 29 day of February
# Three non-capturing groups
        (?:(?:29[/](?:0?[2] | Feb(?:ruary)?)[/])  #Matches day and month
        (?:(?:\d\d)(?:[02468][48]|[13579][26]|[2468]0)  #Years which are multiples of 4 and not multiples of 100
        |
        (?:(?:[02468][48]|[13579][26]|[2468]0)00)))     #Years which are multiples of 400
\b 
|
\b(?:(?:#30-day Months
        (?:  0?[1-9] | [12]\d | 30  )[ ]          #Matches days between 1-30 or 01-30 with separator " "
        (?:  Apr(?:il)? | Jun(?:e)? | Sep(?:tember)? | Nov(?:ember)?)  
|
        #31-day Months
        (?:  0?[1-9] | [12]\d | 3[01] )[ ]        #Matches days between 1-31 or 01-31 with separator " "
        (?: Jan(?:uary)?| Mar(?:ch)? |May| Jul(?:y)?|Aug(?:ust)? | Oct(?:ober)?| Dec(?:ember)?) 
|                    
        #28-day Month
        (?:  0?[1-9] | [12][0-8]      )[ ]        #Matches days between 1-28 or 01-28 with separator " "
        (?:  Feb(?:ruary)?    )
)                             
        [ ]
        #Year
        (?:  (?!0{4})   \d\d\d\d))                   #Matches any year between 0001-9999
|
# This expression will match Leap year 29 day of February
# Three non-capturing groups
        (?:(?:29[ ](?: Feb(?:ruary)?)[ ])  #Matches day and month
        (?:(?:\d\d)(?:[02468][48]|[13579][26]|[2468]0)  #Years which are multiples of 4 and not multiples of 100
        |
        (?:(?:[02468][48]|[13579][26]|[2468]0)00)))     #Years which are multiples of 400
\b 
''', date, re.VERBOSE)
#print(ext)
print("There are ",len(ext), "strings which are in valid date format in the text-variable 'date'")
ext

There are  33 strings which are in valid date format in the text-variable 'date'


['15 November 1989',
 '30-11-1890',
 '1-Jan-1234',
 '1-1-9899',
 '29-2-2000',
 '16/11/2016',
 '30-9-1789',
 '16.11.2016',
 '07-5-9890',
 '16-11-2016',
 '1-10-2030',
 '01-09-9008',
 '9.9.1994',
 '6.02.2006',
 '01/1/1990',
 '10.10.2010',
 '1.1.0001',
 '01-Jan-2018',
 '2/Apr/1996',
 '1-10-1994',
 '02/Feb/1997',
 '1-2-9999',
 '1-December-1789',
 '1.5.1907',
 '30-November-1080',
 '9.12.0198',
 '12-October-1670',
 '01-01-2018',
 '29-2-2004',
 '10-9-1650',
 '29.2.7152',
 '1-10-1234',
 '29/2/2020']