# NLP Exercises - Regular Expressions

## 1.

Write a function named is_vowel. 

It should accept a string as input and use a regular expression to determine if the passed string is a vowel. 

While not explicity mentioned in the lesson, you can treat the result of re.search as a boolean value that indicates whether or not the regular expression matches the given string.

In [46]:
import re
import pandas as pd

In [10]:
def is_vowel(str):
    regexp = r'[aeiouAEIOU]'
    match = bool(re.search(regexp, str))
    return match

In [11]:
is_vowel('a')

True

## 2.

Write a function named is_valid_username that accepts a string as input. 

A valid username starts with a lowercase letter, and only consists of lowercase letters, numbers, or the _ character. 

It should also be no longer than 32 characters. 

The function should return either True or False depending on whether the passed string is a valid username.

In [12]:
def is_valid_username(str):
    regexp = r'^[a-z][a-z0-9_]{,31}$'
    return bool(re.search(regexp, str))

In [15]:
is_valid_username('codeslinger99')

True

In [17]:
is_valid_username('aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa')

False

In [22]:
is_valid_username('codeup')

True

In [23]:
is_valid_username('Codeup')

False

In [24]:
is_valid_username('codeup123')

True

In [25]:
is_valid_username('1codeup')

False

## 3.

Write a regular expression to capture phone numbers. It should match all of the following:

* (210) 867 5309
* +1 210.867.5309
* 867-5309
* 210-867-5309


In [37]:
phone_numbers_re = re.compile(r'''
^
(?P<country_code>\+\d+)?
\D*?
(?P<area_code>\d{3})?
\D*?
(?P<exchange_code>\d{3})?
\D*?
(?P<line_number>\d{4})
\D*
$
''', re.VERBOSE)

df = pd.DataFrame()
df['number'] = [
    '(210) 867 5309',
    '+1 210.867.5309',
    '867-5309',
    '210-867-5309',
    '2108675309',
]
pd.concat([df, df.number.str.extract(phone_number_re)], axis=1)

Unnamed: 0,number,country_code,area_code,exchange_code,line_number
0,(210) 867 5309,,210.0,867,5309
1,+1 210.867.5309,1.0,210.0,867,5309
2,867-5309,,,867,5309
3,210-867-5309,,210.0,867,5309
4,2108675309,,210.0,867,5309


## 4.

Use regular expressions to convert the dates below to the standardized year-month-day format.

* 02/04/19
* 02/05/19
* 02/06/19
* 02/07/19
* 02/08/19
* 02/09/19
* 02/10/19


In [39]:
dates = pd.Series(['02/04/19', 
                   '02/05/19',
                   '02/06/19',
                   '02/07/19',
                   '02/08/19',
                   '02/09/19',
                   '02/10/19'])
dates

0    02/04/19
1    02/05/19
2    02/06/19
3    02/07/19
4    02/08/19
5    02/09/19
6    02/10/19
dtype: object

In [45]:
dates.str.replace(r'(\d+)/(\d+)/(\d+)', r'20\3-\1-\2')

  dates.str.replace(r'(\d+)/(\d+)/(\d+)', r'20\3-\1-\2')


0    2019-02-04
1    2019-02-05
2    2019-02-06
3    2019-02-07
4    2019-02-08
5    2019-02-09
6    2019-02-10
dtype: object

## 5.

Write a regex to extract the various parts of these logfile lines:
* GET /api/v1/sales?page=86 [16/Apr/2019:193452+0000] HTTP/1.1 {200} 510348 "python-requests/2.21.0" 97.105.19.58
* POST /users_accounts/file-upload [16/Apr/2019:193452+0000] HTTP/1.1 {201} 42 "User-Agent: Mozilla/5.0 (X11; Fedora; Fedora; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36" 97.105.19.58
* GET /api/v1/items?page=3 [16/Apr/2019:193453+0000] HTTP/1.1 {429} 3561 "python-requests/2.21.0" 97.105.19.58


In [47]:
subject = '''
GET /api/v1/sales?page=86 [16/Apr/2019:193452+0000] HTTP/1.1 {200} 510348 "python-requests/2.21.0" 97.105.19.58
POST /users_accounts/file-upload [16/Apr/2019:193452+0000] HTTP/1.1 {201} 42 "User-Agent: Mozilla/5.0 (X11; Fedora; Fedora; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36" 97.105.19.58
GET /api/v1/items?page=3 [16/Apr/2019:193453+0000] HTTP/1.1 {429} 3561 "python-requests/2.21.0" 97.105.19.58
'''

In [49]:
regexp = r'''
^
(?P<method>GET|POST)
\s
(?P<path>/[/\w\-\?=]+)
\s
\[(?P<timestamp>.+)\]
\s
(?P<http_version>HTTP/\d+\.\d+)
\s
\{(?P<status_code>\d+)\}
\s
(?P<bytes_out>\d+)
\s
"(?P<user_agent>.+)"
\s
(?P<ip>\d+\.\d+\.\d+\.\d+)
'''

[re.search(regexp, sub, re.VERBOSE).groupdict() for sub in subject.strip().split('\n')]

[{'method': 'GET',
  'path': '/api/v1/sales?page=86',
  'timestamp': '16/Apr/2019:193452+0000',
  'http_version': 'HTTP/1.1',
  'status_code': '200',
  'bytes_out': '510348',
  'user_agent': 'python-requests/2.21.0',
  'ip': '97.105.19.58'},
 {'method': 'POST',
  'path': '/users_accounts/file-upload',
  'timestamp': '16/Apr/2019:193452+0000',
  'http_version': 'HTTP/1.1',
  'status_code': '201',
  'bytes_out': '42',
  'user_agent': 'User-Agent: Mozilla/5.0 (X11; Fedora; Fedora; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36',
  'ip': '97.105.19.58'},
 {'method': 'GET',
  'path': '/api/v1/items?page=3',
  'timestamp': '16/Apr/2019:193453+0000',
  'http_version': 'HTTP/1.1',
  'status_code': '429',
  'bytes_out': '3561',
  'user_agent': 'python-requests/2.21.0',
  'ip': '97.105.19.58'}]