# Regular Expressions Exercises
---

### 1. Write a function named `is_vowel`. It should accept a string as input and use a regular expression to determine if the passed string is a vowel. While not explicity mentioned in the lesson, you can treat the result of re.search as a boolean value that indicates whether or not the regular expression matches the given string.

In [1]:
# import
import regex as re
import pandas as pd

In [2]:
# regex search for single vowel
re.search(r"^[aeiou]$", "O", re.IGNORECASE)

<regex.Match object; span=(0, 1), match='O'>

In [3]:
# define function
def is_vowel(string):
    '''
    This function takes in a string and returns True if the passed string is a vowel.
    '''
    return bool(re.search(r"^[aeiou]$", string, re.IGNORECASE))
# test function
is_vowel('b'), is_vowel('u')

(False, True)

### 2. Write a function named `is_valid_username` that accepts a string as input. A valid username starts with a lowercase letter, and only consists of lowercase letters, numbers, or the `_` character. It should also be no longer than 32 characters. The function should return either `True` or `False` depending on whether the passed string is a valid username.
- is_valid_username('aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa')
 - False
- is_valid_username('codeup')
 - True
- is_valid_username('Codeup')
 - False
- is_valid_username('codeup123')
 - True
- is_valid_username('1codeup')
 - False

In [4]:
# regex search for:
# 1. starts with lowercase alpha
# 2. next is alphanumeric or _
# 3. strings ends by 31st iteration of 2
re.search(r"^[a-z][a-z0-9_]{,31}$", 'codeup')

<regex.Match object; span=(0, 6), match='codeup'>

In [5]:
# define function
def is_valid_username(string):
    '''
    This function takes in a string and returns True if it meets the requirements for a 
    valid username. A valid username consists only of lowercase letters, numbers, or the 
    _ character, and is no longer than 32 characters.
    '''
    return bool(re.search(r"^[a-z][a-z0-9_]{,31}$", string))
# test function
print(is_valid_username('aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'))
print(is_valid_username('codeup'))
print(is_valid_username('Codeup'))
print(is_valid_username('codeup123'))
print(is_valid_username('1codeup'))

False
True
False
True
False


### 3. Write a regular expression to capture phone numbers. It should match all of the following:
- (210) 867 5309
- +1 210.867.5309
- 867-5309
- 210-867-5309

In [6]:
# put subject strings in order of increasing complexity
# solve one at a time
# add optionality as pattern increases

# look for 7 numbers
re.search(r'\d{7}', '8675309')

<regex.Match object; span=(0, 7), match='8675309'>

In [7]:
# 3 digits, hyphen, 4 digits
re.search(r'\d{3}-\d{4}', '867-5309')

<regex.Match object; span=(0, 8), match='867-5309'>

In [8]:
# 3 digits, hyphen period or space, 4 digits
re.search(r'\d{3}[-. ]\d{4}', '867 5309')

<regex.Match object; span=(0, 8), match='867 5309'>

In [9]:
# 3 digits, optional non-digit, 4 digits
re.search(r'\d{3}\D?\d{4}', '867-5309')

<regex.Match object; span=(0, 8), match='867-5309'>

In [10]:
# optional country code: optional +, one or more digits (\+?\d+)?
# optional non-digit separator \D?
# optional area code: optional open parenthesis, 3 digits, optional close parenthesis (\(?\d{3}\)?)?
# optional non-digit separator \D?
# 3 digits \d{3}
# optional non-digit separator \D?
# 4 digits \d{4}
re.search(r"(\+?\d+)?\D?(\(?\d{3}\)?)?\D?\d{3}\D?\d{4}", '(210) 867 5309')

<regex.Match object; span=(0, 14), match='(210) 867 5309'>

In [11]:
# make sure it works on every phone number in list above
assert bool(re.search(r"(\+?\d+)?\D?(\(?\d{3}\)?)?\D?\d{3}\D?\d{4}", '(210) 867 5309')) == True
assert bool(re.search(r"(\+?\d+)?\D?(\(?\d{3}\)?)?\D?\d{3}\D?\d{4}", '+1 210.867.5309')) == True
assert bool(re.search(r"(\+?\d+)?\D?(\(?\d{3}\)?)?\D?\d{3}\D?\d{4}", '867-5309')) == True
assert bool(re.search(r"(\+?\d+)?\D?(\(?\d{3}\)?)?\D?\d{3}\D?\d{4}", '210-867-5309')) == True
print('It works!')

It works!


### 4. Use regular expressions to convert the dates below to the standardized year-month-day format.
- 02/04/19
- 02/05/19
- 02/06/19
- 02/07/19
- 02/08/19
- 02/09/19
- 02/10/19

In [27]:
# put original dates into dataframe
dates = [
    "02/04/19",
    "02/05/19",
    "02/06/19",
    "02/07/19",
    "02/08/19",
    "02/09/19",
    "02/10/19"
]

df = pd.DataFrame({"original": dates})
df

Unnamed: 0,original
0,02/04/19
1,02/05/19
2,02/06/19
3,02/07/19
4,02/08/19
5,02/09/19
6,02/10/19


In [28]:
# define pattern to look for
pattern = re.compile(r"""
(?P<month>\d{2})/
(?P<day>\d{2})/
(?P<year>\d{2})
""", re.VERBOSE)

In [30]:
# apply to dataframe
df = pd.concat([df, df.original.str.extract(pattern)], axis=1)
df["new_format"] = df.year + "/" + df.month + "/" + df.day 
df

TypeError: first argument must be string or compiled pattern

### 5. Write a regex to extract the various parts of these logfile lines:
- GET /api/v1/sales?page=86 [16/Apr/2019:193452+0000] HTTP/1.1 {200} 510348 "python-requests/2.21.0" 97.105.19.58
- POST /users_accounts/file-upload [16/Apr/2019:193452+0000] HTTP/1.1 {201} 42 "User-Agent: Mozilla/5.0 (X11; Fedora; Fedora; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36" 97.105.19.58
- GET /api/v1/items?page=3 [16/Apr/2019:193453+0000] HTTP/1.1 {429} 3561 "python-requests/2.21.0" 97.105.19.58

In [21]:
# list logfile lines
lines = [
    """GET /api/v1/sales?page=86 [16/Apr/2019:193452+0000] HTTP/1.1 {200} 510348 "python-requests/2.21.0" 97.105.19.58""",
    """POST /users_accounts/file-upload [16/Apr/2019:193452+0000] HTTP/1.1 {201} 42 "User-Agent: Mozilla/5.0 (X11; Fedora; Fedora; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36" 97.105.19.58""",
    """GET /api/v1/items?page=3 [16/Apr/2019:193453+0000] HTTP/1.1 {429} 3561 "python-requests/2.21.0" 97.105.19.58"""
]
lines

['GET /api/v1/sales?page=86 [16/Apr/2019:193452+0000] HTTP/1.1 {200} 510348 "python-requests/2.21.0" 97.105.19.58',
 'POST /users_accounts/file-upload [16/Apr/2019:193452+0000] HTTP/1.1 {201} 42 "User-Agent: Mozilla/5.0 (X11; Fedora; Fedora; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36" 97.105.19.58',
 'GET /api/v1/items?page=3 [16/Apr/2019:193453+0000] HTTP/1.1 {429} 3561 "python-requests/2.21.0" 97.105.19.58']

In [22]:
x = re.compile("""
\}\s(?P<bytes>\d+)\s\"
""", re.VERBOSE)
match = re.search(x, lines[0])
match

<regex.Match object; span=(65, 75), match='} 510348 "'>

In [23]:
match.group("bytes")


'510348'

In [24]:
log_pattern = re.compile(r"""
(?P<method>GET|POST) 
\s
(?P<path>/[/\w\-\?=]+)
\s
\[(?P<timestamp>.+)\]
\s
(?P<http_version>HTTP/\d+\.\d+)
\s
\{(?P<status_code>\d+)\}
\s
(?P<bytes>\d+)
\s
"(?P<user_agent>.+)"
\s
(?P<ip>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})
$
""", re.VERBOSE)

In [25]:
rows = [re.search(log_pattern, line).groupdict() for line in lines]
rows

[{'method': 'GET',
  'path': '/api/v1/sales?page=86',
  'timestamp': '16/Apr/2019:193452+0000',
  'http_version': 'HTTP/1.1',
  'status_code': '200',
  'bytes': '510348',
  'user_agent': 'python-requests/2.21.0',
  'ip': '97.105.19.58'},
 {'method': 'POST',
  'path': '/users_accounts/file-upload',
  'timestamp': '16/Apr/2019:193452+0000',
  'http_version': 'HTTP/1.1',
  'status_code': '201',
  'bytes': '42',
  'user_agent': 'User-Agent: Mozilla/5.0 (X11; Fedora; Fedora; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36',
  'ip': '97.105.19.58'},
 {'method': 'GET',
  'path': '/api/v1/items?page=3',
  'timestamp': '16/Apr/2019:193453+0000',
  'http_version': 'HTTP/1.1',
  'status_code': '429',
  'bytes': '3561',
  'user_agent': 'python-requests/2.21.0',
  'ip': '97.105.19.58'}]

In [26]:
df = pd.DataFrame(rows)
df

Unnamed: 0,method,path,timestamp,http_version,status_code,bytes,user_agent,ip
0,GET,/api/v1/sales?page=86,16/Apr/2019:193452+0000,HTTP/1.1,200,510348,python-requests/2.21.0,97.105.19.58
1,POST,/users_accounts/file-upload,16/Apr/2019:193452+0000,HTTP/1.1,201,42,User-Agent: Mozilla/5.0 (X11; Fedora; Fedora; ...,97.105.19.58
2,GET,/api/v1/items?page=3,16/Apr/2019:193453+0000,HTTP/1.1,429,3561,python-requests/2.21.0,97.105.19.58
