# Lab 8: Regular Expressions
In this lab assignment you will be practicing constructing regular expressions. First you will try your hand at some basic regular expressions in order to match words and numbers in a text file. Then you’ll try some more complex regular expressions in "real" situations.

For all of the exercises, you will use only the basic regular expressions (BRE) discussed in class. The only symbols you can use are:

    *     Zero or more
    +     One or more
    ?     Zero or one / Optional
    [X-Y] One character out the range X to Y
    .     Any one character
    |     Or
    ( )   Grouping
    { }   Repetition
    \.    Dot
    \$    Dollar sign
    \     Escape any other special character

Some common regular expression patterns are:

    [0-9]     Any one digit
    [0-9]+    One or more digits
    [a-zA-Z]  Any one letter (upper or lower case)
    [a-z]{3}  Three lower case letters
    -?[0-9]+  A number with an optional negative sign
    .*        Any number of any characters (including no characters at all)
    
## How to Do This Assignment

Before starting, upload the four text files for this assignment:

* contacts.txt
* program.txt
* perl.txt
* weblog.txt

In each problem function you'll see the string TODO. Replace TODO with the regular expression needed to do each problem.

There are no automated checks for your solutions; please eyeball them yourself to visually verify the output.

In [None]:
# Run this cell to define the function you'll need for the rest of the assignment.

from re import finditer

# Return an array of strings that match the given
# regular expression
def extract(regexp, fname):
    matches = []
    with open(fname) as f:
        for line in f:
            for match in finditer(regexp, line):
                matches.append(match.group())
    f.close()
    return matches

# Words and Numbers
Take a look at the file named **contacts.txt**. It’s a list of fake names and addresses.

## Problem 1
Extract all numbers in the text. Numbers are simply one or more consecutive digits. It's OK if the numbers that get extracted are in the middle of email addresses.

**Check:** The first few numbers should be **1116, 42, 02919, 401, 783, 9567, 5361, 40108, ...**

In [None]:
extract(r'TODO', "contacts.txt")

## Problem 2
Extract all words in the text. Words are composed of one or more letters (upper and lower case). Words can be parts of email addresses, too.

**Check:** The first few words are **Chong, Merrih, Unit, G, Cranston, RI, merrihchong, gmail, com, ...**

In [None]:
extract(r'TODO', "contacts.txt")

## Problem 3
Extract all five- and nine-digit ZIP codes. Keep in mind that nine-digit zip codes have a hyphen in them.

**Check:** The first few ZIP codes are **02919, 40108-3837, 33912-1923, 03907, ...**

In [None]:
extract(r'TODO', "contacts.txt")

## Problem 4
Extract all email addresses. For our purposes, an email address is a *username*, followed by an @ symbol, followed by the *hostname*.  
The username consists of one or more letters, numbers, dots, dashes, and underscores.  
The hostname consists of one or more letters, numbers, dots, and dashes.

**Check:** Here are the first few email address: `merrihchong@gmail.com`, `verona@brandenburg.ci.ky.us`, `eunice_h@yahoo.com`, `ccobl@bbc.co.uk`, `brendon@snarky.me`, `taibl@hotmail.com`

In [None]:
extract(r'TODO', "contacts.txt")

# Tokenizing

Take a look at **program.txt**. This is a simple computer program from a non-existent programming language. In this language, keywords are uppercase and everything else is lowercase.

*Tokenizing* is the phase of compilation that identifies the parts of the program in preparation for the parser. The lexer needs to scan through the source code, locate the tokens, and tag them as such. Here, you will write regular expressions (one per problem) to find various things to be tokenized.

## Problem 5
Extract all the *numbers* in the program. Keep in mind they could be a decimal amount and may be negative.

**Check:** Here are the numbers you should find: 1.8, 32, -2.547, 7, 20.0

In [None]:
extract(r'TODO', "program.txt")

## Problem 6
Extract all the *identifiers*. Identifiers are the names that programmers give to things, such as variables, function names, etc. In this programming language, they are composed of lowercase letters.

The identifiers in the program are: **project, x, y, age, convert, temp, zip, foo,** ...

In [None]:
extract(r'TODO', "program.txt")

# Tokenizing a Real Language

The next file is **perl.txt**. This is a real program written in the Perl programming language. It reads in a comma-separated file (CSV) in which each line is a student’s last name, first name, and ID number. It converts each line into an array containing the username, password, first name, and last name — enough information to make user accounts on a server.

## Problem 7
Extract the *variables*. In Perl, variables begin with a $ or @ symbol. The next character can be a letter or underscore. The rest of the variable name can be letters, digits, or underscores. In Perl, @_ is a valid variable, too!

In this program, some of the variables are: `$lname`, `$fname`, `$sid`, `$LIST_SEP`, `@line`, `$first4`, `$full_name`, `@_`, `$password`, ...

In [None]:
extract(r'TODO', "perl.txt")

## Problem 8
Extract the *comments*. Comments begin with a # symbol and continue to the end of the line. They can contain any number of any characters, including none.

Some examples from the program:

    #! /usr/bin/perl
    # Convert CSV file with student info into a file
    # that can be used to make accounts.
    #
    # Output is username, password, first name, last name
    # Change array element separator to comma
    # Print the whole array
    # Return first initial, last name, all lower case

In [None]:
extract(r'TODO', "perl.txt")

# Web Server Log Files

Web servers log every request that comes in. A request is an individual document, such as an HTML file, an image, a JavaScript program, CSS file, and so forth. The last file, **weblog.txt**, contains a portion of a typical log file containing, among other data points:

* the IP address of the web browser
* the date/time of the request
* the file requested, status code
* the file size

## Problem 9
Extract the IP addresses. IP addresses are four numbers separated by dots. Remember that a dot is a special character, so you may need to *escape* it by using a backslash: `\.`

In [None]:
extract(r'TODO', "weblog.txt")

## Problem 10
Extract the dates and times, which look like this: **06/May/2019:03:45:36**

In [None]:
extract(r'TODO', "weblog.txt")

## Problem 11
Extract the image filenames. Images end with **.jpg** or **.png**. Filename can contains letters, numbers, and underscores.

Just extract the filename, not the whole path. For example, we want **header.jpg**, but not **/sites/all/themes/litejazz/js/header.jpg**.

You should get: header.jpg, greetings_art.jpg, backgnd.jpg, footer2.png

In [None]:
extract(r'TODO', "weblog.txt")