# Regex Assignment

In this assignment you're going to learn how to use Regular Expressions (regex) to extract and clean patterns from textual data. Regular expressions provide a special language to capture patterns in text. They can be infintely combined to extract *very* complex patterns under many conditions. Luckily for you, we will only go into the basics. In Python, the [`re`](https://docs.python.org/2/library/re.html) library provides us with a way of interpreting these patterns and using them for *search*.

Let's look at some examples:

In [1]:
import re

some_text = "hello, world"
re.findall('\w', some_text)

['h', 'e', 'l', 'l', 'o', 'w', 'o', 'r', 'l', 'd']

The above example simply matches each alphanumeric character (a through z and 0 through 9, abbreviated by `\w`) individually. If we want to find actual words, we need it to tell that we're actually looking for a *sequence*. This can be done using the `*`. Basically it tells regex to do these steps:

- Start scanning the string
- If you find any character matching (a-z or 0-9) start recording.
- If you find any character that doesn't match this, stop recording.
- Give me a list of all the sequences you recorded.

In [2]:
re.findall('\w*', some_text)

['hello', '', '', 'world', '']

These 'empty' hits `'', ''` are empty because it finds a character that does not belong to the pattern you specified (e.g. `,`, whitespace, and the end of the string).

Of course, you can also only match letters (to clean data for example). By default regex searches for the **exact** order that you provide as a pattern, like so:

In [3]:
more_text = "abc cba abc bc ab"
re.findall("abc", more_text)

['abc', 'abc']

If you just want to find anything including `a` or `b` (even `ba` or `ab`), you can encapsulate these between square brackets (`[ ]`). For example:

In [4]:
re.findall("[ab]*", more_text)

['ab', '', '', '', 'ba', '', 'ab', '', '', 'b', '', '', 'ab', '']

Because it's pretty inconvenient to write `[abcdefhijklmnopqrstuvwxyz]` everyime, you can abbreviate this with `[a-z]`, similar to `[A-Z]` and `[0-9]`. Of course you can combine these as well (`[a-zA-Z0-9]`). This can be used for cleaning up data for example:

In [5]:
raw_text = "h3ll0, w0rld."
re.findall('[a-z]*', raw_text)

['h', '', 'll', '', '', '', 'w', '', 'rld', '', '']

Whitespaces also count (just simply a space, like ` `). So say we want to split a text into words, we can do this by:

In [6]:
example_text= "So say we want to split a text into words, we can do this by:"
re.findall('[A-Za-z0-9]*[ ,.:]', example_text)

['So ',
 'say ',
 'we ',
 'want ',
 'to ',
 'split ',
 'a ',
 'text ',
 'into ',
 'words,',
 ' ',
 'we ',
 'can ',
 'do ',
 'this ',
 'by:']

Just quickly analyzing the code. We look for "any alphanumeric character" until we see "either whitespace, comma, period or colon". We can exclude this second group of punction marks in the output by encapsulating the `[A-Za-z0-9]*` part with brackets (`( )`). Like so:

In [7]:
re.findall('([A-Za-z0-9]*)[ ,.:]', example_text)

['So',
 'say',
 'we',
 'want',
 'to',
 'split',
 'a',
 'text',
 'into',
 'words',
 '',
 'we',
 'can',
 'do',
 'this',
 'by']

Some characters are also used as regex symbols (like `^`, `?` etc), and therefore need to be 'escaped' as this is called. This is done by prepending a `\ ` to the symbol as to indicate: THIS IS NOT PATTERN! So for example:

In [8]:
question_text = "why this question?"
re.findall('\?', question_text)

['?']

Try to remove the `\ ` and see that it gives an error.

Finally, you are also able to limit the amount of things you want to maximally read with operators like `+` (1 or more), `?` (none or 1), `{m}` (exactly the amount of $m$), and `{m,n}` from $m$ to $n$. E.g.:

In [9]:
pattern_text = "aaaaaaaaaaaaaaaaa bbbbbbbbbbbb ccccccc aaaaaaaaa bbbaaaa"
re.findall('a{4}', pattern_text)

['aaaa', 'aaaa', 'aaaa', 'aaaa', 'aaaa', 'aaaa', 'aaaa']

In [10]:
re.findall('a{2,5}', pattern_text)

['aaaaa', 'aaaaa', 'aaaaa', 'aa', 'aaaaa', 'aaaa', 'aaaa']

Note that by default, regex is 'greedy'. So if you define a minimum of two, and a maximum of five, it will always 'eat' until it reaches at least five. If you want it to be non-greedy, you include `?`. Like so:

In [11]:
re.findall('a{2,5}?', pattern_text)

['aa',
 'aa',
 'aa',
 'aa',
 'aa',
 'aa',
 'aa',
 'aa',
 'aa',
 'aa',
 'aa',
 'aa',
 'aa',
 'aa']

As you can see, it now just always stops at 2. It's 'full' so to say once it has eaten two a's.

There's a lot more to regex, but these are the very basic things (and as much as you will need for this assignment. Here's a convenient cheat sheet with most of the patterns:

![cheatsheet](http://tartley.com/wp-content/uploads/2011/10/Python-regular-expression-cheatsheet-0.3.0.png)

# Important BEFORE Starting

Please run the code below, it provides two handy functions to help you finish the assignments:

In [12]:
from IPython.core.display import display, HTML

def highlight_regex(matches, text):
    """Highlights the items in 'matches' in the provided 'text'."""
    for match in set(matches):
        text = text.replace(match, '<mark>' + match + '</mark>')
        text = text.replace('\n', '<br>')
    return display(HTML('<html>' + text + '</html>'))

def submit_answers(name, results):
    """Writes you answers to a txt file so we can check them."""
    answer_file = open('regex-{0}.txt'.format(name), 'w', encoding='utf-8')
    answer_file.write('\n'.join(results) if name != 'html' else results)
    answer_file.close()

## Find Hashtags in Tweets

We'll start of nice and easy. For the first assignment you are asked to find a number of hashtags in tweets. The tweets are from [Chris Manning](https://twitter.com/chrmanning) --- you are allowed to use his page as a visual reference. We will use the `findall` command, which will return a list of all hits of the pattern.

Now it's your turn.

### Task 1
*Find all hashtags in a user's Twitter timeline.*

Please note that when there are consecutive hashtags, these are matched **independently**.

In [13]:
import re

twitter_file = open('data/tweets.md', encoding='utf-8').read()

# ONLY EDIT THIS PART

hashtags = re.findall('#', twitter_file)  # edit the pattern '#'

# -------------------

print(hashtags)
highlight_regex(hashtags, twitter_file)

submit_answers('hashtags', hashtags)

['#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#']


## Generating New Features (DSBG only)

As we've seen in the Social Data Science lecture, sometimes you are provided with textual features that cannot be interpreted by the software you are using (in this case WEKA). The Titanic dataset is a perfect example of this. If we take a look at the rows in the `.arff` file (here saved as `.csv`), we can see that the names of the passengers have some information: 

In [14]:
import csv

reader = csv.reader(open('data/titanic.csv', encoding='utf-8'))
for i, row in enumerate(reader):
    if i == 10:
        break
    else:
        print(row[3])

Braund - Mr. Owen Harris
Cumings - Mrs. John Bradley (Florence Briggs Thayer)
Heikkinen - Miss. Laina
Futrelle - Mrs. Jacques Heath (Lily May Peel)
Allen - Mr. William Henry
Moran - Mr. James
McCarthy - Mr. Timothy J
Palsson - Master. Gosta Leonard
Johnson - Mrs. Oscar W (Elisabeth Vilhelmina Berg)
Nasser - Mrs. Nicholas (Adele Achem)


So apparently they noted 'titles' for certain passengers, which might indicate if they are maried, have some rank, this kind of information. They are coded in a very distinct format, so they should be easy to extract.

### Task 2
*Find all titles in the Titanic dataset.*

In [15]:
import re

titanic_file = open('data/titanic.csv', encoding='utf-8').read()

# ONLY EDIT THIS PART

titles = re.findall('\.', titanic_file)  # edit the pattern '\.' -- DONT REMOVE THE ESCAPE! :)

# -------------------

titles = set(titles)
print(titles)
highlight_regex(titles, titanic_file)

submit_answers('titles', titles)

{'.'}


## Finding E-mails

In this part, you are dealing with `.csv` data. This *comma-separated values* format is commonly used to store flat table data. Quick sample:

    "1","jan engelen","j.a.a.engelen@tilburguniversity.edu"
    "2","emannuel keuleers","e.a.keuleers@tilburguniversity.edu"
    "3","chris emmery","cmry@protonmail.com"

This usually works pretty well, and you can conveniently call the number of the row (starting from 0) you want to see in your favourite programming language. Python uses the `csv` library for this. Basically you just open the file like we did before, and you pass it to the `csv.reader`. This reader will interpret the format, and try to make sense of the field's boundaries (here we use `"`). The sample above I've saved under `data/sample-mails.csv`, we can open it up and list the emails like so:

In [16]:
import csv

reader = csv.reader(open("data/sample-mails.csv", encoding='utf-8'))

for line in reader:
    print(line[2])  # notice that 0 would be the numbers, 1 the names

j.a.a.engelen@tilburguniversity.edu
e.a.keuleers@tilburguniversity.edu
cmry@protonmail.com


Truth is, that sometimes these `.csv` files are not so neatly structured (especially when written by hand or messed up by software). The same holds true for the file saved under `data/emails.csv`. Let's open it up and check the first few rows:

In [17]:
reader = csv.reader(open('data/emails.csv', encoding='utf-8'))

for i, line in enumerate(reader):
    if i == 10:
        break
    else:
        print(line)

['Fallon Olesen"fallonolesen@bing.com', 'RABY~A2a', '5d0c757513f62a706bdda1e0715540cf"']
['Agrippa Brandon', 'agrippabran@geocities.com', '6aWU,UrY?uVA', '2b44d6ab7313e4cc07ed81ae3532b088']
['Khalid Wann', 'kha.wan@yahoo.com', 'XuJaZaQ', '97865ab5c1942dd713f94ff6305032ef']
['Aden Odum', 'ade_odu@webmail.com', '#ajy5ej"', '9df0b29a273d52721b307a5297eba414']
['Amit Schlabach"', 'ami_sc@info.com', '@A5UNu.YXU', '7f103e43d0d41fd68b9902947d63ffd3']
['Adelinda Koepke', 'ade-koepke@toodles.com', '=UqeHEGaXE', '3bc74e6518600ce8628c66ed6446717b']
['Adila Kohut', 'adila_koh@toodles.com', 'myBU$uGaHE)a', '1ee5691a325f20ca360b15791a0c05e5']
['Talfryn Abeyta', '\'talfry-abeyta@yoohoo.com""Zare~Ete"', '044c01baf0887027784064db7482eb92']
["Corbin Silvis'cor_silvis@msn.com", 'Ba-UnY~YgeLU', '849ed89eec82fc3b2972a1fb5e25bc46"Rhoda Richman', 'rhoda-ri@myspace.com', 'zyGEDYGynUZy', 'd9d48f6cfe010fe8da0732b403d9b648']
['Malak Beeler', 'malak.beeler@yoohoo.com', '(e9unA?ERy2A', '4ac773d3b3e388d8b49844ddf1a

As we can see, not all rows behave like they should, sometimes the email is merged with the name, and the second to last line merges two entries together. Luckily, we can still retrieve a list of all the emails using regular expressions! 

### Task 3
*Extend the regex below to recover as many of the 50 emails as you can.*

Note that the output should cover (highlighted in the output) the **whole** email address, not just the `@` or `@hotmail.com` part or whatever.

In [18]:
import re

email_file = open('data/emails.csv', encoding="utf-8").read()

# ONLY EDIT THIS PART

emails = re.findall('@', email_file)  # edit the pattern '@'

# -------------------

print(emails)
highlight_regex(emails, email_file)

submit_answers('emails', emails)

['@', '@', '@', '@', '@', '@', '@', '@', '@', '@', '@', '@', '@', '@', '@', '@', '@', '@', '@', '@', '@', '@', '@', '@', '@', '@', '@', '@', '@', '@', '@', '@', '@', '@', '@', '@', '@', '@', '@', '@', '@', '@', '@', '@', '@', '@', '@', '@', '@', '@', '@', '@', '@', '@']


## Cleaning HTML

An example of a task that can be solved with regex is cleaning your data of any junk that pollutes your nice running piece of natural language. Maybe most notorious are `.html` files: the structure of webpages. HTML includes a lot of `<tag>something</tag>`. For this part, you are going to clean [this](https://tweakers.net/reviews/4867/acer-swift-7-preview-de-dunste-13-komma-3-inch-notebook-ooit.html) webpage. We take Tweakers.net because it has a relatively neat HTML structure (compared to other monstrosities of web developent). For this task we will use `sub`. Sub 'substitutes' a certain pattern with a string. So say we would want to remove all numbers from a string, we would do:

In [19]:
number_text = "123 assf 1231 oidsjg 093463"
re.sub('[0-9]', '', number_text)

' assf  oidsjg '

Or we convert them to `_` for example:

In [20]:
number_text = "123 assf 1231 oidsjg 093463"
re.sub('[0-9]', '_', number_text)

'___ assf ____ oidsjg ______'

To keep things neat, we can just use multiple lines:

In [21]:
number_text = "123 assf 1231 oidsjg 093463"
text = re.sub('[0-9]', '_', number_text)
text = re.sub(' ', '', text)
text

'___assf____oidsjg______'

### Task 4
*Clean the HTML of the article so it shows as little junk as possible (i.e. at best only the article).*

Please don't waste too much time on this assignment, it's a **very** hard task to solve with regex.

In [22]:
import re

html = open('data/acer-swift.html', encoding='utf-8').read()

html = re.sub('\.', '', html)  #
html = re.sub('\.', '', html)  # use more lines to keep the rules simple

print(html)

submit_answers('html', html)


<!doctype html><!-- (c) 1998 - 2016 de Persgroep Online Services BV -->
<html lang="NL">
	<head>
		<meta http-equiv="content-type" content="text/html; charset=iso-8859-15">
		<meta name="referrer" content="origin-when-cross-origin">
		<meta name="viewport" content="width=device-width">
		<meta name="description" content="Acer onthulde op IFA een nieuwe lijn notebooks onder de naam Swift Volgens de fabrikant is het 13,3&quot; Swift 7-model de dunste notebook van dit moment">
		<meta property="fb:app_id" content="188199811217403">
		<meta property="og:site_name" content="Tweakers">
		<meta property="og:locale" content="nl_NL">
		<meta property="og:type" content="article">
		<meta property="og:title" content="Acer Swift 7 Preview: de dunste 13,3&quot;-notebook ooit">
		<meta property="og:url" content="https://tweakersnet/reviews/4867/acer-swift-7-preview-de-dunste-13-komma-3-inch-notebook-ooithtml">
		<meta property="og:description" content="Acer onthulde op IFA een nieuwe lijn notebooks