# NLTK Regular Expressions #
Author: Christin Seifert, licensed under the Creative Commons Attribution 3.0 Unported License https://creativecommons.org/licenses/by/3.0/

This is a tutorial for simple text processing with python using the [nltk library](https://www.nltk.org/). For further reading I recommend the extensive online book of nltk [available here](https://www.nltk.org/book/).
In this notebook we will
* load text files from disk
* find word patterns with regular expressions (and see where they fail)

It is assumed that you have some general knowledge on 
* basic python

## Setup 
If you have never used nltk before you need to download the example copora. Uncomment the `nltk.download` to do so. We also want the nltk library, the library (`re`) for regular expression.

In [5]:
import nltk, re
from nltk import word_tokenize
# NOTE if the data (corpora, example files) is not yet downloaded, this needs to be done first
#nltk.download()


Let's see which free resources are readily available. And then let's have a closer look at Shakespeare's Hamlet (to pretent we are literature freaks). 

In [6]:
print(nltk.corpus.gutenberg.fileids())
hamlet = nltk.corpus.gutenberg.words('shakespeare-hamlet.txt')
print(len(hamlet))

['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt', 'carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'edgeworth-parents.txt', 'melville-moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt']
37360


## Regular Expressions
So Shakespeare used 37360 words to write Hamlet. Let's investigate which patterns we find there. 

* In which word does the character sequence "wre" occur?

In [7]:
[w for w in hamlet if re.search('wre', w)]

['wretch',
 'wretch',
 'wretched',
 'powres',
 'Powres',
 'wretched',
 'wretched',
 'showres',
 'wretch',
 'wretched']

* And which of them actually start with "wre"?

In [8]:
[w for w in hamlet if re.search('^wre', w)]

['wretch', 'wretch', 'wretched', 'wretched', 'wretched', 'wretch', 'wretched']

* Find all words that start with "T" or "t", end with "r" and have exactly 3 other characters in the middle. To implement the "T" or "t" we use a character class specified by the brackets `[]`. `[Tt]`matches either "T" or "t". 
For macthing any character (no whitespace) we could use the character class `[a-zA-Z]`, but using the abbreviation `\D`is much more convenient. Further predefined character classes are:
  * `\d` Matches any decimal digit
  * `\D` Matches any non-digit character
  * `\s` Matches any whitespace character (this could be line endings, blanks or tabs). This is tricky, because some of them are not visible if you look at the text with a text editor.

In [9]:
[w for w in hamlet if re.search('^[Tt]\w{5,5}r$', w)]

['Thunder',
 'truster',
 'thither',
 'Thunder',
 'Theater',
 'thicker',
 'thither',
 'thether']

* Did Shakespeare use any numbers (written as digits?) For macthing all the digits, we could similarly use `[0123456789]` or `[0-9]`, but using the abbreviation `\d`is much more convenient.  

In [71]:
[w for w in hamlet if re.search('\d', w)]

['1599', '1', '1', '1', '1']

* And is there something that starts with z and ends with g?

In [72]:
[w for w in hamlet if re.search('^z.*g$', w)]

[]

In the last example we can not be sure whether there is definitely nothing or whether we got the regular expression wrong. To find out which one is the case, create a string you know should match and test your expression there.

In [81]:
[w for w in ["zarhhg","zhang","zg","42"] if re.search('^z.*g$', w)]


['zarhhg', 'zhang', 'zg']

That's all for the short introduction. See [the documentation of the re library](https://docs.python.org/2/library/re.html) for more examples on regular expressions.

In [140]:
# Create one ore more regular expressions to extract 
# i) all URLs and 
# ii) all keyboard shortcuts (e.g.CRTL+A) from the firefox discussion forums.

from nltk.corpus import webtext

firefox=webtext.raw('firefox.txt')
print(webtext.raw('firefox.txt'))


Cookie Manager: "Don't allow sites that set removed cookies to set future cookies" should stay checked
When in full screen mode
Pressing Ctrl-N should open a new browser when only download dialog is left open
add icons to context menu
So called "tab bar" should be made a proper toolbar or given the ability collapse / expand.
[XUL] Implement Cocoa-style toolbar customization.
#ifdefs for MOZ_PHOENIX
customize dialog's toolbar has small icons when small icons is not checked
nightly builds and tinderboxen for Phoenix
finish tearing prefs UI to pieces and then make it not suck
"mozbrowser" script doesn't start correct binary
Need bookmark groups icon
Dropping at top of palette box horks things
keyboard shortcut for Increase Text Size is broken
default phoenix bookmarks
[cust] need a toolbar spacer and spring spacer for customize
Can't launch phoenix while Mozilla is running (or vice versa)
separator not available when all toolbar items are in toolbar layout
history menu f

In [166]:
firefox_list = [item for item in firefox.split(" ")]

# i) extract all URLs

for item in firefox_list:
    try:
        print (re.search("((http|https|ftp|ftps)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/\S*)?)|((www)\.[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/\S*)?)|(([a-zA-Z0-9\-\.]+)(\.co)(\.[a-zA-Z]{2,3}(\/\S*)?))|(([a-zA-Z0-9\-\.]+)(\.[org|net]{3}))|(([a-zA-Z0-9\-\.]+)(\.com))", item).group())
    except:
        pass
    


mozdev.org
mozilla.org
http://www.scripting.com/misc/msswitchad
eweek.com
mozilla.org
www.foo.com
www.localhost.net.au
http://www.watch.impress.co.jp
download.microsoft.com
http://bugzilla.mozilla.org
dyndns.org
News.com
photo.net
gentoo.org
www.aol.com
www.php.net
python.org
www.fnac.fr
Dict.org
NBA.com
http://www.htt
www.hvv.de
t-mobile.com
Edmunds.com
www.petetownshend.co.uk
fark.com
www.google.com
o2.co.uk
www.wamu.com
ew.com
ford.com
localhost.net
amazon.com
www.excite.com
freshmeat.net
bbc.co.uk
http://www.peterre.com
MozDev.org
sun.com
www.logitech.com
www.mozilla.org
http://texturizer.net/firebird
mail.yahoo.com
www.xy.com
php.net
apple.com
Adobe.com
Buy.com
FedEx.com
iWon.com
SmartSource.com
choiceradio.com
texturizer.net
Bestbuy.com
yodobashi.com
winamp.com
freebyte.com
theaa.com
loginnet.passport.com
browser.xul.err
googlesyndication.com
fed-ups-win32-users.net
foo.com
www.blogger.com
mail.com
fstv.org
freespeech.org
freshmeat.net
vons.com
www.pcpitstop.com
geocaching.com
ge

In [142]:
# ii) extract all keyboard shortcuts (e.g.CRTL+A)
for item in firefox_list:
    try:
        print(re.search("((?i)ctrl|ctl|control|alt|command|cmd|shft|shift|fn|option|opt|f1|f2|f3|f4|f5)(\+|\-|\>)[a-zA-Z0-9\-\.\+\>]+",item).group())
    except:
        pass

Ctrl-N
Ctrl-Y
Ctrl-W
Ctrl+Mousewheel
ctrl+tab
Ctrl+Mousewheel
ALT+F
Ctrl-click
shift-l
ctrl+dragging
ctrl-click
command-click
Ctrl-Click
Ctrl+Click
ctrl+Click
CTRL+click
Ctrl-click
control-b
control-tab
ctrl-shift-f
ctrl-d
ctrl-enter
Ctrl+Enter
control-tab
Alt-d
Ctrl+B
ctl-enter
Ctrl+Enter
CTRL+F
Control-click
Ctl-W
Alt-F4
Ctrl+Tab
Ctrl+Shift+Tab
Ctrl+Space
Alt+Home
ctrl+d
ctrl+t
ctrl>.
ctrl-enter
Ctrl+0
shift+link
ctrl-click
ctrl-click
Ctrl-I
ctrl+enter
alt+enter
ctrl+enter
ctrl-enter
ctrl+enter
Ctrl+-
Ctrl++
ctrl-middleclick
ctrl+enter
Ctrl+L
Shift+Ctrl+G
ctrl+enter
Ctrl+Q
Ctrl+W
alt+back
Ctrl+L
alt+f2
control-l
Ctrl+L
alt+scroll
shift+scroll
shift+scroll
alt+scroll
Ctrl+Enter
ALT+D
ctrl-t
Ctrl-Tab
Alt+D
Alt+d
Cmd+M
CTRL-W
Ctrl-Enter
Cmd-Shift-H
Ctrl+Enter
CTRL+Enter
Shift+G
ctrl+pgup
Command-Click
Ctrl-M
Shift-F10
Alt+Enter
CTRL-F
ALT+d
Ctrl+S
Alt+Click
ctrl++
Alt+F4
Ctrl-I
Alt+drag
Command-H
CTRL+K
Cmd-H
Ctrl+W
control-shift-w
ctrl-click
Ctrl-K
Ctrl+W
Alt-Enter
cmd-shift-left
Cmd-E