Welcome to COMP90042, Web Search and Text Analysis! This is an iPython notebook to introduce you to some of the basics of Python. 
Python is an interpreted scripting language. Is is interpreted, rather than compiled, which means that you can see the results of a program (or program fragment) simply by using a terminal which can interpret the commands. (Like this one!) This property has made Python popular for certain web apps. On the other hand, because it isn't pre-compiled, Python programs often aren't as fast as equivalent C programs (for example).
The "scripting" refers to the fact that Python programs are often short, and could equivalently be performed by issuing the various commands directly to an interpreter (which is another property we'll make use of here!).

Let's get to it! From the Python terminal below, I'm going to begin by entering a familiar string:


In [None]:
"Hello world!"

You can "evaluate" the "program" above, but it's just an r-value, so the statement has no effect. (Often the interpreter will automatically print its output anyway.)

Instead, let's instantiate a variable with that string:

In [None]:
sentence = "Hello world!"

Python is (generally) weakly-typed, which means that we don't need to tell the interpreter in advance what the data type of the variable *sentence* will be (although we can if we wish to). Notice also that the memory for the variable is dynamically allocated.

Anyway, what can we do with a string? Well, a string is just a list of characters, so I can print some individual character by deferencing the list:

In [None]:
print sentence[4]

I can also print a "slice" of the list:

In [None]:
print sentence[6:11]

Let's do something a little more complicated. I can split the string *sentence* up using the in-built method *split()* as follows:

In [None]:
words = sentence.split()
print words[1]

The variable *words* now is a list of elements from the string *sentence*, in this case, broken up based on whitespace. One cute property of Python is that lists can also be deferenced from the end:

In [None]:
print words[-2]

To do something non-trivial, we'll need to able to do some file I/O.

Let's open the file *words*, which is a list of English words, one per line. (This is open-source, and comes with most \*nix distributions; you can get a copy from the LMS.) We open files - as in many environments - by initialising a file pointer, in this case, with the *open()* method (read mode takes the *'r'* flag, write mode takes the *'w'* flag, and so on).

In [None]:
f = open('./words', 'r')

There are a few methods for getting the (textual) content of this file. We can load the entire content of the file into a string with *read()*:

In [None]:
entire_file = f.read()
print entire_file[0:15]
f.close()

Since the file structure is comprised of line breaks, we can instead instantiate a list, where each entry in the list is a line in the file, using *readlines()*:

In [None]:
f = open('./words', 'r')
all_lines = f.readlines()
print all_lines[0:5]
f.close()

Alternatively, we can use a looping structure to process each line of the file individually:

In [None]:
f = open('./words', 'r')
count = 0
for line in f:
    count += 1
print count
f.close()

The use of regular expressions in Python requires us to import the *re* package:

In [None]:
import re

Let's use our word list to build a tool to help us cheat at crossword puzzles. For each element of the list, we are going to use the *search(pattern, string)* method. Let's say that I want to find a 11-letter word whose third letter is "k" and whose tenth letter is "g":

In [None]:
f = open('./words', 'r')
for line in f:
    if re.search('^..k......g.$', line):
        print line
f.close()

Hurray! What about an 8-letter word whose second and final letters are both "e"?

In [None]:
f = open('./words', 'r')
for line in f:
    if re.search('^.e....e.$', line):
        print line,
f.close()

Hmm... maybe we need to solve a few more clues!

Now let's see if we can make a function to do the above for *any* pattern?

In [None]:
def find_matching_words(pattern):
    f = open('./words', 'r')
    for line in f:
        if re.search(pattern, line):
            print line,
    f.close()

# and use it to find all words with the vowels in order
find_matching_words('^.*a.*e.*i.*o.*u.*$')

Another very useful aspect of Python is access to dictionaries (associative arrays). What if we wished to find out the distribution of final letters of words in the English language?

In [None]:
counts = {}
f = open("words")
for line in f:
    last_letter = line[-2] # Why -2???
    if last_letter not in counts:
          counts[last_letter] = 0
    counts[last_letter] += 1
f.close()
for letter in counts.keys():
    print letter, counts[letter]

An equivalent version makes use of a *defaultdict*:

In [None]:
from collections import defaultdict
counts = defaultdict(int) # Any uninstantiated entries are integers (with the value 0)
f = open('./words', 'r')
for line in f:
    counts[line[-2]] += 1
f.close()
for letter in counts.keys():
    print letter, counts[letter]

Another important consideration is when to use a set: a dictionary without values. This is important because it allows near-constant time lookup. For example, consider the following text (Lewis Carroll's Jabberwocky, from Through the Looking Glass (1871)):

In [None]:
text = ''''Twas brillig, and the slithy toves
  Did gyre and gimble in the wabe:
All mimsy were the borogoves,
  And the mome raths outgrabe.

"Beware the Jabberwock, my son!
  The jaws that bite, the claws that catch!
Beware the Jubjub bird, and shun
  The frumious Bandersnatch!"

He took his vorpal sword in hand:
  Long time the manxome foe he sought --
So rested he by the Tumtum tree,
  And stood awhile in thought.

And, as in uffish thought he stood,
  The Jabberwock, with eyes of flame,
Came whiffling through the tulgey wood,
  And burbled as it came!

One, two! One, two! And through and through
  The vorpal blade went snicker-snack!
He left it dead, and with its head
  He went galumphing back.

"And, has thou slain the Jabberwock?
  Come to my arms, my beamish boy!
O frabjous day! Callooh! Callay!'
  He chortled in his joy.

`Twas brillig, and the slithy toves
  Did gyre and gimble in the wabe;
All mimsy were the borogoves,
  And the mome raths outgrabe. '''
tokens = [token.lower() for token in re.split(r'\W+',text)]
tokens[:10]

The second last line above is known as a list comprehension: it builds up a list by applying an operation (in this case, lower-casing) sequentially to named elements of another list (in this case, the list created by splitting the text on non-alphanumeric characters (like whitespace)). It's a very powerful technique for compactly describing a list; you'll probably be seeing it quite a bit.

What if we were interested in finding out which tokens in the text aren't elements of the words file above?

In [None]:
all_words = [word[:-1] for word in all_lines]
unknown_tokens = [token for token in tokens if token not in all_words]
unknown_tokens[:10]

It's subtle, but you might notice a small pause while this evaluates. Effectively, we're checking each token against the entire list of 99K words - the delay is minor here, but could be substantial with a longer text. By converting the word list to a set, we can check the membership much faster:

In [None]:
all_words = set([word[:-1] for word in all_lines])
unknown_tokens = [token for token in tokens if token not in all_words]
unknown_tokens[:10]

Anyway, the lesson ends here: you will need to try some problems for yourself if you want to improve. 

I recommend the following exercises from NLTK book (http://nltk.org/book/) Chapter 3: 1, 2, and 24

The NLTK book is a great reference, and it's pitched at people who've never used Python before. You'll probably also find that the Python documentation (http://docs.pythong.org) is a handy reference.