# Assignment 0

This notebook will help verify that you're all set up with the Python packages we'll be using this semester.

**Your task:** just run the cells below, and verify that the output is as expected. If anything looks wrong, weird, or crashes, update your Python installation or contact the course staff. We don't want library issues to get in the way of the real coursework!

In [1]:
# Version checks
import importlib
def version_check(libname, min_version):
    m = importlib.import_module(libname)
    print "%s version %s is" % (libname, m.__version__),
    print ("OK" if m.__version__ >= min_version 
           else "out-of-date. Please upgrade!")
    
version_check("numpy", "1.11")
version_check("matplotlib", "1.5")
version_check("pandas", "0.18")
version_check("nltk", "3.2")
version_check("tensorflow", "0.12.1")

numpy version 1.11.1 is OK
matplotlib version 1.5.3 is OK
pandas version 0.18.1 is OK
nltk version 3.2.1 is OK
tensorflow version 0.12.1 is OK


## TensorFlow

We'll be using [TensorFlow](tensorflow.org) to build deep learning models this semester. TensorFlow is a whole programming system in itself, based around the idea of a computation graph and deferred execution. We'll be talking a lot more about it in Assignment 1, but for now you should just test that it loads on your system.

Run the cell below; you should see:
```
Hello, TensorFlow!
42
```

In [2]:
import tensorflow as tf

hello = tf.constant("Hello, TensorFlow!")
sess = tf.Session()
print sess.run(hello)

a = tf.constant(10)
b = tf.constant(32)
print sess.run(a+b)

Hello, TensorFlow!
42


(optional) You can also test one of the built-in models. This will train a CNN classifier on the MNIST handwriting dataset. It will generate lots of output, and may take several minutes.

In [3]:
# This is the same as calling python -m (...) on the command line
# You should see a bunch of output, and a final test error around 0.8%
# It might take a few minutes on a slower machine.
%run -m tensorflow.models.image.mnist.convolutional

Successfully downloaded train-images-idx3-ubyte.gz 9912422 bytes.
Successfully downloaded train-labels-idx1-ubyte.gz 28881 bytes.
Successfully downloaded t10k-images-idx3-ubyte.gz 1648877 bytes.
Successfully downloaded t10k-labels-idx1-ubyte.gz 4542 bytes.
Extracting data/train-images-idx3-ubyte.gz
Extracting data/train-labels-idx1-ubyte.gz
Extracting data/t10k-images-idx3-ubyte.gz
Extracting data/t10k-labels-idx1-ubyte.gz
Initialized!
Step 0 (epoch 0.00), 20.0 ms
Minibatch loss: 8.334, learning rate: 0.010000
Minibatch error: 85.9%
Validation error: 84.6%
Step 100 (epoch 0.12), 242.3 ms
Minibatch loss: 3.250, learning rate: 0.010000
Minibatch error: 6.2%
Validation error: 7.6%
Step 200 (epoch 0.23), 242.1 ms
Minibatch loss: 3.376, learning rate: 0.010000
Minibatch error: 12.5%
Validation error: 4.2%
Step 300 (epoch 0.35), 242.0 ms
Minibatch loss: 3.176, learning rate: 0.010000
Minibatch error: 7.8%
Validation error: 3.0%
Step 400 (epoch 0.47), 242.2 ms
Minibatch loss: 3.216, learning 

## NLTK

[NLTK](http://www.nltk.org/) is a large compilation of Python NLP packages. It includes implementations of a number of classic NLP models, as well as utilities for working with linguistic data structures, preprocessing text, and managing corpora.

NLTK is included with Anaconda, but the corpora need to be downloaded separately. Be warned that this will take up around 3.2 GB of disk space if you download everything! If this is too much, you can download individual corpora as you need them through the same interface.

Type the following into a Python shell on the command line. It'll open a pop-up UI with the downloader:

```
import nltk
nltk.download()
```

Alternatively, you can download individual corpora by name. The cell below will download the famous [Brown corpus](http://www.essex.ac.uk/linguistics/external/clmt/w3c/corpus_ling/content/corpora/list/private/brown/brown.html):

In [4]:
import nltk
assert(nltk.download("brown"))  # should return True if successful, or already installed

[nltk_data] Downloading package brown to /home/ubuntu/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.


Now we can look at a few sentences. Expect to see:
```
The Fulton County Grand Jury said Friday an investigation of Atlanta's recent primary election produced `` no evidence '' that any irregularities took place .

The jury further said in term-end presentments that the City Executive Committee , which had over-all charge of the election , `` deserves the praise and thanks of the City of Atlanta '' for the manner in which the election was conducted .
```

In [5]:
from nltk.corpus import brown
# Look at the first two sentences
for s in brown.sents()[:2]:
    print " ".join(s)
    print ""

The Fulton County Grand Jury said Friday an investigation of Atlanta's recent primary election produced `` no evidence '' that any irregularities took place .

The jury further said in term-end presentments that the City Executive Committee , which had over-all charge of the election , `` deserves the praise and thanks of the City of Atlanta '' for the manner in which the election was conducted .



NLTK also includes a sample of the [Penn treebank](https://www.cis.upenn.edu/~treebank/), which we'll be using later in the course for parsing and part-of-speech tagging. Here's a sample of sentences, and an example tree. Expect to see:
```
The top money funds are currently yielding well over 9 % .

(S
  (NP-SBJ (DT The) (JJ top) (NN money) (NNS funds))
  (VP
    (VBP are)
    (ADVP-TMP (RB currently))
    (VP (VBG yielding) (NP (QP (RB well) (IN over) (CD 9)) (NN %))))
  (. .))
```

In [6]:
assert(nltk.download("treebank"))  # should return True if successful, or already installed
print ""
from nltk.corpus import treebank
# Look at the parse of a sentence.
# Don't worry about what this means yet!
idx = 45
print " ".join(treebank.sents()[idx])
print ""
print treebank.parsed_sents()[idx]

[nltk_data] Downloading package treebank to /home/ubuntu/nltk_data...
[nltk_data]   Unzipping corpora/treebank.zip.

The top money funds are currently yielding well over 9 % .

(S
  (NP-SBJ (DT The) (JJ top) (NN money) (NNS funds))
  (VP
    (VBP are)
    (ADVP-TMP (RB currently))
    (VP (VBG yielding) (NP (QP (RB well) (IN over) (CD 9)) (NN %))))
  (. .))


We can also look at the [Europarl corpus](http://www.statmt.org/europarl/), which consists of *parallel* text - a sentence and its translations to multiple languages. You should see:
```
ENGLISH: Resumption of the session I declare resumed the session of the European Parliament adjourned on Friday 17 December 1999 , and I would like once again to wish you a happy new year in the hope that you enjoyed a pleasant festive period .
```
and its translation into French and Spanish.

In [7]:
assert(nltk.download("europarl_raw"))  # should return True if successful, or already installed
print ""
from nltk.corpus import europarl_raw

idx = 0

print "ENGLISH: " + " ".join(europarl_raw.english.sents()[idx])
print ""
print "FRENCH: " + " ".join(europarl_raw.french.sents()[idx])
print ""
print "SPANISH: " + " ".join(europarl_raw.spanish.sents()[idx])

[nltk_data] Downloading package europarl_raw to
[nltk_data]     /home/ubuntu/nltk_data...
[nltk_data]   Unzipping corpora/europarl_raw.zip.

ENGLISH: Resumption of the session I declare resumed the session of the European Parliament adjourned on Friday 17 December 1999 , and I would like once again to wish you a happy new year in the hope that you enjoyed a pleasant festive period .

FRENCH: Reprise de la session Je déclare reprise la session du Parlement européen qui avait été interrompue le vendredi 17 décembre dernier et je vous renouvelle tous mes vux en espérant que vous avez passé de bonnes vacances .

SPANISH: Reanudación del período de sesiones Declaro reanudado el período de sesiones del Parlamento Europeo , interrumpido el viernes 17 de diciembre pasado , y reitero a Sus Señorías mi deseo de que hayan tenido unas buenas vacaciones .
