# Course Set-Up

Welcome to MIDS W266: Natural Language Processing! 

This notebook is a quick guide to getting set up with the programming environment we'll be using this semester. We'll be using IPython (Jupyter) notebooks for most of the course exercises and assignments, and we'll make heavy use of NumPy, scikit-learn, TensorFlow, and NLTK. 

The instructions below should get you set up for most of the semester, although since this is the first time this course is offered we might add a few new things as we go along.

We'll do our best to support a variety of platforms, but most NLP software (and Data Science software in general) works best on UNIX-based operating systems, i.e. **Linux** or **Mac OSX**. While you'll be doing most coding in Python, it'll be very useful to be familiar with the **bash** command-line environment and common utilities.

# Google Cloud Instance set-up

If you plan on using a GCE instance for the course, follow the instructions on the [Cloud Instance Guide](https://github.com/datasci-w266/main/cloud/README.md). This will walk you through cloning the git repo, installing Anaconda and TensorFlow, and setting up to use Jupyter notebooks.

When you're done, run a notebook server on your cloud instance:
```
cd ~
jupyter notebook
```
And in your browser, navigate to http://localhost:8888/notebooks/w266/week1/Course%20Set-Up.ipynb to load a live version of this notebook. (_If you cloned to a different folder than `~/W266`, you might need to change this URL, or browse from the [tree view](http://localhost:8888/tree)_)

You can skip the local set-up section and jump down to the **NLTK** part. 

# Local set-up

If you plan on working on your local machine, first install `git` and clone the course repo:
```
git clone https://github.com/iftenney/W266.git
```
Or if you have authentication issues:
```
git clone git@github.com:iftenney/W266.git
```

## Local set-up: Python

We strongly recommend the [**Anaconda**](https://www.continuum.io/downloads) python distribution, which includes NumPy, scikit-learn, matplotlib, pandas, NLTK, and many other useful packages. For most code, we'll assume that you have Anaconda, and mention explicitly anything else that's not included.

Download the Python 2.7 version from https://www.continuum.io/downloads, and follow the instructions to install.

There are a few Python features that we'll be making use of that you might not have encounted in previous courses. You might want to bookmark these, and take a glance at the documentation for these now - although we'll explain them more as the appear.

- [Generators and generator expressions](https://wiki.python.org/moin/Generators), handy for working with streams of text
- [SciPy sparse matricies](http://docs.scipy.org/doc/scipy/reference/sparse.html) for representing large "one hot" vectors

## Local set-up: TensorFlow

[TensorFlow](https://www.tensorflow.org/) is Google's open-source numerical computation library. It's designed for deep learning, and has very good support for the neural network architectures, such as RNNs, that are commonly used in NLP.

**Note:** TensorFlow is only available for Linux and OSX. If you're on Windows, the easiest option is to use a Google Cloud instance. See the [Cloud Instance Guide](https://github.com/datasci-w266/main/cloud/README.md) for more details.

If you're using Anaconda as above, you can install the latest version of TensorFlow with:
```
conda install -c jjhelmus tensorflow
```

Alternatively, you can follow the Pip Installation instructions here: https://www.tensorflow.org/versions/r0.10/get_started/os_setup.html#pip-installation (Ignore everything about conda or virtualenv environments.)

### Advanced: TensorFlow and GPUs

TensorFlow can use a GPU to dramatically accelerate running neural network models. If you have a recent NVidia GPU, follow the instructions here to get it set up:
https://www.tensorflow.org/versions/r0.10/get_started/os_setup.html#optional-linux-enable-gpu-support

Be warned that CUDA and NVidia drivers on Linux can be finicky and sometimes unstable, so consult the course staff if you plan on going this route.

## Local set-up: Notebooks

If the above steps completed successfully, you should be able to open a notebook with:
```
cd ~
jupyter notebook &
```
It should open a browser window to http://localhost:8888/tree; find this notebook and open it to continue. You can also try the direct link:
- http://localhost:8888/notebooks/w266/week1/Course%20Set-Up.ipynb

Run the cells below to test your Python installation.

In [1]:
print "Hello world!"
print "Welcome to Natural Language Processing!"

Hello world!
Welcome to Natural Language Processing!


Test TensorFlow:

In [2]:
import tensorflow as tf

hello = tf.constant("Hello, TensorFlow!")
sess = tf.Session()
print sess.run(hello)

a = tf.constant(10)
b = tf.constant(32)
print sess.run(a+b)

Hello, TensorFlow!
42


In [None]:
# This is the same as calling python -m (...) on the command line
# You should see a bunch of output, and a final test error around 0.8%
# It might take a few minutes on a slower machine.
%run -m tensorflow.models.image.mnist.convolutional

We'll interact with TensorFlow as a Python library, but it's really a whole programming system in itself. Continue on to the [**TensorFlow Tutorial Notebook**](TensorFlow%20Tutorial.ipynb) to learn more!

# NLTK

[NLTK](http://www.nltk.org/) is a large compilation of Python NLP packages. It includes implementations of a number of classic NLP models, as well as utilities for working with linguistic data structures, preprocessing text, and managing corpora.

NLTK is included with Anaconda, but the corpora need to be downloaded separately. Be warned that this will take up around 3.2 GB of disk space if you download everything! If this is too much, you can download individual corpora as you need them through the same interface.

Type the following into a Python shell. It'll open a pop-up UI with the downloader:
```
import nltk
nltk.download()
```

Now we can explore the corpora a bit. Let's look at the famous [Brown corpus](http://www.essex.ac.uk/linguistics/external/clmt/w3c/corpus_ling/content/corpora/list/private/brown/brown.html):

In [3]:
from nltk.corpus import brown
# Look at the first five sentences
for s in brown.sents()[:5]:
    print " ".join(s)
    print ""

The Fulton County Grand Jury said Friday an investigation of Atlanta's recent primary election produced `` no evidence '' that any irregularities took place .

The jury further said in term-end presentments that the City Executive Committee , which had over-all charge of the election , `` deserves the praise and thanks of the City of Atlanta '' for the manner in which the election was conducted .

The September-October term jury had been charged by Fulton Superior Court Judge Durwood Pye to investigate reports of possible `` irregularities '' in the hard-fought primary which was won by Mayor-nominate Ivan Allen Jr. .

`` Only a relative handful of such reports was received '' , the jury said , `` considering the widespread interest in the election , the number of voters and the size of this city '' .

The jury said it did find that many of Georgia's registration and election laws `` are outmoded or inadequate and often ambiguous '' .



In [12]:
# As words
print "\n".join(brown.words()[:40])

The
Fulton
County
Grand
Jury
said
Friday
an
investigation
of
Atlanta's
recent
primary
election
produced
``
no
evidence
''
that
any
irregularities
took
place
.
The
jury
further
said
in
term-end
presentments
that
the
City
Executive
Committee
,
which
had


NLTK also includes a sample of the [Penn treebank](https://www.cis.upenn.edu/~treebank/):

In [4]:
from nltk.corpus import treebank
# Look at the first five sentences
for s in treebank.sents()[:5]:
    print " ".join(s)
    print ""

Pierre Vinken , 61 years old , will join the board as a nonexecutive director Nov. 29 .

Mr. Vinken is chairman of Elsevier N.V. , the Dutch publishing group .

Rudolph Agnew , 55 years old and former chairman of Consolidated Gold Fields PLC , was named *-1 a nonexecutive director of this British industrial conglomerate .

A form of asbestos once used * * to make Kent cigarette filters has caused a high percentage of cancer deaths among a group of workers exposed * to it more than 30 years ago , researchers reported 0 *T*-1 .

The asbestos fiber , crocidolite , is unusually resilient once it enters the lungs , with even brief exposures to it causing symptoms that *T*-1 show up decades later , researchers said 0 *T*-2 .



In [5]:
# Look at the parse of a sentence.
# Don't worry about what this means yet!
idx = 45
print " ".join(treebank.sents()[idx])
print ""
print treebank.parsed_sents()[idx]

The top money funds are currently yielding well over 9 % .

(S
  (NP-SBJ (DT The) (JJ top) (NN money) (NNS funds))
  (VP
    (VBP are)
    (ADVP-TMP (RB currently))
    (VP (VBG yielding) (NP (QP (RB well) (IN over) (CD 9)) (NN %))))
  (. .))


We can also look at the [Europarl corpus](http://www.statmt.org/europarl/), which consists of *parallel* text - a sentence and its translations to multiple languages.

In [7]:
from nltk.corpus import europarl_raw

idx = 0

print "ENGLISH: " + " ".join(europarl_raw.english.sents()[idx])
print ""
print "FRENCH: " + " ".join(europarl_raw.french.sents()[idx])
print ""
print "SPANISH: " + " ".join(europarl_raw.spanish.sents()[idx])

ENGLISH: Resumption of the session I declare resumed the session of the European Parliament adjourned on Friday 17 December 1999 , and I would like once again to wish you a happy new year in the hope that you enjoyed a pleasant festive period .

FRENCH: Reprise de la session Je déclare reprise la session du Parlement européen qui avait été interrompue le vendredi 17 décembre dernier et je vous renouvelle tous mes vux en espérant que vous avez passé de bonnes vacances .

SPANISH: Reanudación del período de sesiones Declaro reanudado el período de sesiones del Parlamento Europeo , interrumpido el viernes 17 de diciembre pasado , y reitero a Sus Señorías mi deseo de que hayan tenido unas buenas vacaciones .
