# SDA 250 - Assignment 1 preparation

In this assignment, you will explore basics of the NLTK package. You can read over [Chapters 1-3](https://www.nltk.org/book) of the online book, but you don't need to understand everything. 

This notebook simply gives you some practice for the assignment. You don't need to turn this in, but are welcome to reuse parts of this notebook. You tasks are given below and on Canvas.

##  1. Load packages and data

First, you'll need to load NLTK (a library) and some of the data that comes with it. 

After the `nltk.download()` command, you'll get a prompt (look for an open pop-up window) to save stuff in your drive. Choose "Everything used in the NLTK book".

In [None]:
import nltk

In [None]:
nltk.download()

In [None]:
from nltk.book import *

In [None]:
# import a couple other things we'll need
import numpy
import matplotlib

## 2. Explore texts

The texts that come with NLTK are listed after you do the import. You'll see a range of texts in the open domain. Here, I'm giving you some basic concepts to explore, so that you can then explore your own data. 

### Length

Find out how long each of those texts is.

In [None]:
len(text1)

In [None]:
len(text4)

### Dispersion plots

Dispersion plots just give you position of a word (y) in the length of the text (x). This is interesting for things like the novels and the inaugural addresses, perhaps less so for the Wall Street Journal, which is simply a collection of newspaper articles.

In [None]:
text1.dispersion_plot(["monstrous", "Ishmael", "Moby"])

In [None]:
text4.dispersion_plot(["freedom", "duties", "America", "immigration"])

In [None]:
text7.dispersion_plot(["economy", "President", "freedom", "immigration", "America"])

### Counting vocabulary - tokens and types

Tokens = total number of words. Types = total number of unique words, aka vocabulary. To get the types in NLTK, we simply create a set with the words in the text (in a set, duplicates are removed).

In [None]:
set(text1)

In [None]:
sorted(set(text1))

Note that these lists include words in uppercase and lowercase. To do case folding (aka case normalization), you can do the following, for all tokens, or only for tokens that are words, not punctuation.

In [None]:
sorted(set(w.lower() for w in text1))

In [None]:
sorted(set(w.lower() for w in text1 if w.isalpha()))

Lexical diversity is simply the types divided by the tokens. We can create a function to calculate it quickly. Note that we define lexical diversity as type-token ratio (TTR). There are other measures.

In [None]:
def lexical_diversity(text):
    return len(set(text)) / len(text)

In [None]:
lexical_diversity(text1)

In [None]:
lexical_diversity(text4)

In [None]:
lexical_diversity(text5)

In [None]:
lexical_diversity(text7)

Lexical density is another interesting measure. It gives you the ratio of content (lexical) words to total number of words. Think about how you'd calculate that in NLTK.

### Frequency distributions

The frequency distribution of a word is simply the number of times it appears in a text. NLTK has a built-in `FreqDist`function. Here we use it to find out the counts for our 3 texts, and to print the most common words.

In [None]:
fdist1 = FreqDist(text1)

In [None]:
fdist4 = FreqDist(text4)

In [None]:
fdist5 = FreqDist(text5)

In [None]:
fdist7 = FreqDist(text7)

In [None]:
fdist1.most_common(20)

In [None]:
fdist4.most_common(20)

In [None]:
fdist5.most_common(20)

In [None]:
fdist7.most_common(20)

Most common words do not seem very different across texts. What about the least common words? Those are called hapax legomena, sometimes just hapax: words that occur only once in a given text/corpus.

In [None]:
fdist4.hapaxes()

In [None]:
fdist5.hapaxes()

In [None]:
fdist7.hapaxes()

### Collocations, similar contexts

Concordances just show us the word in its context.

In [None]:
text1.concordance("monstrous")

In [None]:
text4.concordance("America")

In [None]:
text5.concordance("lol")

The `similar` function in NLTK gives you a list of words that appear in similar contexts to the word you are querying.

In [None]:
text1.similar("monstrous")

In [None]:
text1.concordance("contemptible")

In [None]:
text4.similar("America")

In [None]:
text5.similar("lol")

The `common_contexts` function in NLTK gives you words that share the same context. The output below tells us that 'monstrous' and 'true' appear in the environment 'the __ pictures'.

In [None]:
text1.common_contexts(["monstrous", "true"])

In [None]:
text4.common_contexts(["America", "American"])

Congratulations! You have gone through the basics of corpus analysis! 

## 3. Load your own text

For Assignment 1, you will need to load some of  your own texts. Here's some basic code to do this. You can also consult [Chapter 2 of the NLTK book](https://www.nltk.org/book/ch02.html). 

In [None]:
#Import corpus reader functionalities
from nltk.corpus import PlaintextCorpusReader
#Point to the path where you have some files
#Change this for your own path
corpus_root = './data/'

Let's assume that I have a corpus of reviews in my directory. I'll load the files (all .txt) into the variable "reviews"

In [None]:
reviews = PlaintextCorpusReader(corpus_root, '.*')

I know that my data contains 2 different types of texts, books and movies. I will tokenize each type by using the words() function and assigning each to a variable.

In [None]:
books_words = reviews.words('books.txt')

In [None]:
movies_words = reviews.words('movies.txt')

Now I can see the list of words

In [None]:
books_words

In [None]:
movies_words

You can check length, lexical diversity, sentences, etc.

In [None]:
len(movies_words)

Congratulations! Now you are ready to work on Assignment 1