# Problem set 1: Counting words

## Description

The goal of this problem set is to create the world's least visually-sophisticated word count graphic.

Along the way, you'll experiment with stopword removal, case folding, and other processing steps.

## Count words, naïvely

We'll work with *Moby-Dick*, as we did in class. 

**Read the text of *Moby-Dick* from a file (it's on the class GitHub site, in the `data/texts` directory), tokenize it with NLTK's `word_tokenize` function, and count the resulting tokens in a `Counter` object.**

You can refer to the lecture notebook from Monday, September 7, to borrow code to do all of this. But you must get that code working in the cell below. This cell should produce a `Counter` object that holds the token counts from the novel.

In [None]:
# Use standard Python file commands to open Moby-Dick,
#  then count the words in that file.

**Print the total number of words (hint: use `Counter`'s `.values()` method, along with the `sum` function) in your text, as well as the 20 most frequently occurring terms and their counts.**

We'll do this a lot, so wrap it up as a function that takes as input a `Counter` object and an optional number of top terms to print:

```
def word_stats(data, n=20):
```

The output of your fuction should look like this:

```
Total words in the text: 255380

Top 20 words by frequency:
,      19204
the    13715
.      7432

[and so on ...]
```

In [None]:
# Your word_stats function here
def word_stats(data, n=20):
    '''
    Print total wordcount and n top terms.
    Takes a Counter object and a number of terms to print.
    Returns None.
    '''

In [None]:
# Call word_stats on your data
word_stats(...)

## Case folding and stopwords

As you can see, the top words that we counted aren't super informative. That said, list two things that you **can** say about the text with reasonable confidence on the basis of our results above:

**Two things you *can* tell about *Moby-Dick* from the naïve word counts:**

1. Thing one
1. Thing two

If we want our word list to be informative, we need to find a way to ignore high-frequency, low-information words. We can do this either by not counting them in the first place, or by excluding them from our reporting after we've collected them. Both methods have advantages and drawbacks. The one you pursue is up to you.

**Modify the original code to ignore token case (e.g., 'The' and 'the' are both counted as occurrences of the same token; note the `.lower()` method for strings) and to remove the English-language stopwords defined by NLTK (`from nltk.corpus import stopwords`). Then display the total token count and top-20 tokens.**

In [None]:
# Count tokens with case folding and NLTK English stopwords removed

... your code here ...

word_stats(...)

Is this better? Maybe! **Note one advantage of this stopword-removed count, as well as one disadvantage:**

**Advantage:**

* Advantage details

**Disadvantage:**

* Disadvantage details

Let's see if we can further improve/refine our approach to continue narrowing our word list. Our goal is to produce a list that contains *only* interesting words and ranks them by frequency.

**List at least two ideas for modifying the stopword list to better approach our goal:**

1. Idea one
1. Idea two

**Implement one or more of your ideas to improve the stopword list, then display the output of your new version using `word_stats()`.**

In [None]:
# Better stopwords in action!

... your code here ...

word_stats(...)

Refine your stoplist until you're satisfied with it. Make sure your code above displays the final output of your `word_stats` function. Then move on.

# Visualization 

Now, make the world's least visually-impressive word count graphic. Your task is to produce a visual representation of your top 10 words that shows the relative frquency of those terms.

The simplest acceptable version of this visualization is a bar chart. **Complete the starter code below to produce a bar chart of the top ten words in the text.**

Your output might look like this:

![bar chart](ps_02_bar_chart.png)

In [None]:
# Make a bar chart of the top 10 words
%matplotlib inline
import matplotlib.pyplot as plt

# Get labels and counts
labels = ...
counts = ...

# Create the figure
fig, ax = plt.subplots()
ax.barh(...)
...

## Optional: word clouds

**This is optional.** Make a word cloud. You can do this the ugly way in pure `matplotlib` or the easy-and-pretty way by using the [`wordcloud`](https://github.com/amueller/word_cloud) library:

```
conda install -c conda-forge wordcloud
```
If you use `wordcloud`, you'll be interested in the [`.generate_from_frequencies()` method](http://amueller.github.io/word_cloud/auto_examples/frequency.html).

Here are examples of the ugly and the pretty outputs. Your specific results might vary.

![ugly](ps_02_ugly_cloud.png)
![pretty](ps_02_pretty_cloud.png)

In [None]:
# The ugly way (matplotlib)
# Hint: you'll want to use the .text() plotting method
# Strictly optional

In [None]:
# The pretty way
# Strictly optional