# CC 303 Final Project
* Ryan Jinnette
* 12/9/2024


## 1. Setup: Install Packages and Dependencies

* Pandas: used for dataframe analysis and storing our counts and frequencies
* Altair: used for graphing our pandas dataframes
* Nltk: used for parsing the texts and classifying using natural language processing
* nltk.corpus stopwords: used for getting rid of filler words like "the", "and", "a", etc.
* Wordcloud: used for creating the nice wordclouds we see at the end of the code

In [1]:
# For fresh projects only
# remove the hashtags before running the following cell

# !pip install pandas
# !pip install altair
# !pip install nltk
# !pip install SpaCy
# !pip install wordcloud
# !python3 -m spacy download en_core_web_sm

In [2]:
import pandas as pd
import altair as alt
import nltk # ← new
from nltk.corpus import stopwords as stop
from nltk.stem import PorterStemmer as stemmer
from nltk.stem import WordNetLemmatizer as lemmatizer
from nltk.corpus import wordnet
from nltk import word_tokenize, pos_tag
import spacy
from wordcloud import WordCloud
import matplotlib.pyplot as plt
nlp = spacy.load("en_core_web_sm")

We need to tell nltk which "models" or settings we want to use for the following analysis. This only has to be done in the beginning per project

In [3]:
def download_models():
    '''call this at the beginning of a session to install dependencies'''
    nltk.download('punkt') # necessary for tokenization
    nltk.download('wordnet') # necessary for lemmatization
    nltk.download('stopwords') # necessary for removal of stop words
    nltk.download('averaged_perceptron_tagger') # necessary for POS tagging
    nltk.download('maxent_ne_chunker' ) # necessary for entity extraction
    nltk.download('words')
    nltk.download('punkt_tab')
    nltk.download('averaged_perceptron_tagger_eng')
    alt.data_transformers.enable("vegafusion")

# download_models()

### String Cleaning and Text Ingest

Before we visualize any text, we need to do some normalization. Load our novel's text

In [4]:
def ingest(text_name: str) -> str:
    with open(text_name, 'r') as f:
        story = f.read()
    return story

# story = ingest('Fagles.txt')
# # display first 100 characters
# print(story[:100]+"...")

At this point the entire story is loaded into one long string.

💡 *Examine the character length of the story and look at the current state of the memory in the kernel:*

In [5]:
print(len(story))


NameError: name 'story' is not defined

### Tokenization

Tokenization is the process of turning a text into chunks that we can more easily work with. Typically tokenization refers to the separation and extraction of words. This process largely relies on the use of spaces and punctuation marks. The latter are typically included as tokens themselves.

Suppose we have a sentence such as this one:

In [None]:
# sentence = "It was a dark and stormy night; the rain fell in torrents—except at occasional intervals, when it was checked by a violent gust of wind which swept up the streets (for it is in London that our scene lies), rattling along the housetops, and fiercely agitating the scanty flame of the lamps that struggled against the darkness."

Now, with nltk's `word_tokenize()` we can extract all tokens into a neat list:

In [None]:
# words = nltk.word_tokenize(sentence)
# words

As you can see we also get the punctuation marks. These we can avoid with a different kind of tokenizer (e.g., the RegexpTokenizer) or by simply removing non-letter strings with Python's `isalpha()` method. This removes any token containing something else but letters, so if you expect that there are numbers, or embedded non-alpha characters, in your text you'll need to consider those seperately or use another method).

In [None]:
# the list comprehension is extracting only words that contain exclusively letters.
# onlywords = [word for word in words if word.isalpha()]

# onlywords[0:20]

Sometimes capitialization can change the mearning of the word, but for a broadstrokes natural language processing word cloud, we can remove capitialization to reduce the number of unique words in our dataset.

In [None]:
# lowerwords = [word.lower() for word in onlywords]
# lowerwords[0:10]

### Stemming & lemmatizing

Words are often inflected to indicate plural, tense, case, etc. To get the word stem or lemma, you can apply stemming or lemmatization. The **stemmer** operates on a relatively robust, but simplistic rule set.  In contrast, lemmatization is more reliable in the linking different word variants of the same dictionary entry of the same word, a.k.a. lemma, but it's computationally more expensive.

Let's compare them both:

In [None]:
# from nltk.stem import PorterStemmer as stemmer
# from nltk.stem import WordNetLemmatizer as lemmatizer
# from nltk.corpus import wordnet # for robust lemmatization

# word = "said"

# print(stemmer().stem(word))
# print(lemmatizer().lemmatize(word, pos = wordnet.VERB))

To lemmatize we needed to indicate the word type via the second parameter `pos`. But how do we know that it's a verb?

### Part-of-speech tagging

Words assume specific roles in sentences. POS tagging identifies these roles as the parts of speech, which roughly translates to word categories such as verbs, nouns, adjectives etc.

To do POS tagging with NLTK, we need to first run the tokenization. So let's revisit the sentence from above:

In [None]:
# to save us some typing, we import these, so we can call them directly
# from nltk import word_tokenize, pos_tag

# # first we tokenize then we pos_tag
# sentence = pos_tag(word_tokenize(sentence))

# sentence

This gives us a list of tuples, each of which contains the token again, plus the part of speech encoded in a tag. There is actually a good overview of the POS tags with brief definitions and examples [on Stack Overflow](https://stackoverflow.com/a/38264311).

Again, let's go back to our full story, and extract the word from Alice, lemmatizing just the verbs:

In [None]:
# same as above: first tokenize, then pos_tag
pos = pos_tag(word_tokenize(story))

# to keep things short & sweet, we define a function for lemmatizing verbs
def lemmatize_verb (word):
  return lemmatizer().lemmatize(word.lower(), pos = wordnet.VERB)

# remember this form? the condition matches verbs, whose POS tag starts with a V
verbs = [lemmatize_verb(word[0]) for word in pos if word[1][0]=="V"]

# let's look at the first 50 verbs
print(verbs[:50])

Woot! We just extracted all verbs from a story and normalized them! Awesome-sauce.
Repeating the last step for nouns, we'll need to specify `lemmatize_noun()` and apply that

In [None]:
# to keep things short & sweet, we define a function for lemmatizing verbs
def lemmatize_noun (word):
  return lemmatizer().lemmatize(word.lower(), pos = wordnet.NOUN)

# remember this form? the condition matches verbs, whose POS tag starts with a V
nouns = [lemmatize_noun(word[0]) for word in pos if word[1][0]=="N"]

# let's look at the first 50 nouns
print(nouns[:50])

## 📄 2. Process

Now we can turn a text into its components and distinguish these tokens as different word types. Let's proceed by extracting entities, removing irrelevant words, and finding the most frequent words.

### Extract entity types

Apart from identifying word types, we can distinguish between different entities that are mentioned in a text, such as persons, places, organizations, etc. This kind of text processing is also referred to as **Named Entity Recognition** (NER).

For this step, we are straying from NLTK and use spaCy's statistical models on the English language. So first we import spaCy and load the English language model:



In [None]:
import spacy
nlp = spacy.load("en_core_web_sm")

Let's find the named entities from Alice:

In [None]:
# carry out NLP processing
doc = nlp(story)

# get each the text and entity label of all word entities in the article
entities = [ (e.text, e.label_) for e in doc.ents if e.text ]

# see first 20 entities
entities[0:20]

Now all tokens that have been recognized as particular entities are extracted and associated with an entity type. Have a look at spaCy's [overview of NER tags](https://spacy.io/api/annotation#named-entities) to understand what they refer to.  These types of parts of speach taggers are typically trained on news stories.  You can see that the results are questionable for our novel.

### Remove stop words

The opposite of particularly interesting entitites are so-called stop words. They are very common and serve as short function words such as "the", "is", "or", "at". In text processing it can be useful to remove these frequent words to focus on those terms that are more specific to a given document.

NLTK actually already includes stop words for several languages, including English.  I'm going to add the word "chapter" to the stop list for this case, since that is a visual que for the reader, not a integral part of the story.

In [None]:

def define_stopwords() -> list:
    stopwords = stop.words("english")
    stopwords.append('chapter')
    return stopwords

As a next step we remove the stop words from the short story to focus on those words that carry meaning:

In [None]:
def tokenize(story: str) -> list:
    tokens = nltk.word_tokenize(story.lower())
    words = [word for word in tokens if word.isalpha()]
    without_stopwords = [word for word in words if word not in stopwords]
    return words, without_stopwords
    

print(without_stopwords[:50])

### Pack a bag of words

A common representation of text documents is the bag-of-words model, which simply considers a given text as a set of words, disregarding sentence or document structure. Typically, a bag-of-words representation is combined with the frequency of words in a document. I'm also going to use the text with the stop words removed, assuming that they aren't interesting to me.

In [None]:
def bag_of_words(words: list[str]) -> list:
    bow = {}
    for word in words:
      bow[word] = words.count(word)
    words_frequency = sorted(bow.items(), key=lambda x: x[1], reverse=True)
    return words_frequency

# print(words_frequency[0:100])

## 🥗 3. Present

Now let's turn all these words into visualizations!

### Word cloud

For text visualization, one technique has reached a lot of attention, despite its limited perceptual and analytical qualities. The word cloud (a.k.a. tag cloud) emerged in the golden age of Web 2.0 (the 2000s) and probably succeeded due to its simplicity in terms of interpretation and implementation: the more frequent a word, the larger the font size. Altair itself actually does not support word clouds, so we resort to a specific `wordcloud` generator and use `matplotlib` to render the images. The wordcloud library is extra convenient, it just takes the raw text as input:


In [None]:
# from wordcloud import WordCloud
# import matplotlib.pyplot as plt

def create_wordcloud(story:str):
    wc = WordCloud(width=500, height=500, background_color="white").generate(story)
    # display the generated image:
    my_dpi = 72
    plt.figure(figsize = (500/my_dpi, 500/my_dpi), dpi=my_dpi)
    plt.imshow(wc, interpolation="bilinear")
    plt.axis("off")
    plt.show()

💡 *The word cloud library actually gives a lot of options for [customization](https://amueller.github.io/word_cloud/generated/wordcloud.WordCloud.html#wordcloud.WordCloud). You can change the colors, fonts, sizes, and keep the stopwords*

### Common words

We shall move on to more precise representations of text. For this we will revisit an arguably mundane, but quite effective visualization technique: we draw a barchart of the most frequent words (excluding the stop words, if you have done the pencil activity in the section on packing a bag of words).

In [None]:

def get_common_words(words_frequency: list) -> pd.DataFrame:
    # first we create a dataframe from the word frequencies
    df = pd.DataFrame(words_frequency, columns=['word', 'count'])
    return df


def plot_top_words(df: pd.DataFrame):
    # we want to focus just on the top 20 words
    df_top = df[:50]
    # df_top_100 = df[:100] # use this later
    
    # draw horizontal barchart
    alt.Chart(df_top).mark_bar().encode(
      x = 'count:Q',
      y = alt.Y('word:N', sort = '-x')
    )

In [None]:
df.head()

### All words by type

Through POS tagging we are able to identify the different word types, such as nouns, verbs, adjectives, adverbs, and several others. So let's do exactly this and distinguish between these common word types for the story:

In [None]:
# first we extract all words and their types (a.k.a. parts-of-speech or POS)
pos = pos_tag(word_tokenize(story))

# we will be collecting words and types in lists of the same length
words = []
types = []

# iterate over all entries in the pos list (generated above)
for p in pos:
  # get the word and turn it into lowercase
  word = p[0].lower()
  # get the word's type
  tag = p[1]

  # for this analysis we remove entries that contain punctuation or numbers
  # and we also ignore the stopwords (sorry: the, and, or, etc!)
  if word.isalpha() and word not in stopwords:
    # first we add this word to the words list
    words.append(word)
    # then we add its word type to types list, based on the 1st letter of the pos tag
    # note that we access letters in a string, like entries in a list
    if   (tag[0]=="J"): types.append("Adjective")
    elif (tag[0]=="N"): types.append("Noun")
    elif (tag[0]=="R"): types.append("Adverb")
    elif (tag[0]=="V"): types.append("Verb")
    # there are many more word types, we simply subsume them under 'other'
    else: types.append("Other")

💡 *This is a good point to check what we generated. Take a look at the two lists we created:*

In [None]:
words[0:10]

In [None]:
types[0:10]

With this information, we can now create two coordinated charts: one representing the frequency of the different word types and the other displaying the frequency of all words (given the current selection). But first things first: we need to create a dataframe with only the most popular 100 words by frequency.

In [None]:
# with the two lists of the same length, we create a dataframe with a dictionary,
# of which the keys will become the column labels
df = pd.DataFrame({"word": words, "type": types })

# Filter out only the top 100 words by frequency
index = df['word'].isin(df_top_100['word'])
df_pared = df[index].reset_index(drop=True)
len(df_pared)

In [None]:
# along the type column, we want to support a filter selection
selection = alt.selection(type="multi", fields=['type'])

# we create a composite chart consisting of two sub-charts
# the base holds it together and acts as the concierge taking care of the data
base = alt.Chart(df_pared)

# this shows the types, note that we rely on Altair's aggregation prowess
chart1 = base.mark_bar().encode(
  x = alt.Y('type:N'),
  y = alt.X('count()'),
  # when a bar is selected, the others are displayed with reduced opacity
  opacity=alt.condition(selection, alt.value(1), alt.value(.25)),
).add_selection(selection)

# this chart reacts to the selection made in the left/above chart
chart2 = base.mark_bar(width=5).encode(
  x = 'word:N',
  y = alt.Y('count()'),
).transform_filter(selection)

chart1 | chart2

### Keyword in context

Last but not least, it can be quite gratifying to see words in their original context. KWIC is a tried and tested method just for that purpose. Let's build one from scratch!

In [None]:
import re # regular expressions, we will need them to search through the text
# the following we need, to display a text input field and make it interactive
import ipywidgets as widgets
from IPython.display import display, clear_output

# we move all line breaks with spaces, to not mess up the display (you'll see)
text = story.replace("\n", " ")

# create a search box …
search_box = widgets.Text(placeholder='Enter search term', description='Search:')
# … and make it appear
display(search_box)

# this function is triggered when a search query is entered
def f(sender):
  # we get the query's text value
  query = search_box.value

  # this is the window of characters displayed both sides
  span = 40 - int(len(query)/2)

  # for subsequent queries, we clear the output
  clear_output(wait=True)
  # which also removes the search box, so we return it
  display(search_box)

  # when the query is too short, we do not proceed and warn the user/reader
  if (len(query)<2):
    print("\nPlease enter a longer query\n")
    return

  # and find all the start positions of matches in the text
  starts = [m.start() for m in re.finditer(query, text)]

  # if there are no matches, we also tell the user/reader
  if (len(starts)==0):
    print("\nSorry, but there are no matches for your query\n")
    return

  # we go through all the start positions
  for start in starts:
    # determine the end position, based on the query's length
    end = start+len(query)

    # we get the string left and right of the match
    # rjust returns a right-justified string, if there are few letters left of match
    left = text[max(0, start-span):start].rjust(span)
    match = text[start:end]
    right = text[end:end+span]

    # we print left and right context with the actual match in the middle
    print(left+match+right)

# the function f is linked with searchbox' on_submit event
search_box.on_submit(f)

Try searching for `Alice` or `Rabbit`!

## Your Turn

  1. Find all of the named entities in the file `elsa_peretti_obit.txt`. Do any of the classifications seem wrong to you? Why or why not?
  2. From the `elsa_peretti_obit.txt` text file, create a horizonal bar chart of word frequencies as we did above, but sort it by values. [This example will help](https://altair-viz.github.io/gallery/bar_chart_sorted.html)
  3. Re-create the interactive graph with word type frequencies and word frequencies by selection from the tutorial, except this time use the `elsa_peretti_obit.txt` text file instead of the Alice in Wonderland text file. Sort the word frequencies from highest on left to lowest on the right.
  4. Crete a different word cloud for Alice's story, in the shape of Alice and the white rabbit.  We provided a mask, `alice_mask.png`. [This documentation](https://amueller.github.io/word_cloud/auto_examples/masked.html) will help.

In [None]:
#request = requests.get('https://gist.githubusercontent.com/isha211/f1ce8a7020230205099399f7dc8edb30/raw/dfda5a993eaec33fdc4dd839556dfaab41abdaf7/elsa_peretti_obit.txt')
obit_text = request.text

In [None]:
# Question 1
# carry out NLP processing

# get the text and entity label of all word entities in the article

# print the entity classifications


➡️ YOUR ANSWER HERE: Do any of the classifications seem wrong to you? Why or why not? ⬅️

In [None]:
# Question 2
# read in the text file

# get tokens which contain letters and are longer than two characters

# remove stopwords

# pack a bag of words and get word-frequency tuples

# draw a sorted, horizontal barchart for the top 20 most frequent words

In [None]:
# Question 3
# refer to the tutorial for how to extract word types for the obituary

# get only the top 100 words of the obituary by frequency

# refer to the interactive chart in the tutorial to build a similar chart here

In [None]:
# Question 4

## Sources

Tutorials
- [A Complete Exploratory Data Analysis and Visualization for Text Data](https://towardsdatascience.com/a-complete-exploratory-data-analysis-and-visualization-for-text-data-29fb1b96fb6a)
- [Named Entity Recognition with NLTK and SpaCy](https://towardsdatascience.com/named-entity-recognition-with-nltk-and-spacy-8c4a7d88e7da)
- [What are all possible pos tags of NLTK? - Stack Overflow](https://stackoverflow.com/a/38264311)

Documentation
- [Natural Language Toolkit — NLTK 3.5 documentation](https://www.nltk.org)
- [spaCy API Documentation - Architecture](https://spacy.io/api)
- [WordCloud for Python documentation](https://amueller.github.io/word_cloud/)

