## Module 6: Ontologies and Document Classification 


Ontologies are sets of words that are used to help discern what is being talked about in digitized, text based information. 

You learned a little bit about how this works in modules 5 and 6, because you dissected the words that are common in a particular set of books. 

What are ontologies used for in practice, though? How would a data scientist deploy them?  

Maybe the easiest way to show you is with a few examples from my research. 
1. Processing Twitter Data to Identify Events: [Football and Baseball Ontologies](./resources/baseball.pdf)
2. Processing online health support forums to identify the type of support being provided: Emotional, community or informational: [WebMD Health Support Forum Ontologies](./resources/onlinehealth.pdf)
3. Noting the emergence of terms in an adult kickball league's online discussion forum: [Finding Ontologies Where None Exist Yet - Topic Modeling](./resources/kickball.pdf)

In each case you will notice that we picked a specific ontology with which to make sense of what was happening.  In the case of Twitter, we used player names and sport specific words like "out" for baseball and "touchdown" for football to identify new events on the field through the Twitter feed. 

In online health support forums, we used ontologies of emotional, informational and community support words to identify specific types of support.  You will see, if you look through the paper casually (skimming is OK!) that some health conditions elicit different levels of different types of support, and exhibit different membership stability (unrelated to ontology, but still pretty cool!). 

Think about an ontology - sports, a hobby, health related - that you might want to use to classify a set of text. Its likely to come up again. :) 

Here are links to the labs and practices for this modules.  

**Labs**
1. [Lab: Text Analysis with Python](./labs/Lab_06.01_TextAnalysis_Python.ipynb)
2. [Lab: Text Classification](./labs/Lab_06.03_Text_Classification.ipynb)
3. [Lab: Ngrams and Text Clustering](./labs/Lab_06.04_Ngrams_Clustering.ipynb)

**Practices** 

1. [Practice: Classification](./practices/Practice_06.01_Classification.ipynb)


In [None]:
## If the text below doesn't work, install this library by uncommenting
## the line below:

# install.packages("tidytext")

### An Opening Example
So, lets take this example:

Let's count the number of words in Jane Austen Novels. 

In [None]:
library(dplyr)
library(janeaustenr)
library(tidytext)

book_words <- austen_books() %>%
  unnest_tokens(word, text) %>%
  count(book, word, sort = TRUE) %>%
  ungroup()

total_words <- book_words %>% 
  group_by(book) %>% 
  summarize(total = sum(n))

book_words <- left_join(book_words, total_words)

book_words

There is one row in this book_words data frame for each word-book combination; n is the number of times that word is used in that book and total is the total words in that book. The usual suspects are here with the highest n, “the”, “and”, “to”, and so forth. In Figure 3.1, let’s look at the distribution of n/total for each novel, the number of times a word appears in a novel divided by the total number of terms (words) in that novel. This is exactly what term frequency is.

In [None]:
library(ggplot2)

ggplot(book_words, aes(n/total, fill = book)) +
  geom_histogram(show.legend = FALSE) +
  xlim(NA, 0.0009) +
  facet_wrap(~book, ncol = 2, scales = "free_y")

The idea of tf-idf is to find the important words for the content of each document by decreasing the weight for commonly used words and increasing the weight for words that are not used very much in a collection or corpus of documents, in this case, the group of Jane Austen’s novels as a whole. Calculating tf-idf attempts to find the words that are important (i.e., common) in a text, but not too common. Let’s do that now.

The bind_tf_idf function in the tidytext package takes a tidy text dataset as input with one row per token (term), per document. One column (word here) contains the terms/tokens, one column contains the documents (book in this case), and the last necessary column contains the counts, how many times each document contains each term (n in this example). We calculated a total for each book for our explorations in previous sections, but it is not necessary for the bind_tf_idf function; the table only needs to contain all the words in each document.

In [None]:
book_words <- book_words %>%
  bind_tf_idf(word, book, n)
book_words

Notice that idf and thus tf-idf are zero for these extremely common words. These are all words that appear in all six of Jane Austen’s novels, so the idf term (which will then be the natural log of 1) is zero. The inverse document frequency (and thus tf-idf) is very low (near zero) for words that occur in many of the documents in a collection; this is how this approach decreases the weight for common words. The inverse document frequency will be a higher number for words that occur in fewer of the documents in the collection.

Let’s look at terms with high tf-idf in Jane Austen’s works.

In [None]:
book_words %>%
  select(-total) %>%
  arrange(desc(tf_idf))

Let’s look at a visualization for these high tf-idf words

In [None]:
plot_austen <- book_words %>%
  arrange(desc(tf_idf)) %>%
  mutate(word = factor(word, levels = rev(unique(word))))

plot_austen %>% 
  top_n(20) %>%
  ggplot(aes(word, tf_idf, fill = book)) +
  geom_col() +
  labs(x = NULL, y = "tf-idf") +
  coord_flip()

And, if you're really into Jane Austen, you'll probably want to see the novels individually. 

In [None]:
plot_austen %>% 
  group_by(book) %>% 
  top_n(15) %>% 
  ungroup %>%
  ggplot(aes(word, tf_idf, fill = book)) +
  geom_col(show.legend = FALSE) +
  labs(x = NULL, y = "tf-idf") +
  facet_wrap(~book, ncol = 2, scales = "free") +
  coord_flip()

## Data Sets

For this module, our data sets are files to process.

Found in the root directory datasets folder:
Path: `../../../../datasets/DSA-8630/newsgroups/`

 Dataset Name                | Files
-----------------------------|------------
 newsgroups                  | `newsgroups/` 
 tweets                      | `DSA-8630/tweets.csv` 
 
 
 Found in the course datasets folder
 Path: `../../../datasets/`
 
 Dataset Name                | Files
-----------------------------|------------
 movie_reviews               | `movie_reviews` 


 Symlink to the directory is found in the exercise folder

 Dataset Name                | Files
-----------------------------|------------
 book                        | `book` 

   
## Suggested Schedule
- Note: There is no topic discussion this week. 
- Please participate in Mutual Aid on Slack. Post questions you have, and post things you discovered or learned. Offer suggestions or answers for your classmates' questions.

### Monday - Tuesday
    
  - **Lab Notebooks**: 
    1. Lab: Text Analysis with Python
    2. Lab: Text Analysis with R
    3. Lab: Text Classification
    4. Lab: Ngrams and Text Clustering
     
### Wednesday
  - **Practice Notebooks**
      1. Practice: Classification
      

### Thursday
  - Practice answers to be released
  - **Exercises **: 
  
  <span style="color:red">Exercise to be deployed later in the week.</span>
      1. Exercise 06.01 

### Next Wednesday
  - **Exercise Due**