# tf-idf

This notebook was posted by Simon Lindgren // [@simonlindgren](http://www.twitter.com/simonlindgren) // [simonlindgren.com](http://simonlindgren.com)

The following code and tutorial is about using the statistic tf-idf for distinguishing important words in a text from words that are simply commonly used. It draws heavily on the excellent book [Text Mining With R](http://tidytextmining.com/) by the creators of the `tidytext` package, [Julia Silge](https://twitter.com/juliasilge) and [David Robinson](http://varianceexplained.org).

A term’s inverse document frequency (idf) decreases the weight for commonly used words and increases the weight for words that are not used very much in a collection of documents. 

This can be combined with term frequency (tf) to calculate a term’s tf-idf (the two quantities multiplied together), the frequency of a term adjusted for how rarely it is used.

The statistic tf-idf is intended to measure how important a word is to a document in a collection (or corpus) of documents, for example, to one novel in a collection of novels or to one website in a collection of websites.

In [None]:
library(tidyverse)
library(tidytext)

##### Read documents
The code below is based on what was done in [another notebook](https://github.com/simonlindgren/Tidy-Text-first-steps/blob/master/Tidy%2Btext%2Bfirst%2Bsteps.ipynb). It reads a `csv` file into a tidy dataset.

In [None]:
documents <- read_csv2("tidyraw2.csv")
tidy_documents <- documents %>%
    unnest_tokens(word,text)
    #unnest_tokens(ngram, text, token = "ngrams", n = 2)
data(stop_words)
tidy_documents <- anti_join(tidy_documents, stop_words, by="word")
my_stop_words <- read_csv2("swestop.csv")
tidy_documents <- anti_join(tidy_documents, my_stop_words, by="word")

In [None]:
# Inspect the dataframe
tidy_documents

##### What are the most commonly used words?
By each `blogger`.

In [None]:
document_words <- tidy_documents %>%
    group_by(blogger) %>%
    count(blogger, word, sort = TRUE) %>%
    ungroup()
document_words

In [None]:
# Also, what are the total words for each blogger?
total_words <- document_words %>% 
  group_by(blogger) %>% 
  summarize(total = sum(n))
total_words

document_words <- left_join(document_words, total_words)

###### TF-IDF
The `bind_tf_idf` function in the tidytext package takes a tidy text dataset as input with one row per token (term), per document. One column (`word` here) contains the terms/tokens, one column contains the documents (`blogger` in this case), and the last necessary column contains the counts, how many times each document contains each term (`n` in this example). 

We use `bind_tf_idf` to calculate TF-IDF:

In [None]:
document_words <- document_words %>%
  bind_tf_idf(word, blogger, n)
document_words

Let's look at terms with high TF-IDF.

In [None]:
tf_idf_dataframe <- document_words %>%
    select(-total) %>%
    arrange(desc(tf_idf))

tf_idf_dataframe

Now, let's visualise the high tf-idf words.

In [None]:
my_plot <- document_words %>%
    arrange(desc(tf_idf)) %>%
    mutate(word = factor(word, levels = rev(unique(word)))) 

my_plot %>%
    top_n(10) %>%
    ggplot(aes(word, tf_idf, fill = blogger)) +
    geom_col() +
    labs(x = NULL, y = "tf-idf") +
    coord_flip()

Now, we look at blogs individually:

In [None]:
my_plot %>% 
  group_by(blogger) %>% 
  top_n(5) %>% 
  ungroup %>%
  ggplot(aes(word, tf_idf, fill = blogger)) +
  geom_col(show.legend = FALSE) +
  labs(x = NULL, y = "tf-idf") +
  facet_wrap(~blogger, ncol = 2, scales = "free") +
  coord_flip()