In 2016, David Robinson's published a great analysis of Donald Trump's (http://varianceexplained.org/r/trump-tweets/). It got a lot of publicity and his collaboration with Julia Slige resulted in a new book and approach called tidytext (http://tidytextmining.com/). 

https://github.com/juliasilge/tidytext has become another package for advanced text analysis that has quickly gained a lot of support.

The tidytext package allows to use tidytext principles (https://www.jstatsoft.org/article/view/v059i10) with unstructured data/text.

Let's take a character vector with one element made of 3 sentences.

In [9]:
import pandas as pd

text = """
Using tidy data principles is important.
In this package, we provide functions for tidy formats.
The novels of Jane Austen can be so tidy!
"""

The dataset is not yet compatible with the tidy tools. The first step is to use unnest.

### unnest_tokens function

The unnest_token function splits a text column (input) into tokens (e.g. sentences, words, ngrams, etc.).

In [None]:
text_split = text.splitlines()

df = pd.DataFrame({
    "text": text_split,
    "line": list(range(len(text_split)))
})

Next for the tidy text format.

### The tidy text format

Tidy text format is define as 'a table with one-term-per-row'. 

To tokenize into words (unigrams).

In [16]:
from tidytext import unnest_tokens

table = unnest_tokens(df, "word", "text")
table

Unnamed: 0,line,word
0,0,
1,1,using
1,1,tidy
1,1,data
1,1,principles
1,1,is
1,1,important
2,2,in
2,2,this
2,2,package


And tokenize into phrases (bigrams).

In [None]:
# bigrams not available in pyhton function, is there a way around it?

## Removing stopwords

In [20]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

stopwords = list(stopwords.words('english'))

new_table = table[~table['word'].isin(stopwords)]
new_table

Unnamed: 0,line,word
0,0,
1,1,using
1,1,tidy
1,1,data
1,1,principles
1,1,important
2,2,package
2,2,provide
2,2,functions
2,2,tidy


## Summarizing word frequencies

In [21]:
# Count function using nltk too? is it available in tidytext?
# bind_tfidf what does it do?

In [38]:
def frequencytable(df):
    words = df['word']
    freq_table = {}
    for word in words:
        if word in freq_table:
            freq_table[word] += 1
        else:
            freq_table[word] = 1
    return freq_table

In [39]:
frequencytable(new_table)

{nan: 1,
 'using': 1,
 'tidy': 3,
 'data': 1,
 'principles': 1,
 'important': 1,
 'package': 1,
 'provide': 1,
 'functions': 1,
 'formats': 1,
 'novels': 1,
 'jane': 1,
 'austen': 1}

## Case Study Gutenberg

### Gutenbergr

The gutenberg package (https://ropensci.org/tutorials/gutenbergr_tutorial.html) provides access to the Project Gutenberg collection. The package contains tools for downloading books and for finding works of interest.

In [40]:
import gutenberg

In [44]:
# sherlock holmes

# Retrieve the first 10 titles of Arthur Conan Doyle in the Gutenberg library.

# how to get the books? either download them beforehand or use beautifulsoup?

In [None]:
# removing stopwords and word count can be done using the previous functions again
# ggplot in python
# sentiment analysis --> again using NLTK?

