Predictive Keyboard. Milestone report
=======================================

The project, in general, is an exercise in building a predictive model for text input, using a keyboard. The predictive model could be a combination of probabilistic models (N-grams, others), and rule-based models (which, in general, could *also* be modeled using probabilities). For various tasks of the keyboard, different models will be used. Since this document is just a milestone to check how the progress is going, I'll reserve the more detailed discussion for the final document. 


Deliverables for this milestone
-------------------------------

The main deliverables are: 
- Demonstrate that you've downloaded the data and have successfully loaded it in.
- Create a basic report of summary statistics about the data sets.
- Report any interesting findings that you amassed so far.


What data do we have?  
---------------------

The data provided consists of 4 sets of files, containing samples of tweets, blogs and news, in English, German, Finnish and Russian. Some basic data statistics follow:lines, word counts, etc (used *wc* command, recursively).

In [None]:
  lines       words     file
 371440    12653185    .//de_DE/de_DE.blogs.txt
 244743    13219388    .//de_DE/de_DE.news.txt
 947774    11803735    .//de_DE/de_DE.twitter.txt
 
 899288    37334690    .//en_US/en_US.blogs.txt
1010242    34372720    .//en_US/en_US.news.txt
2360148    30374206    .//en_US/en_US.twitter.txt
 
 439785    12732013    .//fi_FI/fi_FI.blogs.txt
 485758    10446725    .//fi_FI/fi_FI.news.txt
 285214     3153003    .//fi_FI/fi_FI.twitter.txt
 
 337100     9691167    .//ru_RU/ru_RU.blogs.txt
 196360     9416099    .//ru_RU/ru_RU.news.txt
 881414     9542485    .//ru_RU/ru_RU.twitter.txt

So it seems we have a large amount of data to analyze. 


Exploratory analysis
-----------------------------

...create a basic report of summary statistics about the data sets. Basically at this point we want a quick N-grams analysis, with Uni and Bigrams, and check the frequencies of the most used words or expressions. I'll use mostly the **tm** (text mining) and **RWeka** libraries for the initial exploration, I'll start with a subset of the blogs set, in English. The R code should be run from the parent folder of the languages folders (so the folder containing the *data/en_US* folder). 
I'll use a small subset of data for this initial exploratory task (several issues with the RWeka, Weka and Java on Mac - error "Error in rep(seq_along(x), sapply(tflist, length)) : invalid 'times' argument", which forced me to use a single core on the Mac: options(mc.cores=1) - more details here http://stackoverflow.com/questions/17703553/bigrams-instead-of-single-words-in-termdocument-matrix-using-r-and-rweka)).

In [None]:
%%R
library(tm)
library(RWeka)

## create a UnigramTokenizer (RWeka)
UnigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
## create a BigramTokenizer (RWeka)
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))

## load the english documents
en_texts <- VCorpus(DirSource(directory="data/en_US/small", encoding="UTF-8"), 
                              readerControl=list(language="en"))

## get rid of extra white spaces, stopwords, DON'T STEM YET, switch to lowercase
en_texts <- tm_map(x=en_texts, FUN=removePunctuation)
en_texts <- tm_map(x=en_texts, FUN=removeWords, words=stopwords(kind="en"))
en_texts <- tm_map(x=en_texts, FUN=stripWhitespace)
en_texts <- tm_map(x=en_texts, FUN=tolower)

## create a TermDocumentMatrix  
## NOTE - without the "options" underneath, the TermDocumentMatrix call crashes - 
## (looks like a parallel processing issue)
options(mc.cores=1)
tdmUnigram <- TermDocumentMatrix(en_texts, control=list(tokenizer=UnigramTokenizer))
tdmBigram <- TermDocumentMatrix(en_texts, control=list(tokenizer=BigramTokenizer))