text scrabbing to build TF-IDF model
To download data, go to: http://qwone.com/~jason/20Newsgroups/
and then, select 20news-bydate.tar.gz
For more information on the model, go to: http://spark.apache.org/docs/latest/mllib-feature-extraction.html#word2vec