# XKE Text Mining - Exercise
----
****

This notebook aims at introducing the user to the processing and analysis of text data in Spark with Python.

## Settings
----

In [None]:
# Check if you have a running SparkContext
sc

## 1 - Load data
----

### Text Corpus

##### Load the corpus of texts stored in `data/articles_blog/` into a DataFrame

> Hint 1: Texts are stored in directories associated with their year and month of release. To load every texts in one DataFrame, you can just use `*` instead of a directory name: 

`/articles_blog/*/*/*.txt`

> Hint 2: Each file has the following structure: Three information separated by `|`. The resulting DataFrame then must have three columns:
- title: String
- category: String
- content: String

> Hint 3: One way to proceed is to use the [read](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader) function with the `com.databricks.spark.csv` format.

In [None]:
df_corpus = <FILL IN>
df_corpus = (df_corpus
             .withColumnRenamed('C0', 'title')
             .withColumnRenamed('C1', 'category')
             .withColumnRenamed('C2', 'content')
             )

In [None]:
df_corpus.printSchema()

In [None]:
# Run this test to check if you got the right DataFrame size
assert(df_corpus.count() == 335)

### Stopwords

##### Load the stopwords stored in `data/stopwords_french.txt` in an Array[String]

In [None]:
stopwords = <FILL IN>

In [None]:
# Run this test to check if you got the right array size
assert(len(stopwords) == 237)

## 2 - Tokenizer
----

Now that the data is available, it is time to pre-process it before we can use it in algorithms. The first thing to do is to tokenize each text to get an array of tokens (words) that will be used afterwards.

##### Create a function extract_tokens which transforms a String into an array of tokens. The transformation can perform the following actions:

- Split on spaces (mandatory)
- Remove punctuation and numbers (can be done with `punctuation.sub(' ', sentence)`) 
- Convert to lowercase
- Remove every stopwords
- Keep only words with length strictly higher than 2

In [None]:
import re
punctuation = re.compile(r'[-.?!,":;()|0-9]')

In [None]:
def extract_tokens(sentence):
    """Tokenization of a given sentence.
       Drop stopwords, punctuations, numbers.
       Change the sentence to lowercase."""
    
    <FILL IN>

In [None]:
# Test your function
extract_tokens("Hello, World!")

##### Create a Spark UDF (User Defined Function) which uses the previous tokens function

> Hint: Use the [udf](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.udf) function with your previous function in a lambda function

##### Add a new column named `tokens` to the df_corpus DataFrame containing the result of the tokenizer UDF used on the `content` column

> Hint 1: Use the [withColumn](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.withColumn) method of the df_corpus DataFrame

> Hint 2: To apply the UDF on a DataFrame column, just do the following: `yourUDF(yourDataFrame.col_name)`

In [None]:
from pyspark.sql.functions import udf
from pyspark.sql.types import *

# Create UDF
extract_tokens_udf = <FILL IN>

# Add tokens column to DataFrame
df_tokens = df_corpus.<FILL IN>

In [None]:
# Check if your DataFrame has the right column names and types
df_tokens.dtypes

In [None]:
df_tokens.show(2)

##### What are the 10 most used words in the corpus ?

> Hint 1: Use the [explode](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.explode) function to have a DataFrame with one word per line

> Hint 2: You can perform the following operations
- Use the `select` function and use the `explode` function on the `tokens` column, name it "word"
- Group By the "word" column
- Use the `count()` function to count the number of occurrences of each word
- Order By the count result, [descending](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.desc), and show the 10 first resulting words

In [None]:
from pyspark.sql.functions import explode, desc

df_words = <FILL IN>
df_words.<FILL IN>

> You should observe that even with a descent tokenizing, the most used words are still not very usefull to characterize the articles.

##### (Bonus) Using the NGram Transformer, find the most used sequences of 2 consecutive words in the corpus

> Hint: Use the [ngram](https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.feature.NGram) transformer with n=2 and inputCol="tokens"

In [None]:
from pyspark.ml.feature import NGram

bigram = NGram(<FILL IN>)
df_bigram = bigram.<FILL IN>

In [None]:
df_words = <FILL IN>
df_words.<FILL IN>

> You should see some combinations with interpretable meanings

## 3 - Word2Vec
----

Now that we have tokens, we can use them in some algorithms to extract useful features from them. One algorithm that we can use is [Word2Vec](https://spark.apache.org/docs/latest/ml-features.html#word2vec), which has an implementation in Sparl ML.

In [None]:
from pyspark.ml.feature import Word2Vec

### Training model

##### Instanciate a new Word2Vec object with the following settings
- inputCol: "tokens"
- outputCol: "w2c_features"
- vectorSize: 50
- minCount: 10
- maxIter: 30

In [None]:
word2Vec = Word2Vec(<FILL IN>)

##### Train a model on the `df_tokens` DataFrame using the `fit` method of your word2Vec object

> This might take a few minutes to run depending on the parameters you chose

> You can try different parameters from the ones proposed, but avoid providing values that are too high when you don't work on a cluster

In [None]:
word2Vec_model = word2Vec.<FILL IN>

We now have on our hands a trained Word2Vec model that we can use and query.

##### Check how the learning phase went by finding synonyms of a few words of your choice

> Hint 1: Use the `findSynonyms(word, num_synonyms)` method called on the learned model

> Hint 2: The result of that function is a DataFrame. Use the show method on it to print the results

In [None]:
word2Vec_model.<FILL IN>

Word2Vec seems to do a good job finding synonyms. Now let's look if we can find some associations like `king - man + woman -> queen`.

##### Write a function find_nearest_word(model, vectors_df, king_word, man_word, woman_word) which finds the word in the corpus which is nearest to the word of coordinates `king_word - man_word + woman_word`

> Hint 1: `model` is your Word2Vec model

> Hint 2: `vectors_df` is the DataFrame that comes from the getVectors() method of a Word2Vec model. It has two columns: 
- word: a word in the corpus
- vector: it's associated coordinates

> Hint 3: The function may run the following actions
- Find the `vector` associated to the `king_word` in vectors_df
- Do the same for `man_word` and `woman_word`
- Use the findSynonyms function of the model with the vector `king_word - man_word + woman_word` as argument

In [None]:
def find_nearest_word(model=None, vectors_df=None, king_word="", man_word="", woman_word=""):
    <FILL IN>

In [None]:
find_nearest_word(<FILL IN>)

##### Using the `transform()` method of the word2Vec_model, add a new column to the df_tokens DataFrame

In [None]:
df_features = word2Vec_model.<FILL IN>

In [None]:
df_features.show(3)

## 4 - KMeans
----

In this task, we would like to group the articles into clusters in which they share similar topics. 

For this, we will be using the KMeans algorithm on our corpus. As KMeans needs numerical features to run correctly, we will use the `features` column of our df_features DataFrame as input.

In [None]:
from pyspark.ml.clustering import KMeans

##### Instanciate a new KMeans object with the following settings
- k: an integer of your choice
- maxIter: 30
- featuresCol: "features"
- outputCol: "cluster"

In [None]:
kmeans = KMeans(<FILL IN>)

##### Train a model on the `df_features` DataFrame using the `fit` method of your kmeans object

> This might take a few minutes to run depending on the parameters you chose

In [None]:
kmeans_model = kmeans.<FILL IN>

##### Using the `transform()` method of the kmeans_model, add a new column to the df_features DataFrame

In [None]:
df_clusters = kmeans_model.<FILL IN>

In [None]:
df_clusters.show(10)

We would like to know if the cluster centers represent meaningful topics. To do that, we will use the word2Vec_model to find the synonyms of the cluster centers in our corpus of words.

##### Write a function find_synonyms_cluster_center(word2Vec_model, kmeans_model, cluster_id) which finds the synonyms of the cluster center of your choice

> Hint 1: `word2Vec_model` is your Word2Vec model, on which you can call the findSynonyms method

> Hint 2: `kmeans_model` is your KMeans model, on which you can call the clusterCenters() method

In [None]:
def find_synonyms_cluster_center(word2Vec_model, kmeans_model, cluster_id):
    <FILL IN>

In [None]:
find_synonyms_cluster_center(word2Vec_model, kmeans_model, 0)

##### Write a function titles_in_cluster(df , cluster_id) which prints all articles titles found in a given cluster

> Hint 1: `df` is your df_clusters DataFrame model

> Hint 2: `cluster_id` is the id of the selected cluster

In [None]:
def titles_in_cluster(df, cluster_id):
    <FILL IN>

In [None]:
titles_in_cluster(df_clusters, 0)

##### Write a function most_represented_categories_in_cluster(df , cluster_id) which prints the most represented category of the articles in a given cluster

> Hint 1: `df` is your df_clusters DataFrame model

> Hint 2: `cluster_id` is the id of the selected cluster

> Hint 3: Use the `category` column of the DataFrame

In [None]:
def most_represented_categories_in_cluster(df, cluster_id):
    <FILL IN>

In [None]:
most_represented_categories_in_cluster(df_clusters, 14)

##### (Bonus) Apply a KMeans to the words to see which words are clustered together

In [None]:
kmeans = KMeans(<FILL IN>)

In [None]:
kmeans_model = kmeans.<FILL IN>

In [None]:
words_clusters_df = kmeans_model.<FILL IN>

In [None]:
words_clusters_df.show(4)

In [None]:
words_clusters_df.filter(words_clusters_df.cluster == 0).show()

In [None]:
word2Vec_model.findSynonyms(kmeans_model.clusterCenters()[0], 10).show()

##### (Bonus) Identify clusters of words that seem unuseful to you and add the corresponding words to the stopwords list. Then tokenize the texts with this new list, transform the resulting DataFrame with the Word2Vec model and run a KMeans. Does it help improving the clusters interpretation ?

## 5 - Latent Dirichlet Allocation
----

We are now going to try another algorithm to infer topics from documents, which is Latent Dirichlet Allocation. First, we need to map each word to an index and filter undesirable words. We can then train a LDA model to find those topics.

##### Instanciate a new [CountVectorizer](https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.feature.CountVectorizer) object with the following settings
- minTF: 5.0
- minDF: 15.0
- featuresCol: "tokens"
- outputCol: "features"

In [None]:
from pyspark.ml.feature import CountVectorizer
cv = CountVectorizer(<FILL IN>)

##### Fit the model on the `df_tokens` DataFrame, and then add a new column with the features

In [None]:
cv_model = cv.<FILL IN>

In [None]:
df_features = cv_model.<FILL IN>

In [None]:
df_features.show(5)

In order to train a LDA model, we need to provide a RDD with an index and a Sparse Vector

In [None]:
from pyspark.mllib.linalg import Vectors

In [None]:
rdd_features = <FILL IN>

##### Instanciate and train new [LDA](https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.clustering.LDA) object with the following settings
- k: 3
- maxIterations: 5

In [None]:
from pyspark.mllib.clustering import LDA

ldaModel = LDA.train(<FILL IN>)

> You can change the parameters of the model, but be careful, the training time can be very long

##### Use the `describeTopics` method of your ldaModel and the `vocabulary` attribute of your cv_model to observe the description in 10 words of each topics

In [None]:
for topic in range(3):
    print "TOPIC " + str(topic)
    print "-------"
    for i in ldaModel.describeTopics()[topic][0][:10]:
        print cv_model.vocabulary[i]
    print "\n"