# XKE Text Mining - Solutions
----
****

This notebook aims at introducing the user to the processing and analysis of text data in Spark with Python.

## Settings
----

In [14]:
# Check if you have a running SparkContext
sc

<pyspark.context.SparkContext at 0x107548910>

## 1 - Load data
----

### Text Corpus

##### Load the corpus of texts stored in `data/articles_blog/` into a DataFrame

> Hint 1: Texts are stored in directories associated with their year and month of release. To load every texts in one DataFrame, you can just use `*` instead of a directory name: 

`/articles_blog/*/*/*.txt`

> Hint 2: Each file has the following structure: Three information separated by `|`. The resulting DataFrame then must have three columns:
- title: String
- category: String
- content: String

> Hint 3: One way to proceed is to use the [read](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader) function with the `com.databricks.spark.csv` format.

In [15]:
df_corpus = (sqlContext
      .read
      .format('com.databricks.spark.csv')
      .options(header='false', delimiter='|')
      .load('/Users/Yoann/Documents/Xebia/tests/text-mining/articles_blog/*/*/*.txt')
     )
df_corpus = (df_corpus
             .withColumnRenamed('C0', 'title')
             .withColumnRenamed('C1', 'category')
             .withColumnRenamed('C2', 'content')
             )

In [16]:
df_corpus.printSchema()

root
 |-- title: string (nullable = true)
 |-- category: string (nullable = true)
 |-- content: string (nullable = true)



In [17]:
# Run this test to check if you got the right DataFrame size
assert(df_corpus.count() == 335)

### Stopwords

##### Load the stopwords stored in `data/stopwords_french.txt` in an Array[String]

In [18]:
stopwords = sc.textFile("/Users/Yoann/Documents/Xebia/tests/text-mining/stopwords_french.txt").collect()

In [19]:
# Run this test to check if you got the right array size
assert(len(stopwords) == 237)

## 2 - Tokenizer
----

Now that the data is available, it is time to pre-process it before we can use it in algorithms. The first thing to do is to tokenize each text to get an array of tokens (words) that will be used afterwards.

##### Create a function extract_tokens which transforms a String into an array of tokens. The transformation can perform the following actions:

- Split on spaces (mandatory)
- Remove punctuation and numbers (can be done with `punctuation.sub(' ', sentence)`) 
- Convert to lowercase
- Remove every stopwords
- Keep only words with length strictly higher than 2

In [20]:
import re
punctuation = re.compile(r'[-.?!,":;()|0-9]')

In [21]:
def extract_tokens(sentence):
    """Tokenization of a given sentence.
       Drop stopwords, punctuations, numbers.
       Change the sentence to lowercase."""
    
    if(sentence) :
        tokens = punctuation.sub(' ', sentence.replace("'", " ")).lower().split()
        tokens_filtered = [word for word in tokens if (word not in stopwords) and (len(word) > 2)]
        return tokens_filtered
    else :
        return []

In [22]:
# Test your function
extract_tokens("Hello, World!")

['hello', 'world']

##### Create a Spark UDF (User Defined Function) which uses the previous tokens function

> Hint: Use the [udf](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.udf) function with your previous function in a lambda function

##### Add a new column named `tokens` to the df_corpus DataFrame containing the result of the tokenizer UDF used on the `content` column

> Hint 1: Use the [withColumn](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.withColumn) method of the df_corpus DataFrame

> Hint 2: To apply the UDF on a DataFrame column, just do the following: `yourUDF(yourDataFrame.col_name)`

In [23]:
from pyspark.sql.functions import udf
from pyspark.sql.types import *

# Create UDF
extract_tokens_udf = udf(lambda sentence: extract_tokens(sentence), ArrayType(StringType()))

# Add tokens column to DataFrame
df_tokens = df_corpus.withColumn('tokens', extract_tokens_udf(df_corpus.content))

In [24]:
# Check if your DataFrame has the right column names and types
df_tokens.dtypes

[('title', 'string'),
 ('category', 'string'),
 ('content', 'string'),
 ('tokens', 'array<string>')]

In [12]:
df_tokens.show(2)

+--------------------+-------------+--------------------+--------------------+
|               title|     category|             content|              tokens|
+--------------------+-------------+--------------------+--------------------+
|2014-01-08-crafts...|        Craft|  Pour coder tous...|[coder, jours, pl...|
|2014-01-10-androi...|AndroidMobile|  En tant que dév...|[tant, développeu...|
+--------------------+-------------+--------------------+--------------------+
only showing top 2 rows



##### What are the 10 most used words in the corpus ?

> Hint 1: Use the [explode](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.explode) function to have a DataFrame with one word per line

> Hint 2: You can perform the following operations
- Use the `select` function and use the `explode` function on the `tokens` column, name it "word"
- Group By the "word" column
- Use the `count()` function to count the number of occurrences of each word
- Order By the count result, [descending](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.desc), and show the 10 first resulting words

In [12]:
from pyspark.sql.functions import explode, desc

df_words = df_tokens.select(explode(df_tokens.tokens).alias('word'))
df_words.groupBy('word').count().orderBy(desc('count')).show()

+-----------+-----+
|       word|count|
+-----------+-----+
|       plus| 1576|
|      cette|  909|
|      c’est|  804|
|       d’un|  787|
|       code|  625|
|      faire|  617|
|     projet|  584|
|      d’une|  583|
|       bien|  576|
|     permet|  559|
|    données|  527|
|      xebia|  458|
|    exemple|  418|
|      tests|  407|
|       test|  406|
|        the|  395|
|      n’est|  369|
|       faut|  360|
|       http|  357|
|application|  355|
+-----------+-----+
only showing top 20 rows



> You should observe that even with a descent tokenizing, the most used words are still not very usefull to characterize the articles.

##### (Bonus) Using the NGram Transformer, find the most used sequences of 2 consecutive words in the corpus

> Hint: Use the [ngram](https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.feature.NGram) transformer with n=2 and inputCol="tokens"

In [13]:
from pyspark.ml.feature import NGram

bigram = NGram(n=2, inputCol='tokens', outputCol='bigrams')
df_bigram = bigram.transform(df_tokens)

In [14]:
df_words = df_bigram.select(explode(df_bigram.bigrams).alias('word'))
df_words.groupBy('word').count().orderBy(desc('count')).show()

+--------------------+-----+
|                word|count|
+--------------------+-----+
|         cet article|  149|
|          mise place|   92|
|        mettre place|   89|
|    machine learning|   88|
|             col col|   78|
|software craftsma...|   78|
|          chez xebia|   67|
|        scikit learn|   65|
|          c’est dire|   65|
|           plus plus|   65|
|        public class|   61|
|         d’un projet|   60|
|         public void|   60|
|       product owner|   56|
|     tests unitaires|   56|
|           plus loin|   55|
|   d’une application|   54|
| continuous delivery|   52|
|  cette présentation|   51|
|    http //localhost|   47|
+--------------------+-----+
only showing top 20 rows



> You should see some combinations with interpretable meanings

## 3 - Word2Vec
----

Now that we have tokens, we can use them in some algorithms to extract useful features from them. One algorithm that we can use is [Word2Vec](https://spark.apache.org/docs/latest/ml-features.html#word2vec), which has an implementation in Sparl ML.

In [13]:
from pyspark.ml.feature import Word2Vec

### Training model

##### Instanciate a new Word2Vec object with the following settings
- inputCol: "tokens"
- outputCol: "w2c_features"
- vectorSize: 50
- minCount: 10
- maxIter: 30

In [14]:
word2Vec = Word2Vec(vectorSize=50, minCount=10, maxIter=50, inputCol='tokens', outputCol='features')

##### Train a model on the `df_tokens` DataFrame using the `fit` method of your word2Vec object

> This might take a few minutes to run depending on the parameters you chose

> You can try different parameters from the ones proposed, but avoid providing values that are too high when you don't work on a cluster

In [15]:
word2Vec_model = word2Vec.fit(df_tokens)

We now have on our hands a trained Word2Vec model that we can use and query.

##### Check how the learning phase went by finding synonyms of a few words of your choice

> Hint 1: Use the `findSynonyms(word, num_synonyms)` method called on the learned model

> Hint 2: The result of that function is a DataFrame. Use the show method on it to print the results

In [23]:
word2Vec_model.findSynonyms("conférence", 10).show()

+-----------+------------------+
|       word|        similarity|
+-----------+------------------+
|    keynote| 1.556393936797057|
|conférences|1.4617819337912439|
|d’ouverture|1.3629812805060728|
|    édition|1.3451473302159676|
|      année|1.3071189316752012|
|    journée| 1.305614050319781|
|    xebicon|1.2829967047689437|
|      talks|1.2686259718399966|
|  programme|1.2677530964254287|
|  retrouvez| 1.264922965047682|
+-----------+------------------+



Word2Vec seems to do a good job finding synonyms. Now let's look if we can find some associations like `king - man + woman -> queen`.

##### Write a function find_nearest_word(model, vectors_df, king_word, man_word, woman_word) which finds the word in the corpus which is nearest to the word of coordinates `king_word - man_word + woman_word`

> Hint 1: `model` is your Word2Vec model

> Hint 2: `vectors_df` is the DataFrame that comes from the getVectors() method of a Word2Vec model. It has two columns: 
- word: a word in the corpus
- vector: it's associated coordinates

> Hint 3: The function may run the following actions
- Find the `vector` associated to the `king_word` in vectors_df
- Do the same for `man_word` and `woman_word`
- Use the findSynonyms function of the model with the vector `king_word - man_word + woman_word` as argument

In [18]:
word2Vec_vectors_df = word2Vec_model.getVectors()

In [20]:
word2Vec_vectors_df.show(10)

+--------+--------------------+
|    word|              vector|
+--------+--------------------+
|     dns|[0.09927792102098...|
| speaker|[-0.1568965017795...|
|  assert|[-0.3278719782829...|
|    mise|[0.48140543699264...|
|  plugin|[0.35941249132156...|
| retenir|[0.14124287664890...|
|    afin|[0.14102584123611...|
|   alpha|[-0.2422264367341...|
|demandes|[-0.5457851290702...|
| cyrille|[0.03088080696761...|
+--------+--------------------+
only showing top 10 rows



In [25]:
def find_nearest_word(model=None, vectors_df=None, king_word="", man_word="", woman_word=""):
    vect_king_word = vectors_df.filter(vectors_df.word == king_word).first().vector
    vect_man_word = vectors_df.filter(vectors_df.word == man_word).first().vector
    vect_woman_word = vectors_df.filter(vectors_df.word == woman_word).first().vector
    return model.findSynonyms(vect_king_word - vect_man_word + vect_woman_word, 3).show()

In [32]:
find_nearest_word(model=word2Vec_model, vectors_df=word2Vec_vectors_df, 
                  king_word="xebian", man_word="conférence", woman_word="conférences")

+-----------+------------------+
|       word|        similarity|
+-----------+------------------+
|  collègues|2.3986518909221277|
|conférences| 2.275260079800317|
|    xebians| 2.215104610800545|
+-----------+------------------+



##### Using the `transform()` method of the word2Vec_model, add a new column to the df_tokens DataFrame

In [61]:
df_features = word2Vec_model.transform(df_tokens)

In [62]:
df_features.show(3)

+--------------------+-------------+--------------------+--------------------+--------------------+
|               title|     category|             content|              tokens|            features|
+--------------------+-------------+--------------------+--------------------+--------------------+
|2014-01-08-crafts...|        Craft|  Pour coder tous...|[coder, jours, pl...|[0.01064611069030...|
|2014-01-10-androi...|AndroidMobile|  En tant que dév...|[tant, développeu...|[-0.0081862772121...|
|2014-01-13-separe...|        Craft|  Lors de nos dév...|[lors, développem...|[0.03876683608669...|
+--------------------+-------------+--------------------+--------------------+--------------------+
only showing top 3 rows



## 4 - KMeans
----

In this task, we would like to group the articles into clusters in which they share similar topics. 

For this, we will be using the KMeans algorithm on our corpus. As KMeans needs numerical features to run correctly, we will use the `features` column of our df_features DataFrame as input.

In [114]:
from pyspark.ml.clustering import KMeans

##### Instanciate a new KMeans object with the following settings
- k: an integer of your choice
- maxIter: 30
- featuresCol: "features"
- outputCol: "cluster"

In [146]:
kmeans = KMeans(k=15, maxIter=30, featuresCol="features", predictionCol="cluster")

##### Train a model on the `df_features` DataFrame using the `fit` method of your kmeans object

> This might take a few minutes to run depending on the parameters you chose

In [147]:
kmeans_model = kmeans.fit(df_features)

##### Using the `transform()` method of the kmeans_model, add a new column to the df_features DataFrame

In [148]:
df_clusters = kmeans_model.transform(df_features)

In [149]:
df_clusters.show(10)

+--------------------+-------------+--------------------+--------------------+--------------------+-------+
|               title|     category|             content|              tokens|            features|cluster|
+--------------------+-------------+--------------------+--------------------+--------------------+-------+
|2014-01-08-crafts...|        Craft|  Pour coder tous...|[coder, jours, pl...|[0.01064611069030...|      3|
|2014-01-10-androi...|AndroidMobile|  En tant que dév...|[tant, développeu...|[-0.0081862772121...|      3|
|2014-01-13-separe...|        Craft|  Lors de nos dév...|[lors, développem...|[0.03876683608669...|      7|
|2014-01-14-xebia-...|       Divers|  Xebia aura le p...|[xebia, plaisir, ...|[-0.0384734213196...|      1|
|2014-01-15-crafts...|        Craft|  La pratique des...|[pratique, tests,...|[-0.0429301756372...|      3|
|2014-01-17-fireba...|    BackFront|     Développer u...|[développer, appl...|[-0.0262986038291...|      3|
|2014-01-17-les-di...|      

We would like to know if the cluster centers represent meaningful topics. To do that, we will use the word2Vec_model to find the synonyms of the cluster centers in our corpus of words.

##### Write a function find_synonyms_cluster_center(word2Vec_model, kmeans_model, cluster_id) which finds the synonyms of the cluster center of your choice

> Hint 1: `word2Vec_model` is your Word2Vec model, on which you can call the findSynonyms method

> Hint 2: `kmeans_model` is your KMeans model, on which you can call the clusterCenters() method

In [35]:
def find_synonyms_cluster_center(word2Vec_model, kmeans_model, cluster_id):
    word2Vec_model.findSynonyms(kmeans_model.clusterCenters()[cluster_id], 10).show()

In [133]:
find_synonyms_cluster_center(word2Vec_model, kmeans_model, 0)

+-----------+-------------------+
|       word|         similarity|
+-----------+-------------------+
|      mllib| 0.5527822520467677|
|   learning| 0.5415354570414984|
|algorithmes| 0.5199606945530545|
|    données| 0.4988869351536844|
|    machine|0.49487814142295816|
|      learn|0.48959105512289014|
| algorithme| 0.4773796465860811|
|     scikit| 0.4722683352209581|
|        rdd|0.44398480875690705|
|  dataframe| 0.4399253758696312|
+-----------+-------------------+



In [102]:
def titles_in_cluster(df, cluster_id):
    df_cluster = df.filter(df.cluster == cluster_id)
    cluster_size = df_cluster.count()
    print "Cluster Size : " + str(cluster_size)
    print "Cluster Content : "
    for title_row in df_cluster.select('title').collect():
        print "\t" + title_row.title

In [173]:
titles_in_cluster(df_clusters, 12)

Cluster Size : 2
Cluster Content : 
	2014-11-21-techevent-le-3-decembre-venez-decouvrir-watchkit-le-sdk-apple-watch
	2015-02-27-venez-suivre-la-keynote-apple-chez-xebia-2


In [152]:
def most_represented_categories_in_cluster(df, cluster_id):
    df_cluster = df.filter(df.cluster == cluster_id)
    df_cluster.groupBy('category').count().orderBy(desc('count')).show(5)

In [175]:
most_represented_categories_in_cluster(df_clusters, 14)

+--------+-----+
|category|count|
+--------+-----+
|   Front|    1|
+--------+-----+



##### (Bonus) Apply a KMeans to the words to see which words are clustered together

In [179]:
kmeans = KMeans(k=15, maxIter=50, featuresCol="vector", predictionCol="cluster")

In [180]:
kmeans_model = kmeans.fit(word2Vec_model.getVectors())

In [181]:
words_clusters_df = kmeans_model.transform(word2Vec_model.getVectors())

In [None]:
words_clusters_df.show(4)

In [191]:
words_clusters_df.filter(words_clusters_df.cluster == 0).show()

+----------------+--------------------+-------+
|            word|              vector|cluster|
+----------------+--------------------+-------+
|     //localhost|[-0.0727439373731...|      0|
|         network|[-0.9946057200431...|      0|
|application/json|[0.73389494419097...|      0|
|             any|[0.29274126887321...|      0|
|             win|[0.02392003312706...|      0|
|            name|[-0.0623365081846...|      0|
|           count|[-0.9167723059654...|      0|
|            demo|[-1.2303223609924...|      0|
|         mutable|[-0.0293252132833...|      0|
|            your|[-0.0517358668148...|      0|
|            rel=|[1.75475168228149...|      0|
|          server|[0.74753201007843...|      0|
|            path|[-0.5777864456176...|      0|
|           using|[0.91474348306655...|      0|
|        provider|[-0.9481966495513...|      0|
|           email|[-0.9109817743301...|      0|
|             yes|[-0.1472924202680...|      0|
|            auth|[-0.9163115024566...| 

In [205]:
word2Vec_model.findSynonyms(kmeans_model.clusterCenters()[14], 10).show()

+----------+------------------+
|      word|        similarity|
+----------+------------------+
|      veux|3.5881006251305543|
|    listes| 3.502384906595183|
|  d’envies|3.4487807050614574|
|      tant|3.0072742256851845|
|    invité|2.8871016986373803|
|constituer|2.7479848202778925|
|    privée| 2.056474293140105|
|  articles|1.9981329427129113|
|   pouvoir|1.9503680088204258|
|      lien| 1.913719766184153|
+----------+------------------+



##### (Bonus) Identify clusters of words that seem unuseful to you and add the corresponding words to the stopwords list. Then tokenize the texts with this new list, transform the resulting DataFrame with the Word2Vec model and run a KMeans. Does it help improving the clusters interpretation ?

## 5 - Latent Dirichlet Allocation
----

We are now going to try another algorithm to infer topics from documents, which is Latent Dirichlet Allocation. First, we need to map each word to an index and filter undesirable words. We can then train a LDA model to find those topics.

##### Instanciate a new [CountVectorizer](https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.feature.CountVectorizer) object with the following settings
- minTF: 5.0
- minDF: 15.0
- featuresCol: "tokens"
- outputCol: "features"

In [40]:
from pyspark.ml.feature import CountVectorizer
cv = CountVectorizer(inputCol="tokens", outputCol="features", minTF=5.0, minDF=15.0)

##### Fit the model on the `df_tokens` DataFrame, and then add a new column with the features

In [41]:
cv_model = cv.fit(df_tokens)

In [42]:
df_features = cv_model.transform(df_tokens)

In [43]:
df_features.show(5)

+--------------------+-------------+--------------------+--------------------+--------------------+
|               title|     category|             content|              tokens|            features|
+--------------------+-------------+--------------------+--------------------+--------------------+
|2014-01-08-crafts...|        Craft|  Pour coder tous...|[coder, jours, pl...|(1360,[0,1,2,4,5,...|
|2014-01-10-androi...|AndroidMobile|  En tant que dév...|[tant, développeu...|(1360,[0,1,2,4,5,...|
|2014-01-13-separe...|        Craft|  Lors de nos dév...|[lors, développem...|(1360,[0,6,11,12,...|
|2014-01-14-xebia-...|       Divers|  Xebia aura le p...|[xebia, plaisir, ...|(1360,[48,99],[12...|
|2014-01-15-crafts...|        Craft|  La pratique des...|[pratique, tests,...|(1360,[0,4,9,13,1...|
+--------------------+-------------+--------------------+--------------------+--------------------+
only showing top 5 rows



In order to train a LD
A model, we need to provide a RDD with an index and a Sparse Vector

In [59]:
from pyspark.mllib.linalg import Vectors

In [79]:
rdd_features = (df_features
                .filter(df_features.category == "Data")
                .select("features")
                .map(lambda row: row.features)
                .zipWithIndex()
                .map(lambda (a,b): [b,a])).cache()

##### Instanciate and train new [LDA](https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.clustering.LDA) object with the following settings
- k: 3
- maxIterations: 5

In [81]:
from pyspark.mllib.clustering import LDA

ldaModel = LDA.train(rdd_features, k=3, maxIterations=5)

> You can change the parameters of the model, but be careful, the training time can be very long

##### Use the `describeTopics` method of your ldaModel and the `vocabulary` attribute of your cv_model to observe the description in 10 words of each topics

In [106]:
for topic in range(3):
    print "TOPIC " + str(topic)
    print "-------"
    for i in ldaModel.describeTopics()[topic][0][:10]:
        print cv_model.vocabulary[i]
    print "\n"

TOPIC 0
-------
spark
données
plus
algorithmes
data
machine
import
learning
permet
the


TOPIC 1
-------
données
plus
import
the
data
d’un
map
docker
cluster
c’est


TOPIC 2
-------
spark
données
plus
algorithmes
data
true
learning
import
apache
machine


