# Concatenate features
In NLP introduction notebook we created the bag of word from unigrams, bigrams and trigrams in simple way - we just change **ngrams_range** in CountVectorizer. In Spark this is not so easy. 

For this task we will use [VectorAssembler](https://spark.apache.org/docs/2.1.0/ml-features.html#vectorassembler). 

**Example**
You have a table:

id | hour | mobile | userFeatures     | clicked
----|------|--------|------------------|---------
 0  | 18   | 1.0    | [0.0, 10.0, 0.5] | 1.0
 
With few features - as you remember in spark to estimator you can put only one column - so we need to concatenate this four columns to single. We need to get:
 
 id | hour | mobile | userFeatures     | clicked | features
----|------|--------|------------------|---------|-----------------------------
 0  | 18   | 1.0    | [0.0, 10.0, 0.5] | 1.0     | [18.0, 1.0, 0.0, 10.0, 0.5]
 
 
Now we will create a new pipeline with ngrams features - bigram and trigrams
 
 As you probably read VectorAssembler is not working with strings - so simple pipeline with ngrams won't work. But we can train CountVectorizer for unigrams, bigram, trigrams and concatenate the results.

In [None]:
#TODO - read data / lowercase data / import necessery library / clean labels

In [None]:
from pyspark.ml.feature import VectorAssembler
def ngrams_count_vectorizer(inputCol="tokens", outputCol="features",  ngram_range=(1,1)):
    ngrams = [
        NGram(n=i, inputCol=inputCol, outputCol="{0}_grams".format(i))
        for i in range(ngram_range[0], ngram_range[1] + 1)
    ]

    vectorizers = [
        CountVectorizer(inputCol="{0}_grams".format(i),
            outputCol="{0}_counts".format(i))
        for i in range(ngram_range[0], ngram_range[1] + 1)
    ]

    assembler = [VectorAssembler(
        inputCols=["{0}_counts".format(i) for i in range(ngram_range[0], ngram_range[1] + 1)],
        outputCol=outputCol
    )]

    return Pipeline(stages=ngrams + vectorizers + assembler)

In [None]:
from pyspark.ml.feature import VectorAssembler
tokenizer = Tokenizer(inputCol="lower_sentence", outputCol="words_tokenizer_pipeline")
remover = StopWordsRemover(inputCol="words_tokenizer_pipeline", outputCol="filtered_pipeline")
ngrams = ngrams_count_vectorizer("filtered_pipeline", "features_pipeline", (1,2))
nb = NaiveBayes(modelType="multinomial", featuresCol="features_pipeline", labelCol="indexed",) 
pipeline = Pipeline(stages=[tokenizer, remover, ngrams, nb])

In [None]:
#TODO fit the model - show the accuracy