![](img/SparkML.png)

# Question
If you have any question about NLP or Machine Learning please ask me:
1. Face To Face
2. [On Piazza](http://piazza.com/nlprafalpronko/spring2018/nlp101) access code **nlp101**
3. [On LinkedIn](https://www.linkedin.com/in/rafalpronko/)

### How to put the item on Amazon in [English](http://amazon.com) to proper root category? 

![](img/BrowseNodes.png)

Amazon browsenodes (categories) we can find here http://www.findbrowsenodes.com/. Please look at few first categories and their children - you can see the tree is quite big.

Example of category path: 'Clothing, Shoes & Jewelry'->'Novelty, Costumes & More'->'Costumes & Accessories'->'More Accessories'->'Kids & Baby'

to simplify our job we will try to assignee products (items) to root category. 

What's more, we want to use only **title** to achieve our goal. 

### How we want to achieve this goal? 

1. Scraping data from Amazon
2. Analyzing collected data
3. Building a model

Initial data is provided by external company specialized in scraping sites.

So it's the time to start!

Next steps:
1. We need choose only important data
2. We need clean / normalize the text
3. Building the classifier


# NLP move from Sklearn to SparkML



Do you remember what we were doing at the morning? Here is small reminder:
1. Read data to DataFrame
2. Clear data / remove nulls / get only two columns
3. Get root category from categories
4. Split data to train test set
5. Clear text data / remove stop words / lower case / normalize / build bag of words
6. Labeled categories
7. Build model / train model
8. Evaluate model

Now we will do the same but in Spark

First of all we need start with Spark - create a session

In [None]:
#beacuse we have SparkContext load already we can use it and getOrCreate the spark session 
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

Now we can read data to DataFrame

In [None]:
df_data = spark.read\
    .option("header", "true")\
    .option("mode", "DROPMALFORMED")\
    .option("delimiter", '^')\
    .csv('small_data.csv') 

In [None]:
import pyspark

In [None]:
df_data.persist(pyspark.StorageLevel.MEMORY_AND_DISK) # if SPARK do not have enought memory - use this configuration

In [None]:
#TODO - show first 5 elements from DataFrame

Now we can show how many rows we loaded to our dataframe - in pandas we did it using .shape - in Spark we will do it using count().

**Question**
Do you remember what count() do in Pandas DataFrame? 

In [None]:
df_data.count()

In Pandas we had 240K do you know why here we have lower number?

Do you remember: `option("mode", "DROPMALFORMED")` - so if Spark cannot formatted the column in proper way it will just drop this columns (example we have a text which have some special character and Spark cannot escaped it in proper way)

How to see the number of nan / null value per columns? 

In [None]:
df_data.fillna("-1").filter("title = -1").count()

In [None]:
#TODO show null for salesRank

As we can see we have 9 null value in title. Above we showed the number of null / nan values only for single column below we will see how to do it for every column.

In [None]:
from pyspark.sql.functions import isnan, when, count, col

df_data.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in df_data.columns]).show()

We used isnan and isNull because we can have this two types. 

### Done / ToDo
1. ~~Read data to DataFrame~~
2. Clear data / remove nulls / get only two columns
3. Get root category from categories
4. Split data to train test set
5. Clear text data / remove stop words / lower case / normalize / build bag of words
6. Labeled categories
7. Build model / train model
8. Evaluate model

How to remove null value? 

In [None]:
df_data = df_data.dropna(subset=['title'])

In [None]:
df_data.count()

Did we remove only 9 elements? 

Now we need only two columns from dataframe

In [None]:
df_data = df_data.select('categories', 'title')

In [None]:
#TODO show 5 first elements

### Done / ToDo
1. ~~Read data to DataFrame~~
2. ~~Clear data / remove nulls / get only two columns~~
3. Get root category from categries
4. Split data to train test set
5. Clear text data / remove stop words / lower case / normalize / build bag of words
6. Labelized categories
7. Build model / train model
8. Evaluate model

Now we need to extract root category - we can do it in the same way as in first notebook using `eval(str(x))[0][0]`.

First we need to [define user function](https://gist.github.com/zoltanctoth/2deccd69e3d1cde1dd78). 

In [None]:
from pyspark.sql.functions import udf # import user definion function 
choose_only_first_root_category = udf(lambda x: eval(str(x))[0][0]) # create a function

Now we can create new dataframe with additional column: "root_category". For this we will use [withColumn](https://docs.databricks.com/spark/latest/sparkr/functions/withColumn.html) function. 

In [None]:
df_data = df_data.withColumn('root_category', choose_only_first_root_category(df_data.categories))

In [None]:
#TODO show the result

In [None]:
#TODO create a dataFram (name: df_data) with only two columns title and root_category

### Done / ToDo
1. ~~Read data to DataFrame~~
2. ~~Clear data / remove nulls / get only two columns~~
3. ~~Get root category from categries~~
4. Split data to train test set
5. Clear text data / remove stop words / lower case / normalize / build bag of words
6. Labelized categories
7. Build model / train model
8. Evaluate model

Before we start splitting, to train test data we should check how many rows we have per category. We can do it by using [groupBy](http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.groupBy).

In [None]:
df_data.groupBy('root_category').count().show(100, truncate=False) #truncate=False to show full name of category 

as you can see we have few categories with less than 1000 number of items, so we can remove this category from our dataframe.

Why? Because we are almost sure that we cannot predict the true value for those categories (too small set of items). 

We can do it by using filter function

In [None]:
df_data = df_data.filter('root_category != "GPS & Navigation"')

In [None]:
#TODO remove all rows from categories with number of values below 1000 - additionaly remove category [ and empty
# this categories are some errors during reading data

In pyspark to split the data on to two groups - train / test we can use built-in function randomSplit

In [None]:
train, test = df_data.randomSplit([0.8, 0.2], seed=12345)

In [None]:
#TODO show how many rows we have in test and how many rows we have in train

In [None]:
test.count()

### Done / ToDo
1. ~~Read data to DataFrame~~
2. ~~Clear data / remove nulls / get only two columns~~
3. ~~Get root category from categries~~
4. ~~Split data to train test set~~
5. Clear text data / remove stop words / lower case / normalize / build bag of words
6. Labelized categories
7. Build model / train model
8. Evaluate model

Now it is time to clean data.

## Lower

As you remember the first step in data cleaning was lower case. In SPARK we need to import two function
1. [lower](http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.functions.lower) - make lower text
2. [col](http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.functions.col) - return the column based on the name

In [None]:
from pyspark.sql.functions import col, lower

In [None]:
train_clean = train.withColumn('lower_sentence', lower(col('title')))

In [None]:
#TODO show the result

## Tokenization

[Simple tokenizer](https://spark.apache.org/docs/2.1.0/api/python/pyspark.ml.html#pyspark.ml.feature.Tokenizer) similar as split() function in sklearn

[Regex tokenizer](https://spark.apache.org/docs/2.1.0/api/python/pyspark.ml.html#pyspark.ml.feature.RegexTokenizer) tokenization based on regex function

In [None]:
from pyspark.ml.feature import Tokenizer, RegexTokenizer

In [None]:
tokenizer = Tokenizer(inputCol="lower_sentence", outputCol="words_tokenizer")

In [None]:
train_clean = tokenizer.transform(train_clean)

In [None]:
#TODO show the result

In [None]:
#TODO create tokenized word using regex tokenizer

**question** 

what is the different between simple tokenization and regex tokenization?

## Stop words removal

https://spark.apache.org/docs/2.1.0/api/python/pyspark.ml.html#pyspark.ml.feature.StopWordsRemover

In [None]:
from pyspark.ml.feature import StopWordsRemover
remover = StopWordsRemover(inputCol="words_tokenizer", outputCol="filtered")
train_clean = remover.transform(train_clean)

In [None]:
#TODO show the result

## CountVectorizer - Create bag of word model

https://spark.apache.org/docs/2.1.0/api/python/pyspark.ml.html#pyspark.ml.feature.CountVectorizer

In [None]:
from pyspark.ml.feature import CountVectorizer
cv = CountVectorizer(inputCol="filtered", outputCol="features")
model = cv.fit(train_clean)
train_clean = model.transform(train_clean)

In [None]:
#TODO show the result

As you remember in first notebook we did something like lemmatization? But in Spark ml we do not have such thing. So for now we need to skip this. 

**Exercise**
To do the same cleaning for test set, remember that any model you have fit already so on test set you need to use only transform. 

example:

In [None]:
test_clean = test.withColumn('lower_sentence', lower(col('title')))
test_clean = tokenizer.transform(test_clean)

In [None]:
#TODO show 5 result

### Done / ToDo
1. ~~Read data to DataFrame~~
2. ~~Clear data / remove nulls / get only two columns~~
3. ~~Get root category from categries~~
4. ~~Split data to train test set~~
5. ~~Clear text data / remove stop words / lower case / normalize / build bag of words~~
6. Labelized categories
7. Build model / train model
8. Evaluate model

## String indexer

As you remember in first model with sklearn we had to change the string category to index (labels) - in SPARK we have to do the same.

In SPARK we have [StringIndexer](https://spark.apache.org/docs/2.1.0/api/python/pyspark.ml.html#pyspark.ml.feature.StringIndexer)

In [None]:
from pyspark.ml.feature import StringIndexer
stringIndexer = StringIndexer(inputCol="root_category", outputCol="indexed", handleInvalid='error')

In [None]:
stringIndexer_model = stringIndexer.fit(train_clean)

In [None]:
train_clean = stringIndexer_model.transform(train_clean)

In [None]:
#TODO show train_clean - only 5 elements

In [None]:
#TODO do the same for test_clean - rememeber stringindexer you have trained so use only transform

### Done / ToDo
1. ~~Read data to DataFrame~~
2. ~~Clear data / remove nulls / get only two columns~~
3. ~~Get root category from categries~~
4. ~~Split data to train test set~~
5. ~~Clear text data / remove stop words / lower case / normalize / build bag of words~~
6. ~~Labelized categories~~
7. Build model / train model
8. Evaluate model

We will use our favourite classifier - Naive Bayes

In [None]:
from pyspark.ml.classification import NaiveBayes # import Naive Bayes

In [None]:
nb = NaiveBayes(modelType="multinomial", featuresCol="features", labelCol="indexed",) # declare the model

In [None]:
model = nb.fit(train_clean) # train the model

### Done / ToDo
1. ~~Read data to DataFrame~~
2. ~~Clear data / remove nulls / get only two columns~~
3. ~~Get root category from categries~~
4. ~~Split data to train test set~~
5. ~~Clear text data / remove stop words / lower case / normalize / build bag of words~~
6. ~~Labelized categories~~
7. ~~Build model / train model~~
8. Evaluate model

first we need predict new value for test set

In [None]:
prediction = model.transform(test_clean)

In [None]:
#TODO show 5 rows from prediction

Now we need import the model [evaluator](http://spark.apache.org/docs/2.2.0/api/python/pyspark.ml.html#pyspark.ml.evaluation.MulticlassClassificationEvaluator)

In [None]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

In [None]:
evaluator = MulticlassClassificationEvaluator(labelCol="indexed", predictionCol="prediction",
                                              metricName="accuracy")

In [None]:
accuracy = evaluator.evaluate(prediction)

In [None]:
print(accuracy)

### Done / ToDo
1. ~~Read data to DataFrame~~
2. ~~Clear data / remove nulls / get only two columns~~
3. ~~Get root category from categries~~
4. ~~Split data to train test set~~
5. ~~Clear text data / remove stop words / lower case / normalize / build bag of words~~
6. ~~Labelized categories~~
7. ~~Build model / train model~~
8. ~~Evaluate model~~

Ok now you can create a simple model with simple count-vectorizer - split by the words (using unigrams). 

As you remember in first notebook we talk about ngrams (it was just mentioned) and this should improve our model. In Sklearn it was really simple - we need just change `ngrams_range=(1,1)` to for example `ngrams_range=(1,2)`

In Spark it is more complicated. But now we will learn how to do it.

## Ngrams model

Simple explanation what are ngrams

![](img/ngrams.png)

In [None]:
from pyspark.ml.feature import NGram # import ngram model from spark ml

In [None]:
ngram = NGram(n=2, inputCol='filtered', outputCol='ngrams')

In [None]:
model_ngrams = ngram.transform(train_clean)

In [None]:
model_ngrams.select('ngrams').show(5, truncate=False) # truncate false allow us to show full text

In [None]:
#TODO - create new ngrams column with ngram number = 3