### Multi label classifiation of news into its category using Mutlinomial Naive Bayes

This classifier treats each occurrence of a word as an event. Each of the words' conditional probability distribution is assumed to be multinomial. This ref https://syncedreview.com/2017/07/17/applying-multinomial-naive-bayes-to-nlp-problems-a-practical-explanation/ can be a good refresher of the MultiNB working on text classification problem like the present use case.

Laplace Smoothing is to assign some fail-safe probability to missing tokens in news item sentence, otherwise its container sentences probability gets squashed to zero!

Pipeline stages: -
* Term frequency: Use either CountVectorizer or HashingTF
  * [HashingTF] The transformer hashes token & applies modulo % 'numFeatures' to get unique index => have descently big 'numFeatures' to avoid collission resulting in erroneous TF of some important token => 'numFeatures' default 2^^18
* Inverse Doc Freq (IDF): It operates on TF

Important Notes: -
* The UCI News Aggregator dataset has been taken from https://www.kaggle.com/uciml/news-aggregator-dataset. It contains headlines, URLs, and categories for 422,937 news stories collected by a web aggregator between March 10th, 2014 and August 10th, 2014. Upload csv file to Databricks FileStore through "databricks >> Import & Explore Data".
* This notebook has been at https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1044241664895185/3279643061129108/3281307886843282/latest.html.
* This solution has been tried on Databricks Community!

In [2]:
from pyspark.ml import Pipeline 
from pyspark.ml.feature import Tokenizer, CountVectorizer, StringIndexer, RegexTokenizer, StopWordsRemover, HashingTF, IDF, IndexToString
from pyspark.sql.functions import col, udf,regexp_replace,isnull, countDistinct
from pyspark.sql.types import StringType,IntegerType
from pyspark.ml.classification import NaiveBayes
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

In [3]:
news = spark.read.csv("/FileStore/tables/uci_news_aggregator-e0002.csv", header='True', inferSchema='True')
#news.count()

In [4]:
news = news.select("TITLE", "CATEGORY")
news = news.dropna()
#news.count()

In [5]:
#news.show(truncate=False)

In [6]:
#news.agg(*(countDistinct(col(c)).alias(c) for c in news.columns)).show()

In [7]:
# 'category' column requires just categorization which is 

indexer = StringIndexer(inputCol="CATEGORY", outputCol="label")
news = indexer.fit(news).transform(news)

In [8]:
(news_train, news_test) = news.randomSplit([0.8, 0.2], seed = 12345)
#print(news_train.count(), news_test.count())

In [9]:
# Pipeline labelCol default is 'label' & featuresCol default is 'features'

#categoryIndexer = StringIndexer(inputCol="CATEGORY", outputCol="label")

tokenizer = Tokenizer(inputCol="TITLE", outputCol="words")
wordsRemover = StopWordsRemover(inputCol=tokenizer.getOutputCol(), outputCol="words_cleaned")

cv = CountVectorizer(inputCol=wordsRemover.getOutputCol(), outputCol="words_tf")
#hashingTF = HashingTF(inputCol=wordsRemover.getOutputCol(), outputCol="title_cleaned_hashed")

idf = IDF(minDocFreq=5, inputCol=cv.getOutputCol(), outputCol="features")

nb = NaiveBayes(smoothing=1.0, modelType="multinomial")

In [10]:
# tokenizer, wordsRemover & hashingTF are of type Tranasformer
# tfidf stage involves both Estimator & Transform (it used in standalone mode then it should be used as per below
#               idf = IDF().fit(tf)
#               tfidf = idf.transform(tf))
# lr stage involves only Estimator

pipeline = Pipeline(stages=[tokenizer, wordsRemover, cv, idf, nb])

model = pipeline.fit(news_train)

In [11]:
nb_predictions = model.transform(news_test)

In [12]:
#nb_predictions.select("prediction", "label", "features").show(20)

In [13]:
evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="accuracy")

In [14]:
nb_accuracy = evaluator.evaluate(nb_predictions)

In [15]:
print("Accuracy of NaiveBayes is = %g" % (nb_accuracy))
print("Test Error of NaiveBayes = %g " % (1.0 - nb_accuracy))

Misc notes: - 
* [TODO] Attempted to chain categoryIndexer pre-processing on CATEGORY column but "...Failed to execute user defined function($anonfun$9: (string) =&gt; double)...".