### Simple Text Processing and Classification with Apache Spark
---
The aim of this notebook is to practise basic text processing using the Apache Spark with the use of the toxic comment text classification dataset. The machine learning and text processing used here are at a poor standard. The goal was mainly to convert the column `comment_text` into a column of sparse vectors for use in a classification algorithm in the spark `ml` library.  

The `pyspark.ml` library is used for machine learning with Spark DataFrames. For machine learning with Spark RDDs use the `pyspark.mllib` library. 

In [None]:
import pandas as pd

from pyspark.sql import SparkSession
import pyspark.sql.functions as F
import pyspark.sql.types as T
from pyspark.ml.feature import Tokenizer, HashingTF, IDF
from pyspark.ml.classification import LogisticRegression

In [None]:
# Build a spark context
hc = (SparkSession.builder
                  .appName('Toxic Comment Classification')
                  .enableHiveSupport()
                  .config("spark.executor.memory", "4G")
                  .config("spark.driver.memory","18G")
                  .config("spark.executor.cores","7")
                  .config("spark.python.worker.memory","4G")
                  .config("spark.driver.maxResultSize","0")
                  .config("spark.sql.crossJoin.enabled", "true")
                  .config("spark.serializer","org.apache.spark.serializer.KryoSerializer")
                  .config("spark.default.parallelism","2")
                  .getOrCreate())

In [None]:
hc.sparkContext.setLogLevel('INFO')

In [None]:
hc.version

In [None]:
train = hc.read.csv("../input/train.csv",
                    inferSchema=True, header=True,
                    quote='"', escape='"', multiLine=True, mode='FAILFAST')
test = hc.read.csv("../input/test.csv",
                   inferSchema=True, header=True,
                   quote='"', escape='"', multiLine=True, mode='FAILFAST')

In [None]:
out_cols = [i for i in train.columns if i not in ["id", "comment_text"]]

In [None]:


# Sadly the output is not as  pretty as the pandas.head() function
train.show(5)

In [None]:
# View some toxic comments
train.filter(F.col('toxic') == 1).show(5)

In [None]:
# Basic sentence tokenizer
tokenizer = Tokenizer(inputCol="comment_text", outputCol="words")
wordsData = tokenizer.transform(train)

In [None]:
# Count the words in a document
hashingTF = HashingTF(inputCol="words", outputCol="rawFeatures")
tf = hashingTF.transform(wordsData)

In [None]:
tf.select('rawFeatures').take(2)

In [None]:
# Build the idf model and transform the original token frequencies into their tf-idf counterparts
idf = IDF(inputCol="rawFeatures", outputCol="features")
idfModel = idf.fit(tf) 
tfidf = idfModel.transform(tf)

In [None]:
tfidf.select("features").first()

Do a test first to practise with the LogisticRegression class. I like to create instances of objects first tocheck their methods and docstrings and figure out how to access data.

Build a logistic regression model for the binary toxic column.
Use the features column (the tfidf values) as the input vectors, `X`, and the toxic column as output vector, `y`.

In [None]:
REG = 0.1

In [None]:
lr = LogisticRegression(featuresCol="features", labelCol='toxic', regParam=REG)

In [None]:
tfidf.show(5)

In [None]:
lrModel = lr.fit(tfidf.limit(5000))

In [None]:
res_train = lrModel.transform(tfidf)

In [None]:
res_train.select("id", "toxic", "probability", "prediction").show(20)

In [None]:
res_train.show(5)

#### Select the probability column
---
Create a user-defined function (udf) to select the second element in each row of the column vector

In [None]:
extract_prob = F.udf(lambda x: float(x[1]), T.FloatType())

In [None]:
(res_train.withColumn("proba", extract_prob("probability"))
 .select("proba", "prediction")
 .show())

### Create the results DataFrame
---
Convert the test text

In [None]:
test_tokens = tokenizer.transform(test)
test_tf = hashingTF.transform(test_tokens)
test_tfidf = idfModel.transform(test_tf)

Initialize the new DataFrame with the id column

In [None]:
test_res = test.select('id')
test_res.head()

Make predictions for each class

In [None]:
test_probs = []
for col in out_cols:
    print(col)
    lr = LogisticRegression(featuresCol="features", labelCol=col, regParam=REG)
    print("...fitting")
    lrModel = lr.fit(tfidf)
    print("...predicting")
    res = lrModel.transform(test_tfidf)
    print("...appending result")
    test_res = test_res.join(res.select('id', 'probability'), on="id")
    print("...extracting probability")
    test_res = test_res.withColumn(col, extract_prob('probability')).drop("probability")
    test_res.show(5)

In [None]:
test_res.show(5)

In [None]:
test_res.write.csv('./results/spark_lr.csv', mode='overwrite', header=True)

The output is actually a directory and not a csv file. Within the directory there is one or more csv files, which together make up the entire csv results. I used the cat function to concatenate these csv files together.

In [None]:
!cat results/spark_lr.csv/part*.csv > spark_lr.csv

In [None]:
ls

This submission scores 0.8797 on the public leaderboard.