# BBC text classification using Spark NLP Universal Sentence Encoder and ClassifierDL annotators.



The data can be downloaded from the below link,

https://www.kaggle.com/yufengdev/bbc-text-categorization?#Get-the-data


## Initialize Spark

In [1]:
# Import Spark NLP
from sparknlp.base import *
from sparknlp.annotator import *
from sparknlp.pretrained import PretrainedPipeline
import sparknlp
from pyspark.sql import SparkSession
from pyspark.ml import Pipeline

# Start Spark Session with Spark NLP
#spark = sparknlp.start()

spark = SparkSession.builder \
    .appName("Spark NLP")\
    .master("local[4]")\
    .config("spark.driver.memory","8G")\
    .config("spark.memory.offHeap.enabled",True)\
    .config("spark.memory.offHeap.size","8G") \
    .config("spark.driver.maxResultSize", "2G") \
    .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.11:2.4.5")\
    .config("spark.kryoserializer.buffer.max", "1000M")\
    .getOrCreate()

## Read the Data

In [2]:

# File location and type
file_location = r'E:\Machine Learning\data\bbc-text.csv'
file_type = "csv"

# CSV options
infer_schema = "true"
first_row_is_header = "true"
delimiter = ","


df = spark.read.format(file_type) \
  .option("inferSchema", infer_schema) \
  .option("header", first_row_is_header) \
  .option("sep", delimiter) \
  .load(file_location)


df.count()

2225

In [3]:
df.show(truncate=False)

+-------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

If you observe, we have good amount of text per row. Let's apply ClassifierDL from spark nlp along with universal sentence encoder and check the results.

## Split dataframe into train and test splits

In [4]:
(trainingData, testData) = df.randomSplit([0.7, 0.3], seed = 100)

## Data Preprocessing Pipeline using Spark-NLP

In [5]:
from pyspark.ml.feature import HashingTF, IDF, OneHotEncoder, StringIndexer, VectorAssembler, SQLTransformer


# convert text column to spark nlp document
document_assembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

# generate sentence embeddings using USE
use = UniversalSentenceEncoder.pretrained()\
 .setInputCols(["document"])\
 .setOutputCol("sentence_embeddings")

# the classes/labels/categories are in category column
# Epoch is the number of iterations for training
# setEnableOutputLogs outputs the logs for each epoch
classsifierdl = ClassifierDLApproach()\
  .setInputCols(["sentence_embeddings"])\
  .setOutputCol("result")\
  .setLabelColumn("category")\
  .setMaxEpochs(5)\
  .setEnableOutputLogs(True)

# create a pipeline
clf_pipeline = Pipeline(
    stages = [
        document_assembler,
        use,
        classsifierdl
    ])


# fit the pipeline on training data
pipeline_model = clf_pipeline.fit(trainingData)

# perform predictions on test data
predictions =  pipeline_model.transform(testData)

tfhub_use download started this may take some time.
Approximate size to download 923.7 MB
[OK!]


In [9]:
final_df=predictions.select('category','text','result.result')

final_df.show()

+--------+--------------------+----------+
|category|                text|    result|
+--------+--------------------+----------+
|business|air china in $1bn...|[business]|
|business|alfa romeos  to g...|[business]|
|business|algeria hit by fu...|[business]|
|business|amex shares up on...|[business]|
|business|asian quake hits ...|[business]|
|business|australia rates a...|[business]|
|business|axa sun life cuts...|[business]|
|business|bad weather hits ...|[business]|
|business|bank set to leave...|[business]|
|business|banker loses sexi...|[politics]|
|business|boeing unveils ne...|[business]|
|business|booming markets s...|[business]|
|business|borussia dortmund...|[business]|
|business|brewers  profits ...|[business]|
|business|bt offers equal a...|    [tech]|
|business|burren awarded eg...|[business]|
|business|businesses fail t...|[business]|
|business|cairn energy in i...|[business]|
|business|call to save manu...|[politics]|
|business|cash gives way to...|[business]|
+--------+-

In result column, each row is a array with the first element as the prediction. Lets remove the array from each row and unwrap the prediction.

We will create a simple python function and register it as a UDF and then we will use df.withColumn to generate prediction column.

In [13]:
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType


get_prediction_udf = udf(lambda list:list[0],StringType())


final_df = final_df.withColumn('prediction',get_prediction_udf('result'))

final_df_cleaned = final_df.select('category','text','prediction')

In [15]:
final_df_cleaned.show()

+--------+--------------------+----------+
|category|                text|prediction|
+--------+--------------------+----------+
|business|air china in $1bn...|  business|
|business|alfa romeos  to g...|  business|
|business|algeria hit by fu...|  business|
|business|amex shares up on...|  business|
|business|asian quake hits ...|  business|
|business|australia rates a...|  business|
|business|axa sun life cuts...|  business|
|business|bad weather hits ...|  business|
|business|bank set to leave...|  business|
|business|banker loses sexi...|  politics|
|business|boeing unveils ne...|  business|
|business|booming markets s...|  business|
|business|borussia dortmund...|  business|
|business|brewers  profits ...|  business|
|business|bt offers equal a...|      tech|
|business|burren awarded eg...|  business|
|business|businesses fail t...|  business|
|business|cairn energy in i...|  business|
|business|call to save manu...|  politics|
|business|cash gives way to...|  business|
+--------+-

## Evalute the model


We can convert the above dataframe to pandas and use sklearn evalution metric. But let's try to keep the code as native as possible. I am going to write a very simple logic to calculate accuracy of the model.


In [24]:
from pyspark.sql.types import IntegerType

def analyze_prediction(category,prediction):
    
    if category.lower() ==  prediction.lower():
        return 1
    else:
        return 0

analyze_prediction_udf = udf(analyze_prediction,IntegerType())


final_df_cleaned = final_df_cleaned.withColumn('match_result',analyze_prediction_udf('category','prediction'))

final_df_cleaned.show()

+--------+--------------------+----------+------------+
|category|                text|prediction|match_result|
+--------+--------------------+----------+------------+
|business|air china in $1bn...|  business|           1|
|business|alfa romeos  to g...|  business|           1|
|business|algeria hit by fu...|  business|           1|
|business|amex shares up on...|  business|           1|
|business|asian quake hits ...|  business|           1|
|business|australia rates a...|  business|           1|
|business|axa sun life cuts...|  business|           1|
|business|bad weather hits ...|  business|           1|
|business|bank set to leave...|  business|           1|
|business|banker loses sexi...|  politics|           0|
|business|boeing unveils ne...|  business|           1|
|business|booming markets s...|  business|           1|
|business|borussia dortmund...|  business|           1|
|business|brewers  profits ...|  business|           1|
|business|bt offers equal a...|      tech|      

Formula for accuracy is,


accuracy = (number_correct(1's)/total_count)*100

In [28]:
from pyspark.sql import functions as F

sum_of_1 = final_df_cleaned.agg(F.sum("match_result")).collect()

accuracy = (sum_of_1[0][0]/final_df_cleaned.count())*100

print(accuracy)

98.04216867469879


## Verify using sklearn

In [32]:
from sklearn.metrics import accuracy_score


final_df_pd = final_df_cleaned.toPandas()

print(accuracy_score(final_df_pd.category,final_df_pd.prediction)*100)

98.04216867469879
