# Can you predict the gender from an author's blog post?

The objective of this notebook is to demonstrate a simple end-to-end pipeline for using Spark-ML. I have chosen the Blog Authorship Corpus in this project. It is available at http://u.cs.biu.ac.il/~koppel/BlogCorpus.htm. The script for parsing the blog posts (in XML format) into a JSON dataset is provided alongside this project.

In [2]:
# Let's load all the dependencies
from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.feature import StringIndexer, Tokenizer, HashingTF, StopWordsRemover, RegexTokenizer
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder


In [3]:
# Load the dataset into a dataframe. The JSON file containing the dataset was uploaded to DataBricks's FileStore and a table was created.
df = spark.sql("SELECT * FROM author_blogs_dataset_mini_json").select("text","gender")
df.cache()

In [4]:
display(df.take(10))

In [5]:
df.printSchema()

In [6]:
# Let's see the class distribution in our dataset
df.groupBy('gender').count().show()

In [7]:
#simple preprocessing pipeline

import re

def preprocess(text):
    
    text = text.replace('\\r\\n',' ')
    text = text.replace('\\n\\t',' ')
    text = text.replace('\\t',' ')
    text = re.sub('&nbsp',' ',text)
    text = re.sub('\n',' ',text)
    text = re.sub('[^\w+|\s]',' ',text)
    text = text.lower()
    return text

In [8]:
# create an user defined function (udf)

from pyspark.sql.types import StringType
text_clean_udf = udf(preprocess, StringType())

In [9]:
# add the cleaned column to the dataset based on the udf
df_cleaned = df.withColumn('text_cleaned',text_clean_udf('text'))

In [10]:
# original
(df_cleaned
 .select('text')
 .show(1,truncate=False))

In [11]:
# preprocessed
(df_cleaned
 .select('text_cleaned')
 .show(1,truncate=False))

In [12]:
#Define feature-creation pipeline
labelIndexer = StringIndexer(inputCol="gender", outputCol="label", handleInvalid="keep")
regexTokenizer = RegexTokenizer(inputCol="text_cleaned", outputCol="words",pattern='\s+|[^a-zA-z]')
stopwordsRemover = StopWordsRemover(inputCol="words",outputCol="filtered")
hashingTF = HashingTF(inputCol="filtered", outputCol="features")

In [13]:
#build a pipeline using the components above.
pipeline = Pipeline(stages=[labelIndexer, regexTokenizer,stopwordsRemover,hashingTF])

In [14]:
#define a Logistic Regression Model
lr = LogisticRegression(maxIter=20)

In [15]:
# Fit the pipeline to the dataset
transformerModel = pipeline.fit(df_cleaned)

In [16]:
#show transformed dataset
display(transformerModel.transform(df_cleaned).take(3))

In [17]:
#extract only features and label from the transformed dataset
dataset = transformerModel.transform(df_cleaned).select('label','features')

In [18]:
display(dataset.take(3))

In [19]:
# take a stratified sample as training data, we want a 80-20 split between training and test data.
train_data = dataset.sampleBy('label',fractions={0:0.8,1:0.8})

In [20]:
train_data.groupBy('label').count().show()

In [21]:
#the remaining is hold-out for testing our model's performance
test_data = dataset.subtract(train_data)

In [22]:
test_data.groupBy('label').count().show()

In [23]:
# build our model by fitting on the training data
model = lr.fit(train_data)

In [24]:
# get the predictions
results = model.transform(test_data)
display(results)

In [25]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator

#instantiate Evaluator object, default metric is areaUnderROC
evaluator = BinaryClassificationEvaluator()

In [26]:
evaluator.evaluate(results)

areaUnderROC is 0.59. So our model is slightly better than random prediction (ROC = 0.5)

The objective of this notebook was to show an end-to-end pipeline for loading & preprocessing the dataset, building a model and evaluating it. We'll investigate
tuning of the algorithm in next posts.


Thank you. :)