# Classification in PySpark's MLlib

PySpark offers a good variety of algorithms that can be applied to classification machine learning problems. However, because PySpark operates on distributed dataframes, we cannot use popular Python libraries like scikit learn for our machine learning applications. Which means we need to use PySpark's MLlib packages for these tasks. Luckily, MLlib offers a pretty good variety of algorithms! In this notebook we will go over how to prep our data and train and test the classification algorithms PySpark offers. 

## Algorithms Available

PySpark offers the following algorithms for classification. 

1. Logistic Regression 
2. Naive Bayes
3. One Vs Rest
4. Linear Support Vector Machine (SVC)
5. Random Forest Classifier
6. GBT Classifier
7. Decision Tree Classifier
8. Multilayer Perceptron Classifier (Neural Network)

In [1]:
#Importing pysark and creating a session
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('classification').getOrCreate()
spark

In [2]:
#Importing required libraries
from pyspark.ml.feature import VectorAssembler
from pyspark.sql.types import *
from pyspark.sql.functions import *
from pyspark.ml.feature import StringIndexer
from pyspark.ml.feature import MinMaxScaler

### Data Set Name: Autistic Spectrum Disorder Screening Data for Adult
Autistic Spectrum Disorder (ASD) is a neurodevelopment condition associated with significant healthcare costs, and early diagnosis can significantly reduce these. Unfortunately, waiting times for an ASD diagnosis are lengthy and procedures are not cost effective. The economic impact of autism and the increase in the number of ASD cases across the world reveals an urgent need for the development of easily implemented and effective screening methods. Therefore, a time-efficient and accessible ASD screening is imminent to help health professionals and inform individuals whether they should pursue formal clinical diagnosis. The rapid growth in the number of ASD cases worldwide necessitates datasets related to behaviour traits. However, such datasets are rare making it difficult to perform thorough analyses to improve the efficiency, sensitivity, specificity and predictive accuracy of the ASD screening process. Presently, very limited autism datasets associated with clinical or screening are available and most of them are genetic in nature. Hence, we propose a new dataset related to autism screening of adults that contained 20 features to be utilised for further analysis especially in determining influential autistic traits and improving the classification of ASD cases. In this dataset, we record ten behavioural features (AQ-10-Adult) plus ten individuals characteristics that have proved to be effective in detecting the ASD cases from controls in behaviour science.

### Source: 
https://www.kaggle.com/faizunnabi/autism-screening

In [3]:
path ="datasets-mlib/"
df = spark.read.csv(path+'Toddler Autism dataset July 2018.csv',inferSchema=True,header=True)

In [4]:
#Checking the dataset
df.limit(6).toPandas()

Unnamed: 0,Case_No,A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,Age_Mons,Qchat-10-Score,Sex,Ethnicity,Jaundice,Family_mem_with_ASD,Who completed the test,Class/ASD Traits
0,1,0,0,0,0,0,0,1,1,0,1,28,3,f,middle eastern,yes,no,family member,No
1,2,1,1,0,0,0,1,1,0,0,0,36,4,m,White European,yes,no,family member,Yes
2,3,1,0,0,0,0,0,1,1,0,1,36,4,m,middle eastern,yes,no,family member,Yes
3,4,1,1,1,1,1,1,1,1,1,1,24,10,m,Hispanic,no,no,family member,Yes
4,5,1,1,0,1,1,1,1,1,1,1,20,9,f,White European,no,yes,family member,Yes
5,6,1,1,0,0,1,1,1,1,1,1,21,8,m,black,no,no,family member,Yes


In [5]:
df.printSchema()

root
 |-- Case_No: integer (nullable = true)
 |-- A1: integer (nullable = true)
 |-- A2: integer (nullable = true)
 |-- A3: integer (nullable = true)
 |-- A4: integer (nullable = true)
 |-- A5: integer (nullable = true)
 |-- A6: integer (nullable = true)
 |-- A7: integer (nullable = true)
 |-- A8: integer (nullable = true)
 |-- A9: integer (nullable = true)
 |-- A10: integer (nullable = true)
 |-- Age_Mons: integer (nullable = true)
 |-- Qchat-10-Score: integer (nullable = true)
 |-- Sex: string (nullable = true)
 |-- Ethnicity: string (nullable = true)
 |-- Jaundice: string (nullable = true)
 |-- Family_mem_with_ASD: string (nullable = true)
 |-- Who completed the test: string (nullable = true)
 |-- Class/ASD Traits : string (nullable = true)



In the dataset,
- Inependent variables (features): Case_No - Who completed the test
- Dependent variable: Class/ASD Traits 

In [6]:
#Identifying the number of classes in the dpenedent variable
df.groupBy("Class/ASD Traits ").count().show()

+-----------------+-----+
|Class/ASD Traits |count|
+-----------------+-----+
|               No|  326|
|              Yes|  728|
+-----------------+-----+



### Formatting data

In [7]:
input_columns = df.columns # Collect the column names as a list
input_columns = input_columns[1:-1] # keep only relevant columns: from column 1 to 

dependent_var = 'Class/ASD Traits '

In [8]:
#Changing the label (class variable) to string type to prep for reindexing
# Pyspark expects a zero indexed integer for the label column
# Just in case our data is not in that format... we will treat it by using the StringIndexer built in method
renamed = df.withColumn("label_str", df[dependent_var].cast(StringType())) #Rename and change to string type
indexer = StringIndexer(inputCol="label_str", outputCol="label") #Pyspark is expecting the this naming convention 
indexed = indexer.fit(renamed).transform(renamed)

In [9]:
# Convert all string type data in the input column list to numeric, otherwise the Algorithm will not be able to process it
# Also we will use these lists later on
numeric_inputs = []
string_inputs = []
for column in input_columns:
    # First identify the string vars in your input column list
    if str(indexed.schema[column].dataType) == 'StringType':
        # Setting up the String Indexer function
        indexer = StringIndexer(inputCol=column, outputCol=column+"_num") 
        # Then call on the indexer created here
        indexed = indexer.fit(indexed).transform(indexed)
        # Rename the column to a new name so you can disinguish it from the original
        new_col_name = column+"_num"
        # Adding the new column name to the string inputs list
        string_inputs.append(new_col_name)
    else:
        # If no change was needed, take no action 
        # And add the numeric var to the num list
        numeric_inputs.append(column)

#### Treating for skewness and outliers

Skewness measures how much a distribution of values deviates from symmetry around the mean. A value of zero means the distribution is symmetric, while a positive skewness indicates a greater number of smaller values, and a negative value indicates a greater number of larger values. 

As a general rule of thumb: 

 - If skewness is **less than -1 or greater than 1**, the distribution is highly skewed. 
 - If skewness is **between -1 and -0.5 or between 0.5 and 1**, the distribution is moderately skewed. 
 - If skewness is **between -0.5 and 0.5**, the distribution is approximately symmetric.
 
A common recommendation for treating skewness is either a log transformation for positive skewed data or an exponential transformation for negatively skewed data.


**Outliers** <br>
One common way to correct outliers is by flooring and capping which means editing any value that is above or below a certain threshold (99th percentile or 1st percentile) back to the highest/lowest value in that percentile. For example, if the 99th percentile is 96 and there is a value of 1,000, you would change that value to 96. 

In [10]:
#Creating empty dictionary d
d = {}
# Creating a dictionary of quantiles from numeric cols, Doing the top and bottom 1% but it can be adjusted if needed
for col in numeric_inputs: 
    d[col] = indexed.approxQuantile(col,[0.01,0.99],0.25) #if you want to make it go faster increase the last number

#Now check for skewness for all numeric cols
for col in numeric_inputs:
    skew = indexed.agg(skewness(indexed[col])).collect() #check for skewness
    skew = skew[0][0]
    # If skewness is found,
    # This function will make the appropriate corrections
    if skew > 1: # If right skew, floor, cap and log(x+1)
        indexed = indexed.withColumn(col, \
        log(when(df[col] < d[col][0],d[col][0])\
        .when(indexed[col] > d[col][1], d[col][1])\
        .otherwise(indexed[col] ) +1).alias(col))
        print(col+" has been treated for positive (right) skewness. (skew =)",skew,")")
    elif skew < -1: # If left skew floor, cap and exp(x)
        indexed = indexed.withColumn(col, \
        exp(when(df[col] < d[col][0],d[col][0])\
        .when(indexed[col] > d[col][1], d[col][1])\
        .otherwise(indexed[col] )).alias(col))
        print(col+" has been treated for negative (left) skewness. (skew =",skew,")")

In [11]:
# Checking for negative values in the dataframe 
# Produce a warning if there are negative values in the dataframe that Naive Bayes cannot be used. 
# Note: Checking only the numeric input values since anything that is indexed won't have negative values

# Calculate the mins for all columns in the df
minimums = df.select([min(c).alias(c) for c in df.columns if c in numeric_inputs]) 
# Create an array for all mins and select only the input cols
min_array = minimums.select(array(numeric_inputs).alias("mins")) 
# Collect golobal min as Python object
df_minimum = min_array.select(array_min(min_array.mins)).collect() 
# Slice to get the number itself
df_minimum = df_minimum[0][0] 

# If there are ANY negative vals found in the df, print a warning message
if df_minimum < 0:
    print("WARNING: The Naive Bayes Classifier will not be able to process your dataframe as it contains negative values")
else:
    print("No negative values were found in your dataframe.")

No negative values were found in your dataframe.


In [12]:
# Before correcting for negative values that may have been found above, there is a need to vectorize our df
# because the function that I'm using to make that correction requires a vector
# Now creating final features list
features_list = numeric_inputs + string_inputs
# Creating vector assembler object
assembler = VectorAssembler(inputCols=features_list,outputCol='features')
# And calling on the vector assembler to transform your dataframe
output = assembler.transform(indexed).select('features','label')

In [13]:
# Creating the mix max scaler object, this is what will correct for negative values
# I like to use a high range like 1,000 because I only see one decimal place in the final_data.show() call
scaler = MinMaxScaler(inputCol="features", outputCol="scaledFeatures",min=0,max=1000)
print("Features scaled to range: [%f, %f]" % (scaler.getMin(), scaler.getMax()))

# Computing summary statistics and generate MinMaxScalerModel
scalerModel = scaler.fit(output)

# Rescaling each feature to range [min, max].
scaled_data = scalerModel.transform(output)
final_data = scaled_data.select('label','scaledFeatures')
# Renaming to default value
final_data = final_data.withColumnRenamed("scaledFeatures","features")
final_data.show()

Features scaled to range: [0.000000, 1000.000000]
+-----+--------------------+
|label|            features|
+-----+--------------------+
|  1.0|(17,[6,7,9,10,11,...|
|  0.0|(17,[0,1,5,6,10,1...|
|  0.0|(17,[0,6,7,9,10,1...|
|  0.0|[1000.0,1000.0,10...|
|  0.0|[1000.0,1000.0,0....|
|  0.0|[1000.0,1000.0,0....|
|  0.0|(17,[0,3,4,5,8,10...|
|  0.0|(17,[1,4,6,7,8,9,...|
|  1.0|(17,[6,9,10,11,13...|
|  0.0|[1000.0,1000.0,10...|
|  0.0|[1000.0,0.0,0.0,1...|
|  0.0|[1000.0,1000.0,10...|
|  1.0|(17,[10,12,13,14]...|
|  0.0|[1000.0,1000.0,10...|
|  1.0|(17,[10,13],[250....|
|  0.0|(17,[0,1,2,4,6,7,...|
|  1.0|(17,[10,13,15],[1...|
|  0.0|[1000.0,1000.0,10...|
|  1.0|(17,[0,4,9,10,11,...|
|  0.0|(17,[0,1,2,4,6,7,...|
+-----+--------------------+
only showing top 20 rows



In [14]:
#Splitting the data into train and test
train, test = final_data.randomSplit([0.7,0.3])

In [15]:
train.count()

752

In [16]:
test.count()

302

#### Logistic Regression

In [18]:
from pyspark.ml.classification import *
from pyspark.ml.evaluation import *
from pyspark.sql.functions import *
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

In [19]:
#Setting up our evaluation objects
Bin_evaluator = BinaryClassificationEvaluator(rawPredictionCol='prediction')
MC_evaluator = MulticlassClassificationEvaluator(metricName='accuracy')

In [20]:
#Initiatng logistic regression constructor
classifier = LogisticRegression()

In [21]:
fitModel = classifier.fit(train)

In [24]:
#Evaluation method for multiclass classification problem
predictions = fitModel.transform(test)
accuracy = (MC_evaluator.evaluate(predictions))*100

In [25]:
print("Accuracy",accuracy)

Accuracy 100.0
