### Implement the following algorithms in Pyspark using any dataset that you are comfortable with (Example – Iris).
    1. Decision tree
    2. Naïve Bayes classifier
    3. Logistic Regression
    4. Ensemble models (RandomForest, GBTClassifier)
    5. Multilayer Perceptron classifier

In [1]:
from pyspark.sql import SparkSession
from pyspark.ml.feature import StringIndexer
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.classification import DecisionTreeClassifier, RandomForestClassifier
from pyspark.ml.classification import NaiveBayes, LogisticRegression
from pyspark.ml.classification import MultilayerPerceptronClassifier, GBTClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

In [2]:
spark = SparkSession.builder.appName('iris').getOrCreate()

In [3]:
df = spark.read.csv('data/iris.csv', inferSchema=True, header=True)

In [4]:
df.show(5)

+------------+-----------+------------+-----------+-------+
|sepal_length|sepal_width|petal_length|petal_width|species|
+------------+-----------+------------+-----------+-------+
|         5.1|        3.5|         1.4|        0.2| setosa|
|         4.9|        3.0|         1.4|        0.2| setosa|
|         4.7|        3.2|         1.3|        0.2| setosa|
|         4.6|        3.1|         1.5|        0.2| setosa|
|         5.0|        3.6|         1.4|        0.2| setosa|
+------------+-----------+------------+-----------+-------+
only showing top 5 rows



In [5]:
indexer = StringIndexer(inputCol="species", outputCol="speciesIdx")
df = indexer.fit(df).transform(df)
df.show(5)

+------------+-----------+------------+-----------+-------+----------+
|sepal_length|sepal_width|petal_length|petal_width|species|speciesIdx|
+------------+-----------+------------+-----------+-------+----------+
|         5.1|        3.5|         1.4|        0.2| setosa|       0.0|
|         4.9|        3.0|         1.4|        0.2| setosa|       0.0|
|         4.7|        3.2|         1.3|        0.2| setosa|       0.0|
|         4.6|        3.1|         1.5|        0.2| setosa|       0.0|
|         5.0|        3.6|         1.4|        0.2| setosa|       0.0|
+------------+-----------+------------+-----------+-------+----------+
only showing top 5 rows



In [6]:
assembler = VectorAssembler(inputCols=['sepal_length', 'sepal_width',
                                       'petal_length', 'petal_width'],
                            outputCol='features')

In [7]:
output = assembler.transform(df)

In [8]:
final_data = output.select('features', 'speciesIdx')

In [9]:
train_data,test_data = final_data.randomSplit([0.75, 0.25])

In [10]:
train_data.printSchema()

root
 |-- features: vector (nullable = true)
 |-- speciesIdx: double (nullable = false)



In [11]:
accuracy_eval = MulticlassClassificationEvaluator(metricName='accuracy', labelCol='speciesIdx')

### 1. DecisionTreeClassifier

In [12]:
dtc = DecisionTreeClassifier(labelCol='speciesIdx', featuresCol='features')
dtc_model = dtc.fit(train_data)

In [13]:
dtc_preds = dtc_model.transform(test_data)

In [14]:
print('DTC Accuracy: ', accuracy_eval.evaluate(dtc_preds))

DTC Accuracy:  0.96875


### 2. Naive Bayes Classifier

In [15]:
nbc = NaiveBayes(labelCol='speciesIdx', featuresCol='features')
nbc_model = nbc.fit(train_data)

In [16]:
nbc_preds = nbc_model.transform(test_data)

In [17]:
print('Naive Bayes Accuracy: ', accuracy_eval.evaluate(nbc_preds))

Naive Bayes Accuracy:  0.90625


### 3. Logistic Regression

In [18]:
log_reg = LogisticRegression(labelCol='speciesIdx', featuresCol='features')
lg_model = log_reg.fit(train_data)

In [19]:
lg_preds = lg_model.transform(test_data)

In [20]:
print('Logistic Regression Accuracy: ', accuracy_eval.evaluate(lg_preds))

Logistic Regression Accuracy:  1.0


### 4. Ensemble Model (Random Forest)

In [21]:
rtc = RandomForestClassifier(labelCol='speciesIdx', featuresCol='features')
rtc_model = rtc.fit(train_data)

In [22]:
rf_preds = rtc_model.transform(test_data)

In [23]:
print('Random Forest Accuracy: ', accuracy_eval.evaluate(rf_preds))

Random Forest Accuracy:  0.96875


### 5. MLP

In [24]:
layers = [4, 5, 4, 3]
mlp = MultilayerPerceptronClassifier(labelCol='speciesIdx', featuresCol='features',
                                    maxIter=100, layers=layers, blockSize=128, seed=1234)
mlp_model = mlp.fit(train_data)

In [25]:
mlp_preds = mlp_model.transform(test_data)

In [26]:
print('MLP Accuracy: ', accuracy_eval.evaluate(mlp_preds))

MLP Accuracy:  1.0


### <ins>Additional Notes</ins>

```
- LinearRegression can not be performed on classification problem.

- Support vector Machine (LinearSVC), suports only Binary Classification and Iris dataset consists 3 classes.

- 2 Ensemble models are natively supported by pyspark, namely:
    - GBTClassifier
    - RandomForestClassifier

- These algorithms do not have inbuilt implementation in pyspark:
    - K-nearest Neighbour
    - Adaptive Gradient Descent
    - Root mean squared Propagation
```

[reference](https://spark.apache.org/docs/2.2.0/ml-classification-regression.html)

In [27]:
# gbt = GBTClassifier(labelCol="speciesIdx",
#                     featuresCol="features", maxIter=10)
# gbt_model = gbt.fit(train_data)

# gbtc_preds = gbtc_model.transform(test_data)
# print('Gradient-boost Accuracy: ', accuracy_eval.evaluate(gbtc_preds))

# ::--> This ends up in an error that I was not able to debug. <--::