# Apache Spark con PySpark [2/2]
Actividad Lección 8 || Programación Python para Big Data

Objetivos:
* Realizar la predicción del Iris Dataset usando Sparl MLib

Datos del alumno:
* Víctor Luque Martín
* Máster Avanzado en Programación en Python para Hacking, BigData y Machine Learning

Fecha: 26/09/2022

# Tabla de contenidos:
1. [Importamos las librerías](#importes)
2. [Creación de una sesión sparl](#sesion)
3. [Lectura de Iris Dataset](#iris)
4. [Vectorizar columnas numéricas](#vectorizar)
5. [Indexación de la columna de especies](#indexar)
6. [División Train y Test](#train_test)
7. [DecissionTree Classifier](#dt)
    1. [Evaluación del modelo](#dt_eval)
8. [Gradient-Boosted Tree Classifier](#gbt)
    1. [Evaluación del modelo](#gbt_eval)
9. [RandomForest Classifier](#rf)
    1. [Evaluación del modelo](#rf_eval)
10. [Logistic Regression](#lr)
    1. [Evaluación del modelo](#lr_eval)
11. [Naive Bayes Classifier](#nb)
    1. [Evaluación del modelo](#nb_eval)
12. [Multilayer Perceptron Classifier](#mlp)
    1. [Evaluación del modelo](#mlp_eval)

## Importamos las librerías <a class="anchor" id="importes"></a>

In [1]:
import pandas as pd
import pyspark
from pyspark.sql import SparkSession
from pyspark.ml import Pipeline
from pyspark.ml.feature import *
from pyspark.ml.classification import *
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

## Creación de una sessión spark <a class="anchor" id="sesion"></a>

In [2]:
spark = SparkSession.builder.appName('Iris_pbd8').getOrCreate()

22/09/25 22:51:01 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


## Lectura de Iris Dataset <a class="anchor" id="iris"></a>
Se elimina la columna de Id para tener solamente los datos del sepalo, petalo y la especie

In [3]:
data = spark.read.csv('iris.csv', inferSchema=True, header=True)
data = data.drop('Id')
data.toPandas()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
...,...,...,...,...,...
187,6.7,3.0,5.2,2.3,virginica
188,6.3,2.5,5.0,1.9,virginica
189,6.5,3.0,5.2,2.0,virginica
190,6.2,3.4,5.4,2.3,virginica


## Vectorizar columnas numéricas <a class="anchor" id="vectorizar"></a>
Usamos VectorAssembler para construir un vector de las columnas numéricas

In [4]:
feature_cols = data.columns[:-1]
assembler = VectorAssembler(inputCols=feature_cols, outputCol='features')
data = assembler.transform(data)
data.toPandas()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species,features
0,5.1,3.5,1.4,0.2,setosa,"[5.1, 3.5, 1.4, 0.2]"
1,4.9,3.0,1.4,0.2,setosa,"[4.9, 3.0, 1.4, 0.2]"
2,4.7,3.2,1.3,0.2,setosa,"[4.7, 3.2, 1.3, 0.2]"
3,4.6,3.1,1.5,0.2,setosa,"[4.6, 3.1, 1.5, 0.2]"
4,5.0,3.6,1.4,0.2,setosa,"[5.0, 3.6, 1.4, 0.2]"
...,...,...,...,...,...,...
187,6.7,3.0,5.2,2.3,virginica,"[6.7, 3.0, 5.2, 2.3]"
188,6.3,2.5,5.0,1.9,virginica,"[6.3, 2.5, 5.0, 1.9]"
189,6.5,3.0,5.2,2.0,virginica,"[6.5, 3.0, 5.2, 2.0]"
190,6.2,3.4,5.4,2.3,virginica,"[6.2, 3.4, 5.4, 2.3]"


## Indexación de la columna de especies <a class="anchor" id="indexar"></a>
Indexamos los nombres de las especias para convertirlos en variables númericas
- 0 setosa
- 1 versicolor
- 2 virginica

In [5]:
label_indexer = StringIndexer(inputCol='species', outputCol='label').fit(data)
data = label_indexer.transform(data)
data.toPandas()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species,features,label
0,5.1,3.5,1.4,0.2,setosa,"[5.1, 3.5, 1.4, 0.2]",0.0
1,4.9,3.0,1.4,0.2,setosa,"[4.9, 3.0, 1.4, 0.2]",0.0
2,4.7,3.2,1.3,0.2,setosa,"[4.7, 3.2, 1.3, 0.2]",0.0
3,4.6,3.1,1.5,0.2,setosa,"[4.6, 3.1, 1.5, 0.2]",0.0
4,5.0,3.6,1.4,0.2,setosa,"[5.0, 3.6, 1.4, 0.2]",0.0
...,...,...,...,...,...,...,...
187,6.7,3.0,5.2,2.3,virginica,"[6.7, 3.0, 5.2, 2.3]",2.0
188,6.3,2.5,5.0,1.9,virginica,"[6.3, 2.5, 5.0, 1.9]",2.0
189,6.5,3.0,5.2,2.0,virginica,"[6.5, 3.0, 5.2, 2.0]",2.0
190,6.2,3.4,5.4,2.3,virginica,"[6.2, 3.4, 5.4, 2.3]",2.0


## División Train y Test <a class="anchor" id="train_test"></a>

In [6]:
trainingData, testData = data.randomSplit([0.7, 0.3])

## DecissionTree Classifier <a class="anchor" id="dt"></a>

In [7]:
dt = DecisionTreeClassifier(labelCol="label", featuresCol="features")
dt_model = dt.fit(trainingData)
dt_prediction = dt_model.transform(testData)
dt_prediction.toPandas()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species,features,label,rawPrediction,probability,prediction
0,4.4,2.9,1.4,0.2,setosa,"[4.4, 2.9, 1.4, 0.2]",0.0,"[59.0, 0.0, 0.0]","[1.0, 0.0, 0.0]",0.0
1,4.4,3.0,1.3,0.2,setosa,"[4.4, 3.0, 1.3, 0.2]",0.0,"[59.0, 0.0, 0.0]","[1.0, 0.0, 0.0]",0.0
2,4.4,3.2,1.3,0.2,setosa,"[4.4, 3.2, 1.3, 0.2]",0.0,"[59.0, 0.0, 0.0]","[1.0, 0.0, 0.0]",0.0
3,4.4,3.2,1.3,0.2,setosa,"[4.4, 3.2, 1.3, 0.2]",0.0,"[59.0, 0.0, 0.0]","[1.0, 0.0, 0.0]",0.0
4,4.5,2.3,1.3,0.3,setosa,"[4.5, 2.3, 1.3, 0.3]",0.0,"[59.0, 0.0, 0.0]","[1.0, 0.0, 0.0]",0.0
...,...,...,...,...,...,...,...,...,...,...
60,7.6,3.0,6.6,2.1,virginica,"[7.6, 3.0, 6.6, 2.1]",2.0,"[0.0, 0.0, 26.0]","[0.0, 0.0, 1.0]",2.0
61,7.7,2.8,6.7,2.0,virginica,"[7.7, 2.8, 6.7, 2.0]",2.0,"[0.0, 0.0, 26.0]","[0.0, 0.0, 1.0]",2.0
62,7.7,3.0,6.1,2.3,virginica,"[7.7, 3.0, 6.1, 2.3]",2.0,"[0.0, 0.0, 26.0]","[0.0, 0.0, 1.0]",2.0
63,7.7,3.8,6.7,2.2,virginica,"[7.7, 3.8, 6.7, 2.2]",2.0,"[0.0, 0.0, 26.0]","[0.0, 0.0, 1.0]",2.0


### Evaluación del modelo <a class="anchor" id="dt_eval"></a>

In [8]:
dt_evaluator = MulticlassClassificationEvaluator(
    labelCol="label", predictionCol="prediction", metricName="accuracy")
dt_accuracy = dt_evaluator.evaluate(dt_prediction)
print("Test Error = %g " % (1.0 - dt_accuracy))

Test Error = 0.0153846 


## Gradient-Boosted Tree Classifier <a class="anchor" id="gbt"></a>
De acuerdo con [documentación oficial](https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.ml.classification.GBTClassifier.html) de Apache Spark, este modelo no es capaz de utilizar etiquetas múltiples como columna y, por tanto, no es posible aplicarlo a este dataset.

No obstante, es posible utilizar el algoritmo [OneVsRest](https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.ml.classification.OneVsRest.html#pyspark.ml.classification.OneVsRest) que permite reducir una clasificación multiclase en una clasificación binaria y posteriormente elegir la mejor.

In [9]:
ovr = OneVsRest(classifier=GBTClassifier(maxIter=10))
ovr_model = ovr.fit(trainingData)
ovr_prediction = ovr_model.transform(testData)
ovr_prediction.toPandas()

22/09/25 22:51:17 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS
22/09/25 22:51:17 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS
                                                                                

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species,features,label,rawPrediction,prediction
0,4.4,2.9,1.4,0.2,setosa,"[4.4, 2.9, 1.4, 0.2]",0.0,"[1.325902679220332, -1.325902679220332, -1.325...",0.0
1,4.4,3.0,1.3,0.2,setosa,"[4.4, 3.0, 1.3, 0.2]",0.0,"[1.325902679220332, -1.325902679220332, -1.325...",0.0
2,4.4,3.2,1.3,0.2,setosa,"[4.4, 3.2, 1.3, 0.2]",0.0,"[1.325902679220332, -1.325902679220332, -1.325...",0.0
3,4.4,3.2,1.3,0.2,setosa,"[4.4, 3.2, 1.3, 0.2]",0.0,"[1.325902679220332, -1.325902679220332, -1.325...",0.0
4,4.5,2.3,1.3,0.3,setosa,"[4.5, 2.3, 1.3, 0.3]",0.0,"[1.325902679220332, -1.325902679220332, -1.325...",0.0
...,...,...,...,...,...,...,...,...,...
60,7.6,3.0,6.6,2.1,virginica,"[7.6, 3.0, 6.6, 2.1]",2.0,"[-1.3259026792203317, -1.325902679220332, 1.32...",2.0
61,7.7,2.8,6.7,2.0,virginica,"[7.7, 2.8, 6.7, 2.0]",2.0,"[-1.3259026792203317, -1.325902679220332, 1.32...",2.0
62,7.7,3.0,6.1,2.3,virginica,"[7.7, 3.0, 6.1, 2.3]",2.0,"[-1.3259026792203317, -1.325902679220332, 1.32...",2.0
63,7.7,3.8,6.7,2.2,virginica,"[7.7, 3.8, 6.7, 2.2]",2.0,"[-1.3259026792203317, -1.325902679220332, 1.32...",2.0


### Evaluación del modelo <a class="anchor" id="gbt_eval"></a>

In [10]:
ovr_evaluator = MulticlassClassificationEvaluator(
    labelCol="label", predictionCol="prediction", metricName="accuracy")
ovr_accuracy = ovr_evaluator.evaluate(ovr_prediction)
print("Test Error = %g " % (1.0 - ovr_accuracy))

Test Error = 0.0153846 


## RandomForest Classifier <a class="anchor" id="rf"></a>

In [11]:
rf = RandomForestClassifier(labelCol="label", featuresCol="features")
rf_model = rf.fit(trainingData)
rf_prediction = rf_model.transform(testData)
rf_prediction.toPandas()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species,features,label,rawPrediction,probability,prediction
0,4.4,2.9,1.4,0.2,setosa,"[4.4, 2.9, 1.4, 0.2]",0.0,"[20.0, 0.0, 0.0]","[1.0, 0.0, 0.0]",0.0
1,4.4,3.0,1.3,0.2,setosa,"[4.4, 3.0, 1.3, 0.2]",0.0,"[20.0, 0.0, 0.0]","[1.0, 0.0, 0.0]",0.0
2,4.4,3.2,1.3,0.2,setosa,"[4.4, 3.2, 1.3, 0.2]",0.0,"[20.0, 0.0, 0.0]","[1.0, 0.0, 0.0]",0.0
3,4.4,3.2,1.3,0.2,setosa,"[4.4, 3.2, 1.3, 0.2]",0.0,"[20.0, 0.0, 0.0]","[1.0, 0.0, 0.0]",0.0
4,4.5,2.3,1.3,0.3,setosa,"[4.5, 2.3, 1.3, 0.3]",0.0,"[19.0, 1.0, 0.0]","[0.95, 0.05, 0.0]",0.0
...,...,...,...,...,...,...,...,...,...,...
60,7.6,3.0,6.6,2.1,virginica,"[7.6, 3.0, 6.6, 2.1]",2.0,"[0.0, 0.0, 20.0]","[0.0, 0.0, 1.0]",2.0
61,7.7,2.8,6.7,2.0,virginica,"[7.7, 2.8, 6.7, 2.0]",2.0,"[0.0, 0.0, 20.0]","[0.0, 0.0, 1.0]",2.0
62,7.7,3.0,6.1,2.3,virginica,"[7.7, 3.0, 6.1, 2.3]",2.0,"[0.0, 0.0, 20.0]","[0.0, 0.0, 1.0]",2.0
63,7.7,3.8,6.7,2.2,virginica,"[7.7, 3.8, 6.7, 2.2]",2.0,"[0.0, 0.0, 20.0]","[0.0, 0.0, 1.0]",2.0


### Evaluación del modelo <a class="anchor" id="rf_eval"></a>

In [12]:
rf_evaluator = MulticlassClassificationEvaluator(
    labelCol="label", predictionCol="prediction", metricName="accuracy")
rf_accuracy = rf_evaluator.evaluate(rf_prediction)
print("Test Error = %g " % (1.0 - rf_accuracy))

Test Error = 0.0153846 


## LogisticRegression Classifier <a class="anchor" id="lr"></a>

In [13]:
lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)
lr_model = lr.fit(trainingData)
lr_prediction = lr_model.transform(testData)
lr_prediction.toPandas()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species,features,label,rawPrediction,probability,prediction
0,4.4,2.9,1.4,0.2,setosa,"[4.4, 2.9, 1.4, 0.2]",0.0,"[1.0089130050285169, -0.08636294607487635, -0....","[0.6232447302988551, 0.20844251922671933, 0.16...",0.0
1,4.4,3.0,1.3,0.2,setosa,"[4.4, 3.0, 1.3, 0.2]",0.0,"[1.0746699687976495, -0.08636294607487635, -0....","[0.6385555726891692, 0.19997168732076637, 0.16...",0.0
2,4.4,3.2,1.3,0.2,setosa,"[4.4, 3.2, 1.3, 0.2]",0.0,"[1.1656386119529125, -0.08636294607487635, -0....","[0.659275899276251, 0.1885080200004986, 0.1522...",0.0
3,4.4,3.2,1.3,0.2,setosa,"[4.4, 3.2, 1.3, 0.2]",0.0,"[1.1656386119529125, -0.08636294607487635, -0....","[0.659275899276251, 0.1885080200004986, 0.1522...",0.0
4,4.5,2.3,1.3,0.3,setosa,"[4.5, 2.3, 1.3, 0.3]",0.0,"[0.706731382822585, -0.08636294607487635, -0.2...","[0.5500052406143229, 0.24884581247920284, 0.20...",0.0
...,...,...,...,...,...,...,...,...,...,...
60,7.6,3.0,6.6,2.1,virginica,"[7.6, 3.0, 6.6, 2.1]",2.0,"[-1.038973069999116, -0.08636294607487635, -0....","[0.17457867240691138, 0.4525895339750624, 0.37...",1.0
61,7.7,2.8,6.7,2.0,virginica,"[7.7, 2.8, 6.7, 2.0]",2.0,"[-1.115708272559776, -0.08636294607487635, -0....","[0.16386066595358023, 0.45868412638740086, 0.3...",1.0
62,7.7,3.0,6.1,2.3,virginica,"[7.7, 3.0, 6.1, 2.3]",2.0,"[-1.0291854028321288, -0.08636294607487635, -0...","[0.17585577038404954, 0.4514599340972858, 0.37...",1.0
63,7.7,3.8,6.7,2.2,virginica,"[7.7, 3.8, 6.7, 2.2]",2.0,"[-0.7449194745012075, -0.08636294607487635, -0...","[0.22098421663015466, 0.426941844938803, 0.352...",1.0


### Evaluación del modelo <a class="anchor" id="lr_eval"></a>

In [14]:
lr_evaluator = MulticlassClassificationEvaluator(
    labelCol="label", predictionCol="prediction", metricName="accuracy")
lr_accuracy = lr_evaluator.evaluate(lr_prediction)
print("Test Error = %g " % (1.0 - lr_accuracy))

Test Error = 0.307692 


## NaiveBayes Classifier <a class="anchor" id="nb"></a>

In [15]:
nb = NaiveBayes(smoothing=1.0, modelType="multinomial")
nb_model = nb.fit(trainingData)
nb_prediction = nb_model.transform(testData)
nb_prediction.toPandas()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species,features,label,rawPrediction,probability,prediction
0,4.4,2.9,1.4,0.2,setosa,"[4.4, 2.9, 1.4, 0.2]",0.0,"[-10.481913080536966, -12.008786595175849, -12...","[0.7544493378114466, 0.16387674896956483, 0.08...",0.0
1,4.4,3.0,1.3,0.2,setosa,"[4.4, 3.0, 1.3, 0.2]",0.0,"[-10.39602551556471, -12.050097928070493, -12....","[0.7783934761860695, 0.14888284734843982, 0.07...",0.0
2,4.4,3.2,1.3,0.2,setosa,"[4.4, 3.2, 1.3, 0.2]",0.0,"[-10.612478858343385, -12.376239373915357, -13...","[0.7979487988574379, 0.13676770708290822, 0.06...",0.0
3,4.4,3.2,1.3,0.2,setosa,"[4.4, 3.2, 1.3, 0.2]",0.0,"[-10.612478858343385, -12.376239373915357, -13...","[0.7979487988574379, 0.13676770708290822, 0.06...",0.0
4,4.5,2.3,1.3,0.3,setosa,"[4.5, 2.3, 1.3, 0.3]",0.0,"[-10.079831319429056, -11.233680514571631, -11...","[0.6732155190849919, 0.21234585132884914, 0.11...",0.0
...,...,...,...,...,...,...,...,...,...,...
60,7.6,3.0,6.6,2.1,virginica,"[7.6, 3.0, 6.6, 2.1]",2.0,"[-29.989750712719687, -25.823511549131357, -25...","[0.007822986233260686, 0.5043676752612072, 0.4...",1.0
61,7.7,2.8,6.7,2.0,virginica,"[7.7, 2.8, 6.7, 2.0]",2.0,"[-29.66743650298772, -25.470004255184246, -25....","[0.007593652535481511, 0.5050942034825322, 0.4...",1.0
62,7.7,3.0,6.1,2.3,virginica,"[7.7, 3.0, 6.1, 2.3]",2.0,"[-29.831255837953723, -25.77689368849467, -25....","[0.00873853356212637, 0.5037623053192707, 0.48...",1.0
63,7.7,3.8,6.7,2.2,virginica,"[7.7, 3.8, 6.7, 2.2]",2.0,"[-31.491070823785677, -27.574914369496767, -27...","[0.010345000090133599, 0.5193924557544873, 0.4...",1.0


### Evaluación del modelo <a class="anchor" id="nb_eval"></a>

In [16]:
nb_evaluator = MulticlassClassificationEvaluator(
    labelCol="label", predictionCol="prediction", metricName="accuracy")
nb_accuracy = nb_evaluator.evaluate(nb_prediction)
print("Test Error = %g " % (1.0 - nb_accuracy))

Test Error = 0.246154 


## Multilayer Perceptron Classifier <a class="anchor" id="mlp"></a>

In [17]:
mlp = MultilayerPerceptronClassifier(maxIter=100, layers=[4, 5, 4, 3])
mlp_model = mlp.fit(trainingData)
mlp_prediction = mlp_model.transform(testData)
mlp_prediction.toPandas()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species,features,label,rawPrediction,probability,prediction
0,4.4,2.9,1.4,0.2,setosa,"[4.4, 2.9, 1.4, 0.2]",0.0,"[13.384783083882027, -5.941126792517606, -9.15...","[0.9999999957923589, 4.04449200347838e-09, 1.6...",0.0
1,4.4,3.0,1.3,0.2,setosa,"[4.4, 3.0, 1.3, 0.2]",0.0,"[13.384783076247986, -5.941126789419578, -9.15...","[0.9999999957923589, 4.044492046884149e-09, 1....",0.0
2,4.4,3.2,1.3,0.2,setosa,"[4.4, 3.2, 1.3, 0.2]",0.0,"[13.384782889756949, -5.941126723443072, -9.15...","[0.9999999957923578, 4.044493067987247e-09, 1....",0.0
3,4.4,3.2,1.3,0.2,setosa,"[4.4, 3.2, 1.3, 0.2]",0.0,"[13.384782889756949, -5.941126723443072, -9.15...","[0.9999999957923578, 4.044493067987247e-09, 1....",0.0
4,4.5,2.3,1.3,0.3,setosa,"[4.5, 2.3, 1.3, 0.3]",0.0,"[13.3847843320286, -5.941127242812765, -9.1515...","[0.9999999957923662, 4.044485134150254e-09, 1....",0.0
...,...,...,...,...,...,...,...,...,...,...
60,7.6,3.0,6.6,2.1,virginica,"[7.6, 3.0, 6.6, 2.1]",2.0,"[-13.668410547228927, 6.648521716399623, 6.471...","[8.168749968670019e-10, 0.544110584879301, 0.4...",1.0
61,7.7,2.8,6.7,2.0,virginica,"[7.7, 2.8, 6.7, 2.0]",2.0,"[-13.668410547228879, 6.648521716399587, 6.471...","[8.168749968670618e-10, 0.5441105848792945, 0....",1.0
62,7.7,3.0,6.1,2.3,virginica,"[7.7, 3.0, 6.1, 2.3]",2.0,"[-13.668410547226804, 6.648521716398025, 6.471...","[8.168749968696295e-10, 0.5441105848790253, 0....",1.0
63,7.7,3.8,6.7,2.2,virginica,"[7.7, 3.8, 6.7, 2.2]",2.0,"[-13.66841054723044, 6.648521716400759, 6.4716...","[8.168749968651312e-10, 0.544110584879495, 0.4...",1.0


### Evaluación del modelo <a class="anchor" id="mlp_eval"></a>

In [18]:
mlp_evaluator = MulticlassClassificationEvaluator(
    labelCol="label", predictionCol="prediction", metricName="accuracy")
mlp_accuracy = mlp_evaluator.evaluate(mlp_prediction)
print("Test Error = %g " % (1.0 - mlp_accuracy))

Test Error = 0.292308 
