<a href="https://colab.research.google.com/github/tomasborrella/TheValley/blob/main/Ejercicio_resuelto_entrenamiento_modelos.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Ejercicio resuelto de entrenamiento de modelos

Notebook por [Tomás Borrella Martín](https://www.linkedin.com/in/tomasborrella/)
.

Usando los datos de salarios de [este dataset](https://archive.ics.uci.edu/ml/datasets/Adult), predecir si el salario es mayor o menor de 50K$ utilizando los datos censales.

### Enlaces de interés
*   [Slides de presentación](https://docs.google.com/presentation/d/1MotclVSrLoykWogG-WwLa-DbPNvVgHBaGuZJX2Gfc4o/edit?usp=sharing)

# 1. Instalación Spark

In [None]:
# Install JAVA
!apt-get install openjdk-8-jdk-headless -qq > /dev/null

In [None]:
# Install Spark
!wget -q https://downloads.apache.org/spark/spark-3.1.1/spark-3.1.1-bin-hadoop2.7.tgz
!tar xf spark-3.1.1-bin-hadoop2.7.tgz

In [None]:
# Install findspark
!pip install -q findspark

In [None]:
# Environment variables
import os 
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.1.1-bin-hadoop2.7"

In [None]:
# Find spark
import findspark
findspark.init()

In [None]:
# PySpark 
!pip install pyspark==3.1.1

Collecting pyspark==3.1.1
[?25l  Downloading https://files.pythonhosted.org/packages/45/b0/9d6860891ab14a39d4bddf80ba26ce51c2f9dc4805e5c6978ac0472c120a/pyspark-3.1.1.tar.gz (212.3MB)
[K     |████████████████████████████████| 212.3MB 74kB/s 
[?25hCollecting py4j==0.10.9
[?25l  Downloading https://files.pythonhosted.org/packages/9e/b6/6a4fb90cd235dc8e265a6a2067f2a2c99f0d91787f06aca4bcf7c23f3f80/py4j-0.10.9-py2.py3-none-any.whl (198kB)
[K     |████████████████████████████████| 204kB 21.6MB/s 
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.1.1-py2.py3-none-any.whl size=212767604 sha256=fc09b54999d8800bf8338f667256a7e59121f24af006e9a61fe0cf89dc097a77
  Stored in directory: /root/.cache/pip/wheels/0b/90/c0/01de724414ef122bd05f056541fb6a0ecf47c7ca655f8b3c0f
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9 pyspark-3.1

# 2. Spark Session
Punto de entrada de la aplicación de Spark

In [None]:
# Imports
from pyspark.sql import SparkSession
from pyspark.sql import functions as F

In [None]:
# Create Spark Session
spark = (SparkSession
         .builder
         .master("local[*]")
         .appName("Spark Dataframes")
         .getOrCreate()
)

# Ejemplo

# Datos

In [None]:
# Descargamos los datos al entorno de Colab
!wget -P /content/data 'https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data'

--2021-06-19 16:29:56--  https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3974305 (3.8M) [application/x-httpd-php]
Saving to: ‘/content/data/adult.data’


2021-06-19 16:29:58 (3.17 MB/s) - ‘/content/data/adult.data’ saved [3974305/3974305]



Nos hacemos una primera idea de los datos

In [None]:
!head /content/data/adult.data

39, State-gov, 77516, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Male, 2174, 0, 40, United-States, <=50K
50, Self-emp-not-inc, 83311, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 13, United-States, <=50K
38, Private, 215646, HS-grad, 9, Divorced, Handlers-cleaners, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K
53, Private, 234721, 11th, 7, Married-civ-spouse, Handlers-cleaners, Husband, Black, Male, 0, 0, 40, United-States, <=50K
28, Private, 338409, Bachelors, 13, Married-civ-spouse, Prof-specialty, Wife, Black, Female, 0, 0, 40, Cuba, <=50K
37, Private, 284582, Masters, 14, Married-civ-spouse, Exec-managerial, Wife, White, Female, 0, 0, 40, United-States, <=50K
49, Private, 160187, 9th, 5, Married-spouse-absent, Other-service, Not-in-family, Black, Female, 0, 0, 16, Jamaica, <=50K
52, Self-emp-not-inc, 209642, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 45, United-States, >50K
31, 

La descripción completa del dataset está en [este enlace](https://archive.ics.uci.edu/ml/datasets/Adult).

Cargamos los datos en un DataFrame especificando el esquema:

In [None]:
from pyspark.sql.types import DoubleType, StringType, StructField, StructType
 
schema = StructType([
  StructField("age", DoubleType(), False),
  StructField("workclass", StringType(), False),
  StructField("fnlwgt", DoubleType(), False),
  StructField("education", StringType(), False),
  StructField("education_num", DoubleType(), False),
  StructField("marital_status", StringType(), False),
  StructField("occupation", StringType(), False),
  StructField("relationship", StringType(), False),
  StructField("race", StringType(), False),
  StructField("sex", StringType(), False),
  StructField("capital_gain", DoubleType(), False),
  StructField("capital_loss", DoubleType(), False),
  StructField("hours_per_week", DoubleType(), False),
  StructField("native_country", StringType(), False),
  StructField("income", StringType(), False)
])
 
dataset = spark.read.format("csv").schema(schema).load("/content/data/adult.data")
cols = dataset.columns

In [None]:
dataset.show(5)

+----+-----------------+--------+----------+-------------+-------------------+------------------+--------------+------+-------+------------+------------+--------------+--------------+------+
| age|        workclass|  fnlwgt| education|education_num|     marital_status|        occupation|  relationship|  race|    sex|capital_gain|capital_loss|hours_per_week|native_country|income|
+----+-----------------+--------+----------+-------------+-------------------+------------------+--------------+------+-------+------------+------------+--------------+--------------+------+
|39.0|        State-gov| 77516.0| Bachelors|         13.0|      Never-married|      Adm-clerical| Not-in-family| White|   Male|      2174.0|         0.0|          40.0| United-States| <=50K|
|50.0| Self-emp-not-inc| 83311.0| Bachelors|         13.0| Married-civ-spouse|   Exec-managerial|       Husband| White|   Male|         0.0|         0.0|          13.0| United-States| <=50K|
|38.0|          Private|215646.0|   HS-grad| 

# Preprocesado de los datos

Creamos un Pipeline con todas las transformaciones

Para usar algoritmos como la *Logistic Regression*, primero tenemos que convertir las variables categóricas en valores numéricos.

En este notebook vamos a usar una combinación de *StringIndexer* (que asigna un valor numérico a cada categoría) y *OneHotEncoder* (que combierte cada categoría en un vector binario).

Se crean los stages de todas las variables categóricas usando un bucle:

In [None]:
from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer, VectorAssembler
 
categoricalColumns = ["workclass", "education", "marital_status", "occupation", "relationship", "race", "sex", "native_country"]
# variable que va a contenter las stages del Pipeline
stages = []

for categoricalCol in categoricalColumns:
    # Primero StringIndexer
    stringIndexer = StringIndexer(inputCol=categoricalCol, outputCol=categoricalCol + "Index")
    
    # y después OneHotEncoder para convertir variables categóricas en SparseVectors binarios
    from pyspark.ml.feature import OneHotEncoder
    encoder = OneHotEncoder(inputCols=[stringIndexer.getOutputCol()], outputCols=[categoricalCol + "classVec"])
    # vamos añadiendo las stages a la variable.
    # No se ejecutan ahora, se añadirán al Pipeline más adelante.
    stages += [stringIndexer, encoder]

Podemos comprobar que el bucle a través de las 8 variables categóricas ha funcionado bien mirando el contenido de la variable *stages*:

In [None]:
stages

[StringIndexer_d5d764e2c239,
 OneHotEncoder_bd13a6263131,
 StringIndexer_2ebc7f3c8fe6,
 OneHotEncoder_4fe594756cfd,
 StringIndexer_7822cc7b4e7f,
 OneHotEncoder_3340bc53f631,
 StringIndexer_a655ba0fddf0,
 OneHotEncoder_bf546167c3cc,
 StringIndexer_755addbd14b5,
 OneHotEncoder_43360cacf78d,
 StringIndexer_770cbd967219,
 OneHotEncoder_160f5a785e8d,
 StringIndexer_ead56e96175b,
 OneHotEncoder_060ad367e7bc,
 StringIndexer_4982e725321c,
 OneHotEncoder_8f4378c88799]

Añadimos también un stage para convertir la variable target (*label*) a numérica usando *StringIndexer*: 

In [None]:
label_stringIdx = StringIndexer(inputCol="income", outputCol="label")
stages += [label_stringIdx]

Y por último añadimos un stage de *VectorAssembler* para convertir todas las *features* en un único vector (así es como lo necesitan los modelos de clasificación):

In [None]:
# Transformamos todas las features en un vector con VectorAssembler
numericCols = ["age", "fnlwgt", "education_num", "capital_gain", "capital_loss", "hours_per_week"]
assemblerInputs = [c + "classVec" for c in categoricalColumns] + numericCols
assembler = VectorAssembler(inputCols=assemblerInputs, outputCol="features")
stages += [assembler]

Ejecutamos todo el Pipeline de preparación y obtenemos un DataFrame que ya estará listo para el modelo:

In [None]:
partialPipeline = Pipeline().setStages(stages)
pipelineModel = partialPipeline.fit(dataset)
preppedDataDF = pipelineModel.transform(dataset)

Comprobamos el DataFrame preparado:

In [None]:
preppedDataDF.show(5)

+----+-----------------+--------+----------+-------------+-------------------+------------------+--------------+------+-------+------------+------------+--------------+--------------+------+--------------+-----------------+--------------+-----------------+-------------------+----------------------+---------------+------------------+-----------------+--------------------+---------+-------------+--------+-------------+-------------------+----------------------+-----+--------------------+
| age|        workclass|  fnlwgt| education|education_num|     marital_status|        occupation|  relationship|  race|    sex|capital_gain|capital_loss|hours_per_week|native_country|income|workclassIndex|workclassclassVec|educationIndex|educationclassVec|marital_statusIndex|marital_statusclassVec|occupationIndex|occupationclassVec|relationshipIndex|relationshipclassVec|raceIndex| raceclassVec|sexIndex|  sexclassVec|native_countryIndex|native_countryclassVec|label|            features|
+----+------------

Nos quedamos solo con las columnas que nos interesan (las originales y "label" y "features" que son las 2 que necesitan los modelos):

In [None]:
selectedcols = ["label", "features"] + cols
dataset = preppedDataDF.select(selectedcols)
dataset.show(5)

+-----+--------------------+----+-----------------+--------+----------+-------------+-------------------+------------------+--------------+------+-------+------------+------------+--------------+--------------+------+
|label|            features| age|        workclass|  fnlwgt| education|education_num|     marital_status|        occupation|  relationship|  race|    sex|capital_gain|capital_loss|hours_per_week|native_country|income|
+-----+--------------------+----+-----------------+--------+----------+-------------+-------------------+------------------+--------------+------+-------+------------+------------+--------------+--------------+------+
|  0.0|(100,[4,10,24,32,...|39.0|        State-gov| 77516.0| Bachelors|         13.0|      Never-married|      Adm-clerical| Not-in-family| White|   Male|      2174.0|         0.0|          40.0| United-States| <=50K|
|  0.0|(100,[1,10,23,31,...|50.0| Self-emp-not-inc| 83311.0| Bachelors|         13.0| Married-civ-spouse|   Exec-managerial|    

Partimos el DataFrame en train y test:

In [None]:
(trainingData, testData) = dataset.randomSplit([0.7, 0.3], seed=100)
print(trainingData.count())
print(testData.count())

22832
9729


# Logistic Regression

## Versión inicial

In [None]:
from pyspark.ml.classification import LogisticRegression
 
# Se crea un modelo inicial
lr = LogisticRegression(labelCol="label", featuresCol="features", maxIter=10)
 
# Se entrena el modelo con los datos de train
lrModel = lr.fit(trainingData)

# Predecimos sobre los datos e test, para ello usamos el método transform().
# LogisticRegression.transform() realmente solo necesita la columna 'features'.
predictions = lrModel.transform(testData)

# Visualizamos la salida del modelo (predicciones y probabilidad de cada clase) 
selected = predictions.select("label", "prediction", "probability", "age", "occupation")
# NOTA: Se podrían haber seleccionado otras columnas adicionales.
selected.show(5)

+-----+----------+--------------------+----+---------------+
|label|prediction|         probability| age|     occupation|
+-----+----------+--------------------+----+---------------+
|  0.0|       1.0|[0.16304404160706...|36.0| Prof-specialty|
|  0.0|       0.0|[0.70118653255393...|32.0| Prof-specialty|
|  0.0|       1.0|[0.49801131876699...|33.0| Prof-specialty|
|  0.0|       0.0|[0.68126165186417...|39.0| Prof-specialty|
|  0.0|       0.0|[0.61086205071159...|39.0| Prof-specialty|
+-----+----------+--------------------+----+---------------+
only showing top 5 rows



Para evaluar el modelo podemos usar  *BinaryClassificationEvaluator*:

In [None]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator
 
# Creamos el evaluador
evaluator = BinaryClassificationEvaluator()

La métrica que este evaluador va usar por defecto es el AUC (*Area Under the Curve*), pero podríamos hacer que usara *areaUnderPR* de la siguiente manera:
`evaluator.setMetricName("areaUnderPR")`

In [None]:
evaluator.getMetricName()

'areaUnderROC'

Evaluamos las predicciones:

In [None]:
evaluator.evaluate(predictions)

0.8993574699928725

## Tuning

Se va a afinar el modelo usando *ParamGridBuilder* y *CrossValidator*.

Para saber qué parámetros podemos modificar de este modelo usamos `explainParams()`:

In [None]:
print(lr.explainParams())

aggregationDepth: suggested depth for treeAggregate (>= 2). (default: 2)
elasticNetParam: the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty. (default: 0.0)
family: The name of family which is a description of the label distribution to be used in the model. Supported options: auto, binomial, multinomial (default: auto)
featuresCol: features column name. (default: features, current: features)
fitIntercept: whether to fit an intercept term. (default: True)
labelCol: label column name. (default: label, current: label)
lowerBoundsOnCoefficients: The lower bounds on coefficients if fitting under bound constrained optimization. The bound matrix must be compatible with the shape (1, number of features) for binomial regression, or (number of classes, number of features) for multinomial regression. (undefined)
lowerBoundsOnIntercepts: The lower bounds on intercepts if fitting under bound constrained optimization. The

Si usamos tres valores para *regParam*, tres para *maxIter*, y dos para *elasticNetParam*, las combinaciones de parámetros serán 3 x 3 x 3 = 27 posibilidades para el *CrossValidator*.

**Esto va a llevar mucho tiempo en una sola máquina**

Para las pruebas podemos reducirlo a 2 x 2

In [None]:
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
 
# Se crea el ParamGrid que se usará en el CrossValidator
# Esta es la versión simplificada para que tarde poco
paramGrid = (ParamGridBuilder()
             .addGrid(lr.regParam, [0.01, 0.5])
             .addGrid(lr.elasticNetParam, [0.0, 0.5])
             .build())

# Esta sería una versión más completa que tarda demasiado sin un cluster
# paramGrid = (ParamGridBuilder()
#              .addGrid(lr.regParam, [0.01, 0.5, 2.0])
#              .addGrid(lr.elasticNetParam, [0.0, 0.5, 1.0])
#              .addGrid(lr.maxIter, [1, 5, 10])
#              .build())

# Se crea un CrossValidator de  5-fold 
cv = CrossValidator(estimator=lr, 
                    estimatorParamMaps=paramGrid, 
                    evaluator=evaluator, 
                    numFolds=5)
 
# Se ejecuta el CrossValidator (con los 5-folds y el ParamGrid)
cvModel = cv.fit(trainingData)

Usamos el nuevo modelo para hacer una predicción sobre los datos de test y medir su precisión:

In [None]:
# Usamos los datos de test para crear una nueva predicción
# cvModel utiliza el mejor modelo encontrado en la validación cruzada
predictions = cvModel.transform(testData)

# Y evaluamos las predicción
evaluator.evaluate(predictions)

0.8977643264031809

Podemos ver los pesos de los coeficientes y el intercepto del modelo:

In [None]:
print('Model Intercept: ', cvModel.bestModel.intercept)

Model Intercept:  -1.3832039720849316


In [None]:
weights = cvModel.bestModel.coefficients
weights = [(float(w),) for w in weights]  # convert numpy type to float, and to tuple
weightsDF = spark.createDataFrame(weights, ["Feature Weight"])
weightsDF.show()

+--------------------+
|      Feature Weight|
+--------------------+
|  -0.281603275704831|
| -0.6264483359096494|
| -0.4360275569860984|
| -0.5064247711709583|
|  -0.506326689118052|
|-0.00494814417175...|
| 0.07086989623963032|
|   -2.66978938102928|
| -0.5593567014148134|
|-0.22394378958134853|
|  0.5737091727046981|
|  0.8976634297736545|
|-0.02732965272147547|
| -1.2761244527152253|
|-0.04222024367536...|
| -1.2432064202100166|
| -1.7513331537893073|
|   1.269765487533909|
| -1.4918049429191638|
| -0.7975316495227874|
+--------------------+
only showing top 20 rows



Y por último podemos echar un vistazo a las predicciones:

In [None]:
# Ver las predicciones del mejor modelo obtenido en la validación cruzada
selected = predictions.select("label", "prediction", "probability", "age", "occupation")
selected.show()

+-----+----------+--------------------+----+---------------+
|label|prediction|         probability| age|     occupation|
+-----+----------+--------------------+----+---------------+
|  0.0|       1.0|[0.23296419268391...|36.0| Prof-specialty|
|  0.0|       0.0|[0.65520667452462...|32.0| Prof-specialty|
|  0.0|       0.0|[0.53910224525061...|33.0| Prof-specialty|
|  0.0|       0.0|[0.63734169446424...|39.0| Prof-specialty|
|  0.0|       0.0|[0.60160343580997...|39.0| Prof-specialty|
|  0.0|       0.0|[0.59477679040631...|50.0| Prof-specialty|
|  0.0|       0.0|[0.58993767395651...|51.0| Prof-specialty|
|  0.0|       0.0|[0.59763736817137...|60.0| Prof-specialty|
|  0.0|       0.0|[0.69074556096520...|34.0| Prof-specialty|
|  0.0|       0.0|[0.95784002797462...|20.0| Prof-specialty|
|  0.0|       1.0|[0.46810434624054...|35.0| Prof-specialty|
|  0.0|       0.0|[0.52250674393838...|42.0| Prof-specialty|
|  0.0|       0.0|[0.55520732570780...|43.0| Prof-specialty|
|  0.0|       0.0|[0.671

# Ejercicio propuesto: Random Forest

Entrenar un *RandomForestClassifier* y comprobar si sus métricas son mejores que las del *LogisticRegression*.

1.   Primero una versión inicial del Random Forest
2.   Después intentar tuning de hiperparámetros



## Versión inicial

In [None]:
from pyspark.ml.classification import RandomForestClassifier
 
# Se crea un modelo de RandomForest inicial.
rf = RandomForestClassifier(labelCol="label", featuresCol="features")
 
# Y se entrena con los datos de train
rfModel = rf.fit(trainingData)

In [None]:
# Se realizan las predicciones con el método .transform()
predictions = rfModel.transform(testData)

In [None]:
# Se muestran las predicciones para hacernos una idea
selected = predictions.select("label", "prediction", "probability", "age", "occupation")
selected.show()

+-----+----------+--------------------+----+---------------+
|label|prediction|         probability| age|     occupation|
+-----+----------+--------------------+----+---------------+
|  0.0|       1.0|[0.46234636822485...|36.0| Prof-specialty|
|  0.0|       0.0|[0.63502656323154...|32.0| Prof-specialty|
|  0.0|       0.0|[0.62414586540324...|33.0| Prof-specialty|
|  0.0|       0.0|[0.63502656323154...|39.0| Prof-specialty|
|  0.0|       0.0|[0.61375922551731...|39.0| Prof-specialty|
|  0.0|       0.0|[0.63502656323154...|50.0| Prof-specialty|
|  0.0|       0.0|[0.63502656323154...|51.0| Prof-specialty|
|  0.0|       0.0|[0.63502656323154...|60.0| Prof-specialty|
|  0.0|       0.0|[0.63502656323154...|34.0| Prof-specialty|
|  0.0|       0.0|[0.73466620709577...|20.0| Prof-specialty|
|  0.0|       0.0|[0.63221969275743...|35.0| Prof-specialty|
|  0.0|       0.0|[0.63221969275743...|42.0| Prof-specialty|
|  0.0|       0.0|[0.63221969275743...|43.0| Prof-specialty|
|  0.0|       0.0|[0.632

Se evalúa el Random Forest model usando un BinaryClassificationEvaluator.

In [None]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator
 
# Evaluación del modelo
evaluator = BinaryClassificationEvaluator()
evaluator.evaluate(predictions)

0.8853737552309299

## Tuning

Vamos a hacer tuning del modelo con *ParamGridBuilder* y *CrossValidator*.

Tres valores para *maxDepth*, dos valores para *maxBin*, y dos valores para *numTrees*. El grid de parámetros tiene 3 x 2 x 2 = 12 combinaciones de parámetros para el *CrossValidator*.

In [None]:
# Se crea el ParamGrid
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
 
paramGrid = (ParamGridBuilder()
             .addGrid(rf.maxDepth, [2, 4, 6])
             .addGrid(rf.maxBins, [20, 60])
             .addGrid(rf.numTrees, [5, 20])
             .build())

# Se crea el CrossValidator (5-fold)
cv = CrossValidator(estimator=rf, estimatorParamMaps=paramGrid, evaluator=evaluator, numFolds=5)

# Se entrena el modelo (en función del número de combinaciones podría tardar)
cvModel = cv.fit(trainingData)

Se evalua el modelo

In [None]:
# Primero se predice sobre los datos de test
# cvModel utiliza el mejor modelo que haya salido en la validación cruzada
predictions = cvModel.transform(testData)

# Y sobre las predicciones se puede evaluar el modelo
evaluator.evaluate(predictions)

0.8936864692496512

# Spark Stop

In [None]:
spark.stop()