# Spark MLlib

### Ejercicio de Clasificación

## Problema

Ahora que ya hemos visto la API de MLlib en funcionamiento vamos a probarla sobre un problema de clasificación supervisada.

En este caso estaremos tratando un problema estadístico muy conocido llamado **Adult**. Este dataset plantea estimar el salario medio de un censo poblacional en función de una serie de características.

La descripción original es la que sigue:

>>>The Adult dataset we are going to use is publicly available at the UCI Machine Learning Repository. This data derives from census data, and consists of information about 48842 individuals and their annual income. We will use this information to predict if an individual earns >50k a year or <=50K a year. The dataset is rather clean, and consists of both numeric and categorical variables.

**Attribute Information:**
- age: continuous
- workclass: Private,Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked
- fnlwgt: continuous
- education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc...
- education-num: continuous
- marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent...
- occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners...
- relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried
- race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black
- sex: Female, Male
- capital-gain: continuous
- capital-loss: continuous
- hours-per-week: continuous
- native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany...
- Target/Label: - <=50K, >50K

### Tareas a realizar:

- Indexar las categorías. Transformar las variables categóricas a numéricas utilizando el transformador `StringIndexer`
- Vectorizar la lista de categorías
- Aprender mediante validación cruzada dos algoritmos (`LogisticRegresion` y `DecisionTree`)
  - Utilizar para ello el evaluador `BinaryClassificationEvaluator``


## Cargando los datos:

Vamos a cargar el dataset utilizando la función `read.csv`.


In [None]:
ls /opt/spark-data/

In [None]:
dfAdult = spark.read.csv("/opt/spark-data/adult.data", inferSchema=True)

# Vamos a asignar a todas las variables su nombre correcto
dfAdult = dfAdult.toDF("age", "workclass", "fnlwgt", "education", "education_num", "marital_status", "occupation", "relationship", "race", "sex", "capital_gain", "capital_loss", "hours_per_week", "native_country" , "income")

cols = dfAdult.columns
dfAdult.printSchema()

In [None]:
dfAdult.show()

In [None]:
from pyspark.ml import Pipeline
from pyspark.ml.feature import OneHotEncoderEstimator, StringIndexer, VectorAssembler

categoricalColumns = ["workclass", "education", "marital_status", "occupation", "relationship", "race", "sex", "native_country"]
stages = [] # stages del Pipeline

# Pasamos a índices las columnas categóricas con StringIndexer

for categoricalCol in categoricalColumns:
    stringIndexer = StringIndexer(inputCol=categoricalCol, outputCol=categoricalCol + "Index")
    # Use OneHotEncoder to convert categorical variables into binary SparseVectors
    # encoder = OneHotEncoderEstimator(inputCol=categoricalCol + "Index", outputCol=categoricalCol + "classVec")
    encoder = OneHotEncoderEstimator(inputCols=[stringIndexer.getOutputCol()], outputCols=[categoricalCol + "classVec"])
    stages += [stringIndexer, encoder]


In [None]:
# Convertimos la columna de clase en 'label' con StringIndexer

label_stringIdx = StringIndexer(inputCol="income", outputCol="label")
stages += [label_stringIdx]

In [None]:
# Transformamos las features en un Vector mediante VectorAssembler

numericCols = ["age", "fnlwgt", "education_num", "capital_gain", "capital_loss", "hours_per_week"]

assemblerInputs = [c + "Index" for c in categoricalColumns] + numericCols

assembler = VectorAssembler(inputCols=assemblerInputs, outputCol="features")
stages += [assembler]

In [None]:
from pyspark.ml.classification import LogisticRegression
  
partialPipeline = Pipeline().setStages(stages)
pipelineModel = partialPipeline.fit(dfAdult)
preppedDataDF = pipelineModel.transform(dfAdult)

# Enrenar el modelo para los datos preparados

lrModel = LogisticRegression().fit(preppedDataDF)

# ROC de los datos de entrenamiento

lrModel.summary.roc.show()

display(lrModel, preppedDataDF)

# Keep relevant columns

selectedcols = ["label", "features"] + cols
dataset = preppedDataDF.select(selectedcols)
display(dataset)

# Dividimos el dataset en 70% para training y 30% para testing.

dfTrain, dfTest = dataset.randomSplit([0.7, 0.3], seed=1234)
print("Tenemos %d filas de training y %d filas de test." % (dfTrain.count(), dfTest.count()))

In [None]:
from pyspark.ml.classification import LogisticRegression

# Create initial LogisticRegression model
lr = LogisticRegression(labelCol="label", featuresCol="features", maxIter=10)

# Train model with Training Data
lrModel = lr.fit(dfTrain)

In [None]:
# Make predictions on test data using the transform() method.
# LogisticRegression.transform() will only use the 'features' column.

predictions = lrModel.transform(dfTest)

# View model's predictions and probabilities of each prediction class
# You can select any columns in the above schema to view as well.
# For example's sake we will choose age & occupation

selected = predictions.select("label", "prediction", "probability", "age", "occupation")
selected.show()

In [None]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator

# Evaluación del modelo

evaluator = BinaryClassificationEvaluator(rawPredictionCol="rawPrediction")
evaluator.evaluate(predictions)

In [None]:
print(lr.explainParams())

# Decision Trees

Vamos a utilizar otro algoritmo de clasificación: árboles de decisión.



In [None]:
from pyspark.ml.classification import DecisionTreeClassifier

# Crear el modelo

dt = DecisionTreeClassifier(labelCol="label", featuresCol="features", maxDepth=20, maxBins=500)

# Entrenar el modelo

dtModel = dt.fit(dfTrain)

In [None]:
# Podemos explorar el número de nodos del árbol o su profundidad

print("numNodes = ", dtModel.numNodes)
print("depth = ", dtModel.depth)

display(dtModel)

In [None]:
# Hacer las predicciones en los datos de tests con Transformer.transform().

predictions = dtModel.transform(dfTest)
predictions.printSchema()

# Examinamos las predicciones y probabilidades para cada clase

selected = predictions.select("label", "prediction", "probability", "age", "occupation")
selected.show()

In [None]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator

# Evaluamos la precisión del árbol de decisión

evaluator = BinaryClassificationEvaluator()
evaluator.evaluate(predictions)