<a href="https://colab.research.google.com/github/robertoarturomc/ProgramacionConcurrente/blob/main/28_Machine_Learning_Spark_II.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Programación Concurrente
## 28. Machine Learning con Spark II


In [48]:
import pandas as pd
import numpy as np
import seaborn as sns


from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler, StandardScaler
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator

from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.types import DoubleType

In [None]:
spark = SparkSession.builder.master("local[*]").getOrCreate()
spark

In [5]:
titanic = sns.load_dataset("titanic")
titanic.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [6]:
df = spark.createDataFrame(titanic)
df.show(5)

+--------+------+------+----+-----+-----+-------+--------+-----+-----+----------+----+-----------+-----+-----+
|survived|pclass|   sex| age|sibsp|parch|   fare|embarked|class|  who|adult_male|deck|embark_town|alive|alone|
+--------+------+------+----+-----+-----+-------+--------+-----+-----+----------+----+-----------+-----+-----+
|       0|     3|  male|22.0|    1|    0|   7.25|       S|Third|  man|      true| NaN|Southampton|   no|false|
|       1|     1|female|38.0|    1|    0|71.2833|       C|First|woman|     false|   C|  Cherbourg|  yes|false|
|       1|     3|female|26.0|    0|    0|  7.925|       S|Third|woman|     false| NaN|Southampton|  yes| true|
|       1|     1|female|35.0|    1|    0|   53.1|       S|First|woman|     false|   C|Southampton|  yes|false|
|       0|     3|  male|35.0|    0|    0|   8.05|       S|Third|  man|      true| NaN|Southampton|   no| true|
+--------+------+------+----+-----+-----+-------+--------+-----+-----+----------+----+-----------+-----+-----+
o

Como recordarás. para el dataset de `titanic` busco predecir, basándome en la información proporcionada, si la persona sobrevivió o no. Es decir, si `alive == True / survived == 1`. Vamos a crear un modelo sencillo que prediga esto.

In [16]:
# Revisemos si hay nulos en algunas columnas

df.select([
    F.count(F.when(F.col(c).isNull() | (F.col(c) == "") | F.isnan(F.col(c)), c)).alias(c)
    for c in df.columns
]).show()

+--------+------+---+---+-----+-----+----+--------+-----+---+----------+----+-----------+-----+-----+
|survived|pclass|sex|age|sibsp|parch|fare|embarked|class|who|adult_male|deck|embark_town|alive|alone|
+--------+------+---+---+-----+-----+----+--------+-----+---+----------+----+-----------+-----+-----+
|       0|     0|  0|177|    0|    0|   0|       2|    0|  0|         0| 688|          2|    0|    0|
+--------+------+---+---+-----+-----+----+--------+-----+---+----------+----+-----------+-----+-----+



¡Falló! Vamos a arreglar primero los tipos de Dato...

In [15]:
df = df.withColumn("adult_male", F.col("adult_male").cast(DoubleType()))\
        .withColumn("alone", F.col("alone").cast(DoubleType()))

In [17]:
# Ahora sí, contemos Nulos...

df.select([
    F.count(F.when(F.col(c).isNull() | (F.col(c) == "") | F.isnan(F.col(c)), c)).alias(c)
    for c in df.columns
]).show()

+--------+------+---+---+-----+-----+----+--------+-----+---+----------+----+-----------+-----+-----+
|survived|pclass|sex|age|sibsp|parch|fare|embarked|class|who|adult_male|deck|embark_town|alive|alone|
+--------+------+---+---+-----+-----+----+--------+-----+---+----------+----+-----------+-----+-----+
|       0|     0|  0|177|    0|    0|   0|       2|    0|  0|         0| 688|          2|    0|    0|
+--------+------+---+---+-----+-----+----+--------+-----+---+----------+----+-----------+-----+-----+



### Limpieza de Datos

In [25]:
# Quitar filas con valores nulos y columnas que no se usarán

df = df.dropna(subset=["age", "embarked"])\
        .drop('deck', 'alive')

df.columns

['survived',
 'pclass',
 'sex',
 'age',
 'sibsp',
 'parch',
 'fare',
 'embarked',
 'class',
 'who',
 'adult_male',
 'embark_town',
 'alone']

### Feature Engineering

In [43]:
# Indexar variables categóricas
sex_indexer = StringIndexer(inputCol="sex", outputCol="sex_indexed")
embarked_indexer = StringIndexer(inputCol="embarked", outputCol="embarked_indexed")

# One-Hot Encoding
encoder = OneHotEncoder(
    inputCols=["sex_indexed", "embarked_indexed"],
    outputCols=["sex_vec", "embarked_vec"]
)

# Assembler de todas las features
assembler = VectorAssembler(
    inputCols=["pclass", "age", "sibsp", "parch", "fare", "sex_vec", "embarked_vec"],
    outputCol="features"
)

# Escalado opcional
scaler = StandardScaler(inputCol="features", outputCol="scaledFeatures")

### Entrenamiento y Evaluación

In [45]:
lr = LogisticRegression(featuresCol="scaledFeatures", labelCol="survived", maxIter=50)

In [46]:
pipeline = Pipeline(stages=[
    sex_indexer, embarked_indexer, encoder, assembler, scaler, lr
])

In [47]:
train, test = df.randomSplit([0.8, 0.2], seed=42)
model = pipeline.fit(train)

pred = model.transform(test)
pred.select("survived", "probability", "prediction").show(5)

+--------+--------------------+----------+
|survived|         probability|prediction|
+--------+--------------------+----------+
|       0|[0.42226629210051...|       1.0|
|       0|[0.18437258172412...|       1.0|
|       0|[0.42228128223959...|       1.0|
|       0|[0.44399299840241...|       1.0|
|       0|[0.64575304232344...|       0.0|
+--------+--------------------+----------+
only showing top 5 rows



In [50]:
evaluator = BinaryClassificationEvaluator(
    labelCol="survived", metricName="areaUnderROC"
)
roc_auc = evaluator.evaluate(pred)
print(f"AUC: {roc_auc:.3f}")


AUC: 0.858


### Tarea (se carga en Blackboard)

Genera un Modelo de ML, completamente con `spark.ml`, con estas características:

- Lee el dataset proporcionado a continuación y genera un Modelo que prediga `class` de la mejor manera posible:
  - Crea un pipeline de `spark.ml` .
  - Complementa con el Feature Engineering y Limpieza de Datos que se te ocurran. ¡Ponte creativo!
  - Escoge el algoritmo de ML de tu elección.
  - Opcional: Si tu combinación de hiperparámetros no está funcionando, prueba usar `CrossValidator` para seleccionarlos.

Nota: la calificación de tu tarea estará basada en el AUC que obtengas en un dataset de prueba.

In [51]:
from sklearn.datasets import fetch_openml

spark = SparkSession.builder.appName("AdultIncomeClassification").getOrCreate()

# Cargar desde scikit-learn
adult = fetch_openml("adult", version=2, as_frame=True)
pdf = adult.frame

# Convertir a Spark DataFrame
df = spark.createDataFrame(pdf)
df.show(5)

+---+---------+------+------------+-------------+------------------+-----------------+------------+-----+------+------------+------------+--------------+--------------+-----+
|age|workclass|fnlwgt|   education|education-num|    marital-status|       occupation|relationship| race|   sex|capital-gain|capital-loss|hours-per-week|native-country|class|
+---+---------+------+------------+-------------+------------------+-----------------+------------+-----+------+------------+------------+--------------+--------------+-----+
| 25|  Private|226802|        11th|            7|     Never-married|Machine-op-inspct|   Own-child|Black|  Male|           0|           0|            40| United-States|<=50K|
| 38|  Private| 89814|     HS-grad|            9|Married-civ-spouse|  Farming-fishing|     Husband|White|  Male|           0|           0|            50| United-States|<=50K|
| 28|Local-gov|336951|  Assoc-acdm|           12|Married-civ-spouse|  Protective-serv|     Husband|White|  Male|           0|