# Pyspark ML

<img src="spark_ml.JPG" height="300" width="300">

MLlib es la biblioteca de aprendizaje automático (ML) de Spark. Su objetivo es hacer que el aprendizaje automático práctico sea escalable y fácil. A un alto nivel, proporciona herramientas como:

- **Algoritmos de ML:** algoritmos de aprendizaje comunes como clasificación, regresión, agrupamiento y filtrado colaborativo
- **Caracterización:** extracción de características, transformación, reducción de dimensionalidad y selección
- **Pipelines:** herramientas para construir, evaluar y ajustar ML Pipelines
- **Persistencia:** guardar y cargar algoritmos, modelos y Pipelines
- **Utilidades:** álgebra lineal, estadística, manejo de datos, etc.

# Import Spark

In [1]:
import findspark
findspark.init()
from pyspark.sql import SparkSession

from IPython.core.display import HTML
display(HTML("<style>pre { white-space: pre !important; }</style>"))

import pandas as pd

In [2]:
spark = (
    SparkSession.builder
        .appName("PysparkSession")
        .getOrCreate() ## Create Session
) 

# Correlation Matrix 

In [44]:
from pyspark.ml.stat import Correlation
from pyspark.ml.feature import VectorAssembler

In [3]:
df_auto = spark.read.format("parquet").option("header", True).load("./datasets/automobile_data.parquet")
df_auto.show(5)

+---------+-----------------+-----------+---------+----------+------------+-----------+------------+---------------+----------+------+-----+------+-----------+-----------+----------------+-----------+-----------+----+------+-----------------+----------+--------+--------+-----------+-------+
|symboling|normalized-losses|       make|fuel-type|aspiration|num-of-doors| body-style|drive-wheels|engine-location|wheel-base|length|width|height|curb-weight|engine-type|num-of-cylinders|engine-size|fuel-system|bore|stroke|compression-ratio|horsepower|peak-rpm|city-mpg|highway-mpg|  price|
+---------+-----------------+-----------+---------+----------+------------+-----------+------------+---------------+----------+------+-----+------+-----------+-----------+----------------+-----------+-----------+----+------+-----------------+----------+--------+--------+-----------+-------+
|        3|             null|alfa-romero|      gas|       std|         two|convertible|         rwd|          front|      88

In [36]:
# types_col = str(df_auto.dtypes).replace("(", "").replace(")", "")\
#                                .replace("'", "").replace("[", "")\
#                                .replace("]", "").split(",")

# types_col = list(map(lambda x: x.lstrip(), types_col))

# def Convert(lst):
#     res_dct = {lst[i]: lst[i + 1] for i in range(0, len(lst), 2)}
#     return res_dct

# df_types = pd.DataFrame([Convert(types_col)]).T.reset_index().rename(columns = {0: "Types", "index": "columns"})
# df_types_num = df_types.query("Types != 'string'")
# list_num = df_types_num["columns"].values.tolist()
# df_types_num.head()

In [43]:
df_types = pd.DataFrame(data = df_auto.dtypes).rename(columns = {0: "columns", 1: "Types"})
df_types_num = df_types.query("Types != 'string'")
list_num = df_types_num["columns"].values.tolist()
df_types_num.head()

Unnamed: 0,columns,Types
0,symboling,bigint
1,normalized-losses,double
9,wheel-base,double
10,length,double
11,width,double


In [56]:
df_auto_num = df_auto[list_num]
df_auto_num.show(3)

+---------+-----------------+----------+------+-----+------+-----------+-----------+----+------+-----------------+----------+--------+--------+-----------+-------+
|symboling|normalized-losses|wheel-base|length|width|height|curb-weight|engine-size|bore|stroke|compression-ratio|horsepower|peak-rpm|city-mpg|highway-mpg|  price|
+---------+-----------------+----------+------+-----+------+-----------+-----------+----+------+-----------------+----------+--------+--------+-----------+-------+
|        3|             null|      88.6| 168.8| 64.1|  48.8|       2548|        130|3.47|  2.68|              9.0|     111.0|  5000.0|      21|         27|13495.0|
|        3|             null|      88.6| 168.8| 64.1|  48.8|       2548|        130|3.47|  2.68|              9.0|     111.0|  5000.0|      21|         27|16500.0|
|        1|             null|      94.5| 171.2| 65.5|  52.4|       2823|        152|2.68|  3.47|              9.0|     154.0|  5000.0|      19|         26|16500.0|
+---------+-----

In [98]:
df_auto_num_drop = df_auto_num.na.drop()
df_auto_num_drop.show()

+---------+-----------------+----------+------+-----+------+-----------+-----------+----+------+-----------------+----------+--------+--------+-----------+-------+
|symboling|normalized-losses|wheel-base|length|width|height|curb-weight|engine-size|bore|stroke|compression-ratio|horsepower|peak-rpm|city-mpg|highway-mpg|  price|
+---------+-----------------+----------+------+-----+------+-----------+-----------+----+------+-----------------+----------+--------+--------+-----------+-------+
|        2|            164.0|      99.8| 176.6| 66.2|  54.3|       2337|        109|3.19|   3.4|             10.0|     102.0|  5500.0|      24|         30|13950.0|
|        2|            164.0|      99.4| 176.6| 66.4|  54.3|       2824|        136|3.19|   3.4|              8.0|     115.0|  5500.0|      18|         22|17450.0|
|        1|            158.0|     105.8| 192.7| 71.4|  55.7|       2844|        136|3.19|   3.4|              8.5|     110.0|  5500.0|      19|         25|17710.0|
|        1|     

## Vector Assembler

VectorAssembleres un transformador que combina una lista dada de columnas en un solo vector columna. 
Es útil para combinar características sin procesar y características generadas por diferentes transformadores 
de características en un solo vector de características, para entrenar modelos de ML como regresión logística 
y decisión trees.acepta los siguientes tipos de columna de entrada: todos los tipos numéricos, tipo booleano, 
y tipo vectorial. en resumen un Vector Assembler construye una columna adicional donde cada fila contiene un a lista
con los valores de todas las columnas



In [99]:
print(list_num)

['symboling', 'normalized-losses', 'wheel-base', 'length', 'width', 'height', 'curb-weight', 'engine-size', 'bore', 'stroke', 'compression-ratio', 'horsepower', 'peak-rpm', 'city-mpg', 'highway-mpg', 'price']


In [48]:
assembler = VectorAssembler(inputCols = list_num, outputCol = "features")
assembler

VectorAssembler_4905238424e2

In [60]:
df_auto_num_drop = assembler.transform(df_auto_num_drop)
df_auto_num_drop.show(5)

+---------+-----------------+----------+------+-----+------+-----------+-----------+----+------+-----------------+----------+--------+--------+-----------+-------+--------------------+
|symboling|normalized-losses|wheel-base|length|width|height|curb-weight|engine-size|bore|stroke|compression-ratio|horsepower|peak-rpm|city-mpg|highway-mpg|  price|            features|
+---------+-----------------+----------+------+-----+------+-----------+-----------+----+------+-----------------+----------+--------+--------+-----------+-------+--------------------+
|        2|            164.0|      99.8| 176.6| 66.2|  54.3|       2337|        109|3.19|   3.4|             10.0|     102.0|  5500.0|      24|         30|13950.0|[2.0,164.0,99.8,1...|
|        2|            164.0|      99.4| 176.6| 66.4|  54.3|       2824|        136|3.19|   3.4|              8.0|     115.0|  5500.0|      18|         22|17450.0|[2.0,164.0,99.4,1...|
|        1|            158.0|     105.8| 192.7| 71.4|  55.7|       2844|   

In [100]:
df_fil = df_auto_num_drop[list_num]
assembler = VectorAssembler(inputCols = list_num, outputCol = "features")
df_fil = assembler.transform(df_fil)

pearson_corr = Correlation.corr(df_fil, "features", "pearson").collect()[0][0]
print(pearson_corr)

DenseMatrix([[ 1.        ,  0.51838797, -0.52046477, -0.33621705, -0.21984964,
              -0.47399437, -0.25237234, -0.11023843, -0.25701277, -0.02053884,
              -0.13902179, -0.00366866,  0.19979781,  0.08891209,  0.14930948,
              -0.16332929],
             [ 0.51838797,  1.        , -0.06400101,  0.02911438,  0.1048565 ,
              -0.41708077,  0.12286025,  0.2038412 , -0.03616694,  0.06562699,
              -0.12997093,  0.29090559,  0.24067647, -0.23693364, -0.18969131,
               0.19992385],
             [-0.52046477, -0.06400101,  1.        ,  0.87196801,  0.81593501,
               0.55876376,  0.81050693,  0.6504878 ,  0.58048403,  0.16401196,
               0.2939676 ,  0.51450686, -0.29249053, -0.5766354 , -0.60826982,
               0.7347888 ],
             [-0.33621705,  0.02911438,  0.87196801,  1.        ,  0.83918412,
               0.50515596,  0.87035496,  0.72666638,  0.64905924,  0.11604912,
               0.18896778,  0.66672597, -0.2391

# Preprocessing

## Standard Scaler o Z-score

- Es el proceso de transformar los datos de tal manera que la media de cada columna sea igual a cero, y la desviación estándar de cada columna sea igual a uno. De esta manera, se obtiene la misma escala para todas las columnas
- Es una buena práctica estandarizar los datos de entrada que utiliza para la regresión logística, aunque en muchos casos no es necesario. La estandarización podría mejorar el rendimiento del algoritmo.
- El rango de los datos se ajsuta a cada variable, su rango no es fijo
- Normalmente se utiliza cuando no se desea que los datos atipicos tengan demasiada influencia
- Esta tecnica hace que la media de cada variable tienda a 0 y su desviacion estandar tienda a 1

<img src="standard.PNG" height = "200" width = "200">

### Pandas

In [101]:
from sklearn.preprocessing import StandardScaler

In [105]:
df_auto_pd = spark.read.format("parquet").option("header", True).load("./datasets/automobile_data.parquet").toPandas()
df_auto_pd.head()

Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,...,engine-size,fuel-system,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
0,3,,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111.0,5000.0,21,27,13495.0
1,3,,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111.0,5000.0,21,27,16500.0
2,1,,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154.0,5000.0,19,26,16500.0
3,2,164.0,audi,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.4,10.0,102.0,5500.0,24,30,13950.0
4,2,164.0,audi,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.4,8.0,115.0,5500.0,18,22,17450.0


In [108]:
df_auto_pd_num = df_auto_pd.select_dtypes(exclude = "object")
df_auto_pd_num = df_auto_pd_num.dropna()
df_auto_pd_num.head()

Unnamed: 0,symboling,normalized-losses,wheel-base,length,width,height,curb-weight,engine-size,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
3,2,164.0,99.8,176.6,66.2,54.3,2337,109,3.19,3.4,10.0,102.0,5500.0,24,30,13950.0
4,2,164.0,99.4,176.6,66.4,54.3,2824,136,3.19,3.4,8.0,115.0,5500.0,18,22,17450.0
6,1,158.0,105.8,192.7,71.4,55.7,2844,136,3.19,3.4,8.5,110.0,5500.0,19,25,17710.0
8,1,158.0,105.8,192.7,71.4,55.9,3086,131,3.13,3.4,8.3,140.0,5500.0,17,20,23875.0
10,2,192.0,101.2,176.8,64.8,54.3,2395,108,3.5,2.8,8.8,101.0,5800.0,23,29,16430.0


In [109]:
scaler = StandardScaler()
scaler

In [115]:
df_auto_pd_num_scaler = scaler.fit_transform(df_auto_pd_num)
df_auto_pd_num_scaler

array([[ 1.06469263,  1.20312243,  0.30390372, ..., -0.41342425,
        -0.32219564,  0.43150225],
       [ 1.06469263,  1.20312243,  0.22619761, ..., -1.4031681 ,
        -1.56814855,  1.03025998],
       [ 0.22137173,  1.03406542,  1.46949528, ..., -1.23821079,
        -1.10091621,  1.07473913],
       ...,
       [-1.46527006, -0.74103326,  2.11057064, ..., -1.4031681 ,
        -1.41240444,  1.72054212],
       [-1.46527006, -0.74103326,  2.11057064, ..., -0.08350964,
        -0.78942798,  1.88904965],
       [-1.46527006, -0.74103326,  2.11057064, ..., -1.23821079,
        -1.10091621,  1.91556607]])

In [116]:
pd.DataFrame(df_auto_pd_num_scaler, columns = df_auto_pd_num.columns.tolist())

Unnamed: 0,symboling,normalized-losses,wheel-base,length,width,height,curb-weight,engine-size,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
0,1.064693,1.203122,0.303904,0.371817,0.311066,0.185615,-0.255427,-0.332951,-0.406878,0.554700,-0.037497,0.200623,0.827343,-0.413424,-0.322196,0.431502
1,1.064693,1.203122,0.226198,0.371817,0.414111,0.185615,0.760441,0.557668,-0.406878,0.554700,-0.554245,0.626436,0.827343,-1.403168,-1.568149,1.030260
2,0.221372,1.034065,1.469495,1.770271,2.990229,0.802496,0.802161,0.557668,-0.406878,0.554700,-0.425058,0.462662,0.827343,-1.238211,-1.100916,1.074739
3,0.221372,1.034065,1.469495,1.770271,2.990229,0.890622,1.306966,0.392738,-0.632009,0.554700,-0.476733,1.445307,0.827343,-1.568125,-1.879637,2.129408
4,1.064693,1.992055,0.575875,0.389189,-0.410247,0.185615,-0.134441,-0.365937,0.756300,-1.491064,-0.347546,0.167868,1.474126,-0.578382,-0.477940,0.855765
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
155,-1.465270,-0.741033,2.110571,1.431515,1.702170,0.714370,1.027446,0.722597,1.806913,-0.297701,-0.166684,0.593681,0.611749,-0.578382,-0.633684,0.926760
156,-1.465270,-0.741033,2.110571,1.431515,1.650648,0.714370,1.229785,0.722597,1.806913,-0.297701,-0.373383,2.100403,0.396154,-1.238211,-1.100916,1.303122
157,-1.465270,-0.741033,2.110571,1.431515,1.702170,0.714370,1.152604,1.778145,1.056476,-1.252391,-0.347546,1.248778,0.827343,-1.403168,-1.412404,1.720542
158,-1.465270,-0.741033,2.110571,1.431515,1.702170,0.714370,1.580229,0.854540,-1.082272,0.554700,3.321368,0.331643,-0.681817,-0.083510,-0.789428,1.889050


### Pyspark

In [117]:
from pyspark.ml.feature import StandardScaler

In [119]:
df_fil = df_auto_num_drop[list_num]
assembler = VectorAssembler(inputCols = list_num, outputCol = "features")
df_fil = assembler.transform(df_fil)
df_fil.show(5)

+---------+-----------------+----------+------+-----+------+-----------+-----------+----+------+-----------------+----------+--------+--------+-----------+-------+--------------------+
|symboling|normalized-losses|wheel-base|length|width|height|curb-weight|engine-size|bore|stroke|compression-ratio|horsepower|peak-rpm|city-mpg|highway-mpg|  price|            features|
+---------+-----------------+----------+------+-----+------+-----------+-----------+----+------+-----------------+----------+--------+--------+-----------+-------+--------------------+
|        2|            164.0|      99.8| 176.6| 66.2|  54.3|       2337|        109|3.19|   3.4|             10.0|     102.0|  5500.0|      24|         30|13950.0|[2.0,164.0,99.8,1...|
|        2|            164.0|      99.4| 176.6| 66.4|  54.3|       2824|        136|3.19|   3.4|              8.0|     115.0|  5500.0|      18|         22|17450.0|[2.0,164.0,99.4,1...|
|        1|            158.0|     105.8| 192.7| 71.4|  55.7|       2844|   

In [133]:
scaler_spark = StandardScaler(inputCol = "features", 
                              outputCol = "features_scaler", 
                              withStd = True, 
                              withMean = True)

scaler_model = scaler_spark.fit(df_fil)
df_fil_scaler = scaler_model.transform(df_fil)
df_fil_scaler.show(5)

+---------+-----------------+----------+------+-----+------+-----------+-----------+----+------+-----------------+----------+--------+--------+-----------+-------+--------------------+--------------------+
|symboling|normalized-losses|wheel-base|length|width|height|curb-weight|engine-size|bore|stroke|compression-ratio|horsepower|peak-rpm|city-mpg|highway-mpg|  price|            features|     features_scaler|
+---------+-----------------+----------+------+-----+------+-----------+-----------+----+------+-----------------+----------+--------+--------+-----------+-------+--------------------+--------------------+
|        2|            164.0|      99.8| 176.6| 66.2|  54.3|       2337|        109|3.19|   3.4|             10.0|     102.0|  5500.0|      24|         30|13950.0|[2.0,164.0,99.8,1...|[1.06136025068367...|
|        2|            164.0|      99.4| 176.6| 66.4|  54.3|       2824|        136|3.19|   3.4|              8.0|     115.0|  5500.0|      18|         22|17450.0|[2.0,164.0,99

## Min Max Scaler

- Usa un rango que se calcula como la diferencia entre el maximo y el minimo, esta formula implemeta para cada valor la resta del minimo y a ese valor se divide por el rango
- Esta aplicacion es util ya que conserva la forma de la distribucion original
- Conserva los datos atipicos, es decir que no se ven influenciados por el escalamiento
- Deja cada variable entre un rango de 0 y 1


### Pandas

In [127]:
from sklearn.preprocessing import MinMaxScaler

In [128]:
min_max = MinMaxScaler()
df_min_max = min_max.fit_transform(df_auto_pd_num)
df_min_max

array([[0.8       , 0.51832461, 0.45517241, ..., 0.26470588, 0.33333333,
        0.29500969],
       [0.8       , 0.51832461, 0.44137931, ..., 0.08823529, 0.11111111,
        0.41191796],
       [0.6       , 0.48691099, 0.66206897, ..., 0.11764706, 0.19444444,
        0.42060258],
       ...,
       [0.2       , 0.15706806, 0.77586207, ..., 0.08823529, 0.13888889,
        0.54669651],
       [0.2       , 0.15706806, 0.77586207, ..., 0.32352941, 0.25      ,
        0.57959784],
       [0.2       , 0.15706806, 0.77586207, ..., 0.11764706, 0.19444444,
        0.5847752 ]])

In [129]:
pd.DataFrame(df_min_max, columns = df_auto_pd_num.columns.tolist())

Unnamed: 0,symboling,normalized-losses,wheel-base,length,width,height,curb-weight,engine-size,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
0,0.8,0.518325,0.455172,0.577236,0.517544,0.471154,0.329325,0.243655,0.464286,0.633333,0.18750,0.355263,0.551020,0.264706,0.333333,0.295010
1,0.8,0.518325,0.441379,0.577236,0.535088,0.471154,0.518231,0.380711,0.464286,0.633333,0.06250,0.440789,0.551020,0.088235,0.111111,0.411918
2,0.6,0.486911,0.662069,0.839024,0.973684,0.605769,0.525989,0.380711,0.464286,0.633333,0.09375,0.407895,0.551020,0.117647,0.194444,0.420603
3,0.6,0.486911,0.662069,0.839024,0.973684,0.625000,0.619860,0.355330,0.421429,0.633333,0.08125,0.605263,0.551020,0.058824,0.055556,0.626528
4,0.8,0.664921,0.503448,0.580488,0.394737,0.471154,0.351823,0.238579,0.685714,0.347619,0.11250,0.348684,0.673469,0.235294,0.305556,0.377848
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
155,0.2,0.157068,0.775862,0.775610,0.754386,0.586538,0.567882,0.406091,0.885714,0.514286,0.15625,0.434211,0.510204,0.235294,0.277778,0.391710
156,0.2,0.157068,0.775862,0.775610,0.745614,0.586538,0.605508,0.406091,0.885714,0.514286,0.10625,0.736842,0.469388,0.117647,0.194444,0.465195
157,0.2,0.157068,0.775862,0.775610,0.754386,0.586538,0.591156,0.568528,0.742857,0.380952,0.11250,0.565789,0.551020,0.088235,0.138889,0.546697
158,0.2,0.157068,0.775862,0.775610,0.754386,0.586538,0.670675,0.426396,0.335714,0.633333,1.00000,0.381579,0.265306,0.323529,0.250000,0.579598


### Pyspark

In [130]:
from pyspark.ml.feature import MinMaxScaler

In [132]:
minmax_spark = MinMaxScaler(inputCol = "features", outputCol = "features_minmax")
minmax_model = minmax_spark.fit(df_fil)
df_fil_minmax = minmax_model.transform(df_fil)
df_fil_minmax.show(5)

+---------+-----------------+----------+------+-----+------+-----------+-----------+----+------+-----------------+----------+--------+--------+-----------+-------+--------------------+--------------------+
|symboling|normalized-losses|wheel-base|length|width|height|curb-weight|engine-size|bore|stroke|compression-ratio|horsepower|peak-rpm|city-mpg|highway-mpg|  price|            features|     features_minmax|
+---------+-----------------+----------+------+-----+------+-----------+-----------+----+------+-----------------+----------+--------+--------+-----------+-------+--------------------+--------------------+
|        2|            164.0|      99.8| 176.6| 66.2|  54.3|       2337|        109|3.19|   3.4|             10.0|     102.0|  5500.0|      24|         30|13950.0|[2.0,164.0,99.8,1...|[0.8,0.5183246073...|
|        2|            164.0|      99.4| 176.6| 66.4|  54.3|       2824|        136|3.19|   3.4|              8.0|     115.0|  5500.0|      18|         22|17450.0|[2.0,164.0,99

## Encoder

In [134]:
from pyspark.ml.feature import StringIndexer

In [135]:
df = spark.createDataFrame(
    [(0, "a"), (1, "b"), (2, "c"), (3, "a"), (4, "a"), (5, "c")],
    ["id", "category"])

df.show()

+---+--------+
| id|category|
+---+--------+
|  0|       a|
|  1|       b|
|  2|       c|
|  3|       a|
|  4|       a|
|  5|       c|
+---+--------+



In [136]:
indexer = StringIndexer(inputCol = "category", outputCol = "category_num")
df_indexed = indexer.fit(df).transform(df)
df_indexed.show()

+---+--------+------------+
| id|category|category_num|
+---+--------+------------+
|  0|       a|         0.0|
|  1|       b|         2.0|
|  2|       c|         1.0|
|  3|       a|         0.0|
|  4|       a|         0.0|
|  5|       c|         1.0|
+---+--------+------------+



In [137]:
df_auto_pd = spark.read.format("parquet").option("header", True).load("./datasets/automobile_data.parquet")
df_auto_pd.show(3)

+---------+-----------------+-----------+---------+----------+------------+-----------+------------+---------------+----------+------+-----+------+-----------+-----------+----------------+-----------+-----------+----+------+-----------------+----------+--------+--------+-----------+-------+
|symboling|normalized-losses|       make|fuel-type|aspiration|num-of-doors| body-style|drive-wheels|engine-location|wheel-base|length|width|height|curb-weight|engine-type|num-of-cylinders|engine-size|fuel-system|bore|stroke|compression-ratio|horsepower|peak-rpm|city-mpg|highway-mpg|  price|
+---------+-----------------+-----------+---------+----------+------------+-----------+------------+---------------+----------+------+-----+------+-----------+-----------+----------------+-----------+-----------+----+------+-----------------+----------+--------+--------+-----------+-------+
|        3|             null|alfa-romero|      gas|       std|         two|convertible|         rwd|          front|      88

In [142]:
indexer = StringIndexer(inputCol = "fuel-type", outputCol = "fuel_type_num")
df_auto_pd_indexed = indexer.fit(df_auto_pd).transform(df_auto_pd)
df_auto_pd_indexed.show(3)

+---------+-----------------+-----------+---------+----------+------------+-----------+------------+---------------+----------+------+-----+------+-----------+-----------+----------------+-----------+-----------+----+------+-----------------+----------+--------+--------+-----------+-------+-------------+
|symboling|normalized-losses|       make|fuel-type|aspiration|num-of-doors| body-style|drive-wheels|engine-location|wheel-base|length|width|height|curb-weight|engine-type|num-of-cylinders|engine-size|fuel-system|bore|stroke|compression-ratio|horsepower|peak-rpm|city-mpg|highway-mpg|  price|fuel_type_num|
+---------+-----------------+-----------+---------+----------+------------+-----------+------------+---------------+----------+------+-----+------+-----------+-----------+----------------+-----------+-----------+----+------+-----------------+----------+--------+--------+-----------+-------+-------------+
|        3|             null|alfa-romero|      gas|       std|         two|convert

In [143]:
indexer = StringIndexer(inputCol = "fuel-system", outputCol = "fuel_system_num")
df_auto_pd_indexed = indexer.fit(df_auto_pd_indexed).transform(df_auto_pd_indexed)
df_auto_pd_indexed.show(5)

+---------+-----------------+-----------+---------+----------+------------+-----------+------------+---------------+----------+------+-----+------+-----------+-----------+----------------+-----------+-----------+----+------+-----------------+----------+--------+--------+-----------+-------+-------------+---------------+
|symboling|normalized-losses|       make|fuel-type|aspiration|num-of-doors| body-style|drive-wheels|engine-location|wheel-base|length|width|height|curb-weight|engine-type|num-of-cylinders|engine-size|fuel-system|bore|stroke|compression-ratio|horsepower|peak-rpm|city-mpg|highway-mpg|  price|fuel_type_num|fuel_system_num|
+---------+-----------------+-----------+---------+----------+------------+-----------+------------+---------------+----------+------+-----+------+-----------+-----------+----------------+-----------+-----------+----+------+-----------------+----------+--------+--------+-----------+-------+-------------+---------------+
|        3|             null|alfa-

### Ejercicio 

Realizar un label encoder para cada columna utilizando String indexer y utilizando Pandas

In [144]:
df_auto_pd = spark.read.format("parquet").option("header", True).load("./datasets/automobile_data.parquet")
df_auto_pd.show(3)

+---------+-----------------+-----------+---------+----------+------------+-----------+------------+---------------+----------+------+-----+------+-----------+-----------+----------------+-----------+-----------+----+------+-----------------+----------+--------+--------+-----------+-------+
|symboling|normalized-losses|       make|fuel-type|aspiration|num-of-doors| body-style|drive-wheels|engine-location|wheel-base|length|width|height|curb-weight|engine-type|num-of-cylinders|engine-size|fuel-system|bore|stroke|compression-ratio|horsepower|peak-rpm|city-mpg|highway-mpg|  price|
+---------+-----------------+-----------+---------+----------+------------+-----------+------------+---------------+----------+------+-----+------+-----------+-----------+----------------+-----------+-----------+----+------+-----------------+----------+--------+--------+-----------+-------+
|        3|             null|alfa-romero|      gas|       std|         two|convertible|         rwd|          front|      88

In [147]:
df_types = pd.DataFrame(data = df_auto_pd.dtypes).rename(columns = {0: "columns", 1: "Types"})
df_types_num = df_types.query("Types == 'string'")
list_cat = df_types_num["columns"].values.tolist()
df_types_num.head()

Unnamed: 0,columns,Types
2,make,string
3,fuel-type,string
4,aspiration,string
5,num-of-doors,string
6,body-style,string


In [148]:
list_cat

['make',
 'fuel-type',
 'aspiration',
 'num-of-doors',
 'body-style',
 'drive-wheels',
 'engine-location',
 'engine-type',
 'num-of-cylinders',
 'fuel-system']

In [169]:
df_auto_spark = spark.read.format("parquet").option("header", True).load("./datasets/automobile_data.parquet")

df_types = pd.DataFrame(data = df_auto_spark.dtypes).rename(columns = {0: "Column", 1: "Types"})
df_cat = df_types[df_types["Types"] == "string"]
list_cat = df_cat["Column"].values.tolist()

for col in list_cat:
    
    indexer = StringIndexer(inputCol = col, outputCol = col + "_num")
    df_auto_spark = indexer.fit(df_auto_spark).transform(df_auto_spark)
    df_auto_spark = df_auto_spark.drop(col)

In [170]:
df_auto_spark.show(5)

+---------+-----------------+----------+------+-----+------+-----------+-----------+----+------+-----------------+----------+--------+--------+-----------+-------+--------+-------------+--------------+----------------+--------------+----------------+-------------------+---------------+--------------------+---------------+
|symboling|normalized-losses|wheel-base|length|width|height|curb-weight|engine-size|bore|stroke|compression-ratio|horsepower|peak-rpm|city-mpg|highway-mpg|  price|make_num|fuel-type_num|aspiration_num|num-of-doors_num|body-style_num|drive-wheels_num|engine-location_num|engine-type_num|num-of-cylinders_num|fuel-system_num|
+---------+-----------------+----------+------+-----+------+-----------+-----------+----+------+-----------------+----------+--------+--------+-----------+-------+--------+-------------+--------------+----------------+--------------+----------------+-------------------+---------------+--------------------+---------------+
|        3|             null