#MLlib Basic - Data Transformation


In [None]:
#do this for Google Colab
%pip install pyspark

Tiga komponen yang sering digunakan dalam melakukan feature encoding, yaitu `StringIndexer`, `OneHotEncoder`, dan `VectorAssembler`

In [None]:
#import necessary packages
import pyspark
from pyspark.sql import SparkSession
from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer
from pyspark.ml.feature import OneHotEncoder
from pyspark.ml.feature import VectorAssembler


In [None]:
#Create Spark Session
spark = SparkSession.builder.appName('MlLib Basics').getOrCreate()

In [None]:
df = spark.createDataFrame( [(0, "Male"),
                             (1, "Male"),
                             (2, "Female"),
                             (3, "Female"),
                             (4, "Female"),
                             (5, "Male")
                          ], ["id", "gender"])

df.show()



Kita akan melakukan proses String indexing, yaitu meng-assign indeks ke masing-masing kategori. Dalam hal ini `Female` = 0, dan `Male` = 1

In [None]:
#Indexer
indexer = StringIndexer(inputCol="gender", outputCol="genderIndexed")
indexed = indexer.fit(df).transform(df)
indexed.show()


Selanjutnya kategori yang terindeks itu kita encode dengan menggunakan metode dummy encoding, yaitu mentransformasi variabel kategorik berkardinal N menjadi binary string berukuran N-1.

Dalam hal ini, gender kardinalitasnya = 2, sehingga vektor yang dihasilkan berukuran **2 - 1 = 1**

In [None]:
encoder = OneHotEncoder(inputCols=["genderIndexed"],
                        outputCols=["genderEncoded"])
encoded = encoder.fit(indexed).transform(indexed)
encoded.show()

Vektor yang dihasilkan adalah vektor sparse, yaitu vektor yang mayoritas nilainya 0. Vektor dinyatakan dalam tupel `(panjang_vektor, [indeks nonzero], [nilai kolom nonzero])`

Misalnya pada contoh di atas
`(1,[],[])` berarti vektor 1 elemen tanpa nilai nonzero alias `[0]`
`(1,[0],[1.0])` berarti vektor `[1.0]`

In [None]:
df = spark.createDataFrame( [("Female", "Blue", 300, 0.0, 0),
                             ("Female", "Black", 200, 15.1, 1),
                             ("Male", "Red", 100, 12.4, 0),
                             ("Female", "Green", 100, 0.0, 1),
                             ("Female", "Blue", 200, 0.0, 0),
                             ("Male", "Green", 400, 20.0, 1),
                              ("Male", "Yellow", 400, 20.0, 1)],
                            ["gender", "color", "num1", "num2", "target"])

df.show()

indexer_1 = StringIndexer(inputCol="gender", outputCol="genderIndex")
indexed_1 = indexer_1.fit(df).transform(df)
indexed_1.show()

indexer_2 = StringIndexer(inputCol="color", outputCol="colorIndex")
indexed_2 = indexer_2.fit(indexed_1).transform(indexed_1)
indexed_2.show()


encoder_1 = OneHotEncoder(inputCols=["genderIndex"],
                        outputCols=["genderEncoded"])
encoded_1 = encoder_1.fit(indexed_2).transform(indexed_2)
encoded_1.show()

encoder_2 = OneHotEncoder(inputCols=["colorIndex"],
                        outputCols=["colorEncoded"])
encoded_2 = encoder_2.fit(encoded_1).transform(encoded_1)
encoded_2.show()

Vector assembler digunakan untuk menggabungkan vektor-vektor hasil encoding menjadi sebuah vektor.

Dalam hal ini, hasil penggabungan berukuran 1 + 4 + 2, yaitu kolom `gender`, 4 kolom `color`, dan 2 kolom numerik (`num1` dan `num2`)

In [None]:
#Assemble the vectors into 1 features vector, using VectorAssembler transformator
assembler = VectorAssembler(
    inputCols=["genderEncoded", "colorEncoded", "num1", "num2"],
    outputCol="features")

output = assembler.transform(encoded_2)
output.show(truncate=False)

Proses di atas dapat kita rangkai menjadi satu workflow dengan menggunakan `Pipeline`.

In [None]:
#Create the same process with Pipeline

indexer_1 = StringIndexer(inputCol="gender", outputCol="genderIndex")
indexer_2 = StringIndexer(inputCol="color", outputCol="colorIndex")
indexers = [indexer_1, indexer_2]

encoder_1 = OneHotEncoder(inputCols=["genderIndex"], outputCols=["genderEncoded"])
encoder_2 = OneHotEncoder(inputCols=["colorIndex"], outputCols=["colorEncoded"])
encoders = [encoder_1, encoder_2]

assembler = VectorAssembler(inputCols=["genderEncoded", "colorEncoded", "num1", "num2"], outputCol="features")

pipeline = Pipeline(stages=indexers + encoders + [assembler])

In [None]:
model=pipeline.fit(df)
data = model.transform(df)

data.show(truncate=False)