<a href="https://colab.research.google.com/github/tyri0n11/distributed-system/blob/main/7_2_data_preparation_spark.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# `StringIndexer` and `OneHotEncoder`

In [1]:
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession

sc = SparkContext(conf=SparkConf())
spark = SparkSession(sparkContext=sc)

# Example data

In [2]:
import pandas as pd
pdf = pd.DataFrame({
        'x1': ['a','a','b','b', 'b', 'c'],
        'x2': ['apple', 'orange', 'orange','orange', 'peach', 'peach'],
        'x3': [1, 1, 2, 2, 2, 4],
        'x4': [2.4, 2.5, 3.5, 1.4, 2.1,1.5],
        'y1': [1, 0, 1, 0, 0, 1],
        'y2': ['yes', 'no', 'no', 'yes', 'yes', 'yes']
    })
# `pdf` is pandas dataframe while `df` is Spark dataframe
df = spark.createDataFrame(pdf)
df.show()

+---+------+---+---+---+---+
| x1|    x2| x3| x4| y1| y2|
+---+------+---+---+---+---+
|  a| apple|  1|2.4|  1|yes|
|  a|orange|  1|2.5|  0| no|
|  b|orange|  2|3.5|  1| no|
|  b|orange|  2|1.4|  0|yes|
|  b| peach|  2|2.1|  0|yes|
|  c| peach|  4|1.5|  1|yes|
+---+------+---+---+---+---+



In [3]:
print('Type of pdf', type(pdf))
print('Type of df', type(df))

Type of pdf <class 'pandas.core.frame.DataFrame'>
Type of df <class 'pyspark.sql.classic.dataframe.DataFrame'>


# StringIndexer

`StringIndexer` maps a string column to a index column that will be treated as a categorical column by spark. **The indices start with 0 and are ordered by label frequencies**. If it is a numerical column, the column will first be casted to a string column and then indexed by  StringIndexer.

There are three steps to implement the StringIndexer

1. Build the StringIndexer model: specify the input column and output column names.
2. Learn the StringIndexer model: fit the model with your data.
3. Execute the indexing: call the transform function to execute the indexing process.

### Example: `StringIndex` column "x1"

In [4]:
from pyspark.ml.feature import StringIndexer

# build indexer
string_indexer = StringIndexer(inputCol='x1', outputCol='indexed_x1')

# learn the model
string_indexer_model = string_indexer.fit(df)

# transform the data
df_stringindexer = string_indexer_model.transform(df)

# resulting df
df_stringindexer.show()

+---+------+---+---+---+---+----------+
| x1|    x2| x3| x4| y1| y2|indexed_x1|
+---+------+---+---+---+---+----------+
|  a| apple|  1|2.4|  1|yes|       1.0|
|  a|orange|  1|2.5|  0| no|       1.0|
|  b|orange|  2|3.5|  1| no|       0.0|
|  b|orange|  2|1.4|  0|yes|       0.0|
|  b| peach|  2|2.1|  0|yes|       0.0|
|  c| peach|  4|1.5|  1|yes|       2.0|
+---+------+---+---+---+---+----------+



### Your task `StringIndex` column "x2"

In [5]:
from pyspark.ml.feature import StringIndexer
# build indexer
string_indexer_2 = StringIndexer(inputCol='x2', outputCol='indexed_x2')
# learn the model
string_indexer_model_2 = string_indexer_2.fit(df)
# transform the data
df_stringindexer = string_indexer_model_2.transform(df)
# resulting df
df_stringindexer.show()

+---+------+---+---+---+---+----------+
| x1|    x2| x3| x4| y1| y2|indexed_x2|
+---+------+---+---+---+---+----------+
|  a| apple|  1|2.4|  1|yes|       2.0|
|  a|orange|  1|2.5|  0| no|       0.0|
|  b|orange|  2|3.5|  1| no|       0.0|
|  b|orange|  2|1.4|  0|yes|       0.0|
|  b| peach|  2|2.1|  0|yes|       1.0|
|  c| peach|  4|1.5|  1|yes|       1.0|
+---+------+---+---+---+---+----------+



From the result above, we can see that (a, b, c) in column x1 are converted to (1.0, 0.0, 2.0). They are ordered by their frequencies in column x1.



## OneHotEncoder

**`OneHotEncoder`** converts each categories of a **StringIndexed** column to a `sparse vector`. Each sparse vector has **at most one single active elements** that indicate the category index.

In [6]:
df.show(5)

+---+------+---+---+---+---+
| x1|    x2| x3| x4| y1| y2|
+---+------+---+---+---+---+
|  a| apple|  1|2.4|  1|yes|
|  a|orange|  1|2.5|  0| no|
|  b|orange|  2|3.5|  1| no|
|  b|orange|  2|1.4|  0|yes|
|  b| peach|  2|2.1|  0|yes|
+---+------+---+---+---+---+
only showing top 5 rows


In [7]:
df_ohe = df.select('x1')
df_ohe.show()

+---+
| x1|
+---+
|  a|
|  a|
|  b|
|  b|
|  b|
|  c|
+---+



### `StringIndex` column 'x1'

In [8]:
df_x1_indexed = StringIndexer(inputCol='x1', outputCol='indexed_x1').fit(df_ohe).transform(df_ohe)
df_x1_indexed.show()

+---+----------+
| x1|indexed_x1|
+---+----------+
|  a|       1.0|
|  a|       1.0|
|  b|       0.0|
|  b|       0.0|
|  b|       0.0|
|  c|       2.0|
+---+----------+



'x1' has three categories: 'a', 'b' and 'c',  which corresponding string indices 1.0, 0.0 and 2.0, respectively.

### Mapping string indices to sparse vectors

* Encoding format: 'string index': ['string indices vector size', 'index of string index in string indices vector', **1.0** ]

Here the string indices vector is `[0.0, 1.0, 2.0]`. Therefore, the mapping between string indices and sparse vectors are:
* `0.0: [3, [0], [1.0]]`
* `1.0: [3, [1], [1.0]]`
* `2.0: [3, [2], [1.0]]`

After we convert all sparse vectors to dense vectors, we get:

In [9]:
from pyspark.ml.linalg import DenseVector, SparseVector, DenseMatrix, SparseMatrix
x = [SparseVector(3, {0: 1.0}).toArray()] + \
    [SparseVector(3, {1: 1.0}).toArray()] + \
    [SparseVector(3, {2: 1.0}).toArray()]

import numpy as np
np.array(x)

array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.]])

**The obtained matrix is exactly the matrix that we would use to represent our categorical variable in a statistical class**.

### One more step to go

`OneHotEncoder` by default will drop the last category. So the **string indices vector** becomes `[0.0, 1.0]`, and the mappings between string indices and sparse vectors are:

* `0.0: [2, [0], [1.0]]`
* `1.0: [2, [1], [1.0]]`
* `2.0: [2, [], []]`

We use a sparse vector that has **no active element**(basically all elements are 0's) to represent the last category.

# Verify

### OneHotEncode column 'indexed_x1'

In [10]:
from pyspark.ml.feature import OneHotEncoder

In [11]:
# review `df_x1_indexed`, what is it?
df_x1_indexed.show(5)

+---+----------+
| x1|indexed_x1|
+---+----------+
|  a|       1.0|
|  a|       1.0|
|  b|       0.0|
|  b|       0.0|
|  b|       0.0|
+---+----------+
only showing top 5 rows


In [12]:
OneHotEncoder(inputCol='indexed_x1', outputCol='encoded_x1').fit(df_x1_indexed).transform(df_x1_indexed).show()

+---+----------+-------------+
| x1|indexed_x1|   encoded_x1|
+---+----------+-------------+
|  a|       1.0|(2,[1],[1.0])|
|  a|       1.0|(2,[1],[1.0])|
|  b|       0.0|(2,[0],[1.0])|
|  b|       0.0|(2,[0],[1.0])|
|  b|       0.0|(2,[0],[1.0])|
|  c|       2.0|    (2,[],[])|
+---+----------+-------------+



### Specify to not drop the last category

If we choose to not drop the last category, we get the expected results.

In [13]:
OneHotEncoder(dropLast=False, inputCol='indexed_x1', outputCol='encoded_x1').fit(df_x1_indexed).transform(df_x1_indexed).show()

+---+----------+-------------+
| x1|indexed_x1|   encoded_x1|
+---+----------+-------------+
|  a|       1.0|(3,[1],[1.0])|
|  a|       1.0|(3,[1],[1.0])|
|  b|       0.0|(3,[0],[1.0])|
|  b|       0.0|(3,[0],[1.0])|
|  b|       0.0|(3,[0],[1.0])|
|  c|       2.0|(3,[2],[1.0])|
+---+----------+-------------+



## Exercise:
**Do the same OneHotEncoder for the columns `x2` and `y2`**

In [14]:
OneHotEncoder(dropLast=False, inputCol='indexed_x2', outputCol='encoded_x2').fit(df_stringindexer).transform(df_stringindexer).show()

+---+------+---+---+---+---+----------+-------------+
| x1|    x2| x3| x4| y1| y2|indexed_x2|   encoded_x2|
+---+------+---+---+---+---+----------+-------------+
|  a| apple|  1|2.4|  1|yes|       2.0|(3,[2],[1.0])|
|  a|orange|  1|2.5|  0| no|       0.0|(3,[0],[1.0])|
|  b|orange|  2|3.5|  1| no|       0.0|(3,[0],[1.0])|
|  b|orange|  2|1.4|  0|yes|       0.0|(3,[0],[1.0])|
|  b| peach|  2|2.1|  0|yes|       1.0|(3,[1],[1.0])|
|  c| peach|  4|1.5|  1|yes|       1.0|(3,[1],[1.0])|
+---+------+---+---+---+---+----------+-------------+



In [15]:
df_y2_indexed = StringIndexer(
    inputCol="y2",
    outputCol="indexed_y2"
).fit(df).transform(df)

OneHotEncoder(dropLast=False, inputCol='indexed_y2', outputCol='encoded_y2').fit(df_y2_indexed).transform(df_y2_indexed).show()


+---+------+---+---+---+---+----------+-------------+
| x1|    x2| x3| x4| y1| y2|indexed_y2|   encoded_y2|
+---+------+---+---+---+---+----------+-------------+
|  a| apple|  1|2.4|  1|yes|       0.0|(2,[0],[1.0])|
|  a|orange|  1|2.5|  0| no|       1.0|(2,[1],[1.0])|
|  b|orange|  2|3.5|  1| no|       1.0|(2,[1],[1.0])|
|  b|orange|  2|1.4|  0|yes|       0.0|(2,[0],[1.0])|
|  b| peach|  2|2.1|  0|yes|       0.0|(2,[0],[1.0])|
|  c| peach|  4|1.5|  1|yes|       0.0|(2,[0],[1.0])|
+---+------+---+---+---+---+----------+-------------+



# Vector assembler

## Example data

In [16]:
import pandas as pd
pdf = pd.DataFrame({
        'x1': ['a','a','b','b', 'b', 'c'],
        'x2': ['apple', 'orange', 'orange','orange', 'peach', 'peach'],
        'x3': [1, 1, 2, 2, 2, 4],
        'x4': [2.4, 2.5, 3.5, 1.4, 2.1,1.5],
        'y1': [1, 0, 1, 0, 0, 1],
        'y2': ['yes', 'no', 'no', 'yes', 'yes', 'yes']
    })
df = spark.createDataFrame(pdf)
df.show()

+---+------+---+---+---+---+
| x1|    x2| x3| x4| y1| y2|
+---+------+---+---+---+---+
|  a| apple|  1|2.4|  1|yes|
|  a|orange|  1|2.5|  0| no|
|  b|orange|  2|3.5|  1| no|
|  b|orange|  2|1.4|  0|yes|
|  b| peach|  2|2.1|  0|yes|
|  c| peach|  4|1.5|  1|yes|
+---+------+---+---+---+---+



# VectorAssembler

To fit a ML model in pyspark, we need to combine all feature columns into one single column of vectors: the **featuresCol**. The `VectorAssembler` can be used to combine multiple **`OneHotEncoder` columns** and **other continuous variable columns** into one single column.

The example below shows how to combine three OneHotEncoder columns and one numeric column into a **featureCol** column.



## StringIndex and OneHotEncode categorical columns

In [17]:
from pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler
from pyspark.ml import Pipeline

In [18]:
all_stages = [StringIndexer(inputCol=c, outputCol='idx_' + c) for c in ['x1', 'x2', 'x3']] + \
             [OneHotEncoder(inputCol='idx_' + c, outputCol='ohe_' + c) for c in ['x1', 'x2', 'x3']]
all_stages

[StringIndexer_fd5099100084,
 StringIndexer_e78869b60e6d,
 StringIndexer_de44d43ad2ec,
 OneHotEncoder_188440c8fa87,
 OneHotEncoder_fbbbbe8a4229,
 OneHotEncoder_d7b52d89d51e]

In [19]:
df_new = Pipeline(stages=all_stages).fit(df).transform(df)
df_new.show()

+---+------+---+---+---+---+------+------+------+-------------+-------------+-------------+
| x1|    x2| x3| x4| y1| y2|idx_x1|idx_x2|idx_x3|       ohe_x1|       ohe_x2|       ohe_x3|
+---+------+---+---+---+---+------+------+------+-------------+-------------+-------------+
|  a| apple|  1|2.4|  1|yes|   1.0|   2.0|   1.0|(2,[1],[1.0])|    (2,[],[])|(2,[1],[1.0])|
|  a|orange|  1|2.5|  0| no|   1.0|   0.0|   1.0|(2,[1],[1.0])|(2,[0],[1.0])|(2,[1],[1.0])|
|  b|orange|  2|3.5|  1| no|   0.0|   0.0|   0.0|(2,[0],[1.0])|(2,[0],[1.0])|(2,[0],[1.0])|
|  b|orange|  2|1.4|  0|yes|   0.0|   0.0|   0.0|(2,[0],[1.0])|(2,[0],[1.0])|(2,[0],[1.0])|
|  b| peach|  2|2.1|  0|yes|   0.0|   1.0|   0.0|(2,[0],[1.0])|(2,[1],[1.0])|(2,[0],[1.0])|
|  c| peach|  4|1.5|  1|yes|   2.0|   1.0|   2.0|    (2,[],[])|(2,[1],[1.0])|    (2,[],[])|
+---+------+---+---+---+---+------+------+------+-------------+-------------+-------------+



## Assemble feature columns into one single **feacturesCol** with **`VectorAssembler`**

In [20]:
df_assembled = VectorAssembler(inputCols=['ohe_x1', 'ohe_x2', 'ohe_x3', 'x4'], outputCol='featuresCol')\
    .transform(df_new)\
    .drop('idx_x1', 'idx_x2', 'idx_x3')
df_assembled.show(truncate=False)

+---+------+---+---+---+---+-------------+-------------+-------------+-----------------------------+
|x1 |x2    |x3 |x4 |y1 |y2 |ohe_x1       |ohe_x2       |ohe_x3       |featuresCol                  |
+---+------+---+---+---+---+-------------+-------------+-------------+-----------------------------+
|a  |apple |1  |2.4|1  |yes|(2,[1],[1.0])|(2,[],[])    |(2,[1],[1.0])|(7,[1,5,6],[1.0,1.0,2.4])    |
|a  |orange|1  |2.5|0  |no |(2,[1],[1.0])|(2,[0],[1.0])|(2,[1],[1.0])|[0.0,1.0,1.0,0.0,0.0,1.0,2.5]|
|b  |orange|2  |3.5|1  |no |(2,[0],[1.0])|(2,[0],[1.0])|(2,[0],[1.0])|[1.0,0.0,1.0,0.0,1.0,0.0,3.5]|
|b  |orange|2  |1.4|0  |yes|(2,[0],[1.0])|(2,[0],[1.0])|(2,[0],[1.0])|[1.0,0.0,1.0,0.0,1.0,0.0,1.4]|
|b  |peach |2  |2.1|0  |yes|(2,[0],[1.0])|(2,[1],[1.0])|(2,[0],[1.0])|[1.0,0.0,0.0,1.0,1.0,0.0,2.1]|
|c  |peach |4  |1.5|1  |yes|(2,[],[])    |(2,[1],[1.0])|(2,[],[])    |(7,[3,6],[1.0,1.5])          |
+---+------+---+---+---+---+-------------+-------------+-------------+---------------------

## Convert sparse vectors in featuresCol to dense vectors

In [21]:
from pyspark.sql.functions import udf
from pyspark.sql.types import *
from pyspark.ml.linalg import SparseVector, DenseVector

In [22]:
def dense_features_col(x):
    return(x.toArray().dtype)
dense_features_col_udf = udf(dense_features_col, returnType=StringType())

In [23]:
df_assembled.rdd.map(lambda x: x['featuresCol']).take(6)

[SparseVector(7, {1: 1.0, 5: 1.0, 6: 2.4}),
 DenseVector([0.0, 1.0, 1.0, 0.0, 0.0, 1.0, 2.5]),
 DenseVector([1.0, 0.0, 1.0, 0.0, 1.0, 0.0, 3.5]),
 DenseVector([1.0, 0.0, 1.0, 0.0, 1.0, 0.0, 1.4]),
 DenseVector([1.0, 0.0, 0.0, 1.0, 1.0, 0.0, 2.1]),
 SparseVector(7, {3: 1.0, 6: 1.5})]

In [24]:
df_assembled.rdd.map(lambda x: list(x['featuresCol'].toArray())).take(6)

[[np.float64(0.0),
  np.float64(1.0),
  np.float64(0.0),
  np.float64(0.0),
  np.float64(0.0),
  np.float64(1.0),
  np.float64(2.4)],
 [np.float64(0.0),
  np.float64(1.0),
  np.float64(1.0),
  np.float64(0.0),
  np.float64(0.0),
  np.float64(1.0),
  np.float64(2.5)],
 [np.float64(1.0),
  np.float64(0.0),
  np.float64(1.0),
  np.float64(0.0),
  np.float64(1.0),
  np.float64(0.0),
  np.float64(3.5)],
 [np.float64(1.0),
  np.float64(0.0),
  np.float64(1.0),
  np.float64(0.0),
  np.float64(1.0),
  np.float64(0.0),
  np.float64(1.4)],
 [np.float64(1.0),
  np.float64(0.0),
  np.float64(0.0),
  np.float64(1.0),
  np.float64(1.0),
  np.float64(0.0),
  np.float64(2.1)],
 [np.float64(0.0),
  np.float64(0.0),
  np.float64(0.0),
  np.float64(1.0),
  np.float64(0.0),
  np.float64(0.0),
  np.float64(1.5)]]

## Practice

In [25]:
import pandas as pd
pdf2 = pd.DataFrame({
        'x1': ['a','a','b','b', 'b', 'c'],
        'x2': ['apple', 'orange', 'orange','orange', 'peach', 'peach'],
        'x3': [1, 1, 2, 2, 2, 4],
        'x4': [2.4, 2.5, 3.5, 1.4, 2.1,1.5],
        'x5': ['man', 'woman', 'man', 'man', 'man', 'woman'],
        'x6': [10.3, 11.4, 45.3, 32.5, 13.8, 17.2],
        'x7': ['911', '113', '115', '113', '911', '115'],
        'y1': [1, 0, 1, 0, 0, 1],
        'y2': ['yes', 'no', 'no', 'yes', 'yes', 'yes']
    })
df2 = spark.createDataFrame(pdf2)
df2.show()

+---+------+---+---+-----+----+---+---+---+
| x1|    x2| x3| x4|   x5|  x6| x7| y1| y2|
+---+------+---+---+-----+----+---+---+---+
|  a| apple|  1|2.4|  man|10.3|911|  1|yes|
|  a|orange|  1|2.5|woman|11.4|113|  0| no|
|  b|orange|  2|3.5|  man|45.3|115|  1| no|
|  b|orange|  2|1.4|  man|32.5|113|  0|yes|
|  b| peach|  2|2.1|  man|13.8|911|  0|yes|
|  c| peach|  4|1.5|woman|17.2|115|  1|yes|
+---+------+---+---+-----+----+---+---+---+



#### Your task: Please do the Assemble feature columns (of all categorical and numerical features) into one single **feacturesCol** with **`VectorAssembler`**
**Hint: Categorical features (`x1`, `x2`, `x3`, `x5`, `x7`) and numerical features (`x4`, `x6`)**

In [26]:
from pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler
from pyspark.ml import Pipeline

categorical_cols = ['x1', 'x2', 'x3', 'x5', 'x7']
numerical_cols = ['x4', 'x6']

In [27]:
indexers = [
    StringIndexer(
        inputCol=col,
        outputCol=f"{col}_idx",
        handleInvalid="keep"
    )
    for col in categorical_cols
]


In [28]:
encoders = [
    OneHotEncoder(
        inputCol=f"{col}_idx",
        outputCol=f"{col}_ohe"
    )
    for col in categorical_cols
]


In [29]:
assembler = VectorAssembler(
    inputCols=[f"{col}_ohe" for col in categorical_cols] + numerical_cols,
    outputCol="featuresCol"
)


In [30]:
pipeline = Pipeline(stages=indexers + encoders + [assembler])


In [31]:
df_assembled = pipeline.fit(df2).transform(df2)
df_assembled.select("featuresCol").show(truncate=False)


+-------------------------------------------------------+
|featuresCol                                            |
+-------------------------------------------------------+
|(16,[1,5,7,9,13,14,15],[1.0,1.0,1.0,1.0,1.0,2.4,10.3]) |
|(16,[1,3,7,10,11,14,15],[1.0,1.0,1.0,1.0,1.0,2.5,11.4])|
|(16,[0,3,6,9,12,14,15],[1.0,1.0,1.0,1.0,1.0,3.5,45.3]) |
|(16,[0,3,6,9,11,14,15],[1.0,1.0,1.0,1.0,1.0,1.4,32.5]) |
|(16,[0,4,6,9,13,14,15],[1.0,1.0,1.0,1.0,1.0,2.1,13.8]) |
|(16,[2,4,8,10,12,14,15],[1.0,1.0,1.0,1.0,1.0,1.5,17.2])|
+-------------------------------------------------------+



#### Exercise: Do the Assemble feature columns (of all categorical and numerical features) into one single **feacturesCol** with **`VectorAssembler`** for **`mtcars`** dataset

In [35]:
df = spark.read.csv('mtcars.csv', header=True, inferSchema=True)
df.show(5)

+-----------------+----+---+-----+---+----+-----+-----+---+---+----+----+
|              _c0| mpg|cyl| disp| hp|drat|   wt| qsec| vs| am|gear|carb|
+-----------------+----+---+-----+---+----+-----+-----+---+---+----+----+
|        Mazda RX4|21.0|  6|160.0|110| 3.9| 2.62|16.46|  0|  1|   4|   4|
|    Mazda RX4 Wag|21.0|  6|160.0|110| 3.9|2.875|17.02|  0|  1|   4|   4|
|       Datsun 710|22.8|  4|108.0| 93|3.85| 2.32|18.61|  1|  1|   4|   1|
|   Hornet 4 Drive|21.4|  6|258.0|110|3.08|3.215|19.44|  1|  0|   3|   1|
|Hornet Sportabout|18.7|  8|360.0|175|3.15| 3.44|17.02|  0|  0|   3|   2|
+-----------------+----+---+-----+---+----+-----+-----+---+---+----+----+
only showing top 5 rows


In [46]:
categorical_cols = []
numerical_cols = ['mpg', 'cyl', 'disp', 'hp', 'drat', 'wt', 'qsec', 'vs', 'am', 'gear', 'carb']

In [47]:
encoders = [
    OneHotEncoder(
        inputCol=f"{col}_idx",
        outputCol=f"{col}_ohe"
    )
    for col in categorical_cols
]

In [48]:
indexers = [
    StringIndexer(
        inputCol=col,
        outputCol=f"{col}_idx",
        handleInvalid="keep"
    )
    for col in categorical_cols
]

In [49]:
assembler = VectorAssembler(
    inputCols=[f"{col}_ohe" for col in categorical_cols] + numerical_cols,
    outputCol="featuresCol"
)


In [50]:
pipeline = Pipeline(stages=indexers + encoders + [assembler])


In [51]:
df_assembled = pipeline.fit(df).transform(df)
df_assembled.select("featuresCol").show(truncate=False)

+-------------------------------------------------------+
|featuresCol                                            |
+-------------------------------------------------------+
|[21.0,6.0,160.0,110.0,3.9,2.62,16.46,0.0,1.0,4.0,4.0]  |
|[21.0,6.0,160.0,110.0,3.9,2.875,17.02,0.0,1.0,4.0,4.0] |
|[22.8,4.0,108.0,93.0,3.85,2.32,18.61,1.0,1.0,4.0,1.0]  |
|[21.4,6.0,258.0,110.0,3.08,3.215,19.44,1.0,0.0,3.0,1.0]|
|[18.7,8.0,360.0,175.0,3.15,3.44,17.02,0.0,0.0,3.0,2.0] |
|[18.1,6.0,225.0,105.0,2.76,3.46,20.22,1.0,0.0,3.0,1.0] |
|[14.3,8.0,360.0,245.0,3.21,3.57,15.84,0.0,0.0,3.0,4.0] |
|[24.4,4.0,146.7,62.0,3.69,3.19,20.0,1.0,0.0,4.0,2.0]   |
|[22.8,4.0,140.8,95.0,3.92,3.15,22.9,1.0,0.0,4.0,2.0]   |
|[19.2,6.0,167.6,123.0,3.92,3.44,18.3,1.0,0.0,4.0,4.0]  |
|[17.8,6.0,167.6,123.0,3.92,3.44,18.9,1.0,0.0,4.0,4.0]  |
|[16.4,8.0,275.8,180.0,3.07,4.07,17.4,0.0,0.0,3.0,3.0]  |
|[17.3,8.0,275.8,180.0,3.07,3.73,17.6,0.0,0.0,3.0,3.0]  |
|[15.2,8.0,275.8,180.0,3.07,3.78,18.0,0.0,0.0,3.0,3.0]  |
|[10.4,8.0,472

#### Exercise: Do the Assemble feature columns (of all categorical and numerical features) into one single **feacturesCol** with **`VectorAssembler`** for **`titanic`** dataset

In [101]:
df = spark.read.csv('kaggle-titanic-test.csv', header = True, inferSchema=True)
df.show(20)

+-----------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+
|PassengerId|Pclass|                Name|   Sex| Age|SibSp|Parch|          Ticket|   Fare|Cabin|Embarked|
+-----------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+
|        892|     3|    Kelly, Mr. James|  male|34.5|    0|    0|          330911| 7.8292| NULL|       Q|
|        893|     3|Wilkes, Mrs. Jame...|female|47.0|    1|    0|          363272|    7.0| NULL|       S|
|        894|     2|Myles, Mr. Thomas...|  male|62.0|    0|    0|          240276| 9.6875| NULL|       Q|
|        895|     3|    Wirz, Mr. Albert|  male|27.0|    0|    0|          315154| 8.6625| NULL|       S|
|        896|     3|Hirvonen, Mrs. Al...|female|22.0|    1|    1|         3101298|12.2875| NULL|       S|
|        897|     3|Svensson, Mr. Joh...|  male|14.0|    0|    0|            7538|  9.225| NULL|       S|
|        898|     3|Connolly, Miss. Kate|femal

In [102]:
df = df.drop('PassengerId', 'Name', 'Cabin').dropna()
df.show(5)
df.printSchema()

+------+------+----+-----+-----+-------+-------+--------+
|Pclass|   Sex| Age|SibSp|Parch| Ticket|   Fare|Embarked|
+------+------+----+-----+-----+-------+-------+--------+
|     3|  male|34.5|    0|    0| 330911| 7.8292|       Q|
|     3|female|47.0|    1|    0| 363272|    7.0|       S|
|     2|  male|62.0|    0|    0| 240276| 9.6875|       Q|
|     3|  male|27.0|    0|    0| 315154| 8.6625|       S|
|     3|female|22.0|    1|    1|3101298|12.2875|       S|
+------+------+----+-----+-----+-------+-------+--------+
only showing top 5 rows
root
 |-- Pclass: integer (nullable = true)
 |-- Sex: string (nullable = true)
 |-- Age: double (nullable = true)
 |-- SibSp: integer (nullable = true)
 |-- Parch: integer (nullable = true)
 |-- Ticket: string (nullable = true)
 |-- Fare: double (nullable = true)
 |-- Embarked: string (nullable = true)



In [103]:
categorial_cols = ['Sex', 'Embarked', 'Ticket']
numerical_cols = ['Pclass', 'Age', 'SibSp', 'Parch', 'Fare']

In [104]:
encoders = [
    OneHotEncoder(
        inputCol=f"{col}_idx",
        outputCol=f"{col}_ohe"
    )
    for col in categorical_cols
]

In [105]:
indexers = [
    StringIndexer(
        inputCol=col,
        outputCol=f"{col}_idx",
        handleInvalid="keep"
    )
    for col in categorical_cols
]

In [106]:
assembler = VectorAssembler(
    inputCols=[f"{col}_ohe" for col in categorical_cols] + numerical_cols,
    outputCol="featuresCol"
)


In [107]:
pipeline = Pipeline(stages=indexers + encoders + [assembler])


In [108]:
df_assembled = pipeline.fit(df).transform(df)
df_assembled.select("featuresCol").show(truncate=False)

+--------------------------+
|featuresCol               |
+--------------------------+
|[3.0,34.5,0.0,0.0,7.8292] |
|[3.0,47.0,1.0,0.0,7.0]    |
|[2.0,62.0,0.0,0.0,9.6875] |
|[3.0,27.0,0.0,0.0,8.6625] |
|[3.0,22.0,1.0,1.0,12.2875]|
|[3.0,14.0,0.0,0.0,9.225]  |
|[3.0,30.0,0.0,0.0,7.6292] |
|[2.0,26.0,1.0,1.0,29.0]   |
|[3.0,18.0,0.0,0.0,7.2292] |
|[3.0,21.0,2.0,0.0,24.15]  |
|[1.0,46.0,0.0,0.0,26.0]   |
|[1.0,23.0,1.0,0.0,82.2667]|
|[2.0,63.0,1.0,0.0,26.0]   |
|[1.0,47.0,1.0,0.0,61.175] |
|[2.0,24.0,1.0,0.0,27.7208]|
|[2.0,35.0,0.0,0.0,12.35]  |
|[3.0,21.0,0.0,0.0,7.225]  |
|[3.0,27.0,1.0,0.0,7.925]  |
|[3.0,45.0,0.0,0.0,7.225]  |
|[1.0,55.0,1.0,0.0,59.4]   |
+--------------------------+
only showing top 20 rows
