# Featurization

Cleaning data and adding features creates the inputs for machine learning models, which are only as strong as the data they are fed.  This notebook examines the process of featurization including common tasks such as:

- Exercise 1: Handling missing data
- Exercise 2: Feature Engineering
- Exercise 3: Scaling Numeric features
- Exercise 4: Encoding Categorical Features

Run the following cell to load common libraries.

In [0]:
import urllib.request
import os
import numpy as np
from pyspark.sql.types import * 
from pyspark.sql.functions import col, lit
from pyspark.sql.functions import udf
print("Imported common libraries.")

## Load the training data

In this notebook, we will be using a subset of NYC Taxi & Limousine Commission - green taxi trip records available from [Azure Open Datasets]( https://azure.microsoft.com/en-us/services/open-datasets/). The data is enriched with holiday and weather data. Each row of the table represents a taxi ride that includes columns such as number of passengers, trip distance, datetime information, holiday and weather information, and the taxi fare for the trip.

Run the following cell to load the table into a Spark dataframe and reivew the dataframe.

In [0]:
dataset = spark.sql("select * from nyc_taxi")
display(dataset)

passengerCount,tripDistance,hour_of_day,day_of_week,month_num,normalizeHolidayName,isPaidTimeOff,snowDepth,precipTime,precipDepth,temperature,totalAmount
1.0,9.4,15,2,1,,False,29.058823529411764,24.0,3.0,6.18571428571429,44.3
,14.75,13,4,1,,False,0.0,6.0,0.0,4.571929824561403,44.8
1.0,3.35,23,4,1,,False,0.0,1.0,0.0,4.384090909090913,18.96
1.0,3.33,18,2,1,,False,29.058823529411764,24.0,3.0,6.18571428571429,16.3
1.0,0.47,17,6,1,,False,0.0,1.0,0.0,3.846428571428569,5.3
1.0,3.07,9,1,1,,False,0.0,6.0,0.0,0.1594594594594597,16.3
1.0,0.92,23,4,1,,False,0.0,1.0,0.0,-2.999107142857142,8.97
1.0,1.9,12,4,1,,False,0.0,1.0,0.0,4.384090909090913,11.8
1.0,0.77,0,1,1,,False,0.0,1.0,0.0,-5.393749999999998,7.3
,2.35,2,6,1,,False,0.0,24.0,254.0,10.943654822335034,14.16


## Exercise 1: Handling missing data

Null values refer to unknown or missing data as well as irrelevant responses. Strategies for dealing with this scenario include:
* **Dropping these records:** Works when you do not need to use the information for downstream workloads
* **Adding a placeholder (e.g. `-1`):** Allows you to see missing data later on without violating a schema
* **Basic imputing:** Allows you to have a "best guess" of what the data could have been, often by using the mean of non-missing data
* **Advanced imputing:** Determines the "best guess" of what data should be using more advanced strategies such as clustering machine learning algorithms or oversampling techniques <a href="https://jair.org/index.php/jair/article/view/10302" target="_blank">such as SMOTE.</a>

Run the following cell to review summary statistics of each column in the data frame. Observe that based on the **count** metric the two columns `passenger count` and `totalAmount` have some null or missing values.

In [0]:
display(dataset.describe())

summary,passengerCount,tripDistance,hour_of_day,day_of_week,month_num,normalizeHolidayName,snowDepth,precipTime,precipDepth,temperature,totalAmount
count,11147.0,11734.0,11734.0,11734.0,11734.0,11734,11734.0,11734.0,11734.0,11734.0,11617.0
mean,1.348703687090697,2.866139423896368,13.633884438384182,3.22387932503835,3.5028975626384864,,1.6090149848022983,12.02837906937106,190.78234191239133,10.31424417384186,14.724533872771849
stddev,1.0152961119265145,2.905810032580362,6.670529654348203,1.961855394396239,1.707729094068861,,7.146770932668942,10.158597241219244,1211.087724397753,8.50059991817598,10.96651683941929
min,1.0,0.01,0.0,0.0,1.0,"Martin Luther King, Jr. Day",0.0,1.0,0.0,-13.379464285714295,3.3
max,6.0,62.55,23.0,6.0,6.0,Washington's Birthday,67.0909090909091,24.0,9999.0,26.52410714285713,339.38


A common option for working with missing data is to impute the missing values with a best guess for their value. We will try imputing the `passenger count` column with its median value. Run the following cell to create the **Imputer** with **strategy="median"** and impute the `passenger count` column.

In [0]:
from pyspark.ml.feature import Imputer

inputCols = ["passengerCount"]
outputCols = ["passengerCount"]

imputer = Imputer(strategy="median", inputCols=inputCols, outputCols=outputCols)
imputerModel = imputer.fit(dataset)
imputedDF = imputerModel.transform(dataset)
display(imputedDF)

passengerCount,tripDistance,hour_of_day,day_of_week,month_num,normalizeHolidayName,isPaidTimeOff,snowDepth,precipTime,precipDepth,temperature,totalAmount
1,9.4,15,2,1,,False,29.058823529411764,24.0,3.0,6.18571428571429,44.3
1,14.75,13,4,1,,False,0.0,6.0,0.0,4.571929824561403,44.8
1,3.35,23,4,1,,False,0.0,1.0,0.0,4.384090909090913,18.96
1,3.33,18,2,1,,False,29.058823529411764,24.0,3.0,6.18571428571429,16.3
1,0.47,17,6,1,,False,0.0,1.0,0.0,3.846428571428569,5.3
1,3.07,9,1,1,,False,0.0,6.0,0.0,0.1594594594594597,16.3
1,0.92,23,4,1,,False,0.0,1.0,0.0,-2.999107142857142,8.97
1,1.9,12,4,1,,False,0.0,1.0,0.0,4.384090909090913,11.8
1,0.77,0,1,1,,False,0.0,1.0,0.0,-5.393749999999998,7.3
1,2.35,2,6,1,,False,0.0,24.0,254.0,10.943654822335034,14.16


In the next, lesson we will train a machine learning model to predict the taxi fares, thus the `totalAmount` column will be the target column for training the machine learning model. Given the importance of this column, the strategy to deal with `totalAmount` column will be to drop the rows with null values in that column. Run the following cell to drop the null rows and review the final imputed dataset.

In [0]:
imputedDF = imputedDF.na.drop(subset=["totalAmount"])

display(imputedDF.describe())

summary,passengerCount,tripDistance,hour_of_day,day_of_week,month_num,normalizeHolidayName,snowDepth,precipTime,precipDepth,temperature,totalAmount
count,11617.0,11617.0,11617.0,11617.0,11617.0,11617,11617.0,11617.0,11617.0,11617.0,11617.0
mean,1.32994749074632,2.86314539037617,13.634242919858828,3.2207971076870106,3.503055866402686,,1.594342123854158,12.02143410519067,191.4620814323836,10.318198223395576,14.724533872771849
stddev,0.9905854727655304,2.8995739877114945,6.668682319466743,1.9629105573867032,1.707677463883683,,7.084436666546873,10.157326735285835,1213.6354936137388,8.497340521033312,10.96651683941929
min,1.0,0.01,0.0,0.0,1.0,"Martin Luther King, Jr. Day",0.0,1.0,0.0,-13.379464285714295,3.3
max,6.0,62.55,23.0,6.0,6.0,Washington's Birthday,67.0909090909091,24.0,9999.0,26.52410714285713,339.38


## Exercise 2: Feature Engineering

In some situations, it is beneficial to engineer new features or columns from existing data. In this case, the `hour_of_day` column represents hours from 0 – 23. Given that time is cyclical in nature, for example hour 23 is very close hour 0. Thus, it can be useful to transform the ` hour_of_day ` column as **sine** and **cosine** functions that are inherently cyclical in nature. Run the following cell to setup an user defined function (UDF) that will take in the ` hour_of_day ` column and transforms the column to its sine and cosine representation.

In [0]:
def get_sin_cosine(value, max_value):
  sine =  np.sin(value * (2.*np.pi/max_value))
  cosine = np.cos(value * (2.*np.pi/max_value))
  return (sine.tolist(), cosine.tolist())

schema = StructType([
    StructField("sine", DoubleType(), False),
    StructField("cosine", DoubleType(), False)
])

get_sin_cosineUDF = udf(get_sin_cosine, schema)

print("UDF get_sin_cosineUDF defined.")

Run the following cell to do the ` hour_of_day `  column transformation and name the two new columns as `hour_sine` and `hour_cosine` and drop the original column. To review the resulting dataframe, scroll to the right to observe the two new columns.

In [0]:
engineeredDF = imputedDF.withColumn("udfResult", get_sin_cosineUDF(col("hour_of_day"), lit(24))).withColumn("hour_sine", col("udfResult.sine")).withColumn("hour_cosine", col("udfResult.cosine")).drop("udfResult").drop("hour_of_day")
display(engineeredDF)

passengerCount,tripDistance,day_of_week,month_num,normalizeHolidayName,isPaidTimeOff,snowDepth,precipTime,precipDepth,temperature,totalAmount,hour_sine,hour_cosine
1,9.4,2,1,,False,29.058823529411764,24.0,3.0,6.18571428571429,44.3,-0.7071067811865471,-0.7071067811865479
1,14.75,4,1,,False,0.0,6.0,0.0,4.571929824561403,44.8,-0.2588190451025203,-0.9659258262890684
1,3.35,4,1,,False,0.0,1.0,0.0,4.384090909090913,18.96,-0.2588190451025215,0.965925826289068
1,3.33,2,1,,False,29.058823529411764,24.0,3.0,6.18571428571429,16.3,-1.0,-1.8369701987210294e-16
1,0.47,6,1,,False,0.0,1.0,0.0,3.846428571428569,5.3,-0.965925826289068,-0.2588190451025215
1,3.07,1,1,,False,0.0,6.0,0.0,0.1594594594594597,16.3,0.7071067811865476,-0.7071067811865475
1,0.92,4,1,,False,0.0,1.0,0.0,-2.999107142857142,8.97,-0.2588190451025215,0.965925826289068
1,1.9,4,1,,False,0.0,1.0,0.0,4.384090909090913,11.8,1.2246467991473532e-16,-1.0
1,0.77,1,1,,False,0.0,1.0,0.0,-5.393749999999998,7.3,0.0,1.0
1,2.35,6,1,,False,0.0,24.0,254.0,10.943654822335034,14.16,0.4999999999999999,0.8660254037844387


## Exercise 3: Scaling Numeric features

Common types of data in machine learning include:
- Numerical
  - Numerical values, either integers or floats
  - Example, predict house prices
- Categorical
  - Discrete and limited set of values
  - The values typically do not make sense unless there is a meaning or a category attached to the values
  - Example, persons gender or ethnicity
- Time-Series
  - Data series over time
  - Typically, data collected over equally spaced points in time
  - Example, real-time stock performance
- Text
  - Words or sentences
  - Example, news articles
  
In the example we are working with, we have **numerical** and **categorical** features. Run the following cell to create list of numerical and categorical features in the dataset. In this exercise, we will look at how to work with numerical features and in the next exercise we will look at encoding categorical features.

In [0]:
numerical_cols = ["passengerCount", "tripDistance", "snowDepth", "precipTime", "precipDepth", "temperature", "hour_sine", "hour_cosine"]
categorical_cols = ["day_of_week", "month_num", "normalizeHolidayName", "isPaidTimeOff"]
label_column = "totalAmount"
print("Numerical and categorical features list defined. Label column identified.")

For numerical features the ranges of values can vary widely and thus it is common practice in machine learning to scale the numerical features. The two common approaches for data scaling are:
- **Normalization**: Rescales the data into the range [0, 1].
- **Standardization**: Rescales the data to have Mean = 0 and Variance = 1.

Run the following cell to see how to use the **VectorAssembler**  and **MinMaxScaler** to scale the numerical features into the range of [0,1]. Observe how we combine the two-step transformation into a single pipeline. Finally, review the resulting dataframe by scrolling to right to observe the new assembled and scaled column: **scaled_numerical_features**.

In [0]:
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.feature import MinMaxScaler
from pyspark.ml import Pipeline


assembler = VectorAssembler().setInputCols(numerical_cols).setOutputCol('numerical_features')
scaler = MinMaxScaler(inputCol=assembler.getOutputCol(), outputCol="scaled_numerical_features")

partialPipeline = Pipeline().setStages([assembler, scaler])
pipelineModel = partialPipeline.fit(engineeredDF)
scaledDF = pipelineModel.transform(engineeredDF)

display(scaledDF)

passengerCount,tripDistance,day_of_week,month_num,normalizeHolidayName,isPaidTimeOff,snowDepth,precipTime,precipDepth,temperature,totalAmount,hour_sine,hour_cosine,numerical_features,scaled_numerical_features
1,9.4,2,1,,False,29.058823529411764,24.0,3.0,6.18571428571429,44.3,-0.7071067811865471,-0.7071067811865479,"Map(vectorType -> dense, length -> 8, values -> List(1.0, 9.4, 29.058823529411764, 24.0, 3.0, 6.18571428571429, -0.7071067811865471, -0.7071067811865479))","Map(vectorType -> dense, length -> 8, values -> List(0.0, 0.1501439078989447, 0.433126095966842, 1.0, 3.0003000300030005E-4, 0.49031146513917523, 0.14644660940672644, 0.14644660940672605))"
1,14.75,4,1,,False,0.0,6.0,0.0,4.571929824561403,44.8,-0.2588190451025203,-0.9659258262890684,"Map(vectorType -> dense, length -> 8, values -> List(1.0, 14.75, 0.0, 6.0, 0.0, 4.571929824561403, -0.25881904510252035, -0.9659258262890684))","Map(vectorType -> dense, length -> 8, values -> List(0.0, 0.23568915893827952, 0.0, 0.21739130434782608, 0.0, 0.4498693592479367, 0.37059047744873985, 0.01703708685546579))"
1,3.35,4,1,,False,0.0,1.0,0.0,4.384090909090913,18.96,-0.2588190451025215,0.965925826289068,"Map(vectorType -> dense, length -> 8, values -> List(1.0, 3.35, 0.0, 1.0, 0.0, 4.384090909090913, -0.25881904510252157, 0.9659258262890681))","Map(vectorType -> sparse, length -> 8, indices -> List(1, 5, 6, 7), values -> List(0.05340582027502399, 0.44516203835545143, 0.3705904774487392, 0.9829629131445341))"
1,3.33,2,1,,False,29.058823529411764,24.0,3.0,6.18571428571429,16.3,-1.0,-1.8369701987210294e-16,"Map(vectorType -> dense, length -> 8, values -> List(1.0, 3.33, 29.058823529411764, 24.0, 3.0, 6.18571428571429, -1.0, -1.8369701987210297E-16))","Map(vectorType -> dense, length -> 8, values -> List(0.0, 0.053086024944035824, 0.433126095966842, 1.0, 3.0003000300030005E-4, 0.49031146513917523, 0.0, 0.4999999999999999))"
1,0.47,6,1,,False,0.0,1.0,0.0,3.846428571428569,5.3,-0.965925826289068,-0.2588190451025215,"Map(vectorType -> dense, length -> 8, values -> List(1.0, 0.47, 0.0, 1.0, 0.0, 3.846428571428569, -0.9659258262890681, -0.2588190451025215))","Map(vectorType -> sparse, length -> 8, indices -> List(1, 5, 6, 7), values -> List(0.007355292612727854, 0.43168799785196477, 0.017037086855465955, 0.37059047744873924))"
1,3.07,1,1,,False,0.0,6.0,0.0,0.1594594594594597,16.3,0.7071067811865476,-0.7071067811865475,"Map(vectorType -> dense, length -> 8, values -> List(1.0, 3.07, 0.0, 6.0, 0.0, 0.1594594594594597, 0.7071067811865476, -0.7071067811865475))","Map(vectorType -> dense, length -> 8, values -> List(0.0, 0.04892868564118964, 0.0, 0.21739130434782608, 0.0, 0.3392910273560057, 0.8535533905932737, 0.14644660940672627))"
1,0.92,4,1,,False,0.0,1.0,0.0,-2.999107142857142,8.97,-0.2588190451025215,0.965925826289068,"Map(vectorType -> dense, length -> 8, values -> List(1.0, 0.92, 0.0, 1.0, 0.0, -2.999107142857142, -0.25881904510252157, 0.9659258262890681))","Map(vectorType -> sparse, length -> 8, indices -> List(1, 5, 6, 7), values -> List(0.014550687559961625, 0.26013604224469733, 0.3705904774487392, 0.9829629131445341))"
1,1.9,4,1,,False,0.0,1.0,0.0,4.384090909090913,11.8,1.2246467991473532e-16,-1.0,"Map(vectorType -> dense, length -> 8, values -> List(1.0, 1.9, 0.0, 1.0, 0.0, 4.384090909090913, 1.2246467991473532E-16, -1.0))","Map(vectorType -> sparse, length -> 8, indices -> List(1, 5, 6), values -> List(0.030220658778381836, 0.44516203835545143, 0.5000000000000001))"
1,0.77,1,1,,False,0.0,1.0,0.0,-5.393749999999998,7.3,0.0,1.0,"Map(vectorType -> dense, length -> 8, values -> List(1.0, 0.77, 0.0, 1.0, 0.0, -5.393749999999998, 0.0, 1.0))","Map(vectorType -> sparse, length -> 8, indices -> List(1, 5, 6, 7), values -> List(0.012152222577550368, 0.20012530206748444, 0.5, 1.0))"
1,2.35,6,1,,False,0.0,24.0,254.0,10.943654822335034,14.16,0.4999999999999999,0.8660254037844387,"Map(vectorType -> dense, length -> 8, values -> List(1.0, 2.35, 0.0, 24.0, 254.0, 10.943654822335034, 0.49999999999999994, 0.8660254037844387))","Map(vectorType -> dense, length -> 8, values -> List(0.0, 0.037416053725615614, 0.0, 1.0, 0.025402540254025403, 0.6095474223801856, 0.75, 0.9330127018922194))"


## Exercise 4: Encoding Categorical Features

It is important to note that in machine learning, we ultimately always work with numbers or specifically, vectors. In this context, a vector is either an array of numbers or nested arrays of arrays of numbers. All non-numeric data types, such as categories, like `normalizeHolidayName`, `isPaidTimeOff` in the dataframe are eventually represented as numbers. Also, for numerical categories, such as `day_of_week` and `month_num`, it is important to encode them. Otherwise, machine learning model might assume that month 6 (June) is six times as much as the month 1 (January).

**One Hot Encoding** is often the recommended approach to encode categorical features. In this approach, for each categorical column, a number of N new columns are added to the data set, where N is the cardinality (the number of distinct values) of the column. Each column corresponds to one of the categories and will have a value of 0 if the row has that category or 1 if it hasn’t.

Run the following cell to encode the categorical features in the dataset using One Hot encoding. Since, ** OneHotEncoder** only operates on numerical values, we will first use **StringIndexer** to index the categorical columns to a numerical value and the then encode using the **OneHotEncoder**.

In [0]:
from pyspark.ml.feature import StringIndexer
from pyspark.ml.feature import OneHotEncoder

stages = [] # stages in our Pipeline
for categorical_col in categorical_cols:
    # Category Indexing with StringIndexer
    stringIndexer = StringIndexer(inputCol=categorical_col, outputCol=categorical_col + "_index", handleInvalid="skip")
    encoder = OneHotEncoder(inputCols=[stringIndexer.getOutputCol()], outputCols=[categorical_col + "_classVector"])
    # Add stages.  These are not run here, but will run all at once later on.
    stages += [stringIndexer, encoder]

encodedDF = scaledDF.withColumn("isPaidTimeOff", col("isPaidTimeOff").cast("integer"))
partialPipeline = Pipeline().setStages(stages)
pipelineModel = partialPipeline.fit(encodedDF)
encodedDF = pipelineModel.transform(encodedDF)

display(encodedDF)

passengerCount,tripDistance,day_of_week,month_num,normalizeHolidayName,isPaidTimeOff,snowDepth,precipTime,precipDepth,temperature,totalAmount,hour_sine,hour_cosine,numerical_features,scaled_numerical_features,day_of_week_index,day_of_week_classVector,month_num_index,month_num_classVector,normalizeHolidayName_index,normalizeHolidayName_classVector,isPaidTimeOff_index,isPaidTimeOff_classVector
1,9.4,2,1,,0,29.058823529411764,24.0,3.0,6.18571428571429,44.3,-0.7071067811865471,-0.7071067811865479,"Map(vectorType -> dense, length -> 8, values -> List(1.0, 9.4, 29.058823529411764, 24.0, 3.0, 6.18571428571429, -0.7071067811865471, -0.7071067811865479))","Map(vectorType -> dense, length -> 8, values -> List(0.0, 0.1501439078989447, 0.433126095966842, 1.0, 3.0003000300030005E-4, 0.49031146513917523, 0.14644660940672644, 0.14644660940672605))",4.0,"Map(vectorType -> sparse, length -> 6, indices -> List(4), values -> List(1.0))",3.0,"Map(vectorType -> sparse, length -> 5, indices -> List(3), values -> List(1.0))",0.0,"Map(vectorType -> sparse, length -> 4, indices -> List(0), values -> List(1.0))",0.0,"Map(vectorType -> sparse, length -> 1, indices -> List(0), values -> List(1.0))"
1,14.75,4,1,,0,0.0,6.0,0.0,4.571929824561403,44.8,-0.2588190451025203,-0.9659258262890684,"Map(vectorType -> dense, length -> 8, values -> List(1.0, 14.75, 0.0, 6.0, 0.0, 4.571929824561403, -0.25881904510252035, -0.9659258262890684))","Map(vectorType -> dense, length -> 8, values -> List(0.0, 0.23568915893827952, 0.0, 0.21739130434782608, 0.0, 0.4498693592479367, 0.37059047744873985, 0.01703708685546579))",0.0,"Map(vectorType -> sparse, length -> 6, indices -> List(0), values -> List(1.0))",3.0,"Map(vectorType -> sparse, length -> 5, indices -> List(3), values -> List(1.0))",0.0,"Map(vectorType -> sparse, length -> 4, indices -> List(0), values -> List(1.0))",0.0,"Map(vectorType -> sparse, length -> 1, indices -> List(0), values -> List(1.0))"
1,3.35,4,1,,0,0.0,1.0,0.0,4.384090909090913,18.96,-0.2588190451025215,0.965925826289068,"Map(vectorType -> dense, length -> 8, values -> List(1.0, 3.35, 0.0, 1.0, 0.0, 4.384090909090913, -0.25881904510252157, 0.9659258262890681))","Map(vectorType -> sparse, length -> 8, indices -> List(1, 5, 6, 7), values -> List(0.05340582027502399, 0.44516203835545143, 0.3705904774487392, 0.9829629131445341))",0.0,"Map(vectorType -> sparse, length -> 6, indices -> List(0), values -> List(1.0))",3.0,"Map(vectorType -> sparse, length -> 5, indices -> List(3), values -> List(1.0))",0.0,"Map(vectorType -> sparse, length -> 4, indices -> List(0), values -> List(1.0))",0.0,"Map(vectorType -> sparse, length -> 1, indices -> List(0), values -> List(1.0))"
1,3.33,2,1,,0,29.058823529411764,24.0,3.0,6.18571428571429,16.3,-1.0,-1.8369701987210294e-16,"Map(vectorType -> dense, length -> 8, values -> List(1.0, 3.33, 29.058823529411764, 24.0, 3.0, 6.18571428571429, -1.0, -1.8369701987210297E-16))","Map(vectorType -> dense, length -> 8, values -> List(0.0, 0.053086024944035824, 0.433126095966842, 1.0, 3.0003000300030005E-4, 0.49031146513917523, 0.0, 0.4999999999999999))",4.0,"Map(vectorType -> sparse, length -> 6, indices -> List(4), values -> List(1.0))",3.0,"Map(vectorType -> sparse, length -> 5, indices -> List(3), values -> List(1.0))",0.0,"Map(vectorType -> sparse, length -> 4, indices -> List(0), values -> List(1.0))",0.0,"Map(vectorType -> sparse, length -> 1, indices -> List(0), values -> List(1.0))"
1,0.47,6,1,,0,0.0,1.0,0.0,3.846428571428569,5.3,-0.965925826289068,-0.2588190451025215,"Map(vectorType -> dense, length -> 8, values -> List(1.0, 0.47, 0.0, 1.0, 0.0, 3.846428571428569, -0.9659258262890681, -0.2588190451025215))","Map(vectorType -> sparse, length -> 8, indices -> List(1, 5, 6, 7), values -> List(0.007355292612727854, 0.43168799785196477, 0.017037086855465955, 0.37059047744873924))",2.0,"Map(vectorType -> sparse, length -> 6, indices -> List(2), values -> List(1.0))",3.0,"Map(vectorType -> sparse, length -> 5, indices -> List(3), values -> List(1.0))",0.0,"Map(vectorType -> sparse, length -> 4, indices -> List(0), values -> List(1.0))",0.0,"Map(vectorType -> sparse, length -> 1, indices -> List(0), values -> List(1.0))"
1,3.07,1,1,,0,0.0,6.0,0.0,0.1594594594594597,16.3,0.7071067811865476,-0.7071067811865475,"Map(vectorType -> dense, length -> 8, values -> List(1.0, 3.07, 0.0, 6.0, 0.0, 0.1594594594594597, 0.7071067811865476, -0.7071067811865475))","Map(vectorType -> dense, length -> 8, values -> List(0.0, 0.04892868564118964, 0.0, 0.21739130434782608, 0.0, 0.3392910273560057, 0.8535533905932737, 0.14644660940672627))",5.0,"Map(vectorType -> sparse, length -> 6, indices -> List(5), values -> List(1.0))",3.0,"Map(vectorType -> sparse, length -> 5, indices -> List(3), values -> List(1.0))",0.0,"Map(vectorType -> sparse, length -> 4, indices -> List(0), values -> List(1.0))",0.0,"Map(vectorType -> sparse, length -> 1, indices -> List(0), values -> List(1.0))"
1,0.92,4,1,,0,0.0,1.0,0.0,-2.999107142857142,8.97,-0.2588190451025215,0.965925826289068,"Map(vectorType -> dense, length -> 8, values -> List(1.0, 0.92, 0.0, 1.0, 0.0, -2.999107142857142, -0.25881904510252157, 0.9659258262890681))","Map(vectorType -> sparse, length -> 8, indices -> List(1, 5, 6, 7), values -> List(0.014550687559961625, 0.26013604224469733, 0.3705904774487392, 0.9829629131445341))",0.0,"Map(vectorType -> sparse, length -> 6, indices -> List(0), values -> List(1.0))",3.0,"Map(vectorType -> sparse, length -> 5, indices -> List(3), values -> List(1.0))",0.0,"Map(vectorType -> sparse, length -> 4, indices -> List(0), values -> List(1.0))",0.0,"Map(vectorType -> sparse, length -> 1, indices -> List(0), values -> List(1.0))"
1,1.9,4,1,,0,0.0,1.0,0.0,4.384090909090913,11.8,1.2246467991473532e-16,-1.0,"Map(vectorType -> dense, length -> 8, values -> List(1.0, 1.9, 0.0, 1.0, 0.0, 4.384090909090913, 1.2246467991473532E-16, -1.0))","Map(vectorType -> sparse, length -> 8, indices -> List(1, 5, 6), values -> List(0.030220658778381836, 0.44516203835545143, 0.5000000000000001))",0.0,"Map(vectorType -> sparse, length -> 6, indices -> List(0), values -> List(1.0))",3.0,"Map(vectorType -> sparse, length -> 5, indices -> List(3), values -> List(1.0))",0.0,"Map(vectorType -> sparse, length -> 4, indices -> List(0), values -> List(1.0))",0.0,"Map(vectorType -> sparse, length -> 1, indices -> List(0), values -> List(1.0))"
1,0.77,1,1,,0,0.0,1.0,0.0,-5.393749999999998,7.3,0.0,1.0,"Map(vectorType -> dense, length -> 8, values -> List(1.0, 0.77, 0.0, 1.0, 0.0, -5.393749999999998, 0.0, 1.0))","Map(vectorType -> sparse, length -> 8, indices -> List(1, 5, 6, 7), values -> List(0.012152222577550368, 0.20012530206748444, 0.5, 1.0))",5.0,"Map(vectorType -> sparse, length -> 6, indices -> List(5), values -> List(1.0))",3.0,"Map(vectorType -> sparse, length -> 5, indices -> List(3), values -> List(1.0))",0.0,"Map(vectorType -> sparse, length -> 4, indices -> List(0), values -> List(1.0))",0.0,"Map(vectorType -> sparse, length -> 1, indices -> List(0), values -> List(1.0))"
1,2.35,6,1,,0,0.0,24.0,254.0,10.943654822335034,14.16,0.4999999999999999,0.8660254037844387,"Map(vectorType -> dense, length -> 8, values -> List(1.0, 2.35, 0.0, 24.0, 254.0, 10.943654822335034, 0.49999999999999994, 0.8660254037844387))","Map(vectorType -> dense, length -> 8, values -> List(0.0, 0.037416053725615614, 0.0, 1.0, 0.025402540254025403, 0.6095474223801856, 0.75, 0.9330127018922194))",2.0,"Map(vectorType -> sparse, length -> 6, indices -> List(2), values -> List(1.0))",3.0,"Map(vectorType -> sparse, length -> 5, indices -> List(3), values -> List(1.0))",0.0,"Map(vectorType -> sparse, length -> 4, indices -> List(0), values -> List(1.0))",0.0,"Map(vectorType -> sparse, length -> 1, indices -> List(0), values -> List(1.0))"


In the resulting dataframe, observe the new column **isPaidTimeOff_classVector** is a vector. The difference between a sparse and dense vector is whether Spark records all of the empty values. In a sparse vector, like we see here, Spark saves space by only recording the places where the vector has a non-zero value. The value of 0 in the first position indicates that it's a sparse vector. The second value indicates the length of the vector.

Example interpretation of the following vector: **[0, 1, [0], [1]]**
- 0 - it’s a sparse vector
- 1 – length of the vector is 1
- [0] – in this case the values only present in the 0th position of the vector
- [1] – values in the corresponding positions