<a href="https://colab.research.google.com/github/ysowti/Data-Science-Portfolio-Yahya/blob/master/Deep_Learning_Pipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction

In this notebook I use PySpark, Keras, and Elephas python libraries to build an end-to-end deep learning pipeline that runs on Spark. Spark is an open-source distributed analytics engine that can process large amounts of data with tremendous speed. PySpark is simply the python API for Spark that allows you to use an easy programming language, like python, and leverage the power of Apache Spark.

My interest in putting together this example was to learn and prototype. More specifically, learn more about PySpark pipelines as well as how I could integrate deep learning into the PySpark pipeline. I ran this entire project using Jupyter on my local machine to build a prototype for an upcoming data science project where the data will be massive. Since I work for IBM, I'll take this entire analytics project (Jupyter Notebook) and move it to IBM. This allows me to do my data ingestion, pipelining, training and deployment on a unified platform and on a much larger Spark cluster. Obviously, if you had a real and sizable project or using image data you would NOT do this on your local machine.

Overall, I found it not too difficult to put together this prototype or working example so I hope others will find it useful.

## Step 1: Start Spark Session

In [1]:
!apt-get update

In [2]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://mirrors.sonic.net/apache/spark/spark-2.4.7/spark-2.4.7-bin-hadoop2.7.tgz
!tar xf spark-2.4.7-bin-hadoop2.7.tgz
!pip install -q findspark

In [3]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.4.7-bin-hadoop2.7"

In [4]:
import findspark
findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Deep Learning Pipeline").getOrCreate()

In [5]:
pip install elephas

## Step2: Import Libraries

In [6]:
# Spark Session, Pipeline, Functions, and Metrics
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext
from pyspark.ml.feature import OneHotEncoder, StringIndexer, StandardScaler, VectorAssembler
from pyspark.ml import Pipeline
from pyspark.sql.functions import rand
from pyspark.mllib.evaluation import MulticlassMetrics

# Keras / Deep Learning
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

# Elephas for Deep Learning on Spark
from elephas.ml_model import ElephasEstimator

## Step 3: Load and Preview Data

Here we'll load the data. The data we'll use comes from a Kaggle competition. It's a typical banking dataset. I use the inferSchema parameter here which helps to identify the feature types when loading in the data. Per the PySpark documentation this "requires one extra pass over the data". Since the bank data I'm loading only has ~11k observations it doesnt take long at all, but it may be worth noting if you have a very large dataset.

After we load the data we can see the schema and the various feature types. All our features are either string type or integer. We then preview the first 5 observations. I'm pretty familair with with Pandas python library so through this example you'll see me use toPandas() to convert the spark dataframe to a pandas dataframe and do some manipulations. Not right or wrong, just easier for me.

Finally, we'll drop the 2 date columns since we won't be using those in our deep learning model. They could be possibly significant and featurized by I decided to just drop them all together.

In [7]:
from google.colab import files
uploaded = files.upload()

Saving bank.csv to bank (1).csv


In [8]:
df = spark.read.csv('bank.csv', inferSchema=True, header=True)

In [9]:
df.limit(5).toPandas()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,deposit
0,59,admin.,married,secondary,no,2343,yes,no,unknown,5,may,1042,1,-1,0,unknown,yes
1,56,admin.,married,secondary,no,45,no,no,unknown,5,may,1467,1,-1,0,unknown,yes
2,41,technician,married,secondary,no,1270,yes,no,unknown,5,may,1389,1,-1,0,unknown,yes
3,55,services,married,secondary,no,2476,yes,no,unknown,5,may,579,1,-1,0,unknown,yes
4,54,admin.,married,tertiary,no,184,no,no,unknown,5,may,673,2,-1,0,unknown,yes


In [11]:
df.printSchema()

root
 |-- age: integer (nullable = true)
 |-- job: string (nullable = true)
 |-- marital: string (nullable = true)
 |-- education: string (nullable = true)
 |-- default: string (nullable = true)
 |-- balance: integer (nullable = true)
 |-- housing: string (nullable = true)
 |-- loan: string (nullable = true)
 |-- contact: string (nullable = true)
 |-- day: integer (nullable = true)
 |-- month: string (nullable = true)
 |-- duration: integer (nullable = true)
 |-- campaign: integer (nullable = true)
 |-- pdays: integer (nullable = true)
 |-- previous: integer (nullable = true)
 |-- poutcome: string (nullable = true)
 |-- deposit: string (nullable = true)



In [12]:
df = df.drop('day', 'month')
df.columns

['age',
 'job',
 'marital',
 'education',
 'default',
 'balance',
 'housing',
 'loan',
 'contact',
 'duration',
 'campaign',
 'pdays',
 'previous',
 'poutcome',
 'deposit']

## Step 4: Create the Spark Data Pipeline

Now we create the pipeline using PySpark. This essentially takes your data and, per the feature lists you pass, will do the transformations and vectorizing so it is ready for modeling. I referenced the "Extracting, transforming and selecting features" Apache Spark documentation a lot for this pipeline and project.

Below is a helper function to select from the numeric features which ones to standardize based on the kurtosis or skew of that feature. The current defaults for upper_skew and lower_skew are just general guidelines (depending where you read), but you can modify the upper and lower skew as desired.

In [14]:
def select_features(df, lower_skew=None, upper_skew=None, dtypes='int32', drop_col = []):
  selected_features = []
  feature_list = list(df.toPandas().select_dtypes(include=[dtypes]).columns.drop(drop_col))
  if lower_skew or upper_skew:
    for feature in feature_list:
      if df.toPandas()[feature].kurtosis() < lower_skew or df.toPandas()[feature].kurtosis() > upper_skew:
        selected_features.append(feature)
  else:
    selected_features = feature_list
  return selected_features

Now we'll get into the actual data pipeline. The feature list selection part can be further enhanced to be more dynamic vs listing out each feature, but for this small dataset I just left it as is with cat_features, num_features, and label. Selecting features by type can be done similar to how I did it in the select_features_to_scale helper function using something like this spark_df.toPandas().select_dtypes(include=['object']).columns) which would return a list of all the columns in your spark dataframe that are object or string type.

The first thing we want to do is create an empty list called stages. This will contain each step that the data pipeline needs to to complete all transformations within our pipeline. I print out each step of the stages after the pipeline so you can see the sequential steps from my code to a list.

The second part is going to be a basic loop to go through each categorical feature from our list cat_features and then index and encode those features using one-hot encoding. StringIndexer encodes your categorical feature to a feature index with the highest frequency label (count) as feature index 0 and so on. I will preview the transformed data frame after the pipeline, Step 5, where you can see each feature index created from the categorical features. For more information and a basic example of StringIndexer check the here

Within the loop we also do some one-hot encoding (OHE) using the OneHotEncoderEstimator. This function only takes a label index so if you have categorical data (objects or strings) you have to use StringIndexer so you can pass a label index to the OHE estimator. One nice thing I found from looking at dozens of examples was that you can chain StringIndexer output right into the OHE estimator using string_indexer.getOutputCol(). If you have a lot of features to transform you'll want to do some thinking about the names, OutputCol, because you can't just overwrite feature names so get creative. We'll append all those pipeline steps within our loop into our pipeline list stages.

Next we use StringIndexer again on our label feature or dependent variable. And then we'll move on to scaling the numeric variables using the select_features_to_scale helper function from above. Once that list is selected we'll vectorize those features using VectorAssembler and then standardize the features within that vector using StandardScaler. Then we append those steps to our ongoing pipeline list stages.

The last step is just assembling all our features into a single vector. We'll find the numeric features from the list num_features that were not scaled by just using the difference between our unscaled_features (the name of the selected numeric feature TO scale) list and the original list of numeric features num_features. Then we assemble or vectorize all the categorical OHE features and numeric features and add that step to our pipeline stages. And finally, we add in the scaled_features to our assembled_inputs to get a final and single vector of features for our modeling.

In [15]:
label = 'deposit'
cat_features = select_features(df, dtypes='object', drop_col=[label])
num_features = select_features(df)
stages = []

for feature in cat_features:
  string_indexer = StringIndexer(inputCol=feature, outputCol=feature + '_index')
  encoder = OneHotEncoder(inputCol=string_indexer.getOutputCol(), outputCol= feature + '_class_vec')
  stages += [string_indexer, encoder]

unscaled_features = select_features(df, lower_skew=-2, upper_skew=2, dtypes='int32')
unscaled_assembler = VectorAssembler(inputCols=unscaled_features, outputCol='unscaled_features')
scaler = StandardScaler(inputCol='unscaled_features', outputCol='scaled_features')
stages += [unscaled_assembler, scaler]

unscaled_num_features = list(set(num_features) - set(unscaled_features))
num_str_assembler = VectorAssembler(inputCols=[cat + '_class_vec' for cat in cat_features] +
                                    unscaled_num_features, outputCol='assembled_inputs')
stages += [num_str_assembler]

final_assembler = VectorAssembler(inputCols=['assembled_inputs', 'scaled_features'], outputCol='features')
stages += [final_assembler]

label_indexer = StringIndexer(inputCol=label, outputCol='label_index')
stages += [label_indexer]

We can see all the steps within our pipeline by looking at our stages list that we've been sequentially adding.

In [16]:
stages

[StringIndexer_07e341da0889,
 OneHotEncoder_cc7721ef5eb7,
 StringIndexer_3fa613e3c07f,
 OneHotEncoder_fc78a049d11a,
 StringIndexer_d99c999fc1bd,
 OneHotEncoder_f459d440330c,
 StringIndexer_8ca87bc9550f,
 OneHotEncoder_5dc1a9a23711,
 StringIndexer_863228da661a,
 OneHotEncoder_c684d573cd29,
 StringIndexer_4efb055e6620,
 OneHotEncoder_040230b12232,
 StringIndexer_94cde08c14c8,
 OneHotEncoder_f7096b861146,
 StringIndexer_2d9cf9550bcf,
 OneHotEncoder_aa23903e073e,
 VectorAssembler_728eef4041e9,
 StandardScaler_3bd7fbb8aba7,
 VectorAssembler_de890137d403,
 VectorAssembler_369fac0116a8,
 StringIndexer_e479ba02615f]

## Step 5: Run Data Through the Spark Pipeline

Now that the "hard" part is over we can simply pipeline the stages and fit our data to the pipeline by using fit(). Then we actually transform the data by using transform. We can now preview our newly transformed PySpark dataframe with all the original and transformed features.

In [18]:
pipeline = Pipeline(stages=stages)
pipeline_model = pipeline.fit(df)
df_transform = pipeline_model.transform(df)

In [19]:
df_transform.limit(5).toPandas()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,duration,campaign,pdays,previous,poutcome,deposit,job_index,job_class_vec,marital_index,marital_class_vec,education_index,education_class_vec,default_index,default_class_vec,housing_index,housing_class_vec,loan_index,loan_class_vec,contact_index,contact_class_vec,poutcome_index,poutcome_class_vec,unscaled_features,scaled_features,assembled_inputs,features,label_index
0,59,admin.,married,secondary,no,2343,yes,no,unknown,1042,1,-1,0,unknown,yes,3.0,"(0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",0.0,"(1.0, 0.0)",0.0,"(1.0, 0.0, 0.0)",0.0,(1.0),1.0,(0.0),0.0,(1.0),1.0,"(0.0, 1.0)",0.0,"(1.0, 0.0, 0.0)","[2343.0, 1042.0, 1.0, -1.0, 0.0]","[0.7264185278681131, 3.0017712260834295, 0.367...","(0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",1.0
1,56,admin.,married,secondary,no,45,no,no,unknown,1467,1,-1,0,unknown,yes,3.0,"(0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",0.0,"(1.0, 0.0)",0.0,"(1.0, 0.0, 0.0)",0.0,(1.0),0.0,(1.0),0.0,(1.0),1.0,"(0.0, 1.0)",0.0,"(1.0, 0.0, 0.0)","[45.0, 1467.0, 1.0, -1.0, 0.0]","[0.013951700279157103, 4.226102100445672, 0.36...","(0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",1.0
2,41,technician,married,secondary,no,1270,yes,no,unknown,1389,1,-1,0,unknown,yes,2.0,"(0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",0.0,"(1.0, 0.0)",0.0,"(1.0, 0.0, 0.0)",0.0,(1.0),1.0,(0.0),0.0,(1.0),1.0,"(0.0, 1.0)",0.0,"(1.0, 0.0, 0.0)","[1270.0, 1389.0, 1.0, -1.0, 0.0]","[0.39374798565621155, 4.001401375268602, 0.367...","(0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",1.0
3,55,services,married,secondary,no,2476,yes,no,unknown,579,1,-1,0,unknown,yes,4.0,"(0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, ...",0.0,"(1.0, 0.0)",0.0,"(1.0, 0.0, 0.0)",0.0,(1.0),1.0,(0.0),0.0,(1.0),1.0,"(0.0, 1.0)",0.0,"(1.0, 0.0, 0.0)","[2476.0, 579.0, 1.0, -1.0, 0.0]","[0.7676535531376218, 1.667970767660562, 0.3673...","(0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, ...","(0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, ...",1.0
4,54,admin.,married,tertiary,no,184,no,no,unknown,673,2,-1,0,unknown,yes,3.0,"(0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",0.0,"(1.0, 0.0)",1.0,"(0.0, 1.0, 0.0)",0.0,(1.0),0.0,(1.0),0.0,(1.0),1.0,"(0.0, 1.0)",0.0,"(1.0, 0.0, 0.0)","[184.0, 673.0, 2.0, -1.0, 0.0]","[0.05704695225255348, 1.938763949284211, 0.734...","(0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",1.0


## Step 6: Final Data Prep before Deep Learning Model

There are couple last and quick things we need to do before modeling. First is to create a PySpark dataframe that only contains 2 vectors from the recently transformed dataframe. We only need the: features (X) and label_index (y) features for modeling. It's easy enough to do with PySpark with the simple select statement. Then, just cause, we preview the dataframe.

Finally, we want to shuffle our dataframe and then split the data into train and test sets. You always want to shuffle the data prior to modeling to avoid any bias from how the data may be sorted or otherwise organized and specifically shuffling prior to splitting the data.

In [21]:
df_transform_fin = df_transform.select('features', 'label_index')
df_transform_fin.show()

+--------------------+-----------+
|            features|label_index|
+--------------------+-----------+
|(30,[3,11,13,16,1...|        1.0|
|(30,[3,11,13,16,1...|        1.0|
|(30,[2,11,13,16,1...|        1.0|
|(30,[4,11,13,16,1...|        1.0|
|(30,[3,11,14,16,1...|        1.0|
|(30,[0,12,14,16,2...|        1.0|
|(30,[0,11,14,16,2...|        1.0|
|(30,[5,13,16,18,2...|        1.0|
|(30,[2,11,13,16,1...|        1.0|
|(30,[4,12,13,16,1...|        1.0|
|(30,[3,12,13,16,1...|        1.0|
|(30,[1,11,13,16,1...|        1.0|
|(30,[0,11,14,16,2...|        1.0|
|(30,[1,12,14,16,1...|        1.0|
|(30,[2,12,14,16,1...|        1.0|
|(30,[0,14,16,18,2...|        1.0|
|(30,[1,12,15,16,1...|        1.0|
|(30,[4,11,13,16,1...|        1.0|
|(30,[3,11,13,16,1...|        1.0|
|(30,[3,13,16,20,2...|        1.0|
+--------------------+-----------+
only showing top 20 rows



In [22]:
df_transform_fin = df_transform_fin.orderBy(rand())
train_data, test_data = df_transform_fin.randomSplit([0.8, 0.2], seed=1234)

## Step 7: Build a Deep Learning Model

We'll now build a basic deep learning model using Keras. Keras is described as: "a high-level neural networks API, written in Python and capable of running on top of TensorFlow, CNTK, or Theano." in the Keras documentation. I find Keras to be one of the easiest deep learning APIs for python. Also, I found an extension of Keras that allowed me to do easy distributed deep learning on Spark that could integrate with my PySpark pipeline.

First we need to determine the number of classes as well as the number of inputs from our data so we can plug those values into our Keras deep learning model.

In [23]:
nb_classes = train_data.select('label_index').distinct().count()
input_dim = len(train_data.select('features').first()[0])

Next we create a basic deep learning model. Using the model = Sequential() feature from Keras, it's easy to simply add layers and build a deep learning model the all the desired settings (# of units, dropout %, regularization - l2, activation functions, etc.) I selected the common Adam optimizer and binary cross-entropy since out outcome label is binary.

In [24]:
model = keras.Sequential()
model.add(keras.layers.Dense(256, input_shape=(input_dim,), activation='relu', activity_regularizer=keras.regularizers.l2(0.01)))
model.add(keras.layers.Dropout(0.3))
model.add(keras.layers.Dense(256, activation='relu', activity_regularizer=keras.regularizers.l2(0.01)))
model.add(keras.layers.Dropout(0.3))
model.add(keras.layers.Dense(nb_classes, activation='sigmoid'))

Once the model is built we can view the architecture. Notice that we went from 30 inputs/parameters to 74,242. The beauty (sometimes laziness :) ) of deep learning is the automatic feature engineering.

In [25]:
model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense (Dense)                (None, 256)               7936      
_________________________________________________________________
dropout (Dropout)            (None, 256)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 256)               65792     
_________________________________________________________________
dropout_1 (Dropout)          (None, 256)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 2)                 514       
Total params: 74,242
Trainable params: 74,242
Non-trainable params: 0
_________________________________________________________________


## Step 8: Distributed Deep Learning

Now that we have a model built, using Keras as our deep learning framework, we want to run that model on Spark to leverage its distributed analytic engine. We do that by using a python library and an extension to Keras called Elephas. Elephas makes it pretty easy to run your Keras models on Apache spark with few lines of configuration. I found Elephas to be easier and more stable to use than the several other libraries I read about and tried.

The first thing we do with Elephas is create an estimator similar to some of the PySpark pipeline items above. We can set the optimizer settings right from Keras optimizer function and then pass that to our Elephas estimator. I only explicitly use Adam optimizer with a set learning rate, but you can use any Keras optimizer with their respective parameters (clipnorm, beta_1, beta_2, etc.).

Then within the Elephas estimator you specify a variety of items: features column, label column, # of epochs, batch size for training, validation split of your training data, loss function, metric, etc. I just used the settings from an Elephas example and modified the code slightly.

Notice that after we run the estimator the output, ElephasEstimator_b9a9846d15f7, looks similar to one of our pipeline stages list items. This can be passed directly into our PySpark pipeline to fit and transform our data which we'll do in the next step!

In [26]:
# Set and Serialize Optimizer
optimizer_conf = keras.optimizers.Adam(lr=0.01)
opt_conf = keras.optimizers.serialize(optimizer_conf)

# Initialize SparkML Estimator and Get Settings
estimator = ElephasEstimator()
estimator.setFeaturesCol("features")
estimator.setLabelCol("label_index")
estimator.set_keras_model_config(model.to_yaml())
estimator.set_categorical_labels(True)
estimator.set_nb_classes(nb_classes)
estimator.set_num_workers(1)
estimator.set_epochs(25) 
estimator.set_batch_size(64)
estimator.set_verbosity(1)
estimator.set_validation_split(0.10)
estimator.set_optimizer_config(opt_conf)
estimator.set_mode("synchronous")
estimator.set_loss("binary_crossentropy")
estimator.set_metrics(['acc'])

ElephasEstimator_b9a9846d15f7

## Step 9: Distributed Deep Learning Pipeline and Results

Now that are deep learning model is to be run on Spark, using Elephas, we can pipeline line it exactly how we did above using Pipeline(). You could append this to our stages list and do all of this with one pass with a new dataset now that it's all built out which would be super cool!

I created another helper function below called dl_pipeline_fit_score_results that takes the deep learning pipeline dl_pipeline and then does all the fitting, transforming, and prediction on both the train and test data sets. It also outputs the accuracy for both data sets and their confusion matrices.

In [28]:
# Create Deep Learning Pipeline
dl_pipeline = Pipeline(stages=[estimator])

In [29]:
def dl_pipeline_fit_score_results(dl_pipeline=dl_pipeline,
                                  train_data=train_data,
                                  test_data=test_data,
                                  label='label_index'):
    
    fit_dl_pipeline = dl_pipeline.fit(train_data)
    pred_train = fit_dl_pipeline.transform(train_data)
    pred_test = fit_dl_pipeline.transform(test_data)
    
    pnl_train = pred_train.select(label, "prediction")
    pnl_test = pred_test.select(label, "prediction")
    
    pred_and_label_train = pnl_train.rdd.map(lambda row: (row[label], row['prediction']))
    pred_and_label_test = pnl_test.rdd.map(lambda row: (row[label], row['prediction']))
    
    metrics_train = MulticlassMetrics(pred_and_label_train)
    metrics_test = MulticlassMetrics(pred_and_label_test)
    
    print("Training Data Accuracy: {}".format(round(metrics_train.precision(),4)))
    print("Training Data Confusion Matrix")
    display(pnl_train.crosstab('label_index', 'prediction').toPandas())
    
    print("\nTest Data Accuracy: {}".format(round(metrics_test.precision(),4)))
    print("Test Data Confusion Matrix")
    display(pnl_test.crosstab('label_index', 'prediction').toPandas())

Let's use our new deep learning pipeline and helper function on both data sets and test our results!

In [30]:
dl_pipeline_fit_score_results(dl_pipeline=dl_pipeline,
                              train_data=train_data,
                              test_data=test_data,
                              label='label_index');

>>> Fit model
>>> Synchronous training complete.
Training Data Accuracy: 0.7873
Training Data Confusion Matrix


Unnamed: 0,label_index_prediction,0.0,1.0
0,1.0,430,3843
1,0.0,3192,1471



Test Data Accuracy: 0.7884
Test Data Confusion Matrix


Unnamed: 0,label_index_prediction,0.0,1.0
0,1.0,83,933
1,0.0,822,388


# Conclusion

I hope that this example has been helpful. I know it was for me in learning more about PySpark pipelines and doing deep learning on spark using an easy deep learning framework like Keras. Like I mentioned, I ran all this locally with little to no issue. My main objective was to prototype something for an upcoming project that will contain a massive dataset. Luckily working for IBM allows me to leverage Watson so I'll train and deploy on a large Spark cluster. My hope is that you (and myself) can use this as a template while making few changes to things like the spark session, data, feature selection, and maybe adding or removing some pipeline stages. As always, thanks for reading and good luck on your next project!