# Introduction 

Spark is a hot topic on big data analytics. Spark is an analytics engine for big data processing. It provides high-level APIs in Python known as pySpark. Spark also supports Scala which is also its native language, Java and R. Spark code is written in Java for optimizing for the big data and pySpark is the python wrapper to connect the python IDE to the Java engine. As soon as big data and distributed computing become of the mainstream in analytics, Spark is going to be the number one tool in demand in the near future. But the good news is Spark API is very simple to learn. Jupyter notebook not only supports Spark but also recommended running pySpark code for an interactive environment. We will focus pySpark API in Jupyter notebook in this notebook.


### Spark SQL 

Spark SQL is a Spark module for structured data processing. One obvious use of Spark SQL is to execute SQL queries. Unlike the name suggest Spark SQL is not just about SQL. Spark SQL also contains two ther objects: dataset and dataframe. A Dataset is a distributed collection of data. Python does not have the support for the Dataset API of Spark. However, Python supports Spark dataframe which is similar to Pandas in Python. A DataFrame is a Dataset organized into named columns. However, internally it is designed to scale the data analysis for big data. One of the aims of this notebook is to see what Spark dataframe is capable of doing and compare and contrast with the Pandas dataframe.


### Spark RDDs

In parallel to the Spark data frame there is another object called Spark RDDs. RDDs stands for a resilient distributed dataset, which is a collection of elements partitioned across the nodes of the cluster that can be operated on in parallel. Spark dataframe is a later addition to the Spark and is more versatile and superior to RDDs. Spark dataframe is faster than RDDs and resemble dataframe in Pandas. Spark dataframe is preferred over RDDs for this reason. Also when doing the machine learning in Spark there are two separate libraries in Spark one is for RDDs which is called MLlib. The second machine learning library in Spark is dataframe based and is known as Spark ML. A source of confusion: in Spark documentation both ML libraries are referred to as MLlib. In Spark community they, usually they are referred to as Spark ML and MLlib because they are called with spark.ml and spark.mllib respectively when writing code. Spark documentation also announced DataFrame-based API is the primary API for the future. We will not talk RDDs and MLlib further in this notebook. Rather we will discuss dataframe and Spark ML.


### Spark ML 

The Spark ML library is primary library of Spark for machine learning. Spark ML mimics the API of sci-kit learn for Python user. Internally it is designed to make machine learning scalable for big data. Pretty much similar to sci-kit learn Spark ML has the following features: 
- machine learning algorithms such as classification, regression, clustering, and collaborative filtering. 
- Feature extraction, transformation, dimensionality reduction, and selection. 
- Tools for constructing, evaluating, and tuning machine learning pipelines. 
- Saving and load algorithms, models, and pipelines. 
- Linear algebra, statistics, data handling, etc. 

The Spark ML library is not as big as Sklearn but it is growing surely and steadily. We will use Spark ML API in this notebook to perform machine learning tasks. Especially we study logistic regression with Ridge and Lasso regularization and two popular tree-based ensemble learning methods, random forest and
gradient boosting.



In addition to those Spark has GraphX library for graph processing, and Spark Streaming for Structured Streaming for incremental computation and stream processing which we will not discuss here.


### Transformations & Actions 

The main purpose of Spark is to make calculations on very large scale datasets presented in distributed format. The data usually exists in a cluster of computers and the situation is very different from analyzing the data which can accommodate in a single computer and can be loaded in memory. But Spark also supports running in a single machine. And this is the best way to start learning Spark. But even doing so, special care should be taken to write the Spark code with big data case in mind so that we do not blast the memory of the local engine when working with the big data. We follow spark best practice when writing code.

Generally, there are two types of operations in spark: transformations and actions. Transformations construct a new dataframe from a previous one. Actions, on the other hand, compute a result based on a dataframe, and either return it to the driver program or save it to an external storage system. For example, let's say we have a dataframe. If we select only a few columns from the dataset this is an example of transformation. Doing so we create a new dataframe. But spark do not immediately execute such operation. Remember that the dataset Spark designed to handle is a big one. Executing means using storage resources. Usually, we do a series of transformations doing the analysis. Computing the final result might be much lighter for the memory. For example, if our goal is to calculate the average of each of the selected columns the end result is just a few numbers. In this example calculating average is an action. Action demands some execution on the dataframe. Here is another example of transformation in the context of machine learning: ML model is a transformation which transforms a DataFrame with features into a DataFrame with predictions. On the other hand, a learning algorithm making predictions and calculating the scores to evaluate the quality of the model are  actions which trains on a DataFrame and produces a result.

So in this sense, we say Spark evaluates the transformation in a lazy fashion. One of the drawbacks of the lazy evaluation can be understood from the following example. If we select some columns from the dataframe and calculate the mean and it gives us an answer. And then we have to calculate the standard deviation, then Spark again has to go through selecting the same column and then calculate the standard deviation. For big data, this route can be more effective. But if the same transformation is needed for many actions it might be convenient to store the intermediate result in memory. This process is called a cache. We will not discuss this further here. 


# Starting pySpark

Spark is not available in Kaggle notebook by default. But installing pySpark in Kaggle kernel is easy. 

In [None]:
!pip install pyspark

First, we need to start a SparkSession and create a spark instance.

In [None]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('classification').getOrCreate()

In [None]:
from itertools import chain
from pyspark.sql.functions import count, mean, when, lit, create_map, regexp_extract

Reading CSV files in Spark is not that different than in Pandas. The new thing here is the schema of the dataframe. Spark schema is the structure of the DataFrame or Dataset. The columns in the dataframe can be integer or float or string. If we enable inferring schema while reading the dataframe it finds the right schema for each of the columns.

In [None]:
df1 = spark.read.csv('../input/titanic/train.csv',\
                     header=True, inferSchema=True)
df2 = spark.read.csv('../input/titanic/test.csv', \
                     header=True, inferSchema=True)

We can view the schema here. There is no need to print the column names separately. Although this option is available in Spark. We will use it shortly for a different purposes. 

In [None]:
df1.printSchema()

Pandas has head(), tail(), sample() options to view few rows of the dataframe. Spark dataframe also has all of those options which you can try. But here we use show(n) method to view sample rows of the dataframe. By default show() presents 20 rows. Please remember the discussion of transformation vs action above. All of these are example of Spark action. 

In [None]:
df1.show(4)

The output of the show() might look ugly, especially if there are a large number of columns in the dataframe. At this point, we might miss Pandas head(). There is an option to convert the Spark dataframe into the Pandas dataframe. But we have to be careful here. Usually, Spark is handling a large volume of data, and converting it to Pandas stores everything immediately to the memory. Which we should avoid for the large data. However, there is a way out. Please remember the lazy evaluation of the Spark transformation. We can transform the Spark dataframe by limit(n) to take only n number of rows and then convert that to the Pandas. toPandas() is an action. 

In [None]:
df1.limit(5).toPandas()

Alternatively we can select() a few columns and inspect within Spark. select() is an example of Spark transformation. Therefore that step is evaluated lazily. Hence we pass a Spark action show() at the end to print the result. 

In [None]:
df1.select('Survived', 'Pclass', 'Age', 'Fare').show(4)

Spark has describe() method similar to the Pandas. But I find a summary() method more versatile than describe(). Please check [here](https://github.com/roshankoirala/pySpark_tutorial/blob/master/Exploratory_data_analysis_with_pySpark.ipynb) for detail. Both describe() and summary() are Spark trasnformations. Therefore, they do not produce result immediately. Hence we need show() at the end. 

In [None]:
df1.select('Survived', 'Pclass', 'Age', 'Fare').summary().show()

count() acts differently in Pandas and Spark. In Spark, it gives the total number of rows in the dataframe. There is no direct way to find the shape of the dataframe. We can use the following trick.  Here count and columns are action. 

In [None]:
print('Number of rows: \t', df1.count())
print('Number of columns: \t', len(df1.columns))

# Exploratory Data Analysis 

### About the data visualization in the Spark  

There is no native visualization library in Spark. But we can do the lazy transformation on the dataframe, extract the necessary numbers, and make the visualizations out of that. The implementation of this idea can be found [here](https://github.com/roshankoirala/pySpark_tutorial/blob/master/Data_visualization_in_pySpark%20.ipynb). 

There are options to make visualization by extending other libraries though. However, we do not go to that route here. We will focus on tabular visualization. Tabular visualization is not a bad option. 

### How many people survived?

We can use groupby() and count() transformations to do that. Both are lazy transformation. Again remember that the Spark transformation alone don't evaluate things unless we call an action upon them. Here show() is an action to print the results. 

In [None]:
df1.groupBy('Survived').count().show()

### Continious variables

Among the features, Fare and Age are the continuous variables (non-categorical). We can inspect them closely here. Our interest would be to find average fare and age. We already use summary() to calculate mean. If we just want mean we can either to summary('mean') or we can also directly call mean() and select columns inside that. Also in summary we can pass multiple arguments like df.summary('mean', 'stddev') and so on. Again please refer to [this](https://github.com/roshankoirala/pySpark_tutorial/blob/master/Exploratory_data_analysis_with_pySpark.ipynb) link for detail. Here groupby() and mean() are Spark transformation. Now you probably started to figure out which is transformation and which is action. I will stop iterating that in every single case now on. 

Passenger paying more money for the fair is likely to survive than those paying less. This variable might have collinearity with the passenger class that we investigate later. Age seems to be not that important for survival compared to fare.  

In [None]:
df1.groupBy('Survived').mean('Fare', 'Age').show()

### Categorical variables

Here we see how each of the categorical variables has affected the survival of the passenger. Spark dataframe also has a pivot() method very similar to the Pandas dataframe to perform this task. 

Below we see that sex is an (probably the most) important factor for survival. The survival ratio of the female is much higher than that of male. 

In [None]:
df1.groupBy('Survived').pivot('Sex').count().show()

Similarly, the first-class passengers are more likely to survive than the second class. And the third class passengers had very hard luck. 

In [None]:
df1.groupBy('Survived').pivot('Pclass').count().show()

The number of siblings and the number of parents also play some role in their survival. The large family is less likely to survive. Similarly, the person with no companion is also less likely to survive. 


In [None]:
df1.groupBy('Survived').pivot('SibSp').count().show()

In [None]:
df1.groupBy('Survived').pivot('Parch').count().show()

Embark also seems to be important. But it can be collinear with Pclass and fair.

In [None]:
df1.groupBy('Survived').pivot('Embarked').count().show()

# Feature Engineering 

First, let's see if there are any missed data. 

In [None]:
for col in df1.columns:
    print(col.ljust(20), df1.filter(df1[col].isNull()).count())

There are many Cabin info is missing. The Cabin is related to Pclass. We will drop this feature. So no problem so far. There are, 2 entries of Embarked missing. We will fill it with the most repeated value S. Age of many people is missing. Again the simplest way to impute the age would be to fill by the average. We choose median for fare imputation. We use Spark's fillna() method to do that. For age we use more complex imputation method discussed below. For now I am just focusing on the train data. There can be different feature missing in the test data. Acutally there is missed fair in test data. So we calculate median fair also. We come to the test data at the end of this notebook. 

In [None]:
df1.select('Fare', 'Embarked').summary('mean', '50%', 'max').show()

In [None]:
df1 = df1.fillna({'Embarked': 'S', 'Fare':14.45})

The basic idea for age imputation is to take the title of the people from the name column and impute with the average age of the group of people with that title. Mrs tend to be older than Miss. This method originally appeared in [this](https://www.kaggle.com/konstantinmasich/titanic-0-82-0-83) kernel. We will present the pySpark version of the implementation. 

First, we extract the title using the regular expression and observe the count and average age with each of the titles. 

In [None]:
df1 = df1.withColumn('Title', regexp_extract(df1['Name'],\
                '([A-Za-z]+)\.', 1))

df1.groupBy('Title').agg(count('Age'), mean('Age')).sort('count(Age)').show()

It is seen that Mr, Miss, and Mrs are highly repeated than other titles. The count of Master is not that high but its average age is much lower than others. So we keep those four titles and map other with one of the first three. 

In [None]:
title_dic = {'Mr':'Mr', 'Miss':'Miss', 'Mrs':'Mrs', 'Master':'Master', \
             'Mlle': 'Miss', 'Major': 'Mr', 'Col': 'Mr', 'Sir': 'Mr',\
             'Don': 'Mr', 'Mme': 'Miss', 'Jonkheer': 'Mr', 'Lady': 'Mrs',\
             'Capt': 'Mr', 'Countess': 'Mrs', 'Ms': 'Miss', 'Dona': 'Mrs', \
             'Dr':'Mr', 'Rev':'Mr'}

mapping = create_map([lit(x) for x in chain(*title_dic.items())])

df1 = df1.withColumn('Title', mapping[df1['Title']])
df1.groupBy('Title').mean('Age').show()

Now we create a function that imputes the age column with the average age of the group of people having the same name title as theirs. And use it to impute the ages in the next stage. 

In [None]:
def age_imputer(df, title, age):
    
    '''This function search for the null in 'Age' column 
    of the dataframe df. If there is null then it look 
    for the title and fill the 'Age' with age argument. 
    If 'Age' is not null, it will keep the same age.  '''
    
    return df.withColumn('Age', \
                         when((df['Age'].isNull()) & (df['Title']==title), \
                              age).otherwise(df['Age']))

In [None]:
df1 = age_imputer(df1, 'Mr', 33.02)
df1 = age_imputer(df1, 'Mrs', 35.98)
df1 = age_imputer(df1, 'Miss', 21.86)
df1 = age_imputer(df1, 'Master', 4.75)

### Creating a new column and dropping a column 

Now we create a new column called FamilySize combining Parch and SibSp. This API is significantly different in Spark than in Pandas. We use withColumn() method to do that. The first input in the method is a string of the name of the new column. This creates a new column and also keeps the old columns. We will drop the Parch and SibSp column afterward. 

In [None]:
df1 = df1.withColumn('FamilySize', df1['Parch'] + df1['SibSp']).\
            drop('Parch', 'SibSp')

And drop the unwanted columns. 

In [None]:
df1 = df1.drop('PassengerID', 'Cabin', 'Name', 'Ticket', 'Title')

Now we have a trimmed dataframe. For the small size dataframe show(n) method is not that worse than Pandas head(). See that Sex and Embarked columns are strings. We have to convert them to numeric categories.

In [None]:
df1.show(4)

And there is no missing value now.

In [None]:
for col in df1.columns:
    print(col.ljust(20), df1.filter(df1[col].isNull()).count())

# Model building 

So far we used Spark dataframe available in Spark SQL for EDA and feature engineering. Now we will use the Spark ML library to do ML tasks. We will cover the following ML task on Spark ML here:

- StringIndexer: Converts string categories to numerical categories. 
- Vector Assembler: Special to Spark API. We will find detail shortly. 
- Logistic regression based on Ridge and Lasso regularization. 
- Tree-based ensemble methods: Random forest and Gradient boosting. 
- Pipeline: It is a big deal for big data. 


In [None]:
from pyspark.ml.feature import StringIndexer, VectorAssembler
from pyspark.ml.classification import LogisticRegression,\
                    RandomForestClassifier, GBTClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml import Pipeline
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

### String Indexer 

We will convert the Sex and Embarked column from string to numeric index. This creates a new column for numeric leaving the original intact. So we will remove them afterward. 

In [None]:
stringIndex = StringIndexer(inputCols=['Sex', 'Embarked'], 
                       outputCols=['SexNum', 'EmbNum'])

stringIndex_model = stringIndex.fit(df1)

df1_ = stringIndex_model.transform(df1).drop('Sex', 'Embarked')
df1_.show(4)

### What is VectorAssembler?

In Python's scikit learn API the model takes X and y variable in the separation matrix. The target y is usually a column vector and feature X is a matrix. scikit learn accepts X as a matrix of dataframe directly. But Spark API is different here. First, it requires X and y in a single matrix instead of two for the training data. It accepts X only in the prediction part as it should. And also X should be a vector in each row (see the output below) of the dataframe. In short, we can not directly feed the dataframe in the model. We should do what VectorAssembler does. 

In below, inputCols are the feature columns that are doing to be merged to make a vector in each row and outputCol is the name of the merged column. This is the column that Spark ML identifies as the feature column. It is a common practice to rename this as features as Spark ML identifies this name. If its name is different you have to mention column name when fitting model. Then we can select only the feature column and y column. See the illustration below. 

In [None]:
vec_asmbl = VectorAssembler(inputCols=df1_.columns[1:], 
                           outputCol='features')

df1_ = vec_asmbl.transform(df1_).select('features', 'Survived')
df1_.show(4, truncate=False)

Now we split the training data into the train and validation part. We split the data into a 7:3 ratio.

In [None]:
train_df, valid_df = df1_.randomSplit([0.7, 0.3])

In [None]:
train_df.show(4, truncate=False)

In this output form [0] in the bracket can be confusing. 5, [0] means five consecutive entries are zero. Split has not split the data column-wise.


### Linear model 

We study logistic regression here. Spark ML offers elastic net regularization by default. The regularization function is given by 

$$ \alpha (\lambda | {\bf{w}} |_1) + (1 - \alpha) \left(\frac\lambda2 |{\bf{w}}|_2^2 \right) $$

In spark API $\alpha$ is eleasticNetParam and $\lambda$ is regParam. We can make our model Ridge by choosing $\alpha=0$ and Lasso by choosing $\alpha=1$. 

Please note that we need to specify the label column at this stage. It the feature column was named differently we had to specify that too here. In Spark, we fit() the model similar to scikit learn but unlike scikit learn we need to name the fitted instance (see comment below). Then we have unified function evaluate() and we call evaluation parameters like predict, accuracy on the evaluation instance. 



### Evaluation and metric 


First, we instantiate MulticlassClassificationEvaluator(). We need to specify the metric we want to evaluate at this stage, like metricName='accuracy' in our case. Fitting and evaluating models follow similarly from there. There is an alternative way to evaluate the accuracy scores in the linear model with the following set of commands: 

- model_name = model.fit(data) 
- pred = model_name.evaluate(data)
- pred.accuracy

In this method, there is no need to import MulticlassClassificationEvaluator(). But we stick with the first convention as it provides uniform API for all the models. 


At this point, I will remind the discussion of transformation and action again. All the machine learning models and preprocessing modules in Spark are transformations. They are evaluated lazily. When we ask for prediction of a model or score of the model then these are Spark action. For example MulticlassClassificationEvaluator() is a transformation while evaluate() is an action. 

In [None]:
evaluator = MulticlassClassificationEvaluator(labelCol='Survived', 
                                          metricName='accuracy')

In [None]:
ridge = LogisticRegression(labelCol='Survived', 
                        maxIter=100, 
                        elasticNetParam=0, # Ridge regression is choosen 
                        regParam=0.03)

model = ridge.fit(train_df)
pred = model.transform(valid_df)
evaluator.evaluate(pred)

In [None]:
lasso = LogisticRegression(labelCol='Survived', 
                           maxIter=100,
                           elasticNetParam=1, # Lasso
                           regParam=0.0003)

model = lasso.fit(train_df)
pred = model.transform(valid_df)
evaluator.evaluate(pred)

Generally, Lasso performs better than Ridge. 


### Ensemble Tree 


Currently Spark ML supports two types of ensemble algorithm. Random forest for bagging and gradient boosting for boosting. There is no stacking algorithm available in Spark ML yet. Here we will study both availabel ensemble method both are tree-based methods. 


In [None]:
rf = RandomForestClassifier(labelCol='Survived', 
                           numTrees=100, maxDepth=3)

model = rf.fit(train_df)
pred = model.transform(valid_df)
evaluator.evaluate(pred)

In [None]:
gb = GBTClassifier(labelCol='Survived', maxIter=75, maxDepth=3)

model = gb.fit(train_df)
pred = model.transform(valid_df)
evaluator.evaluate(pred)

Generally, the tree-based ensemble method performs better than the linear model and among the tree-based model, gradient boosting performs better than random forest. We may test different method for our final submission.  


# Prediction 

Now we focus on making a prediction on test data and submit the result. We need to follow the exact same procedure for the test data for data cleaning. First we observer the header and see if there are any missing values in the test data. 

In [None]:
df2.show(4)

In [None]:
for col in df2.columns:
    print(col.ljust(20), df2.filter(df2[col].isNull()).count())

Unlike the train data, there is no missing value in the Embarked column but there is one missing value for fair. And there are few ages missing. Now we fill the missing value by the median fair (of train data, not the test data). We ignore other missing values as we are dropping Cabin from our model. First, we will make a family size feature and drop the unwanted.

In [None]:
df2 = df2.fillna({'Embarked': 'S', 'Fare':14.45})
df2 = df2.withColumn('FamilySize', df2['Parch'] + df2['SibSp']).\
            drop('Parch', 'SibSp')

Now we come to imputing missing age in the test data. We need to follow exactly the same stages as we did in the train data. Only thing we need to be careful is that we are imputing the averages based on the training data but not the test data. You will identify these steps below. 

In [None]:
df2 = df2.withColumn('Title', regexp_extract(df2['Name'],\
                '([A-Za-z]+)\.', 1))

df2 = df2.withColumn('Title', mapping[df2['Title']])

df2.groupBy('Title').agg(count('Age'), mean('Age')).sort('count(Age)').show()

In [None]:
df2 = age_imputer(df2, 'Mr', 33.02)
df2 = age_imputer(df2, 'Mrs', 35.98)
df2 = age_imputer(df2, 'Miss', 21.86)
df2 = age_imputer(df2, 'Master', 4.75)

df2 = df2.drop('Cabin', 'Name', 'Ticket', 'Title') # keep PassengerId 
df2.show(4)

Let's check the missing values again. 

In [None]:
for col in df2.columns:
    print(col.ljust(20), df2.filter(df2[col].isNull()).count())

### Pipeline 

At this stage, it is worth introducing pipeline. In machine learning, it is common to run a sequence of algorithms to process and learn from data. In our example, we performed StringIndexer, VectorAssembler, and ML model. In other cases, the intermediate stages can be standardization, vectorization (for text processing), normalization, etc. These operations have to be performed on a specific order. Spark represents such a workflow as a Pipeline, which consists of a sequence of stages to be run in a specific order. Pipeline chains multiple Transformers and Estimators together to specify an ML workflow. 

Without the pipeline, we have to execute each stage, store the outcome, and feed into the next stage and evaluate, and so on. We prefer pipeline over this manual approach because of the following reasons: 

- The pipeline is less prone to mistake because the processes are automated. 
- In a production environment, this is the only way to do machine learning end to end. 
- Pipeline enhances the lazy evaluation. So this is a very natural choice in Spark. The pipeline is even more important for big data.


### Grid-search and cross-validation 

Usually, there are many hyperparameters in a model of selection and some combination of those parameters might give the best result. Tuning them requires checking all possible combinations of the hyperparameter. Doing them manually is a tedious bookkeeping task. Fortunately, there is a grid search option available in Spark like in Sci-kit learn. 

When doing the grid search we need to validate the model using a separate dataset that was not used to train the data. So far we used customized validation set for comparison between different models. Usually, Spark would be handling very big data. For big data, the train-validation split can be sufficient.  For small datasets like this, however, cross-validation is preferred over the train-validation split. Coss-validation is available in Spark. We will use five-fold cross-validation for better model selection. 

We use CrossValidator available in Spark ML for the cross-validation. CrossValidator accepts estimatorParamMaps in which we can pass a grid search object built with ParamGridBuilder which is also available in Spark ML. 

We have chosen a random forest for our submission model. We test three hyperparameters from the random forest: number of trees, minimum information gain in each split. The tuning of the number of trees is not that tricky, higher is better. The only concern here is time it takes for a large number of trees taken. The depth of the tree should be tuned properly. Larger depth with some non-zero info gain can give the best performance. Other objects in the model pipeline have no hyperparameters. If they would we could make a grid using those as well. 



In [None]:
pipeline_rf = Pipeline(stages=[stringIndex, vec_asmbl, rf])

paramGrid = ParamGridBuilder().\
            addGrid(rf.maxDepth, [3, 4, 5]).\
            addGrid(rf.minInfoGain, [0., 0.01, 0.1]).\
            addGrid(rf.numTrees, [1000]).\
            build()

selected_model = CrossValidator(estimator=pipeline_rf, 
                                estimatorParamMaps=paramGrid, 
                                evaluator=evaluator, 
                                numFolds=5)

model_final = selected_model.fit(df1)
pred_train = model_final.transform(df1)
evaluator.evaluate(pred_train)

This is the in-sample accuracy which is generally higher than the out-sample accuracy. 

In [None]:
pred_test = model_final.transform(df2)

predictions = pred_test.select('PassengerId', 'prediction')
predictions = predictions.\
                withColumn('Survived', predictions['prediction'].\
                cast('integer')).drop('prediction')
predictions.show(5)

The following is the method to read csv file in Spark. We can even read the csv file using Spark API. But there is some problem with that, especially Pandas can not read that csv and submission through kernel does not work for it. For this reason we change the submission file to pandas and make a submission.  

In [None]:
# Writing csv file in Spark 
predictions.coalesce(1).write.csv('submission_file.csv', header=True)

In [None]:
# Reading the saved file from spark 
spark.read.csv('submission_file.csv', header=True).show(4)

In [None]:
# Writing csv file using Pandas 
predictions.toPandas().to_csv('submission.csv', index=False)

In [None]:
# Inspecting csv file in pandas 
import pandas as pd
pd.read_csv('submission.csv').head()

We can also save the model itself for future use so that you don't have to train every time.  

In [None]:
model_final.write().save('titanic_classification.model')

In [None]:
! ls titanic_classification.model/*

# Final thought. 

- We just presented a base model here and established that Spark is basically capable of doing many tasks on machine learning and model building workflow seamlessly. The model is not fine-tuned yet. It would be interesting to see Spark matching Sci-kit learn's performance. 

- We did not perform standardization. The reason is standardization is inbuilt in the Spark linear model and it is not needed for the tree-based models. If we had chosen a linear model as our prediction model we may have to turn it off in order to make normalization based on train data instead of test data while making a prediction. 

- We did not include everything in the pipeline. For example, we imputed null values outside the pipeline. In a production environment, it is required to have everything in the pipeline. So there is room for improvement. 