# Logistic Regression Code Along
This is a code along of the famous titanic dataset, its always nice to start off with this dataset because it is an example you will find across pretty much every data analysis language.

## Sample Data
```
PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female,38,1,0,PC 17599,71.2833,C85,C
```


# B1: Creating SparkSession

In [1]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("logistic_regression_code_along").getOrCreate()

# B2: Loading input corpus
'titanic.csv'

In [2]:
dir_input_path = "./../input_data/"
file_input_path = dir_input_path + 'titanic.csv'

In [3]:
import os

if not os.path.exists(file_input_path):
    print("File Not Found : ", file_input_path)
else:
    print("Verified Input Path : ", file_input_path)

Verified Input Path :  ./../input_data/titanic.csv


In [4]:
df = spark.read.csv(file_input_path, header=True, inferSchema=True)

# B3: Showing overview of input corpus
## Schema

In [5]:
df.printSchema()

root
 |-- PassengerId: integer (nullable = true)
 |-- Survived: integer (nullable = true)
 |-- Pclass: integer (nullable = true)
 |-- Name: string (nullable = true)
 |-- Sex: string (nullable = true)
 |-- Age: double (nullable = true)
 |-- SibSp: integer (nullable = true)
 |-- Parch: integer (nullable = true)
 |-- Ticket: string (nullable = true)
 |-- Fare: double (nullable = true)
 |-- Cabin: string (nullable = true)
 |-- Embarked: string (nullable = true)



## Description

In [6]:
df.describe().show()

+-------+-----------------+-------------------+------------------+--------------------+------+------------------+------------------+-------------------+------------------+-----------------+-----+--------+
|summary|      PassengerId|           Survived|            Pclass|                Name|   Sex|               Age|             SibSp|              Parch|            Ticket|             Fare|Cabin|Embarked|
+-------+-----------------+-------------------+------------------+--------------------+------+------------------+------------------+-------------------+------------------+-----------------+-----+--------+
|  count|              891|                891|               891|                 891|   891|               714|               891|                891|               891|              891|  204|     889|
|   mean|            446.0| 0.3838383838383838| 2.308641975308642|                null|  null| 29.69911764705882|0.5230078563411896|0.38159371492704824|260318.54916792738| 32.20420

## The column names

In [7]:
df.columns

['PassengerId',
 'Survived',
 'Pclass',
 'Name',
 'Sex',
 'Age',
 'SibSp',
 'Parch',
 'Ticket',
 'Fare',
 'Cabin',
 'Embarked']

## Sample Data

In [8]:
df.show(2)

+-----------+--------+------+--------------------+------+----+-----+-----+---------+-------+-----+--------+
|PassengerId|Survived|Pclass|                Name|   Sex| Age|SibSp|Parch|   Ticket|   Fare|Cabin|Embarked|
+-----------+--------+------+--------------------+------+----+-----+-----+---------+-------+-----+--------+
|          1|       0|     3|Braund, Mr. Owen ...|  male|22.0|    1|    0|A/5 21171|   7.25| null|       S|
|          2|       1|     1|Cumings, Mrs. Joh...|female|38.0|    1|    0| PC 17599|71.2833|  C85|       C|
+-----------+--------+------+--------------------+------+----+-----+-----+---------+-------+-----+--------+
only showing top 2 rows



In [10]:
df.toPandas()["Embarked"].unique()

array(['S', 'C', 'Q', None], dtype=object)

In [11]:
col_name = "Embarked"
[i[col_name] for i in df.select(col_name).distinct().collect()]

['Q', None, 'C', 'S']

In [12]:
col_name = "Embarked"
df.select(col_name).distinct().show()

+--------+
|Embarked|
+--------+
|       Q|
|    null|
|       C|
|       S|
+--------+



## Printing each item in the first line

In [13]:
for item in df.head():
    print(item)

1
0
3
Braund, Mr. Owen Harris
male
22.0
1
0
A/5 21171
7.25
None
S


## Analyzing the data in columns (in more detail)

As you see, we have some categorical features as follows:

+ Sex (male, female)
+ Embarked

### Show distinct values in some columns: Sex, Embarked

In [14]:
col_name = "Sex"
df.select(col_name).distinct().show()

+------+
|   Sex|
+------+
|female|
|  male|
+------+



In [15]:
col_name = "Embarked"
df.select(col_name).distinct().show()

+--------+
|Embarked|
+--------+
|       Q|
|    null|
|       C|
|       S|
+--------+



### Show row that containing NULL values

In [16]:
col_name = "Embarked"
df.where(df[col_name].isNull()).show()

+-----------+--------+------+--------------------+------+----+-----+-----+------+----+-----+--------+
|PassengerId|Survived|Pclass|                Name|   Sex| Age|SibSp|Parch|Ticket|Fare|Cabin|Embarked|
+-----------+--------+------+--------------------+------+----+-----+-----+------+----+-----+--------+
|         62|       1|     1| Icard, Miss. Amelie|female|38.0|    0|    0|113572|80.0|  B28|    null|
|        830|       1|     1|Stone, Mrs. Georg...|female|62.0|    0|    0|113572|80.0|  B28|    null|
+-----------+--------+------+--------------------+------+----+-----+-----+------+----+-----+--------+



In [17]:
col_name = "Embarked"
df.where(df[col_name].isNotNull()).show(3)

+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+
|PassengerId|Survived|Pclass|                Name|   Sex| Age|SibSp|Parch|          Ticket|   Fare|Cabin|Embarked|
+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+
|          1|       0|     3|Braund, Mr. Owen ...|  male|22.0|    1|    0|       A/5 21171|   7.25| null|       S|
|          2|       1|     1|Cumings, Mrs. Joh...|female|38.0|    1|    0|        PC 17599|71.2833|  C85|       C|
|          3|       1|     3|Heikkinen, Miss. ...|female|26.0|    0|    0|STON/O2. 3101282|  7.925| null|       S|
+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+
only showing top 3 rows



# B4: Data Preprocessing

## Dealing with the categorical variable
+ Using StringIndexer

+ Using OneHotEncoderEstimator

Example Column: Sex, Embarked

In [18]:
from pyspark.ml.feature import StringIndexer, OneHotEncoderEstimator

In [19]:
df.columns

['PassengerId',
 'Survived',
 'Pclass',
 'Name',
 'Sex',
 'Age',
 'SibSp',
 'Parch',
 'Ticket',
 'Fare',
 'Cabin',
 'Embarked']

In [20]:
sex_indexer = StringIndexer(inputCol="Sex", outputCol="sex_indexer")
sex_onehot = OneHotEncoderEstimator(inputCols=["sex_indexer"], outputCols=["sex_onehot"])

embarked_indexer = StringIndexer(inputCol="Embarked", outputCol="embarked_indexer")
embarked_onehot = OneHotEncoderEstimator(inputCols=["embarked_indexer"], outputCols=["embarked_onehot"])

## Drop NULL line

In [21]:
df_not_null = df.na.drop()

In [22]:
# Verify
col_name = "Embarked"
df_not_null.where(df_not_null[col_name].isNull()).show()

+-----------+--------+------+----+---+---+-----+-----+------+----+-----+--------+
|PassengerId|Survived|Pclass|Name|Sex|Age|SibSp|Parch|Ticket|Fare|Cabin|Embarked|
+-----------+--------+------+----+---+---+-----+-----+------+----+-----+--------+
+-----------+--------+------+----+---+---+-----+-----+------+----+-----+--------+



# B5: Creating VectorAssembler for 'features'

In [23]:
df_not_null.columns

['PassengerId',
 'Survived',
 'Pclass',
 'Name',
 'Sex',
 'Age',
 'SibSp',
 'Parch',
 'Ticket',
 'Fare',
 'Cabin',
 'Embarked']

In [24]:
df_not_null.show(1)

+-----------+--------+------+--------------------+------+----+-----+-----+--------+-------+-----+--------+
|PassengerId|Survived|Pclass|                Name|   Sex| Age|SibSp|Parch|  Ticket|   Fare|Cabin|Embarked|
+-----------+--------+------+--------------------+------+----+-----+-----+--------+-------+-----+--------+
|          2|       1|     1|Cumings, Mrs. Joh...|female|38.0|    1|    0|PC 17599|71.2833|  C85|       C|
+-----------+--------+------+--------------------+------+----+-----+-----+--------+-------+-----+--------+
only showing top 1 row



In [25]:
from pyspark.ml.feature import VectorAssembler

In [26]:
assembler = VectorAssembler(inputCols=['Pclass', 'sex_onehot', 
                                       'Age', 'SibSp', 
                                       'Parch', 'Fare', 
                                       'embarked_onehot'], outputCol="features")

# B7: Training & Testing Phase

## Creating Model object

In [27]:
from pyspark.ml.classification import LogisticRegression

In [28]:
lg = LogisticRegression(featuresCol="features", labelCol="Survived")

## (Cach 2) Applying pipeline
+ Using Pipeline from pyspark.ml with given stages as follows: StringIndexer, OneHotEncoderEstimator, VectorAssembler, LogisticRegression
+ Splitting Full Data to Training set & Testing set
    - Fitting pipeline with training set
    - Transforming with testing set

In [29]:
from pyspark.ml import Pipeline

In [30]:
pipeline = Pipeline(stages=[
    sex_indexer, embarked_indexer, sex_onehot, embarked_onehot, assembler, lg
])

In [31]:
# Splitting Full Data to Training set & Testing set
train_set, test_set = df_not_null.randomSplit([0.7, 0.3])

In [32]:
## Fitting pipeline with training set
lg_model_fit = pipeline.fit(train_set)

In [33]:
## Transforming with testing set
test_result = lg_model_fit.transform(test_set)

In [34]:
test_summary = test_result.summary

In [35]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator

In [36]:
test_result.show(2)

+-----------+--------+------+--------------------+------+----+-----+-----+------+-----+-----+--------+-----------+----------------+-------------+---------------+--------------------+--------------------+--------------------+----------+
|PassengerId|Survived|Pclass|                Name|   Sex| Age|SibSp|Parch|Ticket| Fare|Cabin|Embarked|sex_indexer|embarked_indexer|   sex_onehot|embarked_onehot|            features|       rawPrediction|         probability|prediction|
+-----------+--------+------+--------------------+------+----+-----+-----+------+-----+-----+--------+-----------+----------------+-------------+---------------+--------------------+--------------------+--------------------+----------+
|          4|       1|     1|Futrelle, Mrs. Ja...|female|35.0|    1|    0|113803| 53.1| C123|       S|        0.0|             0.0|(1,[0],[1.0])|  (2,[0],[1.0])|[1.0,1.0,35.0,1.0...|[-3.0880694482562...|[0.04360206986101...|       1.0|
|         12|       1|     1|Bonnell, Miss. El...|female

In [37]:
eval = BinaryClassificationEvaluator(rawPredictionCol="prediction", labelCol="Survived")

In [38]:
test_result.select("prediction", "Survived").show(3)

+----------+--------+
|prediction|Survived|
+----------+--------+
|       1.0|       1|
|       1.0|       1|
|       0.0|       1|
+----------+--------+
only showing top 3 rows



In [39]:
print("AUC :", eval.evaluate(test_result))

AUC : 0.708403361344538
