# Predict Liver Failure based on People's Demographics

***(Feature Creation Notebook)***

## 3. Feature Creation

In this notebook, we will perform Feature Creation and Feature Engineering on the demographics and health information that we have stored in the IBM Object store as part of the previous step of Extract, Transform and Load (ETL).

### 3.1. Load Source Data from Data Store
Let us start by loading the data from the IBM Data store onto this notebook for further processing. Now we will to connect to the object store and read a PARQUET file and create a dataframe out of it. Using SparkSQL we can handle it like a database.

In [1]:
# import required packages and libraries
import types
import pandas as pd
import numpy as np
import ibmos2spark

Waiting for a Spark session to start...
Spark Initialization Done! ApplicationId = app-20190405073152-0000
KERNEL_ID = bcc1b214-9566-449c-9959-23a2e65f0662


In [2]:
# @hidden_cell
credentials = {
    'endpoint': 'https://s3-api.us-geo.objectstorage.service.networklayer.com',
    'api_key': 'yR6pr44dLxKcEe_-J-YBRKtI9LaoOcG9v_c2zK_I1epP',
    'service_id': 'iam-ServiceId-dd08a5f3-28d2-4f87-bc12-4ec0662689f2',
    'iam_service_endpoint': 'https://iam.bluemix.net/oidc/token'}

configuration_name = 'os_85bf8a7fa4e54387abd3bbb49b9490af_configs'
cos = ibmos2spark.CloudObjectStorage(sc, credentials, configuration_name, 'bluemix_cos')

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df_data = spark.read.parquet(cos.url('ALF_Data.parquet', 'fundamentalsofscalabledatascience-donotdelete-pr-qbkdskud4vsck0'))
print("Number of records = ", df_data.count(), "\n")
df_data.createOrReplaceTempView('alf_data')
df_data.show()

Number of records =  5221 

+---+------+-------------+-----+--------------------+--------------------+---------------+--------------+----------------+------------+---+----------------+----------+------------------+------------+------------------+--------+--------------+---------+---------------+--------------+---+
|Age|Gender|BodyMassIndex|Waist|MaximumBloodPressure|MinimumBloodPressure|GoodCholesterol|BadCholesterol|TotalCholesterol|Dyslipidemia|PVD|PhysicalActivity|PoorVision|AlcoholConsumption|HyperTension|FamilyHyperTension|Diabetes|FamilyDiabetes|Hepatitis|FamilyHepatitis|ChronicFatigue|ALF|
+---+------+-------------+-----+--------------------+--------------------+---------------+--------------+----------------+------------+---+----------------+----------+------------------+------------+------------------+--------+--------------+---------+---------------+--------------+---+
| 65|     1|        21.31| 83.6|               135.0|                71.0|           48.0|         249.0|   

### 3.2. Feature Engineering

Let us now apply One Hot Encoding to the categorical integer features viz. Gender, Dyslipidemia, PVD, PhysicalActivity, PoorVision, AlcoholConsumption, HyperTension, FamilyHyperTension, Diabetes, FamilyDiabetes, Hepatitis, FamilyHepatitis and ChronicFatigue.

In [3]:
from pyspark.ml.feature import OneHotEncoder

# Create one hot encoders for the categorical features
encoder1 = OneHotEncoder(inputCol = 'Gender', outputCol = 'GenderVec')
encoder2 = OneHotEncoder(inputCol = 'Dyslipidemia', outputCol = 'DyslipidemiaVec')
encoder3 = OneHotEncoder(inputCol = 'PVD', outputCol = 'PVDVec')
encoder4 = OneHotEncoder(inputCol = 'PhysicalActivity', outputCol = 'PhysicalActivityVec')
encoder5 = OneHotEncoder(inputCol = 'PoorVision', outputCol = 'PoorVisionVec')
encoder6 = OneHotEncoder(inputCol = 'AlcoholConsumption', outputCol = 'AlcoholConsumptionVec')
encoder7 = OneHotEncoder(inputCol = 'HyperTension', outputCol = 'HyperTensionVec')
encoder8 = OneHotEncoder(inputCol = 'FamilyHyperTension', outputCol = 'FamilyHyperTensionVec')
encoder9 = OneHotEncoder(inputCol = 'Diabetes', outputCol = 'DiabetesVec')
encoder10 = OneHotEncoder(inputCol = 'FamilyDiabetes', outputCol = 'FamilyDiabetesVec')
encoder11 = OneHotEncoder(inputCol = 'Hepatitis', outputCol = 'HepatitisVec')
encoder12 = OneHotEncoder(inputCol = 'FamilyHepatitis', outputCol = 'FamilyHepatitisVec')
encoder13 = OneHotEncoder(inputCol = 'ChronicFatigue', outputCol = 'ChronicFatigueVec')

Let us now merge the features into a single features vector and then normalize them.

In [4]:
from pyspark.ml.feature import MinMaxScaler, VectorAssembler

# Create a features vector
vectorAssembler = VectorAssembler(inputCols = ['GenderVec', 'DyslipidemiaVec', 'PVDVec', 'PhysicalActivityVec', 'PoorVisionVec', 
                                               'AlcoholConsumptionVec', 'HyperTensionVec', 'FamilyHyperTensionVec', 
                                               'DiabetesVec', 'FamilyDiabetesVec', 'HepatitisVec', 'FamilyHepatitisVec', 
                                               'ChronicFatigueVec','Age', 'BodyMassIndex', 'Waist', 'MaximumBloodPressure', 
                                               'MinimumBloodPressure', 'GoodCholesterol', 'BadCholesterol', 'TotalCholesterol'],
                                  outputCol = 'featuresVec')

# Normalize the features data
normalizer = MinMaxScaler(inputCol = 'featuresVec', outputCol = 'features')

Now, let us use create a Machine Learning Pipeline to apply the above feature engineering stages to our dataset.

In [5]:
from pyspark.ml import Pipeline

# Create a Feature Engineering ML pipeline
pipeline = Pipeline(stages = [encoder1, encoder2, encoder3, encoder4, encoder5, encoder6, encoder7, encoder8, encoder9,
                              encoder10, encoder11, encoder12, encoder13, vectorAssembler, normalizer])
df_normalized_data = pipeline.fit(df_data).transform(df_data)
df_normalized_data.show()

+---+------+-------------+-----+--------------------+--------------------+---------------+--------------+----------------+------------+---+----------------+----------+------------------+------------+------------------+--------+--------------+---------+---------------+--------------+---+-------------+---------------+-------------+-------------------+-------------+---------------------+---------------+---------------------+-------------+-----------------+-------------+------------------+-----------------+--------------------+--------------------+
|Age|Gender|BodyMassIndex|Waist|MaximumBloodPressure|MinimumBloodPressure|GoodCholesterol|BadCholesterol|TotalCholesterol|Dyslipidemia|PVD|PhysicalActivity|PoorVision|AlcoholConsumption|HyperTension|FamilyHyperTension|Diabetes|FamilyDiabetes|Hepatitis|FamilyHepatitis|ChronicFatigue|ALF|    GenderVec|DyslipidemiaVec|       PVDVec|PhysicalActivityVec|PoorVisionVec|AlcoholConsumptionVec|HyperTensionVec|FamilyHyperTensionVec|  DiabetesVec|FamilyDiab

Now that we have created the normalized features vector, let us go ahead and drop all the other columns from our dataset and retain only the label column(ALF) and the normalized features vector (features).

In [6]:
df_normalized_data = df_normalized_data.drop('Age').drop('Gender').drop('BodyMassIndex').drop('Waist')\
                        .drop('MaximumBloodPressure').drop('MinimumBloodPressure').drop('GoodCholesterol') \
                        .drop('BadCholesterol').drop('TotalCholesterol').drop('Dyslipidemia').drop('PVD') \
                        .drop('PhysicalActivity').drop('PoorVision').drop('AlcoholConsumption').drop('HyperTension') \
                        .drop('FamilyHyperTension').drop('Diabetes').drop('FamilyDiabetes').drop('Hepatitis') \
                        .drop('FamilyHepatitis').drop('ChronicFatigue').drop('featuresVec').drop('GenderVec') \
                        .drop('DyslipidemiaVec').drop('PVDVec').drop('PhysicalActivityVec').drop('PoorVisionVec') \
                        .drop('AlcoholConsumptionVec').drop('HyperTensionVec').drop('FamilyHyperTensionVec') \
                        .drop('DiabetesVec').drop('FamilyDiabetesVec').drop('HepatitisVec').drop('FamilyHepatitisVec') \
                        .drop('ChronicFatigueVec')
                    
df_normalized_data.show()

+---+--------------------+
|ALF|            features|
+---+--------------------+
|0.0|[0.5,1.0,1.0,1.0,...|
|0.0|[0.5,1.0,1.0,1.0,...|
|0.0|[0.5,1.0,0.0,1.0,...|
|0.0|[0.5,1.0,1.0,1.0,...|
|0.0|[0.5,1.0,1.0,1.0,...|
|0.0|[0.5,0.0,1.0,1.0,...|
|0.0|[0.5,0.0,1.0,1.0,...|
|0.0|[0.5,1.0,1.0,1.0,...|
|0.0|[0.5,1.0,1.0,1.0,...|
|0.0|[0.5,0.0,1.0,1.0,...|
|0.0|[0.5,0.0,1.0,1.0,...|
|0.0|[0.5,1.0,1.0,1.0,...|
|0.0|[0.5,1.0,1.0,1.0,...|
|0.0|[0.5,1.0,1.0,1.0,...|
|1.0|[0.5,1.0,1.0,1.0,...|
|0.0|[0.5,1.0,0.0,1.0,...|
|0.0|[0.5,1.0,1.0,1.0,...|
|0.0|[0.5,0.0,1.0,1.0,...|
|0.0|[0.5,1.0,1.0,1.0,...|
|0.0|[0.5,1.0,1.0,1.0,...|
+---+--------------------+
only showing top 20 rows



### 3.3. Store Feature Engineered Data in IBM Object Store

Let us go ahead and persist our feature engineered data into the IBM Object store for us to be able to use it in the next step of our process i.e. Model Definition.

In [7]:
df_normalized_data = df_normalized_data.repartition(1)
df_normalized_data.write.parquet(cos.url('ALF_Normalized.parquet', 
                                         'fundamentalsofscalabledatascience-donotdelete-pr-qbkdskud4vsck0'))

Now that the data has been persisted in the IBM Object store, let us check and confirm that the data persisted is looking good.

In [8]:
df_persisted_data = spark.read.parquet(cos.url('ALF_Normalized.parquet', 
                                               'fundamentalsofscalabledatascience-donotdelete-pr-qbkdskud4vsck0'))
print('Number of records persisted = ', df_persisted_data.count())
df_persisted_data.show()

Number of records persisted =  5221
+---+--------------------+
|ALF|            features|
+---+--------------------+
|0.0|[0.5,1.0,1.0,1.0,...|
|0.0|[0.5,1.0,1.0,1.0,...|
|0.0|[0.5,1.0,0.0,1.0,...|
|0.0|[0.5,1.0,1.0,1.0,...|
|0.0|[0.5,1.0,1.0,1.0,...|
|0.0|[0.5,0.0,1.0,1.0,...|
|0.0|[0.5,0.0,1.0,1.0,...|
|0.0|[0.5,1.0,1.0,1.0,...|
|0.0|[0.5,1.0,1.0,1.0,...|
|0.0|[0.5,0.0,1.0,1.0,...|
|0.0|[0.5,0.0,1.0,1.0,...|
|0.0|[0.5,1.0,1.0,1.0,...|
|0.0|[0.5,1.0,1.0,1.0,...|
|0.0|[0.5,1.0,1.0,1.0,...|
|1.0|[0.5,1.0,1.0,1.0,...|
|0.0|[0.5,1.0,0.0,1.0,...|
|0.0|[0.5,1.0,1.0,1.0,...|
|0.0|[0.5,0.0,1.0,1.0,...|
|0.0|[0.5,1.0,1.0,1.0,...|
|0.0|[0.5,1.0,1.0,1.0,...|
+---+--------------------+
only showing top 20 rows



The feature engineered data persisted in the Object store is looking good. Now we can go ahead and define our Model in the next step of our process.