<h1>Logistic Regression using spark MLlib<h1>


The dataset is the <strong>Pima Indians Diabetes Dataset</strong> [https://www.kaggle.com/uciml/pima-indians-diabetes-database/version/1]. 
This dataset belongs to National Institute of Diabetes and Digestive and Kidney Diseases. The dataset contains data used to classifiy if
someone has diabetes or not. The dataset contains felmale person's data. Features include: number of pregnancies, Glucose concenteration, Blood pressure, skinThickness, 
Insulin, BMI, DiabetesPedigreeFunction,Age. The label is 1 for diabetic and zero for non diabetic. 268 Participants out of 768 are 1(diabetec). 

In [1]:
from pyspark.sql import SparkSession #import spark session
spark= SparkSession.builder.appName("MLlib demo").getOrCreate() #Create a spark session using MLlib demo as name

In [2]:
#conda install -c conda-forge pyspark


<h3>Load Diabetes Datast stored</h3>

In [3]:
#Load diabetes.csv from loca storage to a data frame using spark, header=True <The first row is column header>, inferSchema=True, 
#option to inferSchema directly from the dataset
diab_ds=spark.read.csv('diabetes.csv', header=True, inferSchema=True)

<strong>Get the name of features of the dataset</strong>

In [4]:
diab_ds.columns

['Pregnancies',
 'Glucose',
 'BloodPressure',
 'SkinThickness',
 'Insulin',
 'BMI',
 'DiabetesPedigreeFunction',
 'Age',
 'Outcome']

<strong>Get the datatype of the features</strong>

In [5]:
diab_ds.dtypes

[('Pregnancies', 'int'),
 ('Glucose', 'int'),
 ('BloodPressure', 'int'),
 ('SkinThickness', 'int'),
 ('Insulin', 'int'),
 ('BMI', 'double'),
 ('DiabetesPedigreeFunction', 'double'),
 ('Age', 'int'),
 ('Outcome', 'int')]

<strong> Read First row of dataset</strong>

In [6]:
diab_ds.head() #returns the first row as key and value

Row(Pregnancies=6, Glucose=148, BloodPressure=72, SkinThickness=35, Insulin=0, BMI=33.6, DiabetesPedigreeFunction=0.627, Age=50, Outcome=1)

<p> <strong>In order to see the number of examples for each class, we use groupBy outcome and apply count operation. 
The result shows that we have 268 diabetic and 500 non diabetic examples. This shows that this dataset is unbalanced. </strong></p>

In [7]:
diab_ds.groupBy('Outcome').count().show()

+-------+-----+
|Outcome|count|
+-------+-----+
|      1|  268|
|      0|  500|
+-------+-----+



<h3>Dataset preparataion and preprocessing for Logistic regression </h3>

<p>We need <strong>Vector Assembler </strong> which is a transformer that combines a set of selected features in to a
a single feature vector. For example for our dataset with 8 features, it will combine these 8 features in to one 
feature vector. </p>


In [9]:
!pip install numpy

Collecting numpy
  Using cached numpy-1.18.1.zip (5.4 MB)
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h    Preparing wheel metadata ... [?25ldone
[?25hBuilding wheels for collected packages: numpy
  Building wheel for numpy (PEP 517) ... [?25ldone
[?25h  Created wheel for numpy: filename=numpy-1.18.1-cp36-cp36m-linux_x86_64.whl size=13382120 sha256=8de28aa28fe356f8108fd4d1954b15d93d1f4f19ef7836c42517d0ff4af7d3d8
  Stored in directory: /root/.cache/pip/wheels/92/18/20/83339b2576b5911519a6b616d8b4a6df8b14358ba5cd612a0b
Successfully built numpy
Installing collected packages: numpy
Successfully installed numpy-1.18.1


In [10]:
#!pip install numpy
#import vectors and vector assembler
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler

In [11]:
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler

In [12]:
#create vectorAssembler transformer that takes all features and maps them in to one vector called <features>
assembler= VectorAssembler(
inputCols=['Pregnancies','Glucose','BloodPressure','SkinThickness','Insulin','BMI','DiabetesPedigreeFunction',
           'Age'],
    outputCol="features"

)

In [13]:
diab_ds_vec=assembler.transform(diab_ds) #Transform diab_db using assembler

In [14]:
diab_ds_vec.show(3) #show the first element 

+-----------+-------+-------------+-------------+-------+----+------------------------+---+-------+--------------------+
|Pregnancies|Glucose|BloodPressure|SkinThickness|Insulin| BMI|DiabetesPedigreeFunction|Age|Outcome|            features|
+-----------+-------+-------------+-------------+-------+----+------------------------+---+-------+--------------------+
|          6|    148|           72|           35|      0|33.6|                   0.627| 50|      1|[6.0,148.0,72.0,3...|
|          1|     85|           66|           29|      0|26.6|                   0.351| 31|      0|[1.0,85.0,66.0,29...|
|          8|    183|           64|            0|      0|23.3|                   0.672| 32|      1|[8.0,183.0,64.0,0...|
+-----------+-------+-------------+-------------+-------+----+------------------------+---+-------+--------------------+
only showing top 3 rows



<strong> Now we select features and outcome to build our dataset ready for preprocessing </strong>

In [15]:
diab_db_final= diab_ds_vec.select('features','Outcome') #select features and Outcome from diab_ds_vec
diab_db_final.show(3) #show first three elements 

+--------------------+-------+
|            features|Outcome|
+--------------------+-------+
|[6.0,148.0,72.0,3...|      1|
|[1.0,85.0,66.0,29...|      0|
|[8.0,183.0,64.0,0...|      1|
+--------------------+-------+
only showing top 3 rows



<strong>Feature scaling</strong>
<p>Since the values of each feture are in different scale, we apply scaling to put all featues on the same scale. We use <strong>Standard Scaler</strong> which is MLlib transformer that transformes a dataset of vector rows by scaling each feature to have a zero mean or unit standard devation. Feature scaling can imporve accuracy for some classifiers</p>

In [16]:
from pyspark.ml.feature import StandardScaler

In [17]:
#withStd converts features to unit standard deviation and withMean: centers the data withMean before scaling
scaler=StandardScaler(inputCol="features", outputCol="scaledFeatures", withStd=True, withMean=False)

In [18]:
#Now build scaler model using diab_db_final
ScalerModel=scaler.fit(diab_db_final)

In [19]:
#Transform diab_db_final using the scaler model
diab_db_scaled= ScalerModel.transform(diab_db_final)
diab_db_scaled.show(1) #show first row of scaled dataset

+--------------------+-------+--------------------+
|            features|Outcome|      scaledFeatures|
+--------------------+-------+--------------------+
|[6.0,148.0,72.0,3...|      1|[1.78063837321943...|
+--------------------+-------+--------------------+
only showing top 1 row



<strong>Now we split our scaled dataset in to training and testing data. We use 75% for training and 25% for testing</strong>

In [20]:
diab_db_train, diab_db_test=diab_db_scaled.select('scaledFeatures','Outcome').randomSplit([0.75,0.25])
diab_db_train.show(3)

+--------------------+-------+
|      scaledFeatures|Outcome|
+--------------------+-------+
|(8,[0,1,6,7],[0.5...|      0|
|(8,[0,1,6,7],[0.5...|      0|
|(8,[0,1,6,7],[0.8...|      0|
+--------------------+-------+
only showing top 3 rows



<strong>Logistic Regression classifer</strong>
<p>Logistic regression is used to solve binary classification problem. Binomial logistic regeression is used for binary classification and multnomial is used for multi-class classification</p>

In [21]:
from pyspark.ml.classification import LogisticRegression
lrModel= LogisticRegression(maxIter=50, featuresCol='scaledFeatures', labelCol='Outcome') #build a model by specifying the labelCol as Outcome
lrModel=lrModel.fit(diab_db_train) # Train the model using diab_db_train
trainingSummary= lrModel.summary


<strong>Evaluation of logistic regression using test data</strong>

In [22]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator #We use this to evaluate all binary classifications
predictions= lrModel.evaluate(diab_db_test) #apply classification on test data
predictions.predictions.show()
#evaluator= BinaryClassificationEvaluator(rawPredictionCol='predection', labelCol="Outcome")
#evaluator.evaluate(predections.predection)



+--------------------+-------+--------------------+--------------------+----------+
|      scaledFeatures|Outcome|       rawPrediction|         probability|prediction|
+--------------------+-------+--------------------+--------------------+----------+
|(8,[0,1,6,7],[1.7...|      0|[2.88688855254209...|[0.94719447233909...|       0.0|
|(8,[1,5,6,7],[5.2...|      1|[-1.3156757608887...|[0.21153863383538...|       1.0|
|[0.0,2.9087389538...|      0|[2.09266489629441...|[0.89018819869731...|       0.0|
|[0.0,2.9400157167...|      0|[1.62563943834837...|[0.83557140956014...|       0.0|
|[0.0,2.9712924797...|      1|[2.19498630430049...|[0.89979837499423...|       0.0|
|[0.0,3.0338460056...|      0|[1.74460304716309...|[0.85127079468984...|       0.0|
|[0.0,3.0651227685...|      0|[3.06592745158107...|[0.95546520065146...|       0.0|
|[0.0,3.1276762944...|      0|[2.14000371221546...|[0.89473096012218...|       0.0|
|[0.0,3.1902298203...|      0|[2.22243331431568...|[0.90224601966007...|    