To perform prediction based on the details of patient

1) Login to virtual machine
2) Go to LXTerminal
3) To start hadoop write ./allstart.sh
4) Copy the required csv file to the local system through bitwise client
5) Once hadoop gets started use command hadoop fs -put insurance.csv to import the file to hadoop
6) After hadoop gets started write command pysparknb to start pyspark
7) In pyspark take a jupyter notebook and start with the project

In [12]:
from pyspark.sql import SparkSession

In [13]:
spark= SparkSession.builder.appName('nlp').getOrCreate()

In [14]:
#Importing all the necessary libraries
import numpy as np

from pyspark.ml.feature import StringIndexer, OneHotEncoder

from pyspark.ml.feature import VectorAssembler
from pyspark.ml.feature import MinMaxScaler, StandardScaler
from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator

In [15]:
#Reading the dataset
insurance= spark.read.csv('insurance.csv', header=True, inferSchema=True)

In [16]:
insurance.printSchema()

root
 |-- age: integer (nullable = true)
 |-- sex: string (nullable = true)
 |-- bmi: double (nullable = true)
 |-- children: integer (nullable = true)
 |-- smoker: string (nullable = true)
 |-- region: string (nullable = true)
 |-- charges: double (nullable = true)



In [17]:
# Counting the dataset by sex
insurance.groupBy('sex').count().show()

+------+-----+
|   sex|count|
+------+-----+
|female|  662|
|  male|  676|
+------+-----+



In [18]:
# Counting the dataset by smoker type
insurance.groupBy('smoker').count().show()

+------+-----+
|smoker|count|
+------+-----+
|    no| 1064|
|   yes|  274|
+------+-----+



In [19]:
# Counting the dataset by region
insurance.groupBy('region').count().show()

+---------+-----+
|   region|count|
+---------+-----+
|northwest|  325|
|southeast|  364|
|northeast|  324|
|southwest|  325|
+---------+-----+



In [20]:
# Counting the dataset by children
insurance.groupBy('children').count().show()

+--------+-----+
|children|count|
+--------+-----+
|       1|  324|
|       3|  157|
|       5|   18|
|       4|   25|
|       2|  240|
|       0|  574|
+--------+-----+



String indexing has been in the dataset to index string values into integer values so as to perform feature engineering on the dataset and to also clean the dataset for modelling

In [23]:
from pyspark.ml.feature import StringIndexer
from pyspark.ml import Pipeline
indexer = [StringIndexer(inputCol=column, outputCol=column+"_index").fit(insurance) for column in list(set(insurance.columns)-set(['Provider_Id, Provider_Zip_Code, Total_Discharges, Average_Covered_Charges, Average_Total_Payments, Average_Medicare_Payments'])) ]
pipeline = Pipeline(stages=indexer)
data = pipeline.fit(insurance).transform(insurance)
data.show(10)

+---+------+------+--------+------+---------+-----------+---------+--------------+------------+-------------+---------+---------+------------+
|age|   sex|   bmi|children|smoker|   region|    charges|bmi_index|children_index|region_index|charges_index|age_index|sex_index|smoker_index|
+---+------+------+--------+------+---------+-----------+---------+--------------+------------+-------------+---------+---------+------------+
| 19|female|  27.9|       0|   yes|southwest|  16884.924|    412.0|           0.0|         2.0|        340.0|      1.0|      1.0|         1.0|
| 18|  male| 33.77|       1|    no|southeast|  1725.5523|    283.0|           1.0|         0.0|        358.0|      0.0|      0.0|         0.0|
| 28|  male|  33.0|       3|    no|southeast|   4449.462|     32.0|           3.0|         0.0|        891.0|     17.0|      0.0|         0.0|
| 33|  male|22.705|       0|    no|northwest|21984.47061|    130.0|           0.0|         1.0|        500.0|     30.0|      0.0|         0.0|

In [24]:
from pyspark.mllib.regression import LabeledPoint
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler

# Vector Assembler is used
assembler = VectorAssembler(
    inputCols=["age","children","bmi","region_index", "smoker_index", "sex_index"], outputCol="features")
output = assembler.transform(data)
output.select('features', 'charges').show(10)

+--------------------+-----------+
|            features|    charges|
+--------------------+-----------+
|[19.0,0.0,27.9,2....|  16884.924|
|[18.0,1.0,33.77,0...|  1725.5523|
|[28.0,3.0,33.0,0....|   4449.462|
|[33.0,0.0,22.705,...|21984.47061|
|[32.0,0.0,28.88,1...|  3866.8552|
|[31.0,0.0,25.74,0...|  3756.6216|
|[46.0,1.0,33.44,0...|  8240.5896|
|[37.0,3.0,27.74,1...|  7281.5056|
|[37.0,2.0,29.83,3...|  6406.4107|
|[60.0,0.0,25.84,1...|28923.13692|
+--------------------+-----------+
only showing top 10 rows



Splitting the dataset into train and test for modelling

In [26]:
split = output.randomSplit([0.7, 0.3])
train_data = split[0]
test_data = split[1]

Model training with the help of Linear Regression model

In [27]:
from pyspark.ml.regression import LinearRegression
from pyspark.ml.evaluation import RegressionEvaluator
linr = LinearRegression(maxIter=6, regParam=0.0, labelCol='charges', solver="normal")
model = linr.fit(train_data)

In [34]:
# Finding the Model summary
Summary = model.summary
print("RMSE: %f" % Summary.rootMeanSquaredError)
print("r2: %f" % Summary.r2)

RMSE: 5923.434330
r2: 0.764535


In [35]:
# Predicting the test dataset using the trained model
predict = model.transform(test_data)
predict.show(10)

+---+------+------+--------+------+---------+-----------+---------+--------------+------------+-------------+---------+---------+------------+--------------------+-------------------+
|age|   sex|   bmi|children|smoker|   region|    charges|bmi_index|children_index|region_index|charges_index|age_index|sex_index|smoker_index|            features|         prediction|
+---+------+------+--------+------+---------+-----------+---------+--------------+------------+-------------+---------+---------+------------+--------------------+-------------------+
| 18|female| 20.79|       0|    no|southeast|  1607.5101|    353.0|           0.0|         0.0|        306.0|      0.0|      1.0|         0.0|[18.0,0.0,20.79,0...|-2227.4462131201926|
| 18|female|26.315|       0|    no|northeast| 2198.18985|     44.0|           0.0|         3.0|        499.0|      0.0|      1.0|         0.0|[18.0,0.0,26.315,...|  981.0823568349751|
| 18|female| 27.28|       3|   yes|southeast| 18223.4512|    408.0|           3.

In [36]:
evaluator = RegressionEvaluator(labelCol="charges")
rmse = evaluator.evaluate(predict,{evaluator.metricName:"rmse" })
np.sqrt(rmse), rmse

(80.07647228457024, 6412.241413541546)

In [37]:
print("R Squared (R2) on test data = %g" % evaluator.evaluate(predict,{evaluator.metricName:"r2" }))

R Squared (R2) on test data = 0.706879


We are able to find out through analysis that by using the Linear Regression model on the dataset we are able to get RMSE score of 5923.4 and R2 value of 0.706 on the test dataset. Hence we can assume that the model is capable to make prediction og the charges based on the details of the patients and provides best fit for the trained model