 ## Experiment with SPARK ML (PYSPARK)



> ## Below are the steps to transfer files from local drive to Hadoop Ecosystem:
    
cd ~

hdfs dfs -rm -r hdfs://localhost:9000/user/ashok/data_files/prabhash_assignment_allstate_claims

hdfs dfs -mkdir -p hdfs://localhost:9000/user/ashok/data_files/prabhash_assignment_allstate_claims

Transfer files from local file system

cd ~

hdfs dfs -put /cdata/prabhash_assignment_allstate_claims/train.csv 

hdfs://localhost:9000/user/ashok/data_files/prabhash_assignment_allstate_claims

hdfs dfs -ls -h hdfs://localhost:9000/user/ashok/data_files/prabhash_assignment_allstate_claims

In [None]:
!pip install --quiet sparkmagic
!pip install --quiet pyspark

In [None]:
!pyspark --version


In [None]:
#  Increase the width of notebook to display all columns of data

from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))


In [None]:
# Show multiple outputs of a single cell

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [None]:
from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession, SQLContext

from pyspark.sql.types import *
import pyspark.sql.functions as F
from pyspark.sql.functions import udf, col

In [None]:
spark = SparkSession \
            .builder.master('local[*]')\
            .appName('allstate_claims')\
            .getOrCreate()     
spark


In [None]:
sc = spark.sparkContext
sc

In [None]:
sqlContext = SQLContext(spark.sparkContext)
sqlContext

In [None]:
#  Read, transform and understand the data
#    pyspark creates a spark-session variable: spark

df = spark.read.csv(
                   path = "../input/allstate-claims-severity/train.csv",   
                   header = True,
                   inferSchema= True,           # Infer datatypes automatically
                   sep=","
                   )

In [None]:
df.take(2)

In [None]:
df.show(2)

In [None]:
df.dtypes


In [None]:
# Data shape
df.count()     #How many rows?      
cols = df.columns
len(cols)            
print(cols)

In [None]:
df.printSchema()

In [None]:
# What is the nature of df:
type(df)                     # pyspark.sql.dataframe.DataFrame

In [None]:
#  We also cache the data so that we only read it from disk once.
df.cache()
df.is_cached            # Checks if df is cached

In [None]:
# Show database in parts:
df.select(cols[:15]).show(3)
df.select(cols[15:25]).show(3)
df.select(cols[25:35]).show(3)
df.select(cols[35:45]).show(3)
df.select(cols[45:]).show(3)

In [None]:
df.tail(2)

Summary Statistics:
Spark DataFrames include some built-in functions for statistical processing. The describe() function performs summary statistics calculations on all numeric columns and returns them as a DataFrame.

In [None]:
(df.describe().select(
                    "summary",
                    F.round("cont1", 4).alias("cont1"),
                    F.round("cont2", 4).alias("cont2"),
                    F.round("cont3", 4).alias("cont3"),
                    F.round("cont4", 4).alias("cont4"),
                    F.round("cont5", 4).alias("cont5"),
                    F.round("cont6", 4).alias("cont6"),
                    F.round("cont7", 4).alias("cont7"),
                    F.round("cont13", 4).alias("cont13"),
                    F.round("cont14", 4).alias("cont14"),
                    F.round("loss", 4).alias("loss"))
                    .show())

Look at the minimum and maximum values of all the (numerical) attributes. We see that multiple attributes have a wide range of values: we will need to normalize our dataset.

#Preprocessing The Target Values:

First, let's start with the loss column, our dependent variable. To facilitate our working with the target values, we will express the house values in units of 1000. That means that a target such as 3037.3377 should become 3.037:

In [None]:
# Adjust the values of `medianHouseValue`
df = df.withColumn("loss", col("loss")/1000)

In [None]:
df.show(2)

In [None]:
#  Which columns to drop?

columns_to_drop = ['id']
df= df.drop(*columns_to_drop)

In [None]:
df.dtypes

In [None]:
from pyspark.sql.functions import col

df.select(col("loss")).show(5)
df.select("loss").show(5)

In [None]:

df = df.withColumnRenamed('loss', 'label')
print(df.columns)

In [None]:
# setting random seed for notebook reproducability
import pandas as pd
import numpy as np

rnd_seed=23
np.random.seed=rnd_seed
np.random.set_state=rnd_seed

In [None]:
# Data splitting  #

# Split the dataset randomly into 70% for training and 30% for testing.
train, validation = df.randomSplit([0.7, 0.3],seed=rnd_seed)


print(train.count()/df.count())
print(validation.count()/df.count())
# Split the dataset randomly into 70% for training and 30% for testing.

#splits = df.randomSplit([0.7, 0.3])
#train = splits[0]
#test = splits[1].withColumnRenamed("loss", "Label")
#train_rows = train.count()
#test_rows = test.count()
#print("Training Rows:", train_rows, " Testing Rows:", test_rows)

As number of processes of our trainingData is too large. 
As a result, the DAG size (Directed Acyclical Graph - a logical flow of operations constructed by spark) for our data becomes too large to handle and we may end up getting the following error -
Py4JJavaError: An error occurred while calling o903.fit.

Therefore, By converting our train data dataFrame into RDD (Resilient Distributed Dataset) and then back to DataFrame again will also shrink the DAG considerably.

In [None]:
train.count()

In [None]:
train.explain(extended=True)


In [None]:
#train.checkpoint()
train = spark.createDataFrame(train.rdd, schema=train.schema)


In [None]:
# Now, check the size of your DAG

# Displays the  length of physical plan
train.explain(extended=True)


In [None]:
validation.explain(extended=True)

In [None]:
#validation.checkpoint()
validation = spark.createDataFrame(validation.rdd, schema=validation.schema)

In [None]:
validation.explain(extended=True)


## Creating transformation objects

In [None]:
#  Encode 'string' column to index-column. 
#     Indexing begins from 0.
from pyspark.ml.feature import StringIndexer

# List all categorical columns and create objects to StringIndex all these categorical columns

cat_columns = [ c[0] for c in df.dtypes if c[1] == "string"]


stringindexer_stages = [ StringIndexer(inputCol=c, outputCol='stringindexed_' + c) for c in cat_columns]
stringindexer_stages

In [None]:
len(stringindexer_stages)

In [None]:
# Prepare (one) object to OneHotEncode categorical columns (received from above)
#  OHE an indexed column after StringIndexing and create one another column
from pyspark.ml.feature import OneHotEncoder

in_cols = ['stringindexed_' + c for c in cat_columns]
ohe_cols = ['onehotencoded_' + c  for c in cat_columns]
onehotencoder_stages = [OneHotEncoder(inputCols=in_cols, outputCols=ohe_cols)]

In [None]:
# iii)  Prepare a (one) list of all numerical and OneHotEncoded columns. Exclude 'loss' column from this list.

# Unlike in other languages, in spark
#       type-classes are to be separateky imported
#       They are not part of core classes or modules
from pyspark.sql.types import DoubleType

double_cols =   [  i[0] for i in df.dtypes if i[1] == 'double' ] 

double_cols.remove('label')  

double_cols



In [None]:
#  Create a combined list of double + ohe_cols

featuresCols = double_cols + ohe_cols
print(featuresCols)
len(featuresCols)


In [None]:
# Create a VectorAssembler object to assemble all the columns as above
from pyspark.ml.feature import VectorAssembler
#   Create an instance of VectorAssembler class.
#          This object will be used to assemble all featureCols
#          (a list of columns) into one column with name
#           'rawFeatures'

vectorassembler = VectorAssembler(
                                  inputCols=featuresCols,
                                  outputCol="rawFeatures"
                                 )

In [None]:
from pyspark.ml import Pipeline
from pyspark.ml.regression import GBTRegressor
from pyspark.ml.feature import VectorIndexer
from pyspark.ml.evaluation import RegressionEvaluator

In [None]:
# Create an object to perform modeling using GBTRegressor

gbt = GBTRegressor(labelCol="label",featuresCol="rawFeatures",predictionCol='predlabel', maxIter=10)

In [None]:
# 9.2 Create pipeline model
pipeline = Pipeline(stages=[                        \
                             *stringindexer_stages, \
                             *onehotencoder_stages, \
                             vectorassembler,       \
                             gbt                    \
                           ]                        \
                   )

In [None]:
#  Run the pipeline
import os, time

start = time.time()
pipelineModel = pipeline.fit(train)
end = time.time()
(end - start)/60           


In [None]:
# Make predictions on validation data.
#      Note it is NOT pipelineModel.predict()

prediction = pipelineModel.transform(validation)
predicted = prediction.select("predlabel", "label")
#predicted.show(100, truncate=False)

In [None]:
#  Show 10 columns including predicted column
predicted.show(10, truncate=False)


In [None]:
predicted

In [None]:

# 10.3 Evaluate results
# Create evaluator object.  class is, as:

from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.mllib.evaluation import RegressionMetrics


evaluator = RegressionEvaluator(predictionCol='predlabel', labelCol='label', metricName='rmse')

print("RMSE: {0}".format(evaluator.evaluate(predicted)))