# Multiple Linear Regression on Customer Data

## Agenda

* Business Understanding
* Data Understanding
* Data Preparation
* Exploratory Data Analysis
* Building a Linear Model
* Evaluation

### Business Understanding

#### Problem Statement

A large child education toy company which sells edutainment tablets and gaming systems
both online and in retail stores wanted to analyze the customer data. They are operating
from last few years and maintaining all transactional information data. The given data
‘CustomerData.csv’ is a sample of customer level data extracted and processed for the
analysis from various set of transactional files.

The objectives of today’s activity are :
* Building a regression model to predict the customer revenue based on other factors and understand the influence of other attributes on revenue

### Identify right Error Metrics

##### Error Metrics for Regression

* Mean Absolute Error (MAE):

$$MAE = \dfrac{1}{n}\times|\sum_{i = 1}^{n}y_{i} - \hat{y_{i}}|$$


* Mean Squared Error (MSE):

$$MSE = \dfrac{1}{n}\times(\sum_{i = 1}^{n}y_{i} - \hat{y_{i}})^2$$


* Root Mean Squared Error (RMSE):

$$RMSE = \sqrt{\dfrac{1}{n}\times(\sum_{i = 1}^{n}y_{i} - \hat{y_{i}})^2}$$


* Mean Absolute Percentage Error (MAPE):

$$MAPE = \dfrac{100}{n}\times\mid\dfrac{\sum_{i = 1}^{n}y_{i} - \hat{y_{i}}}{y_{i}}\mid$$


### Create SPARK_HOME and PYLIB env var and update PATH env var¶

In [1]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://archive.apache.org/dist/spark/spark-3.0.1/spark-3.0.1-bin-hadoop2.7.tgz

In [2]:
!tar xf spark-3.0.1-bin-hadoop2.7.tgz
!pip install -q findspark

In [3]:
import os 
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content//spark-3.0.1-bin-hadoop2.7"

In [4]:
import findspark
findspark.init()

In [5]:
from pyspark.sql.types import *
from pyspark.sql.functions import * 
import numpy as np
import pandas as pd
from io import StringIO

In [6]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


### ### Initializing Spark

Build __SparkConf__ object 

    Contains information about your application.  


Create __SparkContext__ object 
    
    Tells Spark how to access a cluster. 
    

Create __SparkSession__ object

    The entry point to programming Spark with the Dataset and DataFrame API.

    Used to create DataFrame, register DataFrame as tables and execute SQL over tables etc.

In [7]:
from pyspark.conf import SparkConf
from pyspark import SparkContext
from pyspark.sql import SparkSession

conf = SparkConf().setAppName("Customer Use Case")
sc = SparkContext(conf=conf)
spark = SparkSession(sc)

In [8]:
spark

#### Loading the required libraries

In [9]:
from pyspark.sql.types import *
from pyspark.sql.functions import *
# from pyspark.sql.functions import isnan, when, count, col, countDistinct

#### Loading the data

In [10]:
## Read data and create a dataframe
data = spark.read.format("csv")\
       .option("header", "true")\
       .option("inferSchema", "true")\
       .load("/content/drive/MyDrive/BigData/CustomerData.csv")

### Data Understanding

In [11]:
# Print Schema
data.printSchema()

root
 |-- CustomerID: integer (nullable = true)
 |-- City: integer (nullable = true)
 |-- NoOfChildren: integer (nullable = true)
 |-- MinAgeOfChild: integer (nullable = true)
 |-- MaxAgeOfChild: integer (nullable = true)
 |-- Tenure: integer (nullable = true)
 |-- FrquncyOfPurchase: integer (nullable = true)
 |-- NoOfUnitsPurchased: integer (nullable = true)
 |-- FrequencyOFPlay: integer (nullable = true)
 |-- NoOfGamesPlayed: integer (nullable = true)
 |-- NoOfGamesBought: integer (nullable = true)
 |-- FavoriteChannelOfTransaction: string (nullable = true)
 |-- FavoriteGame: string (nullable = true)
 |-- TotalRevenueGenerated: double (nullable = true)



Total number of Columns and Records

In [12]:
print("No. of Columns = {}".format(len(data.columns)))

print('No. of Records = {}'.format(data.count()))

No. of Columns = 14
No. of Records = 3209


See the top rows of the data

In [13]:
data.show(3)

+----------+----+------------+-------------+-------------+------+-----------------+------------------+---------------+---------------+---------------+----------------------------+------------+---------------------+
|CustomerID|City|NoOfChildren|MinAgeOfChild|MaxAgeOfChild|Tenure|FrquncyOfPurchase|NoOfUnitsPurchased|FrequencyOFPlay|NoOfGamesPlayed|NoOfGamesBought|FavoriteChannelOfTransaction|FavoriteGame|TotalRevenueGenerated|
+----------+----+------------+-------------+-------------+------+-----------------+------------------+---------------+---------------+---------------+----------------------------+------------+---------------------+
|      1001|   1|           2|            3|            8|   210|               11|                11|           2344|            108|             10|                     Uniform|     Uniform|               107.51|
|      1002|   1|           2|            3|            6|   442|               20|                20|            245|             22|      

Shows a quick statistic summary of your data using Describe

In [14]:
data.describe().show()

+-------+----------------+------------------+------------------+------------------+-----------------+-----------------+-----------------+------------------+------------------+-----------------+------------------+----------------------------+------------+---------------------+
|summary|      CustomerID|              City|      NoOfChildren|     MinAgeOfChild|    MaxAgeOfChild|           Tenure|FrquncyOfPurchase|NoOfUnitsPurchased|   FrequencyOFPlay|  NoOfGamesPlayed|   NoOfGamesBought|FavoriteChannelOfTransaction|FavoriteGame|TotalRevenueGenerated|
+-------+----------------+------------------+------------------+------------------+-----------------+-----------------+-----------------+------------------+------------------+-----------------+------------------+----------------------------+------------+---------------------+
|  count|            3209|              3209|              3209|              3209|             3209|             3209|             3209|              3209|             

Display the data type of each of the variable

In [15]:
data.dtypes 

[('CustomerID', 'int'),
 ('City', 'int'),
 ('NoOfChildren', 'int'),
 ('MinAgeOfChild', 'int'),
 ('MaxAgeOfChild', 'int'),
 ('Tenure', 'int'),
 ('FrquncyOfPurchase', 'int'),
 ('NoOfUnitsPurchased', 'int'),
 ('FrequencyOFPlay', 'int'),
 ('NoOfGamesPlayed', 'int'),
 ('NoOfGamesBought', 'int'),
 ('FavoriteChannelOfTransaction', 'string'),
 ('FavoriteGame', 'string'),
 ('TotalRevenueGenerated', 'double')]

### Data Preparation

#### Observations:
    1. City is interpreted as numeric (which is actually categorical) and FavouriteGame, FavouriteChannelOfTransaction are interpreted as objects.
    2. max age of children is 113 which must be a wrong entry
    3. Summary statistics for CustomerID is not meaningful

So we now change these appropriately i.e, convert city, favourite game and favourite channel to category, exclude customer id from the data for analysis and treat wrong entry records

##### Check and delete CustomerID attribute

In [16]:
print(data.select("CustomerID").distinct().count())

3209


In [17]:
# Delete CustomerID attribute
data = data.drop("CustomerID")

#### Data type conversion 
    Using astype('category') convert 'City', 'FavoriteChannelOfTransaction', 'FavoriteGame' attributes to a categorical data type .

In [18]:
# Creating a list of categorical and numerical features
cols = data.columns

cat_Var_Names = ['City', 'FavoriteChannelOfTransaction', 'FavoriteGame']

num_Var_Names =  list(set(cols) - set(cat_Var_Names))

for colTmp1 in cat_Var_Names:
    data = data.withColumn(colTmp1, data[colTmp1].cast("string"))

for colTmp2 in num_Var_Names:
    data = data.withColumn(colTmp2, data[colTmp2].cast("double"))

In [19]:
data.dtypes

[('City', 'string'),
 ('NoOfChildren', 'double'),
 ('MinAgeOfChild', 'double'),
 ('MaxAgeOfChild', 'double'),
 ('Tenure', 'double'),
 ('FrquncyOfPurchase', 'double'),
 ('NoOfUnitsPurchased', 'double'),
 ('FrequencyOFPlay', 'double'),
 ('NoOfGamesPlayed', 'double'),
 ('NoOfGamesBought', 'double'),
 ('FavoriteChannelOfTransaction', 'string'),
 ('FavoriteGame', 'string'),
 ('TotalRevenueGenerated', 'double')]

In [20]:
data.describe().show()

+-------+------------------+------------------+------------------+-----------------+-----------------+-----------------+------------------+------------------+-----------------+------------------+----------------------------+------------+---------------------+
|summary|              City|      NoOfChildren|     MinAgeOfChild|    MaxAgeOfChild|           Tenure|FrquncyOfPurchase|NoOfUnitsPurchased|   FrequencyOFPlay|  NoOfGamesPlayed|   NoOfGamesBought|FavoriteChannelOfTransaction|FavoriteGame|TotalRevenueGenerated|
+-------+------------------+------------------+------------------+-----------------+-----------------+-----------------+------------------+------------------+-----------------+------------------+----------------------------+------------+---------------------+
|  count|              3209|              3209|              3209|             3209|             3209|             3209|              3209|              3209|             3209|              3209|                        3

#### Observe how many records have values 113 for age of children

In [21]:
data.where((data['MinAgeOfChild']==113) | (data['MaxAgeOfChild']==113)).show()

+----+------------+-------------+-------------+------+-----------------+------------------+---------------+---------------+---------------+----------------------------+------------+---------------------+
|City|NoOfChildren|MinAgeOfChild|MaxAgeOfChild|Tenure|FrquncyOfPurchase|NoOfUnitsPurchased|FrequencyOFPlay|NoOfGamesPlayed|NoOfGamesBought|FavoriteChannelOfTransaction|FavoriteGame|TotalRevenueGenerated|
+----+------------+-------------+-------------+------+-----------------+------------------+---------------+---------------+---------------+----------------------------+------------+---------------------+
|   1|         2.0|          4.0|        113.0| 205.0|             17.0|              17.0|          158.0|           51.0|            8.0|                    Favorite|     Uniform|               218.85|
|   1|         2.0|          3.0|        113.0| 379.0|              6.0|               6.0|          242.0|           32.0|            0.0|                    Favorite|     Uniform|   

Observe how many records have values 113 for age of children

#### Removing outliers

In [22]:
#Lets ignore these 20 records for the analysis
data = data.where((data['MinAgeOfChild'] != 113) & (data['MaxAgeOfChild'] != 113))

In [23]:
data.count()

3189

#### Missing Data

pandas primarily uses the value np.nan to represent missing data. 

Check for missing value

    is.null() output boolean i.e. if missing value then true else false. 

    sum function counts 'true' thus gives total number of missing values

In [24]:
# Checking for null values at each column
data.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in data.columns]).show()

+----+------------+-------------+-------------+------+-----------------+------------------+---------------+---------------+---------------+----------------------------+------------+---------------------+
|City|NoOfChildren|MinAgeOfChild|MaxAgeOfChild|Tenure|FrquncyOfPurchase|NoOfUnitsPurchased|FrequencyOFPlay|NoOfGamesPlayed|NoOfGamesBought|FavoriteChannelOfTransaction|FavoriteGame|TotalRevenueGenerated|
+----+------------+-------------+-------------+------+-----------------+------------------+---------------+---------------+---------------+----------------------------+------------+---------------------+
|   0|           0|            0|            0|     0|                0|                 0|              0|              0|              0|                           0|           0|                    0|
+----+------------+-------------+-------------+------+-----------------+------------------+---------------+---------------+---------------+----------------------------+------------+---

In this case there are no missing values. However if we find any missing values in the data, as a rule of thumb


    If the perticular row/column has more number of missing values then drop that perticular rows/column 
    
        e.g. To drop any rows that have missing data use data.dropna(axis=0, inplace=True) 
        
    Otherwise, impute/fill missing data based on domain knowledge or using imputation techniques
        
        e.g. To fill missing values with mean use data.fillna(data.mean(), inplace=True)      

In [25]:
# The NA values are considered as string values in order to make them null we are comverting the NA values to null values

from pyspark.sql.functions import when 

for col in cols:
    data = data.withColumn(col, when(data[col]== "NA", None).otherwise(data[col]))

In [26]:
data.columns

['City',
 'NoOfChildren',
 'MinAgeOfChild',
 'MaxAgeOfChild',
 'Tenure',
 'FrquncyOfPurchase',
 'NoOfUnitsPurchased',
 'FrequencyOFPlay',
 'NoOfGamesPlayed',
 'NoOfGamesBought',
 'FavoriteChannelOfTransaction',
 'FavoriteGame',
 'TotalRevenueGenerated']

In [27]:
data.dtypes

[('City', 'string'),
 ('NoOfChildren', 'double'),
 ('MinAgeOfChild', 'double'),
 ('MaxAgeOfChild', 'double'),
 ('Tenure', 'double'),
 ('FrquncyOfPurchase', 'double'),
 ('NoOfUnitsPurchased', 'double'),
 ('FrequencyOFPlay', 'double'),
 ('NoOfGamesPlayed', 'double'),
 ('NoOfGamesBought', 'double'),
 ('FavoriteChannelOfTransaction', 'string'),
 ('FavoriteGame', 'string'),
 ('TotalRevenueGenerated', 'double')]

### Train-Test Split

In [28]:
# Split the data into training and test sets (30% held out for testing)
(trainingData, testData) = data.randomSplit([0.7, 0.3])

In [29]:
num_Var_Names = list(set(num_Var_Names) - set(["TotalRevenueGenerated"]))

### Use VectorAssembler to combine a given list of numcolumns into a single vector column.

In [30]:
from pyspark.ml.feature import VectorAssembler

assembler_Num = VectorAssembler(inputCols=num_Var_Names, outputCol="num_features")

### Scaling numeric attributes using MinMaxScaler method

1. Scale all the numeric attributes using MinMaxScaler
2. MinMaxScaler transforms a dataset of Vector rows, rescaling each feature to a specific range (often [0, 1]). 
3. MinMaxScaler computes summary statistics on a data set and produces a MinMaxScalerModel.
4. The model can then transform each feature individually such that it is in the given range.

In [31]:
from pyspark.ml.feature import MinMaxScaler

min_Max_Scalar = MinMaxScaler(inputCol="num_features", outputCol="scaled_num_features")

### Covert categorical to numeric: OneHotEncoder, StringIndexer, VectorAssembler,  VectorIndexer

In [32]:
from pyspark.ml.feature import OneHotEncoder, StringIndexer, VectorAssembler

indexers_Cat = [StringIndexer(inputCol=cat_Var_Name, outputCol="{0}_index".format(cat_Var_Name)) for cat_Var_Name in cat_Var_Names ]
encoders_Cat = [OneHotEncoder(inputCol=indexer.getOutputCol(), outputCol="{0}_vec".format(indexer.getInputCol())) for indexer in indexers_Cat]
assembler_Cat = VectorAssembler(inputCols=[encoder.getOutputCol() for encoder in encoders_Cat], outputCol="cat_features")

assembler = VectorAssembler(inputCols=["scaled_num_features","cat_features"], outputCol="features")


### Defining the pipeline

In [33]:
preprocessiong_Stages = [assembler_Num]+[min_Max_Scalar]+indexers_Cat+encoders_Cat+[assembler_Cat]+[assembler]

### Model Building, Tuning and Evaluation

### Linear Regression

In [34]:
from pyspark.ml.regression import LinearRegression

lr = LinearRegression(maxIter=100,labelCol="TotalRevenueGenerated", featuresCol="features")

In [35]:
# Adding Linear regression model to pipeline
from pyspark.ml import Pipeline

lr_Pipeline = Pipeline(stages=preprocessiong_Stages+[lr]) 

lr_Pipeline_model = lr_Pipeline.fit(trainingData)

In [36]:
print("Coefficients: " + str(lr_Pipeline_model.stages[-1].coefficients))
print("Intercept: " + str(lr_Pipeline_model.stages[-1].intercept))

Coefficients: [55.953569718254734,-6.029955988043001,-41.61223621456934,-23.985654297988145,59.45697359777263,964.3345084741106,-1300.8254809820721,1278.8358169729913,37.996431225008514,-10.125054231427477,15.238157740623286,-5.6066638488726035]
Intercept: 40.71976018574679


#### Predicting on train and test data

In [37]:

train_predictions_lr = lr_Pipeline_model.transform(trainingData)

test_predictions_lr = lr_Pipeline_model.transform(testData)

In [38]:
test_predictions_lr.show(2)

+----+------------+-------------+-------------+------+-----------------+------------------+---------------+---------------+---------------+----------------------------+------------+---------------------+--------------------+--------------------+----------+----------------------------------+------------------+-------------+--------------------------------+----------------+-------------+--------------------+------------------+
|City|NoOfChildren|MinAgeOfChild|MaxAgeOfChild|Tenure|FrquncyOfPurchase|NoOfUnitsPurchased|FrequencyOFPlay|NoOfGamesPlayed|NoOfGamesBought|FavoriteChannelOfTransaction|FavoriteGame|TotalRevenueGenerated|        num_features| scaled_num_features|City_index|FavoriteChannelOfTransaction_index|FavoriteGame_index|     City_vec|FavoriteChannelOfTransaction_vec|FavoriteGame_vec| cat_features|            features|        prediction|
+----+------------+-------------+-------------+------+-----------------+------------------+---------------+---------------+---------------+---

In [39]:
# Find the error metric - RMSE
from pyspark.ml.evaluation import RegressionEvaluator

evaluator = RegressionEvaluator(labelCol="TotalRevenueGenerated",
                            predictionCol="prediction",
                            metricName="rmse" )

In [40]:
lmRegTrain_rmse = evaluator.evaluate(train_predictions_lr)
print('RMSE value on Train data is', lmRegTrain_rmse)

lmRegTest_rmse = evaluator.evaluate(test_predictions_lr)
print('RMSE value on Test data is', lmRegTest_rmse)

RMSE value on Train data is 43.135444226238576
RMSE value on Test data is 43.489969777101905


### Tuning LR Model

In [41]:
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator

In [42]:
# Defining the grid parameters and Cross validator

paramGridLR = ParamGridBuilder()\
    .addGrid(lr.regParam, [0.1]) \
    .addGrid(lr.elasticNetParam, [0.5])\
    .addGrid(lr.maxIter, [100])\
    .build()
    
lr_crossval = CrossValidator(estimator=lr_Pipeline,
                             estimatorParamMaps=paramGridLR,
                             evaluator=RegressionEvaluator(labelCol="TotalRevenueGenerated"),
                             numFolds=2)          

In [43]:
# Run cross-validation, and choose the best set of parameters.
lr_crossval_Model = lr_crossval.fit(trainingData)

In [44]:
# Predicting on train and test data using cross validation model
train_predictions_lrcv = lr_crossval_Model.transform(trainingData)
test_predictions_lrcv = lr_crossval_Model.transform(testData)

In [45]:
# Evaluating the model
lmRegTrain_rmsecv = evaluator.evaluate(train_predictions_lrcv)
print('RMSE value on Train data is', lmRegTrain_rmsecv)

lmRegTest_rmsecv = evaluator.evaluate(test_predictions_lrcv)
print('RMSE value on Test data is', lmRegTest_rmsecv)

RMSE value on Train data is 43.14253963828042
RMSE value on Test data is 43.49978078230788


##### Correlation between numeric attributes 

In [46]:
num_Var_Names = num_Var_Names + ["TotalRevenueGenerated"]

In [47]:
num_attr = data.select(num_Var_Names)
num_attr

DataFrame[MinAgeOfChild: double, Tenure: double, NoOfGamesPlayed: double, MaxAgeOfChild: double, FrequencyOFPlay: double, NoOfUnitsPurchased: double, NoOfGamesBought: double, FrquncyOfPurchase: double, NoOfChildren: double, TotalRevenueGenerated: double]

In [48]:
from pyspark.ml.linalg import Vectors
from pyspark.ml.stat import Correlation

num_attr = assembler_Num.transform(num_attr)

r1 = Correlation.corr(num_attr, "num_features").head()
print("Pearson correlation matrix:\n" + str(r1[0]))

Pearson correlation matrix:
DenseMatrix([[ 1.        , -0.16315477, -0.11758011,  0.26784645, -0.09015244,
              -0.09163769, -0.08549426, -0.10104056, -0.35249942],
             [-0.16315477,  1.        ,  0.27172387, -0.04409811,  0.24003489,
               0.19155512,  0.18771993,  0.19267247,  0.08658319],
             [-0.11758011,  0.27172387,  1.        ,  0.01760575,  0.73815438,
               0.43456408,  0.39794182,  0.39633292,  0.21422604],
             [ 0.26784645, -0.04409811,  0.01760575,  1.        , -0.00239256,
              -0.01276066, -0.01104729, -0.00497733,  0.46133215],
             [-0.09015244,  0.24003489,  0.73815438, -0.00239256,  1.        ,
               0.31043645,  0.28539018,  0.27869029,  0.1665357 ],
             [-0.09163769,  0.19155512,  0.43456408, -0.01276066,  0.31043645,
               1.        ,  0.86811298,  0.93389444,  0.13824432],
             [-0.08549426,  0.18771993,  0.39794182, -0.01104729,  0.28539018,
               0.