# Multiple Linear Regression on Customer Data

## Agenda

* Business Understanding
* Data Understanding
* Data Preparation
* Exploratory Data Analysis
* Building a Linear Model
* Evaluation

### Business Understanding

#### Problem Statement

A large child education toy company which sells edutainment tablets and gaming systems
both online and in retail stores wanted to analyze the customer data. They are operating
from last few years and maintaining all transactional information data. The given data
‘CustomerData.csv’ is a sample of customer level data extracted and processed for the
analysis from various set of transactional files.

The objectives of today’s activity are :
* Building a regression model to predict the customer revenue based on other factors and understand the influence of other attributes on revenue

### Identify right Error Metrics

##### Error Metrics for Regression

* Mean Absolute Error (MAE):

$$MAE = \dfrac{1}{n}\times|\sum_{i = 1}^{n}y_{i} - \hat{y_{i}}|$$


* Mean Squared Error (MSE):

$$MSE = \dfrac{1}{n}\times(\sum_{i = 1}^{n}y_{i} - \hat{y_{i}})^2$$


* Root Mean Squared Error (RMSE):

$$RMSE = \sqrt{\dfrac{1}{n}\times(\sum_{i = 1}^{n}y_{i} - \hat{y_{i}})^2}$$


* Mean Absolute Percentage Error (MAPE):

$$MAPE = \dfrac{100}{n}\times\mid\dfrac{\sum_{i = 1}^{n}y_{i} - \hat{y_{i}}}{y_{i}}\mid$$


### Create SPARK_HOME and PYLIB env var and update PATH env var¶

In [3]:
import os
import sys
os.environ["SPARK_HOME"] = "/usr/hdp/current/spark2-client"
os.environ["PYLIB"] = os.environ["SPARK_HOME"] + "/python/lib"
sys.path.insert(0, os.environ["PYLIB"] + "/py4j-0.10.4-src.zip")
sys.path.insert(0, os.environ["PYLIB"] + "/pyspark.zip")

### ### Initializing Spark

Build __SparkConf__ object 

    Contains information about your application.  


Create __SparkContext__ object 
    
    Tells Spark how to access a cluster. 
    

Create __SparkSession__ object

    The entry point to programming Spark with the Dataset and DataFrame API.

    Used to create DataFrame, register DataFrame as tables and execute SQL over tables etc.

In [4]:
from pyspark.conf import SparkConf
from pyspark import SparkContext
from pyspark.sql import SparkSession

conf = SparkConf().setAppName("Customer Use Case").setMaster('local')
sc = SparkContext(conf=conf)
spark = SparkSession(sc)

#### Loading the required libraries

In [5]:
from pyspark.sql.types import *
from pyspark.sql.functions import *
# from pyspark.sql.functions import isnan, when, count, col, countDistinct

#### Loading the data

In [6]:
## Read data and create a dataframe
data = 

### Data Understanding

In [7]:
# Print Schema
data.printSchema()

root
 |-- CustomerID: integer (nullable = true)
 |-- City: integer (nullable = true)
 |-- NoOfChildren: integer (nullable = true)
 |-- MinAgeOfChild: integer (nullable = true)
 |-- MaxAgeOfChild: integer (nullable = true)
 |-- Tenure: integer (nullable = true)
 |-- FrquncyOfPurchase: integer (nullable = true)
 |-- NoOfUnitsPurchased: integer (nullable = true)
 |-- FrequencyOFPlay: integer (nullable = true)
 |-- NoOfGamesPlayed: integer (nullable = true)
 |-- NoOfGamesBought: integer (nullable = true)
 |-- FavoriteChannelOfTransaction: string (nullable = true)
 |-- FavoriteGame: string (nullable = true)
 |-- TotalRevenueGenerated: double (nullable = true)



Total number of Columns and Records

No. of Columns = 14
No. of Records = 3209


See the top rows of the data

+----------+----+------------+-------------+-------------+------+-----------------+------------------+---------------+---------------+---------------+----------------------------+------------+---------------------+
|CustomerID|City|NoOfChildren|MinAgeOfChild|MaxAgeOfChild|Tenure|FrquncyOfPurchase|NoOfUnitsPurchased|FrequencyOFPlay|NoOfGamesPlayed|NoOfGamesBought|FavoriteChannelOfTransaction|FavoriteGame|TotalRevenueGenerated|
+----------+----+------------+-------------+-------------+------+-----------------+------------------+---------------+---------------+---------------+----------------------------+------------+---------------------+
|      1001|   1|           2|            3|            8|   210|               11|                11|           2344|            108|             10|                     Uniform|     Uniform|               107.51|
|      1002|   1|           2|            3|            6|   442|               20|                20|            245|             22|      

Shows a quick statistic summary of your data using Describe

+-------+----------------+------------------+------------------+------------------+-----------------+-----------------+-----------------+------------------+------------------+-----------------+------------------+----------------------------+------------+---------------------+
|summary|      CustomerID|              City|      NoOfChildren|     MinAgeOfChild|    MaxAgeOfChild|           Tenure|FrquncyOfPurchase|NoOfUnitsPurchased|   FrequencyOFPlay|  NoOfGamesPlayed|   NoOfGamesBought|FavoriteChannelOfTransaction|FavoriteGame|TotalRevenueGenerated|
+-------+----------------+------------------+------------------+------------------+-----------------+-----------------+-----------------+------------------+------------------+-----------------+------------------+----------------------------+------------+---------------------+
|  count|            3209|              3209|              3209|              3209|             3209|             3209|             3209|              3209|             

Display the data type of each of the variable

[('CustomerID', 'int'),
 ('City', 'int'),
 ('NoOfChildren', 'int'),
 ('MinAgeOfChild', 'int'),
 ('MaxAgeOfChild', 'int'),
 ('Tenure', 'int'),
 ('FrquncyOfPurchase', 'int'),
 ('NoOfUnitsPurchased', 'int'),
 ('FrequencyOFPlay', 'int'),
 ('NoOfGamesPlayed', 'int'),
 ('NoOfGamesBought', 'int'),
 ('FavoriteChannelOfTransaction', 'string'),
 ('FavoriteGame', 'string'),
 ('TotalRevenueGenerated', 'double')]

### Data Preparation

#### Observations:
    1. City is interpreted as numeric (which is actually categorical) and FavouriteGame, FavouriteChannelOfTransaction are interpreted as objects.
    2. max age of children is 113 which must be a wrong entry
    3. Summary statistics for CustomerID is not meaningful

So we now change these appropriately i.e, convert city, favourite game and favourite channel to category, exclude customer id from the data for analysis and treat wrong entry records

##### Check and delete CustomerID attribute

3209


In [11]:
# Delete CustomerID attribute


#### Data type conversion 
    Using astype('category') convert 'City', 'FavoriteChannelOfTransaction', 'FavoriteGame' attributes to a categorical data type .

In [12]:
# Creating a list of categorical and numerical features


[('City', 'string'),
 ('NoOfChildren', 'double'),
 ('MinAgeOfChild', 'double'),
 ('MaxAgeOfChild', 'double'),
 ('Tenure', 'double'),
 ('FrquncyOfPurchase', 'double'),
 ('NoOfUnitsPurchased', 'double'),
 ('FrequencyOFPlay', 'double'),
 ('NoOfGamesPlayed', 'double'),
 ('NoOfGamesBought', 'double'),
 ('FavoriteChannelOfTransaction', 'string'),
 ('FavoriteGame', 'string'),
 ('TotalRevenueGenerated', 'double')]

+-------+------------------+------------------+------------------+-----------------+-----------------+-----------------+------------------+------------------+-----------------+------------------+----------------------------+------------+---------------------+
|summary|              City|      NoOfChildren|     MinAgeOfChild|    MaxAgeOfChild|           Tenure|FrquncyOfPurchase|NoOfUnitsPurchased|   FrequencyOFPlay|  NoOfGamesPlayed|   NoOfGamesBought|FavoriteChannelOfTransaction|FavoriteGame|TotalRevenueGenerated|
+-------+------------------+------------------+------------------+-----------------+-----------------+-----------------+------------------+------------------+-----------------+------------------+----------------------------+------------+---------------------+
|  count|              3209|              3209|              3209|             3209|             3209|             3209|              3209|              3209|             3209|              3209|                        3

#### Observe how many records have values 113 for age of children

+----+------------+-------------+-------------+------+-----------------+------------------+---------------+---------------+---------------+----------------------------+------------+---------------------+
|City|NoOfChildren|MinAgeOfChild|MaxAgeOfChild|Tenure|FrquncyOfPurchase|NoOfUnitsPurchased|FrequencyOFPlay|NoOfGamesPlayed|NoOfGamesBought|FavoriteChannelOfTransaction|FavoriteGame|TotalRevenueGenerated|
+----+------------+-------------+-------------+------+-----------------+------------------+---------------+---------------+---------------+----------------------------+------------+---------------------+
|   1|         2.0|          4.0|        113.0| 205.0|             17.0|              17.0|          158.0|           51.0|            8.0|                    Favorite|     Uniform|               218.85|
|   1|         2.0|          3.0|        113.0| 379.0|              6.0|               6.0|          242.0|           32.0|            0.0|                    Favorite|     Uniform|   

Observe how many records have values 113 for age of children

#### Removing outliers

#### Missing Data

pandas primarily uses the value np.nan to represent missing data. 

Check for missing value

    is.null() output boolean i.e. if missing value then true else false. 

    sum function counts 'true' thus gives total number of missing values

In [1]:
# Checking for null values at each column

In this case there are no missing values. However if we find any missing values in the data, as a rule of thumb


    If the perticular row/column has more number of missing values then drop that perticular rows/column 
    
        e.g. To drop any rows that have missing data use data.dropna(axis=0, inplace=True) 
        
    Otherwise, impute/fill missing data based on domain knowledge or using imputation techniques
        
        e.g. To fill missing values with mean use data.fillna(data.mean(), inplace=True)      

In [None]:
# The NA values are considered as string values in order to make them null we are comverting the NA values to null values


In [None]:
data.columns

In [None]:
data.dtypes

### Train-Test Split

In [None]:
# Split the data into training and test sets (30% held out for testing)


### Use VectorAssembler to combine a given list of numcolumns into a single vector column.

### Scaling numeric attributes using MinMaxScaler method

1. Scale all the numeric attributes using MinMaxScaler
2. MinMaxScaler transforms a dataset of Vector rows, rescaling each feature to a specific range (often [0, 1]). 
3. MinMaxScaler computes summary statistics on a data set and produces a MinMaxScalerModel.
4. The model can then transform each feature individually such that it is in the given range.

### Covert categorical to numeric: OneHotEncoder, StringIndexer, VectorAssembler,  VectorIndexer

### Defining the pipeline

### Model Building, Tuning and Evaluation

### Linear Regression

In [None]:
from pyspark.ml.regression import LinearRegression



In [None]:
# Adding Linear regression model to pipeline
from pyspark.ml import Pipeline



#### Predicting on train and test data

In [None]:
# Find the error metric - RMSE
from pyspark.ml.evaluation import RegressionEvaluator

### Tuning LR Model

In [None]:
# Defining the grid parameters and Cross validator
       

In [None]:
# Run cross-validation, and choose the best set of parameters.


In [None]:
# Predicting on train and test data using cross validation model


In [None]:
# Evaluating the model

##### Correlation between numeric attributes 

In [None]:
Reference:

In [None]:
https://spark.apache.org/docs/2.3.0/ml-guide.html
https://spark.apache.org/docs/2.3.0/ml-classification-regression.html