## Linear Regression Ecommerce Customers Dataset

Basically what we do here is examine a dataset with Ecommerce Customer Data for a company's website and mobile app. Then we want to see if we can build a regression model that will predict the customer's yearly spend on the company's product.

In [0]:
from pyspark.sql import SparkSession

In [0]:
spark = SparkSession.builder.appName('LR_Ecommerce').getOrCreate()

In [0]:
df = spark.read.csv('dbfs:/FileStore/Ecommerce_Customers.csv', inferSchema=True, header=True)

In [0]:
df.limit(10).display()

Email,Address,Avatar,Avg Session Length,Time on App,Time on Website,Length of Membership,Yearly Amount Spent
mstephenson@fernandez.com,"835 Frank TunnelWrightmouth, MI 82180-9605",Violet,34.49726772511229,12.65565114916675,39.57766801952616,4.0826206329529615,587.9510539684005
hduke@hotmail.com,"4547 Archer CommonDiazchester, CA 06566-8576",DarkGreen,31.92627202636016,11.109460728682564,37.268958868297744,2.66403418213262,392.2049334443264
pallen@yahoo.com,"24645 Valerie Unions Suite 582Cobbborough, DC 99414-7564",Bisque,33.000914755642675,11.330278057777512,37.110597442120856,4.104543202376424,487.54750486747207
riverarebecca@gmail.com,"1414 David ThroughwayPort Jason, OH 22070-1220",SaddleBrown,34.30555662975554,13.717513665142508,36.72128267790313,3.120178782748092,581.8523440352177
mstephens@davidson-herman.com,"14023 Rodriguez PassagePort Jacobville, PR 37242-1057",MediumAquaMarine,33.33067252364639,12.795188551078114,37.53665330059473,4.446308318351434,599.4060920457634
alvareznancy@lucas.biz,"645 Martha Park Apt. 611Jeffreychester, MN 67218-7250",FloralWhite,33.871037879341976,12.026925339755056,34.47687762925054,5.493507201364199,637.102447915074
katherine20@yahoo.com,"68388 Reyes Lights Suite 692Josephbury, WV 92213-0247",DarkSlateBlue,32.02159550138701,11.366348309710526,36.68377615286961,4.685017246570912,521.5721747578274
awatkins@yahoo.com,Unit 6538 Box 8980DPO AP 09026-4941,Aqua,32.739142938380326,12.35195897300293,37.37335885854755,4.4342734348999375,549.9041461052942
vchurch@walter-martinez.com,"860 Lee KeyWest Debra, SD 97450-0495",Salmon,33.98777289568564,13.386235275676436,37.534497341555735,3.2734335777477144,570.2004089636196
bonnie69@lin.biz,"PSC 2734, Box 5255APO AA 98456-7482",Brown,31.93654861844892,11.814128294972196,37.14516822352819,3.202806071553459,427.1993848953282


In [0]:
df.printSchema()

root
 |-- Email: string (nullable = true)
 |-- Address: string (nullable = true)
 |-- Avatar: string (nullable = true)
 |-- Avg Session Length: double (nullable = true)
 |-- Time on App: double (nullable = true)
 |-- Time on Website: double (nullable = true)
 |-- Length of Membership: double (nullable = true)
 |-- Yearly Amount Spent: double (nullable = true)



In [0]:
df.columns

['Email',
 'Address',
 'Avatar',
 'Avg Session Length',
 'Time on App',
 'Time on Website',
 'Length of Membership',
 'Yearly Amount Spent']

In [0]:
# describe data
df.select('Avg Session Length', 'Time on App', 'Time on Website',
           'Length of Membership', 'Yearly Amount Spent').describe().show()

+-------+------------------+------------------+------------------+--------------------+-------------------+
|summary|Avg Session Length|       Time on App|   Time on Website|Length of Membership|Yearly Amount Spent|
+-------+------------------+------------------+------------------+--------------------+-------------------+
|  count|               500|               500|               500|                 500|                500|
|   mean| 33.05319351819619|12.052487937166134| 37.06044542094859|   3.533461555915055|  499.3140382585909|
| stddev|0.9925631110845354|0.9942156084725424|1.0104889067564033|  0.9992775024112585|   79.3147815497068|
|    min|29.532428967057943| 8.508152176032603| 33.91384724758464|  0.2699010899842742| 256.67058229005585|
|    max| 36.13966248879052|15.126994288792467|40.005181638101895|   6.922689335035808|  765.5184619388373|
+-------+------------------+------------------+------------------+--------------------+-------------------+



In [0]:
from pyspark.sql.functions import col, sum as _sum

In [0]:
# check missing value 
missing_val = df.select([_sum(col(c).isNull().cast('int')).alias(c) for c in df.columns])

In [0]:
missing_val.show()

+-----+-------+------+------------------+-----------+---------------+--------------------+-------------------+
|Email|Address|Avatar|Avg Session Length|Time on App|Time on Website|Length of Membership|Yearly Amount Spent|
+-----+-------+------+------------------+-----------+---------------+--------------------+-------------------+
|    0|      0|     0|                 0|          0|              0|                   0|                  0|
+-----+-------+------+------------------+-----------+---------------+--------------------+-------------------+



In [0]:
# check dupicated
dupicates = df.exceptAll(df.dropDuplicates())

In [0]:
dupicates.show()

+-----+-------+------+------------------+-----------+---------------+--------------------+-------------------+
|Email|Address|Avatar|Avg Session Length|Time on App|Time on Website|Length of Membership|Yearly Amount Spent|
+-----+-------+------+------------------+-----------+---------------+--------------------+-------------------+
+-----+-------+------+------------------+-----------+---------------+--------------------+-------------------+



In [0]:
df.head(5)

[Row(Email='mstephenson@fernandez.com', Address='835 Frank TunnelWrightmouth, MI 82180-9605', Avatar='Violet', Avg Session Length=34.49726772511229, Time on App=12.65565114916675, Time on Website=39.57766801952616, Length of Membership=4.0826206329529615, Yearly Amount Spent=587.9510539684005),
 Row(Email='hduke@hotmail.com', Address='4547 Archer CommonDiazchester, CA 06566-8576', Avatar='DarkGreen', Avg Session Length=31.92627202636016, Time on App=11.109460728682564, Time on Website=37.268958868297744, Length of Membership=2.66403418213262, Yearly Amount Spent=392.2049334443264),
 Row(Email='pallen@yahoo.com', Address='24645 Valerie Unions Suite 582Cobbborough, DC 99414-7564', Avatar='Bisque', Avg Session Length=33.000914755642675, Time on App=11.330278057777512, Time on Website=37.110597442120856, Length of Membership=4.104543202376424, Yearly Amount Spent=487.54750486747207),
 Row(Email='riverarebecca@gmail.com', Address='1414 David ThroughwayPort Jason, OH 22070-1220', Avatar='Sad

In [0]:
for item in df.head():
    print(item)

mstephenson@fernandez.com
835 Frank TunnelWrightmouth, MI 82180-9605
Violet
34.49726772511229
12.65565114916675
39.57766801952616
4.0826206329529615
587.9510539684005


## Format for MLlib

In [0]:
from pyspark.ml.feature import VectorAssembler

In [0]:
df.columns

['Email',
 'Address',
 'Avatar',
 'Avg Session Length',
 'Time on App',
 'Time on Website',
 'Length of Membership',
 'Yearly Amount Spent']

In [0]:
assembler = VectorAssembler(inputCols=['Avg Session Length', 'Time on App',
                                        'Time on Website', 'Length of Membership'], 
                            outputCol='features')

In [0]:
output = assembler.transform(df)

In [0]:
output.limit(5).display()

Email,Address,Avatar,Avg Session Length,Time on App,Time on Website,Length of Membership,Yearly Amount Spent,features
mstephenson@fernandez.com,"835 Frank TunnelWrightmouth, MI 82180-9605",Violet,34.49726772511229,12.65565114916675,39.57766801952616,4.0826206329529615,587.9510539684005,"Map(vectorType -> dense, length -> 4, values -> List(34.49726772511229, 12.65565114916675, 39.57766801952616, 4.0826206329529615))"
hduke@hotmail.com,"4547 Archer CommonDiazchester, CA 06566-8576",DarkGreen,31.92627202636016,11.109460728682564,37.268958868297744,2.66403418213262,392.2049334443264,"Map(vectorType -> dense, length -> 4, values -> List(31.92627202636016, 11.109460728682564, 37.268958868297744, 2.66403418213262))"
pallen@yahoo.com,"24645 Valerie Unions Suite 582Cobbborough, DC 99414-7564",Bisque,33.000914755642675,11.330278057777512,37.110597442120856,4.104543202376424,487.54750486747207,"Map(vectorType -> dense, length -> 4, values -> List(33.000914755642675, 11.330278057777512, 37.110597442120856, 4.104543202376424))"
riverarebecca@gmail.com,"1414 David ThroughwayPort Jason, OH 22070-1220",SaddleBrown,34.30555662975554,13.717513665142508,36.72128267790313,3.120178782748092,581.8523440352177,"Map(vectorType -> dense, length -> 4, values -> List(34.30555662975554, 13.717513665142507, 36.72128267790313, 3.120178782748092))"
mstephens@davidson-herman.com,"14023 Rodriguez PassagePort Jacobville, PR 37242-1057",MediumAquaMarine,33.33067252364639,12.795188551078114,37.53665330059473,4.446308318351434,599.4060920457634,"Map(vectorType -> dense, length -> 4, values -> List(33.33067252364639, 12.795188551078114, 37.53665330059473, 4.446308318351434))"


In [0]:
output.select('features').show()

+--------------------+
|            features|
+--------------------+
|[34.4972677251122...|
|[31.9262720263601...|
|[33.0009147556426...|
|[34.3055566297555...|
|[33.3306725236463...|
|[33.8710378793419...|
|[32.0215955013870...|
|[32.7391429383803...|
|[33.9877728956856...|
|[31.9365486184489...|
|[33.9925727749537...|
|[33.8793608248049...|
|[29.5324289670579...|
|[33.1903340437226...|
|[32.3879758531538...|
|[30.7377203726281...|
|[32.1253868972878...|
|[32.3388993230671...|
|[32.1878120459321...|
|[32.6178560628234...|
+--------------------+
only showing top 20 rows



In [0]:
final_data = output.select('features', 'Yearly Amount Spent')

In [0]:
final_data.show()

+--------------------+-------------------+
|            features|Yearly Amount Spent|
+--------------------+-------------------+
|[34.4972677251122...|  587.9510539684005|
|[31.9262720263601...|  392.2049334443264|
|[33.0009147556426...| 487.54750486747207|
|[34.3055566297555...|  581.8523440352177|
|[33.3306725236463...|  599.4060920457634|
|[33.8710378793419...|   637.102447915074|
|[32.0215955013870...|  521.5721747578274|
|[32.7391429383803...|  549.9041461052942|
|[33.9877728956856...|  570.2004089636196|
|[31.9365486184489...|  427.1993848953282|
|[33.9925727749537...|  492.6060127179966|
|[33.8793608248049...|  522.3374046069357|
|[29.5324289670579...|  408.6403510726275|
|[33.1903340437226...|  573.4158673313865|
|[32.3879758531538...|  470.4527333009554|
|[30.7377203726281...|  461.7807421962299|
|[32.1253868972878...| 457.84769594494855|
|[32.3388993230671...| 407.70454754954415|
|[32.1878120459321...|  452.3156754800354|
|[32.6178560628234...|   605.061038804892|
+----------

## Train Test Split

In [0]:
train, test = final_data.randomSplit([0.7, 0.3])

In [0]:
train.describe().show()

+-------+-------------------+
|summary|Yearly Amount Spent|
+-------+-------------------+
|  count|                351|
|   mean| 505.17809542832737|
| stddev|  78.11058464987376|
|    min| 256.67058229005585|
|    max|  744.2218671047146|
+-------+-------------------+



In [0]:
test.describe().show()

+-------+-------------------+
|summary|Yearly Amount Spent|
+-------+-------------------+
|  count|                149|
|   mean|  485.5000512345802|
| stddev|  80.67206073508915|
|    min|  275.9184206503857|
|    max|  765.5184619388373|
+-------+-------------------+



## Create Model Linear Regression

In [0]:
from pyspark.ml.regression import LinearRegression

In [0]:
lr = LinearRegression(labelCol='Yearly Amount Spent')

In [0]:
# fit the model
lrModel = lr.fit(train)

In [0]:
# print Coefficients and Intercept
print('Coefficients : {}, Intercept : {}'.format(lrModel.coefficients, lrModel.intercept))

Coefficients : [25.934049376822202,38.91951853287949,-0.18301096020995175,61.43265734610271], Intercept : -1036.9722076302494


In [0]:
test_result = lrModel.evaluate(test)

In [0]:
# test residuals
test_result.residuals.show()

+-------------------+
|          residuals|
+-------------------+
| 11.374413278382008|
| 0.5885689853882354|
| 10.134829589539436|
| -4.005515735647975|
| -6.571278309970296|
|-22.260746384262006|
| 20.015309808878158|
|  4.022966377151647|
| 1.4169238638910429|
| 3.1698185509637256|
| -6.542770909238584|
| -3.969255137290702|
|-3.3507563600281287|
|  4.221756525244075|
|-18.264828421822358|
| 17.880582881947475|
| -5.147494372395897|
|-5.8706453821743025|
| 1.9996216042657693|
| -2.931492297555394|
+-------------------+
only showing top 20 rows



In [0]:
unlabeled_data = test.select('features')

In [0]:
# prediction 
predictions = lrModel.transform(unlabeled_data)

In [0]:
predictions.show()

+--------------------+------------------+
|            features|        prediction|
+--------------------+------------------+
|[29.5324289670579...| 397.2659377942455|
|[30.5743636841713...| 441.4758447726774|
|[30.7377203726281...|451.64591260669044|
|[30.8794843441274...|494.21211572050265|
|[31.0613251567161...| 494.1267363678719|
|[31.1239743499119...| 509.2078002240278|
|[31.3123495994443...|443.57610821906246|
|[31.3662121671876...|426.56591617933327|
|[31.3895854806643...|408.65268719609185|
|[31.4459724827577...|481.70714638416484|
|[31.4474464941278...| 425.1455130044626|
|[31.5147378578019...| 493.7817431337521|
|[31.5171218025062...|279.26917701041384|
|[31.5316044825729...| 432.2938492041185|
|[31.5702008293202...| 564.2103205632272|
|[31.6005122003032...|461.29226860914946|
|[31.6253601348306...| 381.4843951293201|
|[31.7207699002873...| 544.6455788601972|
|[31.7366356860502...|494.93382465126615|
|[31.8186165667690...|449.35016566769104|
+--------------------+------------

## Evauations

In [0]:
print('RMSE : {}'.format(test_result.rootMeanSquaredError))
print("MSE : {}".format(test_result.meanSquaredError))

RMSE : 10.355674437788581
MSE : 107.23999306146783


### Good Job..!