You are the owner of an e-commerce web store and you are interested in predicting a customer’s annual e-commerce spend based on historical purchase patterns. In order to develop a linear regression model, execute the following steps:

1.Read in the underlying dataset
2.Clean the “read-in” dataset if needed
3.Split the underlying dataset into “train” and “test” sets
4.Train the ML model on the “train” data set
5.Execute the trained model on the test dataset
6.Compare the output from the ML model to the actual results
7.Examine the efficacy of the ML model using performance metrics covered in the linear regression activity and state the results

**Step 1**: Install Spark

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# innstall java
!apt-get install openjdk-8-jdk-headless -qq > /dev/null

# install spark (change the version number if needed)
!wget -q https://archive.apache.org/dist/spark/spark-3.0.0/spark-3.0.0-bin-hadoop3.2.tgz

# unzip the spark file to the current folder
!tar xf spark-3.0.0-bin-hadoop3.2.tgz

In [None]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.0.0-bin-hadoop3.2"

In [None]:
# install findspark using pip
!pip install -q findspark

In [None]:
!pip3 install pyspark==3.0.2

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyspark==3.0.2
  Downloading pyspark-3.0.2.tar.gz (204.8 MB)
[K     |████████████████████████████████| 204.8 MB 57 kB/s 
[?25hCollecting py4j==0.10.9
  Downloading py4j-0.10.9-py2.py3-none-any.whl (198 kB)
[K     |████████████████████████████████| 198 kB 21.0 MB/s 
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.0.2-py2.py3-none-any.whl size=205186690 sha256=65ff6e49abb3d57ff9244094d9deb988bceb14ae4212c7bd51073de9cb8f828e
  Stored in directory: /root/.cache/pip/wheels/9a/39/f6/970565f38054a830e9a8593f388b36e14d75dba6c6fdafc1ec
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9 pyspark-3.0.2


In [None]:
import findspark
findspark.init()

In [None]:
# Import SparkSession
from pyspark.sql import SparkSession
# Create a Spark Session
spark = SparkSession.builder.master("local[*]").getOrCreate()
# Check Spark Session Information
spark


In [None]:
# Create spark_session
spark_session = SparkSession.builder.getOrCreate()

In [None]:
sc = spark_session.sparkContext

# The objective of this exercise is to predict the time on app

**Step 2**: Read in the data file

In [None]:
# Load in the data
df = spark_session.read.option("header", "true").csv('/content/drive/MyDrive/Ecomm-Customers-new.csv')

**Step 3**:  Split the 2 lines on comas and examine the first 2 lines

In [None]:
rdd.take(2)

[Row(Email='mstephenson@fernandez.com', Address='835 Frank Tunnel', Avatar=None, Avg. Session Length=None, Time on App=None, Time on Website=None, Length of Membership=None, Yearly Amount Spent=None),
 Row(Email='Wrightmouth', Address=' MI 82180-9605"', Avatar='Violet', Avg. Session Length='34.49726772511229', Time on App='12.655651149166752', Time on Website='39.57766801952616', Length of Membership='4.082620632952961', Yearly Amount Spent='587.9510539684005')]

**Step 4**: Display the contents of the DataFrame

In [None]:
df.show()

+-------------------+-----------+---------------+--------------------+-------------------+
|Avg. Session Length|Time on App|Time on Website|Length of Membership|Yearly Amount Spent|
+-------------------+-----------+---------------+--------------------+-------------------+
|        34.49726773|12.65565115|    39.57766802|         4.082620633|         587.951054|
|        31.92627203|11.10946073|    37.26895887|         2.664034182|        392.2049334|
|        33.00091476|11.33027806|    37.11059744|         4.104543202|        487.5475049|
|        34.30555663|13.71751367|    36.72128268|         3.120178783|         581.852344|
|        33.33067252|12.79518855|     37.5366533|         4.446308318|         599.406092|
|        33.87103788|12.02692534|    34.47687763|         5.493507201|        637.1024479|
|         32.0215955|11.36634831|    36.68377615|         4.685017247|        521.5721748|
|        32.73914294|12.35195897|    37.37335886|         4.434273435|        549.9041461|

**Step 5**: Display the data types

In [None]:
df.dtypes

[('Avg. Session Length', 'string'),
 ('Time on App', 'string'),
 ('Time on Website', 'string'),
 ('Length of Membership', 'string'),
 ('Yearly Amount Spent', 'string')]

In [None]:
# Import all from `sql.types`
from pyspark.sql.types import *


**Step 6**: Function that converts the data types of the DataFrame columns

In [None]:
# Write a custom function to convert the data type of DataFrame columns
def convertColumn(df, names, newType):
  for name in names: 
     df = df.withColumn(name, df[name].cast(newType))
  return df 

In [None]:
# Assign all column names to `columns`
columns = ['Time on App','Time on Website','Length of Membership', 'Length of Membership','Yearly Amount Spent']

**Step 7**: Convert the data types of the above mentioned columns into a float type

In [None]:
from pyspark.sql.types import *
# Conver the `df` columns to `FloatType()`
df = convertColumn(df, columns, FloatType())

**Step 8**: Confirm that the data type has been converted into float

In [None]:
df.dtypes

[('Avg. Session Length', 'string'),
 ('Time on App', 'float'),
 ('Time on Website', 'float'),
 ('Length of Membership', 'float'),
 ('Yearly Amount Spent', 'float')]

In [None]:
# Print the schema of `df`
df.printSchema()

root
 |-- Avg. Session Length: string (nullable = true)
 |-- Time on App: float (nullable = true)
 |-- Time on Website: float (nullable = true)
 |-- Length of Membership: float (nullable = true)
 |-- Yearly Amount Spent: float (nullable = true)



In [None]:
df.describe().show()

+-------+-------------------+------------------+------------------+--------------------+-------------------+
|summary|Avg. Session Length|       Time on App|   Time on Website|Length of Membership|Yearly Amount Spent|
+-------+-------------------+------------------+------------------+--------------------+-------------------+
|  count|                500|               500|               500|                 500|                500|
|   mean|     33.05319351824|12.052487915039062|37.060445365905764|   3.533461554646492|  499.3140381469727|
| stddev| 0.9925631111602911|0.9942156264611745|1.0104888427768801|  0.9992775015130736|  79.31478158115246|
|    min|        29.53242897|          8.508152|          33.91385|           0.2699011|           256.6706|
|    max|        36.13966249|         15.126994|          40.00518|           6.9226894|          765.51843|
+-------+-------------------+------------------+------------------+--------------------+-------------------+




You should probably standardize your data, as you have seen that the range of minimum and maximum values is quite large.

Your dependent variable is also quite large; you should adjust the values slightly.

**Step 9**: Processing the data

In [None]:
# Import all from `sql.functions` 
from pyspark.sql.functions import *

# Adjust the values of `medianAvatar` 
df = df.withColumn("medianTime", col("Time on App")/100000)

# Show the first 2 lines of `df`
df.take(2)

[Row(Avg. Session Length='34.49726773', Time on App=12.655651092529297, Time on Website=39.577667236328125, Length of Membership=4.082620620727539, Yearly Amount Spent=587.9510498046875, medianTime=0.00012655651092529296),
 Row(Avg. Session Length='31.92627203', Time on App=11.109460830688477, Time on Website=37.268959045410156, Length of Membership=2.664034128189087, Yearly Amount Spent=392.2049255371094, medianTime=0.00011109460830688477)]

In [None]:
# Re-order and select columns
df = df.select("Time on App",
               "Time on Website",
               "Length of Membership",
               "Yearly Amount Spent")

In [None]:
df.show(10)

+-----------+---------------+--------------------+-------------------+
|Time on App|Time on Website|Length of Membership|Yearly Amount Spent|
+-----------+---------------+--------------------+-------------------+
|  12.655651|      39.577667|           4.0826206|          587.95105|
|  11.109461|       37.26896|           2.6640341|          392.20493|
|  11.330278|      37.110596|            4.104543|          487.54752|
|  13.717514|      36.721283|           3.1201787|          581.85236|
|  12.795189|       37.53665|            4.446308|          599.40607|
|  12.026925|       34.47688|           5.4935074|           637.1025|
|  11.366348|      36.683777|            4.685017|           521.5722|
|  12.351959|       37.37336|           4.4342732|           549.9042|
|  13.386235|      37.534496|           3.2734337|          570.20044|
|  11.814128|       37.14517|            3.202806|          427.19937|
+-----------+---------------+--------------------+-------------------+
only s

**Step 10**: Specifying the label and the features 

In [None]:
# Import `DenseVector`
# A Dense Vector is used to store arrays of values for use in PySpark.
from pyspark.ml.linalg import DenseVector

# # Define the `input_data` 
# input_data = df.rdd.map(lambda x: (x[0], DenseVector(x[1:])))


input_data = df.rdd.map(lambda x: (x[0], DenseVector(x[1:])))

# Replace `df` with the new DataFrame
df = spark_session.createDataFrame(input_data, ["label", "features"])

label = df.rdd.map(lambda x: x.label)
features = df.rdd.map(lambda x: x.features)

**Step 11**: Scaling the features using 'StandardScaler' - standardizes a feature of the model by subtracting the mean and then scaling to unit variance. Unit variance means dividing all the values by the standard deviation.


In [None]:
# Import `StandardScaler` 
from pyspark.ml.feature import StandardScaler

# Initialize the `standardScaler`
standardScaler = StandardScaler(inputCol="features", outputCol="features_scaled")

# Fit the DataFrame to the scaler
scaler = standardScaler.fit(df.select('features'))

# Transform the data in `df` with the scaler
scaled_df = scaler.transform(df)

# Inspect the result
scaled_df.take(2)

[Row(label=12.655651092529297, features=DenseVector([39.5777, 4.0826, 587.951]), features_scaled=DenseVector([39.1669, 4.0856, 7.4129])),
 Row(label=11.109460830688477, features=DenseVector([37.269, 2.664, 392.2049]), features_scaled=DenseVector([36.8821, 2.666, 4.9449]))]

**Step 12**: Create the "Train/Test" split

In [None]:
# Split the data into train and test sets
train_data, test_data = scaled_df.randomSplit([.7,.3],seed=1234)

In [None]:
# Import `LinearRegression`
from pyspark.ml.regression import LinearRegression

# Initialize `lr`
lr = LinearRegression(labelCol="label", maxIter=100, regParam=0.3, elasticNetParam=0.8)


In [None]:
train_data.show()

+------------------+--------------------+--------------------+
|             label|            features|     features_scaled|
+------------------+--------------------+--------------------+
|  8.50815200805664|[35.4623985290527...|[35.0942999346732...|
| 9.316288948059082|[36.9149513244628...|[36.5317752772198...|
| 9.477777481079102|[37.9060134887695...|[37.5125502470682...|
|  9.82440185546875|[35.7427787780761...|[35.3717698454274...|
| 9.846124649047852|[36.8763122558593...|[36.4935372809473...|
| 9.953994750976562|[37.3457374572753...|[36.9580898633647...|
| 9.954976081848145|[37.3883132934570...|[37.0002237636903...|
| 9.984514236450195|[35.9334487915039...|[35.5604607100428...|
| 10.01258373260498|[38.3549613952636...|[37.9568380882485...|
|10.047314643859863|[37.1814460754394...|[36.7955038209652...|
|10.079463005065918|[38.0706634521484...|[37.6754911489454...|
|10.101632118225098|[38.0434532165527...|[37.6485633547493...|
|10.131712913513184|[34.8456115722656...|[34.4839152073

In [None]:
# Fit the data to the model
linearModel_cus = lr.fit(train_data)

**Step 14**: Make the predictions

In [None]:
# Make predictions on test data
predicted = linearModel.transform(test_data)


In [None]:
# Retrieve the predictions and the "known" labels
predictions = predicted.select("prediction").rdd.map(lambda x: x[0])
labels = predicted.select("label").rdd.map(lambda x: x[0])


In [None]:
# Combine the predictions and the label
predictionAndLabel = predictions.zip(labels).collect()

**Step 15**: Output the predictions and the associated labels

In [None]:
# Print out first 5 instances of `predictionAndLabel` - Please note that the medianHouseValue was divided by 100000 in Step 17
predictionAndLabel[:15]

[(11.536391248815368, 8.668349266052246),
 (12.08905682542873, 9.607315063476562),
 (11.793706088211092, 10.34787654876709),
 (12.012616190148162, 10.480506896972656),
 (11.730953944977474, 10.534553527832031),
 (11.96864320505492, 10.537307739257812),
 (11.824247707899433, 10.542645454406738),
 (11.841318913567704, 10.674653053283691),
 (11.597772446103434, 10.75713062286377),
 (11.859469558780473, 10.771074295043945),
 (11.92348186873618, 10.869163513183594),
 (12.54183375245982, 10.8898286819458),
 (12.137527616771205, 10.902556419372559),
 (11.862445317921114, 10.933252334594727),
 (11.909516386186423, 10.956790924072266)]

**Step 16**: Evaluating the model

**RMSE**: RMSE measures the differences between predicted values by the model and the actual values.The smaller the RMSE value is, the closer predicted and actual values are.

In [None]:
linearModel.summary.rootMeanSquaredError

0.8957950065119261

**R-Squared** known as "Co-efficient of determination" illustrates the extent of the variability in the "MedianHouseValue" that can be explained by the Linear Regression model. The higher the R-squared, the better the model fits the underlying data.

In [None]:
linearModel.summary.r2

0.16814355413269244

Only 42% of the variability in the "MedianHouseValues" is explained by the Linear Regression model.There is definitely room for improvement. You can play around with the parameters that you passed to your model.

## conclusion
#### RQ/Assumption: in this project, I analyze how the time spend website, years amount spent, and length of the membership predict the time spent on app. 
#### I collect the data from XX resource and select three features F1, F2, F3, (from articles, the three features strongly predict the time spent on apps)
#### method: linear regression, correlation of features, regression algorithm to predict next 3 month time spent on app 
#### Results: 1. evaluation output: RMSE .87; R2 17% 2. RMSE is bit high, feature 1 has high SE and suppose to remove it 3. R2 is low 3 features together cannot strongly predict the time on apps
