# **Linear regression algorithm implementation using spark MLlib**

Linear regression is one of the most familiar algorithm in machine learning, where we are trying to create a best fitting line for our datapoints that present on our dataset. The outcome of a linear regression model will be numerical (integer), so for any business use case which invloves in predicting the outcome of a integer based outcome linear regression model can be used for predicting it.

Spark has an library known as "MLlib", which then allows us to implement several machine learning algorithms to our spark dataframe, it supports some of the supervised and unsupervised machine learning algorithms.

> Creating a spark session

> Data ingestion using spark

> Exploring the data using spark functions

> Segregating input and output features for the linear regression model

> Using linear regression model from the spark MLlib library

> Evaluating the model with regression evaluation metrics








## **Creating a spark session**

In [1]:
#creating a pyspark session
from pyspark.sql.session import SparkSession
spark = SparkSession.builder.appName("linear_regression").getOrCreate()

## **Data ingestion using spark**

In [2]:
#Loading the data
df=spark.read.csv("50_Startups.csv", inferSchema=True, header=True)

## **Exploring data using pyspark functions**

In [3]:
#Printing schema
df.printSchema()

root
 |-- R&D Spend: double (nullable = true)
 |-- Administration: double (nullable = true)
 |-- Marketing Spend: double (nullable = true)
 |-- State: string (nullable = true)
 |-- Profit: double (nullable = true)



In [4]:
#Getting first five rows of the dataframe.
df.head(5)

[Row(R&D Spend=165349.2, Administration=136897.8, Marketing Spend=471784.1, State='New York', Profit=192261.83),
 Row(R&D Spend=162597.7, Administration=151377.59, Marketing Spend=443898.53, State='California', Profit=191792.06),
 Row(R&D Spend=153441.51, Administration=101145.55, Marketing Spend=407934.54, State='Florida', Profit=191050.39),
 Row(R&D Spend=144372.41, Administration=118671.85, Marketing Spend=383199.62, State='New York', Profit=182901.99),
 Row(R&D Spend=142107.34, Administration=91391.77, Marketing Spend=366168.42, State='Florida', Profit=166187.94)]

In [5]:
#viewing the spark dataframe
df.show()

+---------+--------------+---------------+----------+---------+
|R&D Spend|Administration|Marketing Spend|     State|   Profit|
+---------+--------------+---------------+----------+---------+
| 165349.2|      136897.8|       471784.1|  New York|192261.83|
| 162597.7|     151377.59|      443898.53|California|191792.06|
|153441.51|     101145.55|      407934.54|   Florida|191050.39|
|144372.41|     118671.85|      383199.62|  New York|182901.99|
|142107.34|      91391.77|      366168.42|   Florida|166187.94|
| 131876.9|      99814.71|      362861.36|  New York|156991.12|
|134615.46|     147198.87|      127716.82|California|156122.51|
|130298.13|     145530.06|      323876.68|   Florida| 155752.6|
|120542.52|     148718.95|      311613.29|  New York|152211.77|
|123334.88|     108679.17|      304981.62|California|149759.96|
|101913.08|     110594.11|      229160.95|   Florida|146121.95|
|100671.96|      91790.61|      249744.55|California| 144259.4|
| 93863.75|     127320.38|      249839.4

Since state is a categorical feature, we might need to encode this strings into numbers where our AI model linear regression will accept only numeric inputs to it

In [6]:
#For idetifying number of categories in the state column
df.groupBy("State").count().show()

+----------+-----+
|     State|count|
+----------+-----+
|   Florida|   16|
|California|   17|
|  New York|   17|
+----------+-----+



## **Converting state column to numeric form**

In [7]:
#Importing Stringindexer from pyspark for encoding state column
#The output of this conversion will be saved in the Indexed column
from pyspark.ml.feature import StringIndexer
Indexed = StringIndexer(inputCol="State", outputCol="Indexed_State")
df=Indexed.fit(df).transform(df)
df.show()

+---------+--------------+---------------+----------+---------+-------------+
|R&D Spend|Administration|Marketing Spend|     State|   Profit|Indexed_State|
+---------+--------------+---------------+----------+---------+-------------+
| 165349.2|      136897.8|       471784.1|  New York|192261.83|          1.0|
| 162597.7|     151377.59|      443898.53|California|191792.06|          0.0|
|153441.51|     101145.55|      407934.54|   Florida|191050.39|          2.0|
|144372.41|     118671.85|      383199.62|  New York|182901.99|          1.0|
|142107.34|      91391.77|      366168.42|   Florida|166187.94|          2.0|
| 131876.9|      99814.71|      362861.36|  New York|156991.12|          1.0|
|134615.46|     147198.87|      127716.82|California|156122.51|          0.0|
|130298.13|     145530.06|      323876.68|   Florida| 155752.6|          2.0|
|120542.52|     148718.95|      311613.29|  New York|152211.77|          1.0|
|123334.88|     108679.17|      304981.62|California|149759.96| 

In [8]:
#Viewing the columns in the dataframe
df.columns

['R&D Spend',
 'Administration',
 'Marketing Spend',
 'State',
 'Profit',
 'Indexed_State']

In [9]:
#Selecting only the required columns from the dataframe.
#Generally dropping the state column
df=df.select(['R&D Spend','Administration','Marketing Spend','Indexed_State','Profit'])
df.show()

+---------+--------------+---------------+-------------+---------+
|R&D Spend|Administration|Marketing Spend|Indexed_State|   Profit|
+---------+--------------+---------------+-------------+---------+
| 165349.2|      136897.8|       471784.1|          1.0|192261.83|
| 162597.7|     151377.59|      443898.53|          0.0|191792.06|
|153441.51|     101145.55|      407934.54|          2.0|191050.39|
|144372.41|     118671.85|      383199.62|          1.0|182901.99|
|142107.34|      91391.77|      366168.42|          2.0|166187.94|
| 131876.9|      99814.71|      362861.36|          1.0|156991.12|
|134615.46|     147198.87|      127716.82|          0.0|156122.51|
|130298.13|     145530.06|      323876.68|          2.0| 155752.6|
|120542.52|     148718.95|      311613.29|          1.0|152211.77|
|123334.88|     108679.17|      304981.62|          0.0|149759.96|
|101913.08|     110594.11|      229160.95|          2.0|146121.95|
|100671.96|      91790.61|      249744.55|          0.0| 14425

## **Segregating input and output features for the linear regression model**

Inorder to train a machine learning model, a dataframe should be splitted into two parts training and testing data and also we have to select the dependent(input columns), independent(output columns) from the dataframe to pass to the model.

For doing this, pyspark has ml package which contains of the required packages and libraries for doing the machine learning related activities.

In [10]:
#Vector assmbler package from pyspark will be useful in defining the input and output features to the model
#And also it merge all the input features together into a column of vectors that has to be given as a input to the model.
from pyspark.ml.feature import VectorAssembler
vec_assembler = VectorAssembler(inputCols=['R&D Spend','Administration','Marketing Spend','Indexed_State'],outputCol="Features")

In [11]:
#Converting the df using vector assembler.
input_data = vec_assembler.transform(df)

In [12]:
#Listing the df after the conversion
#The features column will be the input and it's the last column of the dataframe.
input_data.show()

+---------+--------------+---------------+-------------+---------+--------------------+
|R&D Spend|Administration|Marketing Spend|Indexed_State|   Profit|            Features|
+---------+--------------+---------------+-------------+---------+--------------------+
| 165349.2|      136897.8|       471784.1|          1.0|192261.83|[165349.2,136897....|
| 162597.7|     151377.59|      443898.53|          0.0|191792.06|[162597.7,151377....|
|153441.51|     101145.55|      407934.54|          2.0|191050.39|[153441.51,101145...|
|144372.41|     118671.85|      383199.62|          1.0|182901.99|[144372.41,118671...|
|142107.34|      91391.77|      366168.42|          2.0|166187.94|[142107.34,91391....|
| 131876.9|      99814.71|      362861.36|          1.0|156991.12|[131876.9,99814.7...|
|134615.46|     147198.87|      127716.82|          0.0|156122.51|[134615.46,147198...|
|130298.13|     145530.06|      323876.68|          2.0| 155752.6|[130298.13,145530...|
|120542.52|     148718.95|      

In [13]:
#Selecting only the features and profit column from the dataframe.
input_data = input_data.select(["Features","Profit"])

In [14]:
#Viewing the final datframe
input_data.show()

+--------------------+---------+
|            Features|   Profit|
+--------------------+---------+
|[165349.2,136897....|192261.83|
|[162597.7,151377....|191792.06|
|[153441.51,101145...|191050.39|
|[144372.41,118671...|182901.99|
|[142107.34,91391....|166187.94|
|[131876.9,99814.7...|156991.12|
|[134615.46,147198...|156122.51|
|[130298.13,145530...| 155752.6|
|[120542.52,148718...|152211.77|
|[123334.88,108679...|149759.96|
|[101913.08,110594...|146121.95|
|[100671.96,91790....| 144259.4|
|[93863.75,127320....|141585.52|
|[91992.39,135495....|134307.35|
|[119943.24,156547...|132602.65|
|[114523.61,122616...|129917.04|
|[78013.11,121597....|126992.93|
|[94657.16,145077....|125370.37|
|[91749.16,114175....| 124266.9|
|[86419.7,153514.1...|122776.86|
+--------------------+---------+
only showing top 20 rows



In [15]:
#Splitting the dataframe into training and testing data.
#Training data will be used to train the model, similarly testing data will be used to test the trained model.
train_data, test_data = input_data.randomSplit([0.7,0.3])

## **Using linear regression model from the spark MLlib library**

Linear regression is a mathematical algorith , which is also known as least squared method which invloves in predicting an quantitative (numerical) based outcome depends on the input features. In other works linear regression tends to find the best fitting line to ,cover almost all the datapoints of the input feature such that if any new input feature comes in it will be predicted based on the best fitting line.

In [16]:
#Importing linear regression module from pyspark ml module.
from pyspark.ml.regression import LinearRegression
lr = LinearRegression(featuresCol="Features", labelCol="Profit")

In [17]:
#Training the linear regression model with the training data.
lr_model = lr.fit(train_data)

In [18]:
#obtaining the co-efficients of the trained linear regression model.
lr_model.coefficients

DenseVector([0.875, -0.0894, 0.0191, 151.7491])

## **Evaluation of the trained linear regression model using regression evaluation metrics**

There are certain metrics which are available to evaluate a trained regression model, those are
1) Mean absolute error
2) Mean squared error
3) Root mean squared error
3) R squared error

In [19]:
# Testing the model performance using testing data.
results = lr_model.evaluate(test_data)

In [20]:
print(f"Mean Absolute Error (MAE)  : {results.meanAbsoluteError}")
print(f"Mean Squared Error (MSE)  :  {results.meanSquaredError}")
print(f"Root Mean Squared Error (RMSE) : {results.rootMeanSquaredError}")
print(f"R Squared error : {results.r2}")

Mean Absolute Error (MAE)  : 9173.393807110879
Mean Squared Error (MSE)  :  126037249.99652538
Root Mean Squared Error (RMSE) : 11226.631284429242
R Squared error : 0.8828224793862188
