<a href="https://colab.research.google.com/github/thegreatmick1975/Complete-Python-3-Bootcamp/blob/master/LinearRegression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [7]:
import urllib.request

In [8]:
!pip install pyspark

Collecting pyspark
  Downloading pyspark-3.4.1.tar.gz (310.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m310.8/310.8 MB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.4.1-py2.py3-none-any.whl size=311285387 sha256=785a1e795b88965ac035fa133c6cf891616e9ed4a831a51ba8773ec1ca7f6d13
  Stored in directory: /root/.cache/pip/wheels/0d/77/a3/ff2f74cc9ab41f8f594dabf0579c2a7c6de920d584206e0834
Successfully built pyspark
Installing collected packages: pyspark
Successfully installed pyspark-3.4.1


## Importing necessary libraries to import mock randomized data

In [9]:
url = "https://raw.githubusercontent.com/apache/spark/master/data/mllib/sample_linear_regression_data.txt"

In [10]:
local_filename = "sample_linear_regression_data.txt"

In [11]:
urllib.request.urlretrieve(url,local_filename)

('sample_linear_regression_data.txt',
 <http.client.HTTPMessage at 0x7b972b666a70>)

In [12]:
import os

In [13]:
from pyspark.ml.regression import LinearRegression

In [None]:
if os.path.exists(local_filename):
  print(f"File '{local_filename}' has been successfully downloaded and saved.")
else: print(f"Failed to download '{local_filename}'.")

## Building our PySpark Engine and beginning the training summary


In [15]:
from pyspark.sql import SparkSession

In [16]:
spark = SparkSession.builder.appName('lr_ex').getOrCreate()

In [17]:
training = spark.read.format("libsvm").load("sample_linear_regression_data.txt")


In [18]:
training.show()

+-------------------+--------------------+
|              label|            features|
+-------------------+--------------------+
| -9.490009878824548|(10,[0,1,2,3,4,5,...|
| 0.2577820163584905|(10,[0,1,2,3,4,5,...|
| -4.438869807456516|(10,[0,1,2,3,4,5,...|
|-19.782762789614537|(10,[0,1,2,3,4,5,...|
| -7.966593841555266|(10,[0,1,2,3,4,5,...|
| -7.896274316726144|(10,[0,1,2,3,4,5,...|
| -8.464803554195287|(10,[0,1,2,3,4,5,...|
| 2.1214592666251364|(10,[0,1,2,3,4,5,...|
| 1.0720117616524107|(10,[0,1,2,3,4,5,...|
|-13.772441561702871|(10,[0,1,2,3,4,5,...|
| -5.082010756207233|(10,[0,1,2,3,4,5,...|
|  7.887786536531237|(10,[0,1,2,3,4,5,...|
| 14.323146365332388|(10,[0,1,2,3,4,5,...|
|-20.057482615789212|(10,[0,1,2,3,4,5,...|
|-0.8995693247765151|(10,[0,1,2,3,4,5,...|
| -19.16829262296376|(10,[0,1,2,3,4,5,...|
|  5.601801561245534|(10,[0,1,2,3,4,5,...|
|-3.2256352187273354|(10,[0,1,2,3,4,5,...|
| 1.5299675726687754|(10,[0,1,2,3,4,5,...|
| -0.250102447941961|(10,[0,1,2,3,4,5,...|
+----------

##Building our Linear Regression Training Model


In [19]:
lr = LinearRegression(featuresCol = 'features', labelCol= 'label', predictionCol = 'prediction')

In [20]:
lrModel = lr.fit(training)

In [21]:
print('Coefficients:', str(lrModel.coefficients))
print('Intercept:', str(lrModel.intercept))


Coefficients: [0.0073350710225801715,0.8313757584337543,-0.8095307954684084,2.441191686884721,0.5191713795290002,1.1534591903547016,-0.2989124112808717,-0.5128514186201779,-0.619712827067017,0.695615180432293]
Intercept: 0.14228558260358093


In [23]:
trainSummary = lrModel.summary

In [24]:
trainSummary

<pyspark.ml.regression.LinearRegressionTrainingSummary at 0x7b97264fee00>

**MAE**: Measures the average absolute difference between the predicted values and the actual values of the data.
**MSE:** Calculates the average of the squared differences between predicted and actual values.
**RMSE**: is the square root of the *MSE*. This provides a measure of prediction error with lower values indicating better model performance.
**R2**: R-squared measures the proportion of the variance in the dependent variable (target) that is explained by the independent variables (features) in the model. It ranges from 0 to 1, with higher values indicating that a larger proportion of the variance is explained by the model.
**AdjR2**:*Adjusted R-squared* is a modified version of R-squared that takes into account the number of features in the model. It penalizes the addition of unnecessary features and provides a more reliable measure of model fit, especially in the presence of many features.



In [27]:
print('MAE:', trainSummary.meanAbsoluteError)
print('MSE:', trainSummary.meanSquaredError)
print('RMSE:', trainSummary.rootMeanSquaredError)
print('R2:', trainSummary.r2)
print('Adj R2', trainSummary.r2adj)

MAE: 8.145215527783876
MSE: 103.28843028724194
RMSE: 10.16309157133015
R2: 0.027839179518600154
Adj R2 0.007999162774081858


The output shows that this model is absolutely horrific. As we didn't do a train-test-split which will be perfomed below.

## Train Test Split with PySpark
### Pass the split between training/test as a list.
### Not correct, but generally 70/30 or 60/40 splits are used.
### Depending on how much data you have and how unbalanced it is.



In [30]:
newDf = spark.read.format("libsvm").load("sample_linear_regression_data.txt") # Full Dataset

In [32]:
train_data, test_data = newDf.randomSplit([0.7, 0.3], seed=42)

In [34]:
train_data.show()
test_data.show()

+-------------------+--------------------+
|              label|            features|
+-------------------+--------------------+
|-28.571478869743427|(10,[0,1,2,3,4,5,...|
|-28.046018037776633|(10,[0,1,2,3,4,5,...|
|-26.736207182601724|(10,[0,1,2,3,4,5,...|
| -23.51088409032297|(10,[0,1,2,3,4,5,...|
|-23.487440120936512|(10,[0,1,2,3,4,5,...|
|-22.837460416919342|(10,[0,1,2,3,4,5,...|
|-20.057482615789212|(10,[0,1,2,3,4,5,...|
|-19.884560774273424|(10,[0,1,2,3,4,5,...|
|-19.872991038068406|(10,[0,1,2,3,4,5,...|
| -19.16829262296376|(10,[0,1,2,3,4,5,...|
|-18.845922472898582|(10,[0,1,2,3,4,5,...|
| -18.27521356600463|(10,[0,1,2,3,4,5,...|
|-17.494200356883344|(10,[0,1,2,3,4,5,...|
| -17.32672073267595|(10,[0,1,2,3,4,5,...|
| -16.71909683360509|(10,[0,1,2,3,4,5,...|
|-16.692207021311106|(10,[0,1,2,3,4,5,...|
| -16.26143027545273|(10,[0,1,2,3,4,5,...|
| -15.86200932757056|(10,[0,1,2,3,4,5,...|
|-15.732088272239245|(10,[0,1,2,3,4,5,...|
|-15.375857723312297|(10,[0,1,2,3,4,5,...|
+----------

In [35]:
unlabeled_data = test_data.select('features')

In [36]:
new_model = lr.fit(train_data)

In [37]:
results = new_model.evaluate(test_data)

In [38]:
print('MAE:', results.meanAbsoluteError)
print('MSE:', results.meanSquaredError)
print('RMSE:', results.rootMeanSquaredError)
print('R2:', results.r2)
print('Adj R2', results.r2adj)

MAE: 9.855750048378727
MSE: 142.31866794563598
RMSE: 11.929738804585622
R2: -0.14679155085585793
Adj R2 -0.24651255527810645


In [39]:
predictions = new_model.transform(unlabeled_data)

In [40]:
predictions.show()

+--------------------+--------------------+
|            features|          prediction|
+--------------------+--------------------+
|(10,[0,1,2,3,4,5,...|  1.5004193024392305|
|(10,[0,1,2,3,4,5,...|   6.540721556576252|
|(10,[0,1,2,3,4,5,...|  1.4369775273526635|
|(10,[0,1,2,3,4,5,...|  1.3156052948594423|
|(10,[0,1,2,3,4,5,...|-0.09510236182489817|
|(10,[0,1,2,3,4,5,...|  0.1264840774927029|
|(10,[0,1,2,3,4,5,...|-0.40745999229762586|
|(10,[0,1,2,3,4,5,...|  -1.382750455726864|
|(10,[0,1,2,3,4,5,...|  2.6965070486236957|
|(10,[0,1,2,3,4,5,...|  2.4228427074240106|
|(10,[0,1,2,3,4,5,...|-0.33620505674116286|
|(10,[0,1,2,3,4,5,...|  1.5811910073932327|
|(10,[0,1,2,3,4,5,...|  -0.912686515312681|
|(10,[0,1,2,3,4,5,...| -2.4337353560269612|
|(10,[0,1,2,3,4,5,...|  4.7238640017384945|
|(10,[0,1,2,3,4,5,...|  1.7972086764514912|
|(10,[0,1,2,3,4,5,...| -0.3727532193177281|
|(10,[0,1,2,3,4,5,...|  3.3935938829568832|
|(10,[0,1,2,3,4,5,...|  1.1738235336515077|
|(10,[0,1,2,3,4,5,...| 0.4009232