# **Running Pyspark in Colab**

To run spark in Colab, we need to first install all the dependencies in Colab environment i.e. Apache Spark 2.3.2 with hadoop 2.7, Java 8 and Findspark to locate the spark in the system. The tools installation can be carried out inside the Jupyter Notebook of the Colab. One important note is that if you are new in Spark, it is better to avoid Spark 2.4.0 version since some people have already complained about its compatibility issue with python. 
Follow the steps to install the dependencies:

In [0]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q http://apache.osuosl.org/spark/spark-2.3.3/spark-2.3.3-bin-hadoop2.7.tgz
!tar xf spark-2.3.3-bin-hadoop2.7.tgz
!pip install -q findspark

Now that you installed Spark and Java in Colab, it is time to set the environment path which enables you to run Pyspark in your Colab environment. Set the location of Java and Spark by running the following code:

In [0]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.3.3-bin-hadoop2.7"

Run a local spark session to test your installation:

In [0]:
import findspark
findspark.init()
from pyspark.sql import SparkSession
from pyspark import SparkContext
sqlSession = SparkSession.builder.master("local[*]").getOrCreate()
sc = SparkContext.getOrCreate()

In [5]:
nums = sc.parallelize([1,2,3,4])
squared = nums.map(lambda x: x * x).collect()
for num in squared:
  print (num)

1
4
9
16


Upload ""auto-miles-per-gallon.csv""

In [6]:
from google.colab import files

uploaded = files.upload()

for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))

Saving auto-miles-per-gallon.csv to auto-miles-per-gallon.csv
User uploaded file "auto-miles-per-gallon.csv" with length 17321 bytes


# **Import libraries** 

In [0]:
from pyspark.sql import Row
from pyspark.ml.linalg import Vectors
from pyspark.ml.regression import LinearRegression
from pyspark.ml.evaluation import RegressionEvaluator

# **Load and prepare data**



In [9]:
autoData = sc.textFile("auto-miles-per-gallon.csv")
autoData.cache()
autoData.take(5)

['MPG,CYLINDERS,DISPLACEMENT,HORSEPOWER,WEIGHT,ACCELERATION,MODELYEAR,NAME',
 '18,8,307,130,3504,12,70,chevrolet chevelle malibu',
 '15,8,350,165,3693,11.5,70,buick skylark 320',
 '18,8,318,150,3436,11,70,plymouth satellite',
 '16,8,304,150,3433,12,70,amc rebel sst']

Remove header

In [10]:
dataLines = autoData.filter(lambda x: "CYLINDERS" not in x)
dataLines.count()

398

# **Cleanup Data**

Default average HP

In [0]:
avgHP =sc.broadcast(80.0)

Clean data function

In [0]:
def CleanupData( inputStr) :
    global avgHP
    attList=inputStr.split(",")
    
    #Replace ? values with a normal value
    hpValue = attList[3]
    if hpValue == "?":
        hpValue=avgHP.value
       
    #Create a row with cleaned up and converted data
    values= Row(     MPG=float(attList[0]),\
                     CYLINDERS=float(attList[1]), \
                     DISPLACEMENT=float(attList[2]), 
                     HORSEPOWER=float(hpValue),\
                     WEIGHT=float(attList[4]), \
                     ACCELERATION=float(attList[5]), \
                     MODELYEAR=float(attList[6]),\
                     NAME=attList[7]  ) 
    return values

In [14]:
autoMap = dataLines.map(CleanupData)
autoMap.cache()
autoMap.take(5)

[Row(ACCELERATION=12.0, CYLINDERS=8.0, DISPLACEMENT=307.0, HORSEPOWER=130.0, MODELYEAR=70.0, MPG=18.0, NAME='chevrolet chevelle malibu', WEIGHT=3504.0),
 Row(ACCELERATION=11.5, CYLINDERS=8.0, DISPLACEMENT=350.0, HORSEPOWER=165.0, MODELYEAR=70.0, MPG=15.0, NAME='buick skylark 320', WEIGHT=3693.0),
 Row(ACCELERATION=11.0, CYLINDERS=8.0, DISPLACEMENT=318.0, HORSEPOWER=150.0, MODELYEAR=70.0, MPG=18.0, NAME='plymouth satellite', WEIGHT=3436.0),
 Row(ACCELERATION=12.0, CYLINDERS=8.0, DISPLACEMENT=304.0, HORSEPOWER=150.0, MODELYEAR=70.0, MPG=16.0, NAME='amc rebel sst', WEIGHT=3433.0),
 Row(ACCELERATION=10.5, CYLINDERS=8.0, DISPLACEMENT=302.0, HORSEPOWER=140.0, MODELYEAR=70.0, MPG=17.0, NAME='ford torino', WEIGHT=3449.0)]

In [16]:
autoDf = sqlSession.createDataFrame(autoMap)
autoDf.show()

+------------+---------+------------+----------+---------+----+--------------------+------+
|ACCELERATION|CYLINDERS|DISPLACEMENT|HORSEPOWER|MODELYEAR| MPG|                NAME|WEIGHT|
+------------+---------+------------+----------+---------+----+--------------------+------+
|        12.0|      8.0|       307.0|     130.0|     70.0|18.0|chevrolet chevell...|3504.0|
|        11.5|      8.0|       350.0|     165.0|     70.0|15.0|   buick skylark 320|3693.0|
|        11.0|      8.0|       318.0|     150.0|     70.0|18.0|  plymouth satellite|3436.0|
|        12.0|      8.0|       304.0|     150.0|     70.0|16.0|       amc rebel sst|3433.0|
|        10.5|      8.0|       302.0|     140.0|     70.0|17.0|         ford torino|3449.0|
|        10.0|      8.0|       429.0|     198.0|     70.0|15.0|    ford galaxie 500|4341.0|
|         9.0|      8.0|       454.0|     220.0|     70.0|14.0|    chevrolet impala|4354.0|
|         8.5|      8.0|       440.0|     215.0|     70.0|14.0|   plymouth fury 

# **Data Analytics**


In [17]:
autoDf.select("MPG","CYLINDERS").describe().show()

+-------+-----------------+------------------+
|summary|              MPG|         CYLINDERS|
+-------+-----------------+------------------+
|  count|              398|               398|
|   mean|23.51457286432161| 5.454773869346734|
| stddev|7.815984312565782|1.7010042445332125|
|    min|              9.0|               3.0|
|    max|             46.6|               8.0|
+-------+-----------------+------------------+



In [31]:
for i in autoDf.columns:
  if not isinstance(autoDf.select(i).take(1)[0][0], str):
    print( "Correlation to MPG for ", i, autoDf.stat.corr('MPG',i))

Correlation to MPG for  ACCELERATION 0.4202889121016501
Correlation to MPG for  CYLINDERS -0.7753962854205548
Correlation to MPG for  DISPLACEMENT -0.8042028248058979
Correlation to MPG for  HORSEPOWER -0.7746308409203807
Correlation to MPG for  MODELYEAR 0.5792671330833091
Correlation to MPG for  MPG 1.0
Correlation to MPG for  WEIGHT -0.8317409332443347


# **Prepare data**

Transform to a Data Frame for input to Machine Learing

Drop columns that are not required (low correlation)

In [0]:
def transformToLabeledPoint(row) :
    lp = ( row["MPG"], Vectors.dense([row["ACCELERATION"],\
                        row["DISPLACEMENT"], \
                        row["WEIGHT"]]))
    return lp

In [34]:
autoLp = autoMap.map(transformToLabeledPoint)
autoDF = sqlSession.createDataFrame(autoLp,["label", "features"])
autoDF.select("label","features").show(10)

+-----+-------------------+
|label|           features|
+-----+-------------------+
| 18.0|[12.0,307.0,3504.0]|
| 15.0|[11.5,350.0,3693.0]|
| 18.0|[11.0,318.0,3436.0]|
| 16.0|[12.0,304.0,3433.0]|
| 17.0|[10.5,302.0,3449.0]|
| 15.0|[10.0,429.0,4341.0]|
| 14.0| [9.0,454.0,4354.0]|
| 14.0| [8.5,440.0,4312.0]|
| 14.0|[10.0,455.0,4425.0]|
| 15.0| [8.5,390.0,3850.0]|
+-----+-------------------+
only showing top 10 rows



# **Linear Regression**

Split into training and testing data

In [36]:
(trainingData, testData) = autoDF.randomSplit([0.9, 0.1])
print(trainingData.count())
print(testData.count())

357
41


Build the model on training data

In [0]:
lr = LinearRegression(maxIter=10)
lrModel = lr.fit(trainingData)

Print the metrics

In [40]:
print("Coefficients: " + str(lrModel.coefficients))
print("Intercept: " + str(lrModel.intercept))

Coefficients: [0.17302020526793177,-0.011194610723708823,-0.006008098039159543]
Intercept: 40.726790830195036


Predict on the test data

In [41]:
predictions = lrModel.transform(testData)
predictions.select("prediction","label","features").show()

+------------------+-----+-------------------+
|        prediction|label|           features|
+------------------+-----+-------------------+
|  8.16193370038711| 12.0|[11.5,429.0,4952.0]|
|13.206308520744244| 13.0|[12.0,350.0,4274.0]|
| 9.810755710075497| 13.0|[12.0,400.0,4746.0]|
|15.097644765211804| 13.0|[13.0,350.0,3988.0]|
|  16.6215590029507| 14.0| [8.0,340.0,3609.0]|
|11.364915111684628| 14.0| [8.5,440.0,4312.0]|
| 17.25162553097493| 14.0|[11.5,304.0,3672.0]|
|14.700386941961774| 15.0| [8.5,390.0,3850.0]|
|18.648601642899596| 15.0|[11.0,318.0,3399.0]|
| 22.31156282406449| 18.0|[16.5,225.0,3121.0]|
|18.754736239232386| 18.0|[19.0,225.0,3785.0]|
| 20.08862006793819| 18.0|[21.0,250.0,3574.0]|
|20.777491807567515| 20.0|[13.5,262.0,3221.0]|
|22.477528891031422| 21.0|[15.0,231.0,3039.0]|
| 24.17112934361465| 22.0|[15.5,198.0,2833.0]|
|27.024584602900852| 24.0|[15.0,120.0,2489.0]|
|30.451941358951615| 24.5| [22.1,98.0,2164.0]|
| 27.83422364205696| 26.0|[15.5,108.0,2391.0]|
|27.490033414

Find R2 for Linear Regression

In [42]:
evaluator = RegressionEvaluator(predictionCol="prediction", \
                 labelCol="label",metricName="r2")
evaluator.evaluate(predictions)

0.7596688267173703