*italicized text*
**bold**
## Instalar pySpark en Google Colab

To run spark in Colab, we need to first install all the dependencies in Colab environment i.e. Apache Spark 2.3.2 with hadoop 2.7, Java 8 and Findspark to locate the spark in the system. The tools installation can be carried out inside the Jupyter Notebook of the Colab. One important note is that if you are new in Spark, it is better to avoid Spark 2.4.0 version since some people have already complained about its compatibility issue with python. 
Follow the steps to install the dependencies:

In [1]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://downloads.apache.org/spark/spark-2.4.8/spark-2.4.8-bin-hadoop2.7.tgz
!tar xf spark-2.4.8-bin-hadoop2.7.tgz
!pip install -q findspark

Now that you installed Spark and Java in Colab, it is time to set the environment path which enables you to run Pyspark in your Colab environment. Set the location of Java and Spark by running the following code:

In [2]:
#All imports
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.4.8-bin-hadoop2.7"

Run a local spark session to test your installation:

In [3]:
import findspark
findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()

In [4]:
# to retrieve SparkContext version
spark.version

'2.4.8'

## Importar Google Drive

In [None]:
#from google.colab import files
#files.upload()

In [5]:
#Mount Google drive, read local file. The file was previously uploaded to Google Drive
from google.colab import drive
drive.mount("/content/drive")
%ls

Mounted at /content/drive
[0m[01;34mdrive[0m/  [01;34msample_data[0m/  [01;34mspark-2.4.8-bin-hadoop2.7[0m/  spark-2.4.8-bin-hadoop2.7.tgz


## Importar Datos

Base de datos Boston Housing

El conjunto de datos de este modelo proviene del repositorio “UCI Machine Learning”. Estos datos se recopilaron en in 1978, y cada una de las 506 entradas representan datos agregados de 14 características de casas en diversos barrios de Boston. La base de datos Boston Housing contiene los datos originales de Harrison y Rubinfeld (1979). 

Los datos tienen las siguientes características, siendo medv la variable de objetivo o independiente

crim - Crimen per cápita por ciudad

zn - proporción de terrenos residenciales divididos en zonas para lotes de más de 25,000 pies cuadrados

indus - proporción de acres de negocios no minoristas por ciudad

chas - variable ficticia de Charles River (= 1 si el tramo limita el río, 0 de lo contrario)

nox - concentración de óxidos nítricos (partes por 10 millones)

rm - número promedio de habitaciones por vivienda

age - proporción de unidades ocupadas por sus propietarios construidas antes de 1940

dis - Distancias desproporcionadas a cinco centros de empleo de Boston

rad - índice de accesibilidad a las autopistas radiales

tax - tasa de impuesto a la propiedad de valor completo por USD 10,000

ptratio - colegios por localidad 

black - 1000 (B - 0,63)^ 2, donde B es la proporción de negros por ciudad

lstat - porcentaje de estado inferior de la población

medv - valor mediano de las viviendas ocupadas por sus propietarios en USD 1000

In [6]:
path_csv="/content/drive/MyDrive/Colab Notebooks/Certificado Data Science y AI (DSA) Live/BostonHousing.csv"
dataset = spark.read.csv(path_csv,inferSchema=True, header =True)
dataset.show(10)

+-------+----+-----+----+-----+-----+-----+------+---+---+-------+------+-----+----+
|   crim|  zn|indus|chas|  nox|   rm|  age|   dis|rad|tax|ptratio| black|lstat|medv|
+-------+----+-----+----+-----+-----+-----+------+---+---+-------+------+-----+----+
|0.00632|18.0| 2.31|   0|0.538|6.575| 65.2|  4.09|  1|296|   15.3| 396.9| 4.98|24.0|
|0.02731| 0.0| 7.07|   0|0.469|6.421| 78.9|4.9671|  2|242|   17.8| 396.9| 9.14|21.6|
|0.02729| 0.0| 7.07|   0|0.469|7.185| 61.1|4.9671|  2|242|   17.8|392.83| 4.03|34.7|
|0.03237| 0.0| 2.18|   0|0.458|6.998| 45.8|6.0622|  3|222|   18.7|394.63| 2.94|33.4|
|0.06905| 0.0| 2.18|   0|0.458|7.147| 54.2|6.0622|  3|222|   18.7| 396.9| 5.33|36.2|
|0.02985| 0.0| 2.18|   0|0.458| 6.43| 58.7|6.0622|  3|222|   18.7|394.12| 5.21|28.7|
|0.08829|12.5| 7.87|   0|0.524|6.012| 66.6|5.5605|  5|311|   15.2| 395.6|12.43|22.9|
|0.14455|12.5| 7.87|   0|0.524|6.172| 96.1|5.9505|  5|311|   15.2| 396.9|19.15|27.1|
|0.21124|12.5| 7.87|   0|0.524|5.631|100.0|6.0821|  5|311|   15.2

## Modelos de Regresion Lineal y No Lineal


Linear Regression model is one the oldest and widely used machine learning approach which assumes a relationship between dependent and independent variables. For example, a modeler might want to predict the forecast of the rain based on the humidity ratio. Linear Regression consists of the best fitting line through the scattered points on the graph and the best fitting line is known as the regression line.

The goal of this exercise to predict the housing prices by the given features. Let's predict the prices of the Boston Housing dataset by considering MEDV as the output variable and all the other variables as input.

Download the dataset from [here](https://github.com/asifahmed90/pyspark-ML-in-Colab/blob/master/BostonHousing.csv) and keep it somewhere on your computer. Load the dataset into your Colab directory from your local system:

For our linear regression model we need to import two modules from Pyspark i.e. Vector Assembler and Linear Regression. Vector Assembler is a transformer that assembles all the features into one vector from multiple columns that contain type double. We could have used StringIndexer if any of our columns contains string values to convert it into numeric values. Luckily, the BostonHousing dataset only contains double values, so we don't need to worry about StringIndexer for now.

In [10]:
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.regression import LinearRegression

Notice that we used InferSchema inside read.csv mofule. InferSchema enables us to infer automatically different data types for each column.

Let us print look into the dataset to see the data types of each column:

In [11]:
dataset.printSchema()

root
 |-- crim: double (nullable = true)
 |-- zn: double (nullable = true)
 |-- indus: double (nullable = true)
 |-- chas: integer (nullable = true)
 |-- nox: double (nullable = true)
 |-- rm: double (nullable = true)
 |-- age: double (nullable = true)
 |-- dis: double (nullable = true)
 |-- rad: integer (nullable = true)
 |-- tax: integer (nullable = true)
 |-- ptratio: double (nullable = true)
 |-- black: double (nullable = true)
 |-- lstat: double (nullable = true)
 |-- medv: double (nullable = true)



Next step is to convert all the features from different columns into a single column and let's call this new vector column as 'Attributes' in the outputCol.

In [12]:
#Input all the features in one vector column
assembler = VectorAssembler(inputCols=['crim', 'zn', 'indus', 'chas', 'nox', 'rm', 'age', 'dis', 'rad', 'tax', 'ptratio', 'black', 'lstat'], outputCol = 'Attributes')

output = assembler.transform(dataset)

#Input vs Output
finalized_data = output.select("Attributes","medv")

finalized_data.show(10)

+--------------------+----+
|          Attributes|medv|
+--------------------+----+
|[0.00632,18.0,2.3...|24.0|
|[0.02731,0.0,7.07...|21.6|
|[0.02729,0.0,7.07...|34.7|
|[0.03237,0.0,2.18...|33.4|
|[0.06905,0.0,2.18...|36.2|
|[0.02985,0.0,2.18...|28.7|
|[0.08829,12.5,7.8...|22.9|
|[0.14455,12.5,7.8...|27.1|
|[0.21124,12.5,7.8...|16.5|
|[0.17004,12.5,7.8...|18.9|
+--------------------+----+
only showing top 10 rows



Here, 'Attributes' are in the input features from all the columns and 'medv' is the target column.
Next, we should split the training and testing data according to our dataset (0.8 and 0.2 in this case).

In [13]:
#Split training and testing data
train_data,test_data = finalized_data.randomSplit([0.8,0.2])


regressor = LinearRegression(featuresCol = 'Attributes', labelCol = 'medv')

#Learn to fit the model from training set
regressor = regressor.fit(train_data)

#To predict the prices on testing set
#pred_lr = regressor.evaluate(test_data)

#Predict the model
#pred_lr.predictions.show(10)

#It can also be done in this way
pred_lr = regressor.transform(test_data)
pred_lr.select('Attributes', 'medv', 'prediction').show(15)

+--------------------+----+------------------+
|          Attributes|medv|        prediction|
+--------------------+----+------------------+
|[0.00632,18.0,2.3...|24.0|29.737074286701933|
|[0.00906,90.0,2.9...|32.2| 31.50878230441754|
|[0.01301,35.0,1.5...|32.7| 30.19710136062311|
|[0.01951,17.5,1.3...|33.0|23.186315589682287|
|[0.02055,85.0,0.7...|24.7|24.350928554693756|
|[0.02729,0.0,7.07...|34.7|30.521622402464057|
|[0.02899,40.0,1.2...|26.6|21.745583918966155|
|[0.03113,0.0,4.39...|17.5|16.182746874540307|
|[0.0315,95.0,1.47...|34.9|29.761912760779715|
|[0.03237,0.0,2.18...|33.4|28.401690431332206|
|[0.03359,75.0,2.9...|34.9| 33.99636786407778|
|[0.03466,35.0,6.0...|19.4| 23.06345162584452|
|[0.03548,80.0,3.6...|20.9| 21.25177868815085|
|[0.03738,0.0,5.19...|20.7|21.220851388122092|
|[0.03871,52.5,5.3...|23.2|26.804837096689347|
+--------------------+----+------------------+
only showing top 15 rows



In [14]:
import numpy as np
print ("Note: the last rows are the information for Intercept")
print ("##","-------------------------------------------------")
print ("##","  Estimate   |   Std.Error | t Values  |  P-value")
coef = np.append(list(regressor.coefficients),regressor.intercept)
Summary=regressor.summary

for i in range(len(Summary.pValues)):
    print ("##",'{:10.6f}'.format(coef[i]),\
    '{:10.6f}'.format(Summary.coefficientStandardErrors[i]),\
    '{:8.3f}'.format(Summary.tValues[i]),\
    '{:10.6f}'.format(Summary.pValues[i]))

print ("##",'---')
print ("##","Mean squared error: % .6f" \
        % Summary.meanSquaredError, ", \
        RMSE: % .6f" \
        % Summary.rootMeanSquaredError )
print ("##","Multiple R-squared: %f" % Summary.r2, ", \
        Total iterations: %i"% Summary.totalIterations)

Note: the last rows are the information for Intercept
## -------------------------------------------------
##   Estimate   |   Std.Error | t Values  |  P-value
##  -0.115210   0.034041   -3.384   0.000784
##   0.045146   0.015063    2.997   0.002897
##   0.032864   0.069903    0.470   0.638515
##   2.814747   0.883792    3.185   0.001563
## -19.010502   4.106463   -4.629   0.000005
##   4.043601   0.463514    8.724   0.000000
##  -0.000252   0.014394   -0.018   0.986024
##  -1.522442   0.224414   -6.784   0.000000
##   0.317895   0.074365    4.275   0.000024
##  -0.012904   0.004231   -3.050   0.002441
##  -0.982464   0.146130   -6.723   0.000000
##   0.008421   0.002937    2.868   0.004357
##  -0.499863   0.054961   -9.095   0.000000
##  36.413644   5.710907    6.376   0.000000
## ---
## Mean squared error:  21.628740 ,         RMSE:  4.650671
## Multiple R-squared: 0.744075 ,         Total iterations: 1


We can also print the coefficient and intercept of the regression model by using the following command:

In [None]:
#coefficient of the regression model
#coeff = regressor.coefficients

#X and Y intercept
#intr = regressor.intercept

print ("The coefficient of the model is : ", regressor.coefficients)
print ("The Intercept of the model is : ", regressor.intercept)


The coefficient of the model is :  [-0.1043061977316422,0.04592454511455656,0.039942019197666136,2.1598717051076752,-19.011240129760672,3.9056221349873566,0.002665470978042072,-1.4729229068444338,0.28224671866765494,-0.011930871777798644,-0.9814309777852848,0.008200207580733462,-0.4953285811233533]
The Intercept of the model is :  36.75314025390781


In [None]:
#Carry out random forrest regression
from pyspark.ml.regression import RandomForestRegressor
rfr = RandomForestRegressor(featuresCol = 'Attributes', labelCol = 'medv')

#Learn to fit the model from training set
rfr = rfr.fit(train_data)

#To predict the prices on testing set
pred_rfr = rfr.transform(test_data)

# Select example rows to display.
pred_rfr.select('Attributes', 'medv', 'prediction').show(10)

+--------------------+----+------------------+
|          Attributes|medv|        prediction|
+--------------------+----+------------------+
|[0.00632,18.0,2.3...|24.0| 28.67275780592778|
|[0.00906,90.0,2.9...|32.2|28.232331229956777|
|[0.01311,90.0,1.2...|35.4|33.739155780888936|
|[0.0187,85.0,4.15...|23.1|23.982504675639657|
|[0.01951,17.5,1.3...|33.0|31.847689380064388|
|[0.02009,95.0,2.6...|50.0|46.915843018093014|
|[0.02187,60.0,2.9...|31.1|29.025862067176604|
|[0.02729,0.0,7.07...|34.7| 38.44132210725263|
|[0.02731,0.0,7.07...|21.6| 23.44991918071137|
|[0.02763,75.0,2.9...|30.8| 27.80914596081069|
+--------------------+----+------------------+
only showing top 10 rows



In [None]:
#Carry out Gradient-boosted tree regression
from pyspark.ml.regression import GBTRegressor
gbt = GBTRegressor(featuresCol = 'Attributes', labelCol = 'medv')

#Learn to fit the model from training set
gbt = gbt.fit(train_data)

#To predict the prices on testing set
pred_gbt = gbt.transform(test_data)

# Select example rows to display.
pred_gbt.select('Attributes', 'medv', 'prediction').show(10)

+--------------------+----+------------------+
|          Attributes|medv|        prediction|
+--------------------+----+------------------+
|[0.00632,18.0,2.3...|24.0|26.283065059360695|
|[0.00906,90.0,2.9...|32.2|32.220113745574835|
|[0.01311,90.0,1.2...|35.4|  34.9656790467629|
|[0.0187,85.0,4.15...|23.1|  22.6350524449578|
|[0.01951,17.5,1.3...|33.0| 35.55635889114806|
|[0.02009,95.0,2.6...|50.0|  48.5425851394225|
|[0.02187,60.0,2.9...|31.1| 34.85967453431064|
|[0.02729,0.0,7.07...|34.7| 43.00241130659505|
|[0.02731,0.0,7.07...|21.6|23.640480647364864|
|[0.02763,75.0,2.9...|30.8| 24.83272778888755|
+--------------------+----+------------------+
only showing top 10 rows



## Analisis Estadistico de la Regresion

Once we are done with the basic linear regression operation, we can go a bit further and analyze our model statistically by importing RegressionEvaluator module from Pyspark.

In [None]:
from pyspark.ml.evaluation import RegressionEvaluator
#Evaluado la regresion lineal
eval_lr = RegressionEvaluator(labelCol="medv", predictionCol="prediction", metricName="rmse")

print("Linear regression model\n")

# Root Mean Square Error
rmse = eval_lr.evaluate(pred_lr)
#rmse = eval_lr.evaluate(pred_lr.predictions)
print("RMSE: %.3f" % rmse)

# Mean Square Error
#mse = eval_lr.evaluate(pred_lr.predictions, {eval_lr.metricName: "mse"})
#print("MSE: %.3f" % mse)

# Mean Absolute Error
#mae = eval_lr.evaluate(pred_lr.predictions, {eval_lr.metricName: "mae"})
#print("MAE: %.3f" % mae)

# r2 - coefficient of determination
#r2 = eval_lr.evaluate(pred_lr.predictions, {eval_lr.metricName: "r2"})
#print("r2: %.3f" %r2)

# r2 - coefficient of determination
r2 = eval_lr.evaluate(pred_lr, {eval_lr.metricName: "r2"})
print("r2: %.3f" %r2)

Linear regression model

RMSE: 5.248
r2: 0.647


In [None]:
eval_rfr = RegressionEvaluator(labelCol="medv", predictionCol="prediction", metricName="rmse")
#Evaluando el RFR

print("Regression Forrest model\n")

# Root Mean Square Error
#rmse = eval_rfr.evaluate(pred_rfr.predictions)
rmse = eval_rfr.evaluate(pred_rfr)

print("RMSE: %.3f" % rmse)

# Mean Square Error
mse = eval_rfr.evaluate(pred_rfr, {eval_rfr.metricName: "mse"})
print("MSE: %.3f" % mse)

# Mean Absolute Error
mae = eval_rfr.evaluate(pred_rfr, {eval_rfr.metricName: "mae"})
print("MAE: %.3f" % mae)

# r2 - coefficient of determination
r2 = eval_rfr.evaluate(pred_rfr, {eval_rfr.metricName: "r2"})
print("r2: %.3f" %r2)

Regression Forrest model

RMSE: 3.419
MSE: 11.691
MAE: 2.508
r2: 0.878


In [None]:
eval_gbt = RegressionEvaluator(labelCol="medv", predictionCol="prediction", metricName="rmse")
#Evaluando GBTR

print("Gradient Boot Tree model\n")

# Root Mean Square Error
#rmse = eval_rfr.evaluate(pred_rfr.predictions)
rmse = eval_gbt.evaluate(pred_gbt)

print("RMSE: %.3f" % rmse)

# Mean Square Error
mse = eval_gbt.evaluate(pred_gbt, {eval_gbt.metricName: "mse"})
print("MSE: %.3f" % mse)

# Mean Absolute Error
mae = eval_gbt.evaluate(pred_gbt, {eval_gbt.metricName: "mae"})
print("MAE: %.3f" % mae)

# r2 - coefficient of determination
r2 = eval_gbt.evaluate(pred_gbt, {eval_gbt.metricName: "r2"})
print("r2: %.3f" %r2)

Gradient Boot Tree model

RMSE: 3.951
MSE: 15.613
MAE: 2.881
r2: 0.837
