## DataCamp Tutorial on Spark

[Apache Spark in Python: Beginner's Guide](https://www.datacamp.com/community/tutorials/apache-spark-python#gs.fMIIqxM)

[Apache Spark Tutorial: ML with PySpark](https://www.datacamp.com/community/tutorials/apache-spark-tutorial-machine-learning)

[6 tips](https://www.datacamp.com/community/blog/fast-track-apache-spark)

$ pip install findspark

In [3]:
import findspark

In [4]:
findspark.init()

### Quick Intro to RDD

most common transformations are 

* map()
* filter()
* flatMap()
* sample()
* randomSplit()
* coalesce()
* repartition() 

most common actions are 

* reduce()
* reduceByKey()
* collect()
* first()
* take()
* count()
* saveAsHadoopFile()

In [10]:
# Import SparkSession
from pyspark.sql import SparkSession

# Build the SparkSession
spark = SparkSession.builder \
   .master("local") \
   .appName("Linear Regression Model") \
   .config("spark.executor.memory", "4gb") \
   .getOrCreate()
   
sc = spark.sparkContext

In [11]:
# from pyspark import SparkContext, SparkConf

that the SparkSession object has the SparkContext object, which you can access with spark.sparkContext. For backwards compatibility reasons, it’s also still possible to call the SparkContext with sc, as in 

rdd1 = sc.parallelize(['a',7),('a',2),('b',2)]).

#### rdd1

In [12]:
rdd1 = sc.parallelize([('a',7),('a',2),('b',2)])

In [13]:
rdd1.collect()

[('a', 7), ('a', 2), ('b', 2)]

In [14]:
from operator import add

In [15]:
rdd1_merge = rdd1.reduceByKey(add)

In [16]:
rdd1_merge.collect()

[('a', 9), ('b', 2)]

In [17]:
rdd1.reduce(lambda a,b: a+b)

('a', 7, 'a', 2, 'b', 2)

In [18]:
rdd1.reduceByKey(lambda a,b: a+b).collect()

[('a', 9), ('b', 2)]

#### rdd2

In [19]:
rdd2 = sc.parallelize([("a",["x","y","z"]), ("b",["p", "r"])])

In [20]:
rdd2.collect()

[('a', ['x', 'y', 'z']), ('b', ['p', 'r'])]

In [21]:
rdd2.flatMapValues(lambda x: x).collect()

[('a', 'x'), ('a', 'y'), ('a', 'z'), ('b', 'p'), ('b', 'r')]

#### rdd3

In [22]:
rdd3 = sc.parallelize(range(100))

In [23]:
rdd3.take(10)

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

### Model CalHousing data

In [9]:
# Import SparkSession
from pyspark.sql import SparkSession

# Build the SparkSession
spark = SparkSession.builder \
   .master("local") \
   .appName("Linear Regression Model") \
   .config("spark.executor.memory", "4gb") \
   .getOrCreate()
   
sc = spark.sparkContext

#### Load data

In [75]:
# Load in the header
header = sc.textFile('./CaliforniaHousing/cal_housing.domain')

# Load in the data
rdd = sc.textFile('./CaliforniaHousing/cal_housing.data')

#### Explore data

In [76]:
header.collect()

['longitude: continuous.',
 'latitude: continuous.',
 'housingMedianAge: continuous. ',
 'totalRooms: continuous. ',
 'totalBedrooms: continuous. ',
 'population: continuous. ',
 'households: continuous. ',
 'medianIncome: continuous. ',
 'medianHouseValue: continuous. ']

don't use rdd.collect() because it will crash if rdd is too big

In [77]:
rdd.take(2)

['-122.230000,37.880000,41.000000,880.000000,129.000000,322.000000,126.000000,8.325200,452600.000000',
 '-122.220000,37.860000,21.000000,7099.000000,1106.000000,2401.000000,1138.000000,8.301400,358500.000000']

In [78]:
# Split lines on commas
rdd = rdd.map(lambda line: [float(i) for i in line.split(",")])

# Inspect the first 2 lines 
rdd.take(2)

[[-122.23, 37.88, 41.0, 880.0, 129.0, 322.0, 126.0, 8.3252, 452600.0],
 [-122.22, 37.86, 21.0, 7099.0, 1106.0, 2401.0, 1138.0, 8.3014, 358500.0]]

In [79]:
rdd.first()

[-122.23, 37.88, 41.0, 880.0, 129.0, 322.0, 126.0, 8.3252, 452600.0]

In [80]:
rdd.top(2)

[[-114.31, 34.19, 15.0, 5612.0, 1283.0, 1015.0, 472.0, 1.4936, 66900.0],
 [-114.47, 34.4, 19.0, 7650.0, 1901.0, 1129.0, 463.0, 1.82, 80100.0]]

In [81]:
# convert to dataframe
from pyspark.sql import Row

# Map the RDD to a DF
df = rdd.map(lambda line: Row(longitude=line[0], 
                              latitude=line[1], 
                              housingMedianAge=line[2],
                              totalRooms=line[3],
                              totalBedRooms=line[4],
                              population=line[5], 
                              households=line[6],
                              medianIncome=line[7],
                              medianHouseValue=line[8])).toDF()

In [82]:
df.columns

['households',
 'housingMedianAge',
 'latitude',
 'longitude',
 'medianHouseValue',
 'medianIncome',
 'population',
 'totalBedRooms',
 'totalRooms']

In [83]:
df.dtypes

[('households', 'double'),
 ('housingMedianAge', 'double'),
 ('latitude', 'double'),
 ('longitude', 'double'),
 ('medianHouseValue', 'double'),
 ('medianIncome', 'double'),
 ('population', 'double'),
 ('totalBedRooms', 'double'),
 ('totalRooms', 'double')]

In [84]:
df.printSchema()

root
 |-- households: double (nullable = true)
 |-- housingMedianAge: double (nullable = true)
 |-- latitude: double (nullable = true)
 |-- longitude: double (nullable = true)
 |-- medianHouseValue: double (nullable = true)
 |-- medianIncome: double (nullable = true)
 |-- population: double (nullable = true)
 |-- totalBedRooms: double (nullable = true)
 |-- totalRooms: double (nullable = true)



In [85]:
df.show(5)

+----------+----------------+--------+---------+----------------+------------+----------+-------------+----------+
|households|housingMedianAge|latitude|longitude|medianHouseValue|medianIncome|population|totalBedRooms|totalRooms|
+----------+----------------+--------+---------+----------------+------------+----------+-------------+----------+
|     126.0|            41.0|   37.88|  -122.23|        452600.0|      8.3252|     322.0|        129.0|     880.0|
|    1138.0|            21.0|   37.86|  -122.22|        358500.0|      8.3014|    2401.0|       1106.0|    7099.0|
|     177.0|            52.0|   37.85|  -122.24|        352100.0|      7.2574|     496.0|        190.0|    1467.0|
|     219.0|            52.0|   37.85|  -122.25|        341300.0|      5.6431|     558.0|        235.0|    1274.0|
|     259.0|            52.0|   37.85|  -122.25|        342200.0|      3.8462|     565.0|        280.0|    1627.0|
+----------+----------------+--------+---------+----------------+------------+--

In [86]:
rdd.take(2)

[[-122.23, 37.88, 41.0, 880.0, 129.0, 322.0, 126.0, 8.3252, 452600.0],
 [-122.22, 37.86, 21.0, 7099.0, 1106.0, 2401.0, 1138.0, 8.3014, 358500.0]]

In [87]:
# Import all from `sql.types`
from pyspark.sql.types import *

# Write a custom function to convert the data type of DataFrame columns
def convertColumn(df, names, newType):
    for name in names: 
        df = df.withColumn(name, df[name].cast(newType))
    return df 

# Assign all column names to `columns`
columns = ['households', 'housingMedianAge', 'latitude', 'longitude', 'medianHouseValue', 'medianIncome', 'population', 'totalBedRooms', 'totalRooms']

# Conver the `df` columns to `FloatType()`
df2 = convertColumn(df, columns, FloatType())

In [88]:
df2.dtypes

[('households', 'float'),
 ('housingMedianAge', 'float'),
 ('latitude', 'float'),
 ('longitude', 'float'),
 ('medianHouseValue', 'float'),
 ('medianIncome', 'float'),
 ('population', 'float'),
 ('totalBedRooms', 'float'),
 ('totalRooms', 'float')]

In [89]:
df2.describe().show()

+-------+-----------------+------------------+-----------------+-------------------+------------------+------------------+------------------+-----------------+------------------+
|summary|       households|  housingMedianAge|         latitude|          longitude|  medianHouseValue|      medianIncome|        population|    totalBedRooms|        totalRooms|
+-------+-----------------+------------------+-----------------+-------------------+------------------+------------------+------------------+-----------------+------------------+
|  count|            20640|             20640|            20640|              20640|             20640|             20640|             20640|            20640|             20640|
|   mean|499.5396802325581|28.639486434108527|35.63186143109965|-119.56970444871473|206855.81690891474|3.8706710030346416|1425.4767441860465|537.8980135658915|2635.7630813953488|
| stddev|382.3297528316098| 12.58555761211163|2.135952380602968|  2.003531742932898|115395.61587441359|1.

In [90]:
df2.select('population','totalBedRooms').show(3)

+----------+-------------+
|population|totalBedRooms|
+----------+-------------+
|     322.0|        129.0|
|    2401.0|       1106.0|
|     496.0|        190.0|
+----------+-------------+
only showing top 3 rows



In [91]:
df2.groupBy("housingMedianAge").count().sort("housingMedianAge",ascending=False).show(5)


+----------------+-----+
|housingMedianAge|count|
+----------------+-----+
|            52.0| 1273|
|            51.0|   48|
|            50.0|  136|
|            49.0|  134|
|            48.0|  177|
+----------------+-----+
only showing top 5 rows



#### Data Preprocessing

In [92]:
# Import all from `sql.functions` 
from pyspark.sql.functions import *

# Adjust the values of `medianHouseValue`
# express the house values in units of 100,000
df2 = df2.withColumn("medianHouseValue", col("medianHouseValue")/100000)

# Show the first 2 lines of `df`
df2.take(2)

[Row(households=126.0, housingMedianAge=41.0, latitude=37.880001068115234, longitude=-122.2300033569336, medianHouseValue=4.526, medianIncome=8.325200080871582, population=322.0, totalBedRooms=129.0, totalRooms=880.0),
 Row(households=1138.0, housingMedianAge=21.0, latitude=37.86000061035156, longitude=-122.22000122070312, medianHouseValue=3.585, medianIncome=8.301400184631348, population=2401.0, totalBedRooms=1106.0, totalRooms=7099.0)]

##### feature engineering

In [93]:
# Import all from `sql.functions` if you haven't yet
from pyspark.sql.functions import *

# Divide `totalRooms` by `households`
roomsPerHousehold = df2.select("totalRooms", "households", col("totalRooms")/col("households"))

# Divide `population` by `households`
populationPerHousehold = df2.select(col("population")/col("households"))

# Divide `totalBedRooms` by `totalRooms`
bedroomsPerRoom = df2.select(col("totalBedRooms")/col("totalRooms"))

In [94]:
roomsPerHousehold.show(2)

+----------+----------+-------------------------+
|totalRooms|households|(totalRooms / households)|
+----------+----------+-------------------------+
|     880.0|     126.0|        6.984126984126984|
|    7099.0|    1138.0|        6.238137082601054|
+----------+----------+-------------------------+
only showing top 2 rows



In [95]:
populationPerHousehold.show(2)

+-------------------------+
|(population / households)|
+-------------------------+
|       2.5555555555555554|
|        2.109841827768014|
+-------------------------+
only showing top 2 rows



In [96]:
bedroomsPerRoom.show(2)

+----------------------------+
|(totalBedRooms / totalRooms)|
+----------------------------+
|         0.14659090909090908|
|         0.15579659106916466|
+----------------------------+
only showing top 2 rows



In [97]:
# Add the new columns to `df`

df2 = df2.withColumn("roomsPerHousehold", col("totalRooms")/col("households")) \
   .withColumn("populationPerHousehold", col("population")/col("households")) \
   .withColumn("bedroomsPerRoom", col("totalBedRooms")/col("totalRooms"))



# Inspect the result
df2.show(2)

+----------+----------------+--------+---------+----------------+------------+----------+-------------+----------+-----------------+----------------------+-------------------+
|households|housingMedianAge|latitude|longitude|medianHouseValue|medianIncome|population|totalBedRooms|totalRooms|roomsPerHousehold|populationPerHousehold|    bedroomsPerRoom|
+----------+----------------+--------+---------+----------------+------------+----------+-------------+----------+-----------------+----------------------+-------------------+
|     126.0|            41.0|   37.88|  -122.23|           4.526|      8.3252|     322.0|        129.0|     880.0|6.984126984126984|    2.5555555555555554|0.14659090909090908|
|    1138.0|            21.0|   37.86|  -122.22|           3.585|      8.3014|    2401.0|       1106.0|    7099.0|6.238137082601054|     2.109841827768014|0.15579659106916466|
+----------+----------------+--------+---------+----------------+------------+----------+-------------+----------+------

In [98]:
df2.first()

Row(households=126.0, housingMedianAge=41.0, latitude=37.880001068115234, longitude=-122.2300033569336, medianHouseValue=4.526, medianIncome=8.325200080871582, population=322.0, totalBedRooms=129.0, totalRooms=880.0, roomsPerHousehold=6.984126984126984, populationPerHousehold=2.5555555555555554, bedroomsPerRoom=0.14659090909090908)

In [99]:
# Re-order and select columns
df2.select("medianHouseValue", 
              "totalBedRooms", 
              "population", 
              "households", 
              "medianIncome", 
              "roomsPerHousehold", 
              "populationPerHousehold", 
              "bedroomsPerRoom").show(2)

+----------------+-------------+----------+----------+------------+-----------------+----------------------+-------------------+
|medianHouseValue|totalBedRooms|population|households|medianIncome|roomsPerHousehold|populationPerHousehold|    bedroomsPerRoom|
+----------------+-------------+----------+----------+------------+-----------------+----------------------+-------------------+
|           4.526|        129.0|     322.0|     126.0|      8.3252|6.984126984126984|    2.5555555555555554|0.14659090909090908|
|           3.585|       1106.0|    2401.0|    1138.0|      8.3014|6.238137082601054|     2.109841827768014|0.15579659106916466|
+----------------+-------------+----------+----------+------------+-----------------+----------------------+-------------------+
only showing top 2 rows



##### Standardization

In [100]:
# Import `DenseVector`
from pyspark.ml.linalg import DenseVector

# Define the `input_data` 
input_data = df2.rdd.map(lambda x: (float(x[0]), DenseVector(x[1:])))

# convert RDD to DataFrame
df3 = spark.createDataFrame(input_data, ["label", "features"])

In [101]:
input_data.collect()[:3]

[(126.0,
  DenseVector([41.0, 37.88, -122.23, 4.526, 8.3252, 322.0, 129.0, 880.0, 6.9841, 2.5556, 0.1466])),
 (1138.0,
  DenseVector([21.0, 37.86, -122.22, 3.585, 8.3014, 2401.0, 1106.0, 7099.0, 6.2381, 2.1098, 0.1558])),
 (177.0,
  DenseVector([52.0, 37.85, -122.24, 3.521, 7.2574, 496.0, 190.0, 1467.0, 8.2881, 2.8023, 0.1295]))]

In [102]:
df3.show(2)

+------+--------------------+
| label|            features|
+------+--------------------+
| 126.0|[41.0,37.88000106...|
|1138.0|[21.0,37.86000061...|
+------+--------------------+
only showing top 2 rows



In [103]:
# Import `StandardScaler` 
from pyspark.ml.feature import StandardScaler

# Initialize the `standardScaler`
standardScaler = StandardScaler(inputCol="features", outputCol="features_scaled")

# Fit the DataFrame to the scaler
scaler = standardScaler.fit(df3)

# Transform the data in `df` with the scaler
scaled_df = scaler.transform(df3)

# Inspect the result
scaled_df.take(2)

[Row(label=126.0, features=DenseVector([41.0, 37.88, -122.23, 4.526, 8.3252, 322.0, 129.0, 880.0, 6.9841, 2.5556, 0.1466]), features_scaled=DenseVector([3.2577, 17.7345, -61.0073, 3.9222, 4.3821, 0.2843, 0.3062, 0.4034, 2.8228, 0.2461, 2.5264])),
 Row(label=1138.0, features=DenseVector([21.0, 37.86, -122.22, 3.585, 8.3014, 2401.0, 1106.0, 7099.0, 6.2381, 2.1098, 0.1558]), features_scaled=DenseVector([1.6686, 17.7251, -61.0023, 3.1067, 4.3696, 2.1202, 2.6255, 3.254, 2.5213, 0.2031, 2.6851]))]

### Building A Machine Learning Model With Spark ML

In [67]:
# Split the data into train and test sets
train_data, test_data = scaled_df.randomSplit([.8,.2],seed=1234)

In [68]:
train_data.describe().show()

+-------+------------------+
|summary|             label|
+-------+------------------+
|  count|             16479|
|   mean| 497.8825778263244|
| stddev|380.28622388404426|
|    min|               2.0|
|    max|            6082.0|
+-------+------------------+



In [69]:
# Import `LinearRegression`
from pyspark.ml.regression import LinearRegression

# Initialize `lr`
lr = LinearRegression(labelCol="label", maxIter=10, regParam=0.3, elasticNetParam=0.8)

# Fit the data to the model
linearModel = lr.fit(train_data)

In [70]:
# Generate predictions
predicted = linearModel.transform(test_data)

# Extract the predictions and the "known" correct labels
predictions = predicted.select("prediction").rdd.map(lambda x: x[0])
labels = predicted.select("label").rdd.map(lambda x: x[0])

# Zip `predictions` and `labels` into a list
predictionAndLabel = predictions.zip(labels).collect()

# Print out first 5 instances of `predictionAndLabel` 
predictionAndLabel[:5]

[(17.703271290170505, 1.0),
 (40.39784229916637, 2.0),
 (79.06895639200735, 2.0),
 (22.432441402048084, 5.0),
 (35.64138989305235, 7.0)]

###  Evaluating the Model

In [71]:
# Coefficients for the model
linearModel.coefficients

DenseVector([0.0124, 0.219, -4.4558, 7.2281, -0.332, 0.0534, 0.6431, 0.0191, -14.2521, -0.5603, 0.0])

In [72]:
# Intercept for the model
linearModel.intercept

-448.4032179025119

In [73]:
# Get the RMSE
linearModel.summary.rootMeanSquaredError

61.13583281212973

In [74]:
# Get the R2
linearModel.summary.r2

0.9741537920380764

In [None]:
spark.stop()