# Exercise 01 : Basics of Pyspark and SparkML
Here we use both **pandas dataframe** and **pyspark dataframe** for trivial machine learning tasks with Pyspark and MLlib on Databricks, and see how the latter works.

- pandas dataframe : It runs only on master and not distributed.
- pyspark dataframe : It runs as worker jobs and distributed.

## Section 01 : Using pandas dataframe

In [3]:
# prepare data
import numpy as np
np.random.seed(0)
x = np.arange(-10, 11)
y = 2*x + 1 + np.random.normal()
l = list(zip(x, y))
l

In [4]:
# create pandas dataframe
import pandas as pd
df = pd.DataFrame(l, columns=["x","y"])
df

In [5]:
# pandas dataframe transform
df["x"] = df["x"] + 1
df

In [6]:
# linear regression
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
X = df["x"].reshape((21,1))
Y = df["y"].reshape((21,1))
lr.fit(X, Y)

In [7]:
# predict
lr.predict([20])

## Section 02 : Using pyspark dataframe

In [9]:
# prepare data
import numpy as np
np.random.seed(0)
x = np.arange(-10, 11)
y = 2*x + 1 + np.random.normal()
l = list(zip(x, y))
l

In [10]:
# pyspark dataframe
from pyspark.sql import Row
rdd = sc.parallelize(l)
people = rdd.map(lambda z: Row(x=int(z[0]), y=float(z[1])))
df = spark.createDataFrame(people)
df.toPandas()

In [11]:
# pyspark dataframe transform
df = df.withColumn("x", df.x + 1)
df.toPandas()

In [12]:
# SparkML linear regression
from pyspark.ml.feature import VectorAssembler
vectorAssembler = VectorAssembler(inputCols = ["x"], outputCol = "features")
va_df = vectorAssembler.transform(df)

from pyspark.ml.regression import LinearRegression
lr = LinearRegression(featuresCol = "features", labelCol="y")
model = lr.fit(va_df)

In [13]:
# predict
from pyspark.sql.types import StructType, StructField, IntegerType, DoubleType

test_schema = StructType([StructField("x", IntegerType())])
test_row = [Row(x=20)]
test_df = spark.createDataFrame(test_row, test_schema)
va_test_df = vectorAssembler.transform(test_df)

pred = model.transform(va_test_df)
test_res = pred.select("x", "prediction")
test_res.cache()
display(test_res)

x,prediction
20,40.76405234596766
