### Machine learning by Spark in Dataproc

from https://www.cloudskillsboost.google/focuses/3390?locale=en&parent=catalog
    

Apache Spark is an analytics engine for large scale data processing. Logistic regression is available as a module in Apache Spark's machine learning library, MLlib. Spark MLlib, also called Spark ML, includes implementations for most standard machine learning algorithms such as k-means clustering, random forests, alternating least squares, k-means clustering, decision trees, support vector machines, etc. Spark can run on a Hadoop cluster, like Dataproc, in order to process very large datasets in parallel.

Note: you need to create a storage bucket "my-project-1011-1012-dsongcp" as this was not mentioned in the guide book.
    

In [None]:
PROJECT = !gcloud config get-value project

In [None]:
PROJECT=PROJECT[0]
BUCKET = PROJECT + '-dsongcp'


In [None]:
import os
os.environ['BUCKET'] = PROJECT + '-dsongcp'

In [None]:
from pyspark.sql import SparkSession

In [None]:
from pyspark import SparkContext

In [None]:
sc = SparkContext('local', 'logistic')

In [None]:
spark = SparkSession \
    .builder \
    .appName('Logistic regression w/ Spark ML') \
    .getOrCreate()

In [None]:
from pyspark import SparkContext
from pyspark.mllib.classification import LogisticRegressionWithLBFGS, LogisticRegressionModel
from pyspark.mllib.regression import LabeledPoint

In [None]:
# Load and parse the data
def parsePoint(line):
    values = [float(x) for x in line.split(' ')]
    return LabeledPoint(values[0], values[1:])

In [None]:
data = sc.textFile('gs://{}/spark/sample_svm_data.txt'.format(BUCKET))


In [None]:
parsedData = data.map(parsePoint)

In [None]:
# Build the model
model = LogisticRegressionWithLBFGS.train(parsedData)

In [None]:
# Evaluating the model on training data
labelsAndPreds = parsedData.map(lambda p: (p.label, model.predict(p.features)))

In [None]:
trainErr = labelsAndPreds.filter(lambda lp: lp[0] != lp[1]).count() / float(parsedData.count())
print("Training Error = " + str(trainErr))