# Machine Learning in Spark with MLlib

Spark has implementation of most of the common/popular machine learning algorithms, such as:

* Statistical Tests
* Classification and regression
 * Linear models (SVMs, logistic regression, linear regression)
 * Naive Bayes
 * Decision Trees
 * Ensembles of Trees (Random Forests and Gradient-Boosted Trees)
* Collaborative filtering
 * Alternating Least Squares (set to non-negative for NMF)
* Clustering
 * K-means
* Dimensionality reduction
 * Singular Value Decomposition (SVD)
 * Principal Component Analysis (PCA)

For a full list of implementations, please reference the documentation at http://spark.apache.org/docs/latest/mllib-guide.html.


## Supervised Models and LabeledPoint

Supervised models in Spark requires a LabeledPoint RDD, where each observation is a label with a feature vector.  LabeledPoints can be instantiated by:

```python
from pyspark.mllib.regression import LabeledPoint

some_data_rdd.map(lambda x: LabeledPoint(x[{y index}], x[{feature(s) index}]))
```

To extract the labels and features, use map on each point.

```python
labels = some_labelpoints_rdd.map(lambda x: x.label)
features = some_labelpoints_rdd.map(lambda x: x.features)
```

## StandardScaler

Several models in Spark can be very sensitive to different feature scaling.  We will often need to use StandardScaler to resolve this issue.  StandardScaler requires a feature RDD, and can be utilized as such:

```python
from pyspark.mllib.feature import StandardScaler

scaler = StandardScaler(withMean=True, withStd=True).fit(features)
scaler.transform(features)
```

## Train Test Split

There is no built in train test split function in Spark, but we do have transformation RDDs that specializes in random sampling or splits such as randomSplit.

## Model Evaluation

Spark has an evaluation package that allows you to calculate prediction errors.  Some options available are:

* BinaryClassificationMetrics
* MulticlassMetrics
* RegressionMetrics
* RankingMetrics
* More at https://spark.apache.org/docs/1.5.2/mllib-evaluation-metrics.html

Within each of the available Metrics module, there's a selection of suitable metrics such as mean squared error for regression and precision for classification.  All of these modules requires a key value or tuple RDD of label and prediction.  For example:

```python
from pyspark.mllib.evaluation import RegressionMetrics

metrics = RegressionMetrics(valuesAndPreds)
print metrics.meanSquaredError
```

## Example: Predicting churn

### Step 1: setup churn rdd

In [None]:
import pyspark as ps

sc = ps.SparkContext('local[4]')

In [None]:
churn_rdd = sc.textFile('churn.csv')
churn_rdd = churn_rdd.map(lambda x: x.split(','))
header = churn_rdd.first()
churn_rdd = churn_rdd.filter(lambda x: x != header)

### Step 2: clean our churn data

In [None]:
cleaned_churn_rdd = churn_rdd.map(lambda x: [x[1]] + [0 if z == 'no' else 1 for z in x[4:6]] + x[6:-1] + [0 if x[-1] == 'False.' else 1]) \
                             .map(lambda x: [float(z) for z in x])

### Step 3: create a LabeledPoint object (because we're building a supervised model)

### Step 4: Create a train test split

### Step 5: Run a Random Forest model to predict churn

### Step 6: Lets make a prediction

### Step 7: In order to look at our performance, we need to combine out prediction with actual labels

### Step 8: Lets take a look at accuracy

### Step 9: How about the Area under ROC?

### Practice #1: Build a logistic model to predict churn

### Practice #2: Build a SVM to predict churn