#  Spark Mlib

![](https://blog.osservatori.net/hubfs/AI/machine-learning.jpg)
[Osservatori.net](https://blog.osservatori.net/it_it/machine-learning-come-funziona-apprendimento-automatico)

<div class="jumbotron">
    <center>
        <b>MLlib</b> is Apache Spark's scalable machine learning library.
    </center>
</div>

## Ease of Use

***Usable in Java, Scala, Python, and R.***

MLlib fits into Spark's APIs and interoperates with NumPy in Python (as of Spark 0.9) and R libraries (as of Spark 1.5). You can use any Hadoop data source (e.g. HDFS, HBase, or local files), making it easy to plug into Hadoop workflows.

```python
data = spark.read.format("libsvm")\
  .load("hdfs://...")

model = KMeans(k=10).fit(data)
```

## Performance

***High-quality algorithms, 100x faster than MapReduce.***

Spark excels at iterative computation, enabling MLlib to run fast. At the same time, we care about algorithmic performance: MLlib contains high-quality algorithms that leverage iteration, and can yield better results than the one-pass approximations sometimes used on MapReduce.

![](https://spark.apache.org/images/logistic-regression.png)

## Algorithms and Utilities

*Algorithms*

* Classification: logistic regression, naive Bayes,...
* Regression: generalized linear regression, survival regression,...
* Decision trees, random forests, and gradient-boosted trees
* Recommendation: alternating least squares (ALS)
* Clustering: K-means, Gaussian mixtures (GMMs),...
* Topic modeling: latent Dirichlet allocation (LDA)
* Frequent itemsets, association rules, and sequential pattern mining


*Utilities*

* Feature transformations: standardization, normalization, hashing,...
* ML Pipeline construction
* Model evaluation and hyper-parameter tuning
* ML persistence: saving and loading models and Pipelines
* Distributed linear algebra: SVD, PCA,...

## Announcement: DataFrame-based API is primary API
https://spark.apache.org/docs/latest/ml-guide.html

The MLlib RDD-based API is now in maintenance mode.

As of Spark 2.0, the RDD-based APIs in the spark.mllib package have entered maintenance mode. The primary Machine Learning API for Spark is now the DataFrame-based API in the spark.ml package.

DataFrames provide a more user-friendly API than RDDs. The many benefits of DataFrames include Spark Datasources, SQL/DataFrame queries, Tungsten and Catalyst optimizations, and uniform APIs across languages.
The DataFrame-based API for MLlib provides a uniform API across ML algorithms and across multiple languages.
DataFrames facilitate practical ML Pipelines, particularly feature transformations. See the Pipelines guide for details.

What is “Spark ML”?

“Spark ML” is not an official name but occasionally used to refer to the MLlib DataFrame-based API.

## Highlights in 3.0[](https://spark.apache.org/docs/latest/ml-guide.html#highlights-in-30)

The list below highlights some of the new features and enhancements added to MLlib in the `3.0`
release of Spark:

* Multiple columns support was added to `Binarizer` ([SPARK-23578](https://issues.apache.org/jira/browse/SPARK-23578)), `StringIndexer` ([SPARK-11215](https://issues.apache.org/jira/browse/SPARK-11215)), `StopWordsRemover` ([SPARK-29808](https://issues.apache.org/jira/browse/SPARK-29808)) and PySpark `QuantileDiscretizer` ([SPARK-22796](https://issues.apache.org/jira/browse/SPARK-22796)).
* Tree-Based Feature Transformation was added
    ([SPARK-13677](https://issues.apache.org/jira/browse/SPARK-13677)).
* Two new evaluators `MultilabelClassificationEvaluator` ([SPARK-16692](https://issues.apache.org/jira/browse/SPARK-16692)) and `RankingEvaluator` ([SPARK-28045](https://issues.apache.org/jira/browse/SPARK-28045)) were added.
* Sample weights support was added in `DecisionTreeClassifier/Regressor` ([SPARK-19591](https://issues.apache.org/jira/browse/SPARK-19591)), `RandomForestClassifier/Regressor` ([SPARK-9478](https://issues.apache.org/jira/browse/SPARK-9478)), `GBTClassifier/Regressor` ([SPARK-9612](https://issues.apache.org/jira/browse/SPARK-9612)),  `MulticlassClassificationEvaluator` ([SPARK-24101](https://issues.apache.org/jira/browse/SPARK-24101)), `RegressionEvaluator` ([SPARK-24102](https://issues.apache.org/jira/browse/SPARK-24102)), `BinaryClassificationEvaluator` ([SPARK-24103](https://issues.apache.org/jira/browse/SPARK-24103)), `BisectingKMeans` ([SPARK-30351](https://issues.apache.org/jira/browse/SPARK-30351)), `KMeans` ([SPARK-29967](https://issues.apache.org/jira/browse/SPARK-29967)) and `GaussianMixture` ([SPARK-30102](https://issues.apache.org/jira/browse/SPARK-30102)).
* R API for `PowerIterationClustering` was added
    ([SPARK-19827](https://issues.apache.org/jira/browse/SPARK-19827)).
* Added Spark ML listener for tracking ML pipeline status
    ([SPARK-23674](https://issues.apache.org/jira/browse/SPARK-23674)).
* Fit with validation set was added to Gradient Boosted Trees in Python
    ([SPARK-24333](https://issues.apache.org/jira/browse/SPARK-24333)).
* [`RobustScaler`](https://spark.apache.org/docs/latest/ml-features.html#robustscaler) transformer was added
    ([SPARK-28399](https://issues.apache.org/jira/browse/SPARK-28399)).
* [`Factorization Machines`](https://spark.apache.org/docs/latest/ml-classification-regression.html#factorization-machines) classifier and regressor were added
    ([SPARK-29224](https://issues.apache.org/jira/browse/SPARK-29224)).
* Gaussian Naive Bayes Classifier ([SPARK-16872](https://issues.apache.org/jira/browse/SPARK-16872)) and Complement Naive Bayes Classifier ([SPARK-29942](https://issues.apache.org/jira/browse/SPARK-29942)) were added.
* ML function parity between Scala and Python
    ([SPARK-28958](https://issues.apache.org/jira/browse/SPARK-28958)).
* `predictRaw` is made public in all the Classification models. `predictProbability` is made public in all the Classification models except `LinearSVCModel`
    ([SPARK-30358](https://issues.apache.org/jira/browse/SPARK-30358)).

## Data Types

## Local vector
A local vector has integer-typed and 0-based indices and double-typed values, stored on a single machine. 

MLlib supports two types of local vectors: dense and sparse. 

A dense vector is backed by a double array representing its entry values, while a sparse vector is backed by two parallel arrays: indices and values. 

For example, a vector (1.0, 0.0, 3.0) can be represented in dense format as [1.0, 0.0, 3.0] or in sparse format as (3, [0, 2], [1.0, 3.0]), where 3 is the size of the vector.


# Basic Statistics
https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1408031979081866/2243907263472147/2956912205716139/latest.html

# Pipelines
https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1408031979081866/2243907263472164/2956912205716139/latest.html

## Extracting, transforming and selecting features
https://spark.apache.org/docs/latest/ml-features.html#extracting-transforming-and-selecting-features

https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1408031979081866/2689304922466405/2956912205716139/latest.html

# Classification and regression
https://spark.apache.org/docs/latest/ml-classification-regression.html#classification-and-regression

## Pipelines
https://spark.apache.org/docs/latest/ml-pipeline.html

https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1408031979081866/2243907263472164/2956912205716139/latest.html

# Sentiment Analysis

https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1408031979081866/3917430877036588/2956912205716139/latest.html

# Biblio

* https://spark.apache.org/mllib/
* https://spark.apache.org/docs/latest/ml-guide.html
* https://blog.osservatori.net/it_it/machine-learning-come-funziona-apprendimento-automatico
* https://towardsdatascience.com/hands-on-big-data-streaming-apache-spark-at-scale-fd89c15fa6b0
* https://towardsdatascience.com/apache-spark-mllib-tutorial-ec6f1cb336a9
* https://www.guru99.com/pyspark-tutorial.html
* https://towardsdatascience.com/sentiment-analysis-simplified-ac30720a5827
* http://web.cs.ucla.edu/~mtgarip/statistics.html
* https://towardsdatascience.com/machine-learning-with-pyspark-and-mllib-solving-a-binary-classification-problem-96396065d2aa
* https://runawayhorse001.github.io/LearningApacheSpark/index.html