#  Spark Mlib{background-color="black" background-image="https://blog.osservatori.net/hubfs/AI/machine-learning.jpg" background-size="80%" background-opacity="0.8"}

MLlib is Apache Spark's scalable machine learning library.

## Ease of Use

:::: {.columns}

::: {.fragment .column width="50%"}
***Usable in Java, Scala, Python, and R.***

MLlib fits into Spark's APIs and interoperates with NumPy in Python (as of Spark 0.9) and R libraries (as of Spark 1.5). You can use any Hadoop data source (e.g. HDFS, HBase, or local files), making it easy to plug into Hadoop workflows.
::: 

::: {.fragment .column width="50%"}
```python
data = spark.read.format("libsvm")\
  .load("hdfs://...")

model = KMeans(k=10).fit(data)
```
:::
::::


## Performance
:::: {.columns}

::: {.fragment .column width="50%"}
***High-quality algorithms, 100x faster than MapReduce.***

Spark excels at iterative computation, enabling MLlib to run fast. At the same time, we care about algorithmic performance: MLlib contains high-quality algorithms that leverage iteration, and can yield better results than the one-pass approximations sometimes used on MapReduce.
::: 

::: {.fragment .column width="50%"}
![](https://spark.apache.org/images/logistic-regression.png)
:::
::::

## Algorithms and Utilities
:::: {.columns}

::: {.fragment .column width="50%"}
*Algorithms*

* Classification: logistic regression, naive Bayes,...
* Regression: generalized linear regression, survival regression,...
* Decision trees, random forests, and gradient-boosted trees
* Recommendation: alternating least squares (ALS)
* Clustering: K-means, Gaussian mixtures (GMMs),...
* Topic modeling: latent Dirichlet allocation (LDA)
* Frequent itemsets, association rules, and sequential pattern mining
::: 

::: {.fragment .column width="50%"}
*Utilities*

* Feature transformations: standardization, normalization, hashing,...
* ML Pipeline construction
* Model evaluation and hyper-parameter tuning
* ML persistence: saving and loading models and Pipelines
* Distributed linear algebra: SVD, PCA,...
:::
::::

## [DataFrame-based API is primary API](https://spark.apache.org/docs/latest/ml-guide.html)

**The MLlib RDD-based API is now in maintenance mode.**

As of Spark 2.0, the [RDD](https://spark.apache.org/docs/latest/rdd-programming-guide.html#resilient-distributed-datasets-rdds)\-based APIs in the `spark.mllib` package have entered maintenance mode. The primary Machine Learning API for Spark is now the [DataFrame](https://spark.apache.org/docs/latest/sql-programming-guide.html)\-based API in the `spark.ml` package.

:::{.fragment}
_What are the implications?_

*   MLlib will still support the RDD-based API in `spark.mllib` with bug fixes.
*   MLlib will not add new features to the RDD-based API.
*   In the Spark 2.x releases, MLlib will add features to the DataFrames-based API to reach feature parity with the RDD-based API.
:::

:::{.fragment}
_Why is MLlib switching to the DataFrame-based API?_

*   DataFrames provide a more user-friendly API than RDDs. The many benefits of DataFrames include Spark Datasources, SQL/DataFrame queries, Tungsten and Catalyst optimizations, and uniform APIs across languages.
*   The DataFrame-based API for MLlib provides a uniform API across ML algorithms and across multiple languages.
*   DataFrames facilitate practical ML Pipelines, particularly feature transformations. See the [Pipelines guide](https://spark.apache.org/docs/latest/ml-pipeline.html) for details.
::: 

:::{.fragment}
_What is “Spark ML”?_

*   “Spark ML” is not an official name but occasionally used to refer to the MLlib DataFrame-based API. This is majorly due to the `org.apache.spark.ml` Scala package name used by the DataFrame-based API, and the “Spark ML Pipelines” term we used initially to emphasize the pipeline concept.
:::

## Data Types: Local vector

- A local vector has integer-typed and 0-based indices and double-typed values, stored on a single machine. 
- MLlib supports two types of local vectors: dense and sparse. 
- A dense vector is backed by a double array representing its entry values, while a sparse vector is backed by two parallel arrays: indices and values. 
- For example, a vector (1.0, 0.0, 3.0) can be represented in dense format as [1.0, 0.0, 3.0] or in sparse format as (3, [0, 2], [1.0, 3.0]), where 3 is the size of the vector.


## Basic Statistics

<https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1408031979081866/2243907263472147/2956912205716139/latest.html>


## Pipelines
<https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1408031979081866/2243907263472164/2956912205716139/latest.html>

## Extracting, transforming and selecting features
<https://spark.apache.org/docs/latest/ml-features.html#extracting-transforming-and-selecting-features>

<https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1408031979081866/1061986010927738/2956912205716139/latest.html> 

##  MLib - Data Sources - Classification and regression - Clustering
<https://spark.apache.org/docs/latest/ml-classification-regression.html#classification-and-regression>

<https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1408031979081866/1061986010927738/2956912205716139/latest.html>

## Use Apache Spark MLlib on Databricks


https://docs.databricks.com/en/machine-learning/train-model/mllib.html

![](https://media1.tenor.com/images/257a13ee5e204efdca4bb135a8f75a2e/tenor.gif?itemid=16088629)

# Biblio

* https://spark.apache.org/mllib/
* https://spark.apache.org/docs/latest/ml-guide.html
* https://blog.osservatori.net/it_it/machine-learning-come-funziona-apprendimento-automatico
* https://towardsdatascience.com/hands-on-big-data-streaming-apache-spark-at-scale-fd89c15fa6b0
* https://towardsdatascience.com/apache-spark-mllib-tutorial-ec6f1cb336a9
* https://www.guru99.com/pyspark-tutorial.html
* https://towardsdatascience.com/sentiment-analysis-simplified-ac30720a5827
* http://web.cs.ucla.edu/~mtgarip/statistics.html
* https://towardsdatascience.com/machine-learning-with-pyspark-and-mllib-solving-a-binary-classification-problem-96396065d2aa
* https://runawayhorse001.github.io/LearningApacheSpark/index.html