# scikit-learn

scikit-learn (lowercase) is the premier Python machine learning library.

https://scikit-learn.org/

See also:
- https://github.com/phausamann/sklearn-xarray
- https://github.com/nbren12/sklearn-xarray
- https://github.com/dask/dask-ml/


## Imports

In [None]:
import numpy as np

## scikit-learn Tutorials

- https://github.com/jakevdp/sklearn_tutorial/


## sklearn Overview

Data arrays in scikit-learn are always 2-D shape `(n_samples, n_features)`, and input is typically cast to float64 (np).

In scikit-learn, an estimator for classification is a Python object that implements the methods `fit(X, y)` and `predict(T)`.

Fit the estimator instance to the model using the `fit()` method, then `predict()` new values.


It appears that estimators have parameters which are normally set when creating an instance. They seem accesssible (settable?) as attributes and via `get_params()` and `set_params()`.

Now PCA has `transform()`, and not a `predict()` method, so...? I think this means it's a Transformer. See [5. Dataset transformations].(https://scikit-learn.org/stable/data_transforms.html). But are. It seems `transform()` is related to dimensionality reduction, and some unsupervised learning methods have it also. The nuances of estimators vs transformers is vague, there is some indication that transformers are estimators, but what if they don't have predict?


## scikit-learn modules

### sklearn.decomposition: Matrix Decomposition
[sklearn.decomposition](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.decomposition)

- [2.5. Decomposing signals in components (matrix factorization problems)](https://scikit-learn.org/stable/modules/decomposition.html)

Algorithms include:
- `IncrementalPCA`
- `KernelPCA`
- `PCA`
-`TruncatedSVD`---does not 


#### PCA

Parameters include:
- n_components
- whiten
- svd_solver---`randomized`?

Attributes include:
- `components_`---sorted eigenvectors
- `explained_variance_`---sorted eigenvalues
- `explained_variance_ratio_`---

Methods include:
- `fit()`
- `fit_transform()`
- `get_covariance()`
- `transform()`

If copy=False, data passed to fit are overwritten and running fit(X).transform(X) will not yield the expected results, use fit_transform(X) instead.

https://scikit-learn.org/stable/auto_examples/preprocessing/plot_scaling_importance.html

Useful resources:
- https://jakevdp.github.io/PythonDataScienceHandbook/05.09-principal-component-analysis.html


### sklearn.preprocessing: Preprocessing and Normalization

The [sklearn.preprocessing](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing) module

- [5.3 Preprocessing data](https://scikit-learn.org/stable/modules/preprocessing.html)

Includes:
- `StandardScalar`---standardize features by removing the mean and scaling to unit variance (divide by standard deviation), use like `StandardScalar().fit_transform(X)`. Compare to `scale()`. Assumes normal distribution.
- `Normalizer`---Compare to `normalize()`. Divides each value by magnitude. Applied to rows/observations, not columns/features.
- `MinMaxScaler`


So here's the questions, how do you reproject when you run PCA on standardized data? Do you transform the standardized data or the original? It isn't obvious but I think you project the transformed data.

X_std = StandardScalar().fit_transform(X)
X_pca = PCA().fit_transform(X_std)

### sklearn.neighbors: Nearest Neighbors

- [1.6 Nearest Neighbors](https://scikit-learn.org/stable/modules/neighbors.html)
- [2.8 Density Estimation](https://scikit-learn.org/stable/modules/density.html)

Includes:
- `KernelDensity`
