# scikit-learn
Documentation:https://scikit-learn.org/stable/

scikit-learn aka sklearn, provides a variety of handy ml methods and datasets. 

- ### Classfication
- ### Regression
- ### Clustering
- ### Dimensionality reduction
- ### Model Selection
- ### Preprocessing
------
## Regression
1. SVR 
2. Ridge regression
3. Lasso

## Clustering
1. k-Meanings
2. spectrul clustering 
3. mean-shift

## Dimensionality reduction
1. PCA
2. feature selection
3. non-negative matrix factorization

## Model Selection

## Preprocessing
1. pre-processing
2. feature selection


## Classification: 
Example that compare different classifiers:
![alt text](https://scikit-learn.org/stable/_images/sphx_glr_plot_classifier_comparison_001.png)
For all classification examples below, we will use the breast cancer dataset

In [48]:
import numpy as np
from sklearn.datasets import load_breast_cancer
X,y = load_breast_cancer(return_X_y = True)
print('Number of instances:{} \nNumber of features:{}'.format(X.shape[0],X.shape[1]))
print('Note the two classes are not balanced, their ratio is {}:{}'.format(np.unique(y,return_counts=True)[1][0],np.unique(y,return_counts=True)[1][1]))
# train/test partition
msk = np.random.rand(len(X))<0.8
X_train = X[msk]
X_test = X[~msk]
y_train = y[msk]
y_test = y[~msk]

Number of instances:569 
Number of features:30
Note the two classes are not balanced, their ratio is 212:357


------
### Classifier: SVM

In [49]:
from sklearn import svm
clf = svm.SVC(gamma='auto').fit(X_train,y_train)
clf.score(X_test,y_test)


0.6982758620689655

------
### Classifier: Random Forest 

In [59]:
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(n_estimators = 50).fit(X_train,y_train)
clf.score(X_test,y_test)

0.9396551724137931

------
### Classifier: Gaussian Process
Gaussian Processes for Machine Learning (GPML) by Rasmussen and Williams, FYI, they are two Cambridge fellows now. :http://www.gaussianprocess.org/gpml/chapters/RW.pdf


In [56]:
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF
kernel = 1.0*RBF(1.0)
gpc = GaussianProcessClassifier(kernel = kernel).fit(X_train,y_train)
gpc.score(X_test,y_test)

0.9482758620689655

------
### Classifier: Logistic Regression
Logistic regression is a binary classfication method. Sigmoid function can map $[-\infty, \infty]$ to $[0,1]$, which is a deserved property of probability. Therefore:
\begin{equation}
\log\Big( \frac{p(x)}{1-p(x)} \Big) = WX + b
\end{equation}

\begin{equation}
p(x) = \frac{1}{1+exp(-(WX + b))}
\end{equation}

To find the optimal coefficients, we can use maximum likelihood estimates. The Log-likelihood is:

\begin{equation}
\mathcal{L} = \sum_{i=1}^N y_i log p(x_i) + (1-y_i)log(1-p(x_i))
\end{equation}

To find the ML estimates, we need to find $\frac{\partial\mathcal{L}}{\partial\theta}=0$. However, this is a transcendental equation, we usually use numerical method to find the solution.

#### example:

In [54]:
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression(solver='lbfgs', multi_class='multinomial').fit(X_train,y_train)
clf.score(X_test,y_test)



0.9396551724137931