# Scikit-learn(https://scikit-learn.org/stable/)
---

- imple and efficient tools for predictive data analysis
- Accessible to everybody, and reusable in various contexts
- Built on NumPy, SciPy, and matplotlib
- Open source, commercially usable - BSD license

## Example dataset: 
---
The iris dataset is a classic and very easy multi-class classification dataset.

| features name | unit |
|---|---|
| sepal length, 꽃 받침 길이 | cm
| sepal width, 꽃 받침 넓이 | cm
| petal length, 꽃잎 길이 | cm
| petal width, 꽃잎 넓이 | cm
| class | Setosa, Versicolour, Virginica |


![iris](https://thegoodpython.com/assets/images/iris-species.png)

## load datasets
---
https://scikit-learn.org/stable/modules/classes.html#module-sklearn.datasets

In [3]:
import matplotlib.pyplot as plt
from sklearn import datasets

iris = datasets.load_iris()

X = iris.data
Y = iris.target
target_names = iris.target_names

In [4]:
X.shape

(150, 4)

## Train/Test Split

In [5]:
import pandas as pd
from sklearn.model_selection import train_test_split

X_tr, X_ts, Y_tr, Y_ts = train_test_split(X, Y, test_size=0.3)

print(X_tr.shape, Y_tr.shape)
print(X_ts.shape, Y_ts.shape)

(105, 4) (105,)
(45, 4) (45,)


## RandomFores
---
https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html?highlight=randomforest#sklearn.ensemble.RandomForestClassifier

```python
class sklearn.ensemble.RandomForestClassifier(n_estimators=100, criterion='gini', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, bootstrap=True, oob_score=False, n_jobs=None, random_state=None, verbose=0, warm_start=False, class_weight=None, ccp_alpha=0.0, max_samples=None)
```

### 분류 모형에 대한 cross-validation과 Grid Search
---
RandomForest 모형의 hyperparameter

1. n_estimators : The number of trees in the forest.
2. criterion : The function to measure the quality of a split. Supported criteria are “gini” for the Gini impurity and “entropy” for the information gain. Note: this parameter is tree-specific.
3. max_depth : The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.

### GridSearchCV
---
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html
```python
class sklearn.model_selection.GridSearchCV(estimator, param_grid, scoring=None, n_jobs=None, iid='deprecated', refit=True, cv=None, verbose=0, pre_dispatch='2*n_jobs', error_score=nan, return_train_score=False)
```

In [10]:
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
param_grid = { 
    'n_estimators': [10, 50, 100, 150],
    'max_features': ['auto', 'log2'],
    'max_depth' : np.arange(2, 10),
    'criterion' :['gini', 'entropy']
}
rf = RandomForestClassifier()
rf_cv = GridSearchCV(estimator=rf, param_grid=param_grid, cv= 5)
rf_cv.fit(X_tr, Y_tr)

GridSearchCV(cv=5, error_score=nan,
             estimator=RandomForestClassifier(bootstrap=True, ccp_alpha=0.0,
                                              class_weight=None,
                                              criterion='gini', max_depth=None,
                                              max_features='auto',
                                              max_leaf_nodes=None,
                                              max_samples=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              n_estimators=100, n_jobs=None,
                                              oob_score=False,
                                              rando

In [11]:
rf_cv.best_estimator_

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=3, max_features='log2',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=50,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

In [12]:
rf_cv.score(X_ts, Y_ts)

0.9333333333333333

## Digits dataset

In [14]:
import matplotlib.pyplot as plt
from sklearn import datasets

digits = datasets.load_digits()

X = digits.data
Y = digits.target

#split dataset
X_tr, X_ts, Y_tr, Y_ts = train_test_split(X, Y, test_size=0.3)

param_grid = { 
    'n_estimators': [10, 50, 100, 150],
    'max_features': ['auto', 'log2'],
    'max_depth' : np.arange(2, 50, 2),
    'criterion' :['gini', 'entropy']
}
rf = RandomForestClassifier()
rf_cv = GridSearchCV(estimator=rf, param_grid=param_grid, cv= 5)
rf_cv.fit(X_tr, Y_tr)

GridSearchCV(cv=5, error_score=nan,
             estimator=RandomForestClassifier(bootstrap=True, ccp_alpha=0.0,
                                              class_weight=None,
                                              criterion='gini', max_depth=None,
                                              max_features='auto',
                                              max_leaf_nodes=None,
                                              max_samples=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              n_estimators=100, n_jobs=None,...
                                              random_state=None, verbose=0,
                                   

In [15]:
rf_cv.score(X_ts, Y_ts)

0.9611111111111111