### Scikit-learn API training #1

* __Feature__: dataset의 일반 속성. (target 값 제외한 나머지 모든 속성)
  
* __Label, Class, Target__: Supervised learning 시 data의 학습을 위해 주어지는 정답 데이터. 분류의 경우 이 값을 label 또는 class로 지칭

#### Scikit-learn framework
* __Estimator__  
    - training method: __fit( )__
    - prediction method: __predict( )__.  
    - Estimator is separated: __Classifier__ / __Regressor__

#### iris dataset classification prediction process

* __Split the dataset__: split the dataset into two parts: training / test
  
* __Model training__: model training based on training dataset
* __Prediction__: predict the target value using the trained model
* __Evaluation__: compare the prediction result and the actual dataset to evaluate the performance of model

In [1]:
import sklearn # import ML package 'scikit-learn'

In [2]:
from sklearn.datasets import load_iris # load iris dataset (Toy dataset)
from sklearn.tree import DecisionTreeClassifier # decision tree (for classification)
from sklearn.model_selection import train_test_split # import the data split method

In [3]:
import pandas as pd

iris = load_iris() # all of iris dataset
iris_data = iris.data # load iris dataset

iris_label = iris.target # take the target column
print('target values: \n',iris_label)
print('target columns: \n',iris.target_names)

target values: 
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]
target columns: 
 ['setosa' 'versicolor' 'virginica']


In [4]:
iris_df = pd.DataFrame(data = iris_data, columns = iris.feature_names)
iris_df.head() # show the values of the dataset

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


In [5]:
# split iris dataset
X_train,X_test,y_train,y_test = train_test_split(iris_data,iris_label,
                                                 test_size = 0.2,random_state = 1)

In [6]:
# model training
DT_df = DecisionTreeClassifier(random_state = 1) # load the Decision Tree model
DT_df.fit(X_train,y_train) # training the model

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=1, splitter='best')

In [7]:
pred = DT_df.predict(X_test) # model prediction

In [8]:
pred # show the prediction value

array([0, 1, 1, 0, 2, 1, 2, 0, 0, 2, 1, 0, 2, 1, 1, 0, 1, 1, 0, 0, 1, 1,
       2, 0, 2, 1, 0, 0, 1, 2])

In [9]:
# model evaluation
from sklearn.metrics import accuracy_score
print('prediction accuracy: {0:.4f}'.format(accuracy_score(y_test,pred)))

prediction accuracy: 0.9667
