# Scikit-learn Tutorial

scikit-leaen document: https://scikit-learn.org/dev/index.html

In [1]:
import matplotlib.pyplot as plt
import numpy as np

# 資料集 & 資料切分


In [2]:
from sklearn.datasets import load_iris

iris = load_iris()
X, y = iris.data, iris.target

In [4]:
import pandas as pd

pd = pd.DataFrame(iris['data'], columns=iris['feature_names'])
pd

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2
...,...,...,...,...
145,6.7,3.0,5.2,2.3
146,6.3,2.5,5.0,1.9
147,6.5,3.0,5.2,2.0
148,6.2,3.4,5.4,2.3


In [5]:
print('Shape of X: ', X.shape)
print('Shape of y: ', y.shape)
print('\nlabels: ', iris.target_names)
print('y:\n', y)
print('first 5 samples: \n', X[:5])

Shape of X:  (150, 4)
Shape of y:  (150,)

labels:  ['setosa' 'versicolor' 'virginica']
y:
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]
first 5 samples: 
 [[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]]


In [None]:
from sklearn.model_selection import train_test_split
# split the data with 50% in each set
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    train_size=0.5,
                                                    random_state=123)

print("Labels for training data: \n", y_train)
print("\nLabels for testing data: \n", y_test)

Labels for training data: 
 [1 1 0 2 2 0 0 1 1 2 0 0 1 0 1 2 0 2 0 0 1 0 0 1 2 1 1 1 0 0 1 2 0 0 1 1 1
 2 1 1 1 2 0 0 1 2 2 2 2 0 1 0 1 1 0 1 2 1 2 2 0 1 0 2 2 1 1 2 2 1 0 1 1 2
 2]

Labels for testing data: 
 [1 2 2 1 0 2 1 0 0 1 2 0 1 2 2 2 0 0 1 0 0 2 0 2 0 0 0 2 2 0 2 2 0 0 1 1 2
 0 0 1 1 0 2 2 2 2 2 1 0 0 2 0 0 1 1 1 1 2 1 2 0 2 1 0 0 2 1 2 2 0 1 1 2 0
 2]


In [None]:
print('三種類別在各個切分後的資料集占比：')
print('All:', np.bincount(y) / float(len(y)) * 100.0)
print('Training:', np.bincount(y_train) / float(len(y_train)) * 100.0)
print('Test:', np.bincount(y_test) / float(len(y_test)) * 100.0)

三種類別在各個切分後的資料集占比：
All: [33.33333333 33.33333333 33.33333333]
Training: [30.66666667 40.         29.33333333]
Test: [36.         26.66666667 37.33333333]


## Tip: Stratified Split ([link]())
Stratified Split（分層抽樣切分）是一種在進行訓練集和測試集切分時，確保每個類別（label）在兩個數據集中都保持相同比例的方法。

在標準的數據集切分中，數據是隨機分配到訓練集和測試集的，這可能導致某些類別在切分後的數據集中比例不均，尤其是在類別分布不平衡的情況下。而使用分層抽樣切分，可以確保每個類別的樣本比例在訓練集和測試集中都保持一致，這樣模型在訓練和測試時，能更準確地反映數據集的整體情況。

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    train_size=0.5,
                                                    random_state=123,
                                                    stratify=y)

print('All:', np.bincount(y) / float(len(y)) * 100.0)
print('Training:', np.bincount(y_train) / float(len(y_train)) * 100.0)
print('Test:', np.bincount(y_test) / float(len(y_test)) * 100.0)

All: [33.33333333 33.33333333 33.33333333]
Training: [33.33333333 33.33333333 33.33333333]
Test: [33.33333333 33.33333333 33.33333333]


---

# Basic Models 基本分類模型

## 1. Decision Tree [(link)](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html)
- max_depth(int, default=None):
The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.

In [None]:
from sklearn.tree import DecisionTreeClassifier
# Create the model
dtree = DecisionTreeClassifier(max_depth=None)

# Learn the digits on the train subset
dtree.fit(X_train, y_train)

# Predict the value of the digit on the testing subset
prediction = dtree.predict(X_test)
print("預測labels: ", prediction)
print("真實labels: ", y_test)

# Calculate accuracy
accuracy = dtree.score(X_test, y_test)
print('Accuracy of testing dataset: ', accuracy)

預測labels:  [0 2 1 0 2 0 1 2 0 0 2 1 2 0 1 2 2 2 2 2 1 2 1 1 2 2 0 0 1 0 0 2 0 1 0 0 1
 1 2 2 0 1 0 1 2 2 0 1 1 2 0 1 1 2 1 0 0 1 1 0 2 1 0 2 0 2 1 1 2 0 2 1 0 0
 1]
真實labels:  [0 2 1 0 2 0 1 2 0 0 2 1 2 0 1 2 2 2 2 2 1 2 1 1 2 2 0 0 1 0 0 2 0 1 0 0 1
 1 2 2 0 1 0 1 1 2 0 1 1 1 0 2 2 2 1 0 0 1 1 0 2 1 0 2 0 2 1 1 2 0 2 1 0 0
 1]
Accuracy of testing dataset:  0.9466666666666667


## 2. Random Forest [(link)](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)

  *   n_estimators: the number of trees in the forest, default=100
  *   max_depth: the maximum depth of the tree, default=None

In [None]:
from sklearn.ensemble import RandomForestClassifier
# Create the model
RF = RandomForestClassifier(n_estimators=100)

# Learn the digits on the train subset
RF.fit(X_train, y_train)

# Predict the value of the digit on the testing subset
prediction = RF.predict(X_test)
print("預測labels: ", prediction)
print("真實labels: ", y_test)

# Calculate accuracy
accuracy = RF.score(X_test, y_test)
print('Accuracy of testing dataset: ', accuracy)

預測labels:  [0 2 1 0 2 0 1 2 0 0 2 1 2 0 1 2 2 2 2 2 1 2 1 1 2 2 0 0 1 0 0 2 0 1 0 0 1
 1 2 2 0 1 0 1 2 2 0 1 1 2 0 1 1 1 1 0 0 1 1 0 1 1 0 2 0 2 1 1 2 0 2 1 0 0
 1]
真實labels:  [0 2 1 0 2 0 1 2 0 0 2 1 2 0 1 2 2 2 2 2 1 2 1 1 2 2 0 0 1 0 0 2 0 1 0 0 1
 1 2 2 0 1 0 1 1 2 0 1 1 1 0 2 2 2 1 0 0 1 1 0 2 1 0 2 0 2 1 1 2 0 2 1 0 0
 1]
Accuracy of testing dataset:  0.92


## 3. Logistic regression [(link)](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)
  *   penalty: Used to specify the norm used in the penalization, {l1, l2, elasticnet, None}, default=’l2’

In [None]:
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(max_iter=1000, penalty=None)
classifier.fit(X_train, y_train)
prediction = classifier.predict(X_test)
print("預測labels: ", prediction)
print("真實labels: ", y_test)

accuracy = classifier.score(X_test, y_test)
print('Accuracy: ', accuracy)

預測labels:  [0 2 1 0 2 0 1 2 0 0 2 1 2 0 1 2 2 2 2 2 1 2 1 1 2 2 0 0 1 0 0 2 0 1 0 0 1
 1 2 2 0 1 0 1 2 2 0 1 1 1 0 1 1 2 1 0 0 1 1 0 2 1 0 2 0 2 1 1 2 0 2 1 0 0
 1]
真實labels:  [0 2 1 0 2 0 1 2 0 0 2 1 2 0 1 2 2 2 2 2 1 2 1 1 2 2 0 0 1 0 0 2 0 1 0 0 1
 1 2 2 0 1 0 1 1 2 0 1 1 1 0 2 2 2 1 0 0 1 1 0 2 1 0 2 0 2 1 1 2 0 2 1 0 0
 1]
Accuracy:  0.96


In [None]:
classifier.score(X_train, y_train)

1.0

## 4. Support Vector Machine [(link)](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html)
  *   gamma: kernel coefficient for ‘rbf’, ‘poly’ and ‘sigmoid’, {scale, auto} or float, default=’scale’
  *   C: regularization parameter, the strength of the regularization is inversely proportional to C, default=1.0
  *   kernel: specifies the kernel type to be used in the algorithm, {linear, poly, rbf, sigmoid, precomputed}, default=’rbf’

In [None]:
from sklearn import svm
# Create the model
svc = svm.SVC(gamma=0.001, C=100., kernel='rbf')

# Learn the digits on the train subset
svc.fit(X_train, y_train)

# Predict the value of the digit on the testing subset
prediction = svc.predict(X_test)
print("預測labels: ", prediction)
print("真實labels: ", y_test)

# Calculate accuracy
accuracy = svc.score(X_test, y_test)
print('Accuracy of testing dataset: ', accuracy)

預測labels:  [0 2 1 0 2 0 1 2 0 0 2 1 2 0 1 2 2 2 2 2 1 2 1 1 2 2 0 0 1 0 0 2 0 1 0 0 1
 1 2 2 0 1 0 1 2 2 0 1 1 2 0 2 2 2 1 0 0 1 1 0 1 1 0 2 0 2 1 1 2 0 2 1 0 0
 1]
真實labels:  [0 2 1 0 2 0 1 2 0 0 2 1 2 0 1 2 2 2 2 2 1 2 1 1 2 2 0 0 1 0 0 2 0 1 0 0 1
 1 2 2 0 1 0 1 1 2 0 1 1 1 0 2 2 2 1 0 0 1 1 0 2 1 0 2 0 2 1 1 2 0 2 1 0 0
 1]
Accuracy of testing dataset:  0.96


## 5. XGBoost [(link)](https://xgboost.readthedocs.io/en/stable/parameter.html)
  *   n_estimators: number of boosting rounds
  *   learning_rate: boosting learning rate
  *   max_depth: maximum tree depth for base learners
  *   objective: specify the learning task and the corresponding learning objective or a custom objective function to be used


In [None]:
! pip install xgboost



In [None]:
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score

# 初始化 XGBoost 模型
model = XGBClassifier(n_estimators=100, learning_rate= 0.3, max_depth=6)

# 训练模型
model.fit(X_train, y_train)

# 预测
y_pred = model.predict(X_test)

# 计算准确率
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')
print(model.score(X_test, y_test))


Accuracy: 0.95
0.9466666666666667


## 6. CatBoost [(link)](https://catboost.ai/en/docs/concepts/tutorials)
  *   iterations: the maximum number of trees that can be built when solving machine learning problems, default=1000
  *   learning_rate: the learning rate, used for reducing the gradient step
  *   depth: depth of the tree, default=6
  *   loss_function: the metric to use in training, the specified value also determines the machine learning problem to solve

In [None]:
! pip install catboost

Collecting catboost
  Downloading catboost-1.2.5-cp310-cp310-manylinux2014_x86_64.whl.metadata (1.2 kB)
Downloading catboost-1.2.5-cp310-cp310-manylinux2014_x86_64.whl (98.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m98.2/98.2 MB[0m [31m6.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: catboost
Successfully installed catboost-1.2.5


In [None]:
from catboost import CatBoostClassifier
# 初始化 CatBoost 模型
model = CatBoostClassifier(iterations=1000, depth=6, learning_rate=0.1, loss_function='MultiClass')

# 训练模型
model.fit(X_train, y_train)

# 预测
y_pred = model.predict(X_test)

# 计算准确率
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')

0:	learn: 0.9883182	total: 46.8ms	remaining: 46.8s
1:	learn: 0.9308413	total: 47.9ms	remaining: 23.9s
2:	learn: 0.8593279	total: 48.9ms	remaining: 16.2s
3:	learn: 0.7904196	total: 50.6ms	remaining: 12.6s
4:	learn: 0.7388238	total: 52.4ms	remaining: 10.4s
5:	learn: 0.6774368	total: 56ms	remaining: 9.28s
6:	learn: 0.6455785	total: 59.9ms	remaining: 8.5s
7:	learn: 0.6046704	total: 62.2ms	remaining: 7.71s
8:	learn: 0.5677353	total: 66.4ms	remaining: 7.32s
9:	learn: 0.5304439	total: 71.4ms	remaining: 7.07s
10:	learn: 0.4984478	total: 76.3ms	remaining: 6.86s
11:	learn: 0.4768248	total: 81ms	remaining: 6.67s
12:	learn: 0.4528372	total: 88.8ms	remaining: 6.74s
13:	learn: 0.4223327	total: 90.3ms	remaining: 6.36s
14:	learn: 0.3961940	total: 93.6ms	remaining: 6.15s
15:	learn: 0.3709223	total: 94ms	remaining: 5.78s
16:	learn: 0.3498890	total: 98.2ms	remaining: 5.68s
17:	learn: 0.3317499	total: 102ms	remaining: 5.59s
18:	learn: 0.3143794	total: 103ms	remaining: 5.33s
19:	learn: 0.3005645	total: 104

## 7. LightGBM [(link)](https://lightgbm.readthedocs.io/en/stable/index.html)

In [None]:
! pip install lightgbm



In [None]:
from lightgbm import LGBMClassifier

# 初始化 LightGBM 模型
model = LGBMClassifier(n_estimators=1000, learning_rate=0.05, num_leaves=31)

# 训练模型
model.fit(X_train, y_train)

# 预测
y_pred = model.predict(X_test)

# 计算准确率
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')

Dask dataframe query planning is disabled because dask-expr is not installed.

You can install it with `pip install dask[dataframe]` or `conda install dask`.
This will raise in a future version.



[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000231 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 69
[LightGBM] [Info] Number of data points in the train set: 75, number of used features: 4
[LightGBM] [Info] Start training from score -1.098612
[LightGBM] [Info] Start training from score -1.098612
[LightGBM] [Info] Start training from score -1.098612
Accuracy: 0.92
