根据软木塞数据集，利用C4.5算法（不能调包）生成决策树模型。

要求：
1. 随机选取训练集和测试集
2. 生成决策树模型，并对模型进行评估（混淆矩阵，查全率，查准率F1值）
3. 使用CART算法（可调包）生成决策树模型与C4.5算法结果对比，并评价这两种算法的优缺点。

软木塞缺陷的二值图像测量
为了评估其质量：

- ART:    Total area of the defects (in pixels) 缺陷的总面积（以像素为单位）
- N:      Total number of defects 缺陷的总数量
- PRT:    Total perimeter of the defects (in pixels) 缺陷的总周长（以像素为单位）
- ARM:    Average area of the defects (in pixels)=ART/N 缺陷的平均面积（以像素为单位）=ART/N
- PRM:    Average perimeter of the defects (in pixels)=PRT/N 平均缺陷周长（像素）=PRT/N
- ARTG:   Total area of big defects  (in pixels) 大缺陷的总面积（以像素为单位）
- NG:     Number of big defects (bigger than a specified threshold) 大缺陷数（大于指定阈值）
- PRTG:   Total perimeter of big defects (in pixels) 大缺陷的总周长（以像素为单位）
- RAAR:   Areas ratio of the defects =ARTG/ART 缺陷的面积比=ARTG/ART
- RAN:    Ratio of the number of defects=NG/N 缺陷数的比例=NG/N

资料来源：A. Campilho，波尔图大学工程系。

In [23]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report


In [24]:
data = pd.read_excel('DATASETS_CorkStoppers.xls', sheet_name='Data')
data.head()


Unnamed: 0,#,C,ART,N,PRT,ARM,PRM,ARTG,NG,PRTG,RAAR,RAN
0,,,,,,,,,,,,
1,1.0,1.0,81.0,41.0,250.0,1.98,6.1,9.0,1.0,12.0,11.11,2.44
2,2.0,1.0,80.0,42.0,238.0,1.91,5.67,0.0,0.0,0.0,0.0,0.0
3,3.0,1.0,81.0,26.0,196.0,3.12,7.54,9.8,1.8,15.0,12.04,6.73
4,4.0,1.0,125.0,63.0,368.0,1.98,5.84,20.0,1.0,18.0,16.0,1.59


In [25]:
# 数据预处理
data = data.dropna()  # 删除空值
data = data.drop_duplicates()  # 删除重复值
data = data.drop(['#'], axis=1)
data.head()


Unnamed: 0,C,ART,N,PRT,ARM,PRM,ARTG,NG,PRTG,RAAR,RAN
1,1.0,81.0,41.0,250.0,1.98,6.1,9.0,1.0,12.0,11.11,2.44
2,1.0,80.0,42.0,238.0,1.91,5.67,0.0,0.0,0.0,0.0,0.0
3,1.0,81.0,26.0,196.0,3.12,7.54,9.8,1.8,15.0,12.04,6.73
4,1.0,125.0,63.0,368.0,1.98,5.84,20.0,1.0,18.0,16.0,1.59
5,1.0,146.0,45.0,350.0,3.24,7.78,42.8,2.8,43.0,29.28,6.11


In [26]:
# 划分训练集和测试集
y = data['C']
X = data.drop(['C'], axis=1)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=0)


In [27]:
# ID3决策树模型
id3 = DecisionTreeClassifier(criterion='entropy')
id3.fit(X_train, y_train)
y_pred_id3 = id3.predict(X_test)

print('Confusion Matrix:\n', confusion_matrix(y_test, y_pred_id3))
print('Classification Report:\n', classification_report(y_test, y_pred_id3))


Confusion Matrix:
 [[11  5  0]
 [ 3 11  4]
 [ 0  0 11]]
Classification Report:
               precision    recall  f1-score   support

         1.0       0.79      0.69      0.73        16
         2.0       0.69      0.61      0.65        18
         3.0       0.73      1.00      0.85        11

    accuracy                           0.73        45
   macro avg       0.74      0.77      0.74        45
weighted avg       0.73      0.73      0.73        45



In [28]:
# 使用CART算法（可调包）生成决策树模型
cart = DecisionTreeClassifier()
cart.fit(X_train, y_train)
y_pred_cart = cart.predict(X_test)

print('Confusion Matrix:\n', confusion_matrix(y_test, y_pred_cart))
print('Classification Report:\n', classification_report(y_test, y_pred_cart))


Confusion Matrix:
 [[11  5  0]
 [ 3 11  4]
 [ 0  0 11]]
Classification Report:
               precision    recall  f1-score   support

         1.0       0.79      0.69      0.73        16
         2.0       0.69      0.61      0.65        18
         3.0       0.73      1.00      0.85        11

    accuracy                           0.73        45
   macro avg       0.74      0.77      0.74        45
weighted avg       0.73      0.73      0.73        45

