实验目标：训练一个分类模型，该模型能够根据葡萄酒的化学分析预测种植者。

使用SVM分类器，由于分类数 >2，所以使用“一对其余” one-to-rest。

# 一、加载数据集

In [1]:
from sklearn.datasets import load_wine
wine_data = load_wine(as_frame=True)
X_pd = wine_data["data"]
y_pd = wine_data["target"]

In [2]:
X_pd.describe()

Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280/od315_of_diluted_wines,proline
count,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0
mean,13.000618,2.336348,2.366517,19.494944,99.741573,2.295112,2.02927,0.361854,1.590899,5.05809,0.957449,2.611685,746.893258
std,0.811827,1.117146,0.274344,3.339564,14.282484,0.625851,0.998859,0.124453,0.572359,2.318286,0.228572,0.70999,314.907474
min,11.03,0.74,1.36,10.6,70.0,0.98,0.34,0.13,0.41,1.28,0.48,1.27,278.0
25%,12.3625,1.6025,2.21,17.2,88.0,1.7425,1.205,0.27,1.25,3.22,0.7825,1.9375,500.5
50%,13.05,1.865,2.36,19.5,98.0,2.355,2.135,0.34,1.555,4.69,0.965,2.78,673.5
75%,13.6775,3.0825,2.5575,21.5,107.0,2.8,2.875,0.4375,1.95,6.2,1.12,3.17,985.0
max,14.83,5.8,3.23,30.0,162.0,3.88,5.08,0.66,3.58,13.0,1.71,4.0,1680.0


In [3]:
X_pd.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 178 entries, 0 to 177
Data columns (total 13 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   alcohol                       178 non-null    float64
 1   malic_acid                    178 non-null    float64
 2   ash                           178 non-null    float64
 3   alcalinity_of_ash             178 non-null    float64
 4   magnesium                     178 non-null    float64
 5   total_phenols                 178 non-null    float64
 6   flavanoids                    178 non-null    float64
 7   nonflavanoid_phenols          178 non-null    float64
 8   proanthocyanins               178 non-null    float64
 9   color_intensity               178 non-null    float64
 10  hue                           178 non-null    float64
 11  od280/od315_of_diluted_wines  178 non-null    float64
 12  proline                       178 non-null    float64
dtypes: fl

In [4]:
y_pd.info()

<class 'pandas.core.series.Series'>
RangeIndex: 178 entries, 0 to 177
Series name: target
Non-Null Count  Dtype
--------------  -----
178 non-null    int32
dtypes: int32(1)
memory usage: 840.0 bytes


In [5]:
y_pd.value_counts()

target
1    71
0    59
2    48
Name: count, dtype: int64

经过观察，
- 训练样本X上没有缺失值  --> 不需要使用缺失值填充策略
- 训练样本X上的特征类型均为float64 --> 无需进行分类编码
- 训练样本X上数据量级相差较大 -->  使用标准化
- 训练标签y上也没有缺失值 --> 无需缺失值填充
- 考虑到样本数量只有178个，模型可能欠拟合 --> 可以进行多项式特征缩放（但是此处先不加多项式，看训练出的模型效果好坏）

# 二、分离数据集

In [6]:
from sklearn.model_selection import train_test_split
X, y = X_pd.to_numpy(), y_pd.to_numpy()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [7]:
X_train.shape, y_train.shape

((142, 13), (142,))

# 三、使用SVM分类器拟合数据

In [8]:
from sklearn.svm import SVC
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
svc_model = make_pipeline(
    StandardScaler(),
    SVC(C=0.1))             # 多类别问题，decision_function_shape='ovr' one-to-rest
svc_model.fit(X_train, y_train)

In [9]:
from sklearn.metrics import accuracy_score
accuracy_rate = accuracy_score(y_test, svc_model.predict(X_test))
print(f"模型预测的正确率为：{accuracy_rate * 100:.3f} %")

模型预测的正确率为：100.000 %
