<a href="https://colab.research.google.com/github/roberthsu2003/machine_learning/blob/main/%E4%BD%BF%E7%94%A8%E6%95%B8%E6%93%9A/%E5%A8%81%E6%96%AF%E5%BA%B7%E8%BE%9B%E5%B7%9E%E4%B9%B3%E7%99%8C%E6%95%B8%E6%93%9A%E9%9B%86_load_breast_cancer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
%%capture
%pip install wget
%pip install mglearn

In [None]:
import wget
wget.download("https://github.com/roberthsu2003/machine_learning/raw/refs/heads/main/source_data/ChineseFont.ttf")


In [None]:
import matplotlib as mpl
from matplotlib.font_manager import fontManager

fontManager.addfont("ChineseFont.ttf")
mpl.rc('font', family="ChineseFont")

## 範例：威斯康辛州乳癌數據集 (Scikit-learn)

`load_breast_cancer` 是 Scikit-learn 提供的一個經典的真實世界二元分類數據集。它包含了從乳腺腫塊細針穿刺 (FNA) 數位化影像中計算出的特徵，目標是預測腫瘤是惡性 (malignant) 還是良性 (benign)。

- **用途**：常用於測試和比較各種分類演算法的性能。
- **特點**：
    - 包含 30 個數值型特徵。
    - 兩個類別：惡性 (malignant) 和良性 (benign)。
    - 數據相對乾淨，不需要太多預處理。

### Wisconsin Breast Cancer dataset(威斯康辛州乳癌資料集)

In [None]:
import pandas as pd

# 英文特徵名稱對應到繁體中文
feature_names_zh = [
    '平均半徑', '平均紋理', '平均周長', '平均面積', '平均平滑度', '平均緊密度', '平均凹度', '平均凹點數', '平均對稱性', '平均分形維度',
    '半徑標準差', '紋理標準差', '周長標準差', '面積標準差', '平滑度標準差', '緊密度標準差', '凹度標準差', '凹點數標準差', '對稱性標準差', '分形維度標準差',
    '最差半徑', '最差紋理', '最差周長', '最差面積', '最差平滑度', '最差緊密度', '最差凹度', '最差凹點數', '最差對稱性', '最差分形維度'
]
from sklearn.datasets import load_breast_cancer
import numpy as np # np.bincount 需要 numpy

# 載入乳癌資料集並顯示基本資訊（繁體中文）
print("cancer.keys():\n{}".format(cancer.keys()))
print("乳癌資料集資料形狀: {}".format(cancer.data.shape))
print("每個類別的樣本數:\n{}".format(
    {n: v for n, v in zip(cancer.target_names, np.bincount(cancer.target))}
))
print("特徵名稱:\n{}".format(cancer.feature_names))
# 建立 DataFrame
df_cancer = pd.DataFrame(cancer.data, columns=feature_names_zh)
df_cancer['診斷結果'] = [cancer.target_names[t] for t in cancer.target]
df_cancer.head()

cancer.keys():
dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename', 'data_module'])
乳癌資料集資料形狀: (569, 30)
每個類別的樣本數:
{np.str_('malignant'): np.int64(212), np.str_('benign'): np.int64(357)}
特徵名稱:
['mean radius' 'mean texture' 'mean perimeter' 'mean area'
 'mean smoothness' 'mean compactness' 'mean concavity'
 'mean concave points' 'mean symmetry' 'mean fractal dimension'
 'radius error' 'texture error' 'perimeter error' 'area error'
 'smoothness error' 'compactness error' 'concavity error'
 'concave points error' 'symmetry error' 'fractal dimension error'
 'worst radius' 'worst texture' 'worst perimeter' 'worst area'
 'worst smoothness' 'worst compactness' 'worst concavity'
 'worst concave points' 'worst symmetry' 'worst fractal dimension']


Unnamed: 0,平均半徑,平均紋理,平均周長,平均面積,平均平滑度,平均緊密度,平均凹度,平均凹點數,平均對稱性,平均分形維度,...,最差紋理,最差周長,最差面積,最差平滑度,最差緊密度,最差凹度,最差凹點數,最差對稱性,最差分形維度,診斷結果
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,malignant
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,malignant
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,malignant
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,malignant
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,malignant
