<a href="https://colab.research.google.com/github/roberthsu2003/machine_learning/blob/main/%E9%82%8F%E8%BC%AF%E8%BF%B4%E6%AD%B8/cancer%E8%AA%AA%E6%98%8E3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
%pip install wget

Collecting wget
  Downloading wget-3.2.zip (10 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: wget
  Building wheel for wget (setup.py) ... [?25l[?25hdone
  Created wheel for wget: filename=wget-3.2-py3-none-any.whl size=9655 sha256=e2d0fac4f936ea9c102e4b85891c575604fcf63567d3166e94dd90a6502d2502
  Stored in directory: /root/.cache/pip/wheels/40/b3/0f/a40dbd1c6861731779f62cc4babcb234387e11d697df70ee97
Successfully built wget
Installing collected packages: wget
Successfully installed wget-3.2


### 檢視威斯康辛州乳癌資料集

In [10]:
import pandas as pd
import numpy as np
from sklearn.datasets import load_breast_cancer

# 載入資料集
cancer = load_breast_cancer()

# 建立特徵名稱的中文對照字典
feature_names_zh = {
    'mean radius': '平均半徑',
    'mean texture': '平均紋理',
    'mean perimeter': '平均周長',
    'mean area': '平均面積',
    'mean smoothness': '平均平滑度',
    'mean compactness': '平均緊密度',
    'mean concavity': '平均凹度',
    'mean concave points': '平均凹點',
    'mean symmetry': '平均對稱性',
    'mean fractal dimension': '平均分形維度',
    'radius error': '半徑誤差',
    'texture error': '紋理誤差',
    'perimeter error': '周長誤差',
    'area error': '面積誤差',
    'smoothness error': '平滑度誤差',
    'compactness error': '緊密度誤差',
    'concavity error': '凹度誤差',
    'concave points error': '凹點誤差',
    'symmetry error': '對稱性誤差',
    'fractal dimension error': '分形維度誤差',
    'worst radius': '最差半徑',
    'worst texture': '最差紋理',
    'worst perimeter': '最差周長',
    'worst area': '最差面積',
    'worst smoothness': '最差平滑度',
    'worst compactness': '最差緊密度',
    'worst concavity': '最差凹度',
    'worst concave points': '最差凹點',
    'worst symmetry': '最差對稱性',
    'worst fractal dimension': '最差分形維度'
}

# 建立DataFrame
df = pd.DataFrame(cancer.data, columns=[f"{v}特徵{i}" for i, v in enumerate(feature_names_zh.values())])

# 加入target欄位
df['label'] = cancer.target

# 顯示DataFrame的前幾筆資料
display(df.head())

Unnamed: 0,平均半徑特徵0,平均紋理特徵1,平均周長特徵2,平均面積特徵3,平均平滑度特徵4,平均緊密度特徵5,平均凹度特徵6,平均凹點特徵7,平均對稱性特徵8,平均分形維度特徵9,...,最差紋理特徵21,最差周長特徵22,最差面積特徵23,最差平滑度特徵24,最差緊密度特徵25,最差凹度特徵26,最差凹點特徵27,最差對稱性特徵28,最差分形維度特徵29,label
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,0
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,0
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,0
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,0
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,0


In [2]:
import wget

wget.download("https://github.com/roberthsu2003/machine_learning/raw/refs/heads/main/source_data/ChineseFont.ttf")

'ChineseFont.ttf'

In [3]:
import matplotlib as mpl
from matplotlib.font_manager import fontManager
fontManager.addfont("ChineseFont.ttf")
mpl.rc('font', family="ChineseFont")

## 使用特徵比較多的威斯康辛州乳癌資料集

## 說明
max_iter的意思?
1. 優化過程
   - 邏輯迴歸使用梯度下降等優化算法來找到最佳的模型參數
   - 每次迭代都是調整模型參數的一個步驟
   - 目標是最小化損失函數（通常是對數損失）
2. 完整數據集的使用
   - 在每次迭代中，算法都會使用完整的訓練數據集
   - 這表示在一次迭代中，所有訓練樣本都被用來更新模型參數

3. 停止條件
   - 當算法達到收斂（模型參數幾乎不再變化）時會提前停止
   - 如果達到 max_iter 設定的1000次仍未收斂，則強制停止

### `fit_transform()`
- 這個方法會**同時執行兩個步驟**：
  1. **fit (擬合)**：計算訓練數據的統計量（平均值和標準差）
  2. **transform (轉換)**：使用計算出的統計量來標準化數據
- 主要用於**訓練數據集**
- 只能在訓練集上使用一次

### `transform()`
- 只執行**轉換步驟**
- 使用之前 `fit_transform` 時計算好的統計量來進行標準化
- 用於**測試數據集**
- 可以重複使用在不同的數據集上

### 為什麼要這樣區分？
1. **數據洩漏問題**：
   - 如果在測試集上使用 `fit_transform()`，會導致模型看到測試數據的分布，造成數據洩漏
   - 正確做法是只用訓練集的統計量來轉換測試集

2. **一致性**：
   - 確保測試集使用相同的縮放參數
   - 保持訓練集和測試集的轉換標準一致



In [4]:
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

cancer = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
cancer.data, cancer.target, stratify=cancer.target, random_state=42)

# Scale the data 將數據進行標準化
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)



logreg = LogisticRegression(max_iter=1000).fit(X_train_scaled, y_train)
print("訓練時的分數:{:.3f}".format(logreg.score(X_train_scaled, y_train)))
print("測試時的分數:{:.3f}".format(logreg.score(X_test_scaled, y_test)))

訓練時的分數:0.988
測試時的分數:0.986


In [5]:
## 調整c參數
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

cancer = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
cancer.data, cancer.target, stratify=cancer.target, random_state=42)

# Scale the data 將數據進行標準化
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)



logreg100 = LogisticRegression(max_iter=1000,C=100).fit(X_train_scaled, y_train)
print("訓練時的分數:{:.3f}".format(logreg100.score(X_train_scaled, y_train)))
print("測試時的分數:{:.3f}".format(logreg100.score(X_test_scaled, y_test)))

訓練時的分數:0.998
測試時的分數:0.944
