By [Yulandy Chiu](https://www.youtube.com/@YulandySpace)

Aided with Gemini/Claude/ChatGPT and modified by Yulandy Chiu

Version: 2024/05/10

Videos:
* [[10分鐘搞懂機器學習] 1.5 Python實作 類別屬性 Categorical attribute Label encoding One-hot encoding](https://youtu.be/sdIIH4MMKyk?si=Xm2IM-TlaRA1zt0h)

Facebook: [Yulandy Chiu的AI資訊站](https://www.facebook.com/yulandychiu)

 This code is licensed under the Creative Commons Attribution-NonCommercial 4.0
 International License (CC BY-NC 4.0). You are free to use, modify, and share this code for non-commercial purposes, provided you give appropriate credit. For more details, see the LICENSE file or visit: https://creativecommons.org/licenses/by-nc/4.0/
 © [2024] [Yulandy Chiu](https://www.youtube.com/@YulandySpace)


Topic:處理類別屬性(categorical attribute)

In [2]:
# https://github.com/ageron/handson-ml2
# housing 資料路徑 handson-ml2/datasets/housing/housing.csv

from google.colab import drive
import pandas as pd

# 连接到Google Drive
drive.mount('/content/drive')

# 读取CSV文件
file_path = "/content/drive/My Drive/ML/ch1 data processing/housing.csv"  # 请将文件路径替换为实际的CSV文件路径
housing_prices_data = pd.read_csv(file_path)






Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [3]:
# 使用head()
housing_prices_data.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY


In [None]:
import pandas as pd
# housing_prices_data是您的DataFrame，'ocean_proximity'是您要查看的categorical屬性的列名
unique_categories = housing_prices_data['ocean_proximity'].unique()
print(unique_categories)


['NEAR BAY' '<1H OCEAN' 'INLAND' 'NEAR OCEAN' 'ISLAND']


In [None]:
housing_prices_data['ocean_proximity'].value_counts()

ocean_proximity
<1H OCEAN     9136
INLAND        6551
NEAR OCEAN    2658
NEAR BAY      2290
ISLAND           5
Name: count, dtype: int64

文字屬性(text attribute): 文字、句子、文章，涉及自然語言處理  
類別屬性(categorical attribute): 有限的文字類別，例如 "low," "medium," "high"


機器學習演算法較適合處理數值資料

In [None]:
# prompt: Python 將categorical attributes 轉成 numerical values
# 方法一: 使用Label Encoding，將每個類別映射到一個整數
# 缺點-機器學習演算法通常會將相鄰的數值判別成較為類似，實際上可能並非如此
from sklearn.preprocessing import LabelEncoder
df=housing_prices_data.copy()
label_encoder = LabelEncoder()
df['ocean_proximity'] = label_encoder.fit_transform(df['ocean_proximity'])
df['ocean_proximity'].value_counts()

ocean_proximity
0    9136
1    6551
4    2658
3    2290
2       5
Name: count, dtype: int64

In [None]:
# 方法二: 使用One-Hot Encoding，將每個類別轉換為一個二進制向量，其中每個類別對應一個新的二進制特徵
# 缺點-浪費記憶體空間
df=housing_prices_data.copy()
df = pd.get_dummies(df, columns=['ocean_proximity'])
df

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity_<1H OCEAN,ocean_proximity_INLAND,ocean_proximity_ISLAND,ocean_proximity_NEAR BAY,ocean_proximity_NEAR OCEAN
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,False,False,False,True,False
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,False,False,False,True,False
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,False,False,False,True,False
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,False,False,False,True,False
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,False,False,False,True,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
20635,-121.09,39.48,25.0,1665.0,374.0,845.0,330.0,1.5603,78100.0,False,True,False,False,False
20636,-121.21,39.49,18.0,697.0,150.0,356.0,114.0,2.5568,77100.0,False,True,False,False,False
20637,-121.22,39.43,17.0,2254.0,485.0,1007.0,433.0,1.7000,92300.0,False,True,False,False,False
20638,-121.32,39.43,18.0,1860.0,409.0,741.0,349.0,1.8672,84700.0,False,True,False,False,False
