# One-Hotエンコーディング、ラベルエンコーディング
カテゴリ変数
- バイナリ：性別など
- マルチクラス：色、国など

#### **One-Hotエンコーディング**
- カテゴリ特徴の各項目にバイナリ列を作成
    - "性別"カテゴリ -> "女"列（値が"女"なら1, "男"なら0）, "男"列（値が"男"なら1, "女"なら0）
- 適用場面
    - 固有カテゴリが少数
    - ツリーベースモデル、ロジスティック回帰、ニューラルネットワーク

#### **ラベルエンコーディング**
- 各カテゴリに一意の整数を割り当てる
    - ["Red", "Blue", "Green"] -> [0, 1, 2]
- 適用場面
    - 順序が重要なカテゴリ
- 制限事項：カテゴリを順序通りに解釈してしまう可能性がある

#### ⾼カーディナリティカテゴリ特徴の取り扱い
- 固有カテゴリの数が多いカテゴリ
    - Low-cardinality ... 色
    - High-cardinality ... 郵便番号, ユーザーID
- 課題
    - 次元が増える・・・One-Hotエンコーディングを使うと特徴量の次元が爆発的に増え、メモリ消費や学習時間が大幅に増加
    - スパース表現（疎な表現）・・・ほとんどのデータが0という疎行列になるため、情報が薄く、モデルの訓練が難しい
- 対処策
    1. **Frequency Encoding（頻度エンコーディング）**
        - カテゴリを、そのカテゴリがデータセット内に何回登場したかの**出現回数**で置き換える
        ```
        City = ['NY', 'LA', 'NY', 'SF', 'LA']
        出現回数:
          NY → 2
          LA → 2
          SF → 1
        変換後: [2, 2, 2, 1, 2]

        ```
    2. **Target Encoding（ターゲットエンコーディング）**
        - カテゴリを、そのカテゴリに属するサンプルの**目的変数（ターゲット）の平均値**で置き換える

        ```
        City:   NY,  LA,  NY,  SF,  LA
        Target: 1,   0,   1,   0,   1
        カテゴリ別平均:
         NY → (1 + 1) / 2 = 1.0
        LA → (0 + 1) / 2 = 0.5
        SF → (0) / 1 = 0.0
        変換後: [1.0, 0.5, 1.0, 0.0, 0.5]
        ```

#### エンコーディング使い分け
| エンコード技術           | 使用事例 |
|--------------------------|----------|
| ワンホットエンコーディング | 少数の固有カテゴリを持つ名義特徴 |
| ラベルエンコーディング   | 順序特徴、またはツリーベースモデルのようなアルゴリズムで使用される場合 |
| 周波数エンコーディング   | 回帰と分類タスクの両方における高カーディナリティ特徴 |
| ターゲットエンコーディング | 教師あり学習タスクにおける高カーディナリティ特徴 |

----
## 演習
1. カテゴリ変数を含むデータセットにワンホットエンコーディングとラベルエンコーディングを適⽤する
2. 様々なエンコーディング手法を試してモデルのパフォーマンスへの影響を観察する

In [73]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [74]:
# Load Titanic dataset
titanic_df = pd.read_csv('https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv')

In [75]:
print("Dataset info:", titanic_df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
Dataset info: None


### Load dataset (Titanic)

In [76]:
# Display the first few rows of the dataset
print(titanic_df.head())

   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                           Allen, Mr. William Henry    male  35.0      0   

   Parch            Ticket     Fare Cabin Embarked  
0      0         A/5 21171   7.2500   NaN        S  
1      0          PC 17599  71.2833   C85        C  
2      0  STON/O2. 3101282   7.9250   NaN        S  
3      0            113803  53.1000  C123        S  
4      0            373450   8.0500   NaN        S  


In [77]:
# Check the number of Nan per column
titanic_df.isna().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [78]:
# Fill NaN in 'Age' with Median
titanic_df['Age'] = titanic_df['Age'].fillna(titanic_df['Age'].median())

# Drop 'Cabin' column
titanic_df_raw = titanic_df
titanic_df = titanic_df.drop(columns=['Cabin'])

# Drop the rows if 'Embarked' is Nan
titanic_df = titanic_df.dropna(subset=['Embarked'])

# Check Nan per column again
titanic_df.isna().sum()

PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Embarked       0
dtype: int64

In [79]:
titanic_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 889 entries, 0 to 890
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  889 non-null    int64  
 1   Survived     889 non-null    int64  
 2   Pclass       889 non-null    int64  
 3   Name         889 non-null    object 
 4   Sex          889 non-null    object 
 5   Age          889 non-null    float64
 6   SibSp        889 non-null    int64  
 7   Parch        889 non-null    int64  
 8   Ticket       889 non-null    object 
 9   Fare         889 non-null    float64
 10  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(4)
memory usage: 83.3+ KB


In [80]:
# Set the target variable
y = titanic_df['Survived']

### One-Hot Encoding

In [81]:
# Apply One-Hot Encoding
titanic_df_one_hot = pd.get_dummies(titanic_df, columns=['Sex', 'Embarked'])
print("\nOne-Hot Encoded Dataset:\n", titanic_df_one_hot.head())


One-Hot Encoded Dataset:
    PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name   Age  SibSp  Parch  \
0                            Braund, Mr. Owen Harris  22.0      1      0   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  38.0      1      0   
2                             Heikkinen, Miss. Laina  26.0      0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  35.0      1      0   
4                           Allen, Mr. William Henry  35.0      0      0   

             Ticket     Fare  Sex_female  Sex_male  Embarked_C  Embarked_Q  \
0         A/5 21171   7.2500       False      True       False       False   
1          PC 17599  71.2833        True     False        True       False   
2  STON/O2. 3101282   7.9250        True     False       False   

In [82]:
X_one_hot = titanic_df_one_hot.drop(columns=['PassengerId', 'Survived', 'Name', 'Ticket', 'Age'])
y = titanic_df['Survived']

# Split dataset
X_one_hot_train, X_one_hot_test, y_train, y_test = train_test_split(X_one_hot, y, test_size=0.2, random_state=42)

# Train logistic regression model
model = LogisticRegression(max_iter=200)
model.fit(X_one_hot_train, y_train)

y_pred = model.predict(X_one_hot_test)
print("Accuracy with One-Hot Encoding:", accuracy_score(y_test, y_pred))

Accuracy with One-Hot Encoding: 0.7808988764044944


### Label Encoding

In [83]:
# Apply Label Encoding to ordinal categories (順序カテゴリ)
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
titanic_df_one_hot['Pclass_encoded'] = label_encoder.fit_transform(titanic_df_one_hot['Pclass'])

print("\nLabel Encoded Feature:\n", titanic_df_one_hot[['Pclass', 'Pclass_encoded']].head())
print("\nCurrent columns:\n", titanic_df_one_hot.columns)



Label Encoded Feature:
    Pclass  Pclass_encoded
0       3               2
1       1               0
2       3               2
3       1               0
4       3               2

Current columns:
 Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Age', 'SibSp', 'Parch',
       'Ticket', 'Fare', 'Sex_female', 'Sex_male', 'Embarked_C', 'Embarked_Q',
       'Embarked_S', 'Pclass_encoded'],
      dtype='object')


In [None]:
# 
X_Pclass_encoded = titanic_df_one_hot.drop(columns=['PassengerId', 'Survived', 'Pclass', 'Name', 'Ticket'])
X_Pclass_encoded.columns

Index(['Age', 'SibSp', 'Parch', 'Fare', 'Sex_female', 'Sex_male', 'Embarked_C',
       'Embarked_Q', 'Embarked_S', 'Pclass_encoded'],
      dtype='object')

In [85]:
# Split dataset
X_Pclass_encoded_train, X_Pclass_encoded_test, y_train, y_test = train_test_split(X_Pclass_encoded, y, test_size=0.2, random_state=42)

# Train logistic regression model
model = LogisticRegression(max_iter=200)
model.fit(X_Pclass_encoded_train, y_train)

y_pred = model.predict(X_Pclass_encoded_test)
print("Accuracy with Pclass Encoded:", accuracy_score(y_test, y_pred))

Accuracy with Pclass Encoded: 0.7696629213483146


STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT

Increase the number of iterations to improve the convergence (max_iter=200).
You might also want to scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


### Frequency Encoding

In [86]:
# Apply Frequency Encoding
titanic_df_one_hot['Ticket_frequency'] = titanic_df_one_hot['Ticket'].map(titanic_df_one_hot['Ticket'].value_counts())

print("\nFrequency Encoded Feature:\n", titanic_df_one_hot[['Ticket', 'Ticket_frequency']].head())


Frequency Encoded Feature:
              Ticket  Ticket_frequency
0         A/5 21171                 1
1          PC 17599                 1
2  STON/O2. 3101282                 1
3            113803                 2
4            373450                 1


In [87]:
print("\nCurrent columns:\n", titanic_df_one_hot.columns)



Current columns:
 Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Age', 'SibSp', 'Parch',
       'Ticket', 'Fare', 'Sex_female', 'Sex_male', 'Embarked_C', 'Embarked_Q',
       'Embarked_S', 'Pclass_encoded', 'Ticket_frequency'],
      dtype='object')


In [88]:
X_ticket_encoded = titanic_df_one_hot.drop(columns=['PassengerId', 'Survived', 'Name', 'Ticket', 'Pclass_encoded'])
X_ticket_encoded.columns

Index(['Pclass', 'Age', 'SibSp', 'Parch', 'Fare', 'Sex_female', 'Sex_male',
       'Embarked_C', 'Embarked_Q', 'Embarked_S', 'Ticket_frequency'],
      dtype='object')

In [89]:
# Split dataset
X_ticket_encoded_train, X_ticket_encoded_test, y_train, y_test = train_test_split(X_ticket_encoded, y, test_size=0.2, random_state=42)

# Train logistic regression model
model = LogisticRegression(max_iter=200)
model.fit(X_ticket_encoded_train, y_train)

y_pred = model.predict(X_ticket_encoded_test)
print("Accuracy with Ticket Encoded:", accuracy_score(y_test, y_pred))

Accuracy with Ticket Encoded: 0.7752808988764045


STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT

Increase the number of iterations to improve the convergence (max_iter=200).
You might also want to scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
