<a href="https://colab.research.google.com/github/rtajeong/M2_2025/blob/main/gg_25_%EB%B2%94%EC%A3%BC%ED%98%95%EB%8D%B0%EC%9D%B4%ED%84%B0%EC%BD%94%EB%94%A9_rev4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### What is Categorical Data?
- Categorical data has a limited number of values to choose from for a field of data.
  Some examples of fields and values are:

  - Blood type - O, AA, AO, BB, BO, AB
  - Customer responses on satisfaction of a product - happy, content, sad
  - Eye color - green, blue, brown
- There are two common types of categorical data: nominal and ordinal.
  - Nominal categorical data has values with no inherent order such as the eye color example above.
  - Ordinal categorical data contains values with an intended order. One example is the customer responses above. There's an inherent order with the values - happy is a more positive measurement than content.

### When to use categorical data?
- A string variable consisting of only a few different values. (save memory)
- The lexical order of a variable is not the same as the logical order (“one”, “two”, “three”)
- As a signal to other python libraries that this column should be treated as a categorical variable.

### Categorical Data Encoding
our machine learning algorithm can only read numerical values. It is essential to encoding categorical features into numerical values.
Here are three different ways of encoding categorical features:
1. Find and Replace
2. Ordinal Enoding (ordinal data)
3. One-hot Encoding (nominal data)
4. Custom-binary Encoding

- Good reference for Encoding Categorical data
  - https://towardsdatascience.com/all-about-categorical-variable-encoding-305f3361fd02

In [1]:
import pandas as pd
import numpy as np

# 데이터 변환 (categorical data encoding)
- Hot Encoding
- Label Encoding
- Ordinal Encoding
- many more ...

## One-hot encoding
- pandas : get_dummies()
- sklearn: OneHotEncoder()

In [2]:
n_samples = 10
height = 3*np.random.randn(n_samples).round() + 170
nationality = np.random.randint(0,3,n_samples)
nationality = pd.Series(nationality).map({0: '한국', 1:'일본', 2:'중국'})

height, nationality

(array([170., 167., 170., 170., 170., 173., 170., 173., 170., 167.]),
 0    중국
 1    중국
 2    한국
 3    중국
 4    일본
 5    한국
 6    일본
 7    한국
 8    일본
 9    일본
 dtype: object)

In [3]:
list(zip(height, nationality))

[(np.float64(170.0), '중국'),
 (np.float64(167.0), '중국'),
 (np.float64(170.0), '한국'),
 (np.float64(170.0), '중국'),
 (np.float64(170.0), '일본'),
 (np.float64(173.0), '한국'),
 (np.float64(170.0), '일본'),
 (np.float64(173.0), '한국'),
 (np.float64(170.0), '일본'),
 (np.float64(167.0), '일본')]

In [4]:
df = pd.DataFrame(list(zip(height, nationality)),
                  columns=["height","nationality"])
df

Unnamed: 0,height,nationality
0,170.0,중국
1,167.0,중국
2,170.0,한국
3,170.0,중국
4,170.0,일본
5,173.0,한국
6,170.0,일본
7,173.0,한국
8,170.0,일본
9,167.0,일본


- Method 1: Pandas - get_dummies()

In [5]:
# 일부 column 만 encoding 할 수도 있음
# (ex) nat = pd.get_dummies(df['nationality'], prefix='nat_')

new_df = pd.get_dummies(df, columns=['nationality'])
new_df

Unnamed: 0,height,nationality_일본,nationality_중국,nationality_한국
0,170.0,False,True,False
1,167.0,False,True,False
2,170.0,False,False,True
3,170.0,False,True,False
4,170.0,True,False,False
5,173.0,False,False,True
6,170.0,True,False,False
7,173.0,False,False,True
8,170.0,True,False,False
9,167.0,True,False,False


- Method 2: **Sklearn** : OneHotEncoder()

In [6]:
# sklearn - OneHotEncoder()
from sklearn.preprocessing import OneHotEncoder
ohc = OneHotEncoder()
ohe = ohc.fit_transform(df.nationality.values.reshape(-1,1)).toarray()
ohe

array([[0., 1., 0.],
       [0., 1., 0.],
       [0., 0., 1.],
       [0., 1., 0.],
       [1., 0., 0.],
       [0., 0., 1.],
       [1., 0., 0.],
       [0., 0., 1.],
       [1., 0., 0.],
       [1., 0., 0.]])

In [7]:
ohc.categories_

[array(['일본', '중국', '한국'], dtype=object)]

In [8]:
ohc.categories_[0][2]   # element 가 1-d array 의 list 임에 주의 !!!

'한국'

In [9]:
[ohc.categories_[0][i] for i in range(len(ohc.categories_[0]))]

['일본', '중국', '한국']

In [10]:
col = ['nat_'+str(ohc.categories_[0][i]) for i in range(len(ohc.categories_[0]))]
col

['nat_일본', 'nat_중국', 'nat_한국']

In [11]:
ohe_df = pd.DataFrame(ohe, columns=col)
ohe_df

Unnamed: 0,nat_일본,nat_중국,nat_한국
0,0.0,1.0,0.0
1,0.0,1.0,0.0
2,0.0,0.0,1.0
3,0.0,1.0,0.0
4,1.0,0.0,0.0
5,0.0,0.0,1.0
6,1.0,0.0,0.0
7,0.0,0.0,1.0
8,1.0,0.0,0.0
9,1.0,0.0,0.0


In [12]:
new_df2 = pd.concat([df,ohe_df],  axis=1); new_df2

Unnamed: 0,height,nationality,nat_일본,nat_중국,nat_한국
0,170.0,중국,0.0,1.0,0.0
1,167.0,중국,0.0,1.0,0.0
2,170.0,한국,0.0,0.0,1.0
3,170.0,중국,0.0,1.0,0.0
4,170.0,일본,1.0,0.0,0.0
5,173.0,한국,0.0,0.0,1.0
6,170.0,일본,1.0,0.0,0.0
7,173.0,한국,0.0,0.0,1.0
8,170.0,일본,1.0,0.0,0.0
9,167.0,일본,1.0,0.0,0.0


In [13]:
new_df2.drop('nationality', axis=1, inplace=True)

In [14]:
new_df2

Unnamed: 0,height,nat_일본,nat_중국,nat_한국
0,170.0,0.0,1.0,0.0
1,167.0,0.0,1.0,0.0
2,170.0,0.0,0.0,1.0
3,170.0,0.0,1.0,0.0
4,170.0,1.0,0.0,0.0
5,173.0,0.0,0.0,1.0
6,170.0,1.0,0.0,0.0
7,173.0,0.0,0.0,1.0
8,170.0,1.0,0.0,0.0
9,167.0,1.0,0.0,0.0


## Label Encoding
- 주로 target 변수에 사용 (1-차원 어레이 입력)
- each category is assigned a value from 1 through N (here N is the number of category for the feature.
- One major issue is there is no relation or order between these classes but algorithm might consider them as some kind of order or there is some kind of relationship .
- In below example it may look like (Cold<Hot<Very Hot<Warm….0 < 1 < 2 < 3 ) .Scikit-learn code for the data-frame as follows:

In [15]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['nat_label_encoded'] = le.fit_transform(df.nationality)
df

Unnamed: 0,height,nationality,nat_label_encoded
0,170.0,중국,1
1,167.0,중국,1
2,170.0,한국,2
3,170.0,중국,1
4,170.0,일본,0
5,173.0,한국,2
6,170.0,일본,0
7,173.0,한국,2
8,170.0,일본,0
9,167.0,일본,0


In [16]:
le.classes_

array(['일본', '중국', '한국'], dtype=object)

## Ordinal Encoding
- 주로 feature 변수에 사용 (2-차원 어레이 입력)
- Encode categorical features as an integer array.
- The input to this transformer should be an array-like of integers or strings, denoting the values taken on by categorical (discrete) features. The features are converted to ordinal integers. This results in a single column of integers (0 to n_categories - 1) per feature.

In [17]:
from sklearn.preprocessing import OrdinalEncoder
oe = OrdinalEncoder()
df['nat_ordinal_encoded'] = oe.fit_transform(df.nationality.values.reshape(-1,1))
df

Unnamed: 0,height,nationality,nat_label_encoded,nat_ordinal_encoded
0,170.0,중국,1,1.0
1,167.0,중국,1,1.0
2,170.0,한국,2,2.0
3,170.0,중국,1,1.0
4,170.0,일본,0,0.0
5,173.0,한국,2,2.0
6,170.0,일본,0,0.0
7,173.0,한국,2,2.0
8,170.0,일본,0,0.0
9,167.0,일본,0,0.0


In [18]:
oe.categories_

[array(['일본', '중국', '한국'], dtype=object)]

## difference between LabelEncoder() and OrdinalEncoder()
- both have the same functionality. A bit difference is the idea behind.
- OrdinalEncoder is for converting features, while LabelEncoder is for converting target variable.

- That's why OrdinalEncoder can fit data that has the shape of (n_samples, n_features)
  while LabelEncoder can only fit data that has the shape of (n_samples,) (though in the past one
  used LabelEncoder within the loop to handle what has been becoming the job of OrdinalEncoder now)
- LabelEncoder learns classes_ and OrdinalEncoder learns categories_

In [19]:
df = pd.DataFrame({
    'Age':[33,44,22,44,55,22, 60],
    'Income':['Low','Low','High','Medium','Medium','High','Very High']})
df

Unnamed: 0,Age,Income
0,33,Low
1,44,Low
2,22,High
3,44,Medium
4,55,Medium
5,22,High
6,60,Very High


- LabelEncoding

In [20]:
df1 = df.copy()
enc = LabelEncoder()     # no categories in LabelEncoder()
df1["Income_coded"] = enc.fit_transform(df1.Income)
print(df1.dtypes)
df1

Age              int64
Income          object
Income_coded     int64
dtype: object


Unnamed: 0,Age,Income,Income_coded
0,33,Low,1
1,44,Low,1
2,22,High,0
3,44,Medium,2
4,55,Medium,2
5,22,High,0
6,60,Very High,3


- direct mapping

In [22]:
df2 = df.copy()
income_dict = {'Low':0, 'Medium':1, 'High':2, 'Very High': 3}
df2['Income_coded'] = df2['Income'].map(income_dict)
print(df2.dtypes)
df2

Age              int64
Income          object
Income_coded     int64
dtype: object


Unnamed: 0,Age,Income,Income_coded
0,33,Low,0
1,44,Low,0
2,22,High,2
3,44,Medium,1
4,55,Medium,1
5,22,High,2
6,60,Very High,3


In [23]:
df1['Income_coded'].max(), df2.Income_coded.max()

(3, 3)

In [None]:
df1.Income_coded.mean(), df2.Income_coded.mean()

(1.2857142857142858, 1.2857142857142858)

- OrdinalEncoding

In [24]:
from sklearn.preprocessing import OrdinalEncoder   # for features
df3 = df.copy()
enc = OrdinalEncoder(categories = [['Low', 'Medium', 'High', 'Very High']])

df3["Income_Ordered"] = enc.fit_transform(np.array(df3.Income).reshape(-1,1))  # must be a matrix
print(df3.dtypes, enc.categories_)
df3

Age                 int64
Income             object
Income_Ordered    float64
dtype: object [array(['Low', 'Medium', 'High', 'Very High'], dtype=object)]


Unnamed: 0,Age,Income,Income_Ordered
0,33,Low,0.0
1,44,Low,0.0
2,22,High,2.0
3,44,Medium,1.0
4,55,Medium,1.0
5,22,High,2.0
6,60,Very High,3.0


- make data type categorical

In [25]:
df4 = df.copy()
df4.Income=pd.Categorical(df4.Income, ['Low', 'Medium', 'High', 'Very High'],
                         ordered=True)
print(df4.dtypes)

Age          int64
Income    category
dtype: object


In [None]:
df4

Unnamed: 0,Age,Income
0,33,Low
1,44,Low
2,22,High
3,44,Medium
4,55,Medium
5,22,High
6,60,Very High


In [26]:
df3.Income.max(), df3.Income_Ordered.max(), df4.Income.max()

('Very High', 3.0, 'Very High')

In [27]:
enc = OrdinalEncoder(categories=[['Low','Medium', 'High','Very High']] )
# df4["Income_Ordered"] = enc.fit_transform(df4.Income)
xx = np.array(df4.Income).reshape(-1,1)
df4["Income_Ordered"] = enc.fit_transform(xx)
print(df3.dtypes, enc.categories_)
df4

Age                 int64
Income             object
Income_Ordered    float64
dtype: object [array(['Low', 'Medium', 'High', 'Very High'], dtype=object)]


Unnamed: 0,Age,Income,Income_Ordered
0,33,Low,0.0
1,44,Low,0.0
2,22,High,2.0
3,44,Medium,1.0
4,55,Medium,1.0
5,22,High,2.0
6,60,Very High,3.0


## Other custom encoding methods

In [28]:
!pip install category_encoders

Collecting category_encoders
  Downloading category_encoders-2.9.0-py3-none-any.whl.metadata (7.9 kB)
Downloading category_encoders-2.9.0-py3-none-any.whl (85 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m85.9/85.9 kB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: category_encoders
Successfully installed category_encoders-2.9.0


In [29]:
import pandas as pd
from category_encoders import TargetEncoder, BinaryEncoder

data = {'color': ['red', 'green', 'blue', 'green', 'red', 'blue', 'red'],
        'target': [1, 0, 0, 1, 0, 0, 0]}
df = pd.DataFrame(data)
df

Unnamed: 0,color,target
0,red,1
1,green,0
2,blue,0
3,green,1
4,red,0
5,blue,0
6,red,0


In [31]:
target_encoder = TargetEncoder(smoothing=2)
df['color_encoded'] = target_encoder.fit_transform(df['color'], df['target'])
df

Unnamed: 0,color,target,color_encoded
0,red,1,0.285724
1,green,0,0.285741
2,blue,0,0.285679
3,green,1,0.285741
4,red,0,0.285724
5,blue,0,0.285679
6,red,0,0.285724


In [32]:
target_encoder.mapping

{'color': color
  1    0.285724
  2    0.285741
  3    0.285679
 -1    0.285714
 -2    0.285714
 dtype: float64}

In [33]:
df = pd.DataFrame(data)

binary_encoder = BinaryEncoder()
df_encoded = binary_encoder.fit_transform(df['color'])
df_encoded

Unnamed: 0,color_0,color_1
0,0,1
1,1,0
2,1,1
3,1,0
4,0,1
5,1,1
6,0,1


In [34]:
binary_encoder.mapping   # 각 catefory 가 binary 로 변환

[{'col': 'color',
  'mapping':     color_0  color_1
   1        0        1
   2        1        0
   3        1        1
  -1        0        0
  -2        0        0}]

# 표준 스케일링 (Standard Scaling)

In [35]:
n_samples = 10
height = 3*np.random.randn(n_samples).round(1) + 170
weight = 4*np.random.randn(n_samples).round(1) + 70

X = pd.DataFrame(list(zip(height, weight)))
X.head()

Unnamed: 0,0,1
0,171.2,71.6
1,167.3,68.0
2,164.9,66.4
3,168.5,69.2
4,172.1,78.0


In [None]:
from sklearn.preprocessing import StandardScaler

X_std = StandardScaler().fit_transform(X)   # on the dataframe -> result is an array
X_std

array([[ 1.60832159, -0.68544437],
       [-0.09747404, -0.22073632],
       [-1.19405694, -1.26632943],
       [ 0.87726632,  1.63809587],
       [ 1.12095141, -1.15015242],
       [-1.68142712,  1.17338782],
       [ 0.14621105,  0.24397173],
       [-0.58484421,  0.47632575],
       [ 0.51173869, -1.15015242],
       [-0.70668676,  0.9410338 ]])

In [None]:
x = X.values   # array
x_std = StandardScaler().fit_transform(x)  # on the array
x_std

array([[ 1.60832159, -0.68544437],
       [-0.09747404, -0.22073632],
       [-1.19405694, -1.26632943],
       [ 0.87726632,  1.63809587],
       [ 1.12095141, -1.15015242],
       [-1.68142712,  1.17338782],
       [ 0.14621105,  0.24397173],
       [-0.58484421,  0.47632575],
       [ 0.51173869, -1.15015242],
       [-0.70668676,  0.9410338 ]])

# Breast cancer example.
- The dataset classifies breast cancer patient data as either a recurrence or no recurrence of cancer. There are 286 examples and nine input variables. It is a binary classification problem.
- A reasonable classification accuracy score on this dataset is between 68% and 73%. But, this is just for demonstration.
-  Breast Cancer Dataset (breast-cancer.csv)
- https://github.com/jbrownlee/Datasets/blob/master/breast-cancer.names

- download the data file from https://github.com/jbrownlee/Datasets/blob/master/breast-cancer.csv

In [36]:
!curl 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/breast-cancer.csv' -o breast-cancer.csv

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0100 24373  100 24373    0     0   161k      0 --:--:-- --:--:-- --:--:--  161k


In [37]:
!head -10 breast-cancer.csv

'40-49','premeno','15-19','0-2','yes','3','right','left_up','no','recurrence-events'
'50-59','ge40','15-19','0-2','no','1','right','central','no','no-recurrence-events'
'50-59','ge40','35-39','0-2','no','2','left','left_low','no','recurrence-events'
'40-49','premeno','35-39','0-2','yes','3','right','left_low','yes','no-recurrence-events'
'40-49','premeno','30-34','3-5','yes','2','left','right_up','no','recurrence-events'
'50-59','premeno','25-29','3-5','no','2','right','left_up','yes','no-recurrence-events'
'50-59','ge40','40-44','0-2','no','3','left','left_up','no','no-recurrence-events'
'40-49','premeno','10-14','0-2','no','2','left','left_up','no','no-recurrence-events'
'40-49','premeno','0-4','0-2','no','2','right','right_low','no','no-recurrence-events'
'40-49','ge40','40-44','15-17','yes','2','right','left_up','yes','no-recurrence-events'


In [38]:
import pandas as pd
data = pd.read_csv('breast-cancer.csv', header=None)
pd.concat([data.head(3), data.tail(3)])

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,'40-49','premeno','15-19','0-2','yes','3','right','left_up','no','recurrence-events'
1,'50-59','ge40','15-19','0-2','no','1','right','central','no','no-recurrence-events'
2,'50-59','ge40','35-39','0-2','no','2','left','left_low','no','recurrence-events'
283,'30-39','premeno','30-34','6-8','yes','2','right','right_up','no','no-recurrence-events'
284,'50-59','premeno','15-19','0-2','no','2','right','left_low','no','no-recurrence-events'
285,'50-59','ge40','40-44','0-2','no','3','left','right_up','no','no-recurrence-events'


In [39]:
cols = ['age','menopause','tumor_size','inv_nodes', 'node_caps',
        'deg_malig', 'breast', 'breast_quad', 'irradiat','Class']
data.columns = cols
data.head()

Unnamed: 0,age,menopause,tumor_size,inv_nodes,node_caps,deg_malig,breast,breast_quad,irradiat,Class
0,'40-49','premeno','15-19','0-2','yes','3','right','left_up','no','recurrence-events'
1,'50-59','ge40','15-19','0-2','no','1','right','central','no','no-recurrence-events'
2,'50-59','ge40','35-39','0-2','no','2','left','left_low','no','recurrence-events'
3,'40-49','premeno','35-39','0-2','yes','3','right','left_low','yes','no-recurrence-events'
4,'40-49','premeno','30-34','3-5','yes','2','left','right_up','no','recurrence-events'


In [40]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 286 entries, 0 to 285
Data columns (total 10 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   age          286 non-null    object
 1   menopause    286 non-null    object
 2   tumor_size   286 non-null    object
 3   inv_nodes    286 non-null    object
 4   node_caps    278 non-null    object
 5   deg_malig    286 non-null    object
 6   breast       286 non-null    object
 7   breast_quad  285 non-null    object
 8   irradiat     286 non-null    object
 9   Class        286 non-null    object
dtypes: object(10)
memory usage: 22.5+ KB


In [41]:
data.describe()

Unnamed: 0,age,menopause,tumor_size,inv_nodes,node_caps,deg_malig,breast,breast_quad,irradiat,Class
count,286,286,286,286,278,286,286,285,286,286
unique,6,3,11,7,2,3,2,5,2,2
top,'50-59','premeno','30-34','0-2','no','2','left','left_low','no','no-recurrence-events'
freq,96,150,60,213,222,130,152,110,218,201


In [42]:
data = data.drop('node_caps', axis=1); data.head()   # missing values in this feature

Unnamed: 0,age,menopause,tumor_size,inv_nodes,deg_malig,breast,breast_quad,irradiat,Class
0,'40-49','premeno','15-19','0-2','3','right','left_up','no','recurrence-events'
1,'50-59','ge40','15-19','0-2','1','right','central','no','no-recurrence-events'
2,'50-59','ge40','35-39','0-2','2','left','left_low','no','recurrence-events'
3,'40-49','premeno','35-39','0-2','3','right','left_low','yes','no-recurrence-events'
4,'40-49','premeno','30-34','3-5','2','left','right_up','no','recurrence-events'


In [43]:
data = data.dropna() ; data.head().T

Unnamed: 0,0,1,2,3,4
age,'40-49','50-59','50-59','40-49','40-49'
menopause,'premeno','ge40','ge40','premeno','premeno'
tumor_size,'15-19','15-19','35-39','35-39','30-34'
inv_nodes,'0-2','0-2','0-2','0-2','3-5'
deg_malig,'3','1','2','3','2'
breast,'right','right','left','right','left'
breast_quad,'left_up','central','left_low','left_low','right_up'
irradiat,'no','no','no','yes','no'
Class,'recurrence-events','no-recurrence-events','recurrence-events','no-recurrence-events','recurrence-events'


In [None]:
data.describe()

Unnamed: 0,age,menopause,tumor_size,inv_nodes,deg_malig,breast,breast_quad,irradiat,Class
count,285,285,285,285,285,285,285,285,285
unique,6,3,11,7,3,2,5,2,2
top,'50-59','premeno','30-34','0-2','2','left','left_low','no','no-recurrence-events'
freq,95,150,59,212,130,151,110,217,201


In [None]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 285 entries, 0 to 285
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   age          285 non-null    object
 1   menopause    285 non-null    object
 2   tumor_size   285 non-null    object
 3   inv_nodes    285 non-null    object
 4   deg_malig    285 non-null    object
 5   breast       285 non-null    object
 6   breast_quad  285 non-null    object
 7   irradiat     285 non-null    object
 8   Class        285 non-null    object
dtypes: object(9)
memory usage: 22.3+ KB


In [None]:
dataset = data.values
X = dataset[:, :-1]
y = dataset[:,-1]

In [None]:
from sklearn.preprocessing import OrdinalEncoder, LabelEncoder, OneHotEncoder
# ordinal encoder
oe = OrdinalEncoder()
X_enc = oe.fit_transform(X)
le = LabelEncoder()            # y는 안해 주어도 됨
y_enc = le.fit_transform(y)

In [None]:
oe.categories_ , le.classes_

([array(["'20-29'", "'30-39'", "'40-49'", "'50-59'", "'60-69'", "'70-79'"],
        dtype=object),
  array(["'ge40'", "'lt40'", "'premeno'"], dtype=object),
  array(["'0-4'", "'10-14'", "'15-19'", "'20-24'", "'25-29'", "'30-34'",
         "'35-39'", "'40-44'", "'45-49'", "'5-9'", "'50-54'"], dtype=object),
  array(["'0-2'", "'12-14'", "'15-17'", "'24-26'", "'3-5'", "'6-8'",
         "'9-11'"], dtype=object),
  array(["'1'", "'2'", "'3'"], dtype=object),
  array(["'left'", "'right'"], dtype=object),
  array(["'central'", "'left_low'", "'left_up'", "'right_low'",
         "'right_up'"], dtype=object),
  array(["'no'", "'yes'"], dtype=object)],
 array(["'no-recurrence-events'", "'recurrence-events'"], dtype=object))

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)

In [None]:
X_train_enc, X_test_enc, y_train_enc, y_test_enc = train_test_split(X_enc, y_enc, test_size=0.33, random_state=1)

In [None]:
X_train.shape, X_test.shape

((190, 8), (95, 8))

In [None]:
X_train[:5]

array([["'50-59'", "'premeno'", "'50-54'", "'0-2'", "'2'", "'right'",
        "'left_up'", "'yes'"],
       ["'30-39'", "'premeno'", "'20-24'", "'0-2'", "'2'", "'left'",
        "'right_low'", "'no'"],
       ["'40-49'", "'ge40'", "'20-24'", "'3-5'", "'3'", "'right'",
        "'left_low'", "'yes'"],
       ["'60-69'", "'ge40'", "'30-34'", "'0-2'", "'3'", "'left'",
        "'left_low'", "'no'"],
       ["'50-59'", "'ge40'", "'20-24'", "'0-2'", "'3'", "'right'",
        "'left_up'", "'no'"]], dtype=object)

In [None]:
X_train_enc[:5]

array([[ 3.,  2., 10.,  0.,  1.,  1.,  2.,  1.],
       [ 1.,  2.,  3.,  0.,  1.,  0.,  3.,  0.],
       [ 2.,  0.,  3.,  4.,  2.,  1.,  1.,  1.],
       [ 4.,  0.,  5.,  0.,  2.,  0.,  1.,  0.],
       [ 3.,  0.,  3.,  0.,  2.,  1.,  2.,  0.]])

## Ordinal encode categorical data
- sometimes referred to simply as an integer encoding
- We can use the OrdinalEncoder() from scikit-learn to encode each variable to integers. This is a flexible class and does allow the order of the categories to be specified as arguments if any such order is known.

In [None]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X_train_enc, y_train_enc)
model.score(X_test_enc, y_test_enc)

0.7894736842105263

In [None]:
X_train_enc[:5], y_train_enc[:5]

(array([[ 3.,  2., 10.,  0.,  1.,  1.,  2.,  1.],
        [ 1.,  2.,  3.,  0.,  1.,  0.,  3.,  0.],
        [ 2.,  0.,  3.,  4.,  2.,  1.,  1.,  1.],
        [ 4.,  0.,  5.,  0.,  2.,  0.,  1.,  0.],
        [ 3.,  0.,  3.,  0.,  2.,  1.,  2.,  0.]]),
 array([0, 0, 1, 0, 0]))

In [None]:
print(X_train.dtype, y_train.dtype, X_train.shape, y_train.shape)
X_train_enc.dtype, y_train_enc.dtype, X_train_enc.shape, y_train_enc.shape

object object (190, 8) (190,)


(dtype('float64'), dtype('int64'), (190, 8), (190,))

# One-hot encoding
- A one hot encoding is appropriate for categorical data where no relationship exists between categories.
- It involves representing each categorical variable with a binary vector that has one element for each unique label and marking the class label with a 1 and all other elements 0.

In [None]:
ohe = OneHotEncoder()
X_enc = ohe.fit_transform(X)
le = LabelEncoder()            # y는 안해 주어도 됨
y_enc = le.fit_transform(y)

X_train, X_test, y_train, y_test = train_test_split(X_enc, y_enc,
                                                    test_size=0.33, random_state=1)

model = LogisticRegression()
model.fit(X_train, y_train)
model.score(X_test, y_test)

0.7263157894736842

In [None]:
X.shape, X_enc.shape

((285, 8), (285, 39))

In [None]:
print(X_enc[:5].toarray())

[[0. 0. 1. 0. 0. 0. 0. 0. 1. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.
  0. 0. 0. 0. 0. 1. 0. 1. 0. 0. 1. 0. 0. 1. 0.]
 [0. 0. 0. 1. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.
  0. 0. 0. 1. 0. 0. 0. 1. 1. 0. 0. 0. 0. 1. 0.]
 [0. 0. 0. 1. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0.
  0. 0. 0. 0. 1. 0. 1. 0. 0. 1. 0. 0. 0. 1. 0.]
 [0. 0. 1. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0.
  0. 0. 0. 0. 0. 1. 0. 1. 0. 1. 0. 0. 0. 0. 1.]
 [0. 0. 1. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  1. 0. 0. 0. 1. 0. 1. 0. 0. 0. 0. 0. 1. 1. 0.]]


In [None]:
ohe.categories_

[array(["'20-29'", "'30-39'", "'40-49'", "'50-59'", "'60-69'", "'70-79'"],
       dtype=object),
 array(["'ge40'", "'lt40'", "'premeno'"], dtype=object),
 array(["'0-4'", "'10-14'", "'15-19'", "'20-24'", "'25-29'", "'30-34'",
        "'35-39'", "'40-44'", "'45-49'", "'5-9'", "'50-54'"], dtype=object),
 array(["'0-2'", "'12-14'", "'15-17'", "'24-26'", "'3-5'", "'6-8'",
        "'9-11'"], dtype=object),
 array(["'1'", "'2'", "'3'"], dtype=object),
 array(["'left'", "'right'"], dtype=object),
 array(["'central'", "'left_low'", "'left_up'", "'right_low'",
        "'right_up'"], dtype=object),
 array(["'no'", "'yes'"], dtype=object)]

In [None]:
ohe.get_feature_names_out()

array(["x0_'20-29'", "x0_'30-39'", "x0_'40-49'", "x0_'50-59'",
       "x0_'60-69'", "x0_'70-79'", "x1_'ge40'", "x1_'lt40'",
       "x1_'premeno'", "x2_'0-4'", "x2_'10-14'", "x2_'15-19'",
       "x2_'20-24'", "x2_'25-29'", "x2_'30-34'", "x2_'35-39'",
       "x2_'40-44'", "x2_'45-49'", "x2_'5-9'", "x2_'50-54'", "x3_'0-2'",
       "x3_'12-14'", "x3_'15-17'", "x3_'24-26'", "x3_'3-5'", "x3_'6-8'",
       "x3_'9-11'", "x4_'1'", "x4_'2'", "x4_'3'", "x5_'left'",
       "x5_'right'", "x6_'central'", "x6_'left_low'", "x6_'left_up'",
       "x6_'right_low'", "x6_'right_up'", "x7_'no'", "x7_'yes'"],
      dtype=object)

# Exercise

In [None]:
X.shape, X[0],

((285, 8),
 array(["'40-49'", "'premeno'", "'15-19'", "'0-2'", "'3'", "'right'",
        "'left_up'", "'no'"], dtype=object))

In [None]:
X1 = OrdinalEncoder().fit_transform(X)
X1.shape, X1[0]

((285, 8), array([2., 2., 2., 0., 2., 1., 2., 0.]))

In [None]:
X2 = OneHotEncoder().fit_transform(X)
X2.shape, X2[0].toarray()

((285, 39),
 array([[0., 0., 1., 0., 0., 0., 0., 0., 1., 0., 0., 1., 0., 0., 0., 0.,
         0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 1.,
         0., 0., 1., 0., 0., 1., 0.]]))

### for data handling


In [None]:
data.head()

Unnamed: 0,age,menopause,tumor_size,inv_nodes,deg_malig,breast,breast_quad,irradiat,Class
0,'40-49','premeno','15-19','0-2','3','right','left_up','no','recurrence-events'
1,'50-59','ge40','15-19','0-2','1','right','central','no','no-recurrence-events'
2,'50-59','ge40','35-39','0-2','2','left','left_low','no','recurrence-events'
3,'40-49','premeno','35-39','0-2','3','right','left_low','yes','no-recurrence-events'
4,'40-49','premeno','30-34','3-5','2','left','right_up','no','recurrence-events'


In [None]:
class_order = ["'20-29'", "'30-39'", "'40-49'", "'50-59'", "'60-69'", "'70-79'"]
age_order = ["'ge40'", "'lt40'", "'premeno'"]
menopause_order = ["'0-4'", "'10-14'", "'15-19'", "'20-24'", "'25-29'", "'30-34'", \
        "'35-39'", "'40-44'", "'45-49'", "'5-9'", "'50-54'"]
tumor_size_order = ["'0-2'", "'3-5'", "'15-17'", "'6-8'", "'9-11'", "'24-26'", "'12-14'"]
inv_nodes_order = ["'0'", "'1'", "'2'"]

In [None]:
data.Class=pd.Categorical(data.Class, class_order, ordered=True)
data.age=pd.Categorical(data.age, age_order, ordered=True)
data.menopause=pd.Categorical(data.menopause, menopause_order, ordered=True)
data.tumor_size=pd.Categorical(data.tumor_size, tumor_size_order, ordered=True)
data.node_caps=pd.Categorical(data.inv_nodes, inv_nodes_order, ordered=True)

  data.node_caps=pd.Categorical(data.inv_nodes, inv_nodes_order, ordered=True)


In [None]:
data.dtypes

Unnamed: 0,0
age,category
menopause,category
tumor_size,category
inv_nodes,object
deg_malig,object
breast,object
breast_quad,object
irradiat,object
Class,category
