## 9-5　カテゴリ型の数値化

In [1]:
import numpy as np
import pandas as pd

In [2]:
production = pd.read_csv('production.csv')
print(production.shape)
production.head()

(1000, 4)


Unnamed: 0,type,length,thickness,fault_flg
0,E,274.027383,40.241131,False
1,D,86.319269,16.906715,False
2,E,123.940388,1.018462,False
3,B,175.554886,16.414924,False
4,B,244.93474,29.061081,False


`production.query('fault_flg')` で、`fault_flg` が `True` のみのレコードを抽出する。

In [3]:
fault_cnt_per_type = production.query('fault_flg').groupby('type')['fault_flg'].count()
fault_cnt_per_type

type
A    11
B     6
C    16
D     7
E    12
Name: fault_flg, dtype: int64

In [4]:
type_cnt = production.groupby('type')['fault_flg'].count()
type_cnt

type
A    202
B    175
C    211
D    215
E    197
Name: fault_flg, dtype: int64

In [5]:
production['type_fault_rate'] = production[['type', 'fault_flg']].apply(lambda x: (fault_cnt_per_type[x[0]] - int(x[1])) / (type_cnt[x[0]] - 1), axis=1)
production.head()

Unnamed: 0,type,length,thickness,fault_flg,type_fault_rate
0,E,274.027383,40.241131,False,0.061224
1,D,86.319269,16.906715,False,0.03271
2,E,123.940388,1.018462,False,0.061224
3,B,175.554886,16.414924,False,0.034483
4,B,244.93474,29.061081,False,0.034483


`lambda` 関数内の処理を確認する。

In [6]:
a = fault_cnt_per_type[production['type']]
print(a.shape)
a.head()

(1000,)


type
E    12
D     7
E    12
B     6
B     6
Name: fault_flg, dtype: int64

In [7]:
b = (type_cnt[production['type']])
print(b.shape)
b.head()

(1000,)


type
E    197
D    215
E    197
B    175
B    175
Name: fault_flg, dtype: int64

In [8]:
c = production['fault_flg'].astype(int)
print(c.shape)
c.head()

(1000,)


0    0
1    0
2    0
3    0
4    0
Name: fault_flg, dtype: int64

In [9]:
df = pd.concat([a, b], axis=1).reset_index(drop=True)
df['fault'] = c
df.columns = ['fault_cnt_per_type', 'type_cnt', 'fault']
df['type_fault_rate'] = (df['fault_cnt_per_type'] - df['fault']) / (df['type_cnt'] - 1)

print(df.shape)
df.head()

(1000, 4)


Unnamed: 0,fault_cnt_per_type,type_cnt,fault,type_fault_rate
0,12,197,0,0.061224
1,7,215,0,0.03271
2,12,197,0,0.061224
3,6,175,0,0.034483
4,6,175,0,0.034483


障害予測モデルで `fault` を予測するときに、平均障害率 `type_fault_rate` に自身の欠損情報がリークしないように<br>
分母と分子からそのレコードの情報を差し引いている。