### 파킨슨병 데이터
- 환자들의 뇌를 촬영한 사진의 상태를 기록한 자료에 각 환자의 상태 status(1: 파킨슨병 진단, 0: 파킨슨병 아님)로 추가한 테이블
- (data/parkinsons.csv)
1. 파킨슨 병을 예측하는 모델로 로지스틱 회귀모형을 적용하여 생성
2. 파킨슨병을 예측하는데 영향을 미치는 변수를 중요한 순서대로 3개 선정
3. 파킨슨 병을 진단하는 기준를 함수로 생성하여(매개변수명 = threshold, 함수명 = cutoff)을 0.5로 했을 때와 0.8로 했을 때 F1-스코어를 비교
    - 분석 조건
        - 필요 없는 컬럼 name을 삭제
        - 데이터의 정규화는 min-max 스케일러 사용
        - 로지스틱 회귀를 위한 상수항 추가
        - status는 카테고리 타입으로 변환
        - 트레이닝셋과 테스트셋 비율은 9:1
        - 모델은 로지스틱 회귀분석 사용
        - 모델의 최적화 방법론은 "bfgs" 사용

In [5]:
import pandas as pd

data=pd.read_csv("../../csv/parkinsons.csv")

In [6]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score

In [11]:
data

Unnamed: 0,MDVP:Fo(Hz),MDVP:Fhi(Hz),MDVP:Flo(Hz),MDVP:Jitter(%),MDVP:Jitter(Abs),MDVP:RAP,MDVP:PPQ,Jitter:DDP,MDVP:Shimmer,MDVP:Shimmer(dB),...,Shimmer:DDA,NHR,HNR,status,RPDE,DFA,spread1,spread2,D2,PPE
0,119.992,157.302,74.997,0.00784,0.00007,0.00370,0.00554,0.01109,0.04374,0.426,...,0.06545,0.02211,21.033,1,0.414783,0.815285,-4.813031,0.266482,2.301442,0.284654
1,122.400,148.650,113.819,0.00968,0.00008,0.00465,0.00696,0.01394,0.06134,0.626,...,0.09403,0.01929,19.085,1,0.458359,0.819521,-4.075192,0.335590,2.486855,0.368674
2,116.682,131.111,111.555,0.01050,0.00009,0.00544,0.00781,0.01633,0.05233,0.482,...,0.08270,0.01309,20.651,1,0.429895,0.825288,-4.443179,0.311173,2.342259,0.332634
3,116.676,137.871,111.366,0.00997,0.00009,0.00502,0.00698,0.01505,0.05492,0.517,...,0.08771,0.01353,20.644,1,0.434969,0.819235,-4.117501,0.334147,2.405554,0.368975
4,116.014,141.781,110.655,0.01284,0.00011,0.00655,0.00908,0.01966,0.06425,0.584,...,0.10470,0.01767,19.649,1,0.417356,0.823484,-3.747787,0.234513,2.332180,0.410335
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
190,174.188,230.978,94.261,0.00459,0.00003,0.00263,0.00259,0.00790,0.04087,0.405,...,0.07008,0.02764,19.517,0,0.448439,0.657899,-6.538586,0.121952,2.657476,0.133050
191,209.516,253.017,89.488,0.00564,0.00003,0.00331,0.00292,0.00994,0.02751,0.263,...,0.04812,0.01810,19.147,0,0.431674,0.683244,-6.195325,0.129303,2.784312,0.168895
192,174.688,240.005,74.287,0.01360,0.00008,0.00624,0.00564,0.01873,0.02308,0.256,...,0.03804,0.10715,17.883,0,0.407567,0.655683,-6.787197,0.158453,2.679772,0.131728
193,198.764,396.961,74.904,0.00740,0.00004,0.00370,0.00390,0.01109,0.02296,0.241,...,0.03794,0.07223,19.020,0,0.451221,0.643956,-6.744577,0.207454,2.138608,0.123306


In [None]:
# 필요 없는 컬럼 제거
# data.drop(columns=['name'], inplace=True)

In [12]:
# 데이터 정규화
scaler = MinMaxScaler()
data_scaled = pd.DataFrame(scaler.fit_transform(data), columns=data.columns)

# 상수항 추가
data_scaled['intercept'] = 1

# status를 카테고리 타입으로 변환
data_scaled['status'] = data_scaled['status'].astype('category')

# 트레이닝셋과 테스트셋 분리
X = data_scaled.drop(columns=['status'])
y = data_scaled['status']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)

# 로지스틱 회귀 모델 생성
log_reg_model = LogisticRegression(solver='liblinear') #bgfs안됨 xxxxx

# 모델 학습
log_reg_model.fit(X_train, y_train)

# 피처의 중요도 계산
feature_importances = abs(log_reg_model.coef_[0])
feature_importance_df = pd.DataFrame({'feature': X.columns, 'importance': feature_importances})
top_3_features = feature_importance_df.nlargest(3, 'importance')['feature'].tolist()
print("파킨슨병을 예측하는데 영향을 미치는 변수(상위 3개):", top_3_features)

# 파킨슨 병 진단 기준 함수 생성
def cutoff(threshold):
    y_pred_train = (log_reg_model.predict_proba(X_train)[:, 1] >= threshold).astype(int)
    y_pred_test = (log_reg_model.predict_proba(X_test)[:, 1] >= threshold).astype(int)
    f1_train = f1_score(y_train, y_pred_train)
    f1_test = f1_score(y_test, y_pred_test)
    return f1_train, f1_test

# threshold가 0.5일 때와 0.8일 때의 F1-스코어 비교
f1_train_05, f1_test_05 = cutoff(0.5)
f1_train_08, f1_test_08 = cutoff(0.8)
print("Threshold가 0.5인 경우 - Train F1-score:", f1_train_05, "Test F1-score:", f1_test_05)
print("Threshold가 0.8인 경우 - Train F1-score:", f1_train_08, "Test F1-score:", f1_test_08)

파킨슨병을 예측하는데 영향을 미치는 변수(상위 3개): ['spread1', 'PPE', 'spread2']
Threshold가 0.5인 경우 - Train F1-score: 0.9057971014492754 Test F1-score: 0.918918918918919
Threshold가 0.8인 경우 - Train F1-score: 0.7962962962962963 Test F1-score: 0.7333333333333333
