In [1]:
import pandas as pd
from statsmodels.api import Logit

# Feature Selection

In [2]:
clinical_info = pd.read_csv("clinical_info.csv")
clinical_info['PatientID']=clinical_info['PatientID'].str.slice(start=-3)
clinical_info.head()

Unnamed: 0,PatientID,age,clinical.T.Stage,Clinical.N.Stage,Overall.Stage,gender,two-year.survival
0,4,70,2,1,II,male,dead
1,5,80,4,2,IIIb,male,dead
2,6,73,3,1,IIIa,male,dead
3,7,81,2,2,IIIa,male,dead
4,8,71,2,2,IIIa,male,dead


In [3]:
# clinical_info['clinical.T.Stage'] = clinical_info['clinical.T.Stage'].astype(str)
# clinical_info['Clinical.N.Stage'] = clinical_info['Clinical.N.Stage'].astype(str)
# clinical_info.info()

clinical.T.Stage, Clinical.N.Stage는 TNM 분류 체계를 뜻하는 순서형 데이터로 범주형 변수인데 수치형으로 되어 있으므로 범주형으로 변환해준다.

In [3]:
clinical_info['Overall.Stage'] = clinical_info['Overall.Stage'].map({"I":1, "II":2, "IIIa":3, "IIIb":4})
clinical_info['Overall.Stage'].value_counts().sort_index()

1    17
2    15
3    30
4    38
Name: Overall.Stage, dtype: int64

Overall.Stage는 I, II, IIIa, IIIb로 갈수록 생존율이 낮으므로, 1, 2, 3, 4로 mapping해준다.

In [4]:
clinical_info['two-year.survival'] = clinical_info['two-year.survival'].map({"dead":0, "survived":1})
clinical_info['two-year.survival'].value_counts()

0    68
1    32
Name: two-year.survival, dtype: int64

목표변수인 two-year.survival은 2년 내 생존여부이다. 따라서 생존(survived)는 1, 사망(dead)는 0으로 mapping한다.

In [5]:
clinical_info.columns

Index(['PatientID', 'age', 'clinical.T.Stage', 'Clinical.N.Stage',
       'Overall.Stage', 'gender', 'two-year.survival'],
      dtype='object')

In [6]:
clinical_info.columns = ['PatientID', 'age', 'clinical_T_Stage', 'Clinical_N_Stage', 'Overall_Stage', 'gender', 'two_year_survival']
clinical_info.columns

Index(['PatientID', 'age', 'clinical_T_Stage', 'Clinical_N_Stage',
       'Overall_Stage', 'gender', 'two_year_survival'],
      dtype='object')

회귀분석 시 변수명에 .이 있으면 안되므로 _로 바꿔준다.

In [7]:
clinical_info.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   PatientID          100 non-null    object
 1   age                100 non-null    int64 
 2   clinical_T_Stage   100 non-null    int64 
 3   Clinical_N_Stage   100 non-null    int64 
 4   Overall_Stage      100 non-null    int64 
 5   gender             100 non-null    object
 6   two_year_survival  100 non-null    int64 
dtypes: int64(5), object(2)
memory usage: 5.6+ KB


## CASE 1) age, clinical.T.Stage, Clinical.N.Stage, gender

In [8]:
clinical_info_case1 = clinical_info.drop("Overall_Stage", axis=1)
clinical_info_case1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   PatientID          100 non-null    object
 1   age                100 non-null    int64 
 2   clinical_T_Stage   100 non-null    int64 
 3   Clinical_N_Stage   100 non-null    int64 
 4   gender             100 non-null    object
 5   two_year_survival  100 non-null    int64 
dtypes: int64(4), object(2)
memory usage: 4.8+ KB


case 1의 경우 age와 T, N, gender로만 분석할 예정이므로 나머지 설명변수는 분석에서 제외한다.

In [9]:
log_model = Logit.from_formula("""two_year_survival ~ age + clinical_T_Stage + Clinical_N_Stage + C(gender)""", clinical_info_case1)
log_result = log_model.fit()
print(log_result.summary())

Optimization terminated successfully.
         Current function value: 0.605220
         Iterations 5
                           Logit Regression Results                           
Dep. Variable:      two_year_survival   No. Observations:                  100
Model:                          Logit   Df Residuals:                       95
Method:                           MLE   Df Model:                            4
Date:                Fri, 14 Jan 2022   Pseudo R-squ.:                 0.03454
Time:                        21:47:10   Log-Likelihood:                -60.522
converged:                       True   LL-Null:                       -62.687
Covariance Type:            nonrobust   LLR p-value:                    0.3632
                        coef    std err          z      P>|z|      [0.025      0.975]
-------------------------------------------------------------------------------------
Intercept             1.3969      2.052      0.681      0.496      -2.626       5.420
C(gender

In [10]:
log_result.pvalues

Intercept            0.496141
C(gender)[T.male]    0.233228
age                  0.257422
clinical_T_Stage     0.389106
Clinical_N_Stage     0.377412
dtype: float64

회귀계수 검정 결과 유의수준 0.1에서 유의한 변수는 없다. 유의수준 0.3에서 유의한 변수는 gender이고, 0.4에서 유의한 변수는 clinical_T_Stage, Clinical_N_Stage, gender, age다. 즉, featrue selection 결과 모든 변수가 선택되었다. 참고로 gender가 female인 경우가 없는 이유는 female 대비 male 비율이 회귀계수로 추정되었기 때문이다.

In [11]:
clinical_info_case1.to_csv("clinical_info_case1.csv", index=False)

case 1의 clinical information을 저장한다.

## CASE 2) age, Overall.Stage, gender

In [12]:
clinical_info.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   PatientID          100 non-null    object
 1   age                100 non-null    int64 
 2   clinical_T_Stage   100 non-null    int64 
 3   Clinical_N_Stage   100 non-null    int64 
 4   Overall_Stage      100 non-null    int64 
 5   gender             100 non-null    object
 6   two_year_survival  100 non-null    int64 
dtypes: int64(5), object(2)
memory usage: 5.6+ KB


In [13]:
clinical_info_case2 = clinical_info.drop(["clinical_T_Stage", "Clinical_N_Stage"], axis=1)
clinical_info_case2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   PatientID          100 non-null    object
 1   age                100 non-null    int64 
 2   Overall_Stage      100 non-null    int64 
 3   gender             100 non-null    object
 4   two_year_survival  100 non-null    int64 
dtypes: int64(3), object(2)
memory usage: 4.0+ KB


case 2의 경우 age, Overall.Stage, gender로만 분석할 예정이므로 나머지 설명변수는 분석에서 제외한다.

In [14]:
log_model = Logit.from_formula("""two_year_survival ~ age + Overall_Stage + C(gender)""", clinical_info_case2)
log_result = log_model.fit()
print(log_result.summary())

Optimization terminated successfully.
         Current function value: 0.609668
         Iterations 5
                           Logit Regression Results                           
Dep. Variable:      two_year_survival   No. Observations:                  100
Model:                          Logit   Df Residuals:                       96
Method:                           MLE   Df Model:                            3
Date:                Fri, 14 Jan 2022   Pseudo R-squ.:                 0.02744
Time:                        21:49:18   Log-Likelihood:                -60.967
converged:                       True   LL-Null:                       -62.687
Covariance Type:            nonrobust   LLR p-value:                    0.3286
                        coef    std err          z      P>|z|      [0.025      0.975]
-------------------------------------------------------------------------------------
Intercept            -0.0002      2.041     -0.000      1.000      -4.000       3.999
C(gender

In [15]:
log_result.pvalues

Intercept            0.999912
C(gender)[T.male]    0.279172
age                  0.559588
Overall_Stage        0.351708
dtype: float64

이 데이터의 전체 자료 수는 100개, 잔차 자유도는 96, 모델 자유도는 3이다. 이 모델의 설명력은 2.7%이다. 회귀계수 검정 결과 유의수준 0.1에서 유의한 변수는 없다. 유의수준 0.3에서 유의한 변수는 gender고, 0.5에서 유의한 변수는 gender와 Overall_Stage이다. 즉, feature selection 결과 age를 제외한 gender와 Overall_Stage가 선택되었다.

In [16]:
clinical_info_case2.drop("age", axis=1, inplace=True)
clinical_info_case2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 4 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   PatientID          100 non-null    object
 1   Overall_Stage      100 non-null    int64 
 2   gender             100 non-null    object
 3   two_year_survival  100 non-null    int64 
dtypes: int64(2), object(2)
memory usage: 3.2+ KB


In [17]:
clinical_info_case2.to_csv("clinical_info_case2.csv", index=False)

case 2의 clinical information을 저장한다.