# [과제 3] 로지스틱 회귀분석
### - sklearn 패키지를 사용해 로지스틱 회귀분석을 진행해주세요.
### - 성능지표를 계산하고 이에 대해 해석해주세요.
### - 성능 개선을 시도해주세요. (어떠한 성능지표를 기준으로 개선을 시도했는지, 그 이유도 함께 적어주세요.)
### - 주석으로 설명 및 근거 자세하게 달아주시면 감사하겠습니다. :)

## Data

출처 : https://www.kaggle.com/mlg-ulb/creditcardfraud


* V1 ~ V28 : 비식별화 된 개인정보
* **Class** : Target 변수  
  - 1 : fraudulent transactions (사기)
  - 0 : otherwise

In [6]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
from sklearn.model_selection import train_test_split
warnings.filterwarnings(action='ignore')

In [3]:
data = pd.read_csv("/content/sample_data/assignment3_creditcard.csv")

In [4]:
data.head()

Unnamed: 0,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,...,V20,V21,V22,V23,V24,V25,V26,V27,V28,Class
0,-1.848212,2.3849,0.379573,1.048381,-0.84507,2.537837,-4.542983,-10.201458,-1.504967,-2.234167,...,2.585817,-5.29169,0.859364,0.423231,-0.506985,1.020052,-0.627751,-0.017753,0.280982,0
1,2.071805,-0.477943,-1.444444,-0.548657,0.010036,-0.582242,-0.042878,-0.24716,1.171923,-0.342382,...,-0.077306,0.042858,0.390125,0.041569,0.598427,0.098803,0.979686,-0.093244,-0.065615,0
2,-2.985294,-2.747472,1.194068,-0.003036,-1.151041,-0.263559,0.5535,0.6356,0.438545,-1.806488,...,1.345776,0.37376,-0.385777,1.197596,0.407229,0.008013,0.762362,-0.299024,-0.303929,0
3,-1.479452,1.542874,0.290895,0.838142,-0.52929,-0.717661,0.484516,0.545092,-0.780767,0.324804,...,0.038397,0.116771,0.40556,-0.116453,0.541275,-0.216665,-0.415578,0.027126,-0.150347,0
4,-0.281976,-0.309699,-2.162299,-0.851514,0.106167,-1.483888,1.930994,-0.843049,-1.249272,1.079608,...,-0.875516,-0.004199,1.015108,-0.026748,0.077115,-1.468822,0.7517,0.496732,0.331001,0


In [28]:
#결측치 파악
data.isnull().sum()

V1       0
V2       0
V3       0
V4       0
V5       0
V6       0
V7       0
V8       0
V9       0
V10      0
V11      0
V12      0
V13      0
V14      0
V15      0
V16      0
V17      0
V18      0
V19      0
V20      0
V21      0
V22      0
V23      0
V24      0
V25      0
V26      0
V27      0
V28      0
Class    0
dtype: int64

In [8]:
X = data.drop(columns=['Class'])
y = data['Class']

In [9]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [22]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, roc_auc_score

logreg = LogisticRegression(random_state=42)
logreg.fit(X_train, y_train)

y_pred = logreg.predict(X_test)

print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))
print("\nAccuracy:", accuracy_score(y_test, y_pred))
print("ROC AUC Score:", roc_auc_score(y_test, y_pred))

Confusion Matrix:
 [[5688    0]
 [   7   41]]

Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00      5688
           1       1.00      0.85      0.92        48

    accuracy                           1.00      5736
   macro avg       1.00      0.93      0.96      5736
weighted avg       1.00      1.00      1.00      5736


Accuracy: 0.9987796373779637
ROC AUC Score: 0.9270833333333333


현재 성능 지표를 살펴보면 정확도는 높지만, 사기 거래(클래스 1)에 대한 재현율이 낮다. 사기 거래를 놓치지 않고 정확하게 찾아내는 것이 중요하므로 클래스 가중치 조정을 통해 재현율을 높이고자 한다.

class_weight를 'balanced'로 설정하여 클래스에 가중치를 부여하여 불균형한 데이터셋에서 모델이 더 잘 학습하도록 한다.

In [26]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, roc_auc_score

lr_model = LogisticRegression(class_weight='balanced', random_state=42)
lr_model.fit(X_train, y_train)

y_pred_lr = lr_model.predict(X_test)

print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_lr))
print("\nClassification Report:\n", classification_report(y_test, y_pred_lr))
print("\nAccuracy:", accuracy_score(y_test, y_pred_lr))
print("ROC AUC Score:", roc_auc_score(y_test, y_pred_lr))

Confusion Matrix:
 [[5578  110]
 [   3   45]]

Classification Report:
               precision    recall  f1-score   support

           0       1.00      0.98      0.99      5688
           1       0.29      0.94      0.44        48

    accuracy                           0.98      5736
   macro avg       0.64      0.96      0.72      5736
weighted avg       0.99      0.98      0.99      5736


Accuracy: 0.9802998605299861
ROC AUC Score: 0.9590805203938115
