# Logistic Regression
- 본 노트북 코드에서는
    - Logistic Regression을 이용하여 분류 모델을 만들어보고자 합니다.
- 패키지는 
    - 분류(classification) 모델을 훈련하고 평가 및 예측하는 것은 scikit-learn 패키지를 사용하겠습니다.
    - scikit-learn 패키지의 Logistic Regression 함수에 대한 자세한 설명은 아래 문서에서 확인하실 수 있습니다.
        - https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html


## 1. Logistic Regression on "Universal Bank" dataset

### (1) Prepare an example data

In [1]:
import pandas as pd
url = "https://raw.githubusercontent.com/gchoi/Dataset/master/UniversalBank.csv"
bank_df = pd.read_csv(url)

In [2]:
bank_df

Unnamed: 0,ID,Age,Experience,Income,ZIP Code,Family,CCAvg,Education,Mortgage,PersonalLoan,SecuritiesAccount,CDAccount,Online,CreditCard
0,1,25,1,49,91107,4,1.6,1,0,0,1,0,0,0
1,2,45,19,34,90089,3,1.5,1,0,0,1,0,0,0
2,3,39,15,11,94720,1,1.0,1,0,0,0,0,0,0
3,4,35,9,100,94112,1,2.7,2,0,0,0,0,0,0
4,5,35,8,45,91330,4,1.0,2,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4995,4996,29,3,40,92697,1,1.9,3,0,0,0,0,1,0
4996,4997,30,4,15,92037,4,0.4,1,85,0,0,0,1,0
4997,4998,63,39,24,93023,2,0.3,3,0,0,0,0,0,0
4998,4999,65,40,49,90034,3,0.5,2,0,0,0,0,1,0


### (2) Preprocessing

In [3]:
bank_df.drop(columns=['ID', 'ZIP Code'], inplace=True)
bank_df.columns = [c.replace(' ', '_') for c in bank_df.columns]

# Treat education as categorical, convert to dummy variables
bank_df['Education'] = bank_df['Education'].astype('category')
new_categories = {1: 'Undergrad', 2: 'Graduate', 3: 'Advanced/Professional'}
bank_df.Education = bank_df.Education.cat.rename_categories(new_categories)
bank_df = pd.get_dummies(bank_df, prefix_sep='_', drop_first=True, dtype=int)

In [4]:
bank_df

Unnamed: 0,Age,Experience,Income,Family,CCAvg,Mortgage,PersonalLoan,SecuritiesAccount,CDAccount,Online,CreditCard,Education_Graduate,Education_Advanced/Professional
0,25,1,49,4,1.6,0,0,1,0,0,0,0,0
1,45,19,34,3,1.5,0,0,1,0,0,0,0,0
2,39,15,11,1,1.0,0,0,0,0,0,0,0,0
3,35,9,100,1,2.7,0,0,0,0,0,0,1,0
4,35,8,45,4,1.0,0,0,0,0,0,1,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
4995,29,3,40,1,1.9,0,0,0,0,1,0,0,1
4996,30,4,15,4,0.4,85,0,0,0,1,0,0,0
4997,63,39,24,2,0.3,0,0,0,0,0,0,0,1
4998,65,40,49,3,0.5,0,0,0,0,1,0,1,0


In [5]:
X = bank_df.drop(columns=['PersonalLoan'])
y = bank_df['PersonalLoan']

### (3) Split the data into training and test sets

In [6]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

### (4) Define and Train a linear regression model

In [7]:
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression()
clf.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


### (5) Test and Evaluate the model

In [8]:
# 훈련 데이터와 테스트 데이터의 accuracy 값을 출력한다.
print("Training set score: {:.3f}".format(clf.score(X_train, y_train)))
print("Test set score: {:.3f}".format(clf.score(X_test, y_test)))

Training set score: 0.957
Test set score: 0.959


- 모델의 예측이 실제 값과 일치하는 정도를 평가하기 위해 모델의 내장함수인 score 함수를 통해 accuracy 값을 계산해봅시다.
- 훈련된 Logistic Regression 모델은 훈련용 데이터셋 분류 정확도 0.957, 테스트 셋 분류 정확도 0.959로 좋은 성능을 보이는 것을 확인할 수 있습니다.

In [9]:
# accuracy_score 함수를 사용하여 훈련 데이터와 테스트 데이터의 accuracy 값을 구한다.
from sklearn.metrics import accuracy_score

y_train_hat = clf.predict(X_train)
print(accuracy_score(y_train, y_train_hat))

y_test_hat = clf.predict(X_test)
print(accuracy_score(y_test, y_test_hat))

0.9568571428571429
0.9586666666666667


In [None]:
print('intercept ', clf.intercept_[0])
print()
print(pd.DataFrame({'coeff': clf.coef_[0]}, index=X.columns).transpose())


#### (6) Model Equation
- 위의 coefficieints, intercept를 이용하여 모델 식을 구성하면 다음과 같습니다.
  
    - $logit(Personal Loan = Yes) = -0.383 - 0.425 * Age + 0.428 * Experience + ...$

    - $P(Personal Loan = Yes) = \frac{1}{1+e^{-(-0.383 - 0.425 * Age + 0.428 * Experience + ...)}}$