Data: https://dacon.io/competitions/official/236202/overview/description

# **목표: 신용카드 연체 가능성이 있는 사용자를 탐지**

목차:     
1. 데이터 정제    
1-1. 데이터 확인    
1-2. 데이터 전처리    
1-3. 상관계수    
2. 모델    
2-1. 랜덤 포레스트     
2-2. 캣부스트    
2-3. 라쏘    
2-4. 릿지    
2-5. 그래디언트 부스팅 회귀
3. 반성할 점    

# 1. 데이터

## 1-1. 데이터 확인

### 1-1-1. 데이터 로딩

In [1]:
import pandas as pd

In [2]:
train_df  = pd.read_csv("/content/drive/MyDrive/data_0306/train.csv")

In [3]:
test_df = pd.read_csv("/content/drive/MyDrive/data_0306/test.csv")

In [4]:
train_df.keys()

Index(['ID', 'TARGET', '성별', '차량 소유 여부', '부동산 소유 여부', '자녀 수', '연간 수입', '수입 유형',
       '최종 학력', '결혼 여부', '주거 형태', '거주지 인구 비율', '휴대전화 소유 여부', '업무용 휴대전화 소유 여부',
       '이메일 소유 여부', '직업', '가족 구성원 수', '산업군', '나이', '근속연수', '가입연수'],
      dtype='object')

In [5]:
train_df.head()

Unnamed: 0,ID,TARGET,성별,차량 소유 여부,부동산 소유 여부,자녀 수,연간 수입,수입 유형,최종 학력,결혼 여부,...,거주지 인구 비율,휴대전화 소유 여부,업무용 휴대전화 소유 여부,이메일 소유 여부,직업,가족 구성원 수,산업군,나이,근속연수,가입연수
0,TRAIN_00000,0,여성,1,1,2,18054000.0,연금수령자,고등학교 졸업,기혼,...,0.00496,1,0,0,Unknown,4.0,기타 1,39,1000,23.0
1,TRAIN_00001,0,남성,1,0,0,59472000.0,근로자,대학교 졸업 이상,기혼,...,0.018029,1,1,0,기술직,2.0,사업 1,45,4,16.0
2,TRAIN_00002,0,여성,0,1,0,29736000.0,근로자,고등학교 졸업,기혼,...,0.0105,1,1,0,단순 노동자,2.0,사업 0,32,3,9.0
3,TRAIN_00003,0,여성,1,0,1,38232000.0,기타,고등학교 졸업,기혼,...,0.004849,1,1,0,Unknown,3.0,산업 4,34,6,12.0
4,TRAIN_00004,0,여성,0,1,0,26550000.0,근로자,고등학교 졸업,기혼,...,0.025164,1,1,0,Unknown,2.0,사업 2,38,0,4.0


In [6]:
len(train_df['직업'].unique())

19

In [7]:
for i in train_df.keys():
    print(f"{i}: {len(train_df[i].unique())}")

ID: 60000
TARGET: 2
성별: 3
차량 소유 여부: 2
부동산 소유 여부: 2
자녀 수: 11
연간 수입: 817
수입 유형: 7
최종 학력: 4
결혼 여부: 5
주거 형태: 4
거주지 인구 비율: 80
휴대전화 소유 여부: 2
업무용 휴대전화 소유 여부: 2
이메일 소유 여부: 2
직업: 19
가족 구성원 수: 12
산업군: 56
나이: 49
근속연수: 49
가입연수: 55


In [8]:
for i in test_df.keys():
    print(f"{i}: {len(test_df[i].unique())}")

ID: 40000
성별: 2
차량 소유 여부: 2
부동산 소유 여부: 2
자녀 수: 7
연간 수입: 657
수입 유형: 6
최종 학력: 4
결혼 여부: 5
주거 형태: 4
거주지 인구 비율: 80
휴대전화 소유 여부: 1
업무용 휴대전화 소유 여부: 2
이메일 소유 여부: 2
직업: 19
가족 구성원 수: 8
산업군: 56
나이: 49
근속연수: 46
가입연수: 57


### 1-1-2. 성별이 3인 경우 확인

In [9]:
train_df['성별'].unique()

array(['여성', '남성', '기타'], dtype=object)

In [10]:
train_df[train_df['성별']=='기타']

Unnamed: 0,ID,TARGET,성별,차량 소유 여부,부동산 소유 여부,자녀 수,연간 수입,수입 유형,최종 학력,결혼 여부,...,거주지 인구 비율,휴대전화 소유 여부,업무용 휴대전화 소유 여부,이메일 소유 여부,직업,가족 구성원 수,산업군,나이,근속연수,가입연수
17820,TRAIN_17820,0,기타,1,1,0,58410000.0,기타,대학교 중퇴,사실혼,...,0.035792,1,1,0,Unknown,2.0,의학,26,6,11.0


### solution) 하나밖에 없으니까 그냥 지우자. Test 데이터에도 성별이 기타인 곳은 없다.

In [11]:
train_df.drop(17820,inplace= True)

## 1-2. 데이터 전처리

### 1-2-1. 결측치 처리

In [12]:
train_df.isnull().sum()

ID                0
TARGET            0
성별                0
차량 소유 여부          0
부동산 소유 여부         0
자녀 수              0
연간 수입             0
수입 유형             0
최종 학력             0
결혼 여부             0
주거 형태             0
거주지 인구 비율         0
휴대전화 소유 여부        0
업무용 휴대전화 소유 여부    0
이메일 소유 여부         0
직업                0
가족 구성원 수          0
산업군               0
나이                0
근속연수              0
가입연수              0
dtype: int64

In [13]:
test_df.isnull().sum()

ID                0
성별                0
차량 소유 여부          0
부동산 소유 여부         0
자녀 수              0
연간 수입             0
수입 유형             0
최종 학력             0
결혼 여부             0
주거 형태             0
거주지 인구 비율         0
휴대전화 소유 여부        0
업무용 휴대전화 소유 여부    0
이메일 소유 여부         0
직업                0
가족 구성원 수          0
산업군               0
나이                0
근속연수              0
가입연수              0
dtype: int64

### solution) 결측치가 없다!

### 1-2-2. 수치형 데이터 정규화

수치형 데이터 - 자녀 수, 연간 수입, 거주지 인구 비율, 가족 구성원 수, 나이, 근속연수, 가입연수

In [14]:
from sklearn.preprocessing import StandardScaler

In [15]:
scaler = StandardScaler()

In [16]:
from sklearn.preprocessing import StandardScaler

def normalize(df):
    scaler = StandardScaler()
    data_to_normalize = df[["자녀 수", "연간 수입", "거주지 인구 비율", "가족 구성원 수", "나이", "근속연수", "가입연수"]]
    normalized_data  = scaler.fit_transform(data_to_normalize)
    normalized_df = pd.DataFrame(normalized_data, columns=["자녀 수", "연간 수입", "거주지 인구 비율", "가족 구성원 수", "나이", "근속연수", "가입연수"], index=df.index)
    df[["자녀 수", "연간 수입", "거주지 인구 비율", "가족 구성원 수", "나이", "근속연수", "가입연수"]] = normalized_df
    return df


In [17]:
normalize(train_df)
normalize(test_df)

Unnamed: 0,ID,성별,차량 소유 여부,부동산 소유 여부,자녀 수,연간 수입,수입 유형,최종 학력,결혼 여부,주거 형태,거주지 인구 비율,휴대전화 소유 여부,업무용 휴대전화 소유 여부,이메일 소유 여부,직업,가족 구성원 수,산업군,나이,근속연수,가입연수
0,TEST_00000,남성,0,0,2.209346,-0.112224,근로자,대학교 중퇴,기혼,주택 / 아파트,-1.057116,1,1,0,핵심 노동자,2.038853,기타 0,-1.437313,-0.489308,-0.540907
1,TEST_00001,남성,0,0,-0.578252,0.349266,근로자,대학교 졸업 이상,기혼,주택 / 아파트,-1.154388,1,1,0,관리직,-0.179914,정부,-0.335442,-0.473816,-0.233501
2,TEST_00002,남성,1,1,0.815547,-0.112224,공무원,고등학교 졸업,기혼,주택 / 아파트,0.406053,1,1,0,관리직,0.929470,국가 안보,-0.081163,-0.455741,0.073906
3,TEST_00003,여성,0,1,-0.578252,-0.342969,연금수령자,고등학교 졸업,기혼,주택 / 아파트,-0.462412,1,0,0,Unknown,-0.179914,기타 1,0.596911,2.087605,-0.643376
4,TEST_00004,여성,0,1,2.209346,-0.896757,근로자,고등학교 졸업,기혼,공공분양,0.765510,1,1,0,의료 업계 종사자,2.038853,의학,-1.098276,-0.468652,-1.155719
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
39995,TEST_39995,여성,1,1,-0.578252,1.964480,기타,대학교 졸업 이상,기혼,주택 / 아파트,3.763709,1,1,1,Unknown,-0.179914,사업 2,-1.437313,-0.489308,-0.131032
39996,TEST_39996,여성,0,1,-0.578252,-0.804459,근로자,고등학교 졸업,별거,주택 / 아파트,-0.847499,1,1,0,영업직,-1.289298,자영업,0.851190,-0.484144,1.610937
39997,TEST_39997,여성,0,0,0.815547,-0.804459,공무원,고등학교 졸업,기혼,공공분양,3.763709,1,1,0,단순 노동자,0.929470,국가 안보,0.342633,-0.458324,0.893656
39998,TEST_39998,여성,1,0,-0.578252,-0.942906,연금수령자,고등학교 졸업,기혼,주택 / 아파트,-0.663438,1,0,0,Unknown,-0.179914,기타 1,1.444505,2.087605,1.815874


### 1-2-3. 범주형 데이터 수치형 전환

범주형 데이터 - 수입 유형, 최종 학력, 주거 형태, 직업, 산업군

In [18]:
# from sklearn.preprocessing import OneHotEncoder

# one_hot_encoder = OneHotEncoder()

# def oneHot(df):
#     labels = ["수입 유형", "최종 학력", "주거 형태", "직업", "산업군"]
#     for i in range(5):
#         data_to_onehot = df[[labels[i]]]
#         print(labels[i])
#         onehot_data = one_hot_encoder.fit_transform(data_to_onehot)
#         onehot_df = pd.DataFrame(onehot_data,columns=[labels[i]])
#         df[[labels[i]]] = onehot_df
#     return df



In [19]:
from sklearn.preprocessing import LabelEncoder
import numpy as np
categorical_features = ['수입 유형', '최종 학력', '결혼 여부', '주거 형태', '직업', '산업군']

for i in categorical_features:
    le = LabelEncoder()
    le=le.fit(train_df[i])
    train_df[i]=le.transform(train_df[i])

    for case in np.unique(test_df[i]):
        if case not in le.classes_:
            le.classes_ = np.append(le.classes_, case)
    test_df[i]=le.transform(test_df[i])

display(train_df.head(3))
display(test_df.head(3))

Unnamed: 0,ID,TARGET,성별,차량 소유 여부,부동산 소유 여부,자녀 수,연간 수입,수입 유형,최종 학력,결혼 여부,...,거주지 인구 비율,휴대전화 소유 여부,업무용 휴대전화 소유 여부,이메일 소유 여부,직업,가족 구성원 수,산업군,나이,근속연수,가입연수
0,TRAIN_00000,0,여성,1,1,2.193303,-0.881063,5,0,0,...,-1.149509,1,0,0,1,2.030616,6,-0.423674,2.078564,1.000115
1,TRAIN_00001,0,남성,1,0,-0.569142,0.794206,1,1,0,...,-0.205869,1,1,0,4,-0.171512,24,0.08256,-0.486515,0.28047
2,TRAIN_00002,0,여성,0,1,-0.569142,-0.408551,1,0,0,...,-0.749496,1,1,0,5,-0.171512,23,-1.014281,-0.48909,-0.439176


Unnamed: 0,ID,성별,차량 소유 여부,부동산 소유 여부,자녀 수,연간 수입,수입 유형,최종 학력,결혼 여부,주거 형태,거주지 인구 비율,휴대전화 소유 여부,업무용 휴대전화 소유 여부,이메일 소유 여부,직업,가족 구성원 수,산업군,나이,근속연수,가입연수
0,TEST_00000,남성,0,0,2.209346,-0.112224,1,2,0,3,-1.057116,1,1,0,17,2.038853,5,-1.437313,-0.489308,-0.540907
1,TEST_00001,남성,0,0,-0.578252,0.349266,1,1,0,3,-1.154388,1,1,0,3,-0.179914,49,-0.335442,-0.473816,-0.233501
2,TEST_00002,남성,1,1,0.815547,-0.112224,0,0,0,3,0.406053,1,1,0,3,0.92947,3,-0.081163,-0.455741,0.073906


### 1-2-4. 이진형 데이터 수치화

이진형 데이터 - 성별

In [20]:
train_df['결혼 여부'].unique()

array([0, 1, 4, 3, 2])

In [21]:
def binary(df):
    df['성별'] = df['성별'].apply(lambda x: 1 if x == '남성' else 2)

In [22]:
binary(train_df)
binary(test_df)

In [23]:
train_df

Unnamed: 0,ID,TARGET,성별,차량 소유 여부,부동산 소유 여부,자녀 수,연간 수입,수입 유형,최종 학력,결혼 여부,...,거주지 인구 비율,휴대전화 소유 여부,업무용 휴대전화 소유 여부,이메일 소유 여부,직업,가족 구성원 수,산업군,나이,근속연수,가입연수
0,TRAIN_00000,0,2,1,1,2.193303,-0.881063,5,0,0,...,-1.149509,1,0,0,1,2.030616,6,-0.423674,2.078564,1.000115
1,TRAIN_00001,0,1,1,0,-0.569142,0.794206,1,1,0,...,-0.205869,1,1,0,4,-0.171512,24,0.082560,-0.486515,0.280470
2,TRAIN_00002,0,2,0,1,-0.569142,-0.408551,1,0,0,...,-0.749496,1,1,0,5,-0.171512,23,-1.014281,-0.489090,-0.439176
3,TRAIN_00003,0,2,1,0,0.812080,-0.064906,2,0,0,...,-1.157523,1,1,0,1,0.929552,32,-0.845536,-0.481364,-0.130757
4,TRAIN_00004,0,2,0,1,-0.569142,-0.537418,1,0,0,...,0.309310,1,1,0,1,-0.171512,25,-0.508047,-0.496817,-0.953209
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
59995,TRAIN_59995,0,2,0,1,-0.569142,-0.966974,5,0,1,...,-0.459234,1,0,0,1,-1.272576,6,1.348145,2.078564,1.616955
59996,TRAIN_59996,1,1,1,1,-0.569142,0.536473,1,0,0,...,-0.783288,1,1,0,12,-0.171512,47,-0.170557,-0.494241,-1.261629
59997,TRAIN_59997,1,2,0,0,-0.569142,0.321695,1,1,1,...,-0.128466,1,1,0,1,-1.272576,45,-1.520515,-0.494241,-0.644789
59998,TRAIN_59998,0,2,1,1,-0.569142,-0.408551,1,0,3,...,-0.408619,1,1,0,17,-1.272576,39,0.504422,-0.442734,0.280470


## 1-3. 상관계수

In [28]:
corr_df = train_df.corr()

  corr_df = train_df.corr()


In [29]:
corr_df

Unnamed: 0,TARGET,성별,차량 소유 여부,부동산 소유 여부,자녀 수,연간 수입,수입 유형,최종 학력,결혼 여부,주거 형태,거주지 인구 비율,휴대전화 소유 여부,업무용 휴대전화 소유 여부,이메일 소유 여부,직업,가족 구성원 수,산업군,나이,근속연수,가입연수
TARGET,1.0,-0.06733,-0.01772,-0.000838,0.0324,-0.022024,-0.053901,-0.037489,0.017048,-0.014381,-0.046622,0.001414,0.055624,-0.001688,0.019887,0.019742,0.02792,-0.089798,-0.056896,-0.053072
성별,-0.06733,1.0,-0.340175,0.037254,-0.052991,-0.169285,0.137769,-0.006742,0.112772,-0.002619,-0.01751,0.005717,-0.159225,-0.014526,0.038602,-0.089103,0.016189,0.145445,0.160666,0.08316
차량 소유 여부,-0.01772,-0.340175,1.0,-0.005021,0.103836,0.189888,-0.135761,0.064665,-0.123037,0.036675,0.039477,-0.00572,0.153406,0.03086,0.047933,0.153207,0.049751,-0.137355,-0.153999,-0.099057
부동산 소유 여부,-0.000838,0.037254,-0.005021,1.0,0.002714,9.6e-05,0.058174,-0.02209,0.007203,0.171488,0.01359,-0.002618,-0.055869,0.029606,-0.022366,0.004772,-0.020689,0.084955,0.05625,0.014242
자녀 수,0.0324,-0.052991,0.103836,0.002714,1.0,0.02359,-0.243421,0.019981,-0.107573,0.00221,-0.031034,0.002324,0.24419,0.024043,0.13275,0.880868,0.141921,-0.349623,-0.245188,-0.186737
연간 수입,-0.022024,-0.169285,0.189888,9.6e-05,0.02359,1.0,-0.103961,0.146532,-0.024869,0.001772,0.164026,0.001668,0.148945,0.086938,0.015512,0.031993,0.024221,-0.063646,-0.14859,-0.063624
수입 유형,-0.053901,0.137769,-0.135761,0.058174,-0.243421,-0.103961,1.0,-0.060752,0.072864,0.019809,0.037467,0.002374,-0.943751,-0.049645,-0.470488,-0.241915,-0.535867,0.582561,0.943201,0.207564
최종 학력,-0.037489,-0.006742,0.064665,-0.02209,0.019981,0.146532,-0.060752,1.0,-0.021902,0.011161,0.049783,-0.011252,0.074644,0.065163,0.063399,0.005819,0.012541,-0.140863,-0.075239,-0.07733
결혼 여부,0.017048,0.112772,-0.123037,0.007203,-0.107573,-0.024869,0.072864,-0.021902,1.0,-0.021898,0.003994,0.002509,-0.071795,-0.013692,-0.035217,-0.267587,-0.034476,0.061079,0.071565,0.053783
주거 형태,-0.014381,-0.002619,0.036675,0.171488,0.00221,0.001772,0.019809,0.011161,-0.021898,1.0,-0.023863,-0.001039,-0.019401,0.000676,-0.014499,0.019858,-0.002946,0.023891,0.019598,-0.037303


In [30]:
mask = (corr_df >= 0.5) & (corr_df < 0.99)

row_names, col_names = np.where(mask)

for row, col in zip(row_names, col_names):
    print("행 이름:", corr_df.index[row], ", 열 이름:", corr_df.columns[col])


행 이름: 자녀 수 , 열 이름: 가족 구성원 수
행 이름: 수입 유형 , 열 이름: 나이
행 이름: 수입 유형 , 열 이름: 근속연수
행 이름: 업무용 휴대전화 소유 여부 , 열 이름: 산업군
행 이름: 가족 구성원 수 , 열 이름: 자녀 수
행 이름: 산업군 , 열 이름: 업무용 휴대전화 소유 여부
행 이름: 나이 , 열 이름: 수입 유형
행 이름: 나이 , 열 이름: 근속연수
행 이름: 근속연수 , 열 이름: 수입 유형
행 이름: 근속연수 , 열 이름: 나이


# 2. 모델

## 2-1. 랜덤 포레스트

In [24]:
import pandas as pd
import numpy as np
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import RandomForestRegressor

In [25]:
import numpy as np

def basemodel(data):
    global base_tray
    model = RandomForestRegressor(n_estimators=30)   # RandomForestRegressor, n_estimator (트리 개수)
    X = data.drop(['ID', 'TARGET'], axis=1)                 # train_X
    Y = data['TARGET']
    skf = StratifiedKFold(n_splits=5)
    scores = []
    for train_index, test_index in skf.split(X, Y):
        X_train, X_test = X.iloc[train_index], X.iloc[test_index]
        Y_train, Y_test = Y.iloc[train_index], Y.iloc[test_index]
        model.fit(X_train, Y_train)
        Y_pred = pd.Series(model.predict(X_test), index=Y_test.index)
        score = (Y_pred == Y_test).mean()
        scores.append(score)
    average = sum(scores) / len(scores)
    base_tray = model.predict(test_df.drop('ID', axis=1))
    return average


In [26]:
basemodel(train_df)

0.09496047413673361

In [27]:
submission = pd.read_csv('/content/drive/MyDrive/data_0306/sample_submission.csv')
submission['TARGET'] = base_tray
submission.to_csv('/content/drive/MyDrive/data_0306/submission4.csv',index=False)

## 2-2. 캣부스트

In [31]:
!pip install catboost

Collecting catboost
  Downloading catboost-1.2.3-cp310-cp310-manylinux2014_x86_64.whl (98.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m98.5/98.5 MB[0m [31m9.0 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: catboost
Successfully installed catboost-1.2.3


In [32]:
import catboost
from catboost import CatBoostRegressor

def catboost(data):
    global cat_tray
    model = CatBoostRegressor(iterations=200,learning_rate=0.1,depth=6)
    X = data.drop(['ID', 'TARGET'], axis=1)                 # train_X
    Y = data['TARGET']
    skf = StratifiedKFold(n_splits=5)
    scores =[]
    for train_index, test_index in skf.split(X,Y):
        X_train, X_test = X.iloc[train_index],X.iloc[test_index]
        Y_train, Y_test = Y.iloc[train_index], Y.iloc[test_index]
        model.fit(X_train,Y_train)
        Y_pred = pd.Series(model.predict(X_test), index=Y_test.index)
        score = (Y_pred == Y_test).mean()
        scores.append(score)
    cat_tray = model.predict(test_df.drop('ID', axis=1))
    scores = scores[1:]
    average = sum(scores) / len(scores)
    return average

In [None]:
catboost(train_df)

In [34]:
cat_tray

array([0.1348261 , 0.07785765, 0.09206592, ..., 0.05190107, 0.04673009,
       0.07009854])

In [35]:
submission = pd.read_csv('/content/drive/MyDrive/data_0306/sample_submission.csv')
submission['TARGET'] = cat_tray
submission.to_csv('/content/drive/MyDrive/data_0306/submission3.csv',index=False)

## 2-3. 라쏘

In [51]:
import pandas as pd
import numpy as np
from sklearn.model_selection import StratifiedKFold
from sklearn.linear_model import Lasso

lasso_tray = []

def lasso(data):
    global lasso_tray
    model = Lasso(alpha = 0.1)
    X = data.drop(['ID', 'TARGET'], axis=1)
    Y = data['TARGET']
    skf = StratifiedKFold(n_splits=5)
    scores =[]
    for train_index, test_index in skf.split(X,Y):
        X_train, X_test = X.iloc[train_index],X.iloc[test_index]
        Y_train, Y_test = Y.iloc[train_index], Y.iloc[test_index]
        model.fit(X_train,Y_train)
        Y_pred = pd.Series(model.predict(X_test), index=Y_test.index)
        score = (Y_pred == Y_test).mean()
        scores.append(score)
    lasso_tray = model.predict(test_df.drop('ID', axis=1))
    scores = scores[1:]
    average = sum(scores) / len(scores)
    return average

In [52]:
lasso(train_df)

0.0

In [56]:
pd.Series(lasso_tray).unique()

array([0.1034412 , 0.11140315, 0.1030793 , 0.10362216, 0.11086029,
       0.10706027, 0.10905076, 0.11104124, 0.10669836, 0.11176506,
       0.10253644, 0.10796504, 0.11212696, 0.1061555 , 0.10941266,
       0.10470788, 0.10326025, 0.10977457, 0.10488883, 0.10651741,
       0.10452692, 0.10506978, 0.11031743, 0.10760313, 0.11013648,
       0.10814599, 0.10289834, 0.10959362, 0.10832694, 0.10995552,
       0.10923171, 0.10687932, 0.10398406, 0.10271739, 0.1057936 ,
       0.1088698 , 0.10380311, 0.11049838, 0.10416502, 0.11067934,
       0.10561264, 0.11230792, 0.11194601, 0.1112222 , 0.10633646,
       0.11248887, 0.10597455, 0.10543169, 0.10868885, 0.10434597,
       0.10724122, 0.10742218, 0.1115841 , 0.1085079 , 0.10778408,
       0.10525074])

### Q) Alpha 값을 몇으로 줘야하는가?

## 2-4. 릿지

In [68]:
import pandas as pd
import numpy as np
from sklearn.model_selection import StratifiedKFold
from sklearn.linear_model import Ridge

ridge_tray = []

def ridge(data):
    global ridge_tray
    model = Ridge(alpha = 1.0)
    X = data.drop(['ID','TARGET'],axis = 1)
    Y = data['TARGET']
    skf = StratifiedKFold(n_splits=5)
    scores = []
    for train_index, test_index in skf.split(X,Y):
        X_train, X_test = X.iloc[train_index],X.iloc[test_index]
        Y_train, Y_test = Y.iloc[train_index],Y.iloc[test_index]
        model.fit(X_train,Y_train)
        Y_pred = pd.Series(model.predict(X_test),index = Y_test.index)
        score = (Y_pred == Y_test).mean()
        scores.append(score)
    ridge_tray = model.predict(test_df.drop('ID',axis=1))
    scores = scores[1:]
    average = sum(scores) / len(scores)
    return average

In [69]:
ridge(train_df)

0.0

In [70]:
ridge_tray

array([0.14831199, 0.14660124, 0.11203179, ..., 0.05339325, 0.02041346,
       0.1095875 ])

In [71]:
pd.Series(ridge_tray).unique()

array([0.14831199, 0.14660124, 0.11203179, ..., 0.05339325, 0.02041346,
       0.1095875 ])

In [73]:
submission = pd.read_csv('/content/drive/MyDrive/data_0306/sample_submission.csv')
submission['TARGET'] = ridge_tray
submission.to_csv('/content/drive/MyDrive/data_0306/submission5.csv',index=False)

## 2-5. 그래디언트 부스팅 회귀

In [74]:
import pandas as pd
import numpy as np
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import StratifiedKFold

gbr_tray = []

def gbr(data):
    global gbr_tray
    model = GradientBoostingRegressor(n_estimators=30)
    X = data.drop(['ID','TARGET'],axis=1)
    Y = data['TARGET']
    skf = StratifiedKFold(n_splits = 5)
    scores = []
    for train_index, test_index in skf.split(X,Y):
        X_train,X_test = X.iloc[train_index],X.iloc[test_index]
        Y_train,Y_test = Y.iloc[train_index],Y.iloc[test_index]
        model.fit(X_train,Y_train)
        Y_pred = pd.Series(model.predict(X_test),index = Y_test.index)
        score = (Y_pred==Y_test).mean()
        scores.append(score)
    gbr_tray = model.predict(test_df.drop('ID',axis=1))
    average = sum(scores)/len(scores)
    return average

In [75]:
gbr(train_df)

0.0

In [76]:
gbr_tray

array([0.13746739, 0.09289564, 0.07836291, ..., 0.06754274, 0.06386232,
       0.08499268])

In [77]:
submission = pd.read_csv("/content/drive/MyDrive/data_0306/sample_submission.csv")
submission['TARGET'] = gbr_tray
submission.to_csv('/content/drive/MyDrive/data_0306/submission6.csv',index=False)

# 3. 반성할 점

1. 평가 메트릭 수정    
2. 데이터 EDA   
3. 데이터 Semantic한 접근    
4. Rasso / Lidge모델 사용 (하이퍼파라미터 탐색)   
5. 원-핫 벡터 적용하는 방법   
