<a href="https://colab.research.google.com/github/slee159/edu/blob/master/18%EC%9D%BC%EC%B0%A8_200820_(2)_Boston_data_Chapter_4_3_lightgbm_(%EC%84%A0%EC%83%9D%EB%8B%98_%EC%BD%94%EB%93%9C).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

---
title: "머신러닝 알고리즘 - LightGbm"
date: 2020-07-16T18:00:47+09:00
tags:
  - "머신러닝"
  - "Python"
  - "Machine Learning"
categories:
  - "머신러닝"
  - "Python"
  - "Machine Learning"
menu: 
  kaggle:
    name: 머신러닝 알고리즘 - LightGbm
---


## 개요
- 주택가격을 예측하는 데 필요한 Kaggle 데이터를 불러와서 빅쿼리에 저장하는 실습 진행
- 데이터를 불러와서 `LightGBM`를 활용하여 머신러닝을 만든다. 

## I. 사전 준비작업
- `Kaggle API` 설치 및 연동해서 `GCP`에 데이터를 적재하는 것까지 진행한다. 

### (1) Kaggle API 설치
- 구글 코랩에서 `API`를 불러오려면 다음 소스코드를 실행한다. 

In [None]:
!pip install kaggle



### (2) Kaggle Token 다운로드
- Kaggle에서 API Token을 다운로드 받는다.
- [Kaggle]-[My Account]-[API]-[Create New API Token]을 누르면 `kaggle.json` 파일이 다운로드 된다.
- 이 파일을 바탕화면에 옮긴 뒤, 아래 코드를 실행 시킨다.

In [None]:
from google.colab import files
uploaded = files.upload()
for fn in uploaded.keys():
  print('uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))
  
# kaggle.json을 아래 폴더로 옮긴 뒤, file을 사용할 수 있도록 권한을 부여한다. 
!mkdir -p ~/.kaggle/ && mv kaggle.json ~/.kaggle/ && chmod 600 ~/.kaggle/kaggle.json

Saving kaggle.json to kaggle.json
uploaded file "kaggle.json" with length 64 bytes


- 실제 `kaggle.json` 파일이 업로드 되었다는 뜻이다. 

In [None]:
ls -1ha ~/.kaggle/kaggle.json

/root/.kaggle/kaggle.json


### (3) Kaggle 데이터 불러오기
- `Kaggle` 대회 리스트를 불러온다. 

In [None]:
!kaggle competitions list

ref                                            deadline             category            reward  teamCount  userHasEntered  
---------------------------------------------  -------------------  ---------------  ---------  ---------  --------------  
tpu-getting-started                            2030-06-03 23:59:00  Getting Started      Kudos        190           False  
digit-recognizer                               2030-01-01 00:00:00  Getting Started  Knowledge       2946           False  
titanic                                        2030-01-01 00:00:00  Getting Started  Knowledge      22234            True  
house-prices-advanced-regression-techniques    2030-01-01 00:00:00  Getting Started  Knowledge       5049            True  
connectx                                       2030-01-01 00:00:00  Getting Started  Knowledge        768           False  
nlp-getting-started                            2030-01-01 00:00:00  Getting Started      Kudos       1511            True  
competit

- 여기에서 참여하기 원하는 대회의 데이터셋을 불러오면 된다.
- 이번 `basic`강의에서는 `house-prices-advanced-regression-techniques` 데이터를 활용한 데이터 가공과 시각화를 연습할 것이기 때문에 아래와 같이 코드를 실행하여 데이터를 불러온다.

In [None]:
!kaggle competitions download -c house-prices-advanced-regression-techniques

sample_submission.csv: Skipping, found more recently modified local copy (use --force to force download)
train.csv: Skipping, found more recently modified local copy (use --force to force download)
test.csv: Skipping, found more recently modified local copy (use --force to force download)
data_description.txt: Skipping, found more recently modified local copy (use --force to force download)


- 실제 데이터가 잘 다운로드 받게 되었는지 확인한다.

In [None]:
!ls

data_description.txt  sample_data  sample_submission.csv  test.csv  train.csv


### (4) BigQuery에 데이터 적재
- `sample_submission.csv`, `test.csv`, `train.csv` 데이터를 불러와서 빅쿼리에 적재를 한다. 
- 로컬에서 빅쿼리로 데이터를 Load하는 방법에는 여러가지가 있다.
  + `Local`에서 직접 올리기 (단, 10MB 이하)
  + `Google Stroage` 활용
  + `Pandas` 활용
- `Google Stroage`를 활용하려면 클라우드 수업으로 진행되기 때문에, `Pandas`패키지를 활용한다.
  + `to_gbq`라는 함수를 사용하는데, 이를 위해서는 보통 `pandas-gbq package`패키지를 별도로 설치를 해야한다.
  + 다행히도, 구글 `Colab`에서는 위 패키지는 별도로 설치할 필요가 없다.

In [None]:
import pandas as pd
from pandas.io import gbq

# import sample_submission file
sample_submission = pd.read_csv('sample_submission.csv')

# Connect to Google Cloud API and Upload DataFrame
sample_submission.to_gbq(destination_table='house_price.sample_submission', 
                  project_id='bigquerytutorial-274406', 

                  if_exists='replace')

Please visit this URL to authorize this application: https://accounts.google.com/o/oauth2/auth?response_type=code&client_id=725825577420-unm2gnkiprugilg743tkbig250f4sfsj.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fbigquery&state=iDanF0tXkfmDnxY4RD2uUUl44sAOkl&prompt=consent&access_type=offline
Enter the authorization code: 4/2AGO8F1EJcAFws0wc37c5YDgiO3eCfF7k6D5casBL8OtJCulQruEnu0


1it [00:02,  2.60s/it]


In [None]:
import pandas as pd
from pandas.io import gbq
# import train file 
train = pd.read_csv('train.csv')

- `column`명을 확인해본다. 

In [None]:
print(train.columns)

Index(['Id', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street',
       'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig',
       'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType',
       'HouseStyle', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd',
       'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType',
       'MasVnrArea', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual',
       'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1',
       'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating',
       'HeatingQC', 'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF',
       'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath',
       'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual',
       'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType',
       'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual',
       'GarageCond', 'PavedDrive

In [None]:
train.to_gbq(destination_table='house_price.train', 
                  project_id='bigquerytutorial-274406', 
                  if_exists='replace')

Please visit this URL to authorize this application: https://accounts.google.com/o/oauth2/auth?response_type=code&client_id=725825577420-unm2gnkiprugilg743tkbig250f4sfsj.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fbigquery&state=bfs5rTp22NpJ34YGvmUn4sYiGJ824n&prompt=consent&access_type=offline
Enter the authorization code: 4/2AE2Islh7hWhTD483wZqZq6mM0nPEnUTl2jmCtms4oDW_7S5SM3WmPI


GenericGBQException: ignored

- 빅쿼리에 데이터를 `Loading` 할 때는 첫번째 글짜가 숫자가 오면 안되기 때문에, column 명을 수정한다.
  + 이 때, 각 숫자 앞에 `my`만 추가한다. 

In [None]:
colnames_dict = {"1stFlrSF": "my1stFlrSF", "2ndFlrSF": "my2ndFlrSF", "3SsnPorch": "my3SsnPorch"}

In [None]:
# Connect to Google Cloud API and Upload DataFrame
train = train.rename(columns=colnames_dict)
train.to_gbq(destination_table='house_price.train', 
                  project_id='bigquerytutorial-274406', 
                  if_exists='replace')

1it [00:09,  9.53s/it]


In [None]:
# Connect to Google Cloud API and Upload DataFrame
test = pd.read_csv('test.csv')
test = test.rename(columns=colnames_dict)
test.to_gbq(destination_table='house_price.test', 
            project_id='bigquerytutorial-274406', 
            if_exists='replace')

1it [00:03,  3.25s/it]


- 실제 데이터가 들어갔는지 빅쿼리에서 확인한다. 

## II. 데이터 피처공학
- 사이킷런 패키지는 기본적으로 결측치를 허용하지 않기 때문에, 반드시 확인 후, 처리해야 한다. 
- 이번에는 `BigQuery`를 통해 데이터를 불러온다. 
- 주요 데이터 추출을 위한 피처공학에 대해 배워본다. 

### (1) 주요 패키지 불러오기
- 이제 주요 패키지를 불러온다.

In [None]:
import math
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from scipy.stats import norm
import lightgbm as lgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_log_error
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score, cross_val_predict

  import pandas.util.testing as tm


### (2) 데이터 불러오기

In [None]:
from google.colab import auth
auth.authenticate_user()
print('Authenticated')

Authenticated


In [None]:
# 구글 인증 라이브러리
from google.colab import auth

# 빅쿼리 관련 라이브러리
from google.cloud import bigquery
from tabulate import tabulate
import pandas as pd

- 먼저 훈련 데이터를 불러온다.

In [None]:
from google.cloud import bigquery
from tabulate import tabulate
import pandas as pd

project_id = 'warm-rock-286903'
client = bigquery.Client(project=project_id)

df_train = client.query('''
  SELECT 
      * 
  FROM `warm-rock-286903.house_price_train.train`
  ''').to_dataframe()

df_train.shape

(1460, 81)

- 그 다음은 테스트 데이터를 불러온다. 

In [None]:
df_test = client.query('''
  SELECT 
      * 
  FROM `warm-rock-286903.house_price_train.test`
  ''').to_dataframe()

df_test.shape

(1459, 80)

- 아래 코드는 출력 시, 전체 `Column`에 대해 확인할 수 있음

In [None]:
pd.options.display.max_columns = None 
df_train.describe()

Unnamed: 0,Id,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,my1stFlrSF,my2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,TotRmsAbvGrd,Fireplaces,GarageYrBlt,GarageCars,GarageArea,WoodDeckSF,OpenPorchSF,EnclosedPorch,my3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SalePrice
count,1460.0,1460.0,1201.0,1460.0,1460.0,1460.0,1460.0,1460.0,1452.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1379.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0
mean,730.5,56.89726,70.049958,10516.828082,6.099315,5.575342,1971.267808,1984.865753,103.685262,443.639726,46.549315,567.240411,1057.429452,1162.626712,346.992466,5.844521,1515.463699,0.425342,0.057534,1.565068,0.382877,2.866438,1.046575,6.517808,0.613014,1978.506164,1.767123,472.980137,94.244521,46.660274,21.95411,3.409589,15.060959,2.758904,43.489041,6.321918,2007.815753,180921.19589
std,421.610009,42.300571,24.284752,9981.264932,1.382997,1.112799,30.202904,20.645407,181.066207,456.098091,161.319273,441.866955,438.705324,386.587738,436.528436,48.623081,525.480383,0.518911,0.238753,0.550916,0.502885,0.815778,0.220338,1.625393,0.644666,24.689725,0.747315,213.804841,125.338794,66.256028,61.119149,29.317331,55.757415,40.177307,496.123024,2.703626,1.328095,79442.502883
min,1.0,20.0,21.0,1300.0,1.0,1.0,1872.0,1950.0,0.0,0.0,0.0,0.0,0.0,334.0,0.0,0.0,334.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,1900.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2006.0,34900.0
25%,365.75,20.0,59.0,7553.5,5.0,5.0,1954.0,1967.0,0.0,0.0,0.0,223.0,795.75,882.0,0.0,0.0,1129.5,0.0,0.0,1.0,0.0,2.0,1.0,5.0,0.0,1961.0,1.0,334.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,2007.0,129975.0
50%,730.5,50.0,69.0,9478.5,6.0,5.0,1973.0,1994.0,0.0,383.5,0.0,477.5,991.5,1087.0,0.0,0.0,1464.0,0.0,0.0,2.0,0.0,3.0,1.0,6.0,1.0,1980.0,2.0,480.0,0.0,25.0,0.0,0.0,0.0,0.0,0.0,6.0,2008.0,163000.0
75%,1095.25,70.0,80.0,11601.5,7.0,6.0,2000.0,2004.0,166.0,712.25,0.0,808.0,1298.25,1391.25,728.0,0.0,1776.75,1.0,0.0,2.0,1.0,3.0,1.0,7.0,1.0,2002.0,2.0,576.0,168.0,68.0,0.0,0.0,0.0,0.0,0.0,8.0,2009.0,214000.0
max,1460.0,190.0,313.0,215245.0,10.0,9.0,2010.0,2010.0,1600.0,5644.0,1474.0,2336.0,6110.0,4692.0,2065.0,572.0,5642.0,3.0,2.0,3.0,2.0,8.0,3.0,14.0,3.0,2010.0,4.0,1418.0,857.0,547.0,552.0,508.0,480.0,738.0,15500.0,12.0,2010.0,755000.0


### (3) 결측 데이터 확인

In [None]:
# data set의 Percent 구하는 함수를 짜보자. 
def check_fill_na(data):
  new_df = data.copy()
  new_df_na = (new_df.isnull().sum() / len(new_df)) * 100
  new_df_na.sort_values(ascending=False).reset_index(drop=True)
  new_df_na = new_df_na.drop(new_df_na[new_df_na == 0].index).sort_values(ascending=False)
  return new_df_na

check_fill_na(df_train)

PoolQC          99.520548
MiscFeature     96.301370
Alley           93.767123
Fence           80.753425
FireplaceQu     47.260274
LotFrontage     17.739726
GarageYrBlt      5.547945
GarageType       5.547945
GarageFinish     5.547945
GarageQual       5.547945
GarageCond       5.547945
BsmtFinType2     2.602740
BsmtExposure     2.602740
BsmtFinType1     2.534247
BsmtCond         2.534247
BsmtQual         2.534247
MasVnrArea       0.547945
MasVnrType       0.547945
Electrical       0.068493
dtype: float64

### (4) 주요 함수 정의
- 수치형과 범주형 데이터 결측치의 보간에 관한 함수를 정의한다. 


In [None]:
def fill_missing(df, cols, val):
    """ val 입력값을 넣는다. """
    for col in cols:
        df[col] = df[col].fillna(val)

def fill_missing_with_mode(df, cols):
    """ 최대 빈도수를 넣는다. """
    for col in cols:
        df[col] = df[col].fillna(df[col].mode()[0])
        
def addlogs(res, cols):
    """ 로그 변환 """
    m = res.shape[1]
    for c in cols:
        res = res.assign(newcol=pd.Series(np.log(1.01+res[c])).values)   
        res.columns.values[m] = c + '_log'
        m += 1
    return res

- 1층, 2층, 3층의 면적을 합친 `전체 total`을 구해본다.

### (5) 전체 면적 데이터 추가
- 가정의 전체 면적을 더해서 추가 변수를 만든다.

In [None]:
df_train['TotalSF'] = df_train['TotalBsmtSF'] + df_train['my1stFlrSF'] + df_train['my2ndFlrSF']

- 전체 수치형 데이터에 `log transformation`을 해준다.

In [None]:
loglist = ['LotFrontage','LotArea','MasVnrArea','BsmtFinSF1','BsmtFinSF2','BsmtUnfSF',
            'TotalBsmtSF','my1stFlrSF','my2ndFlrSF','LowQualFinSF','GrLivArea',
            'BsmtFullBath','BsmtHalfBath','FullBath','HalfBath','BedroomAbvGr','KitchenAbvGr',
            'TotRmsAbvGrd','Fireplaces','GarageCars','GarageArea','WoodDeckSF','OpenPorchSF',
            'EnclosedPorch','my3SsnPorch','ScreenPorch','PoolArea','MiscVal','YearRemodAdd','TotalSF']

df_train = addlogs(df_train, loglist)

### (6) 타겟변수 로그변환
- 데이터가 작기 때문에, 모형의 안정성을 위해 로그변환을 해준다. 

In [None]:
df_train["SalePrice"] = np.log1p(df_train["SalePrice"])

### (7) 결측치 데이터 보간
- 결측치 데이터를 보간한다. 

In [None]:
# 우선, 결측치가 있는 것 중, 범주형 데이터는 "None"으로 확인
fill_missing(df_train, ["PoolQC", "MiscFeature", "Alley", "Fence", "FireplaceQu", 
                        "GarageType", "GarageFinish", "GarageQual", "GarageCond",
                       'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2',
                       "MasVnrType", "MSSubClass"], "None") 

# 수치형 데이터는 0으로 보간
fill_missing(df_train, ["GarageYrBlt", "GarageArea", "GarageCars",
                       'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF','TotalBsmtSF', 'BsmtFullBath', 'BsmtHalfBath',
                       "MasVnrArea"], 0)
# 그 중, 일부는 빈도수로 채워 넣는다.  
fill_missing_with_mode(df_train, ["MSZoning", "KitchenQual", "Exterior1st", "Exterior2nd", "SaleType"])
fill_missing(df_train, ["Functional"],"Typ")

### (8) 변수 삭제
- 1개의 값만 존재하는 데이터는 삭제한다. 

In [None]:
df_train.drop(['Utilities'], axis=1, inplace=True)

### (9) 이상치 제거
- 적은 데이터에서 상위 또는 하위 이상치가 발생하는 것은 좋지 않다. 따라서, 해당 관측치는 제거한다. 

In [None]:
df_train.drop(df_train[(df_train['OverallQual']<5) & (df_train['SalePrice']>200000)].index, inplace=True)
df_train.drop(df_train[(df_train['GrLivArea']>4000) & (df_train['SalePrice']<300000)].index, inplace=True)
df_train.reset_index(drop=True, inplace=True)

### (10) 재범주화
- 몇몇 수치형 데이터는 사실 범주형 데이터에 가깝다. 
- 따라서, 이를 문자형으로 바꾼다. 

In [None]:
df_train['MSSubClass'] = df_train['MSSubClass'].apply(str)
df_train['YrSold'] = df_train['YrSold'].astype(str)
df_train['MoSold'] = df_train['MoSold'].astype(str)

### (11) 범주형 데이터 다루기
- 이제 범주형 데이터를 원핫 인코딩으로 변환한다. 
- 원핫 인코딩으로 변환하는 이유는, 알고리즘은 수치형으로 되어 있기 때문에 그렇다. 

In [None]:
def fix_missing_cols(in_train, in_test):
    missing_cols = set(in_train.columns) - set(in_test.columns)
    # 테스트 데이터와 훈련 데이터의 컬럼을 동일하게 하는 코드는 작성한다. 
    for c in missing_cols:
        in_test[c] = 0
    # 순서를 동일하게 만든다. 
    in_test = in_test[in_train.columns]
    return in_test

def dummy_encode(in_df_train, in_df_test):
    df_train = in_df_train
    df_test = in_df_test
    categorical_feats = [
        f for f in df_train.columns if df_train[f].dtype == 'object'
    ]
    print(categorical_feats)
    for f_ in categorical_feats:
        prefix = f_
        df_train = pd.concat([df_train, pd.get_dummies(df_train[f_], prefix=prefix)], axis=1).drop(f_, axis=1)
        df_test = pd.concat([df_test, pd.get_dummies(df_test[f_], prefix=prefix)], axis=1).drop(f_, axis=1)
        df_test = fix_missing_cols(df_train, df_test)
    return df_train, df_test

- 훈련 데이터와 테스트 데이터의 크기가 다르면 예측 시, 에러가 발생한다. 

In [None]:
df_train, df_test = dummy_encode(df_train, df_test)
print("Shape train: %s, test: %s" % (df_train.shape, df_test.shape))

['MSSubClass', 'MSZoning', 'Street', 'Alley', 'LotShape', 'LandContour', 'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle', 'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'Heating', 'HeatingQC', 'CentralAir', 'Electrical', 'KitchenQual', 'Functional', 'FireplaceQu', 'GarageType', 'GarageFinish', 'GarageQual', 'GarageCond', 'PavedDrive', 'PoolQC', 'Fence', 'MiscFeature', 'MoSold', 'YrSold', 'SaleType', 'SaleCondition']
Shape train: (1456, 361), test: (1459, 361)


## III. 머신러닝 모형 개발
- 이제 `LightGBM`을 활용하여 머신러닝 모형을 개발한다. 
- 유투브 강의를 보도록 한다. 



In [None]:
from IPython.display import HTML
HTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/OQHlmscvkRI" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>')

### (1) 종속변수 처리
- 종속변수를 `y` 객체로 저장한다. 

In [None]:
y = df_train["SalePrice"]
y.sample(3)

245     11.849405
140     11.289794
1449    11.440366
Name: SalePrice, dtype: float64

- 훈련 및 테스트 데이터의 변수를 삭제한다. 

In [None]:
df_train.drop(["SalePrice"], axis=1, inplace=True)
df_test.drop(["SalePrice"], axis=1, inplace=True)

print("Shape train: %s, test: %s" % (df_train.shape, df_test.shape))

Shape train: (1456, 360), test: (1459, 360)


### (2) 데이터셋 분리
- 데이터셋을 분리한다. 

In [None]:
X_train, X_test, y_train, y_test = train_test_split( df_train, y, test_size=0.2, random_state=42)

### (3) LightGBM 파라미터 정의
- `LightGBM` 파라미터 정의는 다음 메뉴얼을 읽고 적용한다.
- [LightGBM 파라미터 메뉴얼](https://lightgbm.readthedocs.io/en/latest/Parameters.html)

In [None]:
hyper_params = {
    'task': 'train',
    'boosting_type': 'gbdt',
    'objective': 'regression',
    'metric': ['l2', 'auc'],
    'learning_rate': 0.005,
    'feature_fraction': 0.9,
    'bagging_fraction': 0.7,
    'bagging_freq': 10,
    'verbose': 0,
    "max_depth": 8,
    "num_leaves": 128,  
    "max_bin": 512,
    "num_iterations": 100000,
    "n_estimators": 1000
}

### (4) 모델 정의
- 이제 모델을 정의한다. 

In [None]:
gbm = lgb.LGBMRegressor(**hyper_params)

### (5) 모델 학습
- 이제 모델을 학습한다. 

In [None]:
gbm.fit(X_train, y_train,
        eval_set=[(X_test, y_test)],
        eval_metric='l1',
        early_stopping_rounds=1000)



[1]	valid_0's auc: 1	valid_0's l2: 0.186477	valid_0's l1: 0.333038
Training until validation scores don't improve for 1000 rounds.
[2]	valid_0's auc: 1	valid_0's l2: 0.185073	valid_0's l1: 0.331606
[3]	valid_0's auc: 1	valid_0's l2: 0.183684	valid_0's l1: 0.330192
[4]	valid_0's auc: 1	valid_0's l2: 0.18231	valid_0's l1: 0.328797
[5]	valid_0's auc: 1	valid_0's l2: 0.18095	valid_0's l1: 0.327404
[6]	valid_0's auc: 1	valid_0's l2: 0.179603	valid_0's l1: 0.326023
[7]	valid_0's auc: 1	valid_0's l2: 0.17828	valid_0's l1: 0.324652
[8]	valid_0's auc: 1	valid_0's l2: 0.176947	valid_0's l1: 0.323279
[9]	valid_0's auc: 1	valid_0's l2: 0.175652	valid_0's l1: 0.321952
[10]	valid_0's auc: 1	valid_0's l2: 0.174352	valid_0's l1: 0.320633
[11]	valid_0's auc: 1	valid_0's l2: 0.172985	valid_0's l1: 0.319239
[12]	valid_0's auc: 1	valid_0's l2: 0.171633	valid_0's l1: 0.317859
[13]	valid_0's auc: 1	valid_0's l2: 0.170309	valid_0's l1: 0.316502
[14]	valid_0's auc: 1	valid_0's l2: 0.16898	valid_0's l1: 0.3151

LGBMRegressor(bagging_fraction=0.7, bagging_freq=10, boosting_type='gbdt',
              class_weight=None, colsample_bytree=1.0, feature_fraction=0.9,
              importance_type='split', learning_rate=0.005, max_bin=512,
              max_depth=8, metric=['l2', 'auc'], min_child_samples=20,
              min_child_weight=0.001, min_split_gain=0.0, n_estimators=1000,
              n_jobs=-1, num_iterations=100000, num_leaves=128,
              objective='regression', random_state=None, reg_alpha=0.0,
              reg_lambda=0.0, silent=True, subsample=1.0,
              subsample_for_bin=200000, subsample_freq=0, task='train',
              verbose=0)

### (6) 모델 평가
- 모델을 평가한다. (RMSE)

In [None]:
y_pred = gbm.predict(X_train, num_iteration=gbm.best_iteration_)
print('The rmse of prediction is:', round(mean_squared_log_error(y_pred, y_train) ** 0.5, 5))

The rmse of prediction is: 0.02951


### (7) 결과 제출
- 이제 결과를 제출한다. 

In [None]:
test_pred = np.expm1(gbm.predict(df_test, num_iteration=gbm.best_iteration_))
df_test["SalePrice"] = test_pred
df_test.to_csv("results.csv", columns=["Id", "SalePrice"], index=False)