# 👩‍🚀**<span style="color:#4527A0;"> Spaceship Titanic with</span> <span style="color:#F44336;"> LGBMClassifier </span>🌠**

# <span style="color:#4527A0;"> Project </span> 🔥
- 강의명 : 2022년 K-디지털 직업훈련(Training) 사업 - AI데이터플랫폼을 활용한 빅데이터 분석전문가 과정
- 교과목명 : 빅데이터 분석 및 시각화, AI개발 기초, 인공지능 프로그래밍
- 프로젝트 주제 : Spaceship Titanic 데이터를 활용한 탑승유무 분류모형 개발
- 프로젝트 마감일 : 2022년 4월 12일 화요일
- 강사명 : Evan
- 수강생명 : 이가영

# <span style="color:#4527A0;"> Table of content </span> 📋

- [Load data](#Load-Data)
- [EDA](#EDA)
- [Preprocessing](#Preprocessing)
- [Model](#Model)

# Load Data
- 라이브러리 불러오기

In [1]:
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go

from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split, StratifiedKFold, GridSearchCV

from lightgbm import LGBMClassifier

import warnings
warnings.filterwarnings('ignore')

- 데이터 불러오기

In [2]:
train = pd.read_csv("/kaggle/input/spaceship-titanic/train.csv")
test = pd.read_csv("/kaggle/input/spaceship-titanic/test.csv")
submission = pd.read_csv("/kaggle/input/spaceship-titanic/sample_submission.csv")

In [3]:
train.shape, test.shape, submission.shape

((8693, 14), (4277, 13), (4277, 2))

## train Data

In [4]:
train.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


In [5]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8693 entries, 0 to 8692
Data columns (total 14 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   PassengerId   8693 non-null   object 
 1   HomePlanet    8492 non-null   object 
 2   CryoSleep     8476 non-null   object 
 3   Cabin         8494 non-null   object 
 4   Destination   8511 non-null   object 
 5   Age           8514 non-null   float64
 6   VIP           8490 non-null   object 
 7   RoomService   8512 non-null   float64
 8   FoodCourt     8510 non-null   float64
 9   ShoppingMall  8485 non-null   float64
 10  Spa           8510 non-null   float64
 11  VRDeck        8505 non-null   float64
 12  Name          8493 non-null   object 
 13  Transported   8693 non-null   bool   
dtypes: bool(1), float64(6), object(7)
memory usage: 891.5+ KB


In [6]:
train.isna().sum()

PassengerId       0
HomePlanet      201
CryoSleep       217
Cabin           199
Destination     182
Age             179
VIP             203
RoomService     181
FoodCourt       183
ShoppingMall    208
Spa             183
VRDeck          188
Name            200
Transported       0
dtype: int64

In [7]:
train.describe()

Unnamed: 0,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck
count,8514.0,8512.0,8510.0,8485.0,8510.0,8505.0
mean,28.82793,224.687617,458.077203,173.729169,311.138778,304.854791
std,14.489021,666.717663,1611.48924,604.696458,1136.705535,1145.717189
min,0.0,0.0,0.0,0.0,0.0,0.0
25%,19.0,0.0,0.0,0.0,0.0,0.0
50%,27.0,0.0,0.0,0.0,0.0,0.0
75%,38.0,47.0,76.0,27.0,59.0,46.0
max,79.0,14327.0,29813.0,23492.0,22408.0,24133.0


## Test Data

In [8]:
test.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name
0,0013_01,Earth,True,G/3/S,TRAPPIST-1e,27.0,False,0.0,0.0,0.0,0.0,0.0,Nelly Carsoning
1,0018_01,Earth,False,F/4/S,TRAPPIST-1e,19.0,False,0.0,9.0,0.0,2823.0,0.0,Lerome Peckers
2,0019_01,Europa,True,C/0/S,55 Cancri e,31.0,False,0.0,0.0,0.0,0.0,0.0,Sabih Unhearfus
3,0021_01,Europa,False,C/1/S,TRAPPIST-1e,38.0,False,0.0,6652.0,0.0,181.0,585.0,Meratz Caltilter
4,0023_01,Earth,False,F/5/S,TRAPPIST-1e,20.0,False,10.0,0.0,635.0,0.0,0.0,Brence Harperez


In [9]:
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4277 entries, 0 to 4276
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   PassengerId   4277 non-null   object 
 1   HomePlanet    4190 non-null   object 
 2   CryoSleep     4184 non-null   object 
 3   Cabin         4177 non-null   object 
 4   Destination   4185 non-null   object 
 5   Age           4186 non-null   float64
 6   VIP           4184 non-null   object 
 7   RoomService   4195 non-null   float64
 8   FoodCourt     4171 non-null   float64
 9   ShoppingMall  4179 non-null   float64
 10  Spa           4176 non-null   float64
 11  VRDeck        4197 non-null   float64
 12  Name          4183 non-null   object 
dtypes: float64(6), object(7)
memory usage: 434.5+ KB


In [10]:
test.isna().sum()

PassengerId       0
HomePlanet       87
CryoSleep        93
Cabin           100
Destination      92
Age              91
VIP              93
RoomService      82
FoodCourt       106
ShoppingMall     98
Spa             101
VRDeck           80
Name             94
dtype: int64

In [11]:
test.describe()

Unnamed: 0,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck
count,4186.0,4195.0,4171.0,4179.0,4176.0,4197.0
mean,28.658146,219.266269,439.484296,177.295525,303.052443,310.710031
std,14.179072,607.011289,1527.663045,560.821123,1117.186015,1246.994742
min,0.0,0.0,0.0,0.0,0.0,0.0
25%,19.0,0.0,0.0,0.0,0.0,0.0
50%,26.0,0.0,0.0,0.0,0.0,0.0
75%,37.0,53.0,78.0,33.0,50.0,36.0
max,79.0,11567.0,25273.0,8292.0,19844.0,22272.0


## EDA

- 행성 출발지(**HomePlanet**)와 행성 도착치(**Destination**)의 비율을 알아본다.

In [12]:
# fig = go.Figure()
from plotly.subplots import make_subplots

c = ['HomePlanet','Destination']
fig = make_subplots(rows=1, cols=2,shared_yaxes=True)
    
fig.add_trace(go.Scatter( x=train['HomePlanet'].value_counts().index, y=train['HomePlanet'].value_counts().values,
                         mode='markers',name = 'HomePlanet',
        marker=dict(
            sizemode = 'diameter',
            sizeref = 25,
            size = train['HomePlanet'].value_counts().values)),1,1)
    
fig.add_trace(go.Scatter( x=train['Destination'].value_counts().index, y=train['Destination'].value_counts().values,
                         mode='markers', name = 'Destination',
        marker=dict(
            sizemode = 'diameter',
            sizeref = 25,
            size = train['Destination'].value_counts().values)),1,2)


fig.update_layout(    title='<b>Planets</b>', 
        title_x=0.5,
        titlefont=dict(size =20, color='black', family='Space Mono'),
        plot_bgcolor='rgba(0,0,0,0)' )
fig.show()

---
시각화 추가

---

# Preprocessing

### PassengerId 나누기

- PassengerId의 앞자리 4개를 이용해 가족 그룹 Id 생성 -> 컬럼 Group으로 데이터 값 채우기

In [13]:
train['Group'] = train['PassengerId'].apply(lambda x : str(x[:4]))
test['Group'] = test['PassengerId'].apply(lambda x : str(x[:4]))
train[['PassengerId','Group']].head()

Unnamed: 0,PassengerId,Group
0,0001_01,1
1,0002_01,2
2,0003_01,3
3,0003_02,3
4,0004_01,4


In [14]:
train.dtypes

PassengerId      object
HomePlanet       object
CryoSleep        object
Cabin            object
Destination      object
Age             float64
VIP              object
RoomService     float64
FoodCourt       float64
ShoppingMall    float64
Spa             float64
VRDeck          float64
Name             object
Transported        bool
Group            object
dtype: object

PassengerId의 앞의 4자리가 잘 분리 되었고 Object 형태로 저장이 되었다.

- Group을 이용해 가족 구성원 수를 구한다.

In [15]:
train['Family'] = train.groupby('Group')['Group'].transform('count')
test['Family'] = test.groupby('Group')['Group'].transform('count')
train[['Group','Family']].sort_values(['Group']).head()

Unnamed: 0,Group,Family
0,1,1
1,2,1
2,3,2
3,3,2
4,4,1


In [16]:
train['Family'].describe()

count    8693.000000
mean        2.035546
std         1.596347
min         1.000000
25%         1.000000
50%         1.000000
75%         3.000000
max         8.000000
Name: Family, dtype: float64

가족 구성원의 수는 1인 가족 부터 최대 8인 가족까지 존재한다.

### Cabin 나누기

- cd : Cabin Deck
- cn : Cabin Number
- cs : Cabin Side

In [17]:
train['cd'] = train['Cabin'].str.split('/').str[0]
train['cn'] = train['Cabin'].str.split('/').str[1]
train['cs'] = train['Cabin'].str.split('/').str[2]

test['cd'] = test['Cabin'].str.split('/').str[0]
test['cn'] = test['Cabin'].str.split('/').str[1]
test['cs'] = test['Cabin'].str.split('/').str[2]

### 결측치 처리하기

사용할 라이브러리 불러오기

In [18]:
from sklearn.preprocessing import LabelEncoder
from sklearn.impute import SimpleImputer

컬럼 중에 object 형태인 컬럼들만 따로 label_cols 변수로 저장한다.

In [19]:
label_cols= ["HomePlanet", "CryoSleep","Cabin", "Destination" ,"VIP","cd","cs","Name"]

- LabelEncoder
    - 범주형 컬럼들을 수치형 데이터들로 바꿔주는 작업

In [20]:
def label_encoder(train,test,columns):
    for col in columns:
        train[col] = train[col].astype(str)
        test[col] = test[col].astype(str)
        train[col] = LabelEncoder().fit_transform(train[col])
        test[col] = LabelEncoder().fit_transform(test[col])
    return train, test
    
train, test = label_encoder(train,test,label_cols)

- SimpleImputer
    - 수치형 컬럼의 결측치 평균값으로 채워주기
    - 범주형 컬럼의 결측치 최빈값으로 채워주기

In [21]:
train.isna().sum()

PassengerId       0
HomePlanet        0
CryoSleep         0
Cabin             0
Destination       0
Age             179
VIP               0
RoomService     181
FoodCourt       183
ShoppingMall    208
Spa             183
VRDeck          188
Name              0
Transported       0
Group             0
Family            0
cd                0
cn              199
cs                0
dtype: int64

결측치 처리하기 전 많은 결측치들이 있는 것을 확인 할 수 있다.

In [22]:
imputer_cols = ["HomePlanet", "CryoSleep", "Destination", "VIP","cd","cs"]
STRATEGY = 'most_frequent'

imputer = SimpleImputer(strategy=STRATEGY)
imputer.fit(train[imputer_cols])
train[imputer_cols] = imputer.transform(train[imputer_cols])
test[imputer_cols] = imputer.transform(test[imputer_cols])

print("train_data:\n", train[imputer_cols].isnull().sum())
print(20*"-")
print("test:\n", test[imputer_cols].isnull().sum())

train_data:
 HomePlanet     0
CryoSleep      0
Destination    0
VIP            0
cd             0
cs             0
dtype: int64
--------------------
test:
 HomePlanet     0
CryoSleep      0
Destination    0
VIP            0
cd             0
cs             0
dtype: int64


In [23]:
imputer_cols = ["Age", "FoodCourt", "ShoppingMall", "Spa", "VRDeck" ,"RoomService","Family","cn"]
STRATEGY = 'median'

imputer = SimpleImputer(strategy=STRATEGY)
imputer.fit(train[imputer_cols])
train[imputer_cols] = imputer.transform(train[imputer_cols])
test[imputer_cols] = imputer.transform(test[imputer_cols])

print("train_data:\n", train[imputer_cols].isnull().sum())
print(20*"-")
print("test:\n", test[imputer_cols].isnull().sum())

train_data:
 Age             0
FoodCourt       0
ShoppingMall    0
Spa             0
VRDeck          0
RoomService     0
Family          0
cn              0
dtype: int64
--------------------
test:
 Age             0
FoodCourt       0
ShoppingMall    0
Spa             0
VRDeck          0
RoomService     0
Family          0
cn              0
dtype: int64


📍 결측치 제거 완료

ML에 사용하지 않을 컬럼들을 분리시킨다.

In [24]:
train = train.drop(['Cabin','PassengerId','Group'],axis = 1)
test = test.drop(['Cabin','PassengerId','Group'],axis = 1)

---

# Model

In [25]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8693 entries, 0 to 8692
Data columns (total 16 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   HomePlanet    8693 non-null   int64  
 1   CryoSleep     8693 non-null   int64  
 2   Destination   8693 non-null   int64  
 3   Age           8693 non-null   float64
 4   VIP           8693 non-null   int64  
 5   RoomService   8693 non-null   float64
 6   FoodCourt     8693 non-null   float64
 7   ShoppingMall  8693 non-null   float64
 8   Spa           8693 non-null   float64
 9   VRDeck        8693 non-null   float64
 10  Name          8693 non-null   int64  
 11  Transported   8693 non-null   bool   
 12  Family        8693 non-null   float64
 13  cd            8693 non-null   int64  
 14  cn            8693 non-null   float64
 15  cs            8693 non-null   int64  
dtypes: bool(1), float64(8), int64(7)
memory usage: 1.0 MB


📍 **TARGET** 데이터인 **Transported** 빼고 ML을 위한 사전 준비 작업이 완료되었다.

- ML에 사용할 특성만 col에 저장하고 TARGET인 Transported 특성만 TARGET 변수에 저장한다

In [26]:
col = train.drop(['Transported'], axis = 1).columns.tolist()
TARGET = 'Transported'

- ML 하기 전 훈련 데이터와 테스트 데이터 세트 분리

In [27]:
X = train[col]
Y = train[TARGET]
X_train, y_train, X_val, y_val = train_test_split(X,Y,test_size = 0.33)

- StratifiedKFold
    - 5번 교차 검증을 진행한다.
    - 진행 결과를 idx_fold에 저장한다.

In [28]:
idx_fold = []

skf = StratifiedKFold(5, random_state=12, shuffle=True)
for idx_train, idx_valid in skf.split(X = X, y = Y):
    idx_fold.append((idx_train, idx_valid))

- 하이퍼파라미터 저장

In [29]:
lgb_params = {
    'objective' : 'binary',
    'n_estimators' : 975,
    'min_child_samples' : 5,
    'max_depth' : 8,
    'learning_rate' : 0.01,
    'n_jobs' : -1,
    'importance_type' : 'gain'
}

model = LGBMClassifier(n_estimators=975, min_child_samples=5, max_depth=8, learning_rate=0.01,feature_fraction=0.6, 
                       bagging_freq=5, bagging_fraction= 0.9,objective ='binary

    'feature_fraction' : 0.6,
    'bagging_freq' : 5, 
    'bagging_fraction' : 0.9

- 검증 점수와 예측을 저장할 변수 설정

In [30]:
lgb_scores = []
lgb_auc = []
predictions = 0

- idx_fold값에 따라 5번 모델을 돌린다.

In [31]:
from sklearn.metrics import roc_auc_score

In [32]:
for fold, (idx_train, idx_valid) in enumerate(idx_fold):
    print(10*"=", f"Fold={fold+1}", 10*"=")
    
    X_train, X_valid = train.iloc[idx_train][col],train.iloc[idx_valid][col]
    Y_train, Y_valid = train.iloc[idx_train][TARGET],train.iloc[idx_valid][TARGET]
    
    model = LGBMClassifier(**lgb_params)
    model.fit(X_train, Y_train, verbose = 0)
    
    # 정확도 측정해서 저장
    preds_valid = model.predict(X_valid)
    acc = accuracy_score(Y_valid, preds_valid)
    lgb_scores.append(acc)
    
    preds = model.predict(test[col]) 
    predictions += preds/5
    roc = roc_auc_score(Y_valid, model.predict_proba(X_valid)[:, 1])
    lgb_auc.append(roc)
    
    print(f"Fold={fold+1}, Accuracy score : {acc :2f}%")
    print(f"Fold={fold+1}, AUC : {roc :2f}%\n")
    

print("Mean Accuracy : ", np.mean(lgb_scores))
print("Mean AUC : ", np.mean(lgb_auc))

Fold=1, Accuracy score : 0.823462%
Fold=1, AUC : 0.910910%

Fold=2, Accuracy score : 0.800460%
Fold=2, AUC : 0.896033%

Fold=3, Accuracy score : 0.809661%
Fold=3, AUC : 0.895227%

Fold=4, Accuracy score : 0.802071%
Fold=4, AUC : 0.898884%

Fold=5, Accuracy score : 0.809551%
Fold=5, AUC : 0.900163%

Mean Accuracy :  0.8090410146698861
Mean AUC :  0.9002434106663824


**평균 정확도 : 0.809**

---

# Submission

In [33]:
submission[TARGET] = predictions.astype("bool")
submission.to_csv("submission.csv",index=False)
submission.head()

Unnamed: 0,PassengerId,Transported
0,0013_01,True
1,0018_01,False
2,0019_01,True
3,0021_01,True
4,0023_01,True


# 결과
4/4 : accuracy : 0.791   |  등수 :   341 / 전체  

4/5 : accuracy : 0.80102    |    등수  :  323  / 1187  

4/6 : accuracy : 0.80243    |    등수  :  292  / 1187
