## Objective of the notebook

 전복의 나이테(나이)를 회귀모형으로 예측하기 위해 사전에 자료에 대한 이해와 featuer들의 속성을 알아보고자한다.

## About the dataset

총 4177개의 전복을 조사하였고 feature는 9개로, 성별, 길이, 지름, 높이, 전체 무게, 껍질을 제외한 무게, 내장 무게, 껍질 무게, 나이테이다.

| 열             | 의미               | 비고                                     |
|----------------|--------------------|------------------------------------------|
| Sex            | 성별               | 남(M),여(F),유아(I)                      |
| Length         | 길이               | 가장 길게 측정할 수 있는 길이 (최장길이) |
| Diameter       | 지름               | 수직으로의 길이                          |
| Height         | 높이               | 고기를 포함한 높이                       |
| Whole weight   | 전체 무게          | 전체 무게                                |
| Shucked weight | 껍질을 제외한 무게 | 고기만의 무게                            |
| Viscera weight | 내장 무게          | 피를 제외한 내장의 무게                  |
| Shell weight   | 껍질 무게          | 말린 후 껍질의 무게                      |
| Rings          | 나이테             | +1.5는 연령                              |

In [5]:
import pandas as pd
import sklearn
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

In [6]:
abal_df=pd.read_csv('abalone.csv')

In [7]:
abal_df.shape

(4177, 9)

In [16]:
abal_df.head()

Unnamed: 0,Length,Diameter,Height,Whole weight,Shucked weight,Viscera weight,Shell weight,Rings,sex_f,sex_i,sex_m
0,0.455,0.365,0.095,0.514,0.2245,0.101,0.15,15,0.0,0.0,1.0
1,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,7,0.0,0.0,1.0
2,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,9,1.0,0.0,0.0
3,0.44,0.365,0.125,0.516,0.2155,0.114,0.155,10,0.0,0.0,1.0
4,0.33,0.255,0.08,0.205,0.0895,0.0395,0.055,7,0.0,1.0,0.0


# Testing

 여러 모델을 사용하고 케이스에 따라 어떤 모델이 좋을지 경험해보기 위한 저희 목적에 맞는 데이터인지 확인하기 위해 간단한 모델들(DecisionTreeRegressor, MLPClassfier..)로 target을 다양하게 선정해 미리 돌려보았습니다.

## Preprocessing

 명목척도인 성별을 더미변수로 변환하여 처리합니다.

In [9]:
RANDOM_STATE=11

In [10]:
test_df=pd.get_dummies(abal_df.Sex)
test_df=test_df.rename(index=str, columns={'F': 'sex_f','I': 'sex_i','M':'sex_m'})
test_df=test_df.astype('float64', copy=False)
test_df.head()

Unnamed: 0,sex_f,sex_i,sex_m
0,0.0,0.0,1.0
1,0.0,0.0,1.0
2,1.0,0.0,0.0
3,0.0,0.0,1.0
4,0.0,1.0,0.0


In [11]:
abal_df=abal_df.drop(['Sex'],axis=1)

본래 abal_df와 test_df를 concat하면 간단하지만 null 값이 계속 떠 해결 중에 있습니다.

In [12]:
tmp_1=[i for i in test_df.sex_f]
tmp_2=[i for i in test_df.sex_i]
tmp_3=[i for i in test_df.sex_m]

In [14]:
abal_df['sex_f']=tmp_1
abal_df['sex_i']=tmp_2
abal_df['sex_m']=tmp_3

In [15]:
abal_df.drop(['Rings'],axis=1).head()

Unnamed: 0,Length,Diameter,Height,Whole weight,Shucked weight,Viscera weight,Shell weight,sex_f,sex_i,sex_m
0,0.455,0.365,0.095,0.514,0.2245,0.101,0.15,0.0,0.0,1.0
1,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,0.0,0.0,1.0
2,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,1.0,0.0,0.0
3,0.44,0.365,0.125,0.516,0.2155,0.114,0.155,0.0,0.0,1.0
4,0.33,0.255,0.08,0.205,0.0895,0.0395,0.055,0.0,1.0,0.0


## Regression Rings

 전복 나이테를 타겟으로 설정한 후 DecisionTreeRegressor로 확인해보았습니다.

In [101]:
X_train, X_test, y_train, y_test = train_test_split(abal_df.drop(['Rings'],axis=1),abal_df.Rings,test_size=0.3,random_state=RANDOM_STATE)

In [102]:
from sklearn.tree import DecisionTreeRegressor

In [103]:
tree=DecisionTreeRegressor()

In [104]:
tree.fit(X_train,y_train)

DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')

In [105]:
tree.score(X_test,y_test)

0.18145444453558013

In [106]:
tree.predict(X_test)

array([12.,  9.,  8., ..., 13.,  8., 16.])

## Regression Rings

 전복 나이테를 타겟으로 설정한 후 이번에는 앙상블의 RandomForestRegressor 확인해보았습니다.

In [83]:
from sklearn.ensemble import RandomForestRegressor

In [84]:
rf=RandomForestRegressor()

In [85]:
rf.fit(X_train,y_train)

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
           oob_score=False, random_state=None, verbose=0, warm_start=False)

In [86]:
rf.score(X_test,y_test)

0.5280659952831619

In [87]:
standard_tree.fit(X_train,y_train)

Pipeline(memory=None,
     steps=[('scaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('classifier', DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best'))])

## Regression Length

 전복 길이(최장길이)를 타겟으로 설정한 후 DecisionTreeRegressor로 확인해보았습니다.

In [88]:
X_train, X_test, y_train, y_test = train_test_split(abal_df.drop(['Length'],axis=1),abal_df['Length'],test_size=0.3,random_state=RANDOM_STATE)

In [89]:
tree.fit(X_train,y_train)

DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')

In [90]:
tree.score(X_test,y_test)

0.9597331813352243

## Regression Viscera weight

 전복 내장 무게를 타겟으로 설정한 후 DecisionTreeRegressor로 확인해보았습니다.

In [91]:
X_train, X_test, y_train, y_test = train_test_split(abal_df.drop(['Viscera weight'],axis=1),abal_df['Viscera weight'],test_size=0.3,random_state=RANDOM_STATE)

In [92]:
tree.fit(X_train,y_train)

DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')

In [93]:
tree.score(X_test,y_test)

0.8948880314413669

## Classification Sex
 끝으로 성별이 분류가 될 수 있을지 확인해보기 위해 MLP Classifier로 확인해보았습니다.

In [94]:
X_train, X_test, y_train, y_test = train_test_split(abal_df.drop(['sex_f','sex_i','sex_m'],axis=1),test_df,test_size=0.3,random_state=RANDOM_STATE)

In [95]:
from sklearn.neural_network import MLPClassifier

In [96]:
mlp=MLPClassifier()

In [97]:
mlp.fit(X_train,y_train)

MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(100,), learning_rate='constant',
       learning_rate_init=0.001, max_iter=200, momentum=0.9,
       nesterovs_momentum=True, power_t=0.5, random_state=None,
       shuffle=True, solver='adam', tol=0.0001, validation_fraction=0.1,
       verbose=False, warm_start=False)

In [98]:
mlp.score(X_test,y_test)

0.23923444976076555

In [99]:
mlp.predict(X_test)

array([[0, 0, 1],
       [0, 1, 0],
       [0, 0, 0],
       ...,
       [1, 0, 1],
       [0, 0, 0],
       [0, 0, 0]])

# Conclusion

 나이테를 제외한 총 무게, 내장 무게, 길이들의 결과는 정확도가 거의 1에 가까워 이미 feature들에 의해 충분히 설명되고 hyper parameter를 찾거나 다양한 모델을 적용해보는 의미가 상대적으로 적다고 생각된다. 따라서 위 결과와 같이 나이테를 target으로 한 회귀모형의 경우 모델에 따라 영향을 많이 받고 ( ex) DecisionTreeRegressor < RandomForestRegressor), hyper parameter를 수정하는 과정에서 개선될 여지가 많기 때문에 저희 목적에 알맞다고 생각하여 선택하게 되었다.