## [Kaggle Clone Coding] Statoil/C-CORE Iceberg Classifier Challenge  
- [Statoil/C-CORE Iceberg Classifier Challenge](https://www.kaggle.com/c/statoil-iceberg-classifier-challenge/data)
- [
Transfer Learning with VGG-16 CNN+AUG LB 0.1712](https://www.kaggle.com/devm2024/transfer-learning-with-vgg-16-cnn-aug-lb-0-1712)
  
- Task : Binary Classification (주어진 데이터를 통해 해당 이미지가 배인지 빙산인지 분류)  
  - 0 : 배
  - 1 : 빙산
---

### Kaggle API를 통해 코랩에 데이터 다운로드

In [None]:
!pip install kaggle
from google.colab import files
files.upload()



Saving kaggle.json to kaggle (1).json


{'kaggle.json': b'{"username":"yoonj98","key":"dbb3b5607358d2775c1cb6107c3bd2d3"}'}

In [None]:
import os 
os.environ['KAGGLE_CONFIG_DIR'] = '/content/drive/MyDrive/Kaggle/kaggle/'

from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
# 다운받고자하는 대회의 data 탭에서 api 주소 가져오기
! cd /content/drive/MyDrive/Kaggle/kaggle/
! kaggle competitions download -c statoil-iceberg-classifier-challenge

Downloading sample_submission.csv.7z to /content
  0% 0.00/37.7k [00:00<?, ?B/s]
100% 37.7k/37.7k [00:00<00:00, 31.2MB/s]
Downloading test.json.7z to /content
100% 244M/245M [00:02<00:00, 98.3MB/s]
100% 245M/245M [00:02<00:00, 86.2MB/s]
Downloading train.json.7z to /content
 77% 33.0M/42.9M [00:00<00:00, 55.6MB/s]
100% 42.9M/42.9M [00:00<00:00, 67.4MB/s]


In [None]:
# 데이터 확인
!ls

drive	sample_data		  test.json.7z
gdrive	sample_submission.csv.7z  train.json.7z


In [None]:
# 압축해제
!p7zip -d test.json.7z
!p7zip -d train.json.7z


7-Zip (a) [64] 16.02 : Copyright (c) 1999-2016 Igor Pavlov : 2016-05-21
p7zip Version 16.02 (locale=en_US.UTF-8,Utf16=on,HugeFiles=on,64 bits,2 CPUs Intel(R) Xeon(R) CPU @ 2.20GHz (406F0),ASM,AES-NI)

Scanning the drive for archives:
  0M Scan         1 file, 257127394 bytes (246 MiB)

Extracting archive: test.json.7z
--
Path = test.json.7z
Type = 7z
Physical Size = 257127394
Headers Size = 154
Method = LZMA2:24
Solid = -
Blocks = 1

  0%      1% - data/processed/test.json                                 2% - data/processed/test.json                                 3% - data/processed/test.json                                 4% - data/processed/test.json                                 5% - data/processed/test.json

다만 계속 root에 다운로드 및 압축해제되어서 직접 파일 옮겨줌

---
### Code

pre-train된 VGG-16 네트워크 사용 - CIFAR-10과 같은 소형 이미지에서 우수한 성능을 보임

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import log_loss
from sklearn.model_selection import StratifiedKFold, StratifiedShuffleSplit
from os.path import join as opj
from matplotlib import pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import pylab
plt.rcParams['figure.figsize'] = 10, 10
%matplotlib inline

In [3]:
train = pd.read_json('/content/drive/MyDrive/Kaggle/kaggle/Statoil C-CORE Iceberg Classifier/processed/train.json')
test = pd.read_json('/content/drive/MyDrive/Kaggle/kaggle/Statoil C-CORE Iceberg Classifier/processed/test.json')
target_train = train['is_iceberg']

In [4]:
train.head()

Unnamed: 0,id,band_1,band_2,inc_angle,is_iceberg
0,dfd5f913,"[-27.878360999999998, -27.15416, -28.668615, -...","[-27.154118, -29.537888, -31.0306, -32.190483,...",43.9239,0
1,e25388fd,"[-12.242375, -14.920304999999999, -14.920363, ...","[-31.506321, -27.984554, -26.645678, -23.76760...",38.1562,0
2,58b2aaa0,"[-24.603676, -24.603714, -24.871029, -23.15277...","[-24.870956, -24.092632, -20.653963, -19.41104...",45.2859,1
3,4cfc3a18,"[-22.454607, -23.082819, -23.998013, -23.99805...","[-27.889421, -27.519794, -27.165262, -29.10350...",43.8306,0
4,271f93f4,"[-26.006956, -23.164886, -23.164886, -26.89116...","[-27.206915, -30.259186, -30.259186, -23.16495...",35.6256,0


- 간략한 데이터 소개  
  - id : 이미지의 id
  - band_1, band_2 : 병합된 이미지 데이터.  
  각 밴드의 목록에는 75x75 픽셀 값이 있으므로 목록에는 5625개의 요소가 있다. 이 값은 물리적 의미가 있기 때문에 이미지 파일에서 일반적인 음이 아닌 정수가 아니며, 단위가 dB인 부동 소수점 숫자이다 . 
  대역 1 및 대역 2는 특정 입사각에서 서로 다른 편광에서 생성된 레이더 후방 산란을 특징으로 하는 신호로, 편광은 HH(수평 전송/수신) 및 HV(수평 전송 및 수직 수신)에 해당된다. 
  - inc_angle : 이미지가 촬영된 입사각. 이 필드에는 "na"로 표시된 누락된 데이터가 있으며 "na" 입사각이 있는 이미지는 누출을 방지하기 위해 모두 훈련 데이터에 있다.
  - is_iceberg : 대상 변수로, 빙산이면 1로 설정하고 선박이면 0으로 설정합니다.
  
  >주어진 데이터는 backscatter coefficient (후방산란계수)로 다음과 같이 구할 수 있다.<br>  
  ![image](https://user-images.githubusercontent.com/69336270/130349509-2c6dd078-7298-4c25-8e50-019a4d166366.png)
  1. ip = 특정 픽셀에 대한 발생 각도
  2. ic = 이미지의 중심에 대한 발생 각도
  3. K = 중심    
  후방산란계수는 신호가 산란된 표면에 따라 달라지며, HH 성분의 값은 다양한 변화폭을 지니지만, HV 성분의 경우 그렇지 않다. 




keras에서는 pre-train된 VGG 모델을 제공하므로, VGG의 마지막 레이어를 제거하고 binary classification을 위해 sigmoid layer를 배치한다. 

In [5]:
target_train=train['is_iceberg']
test['inc_angle']=pd.to_numeric(test['inc_angle'], errors='coerce')
train['inc_angle']=pd.to_numeric(train['inc_angle'], errors='coerce')
train['inc_angle']=train['inc_angle'].fillna(method='pad')
X_angle=train['inc_angle']

test['inc_angle']=pd.to_numeric(test['inc_angle'], errors='coerce')
X_test_angle=test['inc_angle']

In [6]:
# train, test data 생성 - HH, HV 그리고 둘의 평균

X_band_1=np.array([np.array(band).astype(np.float32).reshape(75, 75) for band in train["band_1"]])
X_band_2=np.array([np.array(band).astype(np.float32).reshape(75, 75) for band in train["band_2"]])
X_band_3=(X_band_1+X_band_2)/2
X_train = np.concatenate([X_band_1[:, :, :, np.newaxis]
                          , X_band_2[:, :, :, np.newaxis]
                         , X_band_3[:, :, :, np.newaxis]], axis=-1)

X_band_test_1=np.array([np.array(band).astype(np.float32).reshape(75, 75) for band in test["band_1"]])
X_band_test_2=np.array([np.array(band).astype(np.float32).reshape(75, 75) for band in test["band_2"]])
X_band_test_3=(X_band_test_1+X_band_test_2)/2
X_test = np.concatenate([X_band_test_1[:, :, :, np.newaxis]
                          , X_band_test_2[:, :, :, np.newaxis]
                         , X_band_test_3[:, :, :, np.newaxis]], axis=-1)

In [10]:
from matplotlib import pyplot
from tensorflow import keras
from keras.preprocessing.image import ImageDataGenerator
from keras.models import Sequential
from keras.layers import Conv2D, MaxPooling2D, Dense, Dropout, Input, Flatten, Activation
from keras.layers import GlobalMaxPooling2D
from keras.layers import BatchNormalization
from keras.layers.merge import Concatenate
from keras.models import Model
from keras import initializers
from keras.layers.advanced_activations import LeakyReLU, PReLU
from keras.callbacks import ModelCheckpoint, Callback, EarlyStopping

from keras.datasets import cifar10
from keras.applications.inception_v3 import InceptionV3
from keras.applications.vgg16 import VGG16
from keras.applications.xception import Xception
from keras.applications.mobilenet import MobileNet
from keras.applications.vgg19 import VGG19
from keras.layers import Concatenate, Dense, LSTM, Input, concatenate
from keras.preprocessing import image
from keras.applications.vgg16 import preprocess_input	

from keras.preprocessing.image import ImageDataGenerator
batch_size=64

In [11]:
gen = ImageDataGenerator(horizontal_flip = True,
                         vertical_flip = True,
                         width_shift_range = 0.,
                         height_shift_range = 0.,
                         channel_shift_range=0,
                         zoom_range = 0.2,
                         rotation_range = 10)

In [16]:
# data generator
def gen_flow_for_two_inputs(X1, X2, y):
    genX1 = gen.flow(X1,y,  batch_size=batch_size,seed=55)
    genX2 = gen.flow(X1,X2, batch_size=batch_size,seed=55)
    while True:
            X1i = genX1.next()
            X2i = genX2.next()
            yield [X1i[0], X2i[1]], X1i[1]

def get_callbacks(filepath, patience=2):
   es = EarlyStopping('val_loss', patience=10, mode="min")
   msave = ModelCheckpoint(filepath, save_best_only=True)
   return [es, msave]

In [19]:
def getVggAngleModel():
    input_2 = Input(shape=[1], name="angle")
    angle_layer = Dense(1, )(input_2)
    base_model = VGG16(weights='imagenet', include_top=False, 
                 input_shape=X_train.shape[1:], classes=1)
    x = base_model.get_layer('block5_pool').output
    

    x = GlobalMaxPooling2D()(x)
    merge_one = concatenate([x, angle_layer])
    merge_one = Dense(512, activation='relu', name='fc2')(merge_one)
    merge_one = Dropout(0.3)(merge_one)
    merge_one = Dense(512, activation='relu', name='fc3')(merge_one)
    merge_one = Dropout(0.3)(merge_one)
    
    predictions = Dense(1, activation='sigmoid')(merge_one)
    
    model = Model([base_model.input, input_2], predictions)
    
    sgd = keras.optimizers.SGD(lr=1e-3, decay=1e-6, momentum=0.9, nesterov=True)
    model.compile(loss='binary_crossentropy',
                  optimizer=sgd,
                  metrics=['accuracy'])
    return model

In [14]:
#Using K-fold Cross Validation with Data Augmentation.
def myAngleCV(X_train, X_angle, X_test):
    K=3
    folds = list(StratifiedKFold(n_splits=K, shuffle=True, random_state=16).split(X_train, target_train))
    y_test_pred_log = 0
    y_train_pred_log=0
    y_valid_pred_log = 0.0*target_train
    for j, (train_idx, test_idx) in enumerate(folds):
        print('\n===================FOLD=',j)
        X_train_cv = X_train[train_idx]
        y_train_cv = target_train[train_idx]
        X_holdout = X_train[test_idx]
        Y_holdout= target_train[test_idx]
        
        X_angle_cv=X_angle[train_idx]
        X_angle_hold=X_angle[test_idx]

        file_path = "%s_aug_model_weights.hdf5"%j
        callbacks = get_callbacks(filepath=file_path, patience=5)
        gen_flow = gen_flow_for_two_inputs(X_train_cv, X_angle_cv, y_train_cv)
        galaxyModel= getVggAngleModel()
        galaxyModel.fit_generator(
                gen_flow,
                steps_per_epoch=24,
                epochs=100,
                shuffle=True,
                verbose=1,
                validation_data=([X_holdout,X_angle_hold], Y_holdout),
                callbacks=callbacks)

        galaxyModel.load_weights(filepath=file_path)

        score = galaxyModel.evaluate([X_train_cv,X_angle_cv], y_train_cv, verbose=0)
        print('Train loss:', score[0])
        print('Train accuracy:', score[1])

        score = galaxyModel.evaluate([X_holdout,X_angle_hold], Y_holdout, verbose=0)
        print('Test loss:', score[0])
        print('Test accuracy:', score[1])

        pred_valid=galaxyModel.predict([X_holdout,X_angle_hold])
        y_valid_pred_log[test_idx] = pred_valid.reshape(pred_valid.shape[0])

        temp_test=galaxyModel.predict([X_test, X_test_angle])
        y_test_pred_log+=temp_test.reshape(temp_test.shape[0])

        temp_train=galaxyModel.predict([X_train, X_angle])
        y_train_pred_log+=temp_train.reshape(temp_train.shape[0])

    y_test_pred_log=y_test_pred_log/K
    y_train_pred_log=y_train_pred_log/K

    print('\n Train Log Loss Validation= ',log_loss(target_train, y_train_pred_log))
    print(' Test Log Loss Validation= ',log_loss(target_train, y_valid_pred_log))
    return y_test_pred_log

In [20]:
preds=myAngleCV(X_train, X_angle, X_test)




  "The `lr` argument is deprecated, use `learning_rate` instead.")


Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Train loss: 0.15708371996879578
Train accuracy: 0.937324583530426
Test loss: 0.22895066440105438
Test accuracy: 0.9140186905860901

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Train loss: 0.191402405500412
Train accuracy: 0.9176800847053528
Test loss: 0.2092941254377365
Test accuracy: 0.9177570343017578

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/10

In [None]:
# submission = pd.DataFrame()
# submission['id']=test['id']
# submission['is_iceberg']=preds
# submission.to_csv('sub.csv', index=False)