# Introduction
    Feature selection을 진행한 후 Linear regression 모델을 baseline으로 하여 DecisionTree, MLP, Ensemble과 같은 modeling기법을 통해 가장 높은 정확도를 나타내는 model을 찾아낼 것이다.

## 1) Feature selection
    Feature들의 수가 많으면 sample의 수에 따라서 complexity가 높아지기 때문에 overfitting이 일어날 확률이 높다. 따라서 Irrelevant feature와 Redundant Feature들을 제거하여 원래 Feature와의 차이점을 볼 것이다. 

###### - Pandas로 training dataset, test dataset 불러오기

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib as matplot
import numpy as np

import re
import sklearn

import warnings
warnings.filterwarnings("ignore")

%matplotlib inline

df_train = pd.read_csv('../input/Train_data.csv')
df_test = pd.read_csv('../input/test_data.csv')
df_test = df_test.drop('Unnamed: 0', axis=1)

In [None]:
df_train.head()

In [None]:
df_test.head()

###### - training dataset과 test dateaset을 각각의 df에 저장해주고 X와 Y를 나눠준다(xAttack인지, 분석 feature 들인지)

In [None]:
X_train = df_train.drop('xAttack', axis=1)
Y_train = df_train.loc[:,['xAttack']]
X_test = df_test.drop('xAttack', axis=1)
Y_test = df_test.loc[:,['xAttack']]

###### - preprocessing과 one hot encoding 을 적용시켜준다, X는 onehotencoder, Y는 LabelBinarizer

In [None]:
from sklearn import preprocessing
from sklearn.preprocessing import OneHotEncoder

In [None]:
le = preprocessing.LabelEncoder()
enc = OneHotEncoder()
lb = preprocessing.LabelBinarizer()

- X OneHotEncoding

In [None]:
X_train['protocol_type'] = le.fit_transform(X_train['protocol_type'])
# enc.fit_transform(X_train['protocol_type'])

X_test['protocol_type'] = le.fit_transform(X_test['protocol_type'])
# enc.fit_transform(X_test['protocol_type'])

X_train.head()

- Y LabelBinarizer

In [None]:
Y_train['xAttack'] = le.fit_transform(Y_train['xAttack'])
lb.fit_transform(Y_train['xAttack'])

Y_test['xAttack'] = le.fit_transform(Y_test['xAttack'])
lb.fit_transform(Y_test['xAttack'])

Y_train.describe()

### 1. Standard deviation
    standard deviation이 작은(편차가 작은) feature들을 제외시키는 방법을 적용해 보았다. 하지만 feature type이 discrete한 경우에는 deviation이 작을 수밖에 없기 때문에 불합리하다고 판단, continuous한 경우에만 생각하기로 하였다.

In [None]:
#except continuous feature
con_list = ['protocol_type', 'service', 'flag', 'land', 'logged_in', 'su_attempted', 'is_host_login', 'is_guest_login']
con_train = X_train.drop(con_list, axis=1)

#drop n smallest std features
stdtrain = con_train.std(axis=0)
std_X_train = stdtrain.to_frame()
std_X_train.nsmallest(10, columns=0).head(10)

#### num_outbound_cmds 는 standard deviation이 0이므로 우선 이것부터 제거해 준다.

In [None]:
X_train = X_train.drop(['num_outbound_cmds'], axis=1)
X_test = X_test.drop(['num_outbound_cmds'], axis=1)

df_train = pd.concat([X_train, Y_train], axis=1)
df_train.head()

X_train.head()

#### std가 낮은 10개를 고른후 feature들을 drop -> X_train_stdrop에 저장해준다. (Ensemble feature selection 이후에 사용 예정)

In [None]:
stdrop_list = ['urgent', 'num_shells', 'root_shell',
        'num_failed_logins', 'num_access_files', 'dst_host_srv_diff_host_rate',
        'diff_srv_rate', 'dst_host_diff_srv_rate', 'wrong_fragment']

X_test_stdrop = X_test.drop(stdrop_list, axis=1)

X_train_stdrop = X_train.drop(stdrop_list, axis=1)

df_train_stdrop = pd.concat([X_train_stdrop, Y_train], axis=1)

df_train_stdrop.head()

### Baseline - Linear regression으로 성능 알아보기

- Linear regression

In [None]:
from sklearn import linear_model

In [None]:
LR = linear_model.LinearRegression()

In [None]:
LR.fit(X_train, Y_train)

In [None]:
lr_score = LR.score(X_test, Y_test)
print('Linear regression processing ,,,')
print('Linear regression Score: %.2f %%' % lr_score)

##### linear regression은 33%의 확률밖에 내지 못한다.

### 2. Ensemble feature selection
    Ensemble Modeling은 각 모델에서 feature가 얼마나 영향을 미쳤는지를 확인 할 수 있다. 따라서 그 Feature들을 중심으로 feature selection을 진행해 보았다(Irrelevant 한 feature를 제거하려는 시도).

In [None]:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import GradientBoostingClassifier

from sklearn.tree import DecisionTreeClassifier

In [None]:
AB = AdaBoostClassifier(base_estimator=DecisionTreeClassifier(), n_estimators=100, learning_rate=1.0)
RF = RandomForestClassifier(n_estimators=10, criterion='entropy', max_features='auto', bootstrap=True)
ET = ExtraTreesClassifier(n_estimators=10, criterion='gini', max_features='auto', bootstrap=False)
GB = GradientBoostingClassifier(loss='deviance', learning_rate=0.1, n_estimators=200, max_features='auto')

In [None]:
y_train = Y_train['xAttack'].ravel()
x_train = X_train.values
x_test = X_test.values

### feature importances를 알아보면서 기본적인 feature로는 얼마만큼의 정확도를 내는지도 미리 확인해본다.

In [None]:
AB.fit(X_train, Y_train)

In [None]:
AB_feature = AB.feature_importances_
AB_feature

ab_score = AB.score(X_test, Y_test)

print('AdaBoostClassifier processing ,,,')
print('AdaBoostClassifier Score: %.3f %%' % ab_score)

In [None]:
RF.fit(X_train, Y_train)

In [None]:
RF_feature = RF.feature_importances_
RF_feature

rf_score = RF.score(X_test, Y_test)

print('RandomForestClassifier processing ,,,')
print('RandomForestClassifier Score: %.3f %%' % rf_score)

In [None]:
ET.fit(X_train, Y_train)

In [None]:
ET_feature = ET.feature_importances_
ET_feature

et_score = ET.score(X_test, Y_test)

print('ExtraTreesClassifier processing ,,,')
print('ExtraTreeClassifier: %.3f %%' % et_score)

In [None]:
GB.fit(X_train, Y_train)

In [None]:
GB_feature = GB.feature_importances_
GB_feature

gb_score = GB.score(X_test, Y_test)

print('GradientBoostingClassifier processing ,,,')
print('GradientBoostingClassifier Score: %.3f %%' % gb_score)

### 앞에서 진행한 Ensemble을 통해서 각 feature들이 어떠한 영향을 주는지를 알아보자

In [None]:
cols = X_train.columns.values

feature_df = pd.DataFrame({'features': cols,
                           'AdaBoost' : AB_feature,
                           'RandomForest' : RF_feature,
                           'ExtraTree' : ET_feature,
                           'GradientBoost' : GB_feature
                          })
feature_df.head(8)

- Feature들의 영향력을 그래프로 표현

In [None]:
from matplotlib.ticker import MaxNLocator
from collections import namedtuple

graph = feature_df.plot.bar(figsize = (18, 10), title = 'Feature distribution', grid=True, legend=True, fontsize = 15, 
                            xticks=feature_df.index)
graph.set_xticklabels(feature_df.features, rotation = 80)

#### 각 Ensemble model에서 12개씩 feature를 뽑아낸다

In [None]:
a_f = feature_df.nlargest(12, 'AdaBoost')
e_f = feature_df.nlargest(12, 'ExtraTree')
g_f = feature_df.nlargest(12, 'GradientBoost')
r_f = feature_df.nlargest(12, 'RandomForest')

duplicate한것을 삭제

In [None]:
result = pd.concat([a_f, e_f, g_f, r_f])
result = result.drop_duplicates() # duplicate feature삭제
result

In [None]:
selected_features = result['features'].values.tolist()
selected_features

### 아래는 standard deviation이 작은 feature들을 제외하고 training한 결과이다.

In [None]:
AB.fit(X_train_stdrop, Y_train)

In [None]:
ab2_score = AB.score(X_test_stdrop, Y_test)

print('AdaBoostClassifier_stdrop processing ,,,')
print('AdaBoostClasifier Score: %.3f %%' % ab2_score)

In [None]:
RF.fit(X_train_stdrop, Y_train)

In [None]:
rf2_score = RF.score(X_test_stdrop, Y_test)

print('RandomForestClassifier_stdrop processing ,,,')
print('RandomForestClassifier Score: %.3f %%' % rf2_score)

In [None]:
ET.fit(X_train_stdrop, Y_train)

In [None]:
et2_score = ET.score(X_test_stdrop, Y_test)

print('ExtraTreesClassifier_stdrop processing ,,,')
print('ExtraTreesClassifier Score: %.3f %%' % et2_score)

In [None]:
GB.fit(X_train_stdrop, Y_train)

In [None]:
gb2_score = GB.score(X_test_stdrop, Y_test)

print('GradientBoostingClassifier_stdrop processing ,,,')
print('GradientBoostingClassifier Score: %.2f %%' % gb2_score)

- ensemble을 통해 얻어낸 feature만 가지고 진행

In [None]:
X_train_ens = X_train[selected_features]
X_train_ens.head()

X_test_ens = X_test[selected_features]
X_test_ens.head()

### 3. Correlation
    여러개의 Feature들 중에서 correlation이 큰 feature들(redundant한 feature)은 병합하거나 삭제시켰다. 왜냐하면 이러한 feature들의 상관관계가 크다면 굳이 feature의 수를 늘릴 필요가 없기 때문이다.

In [None]:
sample = X_train_ens[:10000]

colormap = plt.cm.viridis
plt.figure(figsize=(20, 20))
sns.heatmap(sample.astype(float).corr(), linewidths=0.1, vmax=1.0, square=True, cmap=colormap, annot=True)

- 위의 그래프 분석 결과 아래와 같은 feature에 dependency가 높다는 것을 알아냈고 이후 추출

In [None]:
selected2 = ['flag', 'dst_host_serror_rate', 'serror_rate']
X_train_cordrop = X_train_ens.drop(selected2, axis=1)
X_train_cordrop.describe()

X_test_cordrop = X_test_ens.drop(selected2, axis=1)
X_test_cordrop.describe()

## 2) Modeling

### Feature selection과정을 모두 마친 후의 modeling (low deviation, high correlation 제거)

### Ensemble modeling에 영향을 많이 주는 feature들을 가지고 최종 modeling 결과 비교

In [None]:
AB.fit(X_train_cordrop, Y_train)

In [None]:
ab_finalscore = AB.score(X_test_cordrop, Y_test)

print('AdaBoostClassifier_final processing ,,,')
print('AdaBoostClassifier_final Score: %.3f %%' % ab_finalscore)

In [None]:
RF.fit(X_train_cordrop, Y_train)

In [None]:
rf_finalscore = RF.score(X_test_cordrop, Y_test)

print('RandomForestClassifier_final processing ,,,')
print('RandomForestClassifier_final Score: %.3f %%' % rf_finalscore)

In [None]:
ET.fit(X_train_cordrop, Y_train)

In [None]:
et_finalscore = ET.score(X_test_cordrop, Y_test)

print('ExtraTreesClassifier_final processing ,,,')
print('ExtraTreesClassifier_final Score: %.3f %%' % et_finalscore)

In [None]:
GB.fit(X_train_cordrop, Y_train)

In [None]:
gb_finalscore = GB.score(X_test_cordrop, Y_test)

print('GradientBoostClassifier_final processing ,,,')
print('GradientBoostClassifier_final Score: %.3f %%' % gb_finalscore)

In [None]:
LR.fit(X_train_cordrop, Y_train)

In [None]:
lr_finalscore = LR.score(X_test_cordrop, Y_test)

print('LinearRegression_final processing ,,,')
print('LinearRegression_final Score: %.3f %%' % lr_finalscore)

In [None]:
from sklearn.neural_network import MLPClassifier

In [None]:
MLP = MLPClassifier(hidden_layer_sizes=(1000, 300, 300), solver='adam', shuffle=False, tol = 0.0001)

In [None]:
MLP.fit(X_train_cordrop, Y_train)

In [None]:
mlp_finalscore = MLP.score(X_test_cordrop, Y_test)

print('MLP_final processing ,,,')
print('MLP_final Score: %.3f %%' % mlp_finalscore)

## 3) Result

결과적으로 feature selection, extraction이 높은 확률의 성과를 가져다 주지는 않았다. 1~2퍼센트 정도의 정확도 증가율을 보였지만 feature가 줄어서 조금 더 빠른 연산이 가능했고 새로운 Data가 들어왔을때 overfitting이 되는것을 막아줄 것이라고 생각한다.

### 각 모델별 score비교

- first models

In [None]:
first_model = {'Model': ['Linear Regression', 'Adaboost', 'RandomForest', 'ExtraTrees', 'GradientBoost'],
               'accuracy' : [lr_score, ab_score, rf_score, et_score, gb_score]}

result_df = pd.DataFrame(data = first_model)
result_df

In [None]:
r1 = result_df.plot(x='Model', y='accuracy', kind='bar', figsize=(8, 8), grid=True, title='FIRST MODEL ACCURACY', colormap=plt.cm.viridis,
               sort_columns=True)
r1.set_xticklabels(result_df.Model, rotation = 45)

- second models

In [None]:
second_model = {'Model': ['Adaboost', 'RandomForest', 'ExtraTrees', 'GradientBoost'],
               'accuracy' : [ab2_score, rf2_score, et2_score, gb2_score]}

result_df = pd.DataFrame(data = second_model)
result_df

In [None]:
r2 = result_df.plot(x='Model', y='accuracy', kind='bar', figsize=(8, 8), grid=True, title='SECOND MODEL ACCURACY', colormap=plt.cm.viridis,
               sort_columns=True)
r2.set_xticklabels(result_df.Model, rotation = 45)

- final models

In [None]:
final_model = {'Model': ['Linear Regression', 'Adaboost', 'RandomForest', 'ExtraTrees', 'GradientBoost', 'MLP'],
               'accuracy' : [lr_finalscore, ab_finalscore, rf_finalscore, et_finalscore, gb_finalscore, mlp_finalscore]}

result_df = pd.DataFrame(data = final_model)
result_df

In [None]:
r3 = result_df.plot(x='Model', y='accuracy', kind='bar', figsize=(8, 8), grid=True, title='FINAL MODEL ACCURACY', colormap=plt.cm.viridis,
               sort_columns=True)
r3.set_xticklabels(result_df.Model, rotation = 45)

## FASTEST AND ACCURATE MODEL - final model의 ExtraTrees(76.4%)
## STRONGEST AND THE MOST ACCURATE MODEL - final model의 GradientBoost(77.1%)

Gradient boost가 77퍼센트의 확률을 보이지만 속도는 ExtraTress가 월등하게 빠르다.