# Traing Goal

* quadratic weighted kappa score 를 통하여 정확도를 평가하는 알고리즘 생성

* ### quadratic weighted kappa
$$ k = 1-\dfrac{{\sum}_{i,i} w_{i,j} O_{i,j}}{{\sum}_{i,i} w_{i,j} E_{i,j}} $$

# Modeling
* Regression
* Classification
  * OvR / OvO

In [1]:
import numpy as np
import pandas as pd
import sklearn as sk

import matplotlib as mpl
import matplotlib.pylab as plt
from mpl_toolkits.mplot3d import Axes3D

from sklearn.metrics import confusion_matrix, classification_report, roc_curve, auc, roc_auc_score, accuracy_score
from sklearn.model_selection import cross_val_score
from sklearn.decomposition import TruncatedSVD
from sklearn.preprocessing import StandardScaler
import xgboost as xgb
from sklearn import pipeline
from sklearn.model_selection import GridSearchCV

from sklearn.multiclass import OneVsRestClassifier, OneVsOneClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.svm import SVC
from xgboost import XGBClassifier
import xgboost as xgb
from sklearn.metrics import auc

import joblib
from joblib import dump, load

pd.options.display.max_columns = 400
pd.options.display.max_rows = 200
pd.options.display.max_colwidth = 600
pd.options.display.precision = 10

In [2]:
df_train = pd.read_excel("./__data/excel/train.xlsx").fillna("")
df_test = pd.read_excel("./__data/excel/test.xlsx").fillna("")

In [3]:
train_X = joblib.load('train_X_mixed.pkl')
test_X = joblib.load('test_X_mixed.pkl')
y = joblib.load('y.pkl')

# Modeling (1) Regression

## Process
1. Sparse Matrix 의 특이값 분해
2. Scaling
3. Rgression

#### SVD - n_componets=400 (Best parameter)
* only title 버전으로 girdsearchcv 진행 시 score 가 가장 높았던 parameter 로 진행

In [33]:
svd = TruncatedSVD(n_components=400)
scl = StandardScaler()
xgb_model = xgb.XGBRegressor()

xgb_reg = pipeline.Pipeline([('svd', svd), ('scl', scl), ('xgb', xgb_model)])

In [34]:
%%time
model_reg = xgb_reg.fit(train_X, y)

Wall time: 30min 10s


In [35]:
# pipeline 모델을 test data 에 적용
reg_pred = model_reg.predict(test_X)
reg_pred

array([ 3.66499567,  3.18388104,  3.18961453, ...,  2.36185646,
        3.58581257,  3.46036673], dtype=float32)

In [36]:
# float 형태의 예측값을 int 형태로 전환
def pred_round(pred):
    for num in range(len(pred)):
        pred[num] = round(pred[num])
    
    return pred

pred_round(reg_pred)
reg_pred = reg_pred.astype(int)

In [37]:
# 제출이 가능하도록 id 와 prediction 을 하나로 쌍으로 합친 후 csv 파일 저장
reg_answer = pd.concat([pd.DataFrame(df_test['id'], columns=['id']), pd.DataFrame(reg_pred, columns=['prediction'])], axis=1)
reg_answer.to_csv('./reg_400_answer_mixed.csv', index=False)

* score : 0.34883

# Modeling (2) Classification

### Process
1. KSVM
2. OvO and OvR
3. Pipeline (SVD, Scaling, Best Model)

### (1) KSVM

#### only KSVM

In [7]:
%%time
poly_svc = SVC(kernel="poly", degree=2, gamma=1, coef0=0).fit(train_X, y)

Wall time: 3min 27s


In [8]:
cv = KFold(10)
cross_val_score(poly_svc, train_X, y, cv=cv)

array([ 0.68011811,  0.64370079,  0.65846457,  0.6761811 ,  0.63779528,
        0.64665354,  0.67027559,  0.68275862,  0.64334975,  0.66896552])

In [9]:
poly_pred = poly_svc.predict(test_X)

In [10]:
poly_answer = pd.concat([pd.DataFrame(df_test['id'], columns=['id']), pd.DataFrame(poly_pred, columns=['prediction'])], axis=1)
poly_answer.to_csv('./poly_pred_mixed.csv', index=False)

* score : 0.46529

#### SVD / Scaling / KSVM

In [4]:
svd = TruncatedSVD(n_components=400)
scl = StandardScaler()
svc = SVC(kernel="poly", degree=2, gamma=1, coef0=0)

In [5]:
pipe_svc = pipeline.Pipeline([('svd', svd), ('scl', scl), ('svc', svc)])

In [6]:
%%time
model_svc = pipe_svc.fit(train_X, y)

Wall time: 31min 10s


In [7]:
pipe_svc_pred = model_svc.predict(test_X)

In [8]:
pipe_svc_answer = pd.concat([pd.DataFrame(df_test['id'], columns=['id']), pd.DataFrame(pipe_svc_pred, columns=['prediction'])], axis=1)
pipe_svc_answer.to_csv('./pipe_svc_answer_mixed.csv', index=False)

* score : 0.48007

### (2) XGBClassifier

In [4]:
xgb_clf = XGBClassifier()
xgb_param_grid = {'max_depth': [200, 300, 400, 500, 600]}
xgb_grid = GridSearchCV(estimator=xgb_clf, param_grid=xgb_param_grid, cv=5)

In [5]:
%%time
xgb_grid.fit(train_X, y)

Wall time: 2d 8h 31min 9s


GridSearchCV(cv=5, error_score='raise',
       estimator=XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
       n_jobs=1, nthread=None, objective='binary:logistic', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=1),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'max_depth': [200, 300, 400, 500, 600]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [6]:
xgb_grid.grid_scores_



[mean: 0.64182, std: 0.00547, params: {'max_depth': 200},
 mean: 0.64350, std: 0.00699, params: {'max_depth': 300},
 mean: 0.64241, std: 0.00557, params: {'max_depth': 400},
 mean: 0.64320, std: 0.00536, params: {'max_depth': 500},
 mean: 0.64232, std: 0.00679, params: {'max_depth': 600}]

In [7]:
print('XGBClassifier Best score : ', xgb_grid.best_score_)
print('XGBClassifier Best parameter : ', xgb_grid.best_params_)

XGBClassifier Best score :  0.643497095599
XGBClassifier Best parameter :  {'max_depth': 300}


In [8]:
joblib.dump(xgb_grid, 'xgb_grid_mixed.pkl')

['xgb_grid_mixed.pkl']

#### only XGB - max_depth=300 (Best parameter)

In [9]:
%%time
xgb = XGBClassifier(max_depth=300).fit(train_X, y)

Wall time: 2h 59min 10s


In [14]:
%%time
xgb_pred = xgb.predict(test_X)

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.


In [15]:
xgb_answer = pd.concat([pd.DataFrame(df_test['id'], columns=['id']), pd.DataFrame(xgb_pred, columns=['prediction'])], axis=1)
xgb_answer.to_csv('./xgb_answer_mixed.csv', index=False)

* score : 0.34603

#### SVD / Scaling / XGB

In [11]:
svd = TruncatedSVD(n_components=400)
scl = StandardScaler()
xgb = XGBClassifier(max_depth=300)

In [12]:
pipe_xgb = pipeline.Pipeline([('svd', svd), ('scl', scl), ('xgb', xgb)])

In [13]:
%%time
model_xgb = pipe_xgb.fit(train_X, y)

Wall time: 38min 2s


In [16]:
model_xgb_pred = model_xgb.predict(test_X)

In [17]:
model_xgb_answer = pd.concat([pd.DataFrame(df_test['id'], columns=['id']), pd.DataFrame(model_xgb_pred, columns=['prediction'])], axis=1)
model_xgb_answer.to_csv('./pipe_xgb_answer_mixedd.csv', index=False)

* score : 0.44797

### Scroe ranking
* svm(kornel="poly") / TruncatedSVD / StandardScaler : **0.48007**
* svm (kornel="poly") : **0.46529**
* xgbclassifier (max_depth=300) / TruncatedSVD / StandardScaler : **0.44797**
* xgbregressor (n_components=200) : **0.34883**
* xgbclassifier (max_depth=300) : **0.34603**

# OvR

* (1) KSVM

In [4]:
%%time
poly_ovr = OneVsRestClassifier(SVC(kernel="poly", degree=2, gamma=1, coef0=0)).fit(train_X, y)

Wall time: 15min 52s


In [5]:
poly_ovr_pred = poly_ovr.predict(test_X)

In [6]:
poly_ovr_answer = pd.concat([pd.DataFrame(df_test['id'], columns=['id']), pd.DataFrame(poly_ovr_pred, columns=['prediction'])], axis=1)
poly_ovr_answer.to_csv('./poly_ovr_answer_mixed.csv', index=False)

* score : **0.49430**

* (2) XGBClassifier / TruncatedSVD / StandardScaler

In [12]:
svd = TruncatedSVD(n_components=400)
scl = StandardScaler()
xgb_clf_ovr = OneVsRestClassifier(XGBClassifier(max_depth=300))

pipe_xgb_ovr = pipeline.Pipeline([('svd', svd), ('scl', scl), ('xgb', xgb_clf_ovr)])

In [13]:
%%time
pipe_xgb_ovr_model = pipe_xgb_ovr.fit(train_X, y)

Wall time: 41min 38s


In [14]:
pipe_xgb_ovr_pred = pipe_xgb_ovr_model.predict(test_X)

In [15]:
pipe_xgb_ovr_answer = pd.concat([pd.DataFrame(df_test['id'], columns=['id']), pd.DataFrame(pipe_xgb_ovr_pred, columns=['prediction'])], axis=1)
pipe_xgb_ovr_answer.to_csv('./pipe_xgb_ovr_answer_mixed.csv', index=False)

* score : **0.42857**

* (3) XGBClassifier

In [5]:
%%time
xgb_clf_ovr = OneVsRestClassifier(XGBClassifier(max_depth=300)).fit(train_X, y)

Wall time: 2h 50min 44s


In [6]:
xgb_clf_ovr_pred = xgb_clf_ovr.predict(test_X)

In [7]:
xgb_ovr_answer = pd.concat([pd.DataFrame(df_test['id'], columns=['id']), pd.DataFrame(xgb_clf_ovr_pred, columns=['prediction'])], axis=1)
xgb_ovr_answer.to_csv('./xgb_ovr_answer_mixed.csv', index=False)

* score : **0.33192**

* (4) KSVM / TruncatedSVD / StandardScaler

In [12]:
svd = TruncatedSVD(n_components=400)
scl = StandardScaler()
svc_clf_ovr = OneVsRestClassifier(SVC(kernel="poly", degree=2, gamma=1, coef0=0))

pipe_svc_ovr = pipeline.Pipeline([('svd', svd), ('scl', scl), ('svc', svc_clf_ovr)])

In [13]:
%%time
model_svc_ovr = pipe_svc_ovr.fit(train_X, y)

Wall time: 33min 31s


In [14]:
svc_ovr_pred = model_svc_ovr.predict(test_X)

In [15]:
pipe_svc_ovr_answer = pd.concat([pd.DataFrame(df_test['id'], columns=['id']), pd.DataFrame(svc_ovr_pred, columns=['prediction'])], axis=1)
pipe_svc_ovr_answer.to_csv('./pipe_svc_ovr_answer_mixed.csv', index=False)

* score : **0.45505**

# OvO

* (1) KSVM

In [16]:
%%time
poly_ovo = OneVsOneClassifier(SVC(kernel="poly", degree=2, gamma=1, coef0=0)).fit(train_X, y)

Wall time: 3min 25s


In [17]:
poly_ovo_pred = poly_ovo.predict(test_X)

In [18]:
poly_ovo_answer = pd.concat([pd.DataFrame(df_test['id'], columns=['id']), pd.DataFrame(poly_ovo_pred, columns=['prediction'])], axis=1)
poly_ovo_answer.to_csv('./poly_ovo_answer_mixed.csv', index=False)

* score : **0.46366**

* (2) XGBClassifier / TruncatedSVD / StandardScaler

In [19]:
svd = TruncatedSVD(n_components=400)
scl = StandardScaler()
xgb_clf_ovr = OneVsOneClassifier(XGBClassifier(max_depth=300))

pipe_xgb = pipeline.Pipeline([('svd', svd), ('scl', scl), ('xgb', xgb_clf_ovr)])

In [20]:
%%time
pipe_xgb_ovo = pipe_xgb.fit(train_X, y)

Wall time: 36min 17s


In [21]:
pipe_xgb_ovo_pred = pipe_xgb_ovo.predict(test_X)

In [22]:
pipe_xgb_ovo_answer = pd.concat([pd.DataFrame(df_test['id'], columns=['id']), pd.DataFrame(pipe_xgb_ovo_pred, columns=['prediction'])], axis=1)
pipe_xgb_ovo_answer.to_csv('./pipe_xgb_ovo_answer_mixed.csv', index=False)

* score : **0.39720**

* (3) XGBClassifier

In [23]:
%%time
xgb_clf_ovo = OneVsOneClassifier(XGBClassifier(max_depth=300)).fit(train_X, y)

Wall time: 1h 52min 40s


In [24]:
xgb_clf_ovo_pred = xgb_clf_ovo.predict(test_X)

In [25]:
xgb_ovo_answer = pd.concat([pd.DataFrame(df_test['id'], columns=['id']), pd.DataFrame(xgb_clf_ovo_pred, columns=['prediction'])], axis=1)
xgb_ovo_answer.to_csv('./xgb_ovo_answer_mixed.csv', index=False)

* score : **0.29887**

* (4) KSVM / TruncatedSVD / StandardScaler

In [4]:
svd = TruncatedSVD(n_components=400)
scl = StandardScaler()
svc_clf_ovo = OneVsOneClassifier(SVC(kernel="poly", degree=2, gamma=1, coef0=0))

pipe_xgb_ovo = pipeline.Pipeline([('svd', svd), ('scl', scl), ('svc', svc_clf_ovo)])

In [5]:
%%time
model_svc_ovo = pipe_xgb_ovo.fit(train_X, y)

Wall time: 32min 29s


In [6]:
svc_ovo_pred = model_svc_ovo.predict(test_X)

In [7]:
pipe_svc_ovo_answer = pd.concat([pd.DataFrame(df_test['id'], columns=['id']), pd.DataFrame(svc_ovo_pred, columns=['prediction'])], axis=1)
pipe_svc_ovo_answer.to_csv('./pipe_svc_ovo_answer_mixed.csv', index=False)

* score : **0.47133**

# Model 중 상위 5 ranking

#### (1) KSVM + OvR

* OneVsRestClassifier(SVC(kernel="poly", degree=2, gamma=1, coef0=0))
* score : 0.49430

#### (2) TruncatedSVD / StandardScaler / KSVM

* TruncatedSVD(n_components=400) / StandardScaler() / SVC(kernel="poly", degree=2, gamma=1, coef0=0)
* score : 0.48007

#### (3) TruncatedSVD / StandardScaler / KSVM + OvO

* TruncatedSVD(n_components=400) / StandardScaler() / OneVsOneClassifier(SVC(kernel="poly", degree=2, gamma=1, coef0=0))
* score : 0.47133

svm (kornel="poly") : 0.46529

#### (4) KSVM

* SVC(kernel="poly", degree=2, gamma=1, coef0=0)
* score : 0.46529

#### (5) KSVM + OvO

* OneVsOneClassifier(SVC(kernel="poly", degree=2, gamma=1, coef0=0))
* score : 0.47133