# HW2

#### Machine Learning in Korea University
#### COSE362, Fall 2018
#### Due : 11/26 (TUE) 11:59 PM

#### In this assignment, you will learn various classification methods with given datasets.
* Implementation detail: Anaconda 5.3 with python 3.7
* Use given dataset. Please do not change train / valid / test split.
* Use numpy, scikit-learn, and matplotlib library
* You don't have to use all imported packages below. (some are optional). <br>
Also, you can import additional packages in "(Option) Other Classifiers" part. 
* <b>*DO NOT MODIFY OTHER PARTS OF CODES EXCEPT "Your Code Here"*</b>

In [1]:
# Basic packages
%matplotlib inline
import numpy as np
import pandas as pd
import csv
import matplotlib.pyplot as plt

# Machine Learning Models
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier

# Additional packages
from sklearn.model_selection import cross_val_score
from sklearn.metrics import f1_score

In [2]:
# Import your own packages if you need(only in scikit-learn, numpy, pandas).
# Your Code Here
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import KFold
from sklearn.model_selection import GridSearchCV
from sklearn.datasets import make_hastie_10_2
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.tree import ExtraTreeClassifier 

#End Your Code

## Process

> 1. Load "train.csv". It includes all samples' features and labels.
> 2. Training four types of classifiers(logistic regression, decision tree, random forest, support vector machine) and <b>validate</b> it in your own way. <b>(You can't get full credit if you don't conduct validation)</b>
> 3. Optionally, if you would train your own classifier(e.g. ensembling or gradient boosting), you can evaluate your own model on the development data. <br>
> 4. <b>You should submit your predicted results on test data with the selected classifier in your own manner.</b>

## Task & dataset description
1. 6 Features (1~6)<br>
Feature 2, 4, 6 : Real-valued<br>
Feature 1, 3, 5 : Categorical <br>

2. Samples <br>
>In development set : 2,000 samples <br>
>In test set : 1,500 samples

## Load development dataset
Load your development dataset. You should read <b>"train.csv"</b>. This is a classification task, and you need to preprocess your data for training your model. <br>
> You need to use <b>1-of-K coding scheme</b>, to convert categorical features to one-hot vector. <br>
> For example, if there are 3 categorical values, you can convert these features as [1,0,0], [0,1,0], [0,0,1] by 1-of-K coding scheme. <br>

In [3]:
# For training your model, you need to convert categorical features to one-hot encoding vectors.
# Your Code Here
#load data
df_train = pd.read_csv('./data/train.csv')

#one-hot
df_train_onehot = pd.get_dummies(df_train.drop(columns=['feature2','feature4','feature6','target']))
df_train = pd.concat([df_train,df_train_onehot], axis=1)
df_train.drop(['feature1','feature3','feature5'], axis = 1, inplace = True)

#feature && label
data = df_train.drop(['target'],axis = 1,inplace = False)
target = df_train.target

#functions for cv
def calc_train_error(X_train, y_train, model):
    predictions = model.predict(X_train)
    f1_train = f1_score(y_train, predictions, average='macro')
    return f1_train

def calc_test_error(X_val, y_val, model):
    predictions = model.predict(X_val)
    f1_val = f1_score(y_val, predictions, average='macro')
    return f1_val
    
def calc_metrics(X_train, y_train, X_val, y_val, model):
    model.fit(X_train, y_train)
    train_error = calc_train_error(X_train, y_train, model)
    val_error= calc_test_error(X_val, y_val, model)
    return train_error, val_error

# End Your Code

### Logistic Regression
Train and validate your <b>logistic regression classifier</b>, and print out your validation(or cross-validation) error.
> If you want, you can use cross validation, regularization, or feature selection methods. <br>
> <b> You should use F1 score('macro' option) as evaluation metric. </b>

In [4]:
# Training your logistic regression classifier, and print out your validation(or cross-validation) error.
# Save your own model
# Your Code Here

#hyperparameter
clf = LogisticRegression()
param_grid = {'C': [0.1, 0.3, 0.5, 0.7, 1, 1.3, 1.5], 'penalty': ['l1', 'l2']}
gridsearch = GridSearchCV(clf, 
                          param_grid,
                          scoring = "f1_macro"
                         )
gridsearch.fit(data, target)

#Create Pipeline
standardizer = StandardScaler()

#Create K-fold cross-validation
kf = KFold(n_splits=5, shuffle=True, random_state=42)

f1_trains = []
f1_vals = []
f1_val_result = []

for train_index, val_index in kf.split(data, target):
        
    #split data
    X_train, X_val = data.iloc[train_index], data.iloc[val_index]
    y_train, y_val = target.iloc[train_index], target.iloc[val_index]
        
    #instantaite model
    #Create a pipleline that standardized
    pipeline = make_pipeline(standardizer,
                             LogisticRegression(random_state = 42,
                                                C = gridsearch.best_params_['C'], 
                                                penalty=gridsearch.best_params_['penalty']))
        
    #calculate errors
    f1_train, f1_val = calc_metrics(X_train, y_train, X_val, y_val, pipeline)
        
    #append tp appropriate list
    f1_trains.append(f1_train)
    f1_vals.append(f1_val)
    
    f1_val_result.append(np.mean(f1_vals))
    
#generate report
    print('mean(f1_train): {:7} | mean(f1_validation): {}'.
            format(
            round(np.mean(f1_trains),4),
            round(np.mean(f1_vals),4)
            ))
    
#print(np.mean(f1_vals))
print("======================")
print("validation f1_score")
print(np.mean(f1_val_result))
print("======================")
# End Your Code

  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision'

  return self.partial_fit(X, y)
  return self.fit(X, y, **fit_params).transform(X)


mean(f1_train):  0.3571 | mean(f1_validation): 0.242


  Xt = transform.transform(Xt)
  'precision', 'predicted', average, warn_for)
  Xt = transform.transform(Xt)
  'precision', 'predicted', average, warn_for)
  return self.partial_fit(X, y)
  return self.fit(X, y, **fit_params).transform(X)
  Xt = transform.transform(Xt)
  'precision', 'predicted', average, warn_for)
  Xt = transform.transform(Xt)
  'recall', 'true', average, warn_for)
  return self.partial_fit(X, y)
  return self.fit(X, y, **fit_params).transform(X)


mean(f1_train):  0.3663 | mean(f1_validation): 0.2446


  Xt = transform.transform(Xt)
  'precision', 'predicted', average, warn_for)
  Xt = transform.transform(Xt)
  'precision', 'predicted', average, warn_for)
  return self.partial_fit(X, y)
  return self.fit(X, y, **fit_params).transform(X)


mean(f1_train):  0.3698 | mean(f1_validation): 0.2452
mean(f1_train):  0.3656 | mean(f1_validation): 0.2471


  Xt = transform.transform(Xt)
  'precision', 'predicted', average, warn_for)
  Xt = transform.transform(Xt)
  'precision', 'predicted', average, warn_for)
  return self.partial_fit(X, y)
  return self.fit(X, y, **fit_params).transform(X)


mean(f1_train):  0.3664 | mean(f1_validation): 0.2473
validation f1_score
0.24523288539881988


  Xt = transform.transform(Xt)
  'precision', 'predicted', average, warn_for)
  Xt = transform.transform(Xt)
  'precision', 'predicted', average, warn_for)


### Decision Tree
Train and validate your <b>decision tree classifier</b>, and print out your validation(or cross-validation) error.
> If you want, you can use cross validation, regularization, or feature selection methods. <br>
> <b> You should use F1 score('macro' option) as evaluation metric. </b>

In [5]:
# Training your decision tree classifier, and print out your validation(or cross-validation) error.
# Save your own model
# Your Code Here

#Create Pipeline
standardizer = StandardScaler()

#Create K-fold cross-validation

kf = KFold(n_splits=5, shuffle=True, random_state=42)

f1_trains = []
f1_vals = []
f1_val_result = []

for train_index, val_index in kf.split(data, target):       
    #split data
    X_train, X_val = data.iloc[train_index], data.iloc[val_index]
    y_train, y_val = target.iloc[train_index], target.iloc[val_index]
        
    #instantaite model
    #Create a pipleline that standardized, then run
    pipeline = make_pipeline(standardizer,
                            DecisionTreeClassifier(random_state = 42))
        
    #calculate errors
    f1_train, f1_val = calc_metrics(X_train, y_train, X_val, y_val, pipeline)
        
    #append tp appropriate list
    f1_trains.append(f1_train)
    f1_vals.append(f1_val)
    
    f1_val_result.append(np.mean(f1_vals))
    
    #generate report
    print('mean(f1_train): {:7} | mean(f1_validation): {} '.
        format(
                round(np.mean(f1_trains),4),
                round(np.mean(f1_vals),4)
                ))
    
print("======================")
print("validation f1_score")
print(np.mean(f1_val_result))
print("======================")

# End Your Code

mean(f1_train):     1.0 | mean(f1_validation): 0.3441 
mean(f1_train):     1.0 | mean(f1_validation): 0.3573 
mean(f1_train):     1.0 | mean(f1_validation): 0.3667 
mean(f1_train):     1.0 | mean(f1_validation): 0.3787 
mean(f1_train):     1.0 | mean(f1_validation): 0.3759 
validation f1_score
0.36454714922240783


  return self.partial_fit(X, y)
  return self.fit(X, y, **fit_params).transform(X)
  Xt = transform.transform(Xt)
  Xt = transform.transform(Xt)
  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)
  return self.partial_fit(X, y)
  return self.fit(X, y, **fit_params).transform(X)
  Xt = transform.transform(Xt)
  Xt = transform.transform(Xt)
  return self.partial_fit(X, y)
  return self.fit(X, y, **fit_params).transform(X)
  Xt = transform.transform(Xt)
  Xt = transform.transform(Xt)
  'precision', 'predicted', average, warn_for)
  return self.partial_fit(X, y)
  return self.fit(X, y, **fit_params).transform(X)
  Xt = transform.transform(Xt)
  Xt = transform.transform(Xt)
  return self.partial_fit(X, y)
  return self.fit(X, y, **fit_params).transform(X)
  Xt = transform.transform(Xt)
  Xt = transform.transform(Xt)
  'precision', 'predicted', average, warn_for)


### Random Forest
Train and validate your <b>random forest classifier</b>, and print out your validation(or cross-validation) error.
> If you want, you can use cross validation, regularization, or feature selection methods. <br>
> <b> You should use F1 score('macro' option) as evaluation metric. </b>

In [6]:
# Training your random forest classifier, and print out your validation(or cross-validation) error.
# Save your own model
# Your Code Here

#Create Pipeline
standardizer = StandardScaler()

#Create K-fold cross-validation

kf = KFold(n_splits=5, shuffle=True, random_state=42)

f1_trains = []
f1_vals = []
f1_val_result = []

for train_index, val_index in kf.split(data, target):       
    #split data
    X_train, X_val = data.iloc[train_index], data.iloc[val_index]
    y_train, y_val = target.iloc[train_index], target.iloc[val_index]
        
    #instantaite model
    #Create a pipleline that standardized, then run
    pipeline = make_pipeline(standardizer,
                            RandomForestClassifier(n_estimators=400,
                                                   random_state=42
                                                  ))
        
    #calculate errors
    f1_train, f1_val = calc_metrics(X_train, y_train, X_val, y_val, pipeline)
        
    #append tp appropriate list
    f1_trains.append(f1_train)
    f1_vals.append(f1_val)
    
    f1_val_result.append(np.mean(f1_vals))
    
    #generate report
    print('mean(f1_train): {:7} | mean(f1_validation): {} '.
        format(
                round(np.mean(f1_trains),4),
                round(np.mean(f1_vals),4)
                ))
    
print("======================")
print("validation f1_score")
print(np.mean(f1_val_result))
print("======================")

# End Your Code

  return self.partial_fit(X, y)
  return self.fit(X, y, **fit_params).transform(X)
  Xt = transform.transform(Xt)
  Xt = transform.transform(Xt)
  'recall', 'true', average, warn_for)
  return self.partial_fit(X, y)
  return self.fit(X, y, **fit_params).transform(X)


mean(f1_train):     1.0 | mean(f1_validation): 0.5149 


  Xt = transform.transform(Xt)
  Xt = transform.transform(Xt)
  return self.partial_fit(X, y)
  return self.fit(X, y, **fit_params).transform(X)


mean(f1_train):     1.0 | mean(f1_validation): 0.4771 


  Xt = transform.transform(Xt)
  Xt = transform.transform(Xt)
  'precision', 'predicted', average, warn_for)
  return self.partial_fit(X, y)
  return self.fit(X, y, **fit_params).transform(X)


mean(f1_train):     1.0 | mean(f1_validation): 0.446 


  Xt = transform.transform(Xt)
  Xt = transform.transform(Xt)
  return self.partial_fit(X, y)
  return self.fit(X, y, **fit_params).transform(X)


mean(f1_train):     1.0 | mean(f1_validation): 0.45 


  Xt = transform.transform(Xt)


mean(f1_train):     1.0 | mean(f1_validation): 0.4449 
validation f1_score
0.4665712229084713


  Xt = transform.transform(Xt)
  'precision', 'predicted', average, warn_for)


### Support Vector Machine
Train and validate your <b>support vector machine classifier</b>, and print out your validation(or cross-validation) error.
> If you want, you can use cross validation, regularization, or feature selection methods. <br>
> <b> You should use F1 score('macro' option) as evaluation metric. </b>

In [7]:
# Training your support vector machine classifier, and print out your validation(or cross-validation) error.
# Save your own model
# Your Code Here

#Create Pipeline
standardizer = StandardScaler()

#Create K-fold cross-validation

kf = KFold(n_splits=5, shuffle=True, random_state=42)

f1_trains = []
f1_vals = []
f1_val_result = []

for train_index, val_index in kf.split(data, target):       
    #split data
    X_train, X_val = data.iloc[train_index], data.iloc[val_index]
    y_train, y_val = target.iloc[train_index], target.iloc[val_index]
        
    #instantaite model
    #Create a pipleline that standardized, then run
    pipeline = make_pipeline(standardizer,
                            SVC(
                                C = 10, 
                                kernel='rbf',
                                random_state = 42))
        
    #calculate errors
    f1_train, f1_val = calc_metrics(X_train, y_train, X_val, y_val, pipeline)
        
    #append tp appropriate list
    f1_trains.append(f1_train)
    f1_vals.append(f1_val)
    
    f1_val_result.append(np.mean(f1_vals))
    
    #generate report
    print('mean(f1_train): {:7} | mean(f1_validation): {} '.
        format(
                round(np.mean(f1_trains),4),
                round(np.mean(f1_vals),4)
                ))
    
print("======================")
print("validation f1_score")
print(np.mean(f1_val_result))
print("======================")
# End Your Code

  return self.partial_fit(X, y)
  return self.fit(X, y, **fit_params).transform(X)
  Xt = transform.transform(Xt)
  Xt = transform.transform(Xt)
  'precision', 'predicted', average, warn_for)


mean(f1_train):  0.8019 | mean(f1_validation): 0.3916 


  return self.partial_fit(X, y)
  return self.fit(X, y, **fit_params).transform(X)
  Xt = transform.transform(Xt)
  Xt = transform.transform(Xt)


mean(f1_train):  0.8046 | mean(f1_validation): 0.3728 


  return self.partial_fit(X, y)
  return self.fit(X, y, **fit_params).transform(X)
  Xt = transform.transform(Xt)
  Xt = transform.transform(Xt)
  'precision', 'predicted', average, warn_for)
  return self.partial_fit(X, y)
  return self.fit(X, y, **fit_params).transform(X)


mean(f1_train):  0.8142 | mean(f1_validation): 0.355 


  Xt = transform.transform(Xt)
  Xt = transform.transform(Xt)
  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)
  return self.partial_fit(X, y)
  return self.fit(X, y, **fit_params).transform(X)


mean(f1_train):  0.8122 | mean(f1_validation): 0.3428 


  Xt = transform.transform(Xt)
  'precision', 'predicted', average, warn_for)
  Xt = transform.transform(Xt)


mean(f1_train):    0.81 | mean(f1_validation): 0.3519 
validation f1_score
0.3628068238781192


  'precision', 'predicted', average, warn_for)


### (Option) Other Classifiers.
Train and validate other classifiers by your own manner.
> <b> If you need, you can import other models only in this cell, only in scikit-learn. </b>

In [8]:
# If you need additional packages, import your own packages below.
# Your Code Here

#Create Pipeline
standardizer = StandardScaler()

#Create K-fold cross-validation

kf = KFold(n_splits=5, shuffle=True, random_state=42)

f1_trains = []
f1_vals = []
f1_val_result = []

for train_index, val_index in kf.split(data, target):       
    #split data
    X_train, X_val = data.iloc[train_index], data.iloc[val_index]
    y_train, y_val = target.iloc[train_index], target.iloc[val_index]
        
    #instantaite model
    #Create a pipleline that standardized, then run
    pipeline = make_pipeline(standardizer,
                             ExtraTreeClassifier(
                                                 random_state = 42
                                                      )
                            )
        
    #calculate errors
    f1_train, f1_val = calc_metrics(X_train, y_train, X_val, y_val, pipeline)
        
    #append tp appropriate list
    f1_trains.append(f1_train)
    f1_vals.append(f1_val)
    
    f1_val_result.append(np.mean(f1_vals))
    
    #generate report
    print('mean(f1_train): {:7} | mean(f1_validation): {} '.
        format(
                round(np.mean(f1_trains),4),
                round(np.mean(f1_vals),4)
                ))
    
print("======================")
print("validation f1_score")
print(np.mean(f1_val_result))
print("======================")

# End Your Code

mean(f1_train):     1.0 | mean(f1_validation): 0.3262 
mean(f1_train):     1.0 | mean(f1_validation): 0.2926 
mean(f1_train):     1.0 | mean(f1_validation): 0.2813 
mean(f1_train):     1.0 | mean(f1_validation): 0.2801 
mean(f1_train):     1.0 | mean(f1_validation): 0.2748 
validation f1_score
0.29098182964544445


  return self.partial_fit(X, y)
  return self.fit(X, y, **fit_params).transform(X)
  Xt = transform.transform(Xt)
  Xt = transform.transform(Xt)
  'precision', 'predicted', average, warn_for)
  return self.partial_fit(X, y)
  return self.fit(X, y, **fit_params).transform(X)
  Xt = transform.transform(Xt)
  Xt = transform.transform(Xt)
  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)
  return self.partial_fit(X, y)
  return self.fit(X, y, **fit_params).transform(X)
  Xt = transform.transform(Xt)
  Xt = transform.transform(Xt)
  'precision', 'predicted', average, warn_for)
  return self.partial_fit(X, y)
  return self.fit(X, y, **fit_params).transform(X)
  Xt = transform.transform(Xt)
  Xt = transform.transform(Xt)
  'recall', 'true', average, warn_for)
  return self.partial_fit(X, y)
  return self.fit(X, y, **fit_params).transform(X)
  Xt = transform.transform(Xt)
  Xt = transform.transform(Xt)
  'precision', 'predicted', average, warn_for)


In [9]:
# If you need additional packages, import your own packages below.
# Your Code Here

#Create Pipeline
standardizer = StandardScaler()

#Create K-fold cross-validation

kf = KFold(n_splits=5, shuffle=True, random_state=42)

f1_trains = []
f1_vals = []
f1_val_result = []

for train_index, val_index in kf.split(data, target):       
    #split data
    X_train, X_val = data.iloc[train_index], data.iloc[val_index]
    y_train, y_val = target.iloc[train_index], target.iloc[val_index]
        
    #instantaite model
    #Create a pipleline that standardized, then run
    pipeline = make_pipeline(standardizer,
                             GradientBoostingClassifier(
                                                       random_state = 42,
                                                       n_estimators = 400,  
                                                       subsample = 0.8
                                                      )
                            )
        
    #calculate errors
    f1_train, f1_val = calc_metrics(X_train, y_train, X_val, y_val, pipeline)
        
    #append tp appropriate list
    f1_trains.append(f1_train)
    f1_vals.append(f1_val)
    
    f1_val_result.append(np.mean(f1_vals))
    
    #generate report
    print('mean(f1_train): {:7} | mean(f1_validation): {} '.
        format(
                round(np.mean(f1_trains),4),
                round(np.mean(f1_vals),4)
                ))
    
print("======================")
print("validation f1_score")
print(np.mean(f1_val_result))
print("======================")

# End Your Code

  return self.partial_fit(X, y)
  return self.fit(X, y, **fit_params).transform(X)
  Xt = transform.transform(Xt)
  Xt = transform.transform(Xt)
  'precision', 'predicted', average, warn_for)
  return self.partial_fit(X, y)
  return self.fit(X, y, **fit_params).transform(X)


mean(f1_train):  0.9949 | mean(f1_validation): 0.5173 


  Xt = transform.transform(Xt)
  Xt = transform.transform(Xt)
  return self.partial_fit(X, y)
  return self.fit(X, y, **fit_params).transform(X)


mean(f1_train):  0.9949 | mean(f1_validation): 0.5077 


  Xt = transform.transform(Xt)
  Xt = transform.transform(Xt)
  'precision', 'predicted', average, warn_for)
  return self.partial_fit(X, y)
  return self.fit(X, y, **fit_params).transform(X)


mean(f1_train):  0.9945 | mean(f1_validation): 0.4808 


  Xt = transform.transform(Xt)
  Xt = transform.transform(Xt)
  'recall', 'true', average, warn_for)
  return self.partial_fit(X, y)
  return self.fit(X, y, **fit_params).transform(X)


mean(f1_train):  0.9945 | mean(f1_validation): 0.4631 


  Xt = transform.transform(Xt)


mean(f1_train):  0.9947 | mean(f1_validation): 0.4665 
validation f1_score
0.4871095327259532


  Xt = transform.transform(Xt)
  'precision', 'predicted', average, warn_for)


## Submit your prediction on the test data.

* Select your model and explain it briefly.
* You should read <b>"test.csv"</b>.
* Prerdict your model in array form.
* Prediction example <br>
[2, 6, 14, 8, $\cdots$]
* We will rank your result by <b>F1 metric(with 'macro' option)</b>.
* <b> If you don't submit prediction file or submit it in wrong format, you can't get the point for this part.

# Explain your final model

    주어진 4가지 모델(logistic regression, decision tree, random forest, support vector machine)과 scikit-learn에서 추가한 GradientBoostingClassifier과  ExtraTreeClassifier 모델을 통해 주어진 classification task를 완료했습니다. 

    5 fold cross validation 결과, GradientBoostingClassifier가 가장 f1 macro score이 높아서 해당 모델을 선택했습니다. 

In [10]:
# Load test dataset.
# Your Code Here
df_test = pd.read_csv('./data/test.csv')
# End Your Code

In [11]:
# Predict target class
# Make variable "my_answer", type of array, and fill this array with your class predictions.
# Modify file name into your student number and your name.
# Your Code Here
my_answer = []

#one-hot
df_test_onehot = pd.get_dummies(df_test.drop(columns=['feature2','feature4','featrure6']))
df_test = pd.concat([df_test,df_test_onehot], axis=1)
df_test.drop(['feature1','feature3','feature5'], axis = 1, inplace = True)

#Create a pipleline that standardized, then run
pipeline = make_pipeline(standardizer,
                         GradientBoostingClassifier(
                                                    random_state = 42,
                                                    n_estimators = 400,  
                                                    subsample = 0.8
                                                      ))

my_model = pipeline.fit(data,target)
my_answer = my_model.predict(df_test)

file_name = "HW2_2016320120_정소영.csv"
# End Your Code

  return self.partial_fit(X, y)
  return self.fit(X, y, **fit_params).transform(X)
  Xt = transform.transform(Xt)


In [12]:
# This section is for saving predicted answers. DO NOT MODIFY.
pd.Series(my_answer).to_csv("./data/" + file_name, header=None, index=None)