# Preparation

Preparation biasa dilakukan untuk mempersiapkan data sebelum masuk dalam tahap pemodelan. <br>
Berikut adalah tahapan yang akan dilalui pada data `HR_comma_sep.csv` sebelum tahap pemodelan :
1. Import Library
2. Input Dataset
3. Preprocessing
4. Train-Test Split

## Import Library

In [4]:
import pandas as pd
import numpy as np

## Input Dataset

In [5]:
df = pd.read_csv('HR_comma_sep.csv')

In [6]:
df.sample(5)

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,sales,salary
10011,0.87,0.91,3,229,3,1,0,0,sales,medium
13389,0.61,0.96,3,214,2,0,0,0,IT,medium
12893,0.18,0.54,4,145,5,0,0,0,RandD,low
6734,0.53,0.5,3,231,3,0,0,0,sales,low
482,0.77,0.87,4,242,6,0,1,0,sales,low


## Preprocessing

In [7]:
df = pd.get_dummies(df)

In [9]:
X = df.drop(['left'],1)
y = df['left']

In [16]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_transform = scaler.fit_transform(X)

In [22]:
X_transform = pd.DataFrame(X_transform,columns = X.columns)

## Train-Test Split

In [23]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X_transform,y,test_size = 0.3,random_state = 123)

# Modeling

Pada bagian ini, kita akan menerapkan dengan bahasa python beberapa model yang telah kita pelajari yaitu :
1. k-Nearest Neighbor
2. Decision Tree
3. Logistic Regression
    
Beserta akan ada contoh **tuning hyperparameter** untuk svm dan ensemble

## k-Nearest Neighbor

k-Nearest Neighbor merupakan pemodelan yang memiliki konsep <br>
**bergantung terhadap tetangga terdekatnya**.<br>
Sehingga mampu mengklasifikasi dengan baik.

In [12]:
def evaluasi_model(model,X_test,y_test):
    from sklearn.metrics import accuracy_score
    y_pred = model.predict(X_test)
    return accuracy_score(y_test,y_pred)

In [13]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()
knn.fit(X_train,y_train)

KNeighborsClassifier()

In [15]:
evaluasi_model(knn,X_test,y_test)

0.9306666666666666

### Tuning Hyperparameter - Support Vector Machine

In [31]:
params = {'n_neighbors':[1,2,3,4,5,6,7,8,9,10]} # hati-hati pemilihan hyperparameter jangan terlalu banyak kombinasinya

In [32]:
from sklearn.model_selection import GridSearchCV
grid = GridSearchCV(
             estimator=knn,
             param_grid=params,
             scoring = 'accuracy',
             n_jobs = 10, # core cpu yang digunakan
             cv = 10 # 3-fold cross validation (artinya kita melakukan iterasi model sebanyak 3 kali)
            )

In [33]:
grid.fit(X_train,y_train)

GridSearchCV(cv=10, estimator=KNeighborsClassifier(), n_jobs=10,
             param_grid={'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]},
             scoring='accuracy')

In [34]:
grid.best_params_

{'n_neighbors': 1}

In [35]:
evaluasi_model(grid,X_test,y_test)

0.9597777777777777

## Decision Tree

Decision Tree merupakan pemodelan dengan cara membuat sebuah **pohon keputusan** <br>
Pohon ini bisa kita atur kedalamannya.

In [45]:
from sklearn.tree import DecisionTreeClassifier
dtc = DecisionTreeClassifier()
dtc.fit(X_train,y_train)

DecisionTreeClassifier()

In [54]:
evaluasi_model(dtc,X_train,y_train)

1.0

In [55]:
evaluasi_model(dtc,X_test,y_test)

0.9766666666666667

In [56]:
params = {'max_depth':[3,5,7,9,11,'max']} # hati-hati pemilihan hyperparameter jangan terlalu banyak kombinasinya

In [57]:
from sklearn.model_selection import GridSearchCV
grid = GridSearchCV(
             estimator=dtc,
             param_grid=params,
             scoring = 'accuracy',
             n_jobs = 10, # core cpu yang digunakan
             cv = 10 # 3-fold cross validation (artinya kita melakukan iterasi model sebanyak 3 kali)
            )

In [58]:
grid.fit(X_train,y_train)

GridSearchCV(cv=10, estimator=DecisionTreeClassifier(), n_jobs=10,
             param_grid={'max_depth': [3, 5, 7, 9, 11, 'max']},
             scoring='accuracy')

In [59]:
grid.best_params_

{'max_depth': 9}

In [60]:
evaluasi_model(grid,X_train,y_train)

0.9864749023716545

In [61]:
evaluasi_model(grid,X_test,y_test)

0.9777777777777777

## Logistic Regression

Decision Tree merupakan pemodelan dengan menggunakan konsep **regresi** <br>
Namun regresi yang digunakan adalah regresi yang telah ditransformasi untuk variable targetnya.

In [62]:
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit(X_train,y_train)

LogisticRegression()

In [63]:
evaluasi_model(logreg,X_train,y_train)

0.7882655490999143

In [64]:
evaluasi_model(logreg,X_test,y_test)

0.7902222222222223

In [65]:
params = {'C':[0.1,0.5,1,2,3]} # hati-hati pemilihan hyperparameter jangan terlalu banyak kombinasinya

In [67]:
from sklearn.model_selection import GridSearchCV
grid = GridSearchCV(
             estimator=logreg,
             param_grid=params,
             scoring = 'accuracy',
             n_jobs = 10, # core cpu yang digunakan
             cv = 10 # 3-fold cross validation (artinya kita melakukan iterasi model sebanyak 3 kali)
            )

In [68]:
grid.fit(X_train,y_train)

GridSearchCV(cv=10, estimator=LogisticRegression(), n_jobs=10,
             param_grid={'C': [0.1, 0.5, 1, 2, 3]}, scoring='accuracy')

In [69]:
grid.best_params_

{'C': 3}

In [70]:
evaluasi_model(grid,X_train,y_train)

0.7875035717687399

In [71]:
evaluasi_model(grid,X_test,y_test)

0.7897777777777778