<h1>Employee Classifier Model</h1>

<p>Import OpenML to obtain the dataset in addition to the requisite numpy and pandas libraries:</p>

In [1]:
import openml
import numpy as np
import pandas as pd

<p>Import the necessary modules to carry out binary classification with 10 fold cross validation using an SVM and Pipeline:</p>

In [2]:
from sklearn.model_selection import cross_val_score, KFold
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.datasets import make_classification

<p>Get the dataset from OpenML by ID and retrieve features and target:</p>

In [3]:
dataset = openml.datasets.get_dataset(4135)
X, y, _, _ = dataset.get_data(target = dataset.default_target_attribute)

<p>A Pipeline is used to apply preprocessing using a StandardScaler and support vector machine with linear kernel:</p>

In [4]:
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('svm', SVC(kernel = 'linear'))
])

<p>10 fold cross validation is then carried out with shuffling enabled and random state 42:</p>

In [5]:
kf = KFold(n_splits = 10, shuffle = True, random_state = 42)

<p>Get the cross validation scores using the pipeline, features, target and crossvalidation values, and score based on accuracy:</p>

In [6]:
cv_scores = cross_val_score(pipeline, X, y, cv = kf, scoring = 'accuracy')

<p>Output the accuracy per fold, mean accuracy and standard deviation:</p>

In [7]:
print(f"Accuracy per fold: {cv_scores}")

Accuracy per fold: [0.94354593 0.94385108 0.94263045 0.94110467 0.94842844 0.93774794
 0.93713763 0.93744278 0.94415624 0.94505495]


In [8]:
print(f"Mean accuracy: {cv_scores.mean():.4f}")

Mean accuracy: 0.9421


In [9]:
print(f"Standard deviation: {cv_scores.std():.4f}")

Standard deviation: 0.0035
