Following algorithm undertaken to solve the problem

Formatting the columns: X
1. Num -> Cat: EmployeeNumber
2. Cat -> Num: None
3. Cat -> Date: None
4. Num -> Date: None
5. Others: Attrition ()

Formatting the columns: y
1. Num -> Cat: EmployeeNumber
2. Cat -> Num: None
3. Cat -> Date: None
4. Num -> Date: None
5. Others: Encoded Attrition: 1 if Yes else 0

Including additional features:
1. Manual: None
2. Kernel-based: None

Exploratory Data Analysis: Col-wise: Univariate

1. Numeric Columns
a. Mean, Median, Percentiles: Done. Used standard formula
b. Higher-order moments: Done. Used standard formula
c. Outlier: Done. Used IQR to find % of outliers and outliers mask
d. perc_negatives, perc_positives
e. perc_missing

2. Categorical Columns
a. Count, Unique, Mode, Freq of Mode,
b. Missing_perc
c. Mode_freq_perc

Preparing data for bivariate exploratory analysis and row-wise exploratory analysis:
a. Cols removed
b. Rows removed
c. Cat -> Num

Exploratory Data Analysis: Col-wise: Bivariate

1. Numeric ~ y. Used sklearn.feature_selection.chi2 since all features are non-negative
2. Categorical ~ y. Same as above
3. Numeric ~ Numeric. Used correlation
4. Numeric ~ Categorical. Same as above
5. Categorical ~ Categorical. Same as above

Exploratory Data Analysis: Row-wise:

1. Missing rows: Where 50% of cols is missing
2. Clustering: Used KMeans with default parameters to determine the clusters
3. Outlier Detection: Unsupervised Outlier Detection using DBScan Clustering
4. Exploratory Data Analysis: Overall
a. Linearity (For regression problem)
b. Linearly separable (For classification problem)

Preparing the data for modeling: Dimensionality Reduction: Excluding rows and irrelevant columns

1. Excluding irrelevant features: features having high missing values or relatively constant or perfectly correlated features
2. Excluding rows: outlier rows
3. Preparing the data for modeling: Imputing missing values: None
a. Basis Mean: None
b. Basis Median: None
c. Basis Mode: None
d. Custom: None

Preparing the data for modeling: Reducing dimensionality

1. Manual Feature Selection: None
2. Algorithmic: Used PCA

Selecting a performance measure: Chose accuracy_score

Selecting candidate models and parameters: 
1. Standalone models: 
1.a. Unsupervised approach: Nearest Neighbor approach
1.b. Supervised approach: Logistic Regression, Ridge Classifier, SGD Classifier with default parameters
2. Ensemble models: None


Selecting fitting strategy and fit the models: Non-Iterative fitting strategy

Evaluation and selecting the best model: SGD classifier

Submitting the results

In [None]:
!pip install pingouin

In [None]:
import pandas as pd
import numpy as np
from scipy.stats import iqr
import math
from pingouin import multivariate_normality
from scipy.stats import mode

from sklearn.model_selection import train_test_split
from sklearn.feature_selection import chi2
from sklearn.preprocessing import OneHotEncoder, StandardScaler 
from sklearn.cluster import KMeans, DBSCAN
from sklearn.decomposition import PCA
from sklearn.metrics import accuracy_score
from sklearn.neighbors import NearestNeighbors
from sklearn.linear_model import LogisticRegression, RidgeClassifier, SGDClassifier

In [None]:
data = pd.read_csv('../input/ibm-hr-analytics-attrition-dataset/WA_Fn-UseC_-HR-Employee-Attrition.csv')
data_model, data_submission = train_test_split(
    data,
    test_size = 0.25,
    random_state = 42,
    stratify = data.loc[:, 'Attrition'])

In [None]:
data_model.head().T

In [None]:
data_model.info()

In [None]:
convert_num_to_cat = ['EmployeeNumber']

# 1. Numeric -> Categorical
def convert_to_categorical(x):
    return x.astype(object)
data_model['Attrition'] = data_model['Attrition'].apply(lambda x: 0 if x == 'No' else 1)
data_submission['Attrition'] = data_submission['Attrition'].apply(lambda x: 0 if x == 'No' else 1)

for col in convert_num_to_cat:
    data_model[col] = data_model.loc[:, [col]].apply(convert_to_categorical)

In [None]:
data_model.info()

In [None]:
y = ['Attrition']
data_x = data_model.loc[:, [col for col in data_model.columns if col not in y]]
data_y = data_model.loc[:, ['Attrition']]
submission_x = data_submission.loc[:, [col for col in data_submission.columns if col not in y]]
submission_y = data_submission.loc[:, ['Attrition']]
train_x, test_x, train_y, test_y = train_test_split(
    data_x,
    data_y,
    test_size = 0.25,
    random_state = 42,
    stratify = data_y.loc[:, 'Attrition'])
all_cols = list(data_x.columns)
numeric_cols = [col for col in all_cols if data_x[col].dtypes != object]
cat_cols = [col for col in all_cols if data_x[col].dtypes == object]
len(numeric_cols + cat_cols) == len(all_cols)

In [None]:
# Checking if the stratified split has happened or not
print(data_y.loc[:, 'Attrition'].sum() / data_y.loc[:, 'Attrition'].count())
print(train_y.loc[:, 'Attrition'].sum() / train_y.loc[:, 'Attrition'].count())
print(test_y.loc[:, 'Attrition'].sum() / test_y.loc[:, 'Attrition'].count())
print(submission_y.loc[:, 'Attrition'].sum() / submission_y.loc[:, 'Attrition'].count())

In [None]:
# Exploratory Data Analysis: Col-wise

def derive_outlier_perc(ser, iqr_width = 1.5):
    q1, q3 = ser.quantile(q = 0.25,
      interpolation = 'linear'), ser.quantile(q = 0.75,
      interpolation = 'linear')
    width = iqr_width * (q3 - q1)
    down_, up_ = q1 - width, q3 + width
    mask = ser.apply(lambda x: 0 if down_ <= x <= up_ else 1)
    return mask.sum() / mask.shape[0]
def derive_negative_perc(ser, threshold = 0):
    mask = ser.apply(lambda x: 1 if x < threshold else 0)
    return mask.sum() / mask.shape[0]
def derive_positive_perc(ser, threshold = 0):
    mask = ser.apply(lambda x: 1 if x >= threshold else 0)
    return mask.sum() / mask.shape[0]
    
univariate_numeric = train_x.loc[:, numeric_cols].describe().T
univariate_categorical = train_x.loc[:, cat_cols].describe().T
univariate_numeric['perc_missing'] = 1 - univariate_numeric['count'] / train_x.shape[0]
univariate_numeric['skew'] = [train_x[col].skew() for col in univariate_numeric.index]
univariate_numeric['kurt'] = [train_x[col].kurt() for col in univariate_numeric.index]
univariate_numeric['perc_outliers'] = [derive_outlier_perc(train_x[col]) for col in univariate_numeric.index]
univariate_numeric['perc_negatives'] = [derive_negative_perc(train_x[col]) for col in univariate_numeric.index]
univariate_numeric['perc_positives'] = [derive_positive_perc(train_x[col]) for col in univariate_numeric.index]
univariate_categorical['perc_missing'] = 1 - univariate_categorical['count'] / train_x.shape[0]
univariate_categorical['mode_freq'] = univariate_categorical['freq'] / univariate_categorical['count']

In [None]:
# Preparing data for bivariate and row-wise analysis

missing_cols = []
irrelevant_cols = ['EmployeeNumber', 'EmployeeCount']
constant_cols = ['Over18']
cols_to_remove = missing_cols + irrelevant_cols + constant_cols
train_x_copy = train_x.loc[:, [col for col in train_x.columns if col not in cols_to_remove]]
train_x_nm_mask_full = ~pd.isna(train_x_copy)
train_x_nm_mask = (train_x_nm_mask_full.any(axis = 1))
identifier = train_x_copy.loc[train_x_nm_mask, :].index
all_cols = list(train_x_copy.columns)
numeric_cols = [col for col in all_cols if train_x_copy[col].dtypes != object]
cat_cols = [col for col in all_cols if train_x_copy[col].dtypes == object]
len(numeric_cols + cat_cols) == len(all_cols)
numeric_data, cat_data = train_x_copy.loc[train_x_nm_mask, numeric_cols], train_x_copy.loc[train_x_nm_mask, cat_cols]
ohe = OneHotEncoder(sparse = False)
ohe_data_array = ohe.fit_transform(cat_data.values)
cat_ohe = pd.DataFrame(ohe_data_array, columns = ohe.get_feature_names(), index = identifier)
train_x_analysis = pd.concat(
    objs = (numeric_data, cat_ohe),
    axis = 1,
    join = 'outer',
    ignore_index = False,
    copy = True)
train_x_analysis.index = identifier

In [None]:
# Exploratory Data Analysis: Bivariate Analysis

# X ~ y
_, pvals = chi2(X = train_x_analysis, y = train_y)
bivariate_x_y = pd.DataFrame(pvals, index = train_x_analysis.columns, columns = ['chi2_pvals'])
bivariate_x_y.sort_values(by = 'chi2_pvals', ascending = False, inplace = True)

# X ~ X
bivariate_x_x = train_x_analysis.corr()

In [None]:
# Exploratory Data Analysis: Row-wise analysis

# Missing
threshold_perc_n_missing = 0.50
threshold_n_missing = math.ceil(train_x_nm_mask_full.shape[1] * threshold_perc_n_missing)
train_x_nm_count = train_x_nm_mask_full.sum(axis = 1)
missing_ = train_x_nm_count < threshold_n_missing

# Grouping
cluster_data = pd.concat(
    objs = (train_x_analysis, train_y),
    axis = 1,
    join = 'outer',
    ignore_index = False,
    copy = True)
cluster_data_standardized = pd.DataFrame(
    StandardScaler().fit_transform(X = cluster_data.values), 
    index = cluster_data.index, 
    columns = cluster_data.columns)
def return_inertia(n_clusters):
    kmeans = KMeans(
        n_clusters = n_clusters,
        init = 'k-means++',
        n_init = 10,
        max_iter = 300,
        verbose = 0,
        random_state = 42,
        copy_x = True,
        algorithm = 'auto')
    kmeans.fit(X = cluster_data.values)
    return kmeans.inertia_
inertias = list(map(return_inertia, np.arange(10, 15)))
op_n_cluster = np.argmin(np.array(inertias)) + 1
kmeans = KMeans(
    n_clusters = op_n_cluster,
    init = 'k-means++',
    n_init = 10,
    max_iter = 300,
    verbose = 0,
    random_state = 42,
    copy_x = True,
    algorithm = 'auto')
kmeans.fit(X = cluster_data.values)
group = pd.DataFrame(kmeans.labels_, index = identifier) #This excludes the rows where there are missing values

# Outlier Detection
db = DBSCAN(eps = 7.5) #Chosen this to get the desired no of outliers
cluster_pred = pd.Series(db.fit_predict(X = cluster_data_standardized), index = cluster_data_standardized.index)
outlier = cluster_pred.apply(lambda x: True if x < 0 else False)
outlier.sum() / outlier.shape[0]

In [None]:
# Exploratory Data Analysis: Overall

# Testing for normality
result = multivariate_normality(X = cluster_data_standardized.values, alpha = 0.05)
pval = result.pval

In [None]:
# Preparing the data for modeling: Excluding columns
dataset = dict(
    train_x_model = train_x,
    test_x_model = test_x,
    train_y_model = train_y,
    test_y_model = test_y,
    submission_x_model = submission_x,
    submission_y_model = submission_y)
exclude_datanames = ['train_y_model', 'test_y_model', 'submission_y_model']
for data_name in dataset:
    if data_name in exclude_datanames:
        continue
    data = dataset[data_name].copy(deep = False)
    data = data.loc[:, [col for col in data.columns if col not in cols_to_remove]]
    dataset.update({data_name: data})

In [None]:
# Preparing the data for modeling: Excluding rows
exclude_datanames = ['test_x_model', 'submission_x_model', 'test_y_model', 'submission_y_model']
for data_name in dataset:
    if data_name in exclude_datanames:
        continue
    data = dataset[data_name].copy(deep = False)
    data = data.loc[~outlier.values, :]
    dataset.update({data_name: data})

In [None]:
# Preparing the data for modeling: Imputing missing values: None as there are no missing values

In [None]:
# Preparing the data for modeling: OneHotEncode the categorical features

numeric_cols = [col for col in dataset['train_x_model'].columns if dataset['train_x_model'][col].dtypes != object]
cat_cols = [col for col in dataset['train_x_model'].columns if dataset['train_x_model'][col].dtypes == object]
len(numeric_cols + cat_cols) == len(all_cols)
def ohe_data(data):
    identifier = data.index
    numeric_data, cat_data = data.loc[:, numeric_cols], data.loc[:, cat_cols]
    ohe = OneHotEncoder(sparse = False)
    ohe_data_array = ohe.fit_transform(cat_data.values)
    cat_ohe = pd.DataFrame(ohe_data_array, columns = ohe.get_feature_names(), index = identifier)
    c_data = pd.concat(
        objs = (numeric_data, cat_ohe),
        axis = 1,
        join = 'outer',
        ignore_index = False,
        copy = True)
    return c_data
exclude_datanames = ['train_y_model', 'test_y_model', 'submission_y_model']
for data_name in dataset:
    if data_name in exclude_datanames:
        continue
    data = dataset[data_name].copy(deep = False)
    data_ = ohe_data(data = data)
    dataset.update({data_name: data_})

In [None]:
# Preparing the data for modeling: Reducing dimensionality: Manual feature selection: None

In [None]:
# Preparing the data for modeling: Reducing dimensionality: Algorithmic feature selection

# Determine the n_components
pca = PCA(n_components = 2)
pca.fit(X = dataset['train_x_model'].values)
pca.explained_variance_ratio_
opt_n_components = 2

# Reduce the data using the opt_n_components
opt_pca = PCA(n_components = opt_n_components)
def reduce_data(data):
    data = opt_pca.fit_transform(data.values)
    return data
exclude_datanames = ['train_y_model', 'test_y_model', 'submission_y_model']
for data_name in dataset:
    if data_name in exclude_datanames:
        continue
    data = dataset[data_name].copy(deep = False)
    data_ = reduce_data(data = data)
    dataset.update({data_name: data_})

In [None]:
# Select a performance measure

eval_metric = accuracy_score

In [None]:
# Selecting the standalone models:
# 1.a. Unsupervised approach: Nearest Neighbor approach
# 1.b. Supervised approach: Logistic Regression, Ridge Classifier, SGD Classifier with default parameters

models = []
models.append(NearestNeighbors(n_neighbors = 5))
models.append(LogisticRegression())
models.append(RidgeClassifier())
models.append(SGDClassifier())

# Prediction functions
# NN Model
def predict_and_score_from_nn_model(train_x, test_x, train_y, test_y, model, return_pred = False):
    for data in [train_x, test_x, train_y, test_y]:
        data = data.values if isinstance(data, pd.DataFrame) else data
    model.fit(X = train_x, y = None)
    _, indices = model.kneighbors(X = test_x)
    y_pred = []
    for neighbors in indices:
        pred_point = mode(train_y.values[neighbors, -1])[0][0]
        y_pred.append(pred_point)
    if return_pred:
        return y_pred
    return eval_metric(y_true = test_y.values, y_pred = y_pred)
def predict_and_score_from_other_models(train_x, test_x, train_y, test_y, model, return_pred = False):
    model = LogisticRegression()
    model.fit(X = train_x, y = train_y.values)
    y_pred = model.predict(X = test_x)
    if return_pred:
        return y_pred
    return eval_metric(y_true = test_y.values, y_pred = y_pred)
predict_and_score = []
predict_and_score.append(predict_and_score_from_nn_model)
predict_and_score += [predict_and_score_from_other_models] * 3

In [None]:
# Selecting the ensemble models: None

In [None]:
# Fitting strategy: Non-iterative fitting strategy

scores = []
for model_, pred_score in zip(models, predict_and_score):
    scores.append(pred_score(
        train_x = dataset['train_x_model'],
        test_x = dataset['test_x_model'],
        train_y = dataset['train_y_model'],
        test_y = dataset['test_y_model'],
        model = model_))

In [None]:
# Selecting the best model and submitting the predictions

best_index = 0
submission = predict_and_score[best_index](
    train_x = dataset['train_x_model'],
    test_x = dataset['submission_x_model'],
    train_y = dataset['train_y_model'],
    test_y = None,
    model = models[best_index],
    return_pred = True)

In [None]:
submission