# Data Preprocessing

> Dataset can be downloaded [here](https://www.kaggle.com/datasets/arashnic/hr-analytics-job-change-of-data-scientists).

**Download dataset (for Google Colab users)**

In [None]:
!wget https://raw.githubusercontent.com/s-sairung/intro-dw-dm/master/term_project/dataset/aug_train.csv -P /content/dataset/
!wget https://raw.githubusercontent.com/s-sairung/intro-dw-dm/master/term_project/dataset/aug_test.csv -P /content/dataset/

**Import data**

In [None]:
import pandas as pd

df_train = pd.read_csv("dataset/aug_train.csv")
df_test = pd.read_csv("dataset/aug_test.csv")

**Remove unused attributes**

Note: `city` attribute is correlated with `city_development_index` attribute

In [None]:
df_train = df_train.drop(columns=["city"])
df_test = df_test.drop(columns=["city"])
df_train = df_train.drop(columns=["enrollee_id"])
df_test = df_test.drop(columns=["enrollee_id"])

**Replace missing values with string, mode, or mean**

Co-authored-by: chain2543

In [None]:
def replace_missing(df):
    df["education_level"] = df["education_level"].fillna("None")
    df["major_discipline"] = df["major_discipline"].fillna("None")
    df["relevent_experience"] = df["relevent_experience"].fillna(df["relevent_experience"].mode()[0])
    df["enrolled_university"] = df["enrolled_university"].fillna(df["enrolled_university"].mode()[0])
    df["experience"] = df["experience"].fillna(df["experience"].mode()[0])
    df["company_size"] = df["company_size"].fillna(df["company_size"].mode()[0])
    df["company_type"] = df["company_type"].fillna(df["company_type"].mode()[0])
    df["last_new_job"] = df["last_new_job"].fillna(df["last_new_job"].mode()[0])
    df["gender"] = df["gender"].fillna(df["gender"].mode()[0])
    df["training_hours"] = df["training_hours"].fillna(df["training_hours"].mean())

replace_missing(df_train)
replace_missing(df_test)

**Apply ordinal encoding**

In [None]:
from sklearn.preprocessing import OrdinalEncoder

enc = OrdinalEncoder()
columns = ["gender",
        "relevent_experience",
        "enrolled_university",
        "education_level",
        "major_discipline",
        "experience",
        "company_size",
        "company_type",
        "last_new_job"]

enc.fit(df_train[columns])
df_train[columns] = enc.transform(df_train[columns])
df_test[columns] = enc.transform(df_test[columns])

**Upsample targets with label 1 to match the number of targets with label 0 using SMOTE**

In [None]:
from imblearn.over_sampling import SMOTE

print("Before upsampling")
print("target 0: " + str(df_train[df_train["target"] == 0].shape))
print("target 1: " + str(df_train[df_train["target"] == 1].shape))

X = df_train.drop("target", axis=1)
y = df_train["target"]
sm = SMOTE(random_state = 42)
X, y = sm.fit_resample(X, y)
df_train = X.join(y)

print("\nAfter upsampling")
print("target 0: " + str(df_train[df_train["target"] == 0].shape))
print("target 1: " + str(df_train[df_train["target"] == 1].shape))

**Print out the results**

In [None]:
print(df_train.head())

In [None]:
print(df_test.head())

**Prepare data by splitting into train and test set (70:30)**

In [None]:
from sklearn.model_selection import train_test_split

X = df_train.drop("target", axis=1)
y = df_train["target"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

print("Train set size:\t" + str(X_train.shape))
print("Test set size:\t" + str(X_test.shape))

# Apply Model

**Run model**

In [None]:
from sklearn.tree import DecisionTreeClassifier

tree = DecisionTreeClassifier(criterion="gini", max_depth=None)
tree = tree.fit(X_train, y_train)

y_pred = tree.predict(X_test)
y_score = tree.score(X, y)

**Model evaluation**

In [None]:
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix

print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print ("\nAccuracy:\n", accuracy_score(y_test, y_pred)*100)
print("\nReport:\n", classification_report(y_test, y_pred))

**Visualize decision tree**

> WARNING: This process might take some time, as the decision tree is significantly sparsed.

In [None]:
from sklearn.tree import plot_tree
from matplotlib import pyplot as plt

fig = plt.figure(figsize=(25, 20))
_ = plot_tree(tree, 
          feature_names = df_train.columns.drop("target"), 
          class_names = ["0", "1"], 
          filled = True, 
          rounded = True)

**Export visualization as a file**

In [None]:
fig.savefig("decision_tree.png")

# Prediction

**Predict test set**

In [None]:
df_test["target"] = tree.predict(df_test)

**Apply label inverse transform**

In [None]:
df_test[columns] = enc.inverse_transform(df_test[columns])

**Save predicted results as CSV**

In [None]:
df_test.to_csv("predict.csv")

**Print out the results**

In [None]:
print(df_test.head())