This report was generated using [FastBenchmark](http://fastbenchmark.me/) project.  
# Prediction for target using aug_train.csv data set.  
Here we will try to predict **target** using **aug_train.csv** data set.  
## Reading the data and splitting it into train and test sets
First we need to import all necessary libraries, read the data from csv and make 
training-test split. Standard split rate is 80 to 20, but you can change it in code 
(by changing 0.2 to any other value in _make train and test_ string)  


In [None]:
# The code was generated by FastBenchmark telegram bot


import pandas as pd
from sklearn.model_selection import train_test_split
import numpy as np
import xgboost as xgb
from datetime import datetime
from sklearn.metrics import mean_squared_error
from sklearn.metrics import classification_report
from sklearn.metrics import mean_squared_log_error
import matplotlib.pyplot as plt
import warnings
import seaborn as sns
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import LabelEncoder


warnings.simplefilter(action='ignore')
pd.options.mode.chained_assignment = None

# which correlation is good enough?
correlation_level = 0.7

# how much points to draw to visualize correlations
sequence_length = 100

# list of errors
errors = []
# list of columns to delete
to_drop = []

# coefficient to make plots for categories (if more categories - will not plot it)
cat_coef = 10

# reading the data
data = pd.read_csv('../input/hr-analytics-job-change-of-data-scientists/aug_train.csv', sep = ',')


# cheking if target column is in columns
target_col_name = 'target'



if target_col_name not in data.columns:
    print('No ' + str(target_col_name) + ' in test set')
    errors.append('No ' + str(target_col_name) + ' in test set')
    to_drop.append(target_col_name)
    quit()

# coefficient to check if categories are rare enough
cat_check = 0.01*len(data[target_col_name])

# make train and test
train, test = train_test_split(data, test_size=0.2)

# remove rows were target is Na
train = train[train[target_col_name].notna()]
if target_col_name in test.columns:
    test = test[test[target_col_name].notna()]



So data is loaded and now we have training and test sets.  
We will train our model on training set and will try to predict **target** values in test
 data set using other variables from test set.  
But first we need to look at the data and make some data preparation so we could use it to
train xgboost model.   

Here is the pair plot for given data (by diagonal we can see variable values distribution).  
It is very useful for better understanding dependencies in data:   


Let's look at the data

In [None]:
train.head()


In [None]:
sns.pairplot(train, hue=target_col_name)

## NaN functions
Here are the functions for NaN values.  
By default we will replace NaNs in numeric columns by np.nanmean and for categories with -1.  


In [None]:
def nan_numeric(data):
    result = np.nanmean(data)
    return result


def nan_categorical(data):
    result = -1
    return result


## Target variable target (categorical)
Here we will deal with **target** variable which is our target variable 
(the variable which values, we want to predict).  
First we will look at its values distribution.
After we will transform the data so we could be able to use it in xgboost ML model.



In [None]:
# TARGET COLUMN (categorical) - target

col_name = train.columns[13]
train[col_name] = train[col_name].astype(str)
if col_name in test.columns:
    test[col_name] = test[col_name].astype(str)
categoryType = 'category_bool'
try:
    sns.displot(x=train[col_name])
except:
    sns.distplot(train[col_name])


In [None]:
if categoryType == 'category' or categoryType == 'category_bool':
    # replacing category names with numbers
    encoder = LabelEncoder()
    encoder.fit(train[col_name])
    target = encoder.transform(train[col_name])
    target_real_values = {'Category name': encoder.classes_, 'Category value': range(len(encoder.classes_))}
    if col_name in test.columns:
        test_target = encoder.transform(test[col_name])
    to_drop.append(col_name)

    # dealing with NaN values
    target = pd.to_numeric(target, errors='coerce')
    target[pd.isna(target)] = np.nanmean(target)
    if col_name in test.columns:
        test_target = pd.to_numeric(test_target, errors='coerce')
        test_target[pd.isna(test_target)] = np.nanmean(test_target)

else:
    print('This version can predict only numeric and categorical values values')
    quit()



if col_name in train.columns:
    to_drop.append(col_name)
else:
    print('No ' + str(col_name) + ' in data set')
    errors.append('No ' + str(col_name) + ' in data set')
    to_drop.append(col_name)



## enrollee_id variable (numeric)
Id column. Will be dropped


In [None]:
# NUMERIC  COLUMN - enrollee_id

col_name = train.columns[0]
to_drop.append(col_name)

## city variable (categorical)
Here we will deal with **city** variable which is a categorical variable.  
We will look at at it's distribution and connection with target variable **target**

In [None]:
# CATEGORY COLUMN - city

col_name = train.columns[1]
try:
    sns.displot(x=train.dropna()[col_name], hue=train.dropna()[target_col_name])
except: pass


In [None]:
if col_name in test.columns:
    try:
        train[col_name] = train[col_name].astype(float)
        test[col_name] = test[col_name].astype(float)
    except: pass
    if train.dtypes[col_name] in ['int64', 'float64'] and test.dtypes[col_name] in ['int64', 'float64']:
        test[col_name + '_encoded'] = test[col_name]
        train[col_name + '_encoded'] = train[col_name]
    else:
        # replacing category names with numbers
        encoder = LabelEncoder()
        encoder.fit(train[col_name].astype(str))
        train[col_name + '_encoded'] = encoder.transform(train[col_name].astype(str))
        try:
            test[col_name + '_encoded'] = encoder.transform(test[col_name].astype(str))
        except:
            encoder = LabelEncoder()
            encoder.fit(train[col_name].append(test[col_name]).astype(str))
            train[col_name + '_encoded'] = encoder.transform(train[col_name].astype(str))
            test[col_name + '_encoded'] = encoder.transform(test[col_name].astype(str))
            if max(test[col_name + '_encoded']) > max(train[col_name + '_encoded']):
                other_cat = max(train[col_name + '_encoded']) + 1
                test[col_name + '_encoded'][test[col_name + '_encoded'] > max(train[col_name + '_encoded'])] = other_cat
    to_drop.append(col_name)

    # dealing with NaN values
    train[col_name + '_encoded'] = pd.to_numeric(train[col_name + '_encoded'], errors='coerce')
    train[col_name + '_encoded'][pd.isna(train[col_name + '_encoded'])] = nan_categorical(train[col_name + '_encoded'])
    test[col_name + '_encoded'] = pd.to_numeric(test[col_name + '_encoded'], errors='coerce')
    test[col_name + '_encoded'][pd.isna(test[col_name + '_encoded'])] = nan_categorical(train[col_name + '_encoded'])


# if there is no such column in test set - add information to errors and delete column in training set

else:
    print('No ' + str(col_name) + ' in test set')
    errors.append('No ' + str(col_name) + ' in test set')
    to_drop.append(col_name)



## city_development_index variable (numeric)
Here we will deal with **city_development_index** variable which is a numeric variable.  
We will look at at it's statistics, distribution and connection with target variable **target**


In [None]:
# NUMERIC  COLUMN - city_development_index

col_name = train.columns[2]
try:
    sns.displot(x=train[col_name].dropna())
except:
    sns.distplot(train[col_name].dropna())


**city_development_index** variable statistics:

In [None]:
pd.DataFrame(train[col_name].describe())


Now we will prepare **city_development_index** variable to use it in xgboost ML model.

In [None]:
if col_name in test.columns:

    # dealing with NaNs in column
    train[col_name] = pd.to_numeric(train[col_name], errors='coerce')
    test[col_name] = pd.to_numeric(test[col_name], errors='coerce')
    train[col_name][pd.isna(train[col_name])] = nan_numeric(train[col_name])
    test[col_name][pd.isna(test[col_name])] = nan_numeric(train[col_name])

# if there is no such column in test set - add information to errors and delete column in training set
else:
    print('No ' + str(col_name) + ' in test set')
    errors.append('No ' + str(col_name) + ' in test set')
    to_drop.append(col_name)



It could also be useful to add log transformed numeric variable to data set,so let's do so and check its distribution:

In [None]:
if col_name in test.columns:
    train[col_name + '_log'] = np.log(train[col_name] + 1 - min(0, min(train[col_name])))
    test[col_name + '_log'] = np.log(test[col_name] + 1 - min(0, min(test[col_name])))
    try:
        sns.displot(x=train[col_name + '_log'])
    except:
        sns.distplot(train[col_name + '_log'])


## gender variable (categorical)
Here we will deal with **gender** variable which is a categorical variable.  
We will look at at it's distribution and connection with target variable **target**


In [None]:
# CATEGORY COLUMN - gender

col_name = train.columns[3]
try:
    sns.displot(x=train.dropna()[col_name], hue=train.dropna()[target_col_name])
except: pass


In [None]:
if col_name in test.columns:
    try:
        train[col_name] = train[col_name].astype(float)
        test[col_name] = test[col_name].astype(float)
    except: pass
    if train.dtypes[col_name] in ['int64', 'float64'] and test.dtypes[col_name] in ['int64', 'float64']:
        test[col_name + '_encoded'] = test[col_name]
        train[col_name + '_encoded'] = train[col_name]
    else:
        # replacing category names with numbers
        encoder = LabelEncoder()
        encoder.fit(train[col_name].astype(str))
        train[col_name + '_encoded'] = encoder.transform(train[col_name].astype(str))
        try:
            test[col_name + '_encoded'] = encoder.transform(test[col_name].astype(str))
        except:
            encoder = LabelEncoder()
            encoder.fit(train[col_name].append(test[col_name]).astype(str))
            train[col_name + '_encoded'] = encoder.transform(train[col_name].astype(str))
            test[col_name + '_encoded'] = encoder.transform(test[col_name].astype(str))
            if max(test[col_name + '_encoded']) > max(train[col_name + '_encoded']):
                other_cat = max(train[col_name + '_encoded']) + 1
                test[col_name + '_encoded'][test[col_name + '_encoded'] > max(train[col_name + '_encoded'])] = other_cat
    to_drop.append(col_name)

    # dealing with NaN values
    train[col_name + '_encoded'] = pd.to_numeric(train[col_name + '_encoded'], errors='coerce')
    train[col_name + '_encoded'][pd.isna(train[col_name + '_encoded'])] = nan_categorical(train[col_name + '_encoded'])
    test[col_name + '_encoded'] = pd.to_numeric(test[col_name + '_encoded'], errors='coerce')
    test[col_name + '_encoded'][pd.isna(test[col_name + '_encoded'])] = nan_categorical(train[col_name + '_encoded'])


# if there is no such column in test set - add information to errors and delete column in training set

else:
    print('No ' + str(col_name) + ' in test set')
    errors.append('No ' + str(col_name) + ' in test set')
    to_drop.append(col_name)



## relevent_experience variable (categorical)
Here we will deal with **relevent_experience** variable which is a categorical variable.  
We will look at at it's distribution and connection with target variable **target**


In [None]:
# CATEGORY COLUMN - relevent_experience

col_name = train.columns[4]
try:
    sns.displot(x=train.dropna()[col_name], hue=train.dropna()[target_col_name])
except: pass


In [None]:
if col_name in test.columns:
    try:
        train[col_name] = train[col_name].astype(float)
        test[col_name] = test[col_name].astype(float)
    except: pass
    if train.dtypes[col_name] in ['int64', 'float64'] and test.dtypes[col_name] in ['int64', 'float64']:
        test[col_name + '_encoded'] = test[col_name]
        train[col_name + '_encoded'] = train[col_name]
    else:
        # replacing category names with numbers
        encoder = LabelEncoder()
        encoder.fit(train[col_name].astype(str))
        train[col_name + '_encoded'] = encoder.transform(train[col_name].astype(str))
        try:
            test[col_name + '_encoded'] = encoder.transform(test[col_name].astype(str))
        except:
            encoder = LabelEncoder()
            encoder.fit(train[col_name].append(test[col_name]).astype(str))
            train[col_name + '_encoded'] = encoder.transform(train[col_name].astype(str))
            test[col_name + '_encoded'] = encoder.transform(test[col_name].astype(str))
            if max(test[col_name + '_encoded']) > max(train[col_name + '_encoded']):
                other_cat = max(train[col_name + '_encoded']) + 1
                test[col_name + '_encoded'][test[col_name + '_encoded'] > max(train[col_name + '_encoded'])] = other_cat
    to_drop.append(col_name)

    # dealing with NaN values
    train[col_name + '_encoded'] = pd.to_numeric(train[col_name + '_encoded'], errors='coerce')
    train[col_name + '_encoded'][pd.isna(train[col_name + '_encoded'])] = nan_categorical(train[col_name + '_encoded'])
    test[col_name + '_encoded'] = pd.to_numeric(test[col_name + '_encoded'], errors='coerce')
    test[col_name + '_encoded'][pd.isna(test[col_name + '_encoded'])] = nan_categorical(train[col_name + '_encoded'])


# if there is no such column in test set - add information to errors and delete column in training set

else:
    print('No ' + str(col_name) + ' in test set')
    errors.append('No ' + str(col_name) + ' in test set')
    to_drop.append(col_name)



## enrolled_university variable (categorical)
Here we will deal with **enrolled_university** variable which is a categorical variable.  
We will look at at it's distribution and connection with target variable **target**


In [None]:
# CATEGORY COLUMN - enrolled_university

col_name = train.columns[5]
try:
    sns.displot(x=train.dropna()[col_name], hue=train.dropna()[target_col_name])
except: pass


In [None]:
if col_name in test.columns:
    try:
        train[col_name] = train[col_name].astype(float)
        test[col_name] = test[col_name].astype(float)
    except: pass
    if train.dtypes[col_name] in ['int64', 'float64'] and test.dtypes[col_name] in ['int64', 'float64']:
        test[col_name + '_encoded'] = test[col_name]
        train[col_name + '_encoded'] = train[col_name]
    else:
        # replacing category names with numbers
        encoder = LabelEncoder()
        encoder.fit(train[col_name].astype(str))
        train[col_name + '_encoded'] = encoder.transform(train[col_name].astype(str))
        try:
            test[col_name + '_encoded'] = encoder.transform(test[col_name].astype(str))
        except:
            encoder = LabelEncoder()
            encoder.fit(train[col_name].append(test[col_name]).astype(str))
            train[col_name + '_encoded'] = encoder.transform(train[col_name].astype(str))
            test[col_name + '_encoded'] = encoder.transform(test[col_name].astype(str))
            if max(test[col_name + '_encoded']) > max(train[col_name + '_encoded']):
                other_cat = max(train[col_name + '_encoded']) + 1
                test[col_name + '_encoded'][test[col_name + '_encoded'] > max(train[col_name + '_encoded'])] = other_cat
    to_drop.append(col_name)

    # dealing with NaN values
    train[col_name + '_encoded'] = pd.to_numeric(train[col_name + '_encoded'], errors='coerce')
    train[col_name + '_encoded'][pd.isna(train[col_name + '_encoded'])] = nan_categorical(train[col_name + '_encoded'])
    test[col_name + '_encoded'] = pd.to_numeric(test[col_name + '_encoded'], errors='coerce')
    test[col_name + '_encoded'][pd.isna(test[col_name + '_encoded'])] = nan_categorical(train[col_name + '_encoded'])


# if there is no such column in test set - add information to errors and delete column in training set

else:
    print('No ' + str(col_name) + ' in test set')
    errors.append('No ' + str(col_name) + ' in test set')
    to_drop.append(col_name)



## education_level variable (categorical)
Here we will deal with **education_level** variable which is a categorical variable.  
We will look at at it's distribution and connection with target variable **target**


In [None]:
# CATEGORY COLUMN - education_level

col_name = train.columns[6]
try:
    sns.displot(x=train.dropna()[col_name], hue=train.dropna()[target_col_name])
except: pass


In [None]:
if col_name in test.columns:
    try:
        train[col_name] = train[col_name].astype(float)
        test[col_name] = test[col_name].astype(float)
    except: pass
    if train.dtypes[col_name] in ['int64', 'float64'] and test.dtypes[col_name] in ['int64', 'float64']:
        test[col_name + '_encoded'] = test[col_name]
        train[col_name + '_encoded'] = train[col_name]
    else:
        # replacing category names with numbers
        encoder = LabelEncoder()
        encoder.fit(train[col_name].astype(str))
        train[col_name + '_encoded'] = encoder.transform(train[col_name].astype(str))
        try:
            test[col_name + '_encoded'] = encoder.transform(test[col_name].astype(str))
        except:
            encoder = LabelEncoder()
            encoder.fit(train[col_name].append(test[col_name]).astype(str))
            train[col_name + '_encoded'] = encoder.transform(train[col_name].astype(str))
            test[col_name + '_encoded'] = encoder.transform(test[col_name].astype(str))
            if max(test[col_name + '_encoded']) > max(train[col_name + '_encoded']):
                other_cat = max(train[col_name + '_encoded']) + 1
                test[col_name + '_encoded'][test[col_name + '_encoded'] > max(train[col_name + '_encoded'])] = other_cat
    to_drop.append(col_name)

    # dealing with NaN values
    train[col_name + '_encoded'] = pd.to_numeric(train[col_name + '_encoded'], errors='coerce')
    train[col_name + '_encoded'][pd.isna(train[col_name + '_encoded'])] = nan_categorical(train[col_name + '_encoded'])
    test[col_name + '_encoded'] = pd.to_numeric(test[col_name + '_encoded'], errors='coerce')
    test[col_name + '_encoded'][pd.isna(test[col_name + '_encoded'])] = nan_categorical(train[col_name + '_encoded'])


# if there is no such column in test set - add information to errors and delete column in training set

else:
    print('No ' + str(col_name) + ' in test set')
    errors.append('No ' + str(col_name) + ' in test set')
    to_drop.append(col_name)



## major_discipline variable (categorical)
Here we will deal with **major_discipline** variable which is a categorical variable.  
We will look at at it's distribution and connection with target variable **target**


In [None]:
# CATEGORY COLUMN - major_discipline

col_name = train.columns[7]
try:
    sns.displot(x=train.dropna()[col_name], hue=train.dropna()[target_col_name])
except: pass


In [None]:
if col_name in test.columns:
    try:
        train[col_name] = train[col_name].astype(float)
        test[col_name] = test[col_name].astype(float)
    except: pass
    if train.dtypes[col_name] in ['int64', 'float64'] and test.dtypes[col_name] in ['int64', 'float64']:
        test[col_name + '_encoded'] = test[col_name]
        train[col_name + '_encoded'] = train[col_name]
    else:
        # replacing category names with numbers
        encoder = LabelEncoder()
        encoder.fit(train[col_name].astype(str))
        train[col_name + '_encoded'] = encoder.transform(train[col_name].astype(str))
        try:
            test[col_name + '_encoded'] = encoder.transform(test[col_name].astype(str))
        except:
            encoder = LabelEncoder()
            encoder.fit(train[col_name].append(test[col_name]).astype(str))
            train[col_name + '_encoded'] = encoder.transform(train[col_name].astype(str))
            test[col_name + '_encoded'] = encoder.transform(test[col_name].astype(str))
            if max(test[col_name + '_encoded']) > max(train[col_name + '_encoded']):
                other_cat = max(train[col_name + '_encoded']) + 1
                test[col_name + '_encoded'][test[col_name + '_encoded'] > max(train[col_name + '_encoded'])] = other_cat
    to_drop.append(col_name)

    # dealing with NaN values
    train[col_name + '_encoded'] = pd.to_numeric(train[col_name + '_encoded'], errors='coerce')
    train[col_name + '_encoded'][pd.isna(train[col_name + '_encoded'])] = nan_categorical(train[col_name + '_encoded'])
    test[col_name + '_encoded'] = pd.to_numeric(test[col_name + '_encoded'], errors='coerce')
    test[col_name + '_encoded'][pd.isna(test[col_name + '_encoded'])] = nan_categorical(train[col_name + '_encoded'])


# if there is no such column in test set - add information to errors and delete column in training set

else:
    print('No ' + str(col_name) + ' in test set')
    errors.append('No ' + str(col_name) + ' in test set')
    to_drop.append(col_name)



## experience variable (categorical)
Here we will deal with **experience** variable which is a categorical variable.  
We will look at at it's distribution and connection with target variable **target**


In [None]:
# CATEGORY COLUMN - experience

col_name = train.columns[8]
try:
    sns.displot(x=train.dropna()[col_name], hue=train.dropna()[target_col_name])
except: pass


In [None]:
if col_name in test.columns:
    try:
        train[col_name] = train[col_name].astype(float)
        test[col_name] = test[col_name].astype(float)
    except: pass
    if train.dtypes[col_name] in ['int64', 'float64'] and test.dtypes[col_name] in ['int64', 'float64']:
        test[col_name + '_encoded'] = test[col_name]
        train[col_name + '_encoded'] = train[col_name]
    else:
        # replacing category names with numbers
        encoder = LabelEncoder()
        encoder.fit(train[col_name].astype(str))
        train[col_name + '_encoded'] = encoder.transform(train[col_name].astype(str))
        try:
            test[col_name + '_encoded'] = encoder.transform(test[col_name].astype(str))
        except:
            encoder = LabelEncoder()
            encoder.fit(train[col_name].append(test[col_name]).astype(str))
            train[col_name + '_encoded'] = encoder.transform(train[col_name].astype(str))
            test[col_name + '_encoded'] = encoder.transform(test[col_name].astype(str))
            if max(test[col_name + '_encoded']) > max(train[col_name + '_encoded']):
                other_cat = max(train[col_name + '_encoded']) + 1
                test[col_name + '_encoded'][test[col_name + '_encoded'] > max(train[col_name + '_encoded'])] = other_cat
    to_drop.append(col_name)

    # dealing with NaN values
    train[col_name + '_encoded'] = pd.to_numeric(train[col_name + '_encoded'], errors='coerce')
    train[col_name + '_encoded'][pd.isna(train[col_name + '_encoded'])] = nan_categorical(train[col_name + '_encoded'])
    test[col_name + '_encoded'] = pd.to_numeric(test[col_name + '_encoded'], errors='coerce')
    test[col_name + '_encoded'][pd.isna(test[col_name + '_encoded'])] = nan_categorical(train[col_name + '_encoded'])


# if there is no such column in test set - add information to errors and delete column in training set

else:
    print('No ' + str(col_name) + ' in test set')
    errors.append('No ' + str(col_name) + ' in test set')
    to_drop.append(col_name)



## company_size variable (categorical)
Here we will deal with **company_size** variable which is a categorical variable.  
We will look at at it's distribution and connection with target variable **target**


In [None]:
# CATEGORY COLUMN - company_size

col_name = train.columns[9]
try:
    sns.displot(x=train.dropna()[col_name], hue=train.dropna()[target_col_name])
except: pass


In [None]:
if col_name in test.columns:
    try:
        train[col_name] = train[col_name].astype(float)
        test[col_name] = test[col_name].astype(float)
    except: pass
    if train.dtypes[col_name] in ['int64', 'float64'] and test.dtypes[col_name] in ['int64', 'float64']:
        test[col_name + '_encoded'] = test[col_name]
        train[col_name + '_encoded'] = train[col_name]
    else:
        # replacing category names with numbers
        encoder = LabelEncoder()
        encoder.fit(train[col_name].astype(str))
        train[col_name + '_encoded'] = encoder.transform(train[col_name].astype(str))
        try:
            test[col_name + '_encoded'] = encoder.transform(test[col_name].astype(str))
        except:
            encoder = LabelEncoder()
            encoder.fit(train[col_name].append(test[col_name]).astype(str))
            train[col_name + '_encoded'] = encoder.transform(train[col_name].astype(str))
            test[col_name + '_encoded'] = encoder.transform(test[col_name].astype(str))
            if max(test[col_name + '_encoded']) > max(train[col_name + '_encoded']):
                other_cat = max(train[col_name + '_encoded']) + 1
                test[col_name + '_encoded'][test[col_name + '_encoded'] > max(train[col_name + '_encoded'])] = other_cat
    to_drop.append(col_name)

    # dealing with NaN values
    train[col_name + '_encoded'] = pd.to_numeric(train[col_name + '_encoded'], errors='coerce')
    train[col_name + '_encoded'][pd.isna(train[col_name + '_encoded'])] = nan_categorical(train[col_name + '_encoded'])
    test[col_name + '_encoded'] = pd.to_numeric(test[col_name + '_encoded'], errors='coerce')
    test[col_name + '_encoded'][pd.isna(test[col_name + '_encoded'])] = nan_categorical(train[col_name + '_encoded'])


# if there is no such column in test set - add information to errors and delete column in training set

else:
    print('No ' + str(col_name) + ' in test set')
    errors.append('No ' + str(col_name) + ' in test set')
    to_drop.append(col_name)



## company_type variable (categorical)
Here we will deal with **company_type** variable which is a categorical variable.  
We will look at at it's distribution and connection with target variable **target**


In [None]:
# CATEGORY COLUMN - company_type

col_name = train.columns[10]
try:
    sns.displot(x=train.dropna()[col_name], hue=train.dropna()[target_col_name])
except: pass


In [None]:
if col_name in test.columns:
    try:
        train[col_name] = train[col_name].astype(float)
        test[col_name] = test[col_name].astype(float)
    except: pass
    if train.dtypes[col_name] in ['int64', 'float64'] and test.dtypes[col_name] in ['int64', 'float64']:
        test[col_name + '_encoded'] = test[col_name]
        train[col_name + '_encoded'] = train[col_name]
    else:
        # replacing category names with numbers
        encoder = LabelEncoder()
        encoder.fit(train[col_name].astype(str))
        train[col_name + '_encoded'] = encoder.transform(train[col_name].astype(str))
        try:
            test[col_name + '_encoded'] = encoder.transform(test[col_name].astype(str))
        except:
            encoder = LabelEncoder()
            encoder.fit(train[col_name].append(test[col_name]).astype(str))
            train[col_name + '_encoded'] = encoder.transform(train[col_name].astype(str))
            test[col_name + '_encoded'] = encoder.transform(test[col_name].astype(str))
            if max(test[col_name + '_encoded']) > max(train[col_name + '_encoded']):
                other_cat = max(train[col_name + '_encoded']) + 1
                test[col_name + '_encoded'][test[col_name + '_encoded'] > max(train[col_name + '_encoded'])] = other_cat
    to_drop.append(col_name)

    # dealing with NaN values
    train[col_name + '_encoded'] = pd.to_numeric(train[col_name + '_encoded'], errors='coerce')
    train[col_name + '_encoded'][pd.isna(train[col_name + '_encoded'])] = nan_categorical(train[col_name + '_encoded'])
    test[col_name + '_encoded'] = pd.to_numeric(test[col_name + '_encoded'], errors='coerce')
    test[col_name + '_encoded'][pd.isna(test[col_name + '_encoded'])] = nan_categorical(train[col_name + '_encoded'])


# if there is no such column in test set - add information to errors and delete column in training set

else:
    print('No ' + str(col_name) + ' in test set')
    errors.append('No ' + str(col_name) + ' in test set')
    to_drop.append(col_name)



## last_new_job variable (categorical)
Here we will deal with **last_new_job** variable which is a categorical variable.  
We will look at at it's distribution and connection with target variable **target**


In [None]:
# CATEGORY COLUMN - last_new_job

col_name = train.columns[11]
try:
    sns.displot(x=train.dropna()[col_name], hue=train.dropna()[target_col_name])
except: pass


In [None]:
if col_name in test.columns:
    try:
        train[col_name] = train[col_name].astype(float)
        test[col_name] = test[col_name].astype(float)
    except: pass
    if train.dtypes[col_name] in ['int64', 'float64'] and test.dtypes[col_name] in ['int64', 'float64']:
        test[col_name + '_encoded'] = test[col_name]
        train[col_name + '_encoded'] = train[col_name]
    else:
        # replacing category names with numbers
        encoder = LabelEncoder()
        encoder.fit(train[col_name].astype(str))
        train[col_name + '_encoded'] = encoder.transform(train[col_name].astype(str))
        try:
            test[col_name + '_encoded'] = encoder.transform(test[col_name].astype(str))
        except:
            encoder = LabelEncoder()
            encoder.fit(train[col_name].append(test[col_name]).astype(str))
            train[col_name + '_encoded'] = encoder.transform(train[col_name].astype(str))
            test[col_name + '_encoded'] = encoder.transform(test[col_name].astype(str))
            if max(test[col_name + '_encoded']) > max(train[col_name + '_encoded']):
                other_cat = max(train[col_name + '_encoded']) + 1
                test[col_name + '_encoded'][test[col_name + '_encoded'] > max(train[col_name + '_encoded'])] = other_cat
    to_drop.append(col_name)

    # dealing with NaN values
    train[col_name + '_encoded'] = pd.to_numeric(train[col_name + '_encoded'], errors='coerce')
    train[col_name + '_encoded'][pd.isna(train[col_name + '_encoded'])] = nan_categorical(train[col_name + '_encoded'])
    test[col_name + '_encoded'] = pd.to_numeric(test[col_name + '_encoded'], errors='coerce')
    test[col_name + '_encoded'][pd.isna(test[col_name + '_encoded'])] = nan_categorical(train[col_name + '_encoded'])


# if there is no such column in test set - add information to errors and delete column in training set

else:
    print('No ' + str(col_name) + ' in test set')
    errors.append('No ' + str(col_name) + ' in test set')
    to_drop.append(col_name)



## training_hours variable (numeric)
Here we will deal with **training_hours** variable which is a numeric variable.  
We will look at at it's statistics, distribution and connection with target variable **target**


In [None]:
# NUMERIC  COLUMN - training_hours

col_name = train.columns[12]
try:
    sns.displot(x=train[col_name].dropna())
except:
    sns.distplot(train[col_name].dropna())


**training_hours** variable statistics:

In [None]:
pd.DataFrame(train[col_name].describe())


Now we will prepare **training_hours** variable to use it in xgboost ML model.

In [None]:
if col_name in test.columns:

    # dealing with NaNs in column
    train[col_name] = pd.to_numeric(train[col_name], errors='coerce')
    test[col_name] = pd.to_numeric(test[col_name], errors='coerce')
    train[col_name][pd.isna(train[col_name])] = nan_numeric(train[col_name])
    test[col_name][pd.isna(test[col_name])] = nan_numeric(train[col_name])

# if there is no such column in test set - add information to errors and delete column in training set
else:
    print('No ' + str(col_name) + ' in test set')
    errors.append('No ' + str(col_name) + ' in test set')
    to_drop.append(col_name)



It could also be useful to add log transformed numeric variable to data set,so let's do so and check its distribution:

In [None]:
if col_name in test.columns:
    train[col_name + '_log'] = np.log(train[col_name] + 1 - min(0, min(train[col_name])))
    test[col_name + '_log'] = np.log(test[col_name] + 1 - min(0, min(test[col_name])))
    try:
        sns.displot(x=train[col_name + '_log'])
    except:
        sns.distplot(train[col_name + '_log'])


## Remove all useless columns
Here we will remove all useless columns


In [None]:
# clean data
for name in to_drop:
    if name in train.columns:
        del train[name]
    if name in test.columns:
        del test[name]



## Prediction for target
Here we will train xgboost model to predict our target variable **target**.  
Actually, all what we do above was just a sort of preparation. 
And now we will use all we done above to make better prediction for target variable.  
Here you can use different models, make ensembles, and try to improve the solution.  



In [None]:
# MAKE MODEL

# subsample for feature engineering
real_train = train
real_target = target
real_test = test
train, test, target, test_target = train_test_split(train, target, test_size=0.2)

# more rounds -> better prediction, but longer training
num_boost_rounds = 1000
eta = 10/num_boost_rounds

# parameters for xgboost
xgb_params = {
    'eta': eta,
    'subsample': 0.80,
    'objective': 'binary:logistic',
    'eval_metric': 'error'
}

# transforming the data for xgboost
dtrain = xgb.DMatrix(train, target)
dtest = xgb.DMatrix(test)
dreal_test = xgb.DMatrix(real_test)
# training the model
model = xgb.train(dict(xgb_params), dtrain, num_boost_round=num_boost_rounds)

# predict
preds = model.predict(dtest)
preds = pd.to_numeric(preds)
preds[pd.isna(preds)] = np.nanmean(preds)

real_preds = model.predict(dreal_test)
real_preds = pd.to_numeric(real_preds)
real_preds[pd.isna(real_preds)] = np.nanmean(real_preds)

preds_categorical = []
test_target_categorical = []

for i in range(len(preds)):
    if round(preds[i]) in target_real_values['Category value']:
        pred_name_index = target_real_values['Category value'].index(round(preds[i]))
    else:
        pred_name_index = target_real_values['Category value'].index(min(target_real_values['Category value']))
    test_target_name_index = target_real_values['Category value'].index(test_target[i])
    preds_categorical.append(target_real_values['Category name'][pred_name_index])
    test_target_categorical.append(target_real_values['Category name'][test_target_name_index])


Feature importance for **target** prediction:

In [None]:
# preparing the data and printing the statistics
xgb_fea_imp = pd.DataFrame(list(model.get_fscore().items()),
columns=['feature', 'importance']).sort_values('importance', ascending=False)
xgb_fea_imp.style.hide_index()


Classification report and plot for prediction:

In [None]:
for_plot = pd.DataFrame({'Id': range(len(preds_categorical)), 'Predictions': preds_categorical, 'Real values': test_target_categorical})
sns.scatterplot(data=for_plot, x='Id', y='Predictions', hue='Real values')
pd.DataFrame(classification_report(test_target_categorical, preds_categorical, output_dict=True)).T


We will also look at roc curve score for non-rounded dataand plot the non-rounded prediction

In [None]:
for_plot = pd.DataFrame({'Id': range(len(preds)), 'Prediction': preds, 'Real values': np.round(test_target)})
sns.scatterplot(data=for_plot, x='Id', y='Prediction', hue='Real values')
print('roc auc score is ' + str(roc_auc_score(np.round(test_target), preds)))



## Errors statistics
Here we will look at the largest mistakes and compare it to our training set.  
That could help us with feature generation.
First we will look at largest absolute errors.


In [None]:
# how much top errors to work with
top_errors = 5

# how much best variables to use
top_features = 4

# how much rows from training set to look at
top_rows = 5

# how much could be the distance btw values in % to take them as similar (to assume that two values are in same claster)
dist = 5

preds_df = pd.DataFrame({'prediction': preds})
preds_df.index = test.index
test_target_df = pd.DataFrame({target_col_name: test_target})
test_target_df.index = test.index
target_df = pd.DataFrame({target_col_name: target})
target_df.index = train.index
test_with_target = test.join(test_target_df).join(preds_df)
train_with_target = train.join(target_df)
test_with_target = test.join(test_target_df).join(preds_df)
train_with_target = train.join(target_df)
test_with_target['abs_err'] = abs(test_with_target[target_col_name] - test_with_target['prediction'])
test_top_errors = test_with_target.sort_values(by='abs_err', ascending = False).head(top_errors)
mean_top_err = test_top_errors.describe().iloc[1:2, :]
show_order = [target_col_name, 'prediction', 'abs_err'] + xgb_fea_imp['feature'].tolist()
test_top_errors[show_order]


Now we will try to find most similar rows to this largest error rows in our training set

In [None]:
features = xgb_fea_imp['feature'][0:top_features].tolist()
to_compare = train_with_target
for feature in features:
    condition1 = train_with_target[feature] >= (mean_top_err[feature]*(1 - (dist/100))).values[0]
    condition2 = train_with_target[feature] < (mean_top_err[feature]*(1 + (dist/100))).values[0]
    condition = condition1 & condition2
    if to_compare[condition].shape[0] >= top_rows:
        to_compare = to_compare[condition]
    else:
        break
show_order = [target_col_name] + xgb_fea_imp['feature'].tolist()
to_compare[show_order].head(top_rows)
