# Practice assignment: Handling categorical data

In this programming assignment, you are going to work with the following dataset from Kaggle:

https://www.kaggle.com/arashnic/hr-analytics-job-change-of-data-scientists


## Dataset description

Suppose that a company working with Big Data and Data Science wants to hire data scientists among people who have successfully passed some courses conducted by the company. Many people sign up for their training. The company wants to focus of the candidates who really want to work for them after training. Information related to demographics, education, experience is provided by the candidates via a sign-up form.

This dataset is designed to understand the factors that lead a person to leave current job which is useful for HR researches. Based on the provided data, you are going to predict whether a candidate is looking for a job change.

The whole data is divided into train and test parts. Data contains several categorical features – they need to be encoded.

## Feature description

- `enrollee_id`: Unique ID for a candidate
- `city`: City code
- `city_development_index`: Developement index of the city (scaled)
- `gender`: Gender of a candidate
- `relevent_experience`: Relevant experience of a candidate
- `enrolled_university`: Type of University course enrolled in if any
- `education_level`: Education level of a candidate
- `major_discipline`: Education major discipline of a candidate
- `experience`: Candidate total experience in years
- `company_size`: Number of employees in a current employer's company
- `company_type`: Type of a current employer
- `last_new_job`: Difference in years between previous and current jobs
- `training_hours`: training hours completed
- `target`: 0 – Not looking for job change, 1 – Looking for a job change


In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import Ridge
from sklearn.svm import SVR
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import r2_score

from warnings import filterwarnings
filterwarnings('ignore')


In [3]:
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

## 1

Before modifying the features, let's study our dataframe a bit. First, call `.dtypes` for `train` dataframe. 

**q1:** How many columns in `train` contain data types different from numbers (ints and floats)?

In [4]:
cat_cols = train.select_dtypes(exclude=np.number).columns

print(len(cat_cols))

10


## 2

**q2:** How many unique caterogies does a feature `'city'` contain in the train data?

In [5]:
train.city.nunique()

123

## 3

**q3:** How many features in train data contain missing values (NaN)?

In [6]:
len([i for i in train.isna().sum() if i > 0])

8

## 4

Some features have a relatively small number of missing values. For instance, `'experience'` has only 65 missing values in the train data. Replacing these missing values with a special category might introduce bias to the data. However, since the number of missing values is not so big, we might be OK with it.

Replace missing values in `'experience'` feature with a category -1.

**q4:** How many categories does this feature contain in the train data now?

_(hint: you might want to use `fillna` function from `pandas` library in this task, but remember that it's not an in-place function by default. Or you can use `SimpleImputer` from `sklearn`)_

In [7]:
train['experience'].fillna(-1, inplace=True)

train.experience.nunique()

23

## 5

`'education_level'` is an example of an ordinal feature. Sure, we can define an order on it. For instance, `'High School'` is "bigger" than `'Primary School'`, and `'Phd'` is "bigger" than '`Graduate'`. We can encode this feature with integer numbers in a correct order.

In this task, apply a correct mapping for the values of `'education_level'` feature. The mapping should be the following:

* `'Primary School'` -> 0
* `'High School'` -> 1
* `'Graduate'` -> 2
* `'Masters'` -> 3
* `'Phd'` -> 4

At the same time, impute missing values in `'education_level'` with a new category -1. So another part of the mapping would be:

* `np.nan` -> -1

**q5:** What will be the mean value of this feature in the train data after the encoding? Provide the answer, rounded to the nearest TWO decimal places (e.g. 12.3456789 -> 12.35).

_(hint: you might want to use `map` function from `pandas` in this task)_

In [8]:
# mapping
map_education = {'Primary School': 0, 'High School': 1, 'Graduate': 2, 'Masters': 3, 'Phd': 4, np.nan: -1}

# call map
train['education_level'] = train['education_level'].map(map_education)

np.mean(train.education_level)


2.061384278108362

## 6

Feature `'relevent_experience'` is an example of a binary feature. You can also check that it has no missing values, which makes its encoding pretty easy.

Encode this feature with the following mapping:

*   `"No relevent experience"` -> 0
*   `"Has relevent experience"` -> 1

**q6:** What will be the mean value of this feature in the train data after the encoding? Provide the answer, rounded to the nearest TWO decimal places (e.g. 12.3456789 -> 12.35).

In [9]:
map_relevent_experience = {'No relevent experience': 0, 'Has relevent experience': 1}

train['relevent_experience'] = train['relevent_experience'].map(map_relevent_experience)
np.mean(train.relevent_experience)

0.7199081323728991

## 7

In our case, `'gender'` is an example of a nominal feature (notice that it is not a binary feature cause it contains three categories + NaNs). We will use One-Hot encoding to encode it. There are several options how to do it in practice: for example, to use `get_dummies` function from `pandas` or `OneHotEncoder` from `sklearn.preprocessing`. Here, we will go with the first option.

If you want to encode a whole dataset with One-Hot encoding, you can directly pass it into the discussed methods. But here we want to encode only one feature. We suggest to do it in four steps:

1. Obtain a One-Hot encoding dataframe for the feature `'gender'`: apply `pd.get_dummies` to this feature. As a result, you should obtain a dataframe with three features (three different categories for gender). Don't include `'NaN'` feature - it will already be included as the encoding (0, 0, 0).
2. Change the column names of this dataframe, so that a feature `'<category_name>'` will become `'gender-<category_name>'`. Of course, it is not a necessary step in general. However, it probably would be more convenient to work with a full dataframe if we remember that these One-Hot encoded features originally came from the feature `'gender'`. On a technical side, you can perform it by changing `df_gender.columns` list.
3. Concatenate original and new dataframes. You can do it by calling `pd.concat` function. Don't forget to set `'axis'` parameter, so that you concatenate dataframes by columns, not by rows. As a result of this step, you should obtain a new `train` dataframe with three new columns named like `'gender-<category_name>'`.
4. Finally, drop `'gender'` feature because we have already encoded it.

As a result of these four steps, you should obtain a new version of `train` dataframe, with dropped `'gender'` feature and three new features named like `'gender-<category_name>'`. A total number of columns by this time should be equal to 16.

**q7:** What is the total number of zero values which appear in the columns starting with `'gender-'`?

_Example: suppose that you obtain the following dataframe:_

| gender-category1 | gender-category2   | gender-category3|
|------|------|------|
|   0  | 1| 0|
|0|0|0|

_Then the answer for the question above should be 5._



In [10]:
# get dummies
df_gender = pd.get_dummies(train['gender'], prefix='gender').astype(np.int8)

# concat train and df_gender
train = pd.concat([train, df_gender], axis=1)

# drop original column
train = train.drop(columns='gender')

In [11]:
# count all rows with zero values in gender columns

l = ['Male', 'Female', 'Other']
fl = ['gender_' + _ for _ in l]

zeros_gender = 0

for col in fl:
    zeros_gender += len(train[train[col] == 0])
print(zeros_gender)

42824


## 8

Perform One-Hot encoding for the feature `'enrolled_university'`, using the similar procedure as in the previous task:

1. Obtain a One-Hot encoding dataframe for `'enrolled_university'`.
2. Rename its columns.
3. Concatenate original and One-Hot encoding dataframes.
4. Drop `'enrolled_university'` column.

A total number of columns by this time should be equal to 18.

**q8:** What is the total number of zero values which appear in the columns starting with `'enrolled_university-'`?

In [12]:
# ohe on enrolled_university feature
df_enrolled_university = pd.get_dummies(train['enrolled_university'], prefix='enrolled_university').astype(np.int8)

# concat train and df_enrolled_university
train = pd.concat([train, df_enrolled_university], axis=1)

# drop original column
train = train.drop(columns='enrolled_university')

In [13]:
uni_cols = [i for i in train.columns if 'enrolled_university' in i]

zeros_uni = 0

for col in uni_cols:
    zeros_uni += len(train[train[col] == 0])
print(zeros_uni)

38702


## 9

Encode feature `'city'` using frequency encoding. You should map each category `'city_i'` to its count (a total number of observations in `train` with `city == 'city_i'`). Save this mapping, since later you would apply the same mapping to the test data.

As a result of this task, feature `'city'` should be encoded with category counts in `train`. 

**q9:** What will be the mean value of this feature in the train data after the encoding? Provide the answer, rounded to the nearest FIVE decimal places (e.g. 12.3456789 -> 12.34568).


In [14]:
def freq_encode_city(df):
    map_city = df.city.value_counts().to_dict()
    df['city'] = df['city'].map(map_city)
    return df['city']

# encode
train['city'] = freq_encode_city(train)

np.round(np.mean(train.city), 5)

1709.79643

## 10

Encode feature `'last_new_job'` with target encoding with no modifications. First, impute missing values in this feature with a new category `'-1'`. Then, map each category of `'last_new_job'` to the mean target value of the observations with the corresponding category. Save this mapping, since later you would apply the same mapping to the test data.

**q10:** What will be the maximum value of this feature in the train data after the encoding? Provide the answer, rounded to the nearest FIVE decimal places (e.g. 12.3456789 -> 12.35).

_(hint: you might want to use `groupby` function from `pandas` in this task)_

In [15]:
def encode_last_new_job(df):
    df['last_new_job'] = df['last_new_job'].fillna('-1')
    map_last_new_job = df.groupby('last_new_job')['target'].mean().to_dict()
    df['last_new_job'] = df['last_new_job'].map(map_last_new_job)
    return df['last_new_job']

# encode
train['last_new_job'] = encode_last_new_job(train)

np.round(np.max(train.last_new_job), 5)

0.36407

## 11

Encode feature `'experience'` with M-estimate encoding. Map each category of `'experience'` according to the following formula:

$$
\hat{x_{ij}} = \frac{\text{target}\left(j, x_{ij}\right) + m\times y_{\text{mean}}}{\text{count}\left(j, x_{ij}\right) + m}\quad,
$$

where

* $x_{ij}$ is a category being encoded,
* $\hat{x_{ij}}$ is its corresponding M-estimate encoding value,
* $\text{count}\left(j, x_{ij}\right)$ is a total number of times $x_{ij}$ appeared in `train`,
* $\text{target}\left(j, x_{ij}\right)$ is a mean target value of the observations with the corresponding category,
* $m$ is a parameter.

In this task, set $m = 0.5$. 

**q11:** What will be the maximum value of this feature in the train data after the encoding? Provide the answer, rounded to the nearest FIVE decimal places (e.g. 12.3456789 -> 12.34568).

In [16]:
from category_encoders import MEstimateEncoder

# map
map_experience = MEstimateEncoder(cols='experience', m=0.5)

# encode
train['experience'] = map_experience.fit_transform(train['experience'], train['target'])

# res
np.round(np.max(train.experience), 5)

0.45383

## 12

Encode feature `'major_discipline'` with Leave-One-Out encoding. Remember that this technique is similar to target encoding, but here while computing the encoding for a particular observation, we exclude it from the target encoding formula. Therefore a category $x_{ij}$ for the $i$-th observation will be encoded according to the following formula:

$$
\hat{x_{ij}} = \frac{\text{target}\left(j, x_{ij}\right) - y_i}{\text{count}\left(j, x_{ij}\right) - 1}\quad,
$$

where

* $\hat{x_{ij}}$ is its corresponding M-estimate encoding value,
* $\text{count}\left(j, x_{ij}\right)$ is a total number of times $x_{ij}$ appeared in `train`,
* $\text{target}\left(j, x_{ij}\right)$ is a mean target value of the observations with the corresponding category,
* $y_i$ is a target value of the $i$-th observation

For example, suppose that you have the following train data:

|feature|target|
|-|-|
|A|1|
|A|0|
|B|1|
|B|0|
|A|0|

Then you obtain the following Leave-One-Out encoding:

|feature|feature_loo_encoded|
|-|-|
|A|0.0|
|A|0.5|
|B|0.0|
|B|1.0|
|A|0.5|

It is very important to notice that in this method the same category can be encoded differently for different observations. Thus, after encoding the train part of data, you should create a mapping which will help to encode the test data. In order to do this, simply average train encoding values within each category to obtain the final encoding. For instance, suppose that you obtain the following train dataframe:

|feature|feature_loo_encoded|
|-|-|
|A|0.2|
|A|0.6|
|B|0.3|
|B|0.7|
|A|0.4|

Then, for the test data, you should obtain the following mapping:

* A -> 0.4 (because 0.4 is a mean value of the encoded values for the category A)
* B -> 0.5 (because 0.5 is a mean value of the encoded values for the category B)

Don't impute any missing values in this task. After completing this task, drop the feature `'major_discipline'` and rename `'major_discipline_loo_encoded'` to `'major_discipline'`.

**q12:** What will be the maximum value of the encoding values for `'major_discipline'` for the TEST data in the mapping described above? Provide the answer, rounded to the nearest FIVE decimal places (e.g. 12.3456789 -> 12.34568).

In [17]:
from category_encoders import LeaveOneOutEncoder

map_major_discipline = LeaveOneOutEncoder(cols='major_discipline')

# encode
train['major_discipline_loo_encoded'] = map_major_discipline.fit_transform(train['major_discipline'], train['target'])

# drop original column, rename encoded column
train = train.drop(columns='major_discipline').rename(columns={'major_discipline_loo_encoded': 'major_discipline'})

# fit to test
np.round(np.max(map_major_discipline.transform(test['major_discipline'])), 5)

0.26772

## 13

Encode feature `'company_size'` with Catboost encoding. The technique was described in the lecture. Here, for the sake of simplicity, let's use the implementation from `category_encoders` library.

As you may remember, Catboost encoding depends on how the data is ordered. Normally, you should shuffle the data one time or even several times. In this task, we will assume that the data has already been shuffled, so you don't need to shuffle it again.

Take `CatBoostEncoder` and use the default values for all its parameters, except `handle_missing` - set it to `'return_nan'` so that your encoder don't do anything with missing values. Fit it on the `'company_size'` (train data) and `'target'` and transform this feature. Save the encoder - it will be used later to transform this feature in test data.

Don't impute any missing values in this task.

**q13:** What will be the most popular value of this feature in the train data after the encoding? Provide the answer, rounded to the nearest FIVE decimal places (e.g. 12.3456789 -> 12.34568).

In [18]:
from category_encoders.cat_boost import CatBoostEncoder
from scipy.stats import mode

# map to train
map_company_size = CatBoostEncoder(cols='company_size', handle_missing='return_nan')

# encode
train['company_size'] = map_company_size.fit_transform(train['company_size'], train['target'])

np.round(mode(train.company_size, nan_policy='omit')[0], 5)

0.24935

## 14

Encode feature `'company_type'` with Weight of Evidence (WoE) encoding. A category $x_{ij}$ will be encoded according to the following formula:

$$
\hat{x_{ij}} = \ln\left(\frac{\mathbb{P}\left(x_{ij}\mid y=1\right)}{\mathbb{P}\left(x_{ij}\mid y=0\right)}\right)\quad.
$$

Here:

$$
\mathbb{P}\left(x_{ij}\mid y=1\right) = \frac{\text{count}\left(y=1\mid x_{ij}\right)}{\text{count}\left(y=1\right)}
$$
$$
\mathbb{P}\left(x_{ij}\mid y=0\right) = \frac{\text{count}\left(y=0\mid x_{ij}\right)}{\text{count}\left(y=0\right)}
$$

The notation means the following:

* $\text{count}\left(y=1\mid x_{ij}\right)$ denotes the number of observations with the category $x_{ij}$ where the target value is equal to $1$;
* $\text{count}\left(y=0\mid x_{ij}\right)$ denotes the same but for the target value $0$;
* $\text{count}\left(y=1\right)$ denotes the number of observations with the target value equal to $1$;
* $\text{count}\left(y=0\right)$ denotes the same but for the target value $0$.


For example, suppose that you have the following train data:

|feature|target|
|-|-|
|A|1|
|A|0|
|B|1|
|B|0|
|A|0|

Then you obtain the following WoE encoding mapping:

* A -> $\ln\left(\frac{\frac{1}{2}}{\frac{2}{3}}\right) \approx -0.288$ 
* B -> $\ln\left(\frac{\frac{1}{2}}{\frac{1}{3}}\right) \approx 0.405$ 

Don't impute any missing values in this task.

**q14:** What will be the most popular value of this feature in the train data after the encoding? Provide the answer, rounded to the nearest FIVE decimal places (e.g. 12.3456789 -> 12.34568).

In [19]:
from category_encoders import WOEEncoder
map_company_type = WOEEncoder(cols='company_type', regularization=0)

train['company_type'] = map_company_type.fit_transform(train['company_type'], train['target'])

np.round(np.median(train.company_type), 5)

-0.40878

## 15

We have encoded all categorical features. Next, we drop `'enrollee_id'` because it is not a representative feature. After this, we split train part of data into the dataframe which contains only features (without target) and the target array.

Before training the models, we should impute the remaining missing values. You might have noticed that we didn't impute missing values for the features `'major_discipline'`, `'company_size'` and `'company_type'`. This is because the number of missing values in these features is relatively big (you can check it yourself). In practice, you might just create a special category (like `'-1'`) for each of these features before the encoding. However, in this task, let's perform the imputation using KNN approach - where we impute missing values by looking at the similar observations.

Import `KNNImputer` from `sklearn`. It works only with the dataset with numbers - this is why we didn't run it before the encoding. Set `n_neighbors=3`, and let other parameters have the default values. Fit it on the train dataframe with features, and then transform it. Notice that after the transformation we will obtain `numpy.array` - make `pandas.DataFrame` out of it with the same columns as before.

Save the KNN imputer - you will need it to process the test data. Check that there are no missing values in the train data anymore.

**q15:** What is the mean value of the `'company_size'` feature in the train data after the imputation? Provide the answer, rounded to the nearest FIVE decimal places (e.g. 12.3456789 -> 12.34568).

In [20]:
X_train = train.drop(['enrollee_id', 'target'], axis=1)
y_train = train['target']

X_train.shape, y_train.shape

((19158, 16), (19158,))

In [21]:
from sklearn.impute import KNNImputer
knn_imputer = KNNImputer(n_neighbors=3)

# apply imputer
X_train_imp = knn_imputer.fit_transform(X_train)

# make pandas dataframe
X_train = pd.DataFrame(X_train_imp, columns=X_train.columns)

np.round(np.mean(X_train.company_size), 5)

0.18226

In [22]:
assert X_train.isna().sum().sum() == 0

## 16

Finally, train a Random Forest classifier from `sklearn` on the train data. Set `n_estimators=500`, `max_depth=8`, `random_state=13`, and let other parameters have the default values. Check feature importances. 

**q16:** What is the name of the most important feature for this model? Provide the name of the feature.

In [23]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

rf = RandomForestClassifier(n_estimators=500, max_depth=8, random_state=13)
rf.fit(X_train, y_train)

y_pred_train = rf.predict(X_train)

# score
train_score = accuracy_score(y_train, y_pred_train)

# get feature importances and zip it with feature names
feature_importances = sorted(zip(rf.feature_importances_, X_train.columns), reverse=True)
feature_importances


[(0.7232224747048347, 'major_discipline'),
 (0.09259636011790308, 'city_development_index'),
 (0.071106090906331, 'education_level'),
 (0.0338978512331609, 'company_type'),
 (0.03003712000207121, 'city'),
 (0.011585915861990839, 'experience'),
 (0.007888260376452148, 'last_new_job'),
 (0.0063619633978227895, 'company_size'),
 (0.005568817828513686, 'training_hours'),
 (0.005442621002178778, 'enrolled_university_Full time course'),
 (0.005133107213602752, 'relevent_experience'),
 (0.004732604057988538, 'enrolled_university_no_enrollment'),
 (0.0010335703963148064, 'gender_Male'),
 (0.0006705666869188496, 'gender_Female'),
 (0.0004487958150345902, 'enrolled_university_Part time course'),
 (0.000273880398881501, 'gender_Other')]

## 17

In this last task, process the test data so that it is possible to make Random Forest predictions on for it. Perform the similar operations as for the train data, but remember that now you work with test observations and therefore all operations are in the inference mode.

1. (task 4) Feature `'experience'`: impute missing values by -1. 
2. (task 5) Feature `'education_level'`: perform ordinal encoding mapping.
3. (task 6) Feature `'relevent_experience'`: perform binary mapping.
4. (task 7) Feature `'gender'`: perform One-Hot encoding and obtain three new features starting with `'gender-'`. Drop `'gender'` feature.
5. (task 8) Feature `'enrolled_university'`: perform One-Hot encoding and obtain three new features starting with `'enrolled_university-'`. Drop `'enrolled_university'` feature.
6. (task 9) Feature `'city'`: perform frequency encoding mapping.
7. (task 10) Feature `'last_new_job'`: impute missing values by -1 and perform target encoding mapping.
8. (task 11) Feature `'experience'`: perform M-estimate encoding mapping.
9. (task 12) Feature `'major_discipline'`: perform Leave-One-Out encoding mapping.
10. (task 13) Feature `'company_size'`: perform Catboost encoding mapping.
11. (task 14) Feature `'company_type'`: perform WoE encoding mapping.
12. (task 15) Split `test` into `X_test` (a dataframe with no `'enrollee_id'` and `'target'`) and `y_test` (an array with target values). Impute missing values by using KNN imputer which you used before (now only in a transform mode).

As a result of the operations described above, you should obtain `X_test` which is a `pandas.DataFrame` with a shape (2129, 16). Calculate the predictions of the trained Random Forest model on it. Check the accuracy of the predictions on test data.

Then calculate the predictions of the same model on `X_train` and check the accuracy there. Compare the accuracies on train and test data. Do you notice something? What, in your opinion, caused such difference in the accuracies?

**q17:** As a result of this task, provide the accuracy for the test data, rounded to the nearest TWO decimal places (e.g. 12.3456789 -> 12.35).

In [24]:
# experience
test['experience'] = test['experience'].fillna(-1)

# education_level
test['education_level'] = test['education_level'].map(map_education)

# relevant experience
test['relevent_experience'] = test['relevent_experience'].map(map_relevent_experience)

# gender
df_gender = pd.get_dummies(test['gender'], prefix='gender').astype(np.int8)
test = pd.concat([test, df_gender], axis=1)
test = test.drop(columns='gender')

# enrolled_university
df_enrolled_university = pd.get_dummies(test['enrolled_university'], prefix='enrolled_university').astype(np.int8)
test = pd.concat([test, df_enrolled_university], axis=1)
test = test.drop(columns='enrolled_university')

# city
test['city'] = freq_encode_city(test)

# last_new_job
test['last_new_job'] = test['last_new_job'].fillna('-1')
test['last_new_job'] = encode_last_new_job(test)

# experience
test['experience'] = map_experience.transform(test['experience'])

# major_discipline
test['major_discipline'] = map_major_discipline.transform(test['major_discipline'])

# company_size
test['company_size'] = map_company_size.transform(test['company_size'])

# company_type
test['company_type'] = map_company_type.transform(test['company_type'])

# split into X_test and y_test
X_test = test.drop(['enrollee_id', 'target'], axis=1)
y_test = test['target']

# missing values
X_test = X_test[X_train.columns]
X_test_imp = knn_imputer.transform(X_test)

# to pd
X_test = pd.DataFrame(X_test_imp, columns=X_test.columns)

assert X_test.shape == (2129, 16)
assert X_test.isna().sum().sum() == 0

In [25]:
y_pred = rf.predict(X_test)

# compare scores
test_score = accuracy_score(y_test, y_pred)

print(f'Train score: {train_score:.2f}')
print(f'Test score: {test_score:.2f}')


Train score: 0.99
Test score: 0.73
