### Step 1: Import and Encode the Data

Import data using Pandas.

In [1]:
import pandas as pd
df = pd.read_csv("aug_train.csv")

Then, encode the ordinal feature using mapping to transform categorical features into numerical features (since the model takes only numerical input).

In [3]:
# making dictionaries of ordinal features

relevent_experience_map = {
    'Has relevent experience':  1,
    'No relevent experience':    0
}

experience_map = {
    '<1'      :    0,
    '1'       :    1, 
    '2'       :    2, 
    '3'       :    3, 
    '4'       :    4, 
    '5'       :    5,
    '6'       :    6,
    '7'       :    7,
    '8'       :    8, 
    '9'       :    9, 
    '10'      :    10, 
    '11'      :    11,
    '12'      :    12,
    '13'      :    13, 
    '14'      :    14, 
    '15'      :    15, 
    '16'      :    16,
    '17'      :    17,
    '18'      :    18,
    '19'      :    19, 
    '20'      :    20, 
    '>20'     :    21
} 

last_new_job_map = {
    'never'        :    0,
    '1'            :    1, 
    '2'            :    2, 
    '3'            :    3, 
    '4'            :    4, 
    '>4'           :    5
}

In [5]:
# making dictionaries of ordinal features
def encode(df_pre):
    df_pre['relevent_experience'] = df_pre['relevent_experience'].map(relevent_experience_map)
    df_pre['last_new_job'] = df_pre['last_new_job'].map(last_new_job_map)
    df_pre['experience'] = df_pre['experience'].map(experience_map)
  
    return df_pre

df = encode(df)

### Step 2: Define Sets of Columns to be Transformed in Different Ways

In [6]:
num_cols = ['city_development_index','relevent_experience', 'experience','last_new_job', 'training_hours']
cat_cols = ['gender', 'enrolled_university', 'education_level', 'major_discipline', 'company_size', 'company_type']

### Step 3: Create Pipelines for Numerical and Categorical Features

The syntax of the pipeline is: Pipeline(steps = [(‘step name’, transform function), …])

For numerical features, I perform the following actions:

1. **SimpleImputer**: Fill in the missing values with the mean of that column.
2. **MinMaxScaler**: Scale the values to range from 0 to 1 (this will affect regression performance).

For categorical features, perform the following actions:

1. **SimpleImputer**: Fill in the missing values with the most frequent value of that column.
2. **OneHotEncoder**: Split into multiple numerical columns for model training. (`handle_unknown='ignore'` is specified to prevent errors when it finds an unseen category in the test set).

In [8]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler
from sklearn.pipeline import Pipeline

num_pipeline = Pipeline(steps=[
    ('impute', SimpleImputer(strategy='mean')),
    ('scale',MinMaxScaler())
])
cat_pipeline = Pipeline(steps=[
    ('impute', SimpleImputer(strategy='most_frequent')),
    ('one-hot',OneHotEncoder(handle_unknown='ignore', sparse_output=False))
])

### Step 4: Create ColumnTransformer to Apply the Pipeline for Each Column Set

The syntax of the ColumnTransformer is: ColumnTransformer(transformers=[(‘step name’, transform function,cols), …])

Pass numerical columns through the numerical pipeline and pass categorical columns through the categorical pipeline created in step 3.

remainder=’drop’ is specified to ignore other columns in a dataframe.

n_job = -1 means that we'll be using all processors to run in parallel.

In [9]:
from sklearn.compose import ColumnTransformer

col_trans = ColumnTransformer(transformers=[
    ('num_pipeline',num_pipeline,num_cols),
    ('cat_pipeline',cat_pipeline,cat_cols)
    ],
    remainder='drop',
    n_jobs=-1)

### Step 5: Add a Model to the Final Pipeline

Using the logistic regression model in this example.

Create a new pipeline to commingle the ColumnTransformer in step 4 with the logistic regression model. Using a pipeline in this case because the entire dataframe must pass the ColumnTransformer step and modeling step, respectively.

In [10]:
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression(random_state=0)
clf_pipeline = Pipeline(steps=[
    ('col_trans', col_trans),
    ('model', clf)
])

### Step 6: Display the Pipeline

The syntax for this is display(pipeline name):

In [11]:
from sklearn import set_config

set_config(display='diagram')
display(clf_pipeline)

### Step 7: Split the Data into Train and Test Sets

Split 20% of the data into a test set like this:

In [12]:
from sklearn.model_selection import train_test_split

X = df[num_cols+cat_cols]
y = df['target']
# train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y)

Fit the pipeline for the train set and use that fitted pipeline for the test set to prevent data leakage from the test set to the model.

### Step 8: Pass Data through the Pipeline

Here's the syntax for this: pipeline_name.fit, pipeline_name.predict, pipeline_name.score

- `pipeline.fit`: Passes data through a pipeline and fits the model.
- `pipeline.predict`: Uses the model trained during `pipeline.fit` to predict new data.
- `pipeline.score`: Gets a score of the model in the pipeline (accuracy of logistic regression in this example).

In [13]:
clf_pipeline.fit(X_train, y_train)
# preds = clf_pipeline.predict(X_test)
score = clf_pipeline.score(X_test, y_test)
print(f"Model score: {score}") # model accuracy

Model score: 0.7625260960334029


### Step 9: Save the Pipeline

The syntax for this is `joblib.dumb`.

Use the joblib library to save the pipeline for later use, so you don’t need to create and fit the pipeline again. When you want to use a saved pipeline, just load the file using joblib.load like this:

In [14]:
import joblib

# save pipeline to file "pipe.joblib"
joblib.dump(clf_pipeline,"pipe.joblib")

# load pipeline when you want to use
same_pipe = joblib.load("pipe.joblib")

### How to Find the Best Hyperparameter Sets: Add a Pipeline to Grid Search

Grid Search is a method you can use to perform hyperparameter tuning. It helps you find the optimum parameter sets that yield the highest model accuracy.

#### Set the tuning parameters and their range.

Create a dictionary of tuning parameters (hyperparameters): { ‘tuning parameter’ : ‘possible value’, … }

In this example, find the best penalty type and C of a logistic regression model.

In [16]:
import numpy as np
grid_params = {'model__penalty' : ['none', 'l2'],
               'model__C' : np.logspace(-4, 4, 20)}

#### Add the pipeline to Grid Search

GridSearchCV(model, tuning parameter, …)

Our pipeline has a model step as the final step, so we can input the pipeline directly to the GridSearchCV function.

In [17]:
from sklearn.model_selection import GridSearchCV

gs = GridSearchCV(clf_pipeline, grid_params, cv=5, scoring='accuracy')
gs.fit(X_train, y_train)

print("Best Score of train set: "+str(gs.best_score_))
print("Best parameter set: "+str(gs.best_params_))
print("Test Score: "+str(gs.score(X_test,y_test)))

Best Score of train set: 0.7657574470937899
Best parameter set: {'model__C': 0.23357214690901212, 'model__penalty': 'l2'}
Test Score: 0.7627870563674322


100 fits failed out of a total of 200.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
100 fits failed with the following error:
Traceback (most recent call last):
  File "c:\Users\16476\anaconda3\Lib\site-packages\sklearn\model_selection\_validation.py", line 895, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "c:\Users\16476\anaconda3\Lib\site-packages\sklearn\base.py", line 1474, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\16476\anaconda3\Lib\site-packages\sklearn\pipeline.py", line 475, in fit
    self._final_estimator.fit(Xt, y, **last_step_params["fit"])
  File "c:\Users\16476\anaconda3\Lib\site-packages\sklearn\base.py",

After setting up Grid Search, you can fit Grid Search with the data and see the results. Let's see what the code is doing:

- `.fit`: Fits the model and tries all sets of parameters in the tuning parameter dictionary.
- `.best_score_`: The highest accuracy across all sets of parameters.
- `.best_params_`: The set of parameters that yield the best score.
- `.score(X_test, y_test)`: The score when trying the best model with the test set.

### How to adjust the current pipeline a little

I want to know which scaling method will work best for my data between MinMaxScaler and StandardScaler. I add a step StandardScaler in the num_pipeline. The rest doesn't change.

In [18]:
from sklearn.preprocessing import StandardScaler

num_pipeline2 = Pipeline(steps=[
    ('impute', SimpleImputer(strategy='mean')),
    ('minmax_scale', MinMaxScaler()),
    ('std_scale', StandardScaler()),
])

col_trans2 = ColumnTransformer(transformers=[
    ('num_pipeline',num_pipeline2,num_cols),
    ('cat_pipeline',cat_pipeline,cat_cols)
    ],
    remainder='drop',
    n_jobs=-1)
    
clf_pipeline2 = Pipeline(steps=[
    ('col_trans', col_trans2),
    ('model', clf)
])

### How to Perform Grid Search

In grid search parameters, specify the steps to skip and set their value to passthrough. Since MinMaxScaler and StandardScaler should not perform at the same time, use a list of dictionaries for the grid search parameters. E.g., [{case 1},{case 2}]

If using a list of dictionaries, grid search will perform a combination of every parameter in case 1 until complete. Then, it will perform a combination of every parameter in case 2. So there is no case where MinMaxScaler and StandardScaler are used together.

In [19]:
grid_step_params = [{'col_trans__num_pipeline__minmax_scale': ['passthrough']},
                    {'col_trans__num_pipeline__std_scale': ['passthrough']}]

Perform Grid Search and print the results (like a normal grid search).

In [20]:
gs2 = GridSearchCV(clf_pipeline2, grid_step_params, scoring='accuracy')
gs2.fit(X_train, y_train)

print("Best Score of train set: "+str(gs2.best_score_))
print("Best parameter set: "+str(gs2.best_params_))
print("Test Score: "+str(gs2.score(X_test,y_test)))

Best Score of train set: 0.7650395379944643
Best parameter set: {'col_trans__num_pipeline__minmax_scale': 'passthrough'}
Test Score: 0.7627870563674322


The best case is minmax_scale : ‘passthrough’, so StandardScaler is the best scaling method for this data.

### How to Find the Best Hyperparameter Sets and the Best Data Preparation Method

Find the best hyperparameter sets and the best data preparation method by adding tuning parameters to the dictionary of each case of the data preparation method.

In [24]:
grid_params = {'model__penalty' : ['none', 'l2'],
               'model__C' : np.logspace(-4, 4, 20)}
               
grid_step_params2 = [{**{'col_trans__num_pipeline__minmax_scale': ['passthrough']}, **grid_params},
                    {**{'col_trans__num_pipeline__std_scale': ['passthrough']}, **grid_params}]

grid_params will be added to both case 1 (skip MinMaxScaler) and case 2 (skip StandardScalerand).

Merge dictionary using this syntax: merge_dict = {**dict_1,**dict_2}

In [25]:
# perform Grid Search and print the results (like a normal grid search)
gs3 = GridSearchCV(clf_pipeline2, grid_step_params2, scoring='accuracy')
gs3.fit(X_train, y_train)

print("Best Score of train set: "+str(gs3.best_score_))
print("Best parameter set: "+str(gs3.best_params_))
print("Test Score: "+str(gs3.score(X_test,y_test)))

Best Score of train set: 0.7669317430876349
Best parameter set: {'col_trans__num_pipeline__minmax_scale': 'passthrough', 'model__C': 0.004832930238571752, 'model__penalty': 'l2'}
Test Score: 0.7640918580375783


200 fits failed out of a total of 400.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
200 fits failed with the following error:
Traceback (most recent call last):
  File "c:\Users\16476\anaconda3\Lib\site-packages\sklearn\model_selection\_validation.py", line 895, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "c:\Users\16476\anaconda3\Lib\site-packages\sklearn\base.py", line 1474, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\16476\anaconda3\Lib\site-packages\sklearn\pipeline.py", line 475, in fit
    self._final_estimator.fit(Xt, y, **last_step_params["fit"])
  File "c:\Users\16476\anaconda3\Lib\site-packages\sklearn\base.py",

Find the best parameter set using .best_params_. As minmax_scale : ‘passthrough’, so StandardScaler is the best scaling method for this data.

Show all grid search cases using .cv_results_:

In [26]:
pd.DataFrame(gs3.cv_results_)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_col_trans__num_pipeline__minmax_scale,param_model__C,param_model__penalty,param_col_trans__num_pipeline__std_scale,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,0.034625,0.006329,0.000000,0.000000,passthrough,0.0001,none,,{'col_trans__num_pipeline__minmax_scale': 'pas...,,,,,,,,41
1,0.035888,0.002077,0.011861,0.001807,passthrough,0.0001,l2,,{'col_trans__num_pipeline__minmax_scale': 'pas...,0.750489,0.750734,0.750734,0.750734,0.750408,0.750620,0.000142,35
2,0.032155,0.002674,0.000000,0.000000,passthrough,0.000264,none,,{'col_trans__num_pipeline__minmax_scale': 'pas...,,,,,,,,41
3,0.035982,0.002943,0.011009,0.004184,passthrough,0.000264,l2,,{'col_trans__num_pipeline__minmax_scale': 'pas...,0.750489,0.750734,0.750734,0.750734,0.750408,0.750620,0.000142,35
4,0.033266,0.000643,0.000000,0.000000,passthrough,0.000695,none,,{'col_trans__num_pipeline__minmax_scale': 'pas...,,,,,,,,41
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
75,0.050299,0.001175,0.011707,0.000385,,1438.449888,l2,passthrough,{'col_trans__num_pipeline__std_scale': 'passth...,0.768102,0.767700,0.752692,0.762480,0.773573,0.764909,0.007046,10
76,0.032805,0.002020,0.000000,0.000000,,3792.690191,none,passthrough,{'col_trans__num_pipeline__std_scale': 'passth...,,,,,,,,41
77,0.047058,0.005294,0.010679,0.004314,,3792.690191,l2,passthrough,{'col_trans__num_pipeline__std_scale': 'passth...,0.768102,0.767700,0.752692,0.762480,0.773573,0.764909,0.007046,10
78,0.033474,0.003093,0.000000,0.000000,,10000.0,none,passthrough,{'col_trans__num_pipeline__std_scale': 'passth...,,,,,,,,41


There are 80 cases for this example. There's running time and accuracy of each case to consider, since sometimes we may select the fastest model with acceptable accuracy instead of the highest accuracy one.

### How to Add a Custom Transformation

Apart from standard data transformation functions such as MinMaxScaler from sklearn, can create own transformation of data.

This example creates a class method to encode ordinal features using mapping to transform categorical features into numerical ones. In simple words, we'll change data from text to numbers.

First we'll do the required data processing before regression model training.

In [29]:
from sklearn.base import TransformerMixin

class Encode(TransformerMixin):
    
    def __init__(self):
        # Making Dictionaries of ordinal features
        self.rel_exp_map = {
            'Has relevent experience': 1,
            'No relevent experience': 0}
            
    def fit(self, df, y = None):
    	return self
        
    def transform(self, df, y = None):
        df_pre = df.copy()
        df_pre.loc[:,'rel_exp'] = df_pre['rel_exp'].map(self.rel_exp_map)
        return df_pre

Here's an explanation of what's going on in this code:

- Create a class named `Encode` which inherits from the base class `TransformerMixin` from `sklearn`.
- Inside the class, there are 3 necessary methods: `__init__`, `fit`, and `transform`.
  - `__init__` will be called when a pipeline is created. It is where we define variables inside the class. I created a variable `rel_exp_map` which is a dictionary that maps categories to numbers.
  - `fit` will be called when fitting the pipeline. I left it blank for this case.
  - `transform` will be called when a pipeline transform is used. This method requires a dataframe (`df`) as an input while `y` is set to `None` by default (it is forced to have a `y` argument).
- In `transform`, the dataframe column `rel_exp` will be mapped with the `rel_exp_map`.

Add this Encode class as a pipeline step.

In [30]:
pipeline = Pipeline(steps=[
    ('Encode', Encode()),
    ('col_trans', col_trans),
    ('model', LogisticRegression())
])

Then you can fit, transform, or grid search the pipeline like a normal pipeline.

### How to Find the Best Machine Learning Model

Create a custom transformation that receives a model as an input and performs grid search to find the best model.