### Step 1: Import and Encode the Data

Import data using Pandas.

In [1]:
import pandas as pd
df = pd.read_csv("aug_train.csv")

Then, encode the ordinal feature using mapping to transform categorical features into numerical features (since the model takes only numerical input).

In [3]:
# making dictionaries of ordinal features

relevent_experience_map = {
    'Has relevent experience':  1,
    'No relevent experience':    0
}

experience_map = {
    '<1'      :    0,
    '1'       :    1, 
    '2'       :    2, 
    '3'       :    3, 
    '4'       :    4, 
    '5'       :    5,
    '6'       :    6,
    '7'       :    7,
    '8'       :    8, 
    '9'       :    9, 
    '10'      :    10, 
    '11'      :    11,
    '12'      :    12,
    '13'      :    13, 
    '14'      :    14, 
    '15'      :    15, 
    '16'      :    16,
    '17'      :    17,
    '18'      :    18,
    '19'      :    19, 
    '20'      :    20, 
    '>20'     :    21
} 

last_new_job_map = {
    'never'        :    0,
    '1'            :    1, 
    '2'            :    2, 
    '3'            :    3, 
    '4'            :    4, 
    '>4'           :    5
}

In [5]:
# making dictionaries of ordinal features
def encode(df_pre):
    df_pre['relevent_experience'] = df_pre['relevent_experience'].map(relevent_experience_map)
    df_pre['last_new_job'] = df_pre['last_new_job'].map(last_new_job_map)
    df_pre['experience'] = df_pre['experience'].map(experience_map)
  
    return df_pre

df = encode(df)

### Step 2: Define Sets of Columns to be Transformed in Different Ways

In [6]:
num_cols = ['city_development_index','relevent_experience', 'experience','last_new_job', 'training_hours']
cat_cols = ['gender', 'enrolled_university', 'education_level', 'major_discipline', 'company_size', 'company_type']

### Step 3: Create Pipelines for Numerical and Categorical Features

The syntax of the pipeline is: Pipeline(steps = [(‘step name’, transform function), …])

For numerical features, I perform the following actions:

1. **SimpleImputer**: Fill in the missing values with the mean of that column.
2. **MinMaxScaler**: Scale the values to range from 0 to 1 (this will affect regression performance).

For categorical features, perform the following actions:

1. **SimpleImputer**: Fill in the missing values with the most frequent value of that column.
2. **OneHotEncoder**: Split into multiple numerical columns for model training. (`handle_unknown='ignore'` is specified to prevent errors when it finds an unseen category in the test set).

In [8]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler
from sklearn.pipeline import Pipeline

num_pipeline = Pipeline(steps=[
    ('impute', SimpleImputer(strategy='mean')),
    ('scale',MinMaxScaler())
])
cat_pipeline = Pipeline(steps=[
    ('impute', SimpleImputer(strategy='most_frequent')),
    ('one-hot',OneHotEncoder(handle_unknown='ignore', sparse_output=False))
])

### Step 4: Create ColumnTransformer to Apply the Pipeline for Each Column Set

The syntax of the ColumnTransformer is: ColumnTransformer(transformers=[(‘step name’, transform function,cols), …])

Pass numerical columns through the numerical pipeline and pass categorical columns through the categorical pipeline created in step 3.

remainder=’drop’ is specified to ignore other columns in a dataframe.

n_job = -1 means that we'll be using all processors to run in parallel.

In [9]:
from sklearn.compose import ColumnTransformer

col_trans = ColumnTransformer(transformers=[
    ('num_pipeline',num_pipeline,num_cols),
    ('cat_pipeline',cat_pipeline,cat_cols)
    ],
    remainder='drop',
    n_jobs=-1)

### Step 5: Add a Model to the Final Pipeline

Using the logistic regression model in this example.

Create a new pipeline to commingle the ColumnTransformer in step 4 with the logistic regression model. Using a pipeline in this case because the entire dataframe must pass the ColumnTransformer step and modeling step, respectively.

In [10]:
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression(random_state=0)
clf_pipeline = Pipeline(steps=[
    ('col_trans', col_trans),
    ('model', clf)
])

### Step 6: Display the Pipeline

The syntax for this is display(pipeline name):

In [11]:
from sklearn import set_config

set_config(display='diagram')
display(clf_pipeline)

### Step 7: Split the Data into Train and Test Sets

Split 20% of the data into a test set like this:

In [12]:
from sklearn.model_selection import train_test_split

X = df[num_cols+cat_cols]
y = df['target']
# train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y)

Fit the pipeline for the train set and use that fitted pipeline for the test set to prevent data leakage from the test set to the model.

### Step 8: Pass Data through the Pipeline

Here's the syntax for this: pipeline_name.fit, pipeline_name.predict, pipeline_name.score

- `pipeline.fit`: Passes data through a pipeline and fits the model.
- `pipeline.predict`: Uses the model trained during `pipeline.fit` to predict new data.
- `pipeline.score`: Gets a score of the model in the pipeline (accuracy of logistic regression in this example).

In [13]:
clf_pipeline.fit(X_train, y_train)
# preds = clf_pipeline.predict(X_test)
score = clf_pipeline.score(X_test, y_test)
print(f"Model score: {score}") # model accuracy

Model score: 0.7625260960334029


### Step 9: Save the Pipeline

The syntax for this is `joblib.dumb`.

Use the joblib library to save the pipeline for later use, so you don’t need to create and fit the pipeline again. When you want to use a saved pipeline, just load the file using joblib.load like this:

In [14]:
import joblib

# save pipeline to file "pipe.joblib"
joblib.dump(clf_pipeline,"pipe.joblib")

# load pipeline when you want to use
same_pipe = joblib.load("pipe.joblib")

### How to Find the Best Hyperparameter Sets: Add a Pipeline to Grid Search

Grid Search is a method you can use to perform hyperparameter tuning. It helps you find the optimum parameter sets that yield the highest model accuracy.

#### Set the tuning parameters and their range.

Create a dictionary of tuning parameters (hyperparameters): { ‘tuning parameter’ : ‘possible value’, … }

In this example, find the best penalty type and C of a logistic regression model.

In [16]:
import numpy as np
grid_params = {'model__penalty' : ['none', 'l2'],
               'model__C' : np.logspace(-4, 4, 20)}

#### Add the pipeline to Grid Search

GridSearchCV(model, tuning parameter, …)

Our pipeline has a model step as the final step, so we can input the pipeline directly to the GridSearchCV function.

In [17]:
from sklearn.model_selection import GridSearchCV

gs = GridSearchCV(clf_pipeline, grid_params, cv=5, scoring='accuracy')
gs.fit(X_train, y_train)

print("Best Score of train set: "+str(gs.best_score_))
print("Best parameter set: "+str(gs.best_params_))
print("Test Score: "+str(gs.score(X_test,y_test)))

Best Score of train set: 0.7657574470937899
Best parameter set: {'model__C': 0.23357214690901212, 'model__penalty': 'l2'}
Test Score: 0.7627870563674322


100 fits failed out of a total of 200.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
100 fits failed with the following error:
Traceback (most recent call last):
  File "c:\Users\16476\anaconda3\Lib\site-packages\sklearn\model_selection\_validation.py", line 895, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "c:\Users\16476\anaconda3\Lib\site-packages\sklearn\base.py", line 1474, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\16476\anaconda3\Lib\site-packages\sklearn\pipeline.py", line 475, in fit
    self._final_estimator.fit(Xt, y, **last_step_params["fit"])
  File "c:\Users\16476\anaconda3\Lib\site-packages\sklearn\base.py",

After setting up Grid Search, you can fit Grid Search with the data and see the results. Let's see what the code is doing:

- `.fit`: Fits the model and tries all sets of parameters in the tuning parameter dictionary.
- `.best_score_`: The highest accuracy across all sets of parameters.
- `.best_params_`: The set of parameters that yield the best score.
- `.score(X_test, y_test)`: The score when trying the best model with the test set.

### How to adjust the current pipeline a little

I want to know which scaling method will work best for my data between MinMaxScaler and StandardScaler. I add a step StandardScaler in the num_pipeline. The rest doesn't change.

In [18]:
from sklearn.preprocessing import StandardScaler

num_pipeline2 = Pipeline(steps=[
    ('impute', SimpleImputer(strategy='mean')),
    ('minmax_scale', MinMaxScaler()),
    ('std_scale', StandardScaler()),
])

col_trans2 = ColumnTransformer(transformers=[
    ('num_pipeline',num_pipeline2,num_cols),
    ('cat_pipeline',cat_pipeline,cat_cols)
    ],
    remainder='drop',
    n_jobs=-1)
    
clf_pipeline2 = Pipeline(steps=[
    ('col_trans', col_trans2),
    ('model', clf)
])