<p style="font-family: Arial; font-size:2.75em;color:purple; font-style:bold">
Logistic Regression using Python (ScikitLearn):</p>
<p style="font-family: Arial; font-size:2.25em;color:green; font-style:bold">
Kumar Rahul</p><br>

ScikitLearn.jl implements the popular scikit-learn interface and algorithms in Julia. It supports both models from the Julia ecosystem and those of the scikit-learn library (via PyCall.jl).

* More at: https://cstjean.github.io/ScikitLearn.jl/dev/man/python/
* Examples at: https://github.com/cstjean/ScikitLearn.jl/blob/master/docs/src/man/examples.md

### We will be using HR data in this exercise. Refer the Exhibit 1 to understand the feature list. Use the HR data and answer the below questions.

1.	Load the dataset in Jupyter Notebook using pandas
2.	Build a correlation matrix between all the numeric features in the dataset.
3.	Build a new feature named LOB_Hike_Offered using LOB and percentage hike offered. Include this as a part of the data frame created in step 1. What assumption are you trying to test with such variables?
4.	Create a new data frame with the numeric features and categorical features as dummy variable coded features. Which features will you include for model building and why?
5.	Split the data into training set and test set. Use 80% of data for model training and 20% for model testing. 
6.	Build a model using Gender and Age as independent variable and Status as dependent variable.
    >* Are Gender and Age a significant feature in this model?
    * What inferences can be drawn from this model? 
7.	Build a model with statsmodel.api to predict the probability of Not Joining. How do you interpret the model outcome? Report the model performance on the test set.
8.	Build a model with statsmodel.formula.api to predict the probability of Not Joining and report the model performance on the test set. What difference do you observe in the model built here and the one built in step 7.
9.	Build a model using sklearn package to predict the probability of Not Joining. What difference do you observe in this model compared to model built in step 7 and 8.
10.	Fine-tune the cut-off value using cost of misclassification as a strategy. The cut-off should help classify maximum number of Not Joining cases correctly.
11.	Fine-tune the cut-off value using youdens index as a strategy. The cut-off should help balance the classification of Joined and Not Joined cases.
12.	Apply the cut-off values obtained in step 10 and step 11 on the test set. What inference can be deduced from it?
13. Build model using gradient descent to get an intuition about the inner working of optimization algorithms.
14. Build model using gradient descent with regularization to get an intution about the inner working of optimization algorithms.

**PS: Not all the questions are being answered as a part of the same notebook. You are encouraged to answer the questions if you find them missing.**

**Exhibit 1**


|Sl.No.|Name of Variable|Variable Description|
|:-------|----------------|:--------------------|
|1	|Candidate reference number|	Unique number to identify the candidate|
|2	|DOJ extended|Binary variable identifying whether candidate asked for date of joining extension (Yes/No)|
|3	|Duration to accept the offer|	Number of days taken by the candidate to accept the offer (continuous variable)|
|4	|Notice period|	Notice period to be served in the parting company before candidate can join this company (continuous variable)|
|5	|Offered band|	Band offered to the candidate based on experience and performance in interview rounds (categorical variable labelled C0/C1/C2/C3/C4/C5/C6)|
|6	|Percentage hike (CTC) expected|	Percentage hike expected by the candidate (continuous variable)|
|7	|Percentage hike offered (CTC)| Percentage hike offered by the company (continuous variable)|
|8	|Percent difference CTC|	Percentage difference between offered and expected CTC (continuous variable)|
|9	|Joining bonus|	Binary variable indicating if joining bonus was given or not (Yes/No)|
|10	|Gender|	Gender of the candidate (Male/Female)|
|11	|Candidate source|	Source from which resume of the candidate was obtained (categorical variables with categories  Employee referral/Agency/Direct)|
|12	|REX (in years)|	Relevant years of experience of the candidate for the position offered (continuous variable)|
|13	|LOB|	Line of business for which offer was rolled out (categorical variable)|
|14	|DOB|	Date of birth of the candidate|
|15	|Joining location|	Company location for which offer was rolled out for candidate to join (categorical variable)|
|16	|Candidate relocation status|	Binary variable indicating whether candidate has to relocate from one city to another city for joining (Yes/No)|
|17 |HR status|	Final joining status of candidate (Joined/Not-Joined)|

***

# Code starts here

We are going to use below mentioned libraries for **data import, processing and visulization**. As we progress, we will use other specific libraries for model building and evaluation. 

In [1]:
using Pkg
using CSV
using DataFrames
using Statistics
using FreqTables
using StatsBase
using Gadfly
using Printf
using MLJ ##Machine Learning Julia, schema() from this package.
using ScikitLearn ##Machine Learning using SciKitLearn
using JLD ##To save model object
using PyCallJLD #to save model object

In [2]:
ENV["COLUMNS"] = 1000
ENV["LINES"] = 30
pwd()

"/Users/Rahul/Documents/Rahul Office/IIMB/Concepts/Julia/ML_using_Julia/Julia_Code/Julia_logistic/Code"


## Data Import and Manipulation

### 1. Importing a data set

_Give the correct path to the data_



modify the ast_note_interactivity kernel option to see the value of multiple statements at once.

In [3]:
raw_df = CSV.read( "../HR_case/data/IMB533_HR_Data_No_Missing_Value.csv", DataFrame, 
                    delim = ",", header =1,
                    normalizenames=true,
                    missingstrings = ["", " "]
                    )
head(raw_df)

Unnamed: 0_level_0,SLNO,Candidate_Ref,DOJ_Extended,Duration_to_accept_offer,Notice_period,Offered_band,Pecent_hike_expected_in_CTC,Percent_hike_offered_in_CTC,Percent_difference_CTC,Joining_Bonus,Candidate_relocate_actual,Gender,Candidate_Source,Rex_in_Yrs,LOB,Location,Age,Status
Unnamed: 0_level_1,Int64,Int64,String,Int64,Int64,String,Float64,Float64,Float64,String,String,String,String,Int64,String,String,Int64,String
1,1,2110407,Yes,14,30,E2,-20.79,13.16,42.86,No,No,Female,Agency,7,ERS,Noida,34,Joined
2,2,2112635,No,18,30,E2,50.0,320.0,180.0,No,No,Male,Employee Referral,8,INFRA,Chennai,34,Joined
3,3,2112838,No,3,45,E2,42.84,42.84,0.0,No,No,Male,Agency,4,INFRA,Noida,27,Joined
4,4,2115021,No,26,30,E2,42.84,42.84,0.0,No,No,Male,Employee Referral,4,INFRA,Noida,34,Joined
5,5,2115125,Yes,1,120,E2,42.59,42.59,0.0,No,Yes,Male,Employee Referral,6,INFRA,Noida,34,Joined
6,6,2117167,Yes,17,30,E1,42.83,42.83,0.0,No,No,Male,Employee Referral,2,INFRA,Noida,34,Joined


In [4]:
rename!(raw_df, lowercase.(names(raw_df)));

In [5]:
head(raw_df)

Unnamed: 0_level_0,slno,candidate_ref,doj_extended,duration_to_accept_offer,notice_period,offered_band,pecent_hike_expected_in_ctc,percent_hike_offered_in_ctc,percent_difference_ctc,joining_bonus,candidate_relocate_actual,gender,candidate_source,rex_in_yrs,lob,location,age,status
Unnamed: 0_level_1,Int64,Int64,String,Int64,Int64,String,Float64,Float64,Float64,String,String,String,String,Int64,String,String,Int64,String
1,1,2110407,Yes,14,30,E2,-20.79,13.16,42.86,No,No,Female,Agency,7,ERS,Noida,34,Joined
2,2,2112635,No,18,30,E2,50.0,320.0,180.0,No,No,Male,Employee Referral,8,INFRA,Chennai,34,Joined
3,3,2112838,No,3,45,E2,42.84,42.84,0.0,No,No,Male,Agency,4,INFRA,Noida,27,Joined
4,4,2115021,No,26,30,E2,42.84,42.84,0.0,No,No,Male,Employee Referral,4,INFRA,Noida,34,Joined
5,5,2115125,Yes,1,120,E2,42.59,42.59,0.0,No,Yes,Male,Employee Referral,6,INFRA,Noida,34,Joined
6,6,2117167,Yes,17,30,E1,42.83,42.83,0.0,No,No,Male,Employee Referral,2,INFRA,Noida,34,Joined


In [6]:
#?pd.read_csv

Dropping SLNo and Candidate.Ref as these will not be used for any analysis or model building.

In [7]:
drop_feature = [ "slno", "candidate_ref" ]

2-element Array{String,1}:
 "slno"
 "candidate_ref"

In [8]:
drop_feature ∈ names(raw_df)

false

In [9]:
select!(raw_df, Not(["slno","candidate_ref"]))

Unnamed: 0_level_0,doj_extended,duration_to_accept_offer,notice_period,offered_band,pecent_hike_expected_in_ctc,percent_hike_offered_in_ctc,percent_difference_ctc,joining_bonus,candidate_relocate_actual,gender,candidate_source,rex_in_yrs,lob,location,age,status
Unnamed: 0_level_1,String,Int64,Int64,String,Float64,Float64,Float64,String,String,String,String,Int64,String,String,Int64,String
1,Yes,14,30,E2,-20.79,13.16,42.86,No,No,Female,Agency,7,ERS,Noida,34,Joined
2,No,18,30,E2,50.0,320.0,180.0,No,No,Male,Employee Referral,8,INFRA,Chennai,34,Joined
3,No,3,45,E2,42.84,42.84,0.0,No,No,Male,Agency,4,INFRA,Noida,27,Joined
4,No,26,30,E2,42.84,42.84,0.0,No,No,Male,Employee Referral,4,INFRA,Noida,34,Joined
5,Yes,1,120,E2,42.59,42.59,0.0,No,Yes,Male,Employee Referral,6,INFRA,Noida,34,Joined
6,Yes,17,30,E1,42.83,42.83,0.0,No,No,Male,Employee Referral,2,INFRA,Noida,34,Joined
7,Yes,37,30,E2,31.58,31.58,0.0,No,No,Male,Employee Referral,7,INFRA,Noida,32,Joined
8,Yes,16,0,E1,-20.0,-20.0,0.0,No,No,Female,Direct,8,Healthcare,Noida,34,Joined
9,No,1,30,E1,-22.22,-22.22,0.0,No,No,Female,Employee Referral,3,BFSI,Gurgaon,26,Joined
10,No,6,30,E1,240.0,220.0,-5.88,No,No,Male,Employee Referral,3,CSMP,Chennai,34,Joined



### 2. Structure of the dataset



In [10]:
Dict(names(raw_df) .=> eltype.(eachcol(raw_df)))

Dict{String,DataType} with 16 entries:
  "rex_in_yrs"                  => Int64
  "joining_bonus"               => String
  "lob"                         => String
  "age"                         => Int64
  "percent_difference_ctc"      => Float64
  "notice_period"               => Int64
  "location"                    => String
  "status"                      => String
  "pecent_hike_expected_in_ctc" => Float64
  "duration_to_accept_offer"    => Int64
  "percent_hike_offered_in_ctc" => Float64
  "doj_extended"                => String
  "candidate_relocate_actual"   => String
  "gender"                      => String
  "offered_band"                => String
  "candidate_source"            => String

In [11]:
schema(raw_df)

┌─────────────────────────────┬─────────┬────────────┐
│[22m _.names                     [0m│[22m _.types [0m│[22m _.scitypes [0m│
├─────────────────────────────┼─────────┼────────────┤
│ doj_extended                │ String  │ Textual    │
│ duration_to_accept_offer    │ Int64   │ Count      │
│ notice_period               │ Int64   │ Count      │
│ offered_band                │ String  │ Textual    │
│ pecent_hike_expected_in_ctc │ Float64 │ Continuous │
│ percent_hike_offered_in_ctc │ Float64 │ Continuous │
│ percent_difference_ctc      │ Float64 │ Continuous │
│ joining_bonus               │ String  │ Textual    │
│ candidate_relocate_actual   │ String  │ Textual    │
│ gender                      │ String  │ Textual    │
│ candidate_source            │ String  │ Textual    │
│ rex_in_yrs                  │ Int64   │ Count      │
│ lob                         │ String  │ Textual    │
│ location                    │ String  │ Textual    │
│ age                         │ Int64 

In [12]:
numerical_features = names(raw_df[(<:).(eltypes(raw_df),Union{Number,Missing})])
categorical_features = names(raw_df[(<:).(eltypes(raw_df),Union{String,Missing})])

9-element Array{String,1}:
 "doj_extended"
 "offered_band"
 "joining_bonus"
 "candidate_relocate_actual"
 "gender"
 "candidate_source"
 "lob"
 "location"
 "status"

In [13]:
describe(raw_df, :all, cols=numerical_features)

Unnamed: 0_level_0,variable,mean,std,min,q25,median,q75,max,nunique,nmissing,first,last,eltype
Unnamed: 0_level_1,Symbol,Float64,Float64,Real,Float64,Float64,Float64,Real,Nothing,Nothing,Real,Real,DataType
1,duration_to_accept_offer,21.4345,25.8116,0.0,3.0,10.0,33.0,224.0,,,14.0,2.0,Int64
2,notice_period,39.2918,22.2202,0.0,30.0,30.0,60.0,120.0,,,30.0,0.0,Int64
3,pecent_hike_expected_in_ctc,43.8648,29.789,-68.83,27.27,40.0,53.85,359.77,,,-20.79,45.25,Float64
4,percent_hike_offered_in_ctc,40.6574,36.0641,-60.53,22.09,36.0,50.0,471.43,,,13.16,14.09,Float64
5,percent_difference_ctc,-1.5738,19.6107,-67.27,-8.33,0.0,0.0,300.0,,,42.86,-21.45,Float64
6,rex_in_yrs,4.23902,2.54757,0.0,3.0,4.0,6.0,24.0,,,7.0,1.0,Int64
7,age,29.9132,4.09791,20.0,27.0,29.0,34.0,60.0,,,34.0,34.0,Int64


In [14]:
describe(raw_df, :all, cols=categorical_features)

Unnamed: 0_level_0,variable,mean,std,min,q25,median,q75,max,nunique,nmissing,first,last,eltype
Unnamed: 0_level_1,Symbol,Nothing,Nothing,String,Nothing,Nothing,Nothing,String,Int64,Nothing,String,String,DataType
1,doj_extended,,,No,,,,Yes,2,,Yes,No,String
2,offered_band,,,E0,,,,E3,4,,E2,E1,String
3,joining_bonus,,,No,,,,Yes,2,,No,No,String
4,candidate_relocate_actual,,,No,,,,Yes,2,,No,No,String
5,gender,,,Female,,,,Male,2,,Female,Female,String
6,candidate_source,,,Agency,,,,Employee Referral,3,,Agency,Employee Referral,String
7,lob,,,AXON,,,,MMS,9,,ERS,INFRA,String
8,location,,,Ahmedabad,,,,Pune,11,,Noida,Chennai,String
9,status,,,Joined,,,,Not Joined,2,,Joined,Joined,String


### 2. Summarizing the dataset
Create a new data frame and store the raw data copy. This is being done to have a copy of the raw data intact for further manipulation if needed.

In [15]:
filter_df = copy(dropmissing(raw_df))
schema(filter_df)

┌─────────────────────────────┬─────────┬────────────┐
│[22m _.names                     [0m│[22m _.types [0m│[22m _.scitypes [0m│
├─────────────────────────────┼─────────┼────────────┤
│ doj_extended                │ String  │ Textual    │
│ duration_to_accept_offer    │ Int64   │ Count      │
│ notice_period               │ Int64   │ Count      │
│ offered_band                │ String  │ Textual    │
│ pecent_hike_expected_in_ctc │ Float64 │ Continuous │
│ percent_hike_offered_in_ctc │ Float64 │ Continuous │
│ percent_difference_ctc      │ Float64 │ Continuous │
│ joining_bonus               │ String  │ Textual    │
│ candidate_relocate_actual   │ String  │ Textual    │
│ gender                      │ String  │ Textual    │
│ candidate_source            │ String  │ Textual    │
│ rex_in_yrs                  │ Int64   │ Count      │
│ lob                         │ String  │ Textual    │
│ location                    │ String  │ Textual    │
│ age                         │ Int64 

We will first start by printing the unique labels in categorical features

In [16]:
for f in categorical_features
    unq = unique(filter_df[:, f]) ## Set(filter_df[:, (f)]) also works.
    val_cnt = StatsBase.countmap(filter_df[:, (f)])
    @printf("\nThe unique labels in %s is %s \n", f, unq)
    @printf("\nThe unique labels in %s is %s \n", f, val_cnt)
end


The unique labels in doj_extended is ["Yes", "No"] 

The unique labels in doj_extended is Dict("Yes" => 4207,"No" => 4788) 

The unique labels in offered_band is ["E2", "E1", "E3", "E0"] 

The unique labels in offered_band is Dict("E0" => 211,"E3" => 505,"E2" => 2711,"E1" => 5568) 

The unique labels in joining_bonus is ["No", "Yes"] 

The unique labels in joining_bonus is Dict("Yes" => 417,"No" => 8578) 

The unique labels in candidate_relocate_actual is ["No", "Yes"] 

The unique labels in candidate_relocate_actual is Dict("Yes" => 1290,"No" => 7705) 

The unique labels in gender is ["Female", "Male"] 

The unique labels in gender is Dict("Female" => 1551,"Male" => 7444) 

The unique labels in candidate_source is ["Agency", "Employee Referral", "Direct"] 

The unique labels in candidate_source is Dict("Agency" => 2585,"Direct" => 4801,"Employee Referral" => 1609) 

The unique labels in lob is ["ERS", "INFRA", "Healthcare", "BFSI", "CSMP", "ETS", "AXON", "EAS", "MMS"] 

The unique la

Looking at the feature **line of business** it seems that *EAS, Healthcare and MMS* does not have enough observations and may be clubbed together

In [17]:
filter_df[(filter_df.lob .=="EAS") .| 
          (filter_df.lob .=="Healthcare") .|
          (filter_df.lob .=="MMS"),"lob"] .= "Others";

In [18]:
countmap(filter_df.lob)

Dict{String,Int64} with 7 entries:
  "Others" => 485
  "BFSI"   => 1396
  "CSMP"   => 579
  "AXON"   => 568
  "INFRA"  => 2850
  "ETS"    => 691
  "ERS"    => 2426

We will use **groupby** function of pandas to get deeper insights of the behaviour of people **Joining** or **Not Joining** the company. We will write a generic function to report the mean by any categorical variable.

In [19]:
##Write your code



In [20]:
## Call the function group_by() defined above

#group_by("doj_extended")
#group_by("status")
#group_by("location")

### 3. Visualizing the Data using Gadfly

Write a custom function to create bar plot to visualize the average of numeric features w.r.t each categorical feature. Say, average number of days to accept the offer w.r.t status as joined vs. not joined.

In [21]:
## Write your code here

## Model Building

### Dummy Variable coding

Remove the response variable from the dataset¶


In [22]:
removed_features = ["status","pecent_hike_expected_in_ctc",
                    "percent_hike_offered_in_ctc","candidate_relocate_actual"]

4-element Array{String,1}:
 "status"
 "pecent_hike_expected_in_ctc"
 "percent_hike_offered_in_ctc"
 "candidate_relocate_actual"

In [23]:
X_features = [x for x ∈ names(filter_df) if x ∉ removed_features]

12-element Array{String,1}:
 "doj_extended"
 "duration_to_accept_offer"
 "notice_period"
 "offered_band"
 "percent_difference_ctc"
 "joining_bonus"
 "gender"
 "candidate_source"
 "rex_in_yrs"
 "lob"
 "location"
 "age"

In [24]:
X_numeric = names(filter_df[:,X_features][(<:).(eltypes(filter_df[:,X_features]),Union{Number,Missing})]);
X_categoric = names(filter_df[:,X_features][(<:).(eltypes(filter_df[:,X_features]),Union{String,Missing})]);

### MLJ Package

MLJ (Machine Learning in Julia) is a toolbox written in Julia providing a common interface and meta-algorithms for selecting, tuning, evaluating, composing and comparing over 150 machine learning models written in Julia and other languages. In particular MLJ wraps a large number of scikit-learn models.

https://alan-turing-institute.github.io/MLJ.jl/dev/list_of_supported_models/

In [25]:
localmodels("logistic")

NamedTuple{(:name, :package_name, :is_supervised, :docstring, :hyperparameter_ranges, :hyperparameter_types, :hyperparameters, :implemented_methods, :is_pure_julia, :is_wrapper, :iteration_parameter, :load_path, :package_license, :package_url, :package_uuid, :prediction_type, :supports_class_weights, :supports_online, :supports_training_losses, :supports_weights, :input_scitype, :target_scitype, :output_scitype),T} where T<:Tuple[]

In [26]:
schema(filter_df)

┌─────────────────────────────┬─────────┬────────────┐
│[22m _.names                     [0m│[22m _.types [0m│[22m _.scitypes [0m│
├─────────────────────────────┼─────────┼────────────┤
│ doj_extended                │ String  │ Textual    │
│ duration_to_accept_offer    │ Int64   │ Count      │
│ notice_period               │ Int64   │ Count      │
│ offered_band                │ String  │ Textual    │
│ pecent_hike_expected_in_ctc │ Float64 │ Continuous │
│ percent_hike_offered_in_ctc │ Float64 │ Continuous │
│ percent_difference_ctc      │ Float64 │ Continuous │
│ joining_bonus               │ String  │ Textual    │
│ candidate_relocate_actual   │ String  │ Textual    │
│ gender                      │ String  │ Textual    │
│ candidate_source            │ String  │ Textual    │
│ rex_in_yrs                  │ Int64   │ Count      │
│ lob                         │ String  │ Textual    │
│ location                    │ String  │ Textual    │
│ age                         │ Int64 

Since we want to coerce all Textual column to multiclass, we can write:

In [27]:
head(coerce!(filter_df, Textual => Multiclass))

Unnamed: 0_level_0,doj_extended,duration_to_accept_offer,notice_period,offered_band,pecent_hike_expected_in_ctc,percent_hike_offered_in_ctc,percent_difference_ctc,joining_bonus,candidate_relocate_actual,gender,candidate_source,rex_in_yrs,lob,location,age,status
Unnamed: 0_level_1,Cat…,Int64,Int64,Cat…,Float64,Float64,Float64,Cat…,Cat…,Cat…,Cat…,Int64,Cat…,Cat…,Int64,Cat…
1,Yes,14,30,E2,-20.79,13.16,42.86,No,No,Female,Agency,7,ERS,Noida,34,Joined
2,No,18,30,E2,50.0,320.0,180.0,No,No,Male,Employee Referral,8,INFRA,Chennai,34,Joined
3,No,3,45,E2,42.84,42.84,0.0,No,No,Male,Agency,4,INFRA,Noida,27,Joined
4,No,26,30,E2,42.84,42.84,0.0,No,No,Male,Employee Referral,4,INFRA,Noida,34,Joined
5,Yes,1,120,E2,42.59,42.59,0.0,No,Yes,Male,Employee Referral,6,INFRA,Noida,34,Joined
6,Yes,17,30,E1,42.83,42.83,0.0,No,No,Male,Employee Referral,2,INFRA,Noida,34,Joined


Transform is same as calling predict function in python. In MLJ:
* For supervised problem, we will call predict
* For unsupervised problem, it will be transform

The function 'machine()' binds a model (i.e., a choice of algorithm + hyperparameters) to data. A machine is also the object storing learned parameters. Under the hood, calling fit! on a machine calls either MLJBase.fit or MLJBase.update, depending on the machine's internal state (as recorded in private fields old_model and old_rows). 

In [28]:
X_ohe = machine(MLJ.OneHotEncoder(drop_last=true), filter_df[:,X_features])
MLJ.fit!(X_ohe)
encoded_X_df = MLJ.transform(X_ohe, filter_df[:,X_features]);

┌ Info: Training [34mMachine{OneHotEncoder,…} @688[39m.
└ @ MLJBase /Users/Rahul/.julia/packages/MLJBase/hLtde/src/machines.jl:342
┌ Info: Spawning 1 sub-features to one-hot encode feature :doj_extended.
└ @ MLJModels /Users/Rahul/.julia/packages/MLJModels/E8BbE/src/builtins/Transformers.jl:1142
┌ Info: Spawning 3 sub-features to one-hot encode feature :offered_band.
└ @ MLJModels /Users/Rahul/.julia/packages/MLJModels/E8BbE/src/builtins/Transformers.jl:1142
┌ Info: Spawning 1 sub-features to one-hot encode feature :joining_bonus.
└ @ MLJModels /Users/Rahul/.julia/packages/MLJModels/E8BbE/src/builtins/Transformers.jl:1142
┌ Info: Spawning 1 sub-features to one-hot encode feature :gender.
└ @ MLJModels /Users/Rahul/.julia/packages/MLJModels/E8BbE/src/builtins/Transformers.jl:1142
┌ Info: Spawning 2 sub-features to one-hot encode feature :candidate_source.
└ @ MLJModels /Users/Rahul/.julia/packages/MLJModels/E8BbE/src/builtins/Transformers.jl:1142
┌ Info: Spawning 6 sub-features to one

In [30]:
Y_ohe = machine(MLJ.OneHotEncoder(drop_last=true), filter_df[:,["status"]])
MLJ.fit!(Y_ohe)
encoded_Y_df = MLJ.transform(Y_ohe, filter_df[:,["status"]]);

┌ Info: Training [34mMachine{OneHotEncoder,…} @600[39m.
└ @ MLJBase /Users/Rahul/.julia/packages/MLJBase/hLtde/src/machines.jl:342
┌ Info: Spawning 1 sub-features to one-hot encode feature :status.
└ @ MLJModels /Users/Rahul/.julia/packages/MLJModels/E8BbE/src/builtins/Transformers.jl:1142


In [33]:
head(encoded_X_df)

Unnamed: 0_level_0,doj_extended__No,duration_to_accept_offer,notice_period,offered_band__E0,offered_band__E1,offered_band__E2,percent_difference_ctc,joining_bonus__No,gender__Female,candidate_source__Agency,candidate_source__Direct,rex_in_yrs,lob__AXON,lob__BFSI,lob__CSMP,lob__ERS,lob__ETS,lob__INFRA,location__Ahmedabad,location__Bangalore,location__Chennai,location__Cochin,location__Gurgaon,location__Hyderabad,location__Kolkata,location__Mumbai,location__Noida,location__Others,age
Unnamed: 0_level_1,Float64,Int64,Int64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Int64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Int64
1,0.0,14,30,0.0,0.0,1.0,42.86,1.0,1.0,1.0,0.0,7,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,34
2,1.0,18,30,0.0,0.0,1.0,180.0,1.0,0.0,0.0,0.0,8,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,34
3,1.0,3,45,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,4,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,27
4,1.0,26,30,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,4,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,34
5,0.0,1,120,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,6,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,34
6,0.0,17,30,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,2,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,34


In [32]:
head(encoded_Y_df)

Unnamed: 0_level_0,status__Joined
Unnamed: 0_level_1,Float64
1,1.0
2,1.0
3,1.0
4,1.0
5,1.0
6,1.0


In [34]:
X = Matrix(encoded_X_df);
Y = Matrix(encoded_Y_df);

8995×1 Array{Float64,2}:
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 0.0
 1.0
 0.0
 ⋮
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 0.0
 1.0
 1.0
 1.0
 1.0
 1.0

### Train and test data split using Python

The train and test split can also be done using the **sklearn module**. If we use @sk_import to call the train_test_split function from model_selection module, we will get a warning message. Reason, the native ScikitLearn package in Julia has already defined train_test_split() in CrossValidation module, so better to use it from this module.

In MLJ, we have partition() function to do the split but we are not using it as of now.

In [35]:
using ScikitLearn.CrossValidation: train_test_split

In [36]:
X_train, X_test, y_train, y_test = train_test_split( X, Y, test_size = 0.3, random_state = 42);

In [37]:
@show size(X_train)
@show size(X_test)

size(X_train) = (6296, 29)
size(X_test) = (2699, 29)


(2699, 29)

In case there is class imbalance, the below code chunk can be used to remove the class imbalance before any algorithm is tried.

## Model Building: Using the **sklearn** 



In [38]:
@sk_import linear_model: (LogisticRegression)

PyObject <class 'sklearn.linear_model._logistic.LogisticRegression'>

In [41]:
lg_reg_model = ScikitLearn.fit!(LogisticRegression(), X_train, y_train)

PyObject LogisticRegression()

## Model Evaluation


### 1. The prediction on train data.

To predict the outcome on the **train set**
> * Use **predict** function of the model object 


In [43]:
# Make predictions using the testing set
y_pred = ScikitLearn.predict(lg_reg_model,X_test)

2699-element Array{Float64,1}:
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 ⋮
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0

In [44]:
y_pred = lg_reg_model.predict_proba(X_test)

2699×2 Array{Float64,2}:
 0.0450288  0.954971
 0.109835   0.890165
 0.0849472  0.915053
 0.222328   0.777672
 0.0854943  0.914506
 0.226441   0.773559
 0.218286   0.781714
 0.17077    0.82923
 0.0592248  0.940775
 0.173216   0.826784
 0.289107   0.710893
 0.0777633  0.922237
 0.111257   0.888743
 ⋮          
 0.315407   0.684593
 0.369644   0.630356
 0.0954781  0.904522
 0.2535     0.7465
 0.181376   0.818624
 0.313577   0.686423
 0.269765   0.730235
 0.2568     0.7432
 0.30588    0.69412
 0.127011   0.872989
 0.219395   0.780605
 0.37469    0.62531

In [45]:
# The coefficients
print("Coefficients: \n", lg_reg_model.coef_)
print("Intercept: \n", lg_reg_model.intercept_)

Coefficients: 
[-0.11195296063309901 0.0020243874193852135 -0.020062165454874045 -0.509670606327758 0.43273468204578897 0.12314217949720567 0.0034563775473930624 0.056298252746685706 0.01948241670657845 -0.6979869844884219 -0.3997031930051729 -0.03162935326834916 -0.20377634216801174 -0.046837731013220626 0.05840433396316921 0.1162952357379013 0.288598744322289 0.6192756335716718 -0.027521383163637964 -0.05140982751605044 -0.04149328182996375 0.026376992142095435 -0.127901128067067 -0.015543111473654657 0.06629587575726618 0.11827507689112314 0.32697633790810043 0.043587440759225535 0.06592847638113887]Intercept: 
[0.35238705737807335]

In [49]:
predict_porb_train_df = convert(DataFrame, lg_reg_model.predict_proba(X_train))

head(predict_porb_train_df)

Unnamed: 0_level_0,x1,x2
Unnamed: 0_level_1,Float64,Float64
1,0.144336,0.855664
2,0.0943273,0.905673
3,0.152119,0.847881
4,0.276366,0.723634
5,0.193261,0.806739
6,0.303941,0.696059


### 2. The prediction on test data.

The prediction can be carried out by **defining functions** as well. Below is one such instance wherein a function is defined and is used for prediction

In [None]:
def get_predictions ( test_class, model, test_data ):
    predicted_df = pd.DataFrame(model.predict_proba(test_data))
    y_pred_df = pd.concat([test_class.reset_index(drop=True), predicted_df.iloc[:,1:]], axis =1)
    return y_pred_df

Giving label to the Y column of the test set by using the dictionary data type in python. This is being done for the model which was built using dummy variable coding. It will be used to generate confusion matrix at a later time

In [None]:
test_series = y_test
train_series = y_train

status_dict = {1:"Joined", 0:"Not Joined"}
class_test_df = test_series.replace(dict(Joined=status_dict))
class_test_df.rename({'Joined': 'status'}, axis='columns', inplace=True )

class_train_df = train_series.replace(dict(Joined=status_dict))
class_train_df.rename({'Joined': 'status'}, axis='columns', inplace=True )

class_test_df.head()
y_test.head()
#class_train_df.info()

In [54]:
predict_test_df = convert(DataFrame, lg_reg_model.predict_proba(X_test))

head(predict_test_df)

Unnamed: 0_level_0,x1,x2
Unnamed: 0_level_1,Float64,Float64
1,0.0450288,0.954971
2,0.109835,0.890165
3,0.0849472,0.915053
4,0.222328,0.777672
5,0.0854943,0.914506
6,0.226441,0.773559


In [None]:
predict_test_df = pd.DataFrame(get_predictions(class_test_df.status, lg_reg_model, X_test))
predict_test_df.rename(columns = {1:'predicted_prob'}, inplace=True)
predict_test_df.head()

In [None]:
predict_test_df['predicted'] = predict_test_df.predicted_prob.map(lambda x: 'Joined' if x > 0.5 else 'Not Joined')
predict_test_df[0:10]

In [None]:
pd.crosstab(predict_test_df.status,predict_test_df.predicted)

### 3. Confusion Matrix

We will built classification matrix using the **metrics** method from **sklearn** package. We will also write a custom function to build a classification matrix and use it for reporting the performance measures.

#### 3a. Confusion Matrix using sklearn

In [50]:
@sk_import metrics: (confusion_matrix, classification_report)

PyObject <function classification_report at 0x7fc8808bc3a0>

In [51]:
#from sklearn import metrics
#from sklearn.metrics import confusion_matrix
#from sklearn.metrics import classification_report

In [52]:
print("The model with dummy variable coding output: ")
confusion_matrix(class_test_df.status, predict_test_df.predicted)
lg_reg_report = (classification_report(class_test_df.status, predict_test_df.predicted))
print(lg_reg_report)


The model with dummy variable coding output: 

LoadError: UndefVarError: class_test_df not defined

### 4. Performance Measure on the test set


In [None]:
def measure_performance (clasf_matrix):
    measure = pd.DataFrame({
                        'sensitivity': [round(clasf_matrix[0,0]/(clasf_matrix[0,0]+clasf_matrix[0,1]),2)], 
                        'specificity': [round(clasf_matrix[1,1]/(clasf_matrix[1,0]+clasf_matrix[1,1]),2)],
                        'recall': [round(clasf_matrix[0,0]/(clasf_matrix[0,0]+clasf_matrix[0,1]),2)],
                        'precision': [round(clasf_matrix[0,0]/(clasf_matrix[0,0]+clasf_matrix[1,0]),2)],
                        'overall_acc': [round((clasf_matrix[0,0]+clasf_matrix[1,1])/
                                              (clasf_matrix[0,0]+clasf_matrix[0,1]+clasf_matrix[1,0]+clasf_matrix[1,1]),2)]
                       })
    return measure

In [None]:
cm = metrics.confusion_matrix(predict_test_df.status, predict_test_df.predicted)

lg_reg_metrics_df = pd.DataFrame(measure_performance(cm))
lg_reg_metrics_df

print( 'Total Accuracy sklearn: ',np.round( metrics.accuracy_score( class_test_df.status, predict_test_df.predicted ), 2 ))




### 5. The optimal cut-off

We are going to use model with dummy variable coding to select the optimal cut-off. 



#### Select the optimal cut-off value, if:

> 1. Cost of Mis-classifying Not Joined as Joined is twice as costly as cost of micalssifying Joined as Not Joined
2. Both sensitivity and specificity are equally important

The best cut-off is the one which minimizes the misclassification cost (in case of **_option 1_**) or which maximizes the Youden's Index (in case of **_Option 2_**).

In [None]:
lg_pred_prob = pd.DataFrame(lg_reg_model.predict_proba(X_train))
n = len(X_train)

d = {"Joined":(0,5), "Not Joined": (2,0)}

costs = pd.DataFrame(d, index = ('Joined', 'Not Joined'))

print(costs)


The other way to create the cost table

In [None]:
costs =  pd.DataFrame.from_dict({'Joined': [0,1], 'Not Joined': [2,0]},
                    orient='index', columns=['Joined', 'Not Joined'])

print(costs)
costs.iloc[0][1] #to refer to specific value at a given position

In [None]:
lg_pred_prob.rename(columns = {1: 'predicted'}, inplace=True)


Defining loop function to loop through float values


In [None]:
def frange(start, stop, step):
     s = start
     while s < stop:
         yield s
         s += step

* 'P11': [round(tbl[0,0]/(tbl[0,0]+tbl[0,1]),2)], 
* 'P00': [round(tbl[1,1]/(tbl[1,0]+tbl[1,1]),2)],

In [None]:
#creating empty vectors to store the results.
cutoff = []
P11 = [] #correct classification of positive as positive
P00 = [] #correct classification of negative as negative
P10 = [] #
P01 = [] 

for i in frange(0.00, 1, 0.05):
    predicted_y = lg_pred_prob.predicted.map(lambda x: 'Joined' if x > i else 'Not Joined')
    tbl = metrics.confusion_matrix(class_train_df.status, predicted_y)
    if ( i <= 1):
        j = int(20*i)
        P01.append(tbl[1,0]/(tbl[1,0] + tbl[1,1]))
        P00.append(tbl[1,1]/(tbl[1,0] + tbl[1,1]))
        P10.append(tbl[0,1]/(tbl[0,0] + tbl[0,1]))
        P11.append(tbl[0,0]/(tbl[0,0] + tbl[0,1]))
        cutoff.append(i)

d = {'cutoff':cutoff,'P10':P10,'P01': P01,'P00': P00,'P11':P11}
df_cost_table = pd.DataFrame(d, columns=['cutoff','P00','P01','P10','P11'])

In [None]:
df_cost_table


The table summarizing the optimal cut-off value:

_write the cost.table into a csv file_


In [None]:
df_cost_table['msclaf_cost'] = df_cost_table.P10*costs.iloc[0,1]+df_cost_table.P01*costs.iloc[1,0]
df_cost_table['youden_index'] = df_cost_table.P00+df_cost_table.P11 -1
df_cost_table

#write to csv
#df_cost_table.to_csv("optimal_Cutoff_caret.csv", sep=',')
#os.getcwd()


### 5. Confusion Matrix using Optimal Cut-off

The probability value along with the optimal cut-off can be used to build confusion matrix. We will use the **draw_cm** and **performance_measure** functions defined previously to report the performance of the model.

In [None]:
predict_test_df['predicted'] = predict_test_df.predicted_prob.map(lambda x: 'Joined' if x > 0.9 else 'Not Joined') 
predict_test_df[0:10]

In [None]:
draw_cm( predict_test_df.status, predict_test_df.predicted )

In [None]:
predict_test_df['predicted_8'] = predict_test_df.predicted_prob.map(lambda x: 'Joined' if x > 0.8 else 'Not Joined') 
draw_cm( predict_test_df.status, predict_test_df.predicted_8)

In [None]:
cm = metrics.confusion_matrix(predict_test_df.status, predict_test_df.predicted)

pd.DataFrame(measure_performance(cm))

In [None]:
cm = metrics.confusion_matrix(predict_test_df.status, predict_test_df.predicted_8)

pd.DataFrame(measure_performance(cm))

## Deployment - Save model

Save the model using JLD and PyCallJLD (Neeed if using @sk_import)

In [None]:
JLD.save("lg_reg_model.jld", "model", lg_reg_model)

## Use model on New Cases

We can load the model object for later use. Assuming that X_test is a new data on which we will want to use the model.

In [None]:
new_model = JLD.load("lg_reg_model.jld", "model")    # Load it back

In [None]:
new_model.estimator

In [None]:
new_model.predict(X_test)

In [None]:
new_model.score(X_test,y_test)


#### End of Document

***
***
