# Old but Interpretable: The Logit Model
![](http://)The motivation for using the logit model / logistic regression lies in the interpretability of its coefficients as log odd ratios.

**Recap odds:**  
The odds is a relative value between the chances of two mutual exlusive events. In our problem at hand these chances are the expected probabilies generated by the logit model. The odds are greater than one, if the probability of an employe looking for a new job given an expression of features is bigger than the probability of the counter event and vice versa.

**Log odds ratio in the context of a logit model:**  
$$
\exp \left(\hat{\beta}_{i}\right)=\frac{\frac{P\left(y=1 \mid x_{1}, \ldots, x_{i}+1, \ldots, x_{p}\right)}{P\left(y=0 \mid x_{1}, \ldots, x_{i}+1, \ldots, x_{p}\right)}}{\frac{P\left(y=1 \mid x_{1}, \ldots, x_{i}, \ldots, x_{p}\right)}{P\left(y=0 \mid x_{1}, \ldots, x_{i}, \ldots, x_{p}\right)}}
$$

with $\hat{\beta}_{i}$ beeing the i-th estimated coefficient and ${P\left(Y \mid X \right)}$ beeing the probability output of our logistic model. With $Y$ as a random variable of the target and $X$ as vector of the random variables of the features.


**Interpretation:**  
- $\exp \left(\hat{\beta}_{i}\right)$ < 1: Increase of feature $x_{i}$ by one unit decreases the odds 
- $\exp \left(\hat{\beta}_{i}\right)$ > 1: Increase of feature $x_{i}$ by one unit increases the odds 

When $x_{i}$ is increased all other features remain constant.  
The increase of the expected probability is only relative and not linear, due to the non-linearity of our model's link function (sigmoid).

In [None]:
# the only way to order your imports
from sklearn.preprocessing import StandardScaler, LabelEncoder, OneHotEncoder
from sklearn.model_selection import RandomizedSearchCV, KFold
from sklearn.experimental import enable_iterative_imputer
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import confusion_matrix
from sklearn.impute import IterativeImputer
from sklearn.utils.fixes import loguniform
from sklearn.metrics import roc_auc_score
from imblearn.over_sampling import SMOTE
from collections import OrderedDict
from scipy import sparse
import pandas as pd
import numpy as np
import random

In [None]:
random.seed(42)
np.random.seed(42)
train = pd.read_csv('../input/hr-analytics-job-change-of-data-scientists/aug_train.csv')
test = pd.read_csv('../input/hr-analytics-job-change-of-data-scientists/aug_test.csv')
answers = np.load('../input/job-change-dataset-answer/jobchange_test_target_values.npy')

# Preprocessing
Ordinal: Order matters.  
Nominal: Order does not matter.  

- LabelEncode categorical data
- Scale numerical data
- Impute missing data
- Scale ordinal data
- OneHotEncode nominal data

In [None]:
x_train = train.drop(['enrollee_id', 'target'], axis=1)
y_train = train['target']
x_test = test.drop(['enrollee_id'], axis=1)

cats = [c for c in x_train.columns if train[c].dtypes =='object']
nominals = ['city', 'gender', 'relevent_experience', 'enrolled_university', 'major_discipline', 'company_type'] # will be onehot encoded
ordinals = list(set(cats)-set(nominals))
nums = [c for c in x_train.columns if train[c].dtypes !='object']
print(len(nominals),len(ordinals), len(nums))
# # create mappings for categorical features
gender = ['Female', 'Male', 'Other']
relevent_experience = ['No relevent experience', 'Has relevent experience']
enrolled_university = ['no_enrollment', 'Full time course', 'Part time course']
education_level = ['Primary School', 'High School', 'Graduate', 'Masters', 'Phd']
major_discipline = ['STEM', 'Business Degree', 'Arts', 'Humanities', 'No Major', 'Other']
experience = ['<1', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '>20']
company_type = ['Pvt Ltd', 'Funded Startup', 'Early Stage Startup', 'Other', 'Public Sector', 'NGO']
company_size = ['<10', '10/49', '50-99', '100-500', '500-999', '1000-4999', '5000-9999', '10000+']
last_new_job = ['never', '1', '2', '3', '4', '>4']
mapping = dict()
for key in cats:
    if key == 'city': 
        continue
    mapping[key] = OrderedDict()
    for subkey, value in zip(eval(key),range(len(eval(key)))):
        mapping[key][subkey] = value

city_label = LabelEncoder().fit(x_train['city'])
for df in [x_train, x_test]:
    df[cats] = df[cats].apply(lambda x: x.map(mapping[x.name]) if x.name != 'city' else city_label.transform(x))

num_scaler = StandardScaler().fit(x_train[nums])
ordinals_scaler = StandardScaler().fit(x_train[ordinals])

# Important: Imput train and test set separately. Otherwise information of the training set will leak into the test set.
# ConvergenceWarning of IterativeImputer can be ignored. github.com/scikit-learn/scikit-learn/issues/14338
for df in [x_train, x_test]:
    df[nums] = num_scaler.transform(df[nums])
    df.iloc[:, :] = IterativeImputer(estimator=KNeighborsRegressor(),random_state=42).fit_transform(df)
    df[cats] = df[cats].round(0).astype(int)
    df[ordinals] = ordinals_scaler.transform(df[ordinals])
    

# Same OneHotEncoder for both sets, otherwise columns will not match 
enc = OneHotEncoder(sparse=False)
enc.fit(x_train[nominals])
x_train = sparse.csr_matrix(np.hstack([x_train[nums + ordinals].to_numpy(), enc.transform(x_train[nominals])]))
x_test = sparse.csr_matrix(np.hstack([x_test[nums + ordinals].to_numpy(), enc.transform(x_test[nominals])]))

# Hyperparam Tuning
Extend LogisticRegression for resampling inside the cross validation. 

In [None]:
# Some so unfancy random parameter search //  
smote = SMOTE(random_state=42)
searchspace = {
    'C': loguniform(1e-3, 1e2),
    'l1_ratio': np.random.default_rng().uniform(0,1,1000),
}

class SMOTELogisticRegression(LogisticRegression):
    
    def __init__(self,C=1.0, l1_ratio=None, penalty='elasticnet', solver='saga', max_iter=1000, **kwargs):
        super().__init__(C=C, l1_ratio=l1_ratio, penalty=penalty, solver=solver, max_iter=max_iter, **kwargs)
        
    def fit(self, X, y, sample_weight=None):
        smote = SMOTE(random_state=42)
        x_smote, y_smote = smote.fit_resample(X, y)
        super().fit(x_smote, y_smote, sample_weight)

rs = RandomizedSearchCV(estimator=SMOTELogisticRegression(),
                        param_distributions=searchspace,
                        n_iter=25,
                        scoring='roc_auc',
                        n_jobs=-1,
                        refit=False,
                        cv=5,
                        verbose=3,
                        random_state=42)
rs.fit(x_train,y_train)
print(rs.best_params_)

# Prediction

In [None]:
# Refit of best params with correct resampling. Kundos to kaggle.com/arashnic/handling-imbalanced-resampling-the-right-way/notebook
folds = KFold(n_splits=5, shuffle=True, random_state=42)
pred_cls = np.zeros(len(answers))
train_auc = []
val_auc = []

for train_idx, val_idx in folds.split(x_train,y_train):
    x_smote, y_smote = SMOTE(random_state=42).fit_resample(x_train[train_idx],y_train[train_idx])
    x_val, y_val = x_train[val_idx], y_train[val_idx]
    cls = LogisticRegression(penalty='elasticnet', solver='saga', max_iter=1000, random_state=42, **rs.best_params_)
    cls.fit(x_smote,y_smote)
    train_auc.append(roc_auc_score(y_smote, cls.predict_proba(x_smote)[:,1]))
    val_auc.append(roc_auc_score(y_val, cls.predict_proba(x_val)[:,1]))
    pred_cls = cls.predict_proba(x_test)[:,1]/folds.n_splits

print(f'Train auc: {np.mean(train_auc)}')
print(f'Val auc: {np.mean(val_auc)}')
print(f'Test auc: {roc_auc_score(answers, pred_cls)}')
print(f'Confusion for test data:')
for label, value in zip(['TN', 'FP', 'FN', 'TP'], confusion_matrix(answers,np.where(pred_cls > 0.5,1,0)).flatten()):
    print(label, value)

# submission
pd.DataFrame({'enrollee_id':test['enrollee_id'],'target':pred_cls}).to_csv('submission.csv',index=False)

## Prediction Results 
Overfitting. Bad generalization. Lots of Type-II errors.
Well it is a logistic model. If you have any tips for improvements, I am happy to read them in the comment section.

# Statistical Inference

Overfitting does not matter in this case. We like to have best possible estimates of the coefficients with respect to the data.

Estimate coefficients with the whole upsampled dataset. No CV.

In [None]:
cls = LogisticRegression(penalty='elasticnet', solver='saga', max_iter=1000,random_state=42, **rs.best_params_)
cls.fit(*SMOTE(random_state=42).fit_resample(x_train[train_idx],y_train[train_idx]))
odd_ratios = np.round(np.exp(cls.coef_),3).flatten()

In [None]:
def print_changes_in_odds(features, odd_ratios, padding=30):
    [print(x.ljust(padding),end='') for x in ['feature', 'influence on odds', 'change in %']]
    print()
    for k,v in zip(features, odd_ratios):
        influence = 'negativ' if v < 1 else 'positiv'
        [print(x.ljust(padding), end='') for x in [k,influence,f'{np.abs(1-v)*100:.2f}']]
        print()

## Change In The Odds

How to read a negative/positive percental change in the odds ratio?  

**Structure of odds ratio:**   
The Odds with the feature increased by one unit are in the numerator and odds without increase are in the denominator.

**Influence is positiv:**  
An increase of the feature by one unit/level increases the odds by x percent.  
Odds in the numerator are bigger than odds in the denominator.  
The nominator gets bigger if the probability for the event `not looking for job` decreases or if the probability for `looking for a job` increases.

**Influence is negativ:**  
A increase of the feature by one unit/level decreases the odds by x percent.  
Odds in the numerator are smaller than odds in the denominator.  
The nominator gets smaller if the probability for the event `not looking for job` increases or if the probability for `looking for a job` decreases.  


### Numerical Features

In [None]:
upper=len(nums)
print_changes_in_odds(nums, odd_ratios[:upper])

### Ordinal Features: 

In [None]:
lower = upper
upper += len(ordinals)
print_changes_in_odds(ordinals,odd_ratios[lower:upper])

### Nominal Features

In [None]:
for nominal in nominals:
    lower = upper
    upper += city_label.classes_.shape[0] if nominal == 'city' else len(mapping[nominal])
    print('#'*20, nominal, '#'*20)
    if nominal == 'city':
        # only use the highly influential cities
        highly_influential_idx = np.where(np.abs(np.ones(len(range(lower,upper))) - odd_ratios[lower:upper]) > 1.1)
        print_changes_in_odds(city_label.classes_[highly_influential_idx],odd_ratios[lower:upper][highly_influential_idx])
    else:
        print_changes_in_odds(mapping[nominal].keys(),odd_ratios[lower:upper])

## Final Interpretation
Numerical:
- Higher `city_development_index` => employee probably not looking for new job
- Higher `training_hours` => employee probably not looking for new job

Ordinal:
- `comany_size` has no nearly no influence
- Higher `experience` => employee probably not looking for new job
- Higher `education_level` => employee probably looking for new job
- Higher `last_new_job` (longer in same job) => employee probably looking for new job

Nominal (subset of most influential):
- If employee works in city `100`, `103`, `160` or `21` => employee probably looking for job
- If employee is `female` => employee probably not looking for job
- If employee has `no relevent experience` => employee probably looking for job
- If employee has `relevent experience` => employee probably not looking for job
- If employee is enrolled in `full time course` at university  => employee probably looking for job
- If employee majored in `STEM` => employee probably looking for job
- If employee works for `Funded Startup` => employee probably looking for job
- If employee works in `Other` => employee probably looking for job
- If employee works for `NGO` => employee  probably not looking for job


# Last Words

Thank you for the attention. Feel free to point out any errors.

Unsolved Problem:  
I would like to test of the coefficients of the logit model. Unfortunately the logit model of the statsmodels package could not compute the inverse of the Hessian (I have tried different solvers, increased max_iter and different initializations). In Maximum Likely Hood Estimation the negative inverse of the expectations of the Hessian are an approximation of the covariance matrix of the estimated coefficients. 

No inverse Hessian => no variances of coefficents => no test statistics => no p-values => no test decisions => no tests 

:(

I would appreciate your help!  
Yes there is not a single chart in this notebook. :p