# Will you get a job?
  
The data set consists of MBA students' basic demographic variables, past grades before undergradute degree, whether they get a job after graduation and the salary.   

To understand which and how factors affect the recruitment, first I would conduct EDA, then build a model to figure out the impact of each variable and to make prediction. All the analysis is as follows.

# Just Guess

*Before we start, I want to make some guess about this data set.  After that, we will check with the data.*
* **Better Grades Better:**  
Company can't be less interested in candidates with better grades than those with lower grades. The better the grades are, the more posible you'll get a job. 
  
* **The Recents Matter:**  
The more recent grades should be more representative of the candidates' ability at that time, as a result, grades in MBA might affect the chance of getting a job more than grades in higher secondary education or secondary education do.   
  
* **Popular Central Board of Education Is More Competetive:**  
There are some different boards of education in India, and central board is the most popular one. We believe that it is more competetive because most students want to go to the schools belonging to it, thus, the students from central board might have higher chance of getting a job.  
  
* **Gender Equality?**  
Literacy and educated ratio of female to male in India is relative low. Because the data was collected from MBA students, we could guess that number of female would be lower than male.  
However, for females with such high education level, it might not necessarily be more difficult to get a job.  



# A. Exploratory Data Analysis  
  
We would quickly figure out that 
* The dimension of the data.
* What variables are there in this dataset.
* Any irrational value or missing value there.
* The distribution of every variable.

## A-1. Generally  

Quickly see the dimension and columns.

In [None]:
import pandas as pd
import os
from scipy import stats
from statsmodels.stats import anova
import statsmodels.api as sm
from statsmodels.formula.api import ols

wd = './'
mydat = pd.read_csv('../input/factors-affecting-campus-placement/Placement_Data_Full_Class.csv')
print(type(mydat))

# 維度
print(mydat.shape)
print(len(mydat))
cols = (mydat.columns)
print(cols.tolist())


Then check if there is any missing values.

In [None]:
# types of columns
import numpy as np

is_numeric = mydat.dtypes != 'object'
numeric_cols = mydat.columns[is_numeric]


# missing values
missing_cols = mydat.columns[np.where(mydat.isna())[1]]
print('Missing columns:\n', np.unique(missing_cols),'\n')
print('Missing percentage:\n', missing_cols.value_counts() / len(mydat),'\n')


Seems that there's only salary column which has missing values. Let's go deeper.

In [None]:
### salary missing deeper explore
not_placed = np.where(mydat['status']=='Not Placed')[0]
salary_missing = np.where(mydat['salary'].isna())[0]

print('Is the data points which has missing salary are also the "not placed" points:\n', (not_placed == salary_missing).all(),'\n')


The missing salary is all due to not getting a job. It's quite rational.  
  
Next, let's take a look at the numeric variables.

## A-2. Numeric Variables

In [None]:
# numeric distribution
### distribution
pd.set_option("display.max_columns", 15)
print('Summary of numeric variables:\n',mydat.describe(include=[np.number]),'\n')

### outlier detect
def inRangeCheck(x, left, right): # x: series, left: int, right: int; outlier: list
    #outlier = [x[i] for i in range(len(x)) if x[i]<left] + [x[i] for i in range(len(x)) if x[i]>right]
    outlier = pd.concat([x[x<left], x[x>right]]).sort_values()
    return outlier

def outlierDetect(x): # x: series; outlier: list
    x_series = pd.Series(data=x)
    q1 = x_series.quantile(.25)
    q3 = x_series.quantile(.75)
    iqr = q3 - q1
    outlier = pd.concat([x_series[x_series<(q1-1.5*iqr)], x_series[x_series>(q1+1.5*iqr)]]).sort_values()
    return outlier


cols = [nc for nc in numeric_cols if nc not in ['sl_no','salary']]
out_range = mydat.loc[:,cols].apply(inRangeCheck, left=0, right=100)
out_salary = inRangeCheck(mydat.loc[:,'salary'], left=0,right=float('inf'))
print('Out of range:\n', out_range, '\n')
print('Negative salary:\n ', out_salary, '\n')



We could see that the variables presenting as percentage format are all in the reasonable range, (1,100).  

Besides, we checked if there's any negative value in salary, and it seemed very clean. 

Next, we checked the outliers in the data set.

In [None]:
cols = [nc for nc in numeric_cols if nc not in ['sl_no']]
outlier_normal = []
for col in cols:
    temp = outlierDetect(mydat.loc[:, col])
    outlier_normal.append(temp)

outlier_len = {}
for i in range(len(outlier_normal)):
    outlier_len.update({outlier_normal[i].name :len(outlier_normal[i])})

print("outlier_len: \n", outlier_len, '\n')

Indeed, there are some outliers in every columns detected by IQR method, but according to the exploration I've done, the data is quite clean, I still keep these detected outliers in the data set.


Also, the correlation coefficient shows that the grade in MBA and exployment test have certain correlation with the salary if you get a job.

In [None]:
### correlation
corr = mydat.loc[:,numeric_cols].corrwith(mydat['salary'])
print('Correlation with salary:\n',corr,'\n')


## A-3. Categorical Variables  
  
First, take a look at the distribution of each categorical variables.

In [None]:
# categorical distribution
categorical_cols = mydat.columns[mydat.dtypes == 'object']

frequency = {}
i = 0
for cc in categorical_cols:
    if i == 0:
        print('Count Values:')
    temp = mydat[cc].value_counts()
    frequency.update({cc: temp})
    print(temp,'\n')
    i += 1


****Then we try to plot the distribution.

In [None]:
# Visualization
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

## Single variable distribution check
### Numeric Variables
plot_numeric_cols = [nc for nc in numeric_cols if nc not in ['sl_no']]
n_row = 3
n_col = 2
fig, axes = plt.subplots(nrows=n_row, ncols=n_col, figsize=(8,7))
plt.subplots_adjust(hspace=0.7)
count=0
for i in range(n_row):
    for j in range(n_col):
        sns.distplot(mydat[plot_numeric_cols[count]],
            ax=axes[i,j])
        axes[i,j].set_title(plot_numeric_cols[count], fontsize=15)
        count+=1

plt.show()

#### box plot
fig, axes = plt.subplots(nrows=n_row, ncols=n_col, figsize=(8,7))
plt.subplots_adjust(hspace=0.7)

count=0
for i in range(n_row):
    for j in range(n_col):
        sns.boxplot(mydat[plot_numeric_cols[count]],
            ax=axes[i,j])
        axes[i,j].set_title(plot_numeric_cols[count], fontsize=15)
        count+=1

plt.show()


From density plots, it seems to us that most variables look like normal, but we suggest to further check the normility assumtion of "etest_p" to "status". (We don't check "salary" because we will not choose it as the predictor or the dependent varible.)  
  
The boxplot show some outliers of hsc_p. It is the percentage of high scholl grades. The only restriction for it is that the value must be from 0 to 100. Since all samples follow the rule, and we can't really come up with any other solid reason to explain how the outliers come, we would keep these data to further analysis.

In [None]:
### Categorical Variables
n_row = 4
n_col = 2
fig, axes = plt.subplots(nrows=n_row, ncols=n_col, figsize=(10,7))
plt.subplots_adjust(hspace=0.7, wspace=0.3)
count=0
for i in range(n_row):
    for j in range(n_col):
            sns.barplot(x=frequency[categorical_cols[count]].values, y=frequency[categorical_cols[count]].index,
                ax=axes[i,j])
            axes[i,j].set_title(categorical_cols[count], fontsize=15)
            count+=1

#plt.savefig(wd + '/categorical_eda1.png')
plt.show()

Visualize the relationship between categorical predictors and job status.

In [None]:
### Which variable is associated with the "status" variable
##### Numeric variables
plot_numeric_cols = [nc for nc in numeric_cols if nc not in ['sl_no']]

fig, axes = plt.subplots(nrows=len(plot_numeric_cols), ncols=1, figsize=(7,10))
for i in range(len(plot_numeric_cols)):
    if(plot_numeric_cols[i]!='salary'):
        sns.violinplot(data=mydat, x=plot_numeric_cols[i], y='status', #hue='status',
            cut=0, order=['Placed','Not Placed'], scale='count', bw=.3, orient='h',
            ax=axes[i]) # 指定畫在哪個subplots
    else:
        sns.violinplot(data=mydat, x=plot_numeric_cols[i], y='status', hue='status',
            cut=0, order=['Placed','Not Placed'], scale='count', bw=.3, orient='h',
            ax=axes[i])
    axes[i].set_ylabel(plot_numeric_cols[i], rotation=0, fontsize=15, labelpad=27) # ax.set_ylabel
    axes[i].set_yticks(ticks=[])
    axes[i].set_xlabel('')

axes[0].set_title('EDA of Numeric Variables', fontsize=20)

#plt.savefig(wd + '/numeric_eda.png')
plt.show()

In [None]:
##### catgorical variables
plot_categorical_cols = categorical_cols[categorical_cols != 'status']
#cross_dat = pd.crosstab(index=mydat['status'], columns=[mydat['workex']])

n_row = 4
n_col = 2
fix, axes = plt.subplots(nrows=n_row, ncols=n_col, figsize=(10,7))
plt.subplots_adjust(hspace=0.7, wspace=0.3)
for i in range(n_row):
    end = False
    for j in range(n_col):
        ind = i*2+j
        if ind>(len(plot_categorical_cols)-1):
            end = True
            break
        col = plot_categorical_cols[ind]
        count_table = mydat.groupby(['status',col]).size().reset_index(name='counts')
        count_table['total'] = count_table['counts'].groupby(count_table[col]).transform('sum')
        count_table['proportion'] = (count_table['counts']/count_table['total']*100).round(2)

        sns.barplot(data=count_table, x=col, y="proportion", hue="status", ax=axes[i,j])
        ys = count_table['proportion'].groupby(count_table[col]).max()
        values = count_table.loc[:,[col,'total']].drop_duplicates(subset=col).set_index(col)['total']

        for x, y, value in zip(range(len(ys)), ys.values, values.values):
            axes[i,j].text(x=x, y=y, s=str(value), fontsize=12, horizontalalignment='center')

        axes[i,j].set_title(col, fontsize=14)
        axes[i,j].set_ylabel('')
        axes[i,j].set_xlabel('')

    if end:
        break
plt.show()

### *Brief Summary:*
* The grades variables are quite clean.
* We didn't remove any outlier because we believe it's reasonable enough.
* Surprisingly, it seems that the grades of college and below are more correlated with the chance of getting a job, rather than the most recent grades of MBA and the employment test.  
* Students from central board didn't get a job with higher percentage.
* Males' ratio of getting a job to not is indeed higher than females'.
* The ratio of those who have work experience before entering MBA and get hired is more than which of those who haven't worked before MBA and get hired.  


# B. Select Predictors  
Having seen the distribution plot, we still want to know which variables are correlated with the probablity of getting a job through statistical tests.  
* For categorical variables, we choose chi-square independence test
* For those numeric, we would apply t-test ***after making sure they follow the assumptions***. 


In [None]:
# Test of different mean of every numeric variable
# #Before hypothesis test we first do data transfomation and normality test
# #salary, etest_p seem to be more likely to be transform to normality
# #Seems etest_p is a little right skrew

lm_model = ols('etest_p~status', data=mydat).fit()
sns.distplot(lm_model.resid)
plt.show()
 
mydat['etest_p_trans0'] = mydat['etest_p'] ** .1 #Make it less right-skrewed
lm_model = ols('etest_p_trans0~status', data=mydat).fit()
sns.distplot(lm_model.resid)
plt.show()

mydat['etest_p_trans'] = (mydat['etest_p_trans0']-min(mydat['etest_p_trans0']))*100/(max(mydat['etest_p_trans0'])-min(mydat['etest_p_trans0']))

mydat.to_csv(wd + '/Placement_Data_Transformed.csv', index=False)

After power transformaion, the residual density plot of 'etest_p' seemed to be less right-skrewed.  
  
Then we're going to do hypothesis test.  
### **Null hypothesis:**  
***Numeric variables:***  
The means of the value of certain variable are the same in "Placed" group & "Not Placed" group.

***Categorical variables:***  
The certain variable is independent to the "status" variable.  

If the null hypothesis is rejected, it means that the value of certain variable would be different when the employment status change. which says: **The variable has correlation with employment status.** Then we could consider to put the variable in the model.

In [None]:
test_numeric_cols = plot_numeric_cols
test_numeric_cols = ['etest_p_trans' if col == 'etest_p' else col for col in test_numeric_cols]
test_numeric_cols.remove('salary')

t_test = pd.DataFrame(columns=['coef','se','tvalue','pvalue'])
for col in test_numeric_cols:
    X=np.array(mydat['status'].map({'Placed':0, 'Not Placed':1}))
    X = sm.add_constant(X)
    Y = np.array(mydat[col])
    temp = sm.OLS(Y,X).fit()

    df_temp = pd.DataFrame({'coef':[temp.params[1]],
        'se':[temp.bse[1]],
        'tvalue':[temp.tvalues[1]],
        'pvalue':[temp.pvalues[1]]})
    t_test = t_test.append(df_temp, ignore_index=True)

t_test.index = test_numeric_cols
t_test = t_test.apply(lambda x: round(x,2) , axis=0)
print(t_test)

# chi-square independence test
outcome = 'status'
cols = categorical_cols[categorical_cols!='status']
chiind_test = pd.DataFrame(columns=['chi2','pvalue'])
for i in range(len(cols)):
    col = cols[i]
    contingency_table = pd.crosstab(index=mydat[outcome], columns=mydat[col])
    chi2, p, dof, expected = stats.chi2_contingency(contingency_table, correction=False)
    chiind_test = chiind_test.append(pd.DataFrame({'chi2':[chi2], 'pvalue':[p]}))

chiind_test.index = cols
chiind_test = chiind_test.apply(lambda x: round(x,2), axis=0)
print(chiind_test)

According to the result of statistical tests, **ssc_p, hsc_p, degree_p, workex** and **specialisation** are the 5 variables that has significant difference in 2 different status groups. Thus, they would be the considered predictors of the model.
  
Considering the plots we've made before, it seemed to us that although without significant difference, the **gender** and **grades of exployment test (etest_p)** still have some correlation with status, as a result, these two would be considered in the model.  


# C. Building model  
Here, we would like to build a model to explain the effect of each variable and to predict the one's probability of getting a job. There is 4 main parts in this section.  

1. Data preparing
1. Model selection
1. Select polynomial features
1. Hyperparameter tuning  

## C-1. Data Preparing:  
* Data would be split as training & testing sets with ratio of 4:1.  
* Categorical variables would be one-hot coded, moreover, scale of numerical ones are all from 0 to 100.
* Note that "Not Placed" status is set to be positive event.

In [None]:
import pandas as pd
import numpy as np
import math
import os
from joblib import dump, load


wd = os.path.abspath(os.getcwd())
mydat = pd.read_csv(wd + '/Placement_Data_Transformed.csv')

"""
1. Data Preparing
"""
from sklearn import model_selection
categorical_predictor = ['gender', 'specialisation', 'workex']
numeric_predictor = ['ssc_p', 'hsc_p', 'degree_p', 'etest_p_trans']
response = 'status'

X_dummy = pd.get_dummies(mydat.loc[:, categorical_predictor], drop_first=True)
X = pd.concat([X_dummy, mydat.loc[:, numeric_predictor]], axis=1)
Y = mydat.loc[:, response].map({'Placed':0, 'Not Placed':1})
train_X, test_X, train_Y, test_Y = model_selection.train_test_split(X, Y, test_size=0.2, random_state=101)

## C-2. Model Selection  
* In the section the following models would be compared:  

  1. Logistic regression
  2. Support Vector Machine
  3. Naive Bayes
  4. Decision Tree  
  
  Note that we tend to choose relative simple models because we hope the model would be explained.
  
  
* Hyperparameters are obtained by 10-fold CV being the one with the best **accuracy** using **balanced weighted** scoring metrics. The reason we choose balanced weighted is the imbalance of the dependent variable despite it is not too much. 

* Select models based on metrics including:  
  
    1. ROC AUC
    2. F1 score
    3. PR curve
    4. Sensitivity
    5. Accuracy

In [None]:
"""
2. Model Selection

- Compare models including
    1. Logistic regression
    2. Support Vector Machine
    3. Naive Bayes
    4. Decision Tree

- Hyperparameters are obtained by 10-fold CV being the one
    * with the best 'accuracy' using 'balanced' scoring metrics *

- Select models based on metrics including
    1. ROC AUC
    2. F1 score
    3. PR curve
    4. Sensitivity
    5. Accuracy

"""
from sklearn import linear_model
from sklearn import svm
from sklearn import naive_bayes
from sklearn.tree import DecisionTreeClassifier
from sklearn import model_selection
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn import metrics

n_cv = 10
n_Cs = 10
random_state = 101
max_iter = 300
scoring = 'accuracy'
class_weight = 'balanced'
cv_splitter = model_selection.KFold(n_splits=n_cv, shuffle=True, random_state=random_state)


def PerformanceCompare(class_true, class_predict, method, score_predict, exist_score=True):
    if exist_score:
        roc_auc = metrics.roc_auc_score(y_true=class_true, y_score=score_predict)
        pr_score = metrics.average_precision_score(y_true=class_true, y_score=score_predict)
    else:
        roc_auc = float('nan')
        pr_score = float('nan')
    f1_score = metrics.f1_score(y_true=class_true, y_pred=class_predict)
    sensitivity = metrics.recall_score(y_true=class_true, y_pred=class_predict)
    accuracy = metrics.accuracy_score(y_true=class_true, y_pred=class_predict)

    output = pd.DataFrame(np.array([roc_auc, pr_score, f1_score, sensitivity, accuracy]),
        index=['ROC_AUC', 'PR_score', 'F1_score', 'sensitivity', 'accuracy'], columns=method)
    return(output)


# Logistic Regression
def LRPerformance(X, Y):
    param_grid = {
        'C': [10**uu for uu in np.linspace(-4, 4, n_Cs)]
    }

    lr_model = linear_model.LogisticRegression(random_state=random_state, class_weight=class_weight, max_iter=max_iter)
    lr_cv = GridSearchCV(estimator=lr_model, param_grid=param_grid, cv=cv_splitter, scoring=scoring).fit(X,Y)
    Y_predict = lr_cv.predict(X)
    score_predict = lr_cv.predict_proba(X)[:,1]
    best_estimator = lr_cv.best_estimator_

    performance = PerformanceCompare(class_true=Y, class_predict=Y_predict, score_predict=score_predict, method=['LR'])
    return({'performance': performance, 'best_estimator': best_estimator})


# Support Vector Machine
def SVMPerformance(X, Y):
    #penalty = 'l1'
    #loss = 'hinge'
    param_grid = {
        'C' : [10**uu for uu in np.linspace(-4, 4, n_Cs)]
    }
    svm_model = svm.SVC(random_state=random_state, class_weight=class_weight)
    svm_cv = GridSearchCV(estimator=svm_model, param_grid=param_grid, cv=cv_splitter, scoring=scoring).fit(X, Y)
    Y_predict = svm_cv.predict(X)
    score_predict = svm_cv.decision_function(X)
    best_estimator = svm_cv.best_estimator_

    performance = PerformanceCompare(class_true=Y, class_predict=Y_predict, score_predict=score_predict, method=['SVM'])
    return({'performance': performance, 'best_estimator': best_estimator})


# Naive Bayes
def NBPerformance(X, Y):
    categorical_cols = ['gender_M', 'specialisation_Mkt&HR', 'workex_Yes']
    numerical_cols = ['ssc_p', 'hsc_p', 'degree_p', 'etest_p_trans']

    cate_nb = naive_bayes.BernoulliNB().fit(X.loc[:, categorical_cols], Y)
    conti_nb = naive_bayes.GaussianNB().fit(X.loc[:, numerical_cols], Y)
    prob = cate_nb.predict_proba(X.loc[:, categorical_cols]) * conti_nb.predict_proba(X.loc[:, numerical_cols])
    Y_predict = list(np.argmax(prob, axis=1))
    score_predict = prob[:,1]

    performance = PerformanceCompare(class_true=Y, class_predict=Y_predict, score_predict=score_predict, method=['NB'])
    return({'performance': performance, 'best_estimator': [cate_nb, conti_nb]})


# Decision Tree
def DTPerformance(X, Y):
    param_grid = {
        'max_leaf_nodes': range(3,15),
        'ccp_alpha': np.linspace(0, 0.05, 5)
    }
    dt_model = DecisionTreeClassifier(random_state=random_state, class_weight=class_weight)
    dt_cv = GridSearchCV(estimator=dt_model, param_grid=param_grid, cv=cv_splitter, scoring=scoring).fit(X, Y)
    Y_predict = dt_cv.predict(X)
    score_predict = dt_cv.predict_proba(X)[:,1]
    best_estimator = dt_cv.best_estimator_

    performance = PerformanceCompare(class_true=Y, class_predict=Y_predict, score_predict=score_predict, method=['DT'])
    return({'performance': performance, 'best_estimator': best_estimator})


lr_pb = LRPerformance(train_X,train_Y)
svm_pb = SVMPerformance(train_X, train_Y)
nb_pb = NBPerformance(train_X, train_Y)
dt_pb = DTPerformance(train_X, train_Y)

print(lr_pb['best_estimator'])
performance_table = pd.concat([lr_pb['performance'],
    svm_pb['performance'],
    nb_pb['performance'],
    dt_pb['performance']], axis=1)
performance_table = round(performance_table, 3)
print(performance_table)




From the table we could easily find that logistic regression model excels on every metric we choosed, also, it is easy for explaining. Therefore, we would choose logistic regression model as the final decision.

## C-3. Select Polynomial Features  
After selecting the favorite model, we generated polynomial features then did feature selection through 
1. Recursive feature elimination  

    *The number of variables remaining is decided by cross validation.*  


1. L1 penalty  
  
    *The variable of which the absolute value of coefficient is under 10^-5 would be abandonded*

In [None]:
"""
3. Select Polynomial Features

- CV to select best feature set from polynomial of the basic predictors.
- L1 penalty to eliminate redundant variables.
"""
from sklearn.preprocessing import PolynomialFeatures
from sklearn.feature_selection import RFECV
from sklearn.feature_selection import RFE
from sklearn.model_selection import StratifiedKFold
from sklearn.feature_selection import SelectFromModel
import matplotlib.pyplot as plt

poly = PolynomialFeatures(2)
train_poly = poly.fit_transform(train_X)
poly_name = poly.get_feature_names(train_X.columns)

stay_id= [i for i, x in enumerate(poly_name) if x not in ['gender_M^2','specialisation_Mkt&HR^2','workex_Yes^2']]
train_poly = train_poly[:,stay_id]
poly_name = [poly_name[i] for i in stay_id]

max_iter = 10000
lr_model = linear_model.LogisticRegression(C=0.36, random_state=random_state, class_weight=class_weight, max_iter=max_iter, solver='liblinear')
rfecv = RFECV(estimator = lr_model, step=1, cv=StratifiedKFold(5), scoring='roc_auc').fit(train_poly, train_Y)

plt.figure()
plt.xlabel('Number of features selected')
plt.ylabel('ROC_AUC score')
plt.plot(range(1, len(rfecv.grid_scores_)+1), rfecv.grid_scores_)
plt.show()

rfe = RFE(estimator = lr_model, step=1, n_features_to_select=21).fit(train_poly, train_Y)
poly_variables_rfe = [poly_name[id] for id, x in enumerate(rfe.support_) if x]

# Lasso Feature Selection
param_grid = {
    'C': [10**x for x in np.linspace(-4,4,n_Cs)]
}


lr_model = linear_model.LogisticRegression(penalty='l1', random_state=random_state, class_weight=class_weight, max_iter=max_iter, solver='liblinear')
#lr_cv = GridSearchCV(lr_model, param_grid=param_grid, scoring='roc_auc').fit(train_poly, train_Y)
#importance = np.abs(lr_cv.best_estimator_.coef_)[0]

threshold = 1e-5
sfm = SelectFromModel(lr_model, threshold=threshold).fit(train_poly, train_Y)


poly_variables_l1 = [poly_name[id] for id, x in enumerate(sfm.get_support()) if x]

feature_selection_output = pd.DataFrame({
    'Feature Name': poly_name,
    'RFE': rfe.support_*1,
    'L1': sfm.get_support()*1
})
feature_selection_output.to_csv(wd+'/feature_selection_output.csv', index=False)

print(feature_selection_output)

# Set the final version of FEATURE SET
stay_id = [id for id, x in enumerate(poly_name) if x in poly_variables_rfe]
train_X = train_poly[:, stay_id]

test_poly = poly.transform(test_X)
stay_id = [id for id, x in enumerate(poly.get_feature_names(test_X.columns)) if x in poly_variables_rfe]
test_X = test_poly[:, stay_id]


**RFE:**  
The roc-auc score started to be stable while choosing more than 20 variables.  

**L1:**  
The number of variables remaining is 22.

The number of variables remaining obtained by these 2 methods are close. Compare to l1-penaly method, there are less quadratic variables in RFE mothods, which is easier for people to explain and understand, thus we use the feature set obtained by RFE.

## C-4. Hyperparameter Tuning  
Finally, we would grind the model and choose a best hyperparameter set.

In [None]:
"""
4. Hyperparameter Tuning

- Tune to the best hyperparameter
"""
param_grid = {
    'C': np.linspace(1e-5, 1e5, 20),
    'class_weight': ['balanced']
}

lr_model = linear_model.LogisticRegression(random_state=random_state, max_iter=max_iter, solver='liblinear')
lr_cv = GridSearchCV(lr_model, param_grid=param_grid, scoring='roc_auc').fit(train_X, train_Y)

dump(lr_cv, wd+'/final_model.joblib')




In [None]:
# Model Performance
Y_predict = lr_cv.predict(test_X)
Y_score = lr_cv.predict_proba(test_X)[:,1]

confusion_matrix_train = metrics.confusion_matrix(train_Y, lr_cv.predict(train_X))
confusion_matrix_train = pd.DataFrame(confusion_matrix_train, index=['Neg','Pos'], columns=['Pred_Neg','Pred_Pos'])

confusion_matrix_test = metrics.confusion_matrix(y_true=test_Y, y_pred=Y_predict)
#tn, fp, fn, tp = confusion_matrix.ravel()
confusion_matrix_test = pd.DataFrame(confusion_matrix_test, index=['Neg','Pos'], columns=['Pred_Neg','Pred_Pos'])

roc_auc = metrics.roc_auc_score(y_true=test_Y, y_score=Y_score)
average_precision = metrics.average_precision_score(y_true=test_Y, y_score=Y_score)
sensitivity = metrics.recall_score(y_true=test_Y, y_pred=Y_predict)
specificity = metrics.recall_score(y_true=test_Y, y_pred=Y_predict, pos_label=0)
accuracy = metrics.accuracy_score(y_true=test_Y, y_pred=Y_predict)
precision = metrics.precision_score(y_true=test_Y, y_pred=Y_predict)
precision_neg = metrics.precision_score(y_true=test_Y, y_pred=Y_predict, pos_label=0)



metrics.plot_roc_curve(estimator=lr_cv.best_estimator_, X=test_X, y=test_Y)
plt.title('ROC Curve')
plt.show()

metrics.plot_precision_recall_curve(estimator=lr_cv.best_estimator_, X=test_X, y=test_Y)
plt.title('Precision-Recall Curve')
plt.show()

print(lr_cv.best_estimator_)
print("best roc_auc:",lr_cv.best_score_)
print('confusion_matrix_train:\n',confusion_matrix_train,'\n')

print('confusion_matrix_test:\n', confusion_matrix_test)
print(pd.Series({
    'roc_auc': roc_auc, 'average_precision':average_precision, 
    'accuracy':accuracy, 'sensitivity':sensitivity, 'specificity':specificity,
    'precision':precision, 'precision_neg':precision_neg
}))

### Model Performance:  
* C, the regulization strength parameter we choose is quite large, which means that there is almost no penalty on the size of coeffecients.
* ROC-AUC and AP are quite good, which are near 0.85 and 0.83, showing that the model has excellent discriminant ability.
* Accuracy of the model is about 0.81, while comparing to specificity, the sensitivity (recall) is relative small. It means that the ability of testing the real positive one is poorer than which of testing the negative one.  


In [None]:
# Feature Impact  
pd_coef = pd.DataFrame({'Feature Name': poly_variables_rfe, 'Coefficient': lr_cv.best_estimator_.coef_[0,:]})


feature_selection_output = pd.merge(feature_selection_output, pd_coef, left_on='Feature Name', right_on='Feature Name', how='outer')
print(feature_selection_output)

default_value = X.loc[:,['ssc_p','hsc_p','degree_p','etest_p_trans']].apply(lambda x: sum(x)/len(x), axis=0)
default_value = round(default_value, 0).tolist()

def makeSample(moving_variable=None, fixed_variable=None, default_value=default_value):
    """
    To generate sample with some variable values being fixed & ONE variable value moving. 
    The sample would be transformed to polynomial and be selected as the feature set we've selected.
    
    * moving_variable is a dict with only one element. The key is column name. The value could be a number or list containing multiple numbers.
    * fixed_variable is a dict containing one or multiple elements. The keys are column names. The values could only be a number.
    * The default values of the categoricals are 0, and which of the numericals are 50.
    """
    
    columns0=['gender_M','specialisation_Mkt&HR','workex_Yes','ssc_p','hsc_p','degree_p','etest_p_trans']
    values0 = [0,0,0]+default_value
    
    if fixed_variable is None:
        X = values0
    else:
        change_names = [x for x in fixed_variable.keys()] # would be a list even if theres only one element
        change_values = [x for x in fixed_variable.values()] # would be a list even if theres only one element
        change_id = [id for cn in change_names for id, x in enumerate(columns0) if x == cn]
        
        X = values0
        for ind in range(len(change_names)):
            X[change_id[ind]] = change_values[ind]
    
    
    if moving_variable is None:
        N=1
        moving_name = [None]
        moving_value = []
    else: 
        moving_name = [x for x in moving_variable.keys()][0]
        moving_value = [x for x in moving_variable.values()][0] # would have been a double list if not choosing the first element
        N = len(moving_value)
    
    
    X = np.array(X*N)
    X = np.reshape(X, (N,7))
    moving_id = [id for id, x in enumerate(columns0) if x in moving_name][0] # would have been a list if not choosing the first element
    X[:,moving_id] = moving_value
    X = pd.DataFrame(X, columns=columns0)
    
    X_poly = poly.fit_transform(X)
    poly_name = poly.get_feature_names(X.columns)
    
    stay_id= [i for i, x in enumerate(poly_name) if x in poly_variables_rfe]
    X_poly = X_poly[:,stay_id]
        
    
    return(X_poly)


def makeComparisonSample(categoricals, numerical):
    output = []
    name = []
    for ind1 in [0,1]:
        name1 = "".join([categoricals[0], str(ind1)])
        for ind2 in [0,1]:
            name2 = "".join([categoricals[1], str(ind2)])
            for ind3 in [0,1]:
                name3 = "".join([categoricals[2], str(ind3)])
                name_temp = " ".join([name1, name2, name3])
                temp = makeSample(moving_variable={numerical:range(101)}, fixed_variable={categoricals[0]:ind1, categoricals[1]:ind2, categoricals[2]:ind3})
                output.append(temp)
                name.append(name_temp)
    return(output, name) # output is a list made of 4 arrays

        
def plotComparison(labels, categoricals, suptitle):
    
    # Show the probablity predicted of different 
    numerical_cols = ['ssc_p','hsc_p','degree_p','etest_p_trans']
    # gender & specialization & numericals
    s = []
    s_name = []
    linestyle = ['solid']*(len(labels)+1)
    count = 0
    n_row = 2
    n_col = 2
    fig, axes = plt.subplots(nrows=n_row, ncols=n_col, figsize=(10,7))
    plt.subplots_adjust(hspace=0.7, wspace=0.3)
    for col in numerical_cols:
        s_temp, s_name_temp = makeComparisonSample(categoricals=categoricals, numerical=col)

        r_id = count//n_col
        c_id = count%n_col
        for ind in range(8):
            prob_temp = lr_cv.predict_proba(s_temp[ind])[:,0]

            axes[r_id, c_id].plot(range(101), prob_temp, linestyle=linestyle[ind], label=labels[ind])
        axes[r_id, c_id].set_ylim(-0.1,1.1)
        axes[r_id, c_id].set_yticks([0,.2,.4,.6,.8,1])
        axes[r_id, c_id].hlines(y=0.5, xmin=0, xmax=100, linestyles='dashed', colors='gray', linewidth=1)
        axes[r_id, c_id].set_ylabel('Probability')
        axes[r_id, c_id].set_xlabel(col)
        #axes[r_id, c_id].set_title('Prob. of being PLACED')
        s.append(s_temp)
        s_name.append(s_name_temp)
        count+=1
    
    lines, labels = fig.axes[-1].get_legend_handles_labels()
    fig.legend(lines, labels, loc = 'right')
    fig.suptitle(suptitle)
    return(fig, axes)



# Create Labels   
label_grid = {'Gender':['Female','Male'], 'Specialization':['Fin','HR'], 'WorkExperience':["Haven't Worked", "Have Worked"]}
grid = model_selection.ParameterGrid(label_grid)
labels = []
for g in grid:
    temp ="/".join(list(g.values()))
    labels.append(temp)

plotComparison(labels=labels, categoricals=['gender_M','specialisation_Mkt&HR', 'workex_Yes'], suptitle='Prob. of being PLACED')    
plt.show()




### Feature Impact:  
> Note that the positive event means **Not Placed** in our analysis. The larger the coefficient is, the less posible that someone would get a job.  
For example, the coefficient of ssc_p is -0.23, it can be interpreted as that 1 unit increasing of ssc_p would let the log odds ratio of "**DO getting a job**" to "**Not getting a job**" **RAISE** by 0.23.

> Scale of categorical variables including gender, specialization and work experience is (0,1), their effects on model output can't be larger than the coefficients themselves, while the numerical predictors could increase to the maximum 100, which allows the effects of the numericals dozen times of their own coefficients.  
For example, there might be 2 people with diffence of ssc_p being 60. Then their difference of log odds ratio of getting a job is 0.23\*60=13.8. However, difference of log odds ratio of people with different gender would be only 7.3, which is relatively smaller than the value of the coefficient itself, comparing to that of the ssc_p.  

Here's things the graph shows:
  
Basically, effects of ssc_p, hsc_p, degree_p, etest_p_trans are positive to probability of getting a job. Except for some cases including 
* Female who hasn't worked & male who hasn't worked specializes HR, on which employment test grade has negative impact.
* On the situation that female who hasn't worked specializes HR, secondary school grade has barely no impact.  
  
With same mean values of other variables, males usually have better chance to get placed than females when they get about above-average grades, which verifies the trend we've observed on EDA.
  
However, the specialization doesn't affect the chance as much as we thought. In some cases the curves are similar, in other cases people with HR specialization have higher chance than who specialize Finance.  
  
Work experience has really positive impact on the placed chance.  
  
The effect of secondary school grade is quite consistent and influential in every situation except for one case.  
  
Employment test grade seems random. There are up trend and down trend on different situations.
  
Effects of high school grade and college grade are similarily strong. 

# Reference  

https://www.orfonline.org/research/literacy-in-india-the-gender-and-age-dimension-57150/