### Recap

We have our random forest model with tuned parameters. Let's do some feature engineering.

In [1]:
import pandas as pd
import numpy as np
from IPython.display import Image
from sklearn.model_selection import train_test_split

import stackoverflow_helper as soh
import dictionaries as look

In [2]:
raw_import = pd.read_csv('/Users/pang/repos/stack-overflow-survey/_data/output', index_col='Respondent')

In [3]:
y_big = raw_import['JobSat']
X_big = raw_import.drop(columns='JobSat')
X_train_big, X_test_final, y_train_big, y_test_final = train_test_split(
    X_big, y_big, test_size=0.20, random_state=4444)

y = y_train_big
X = X_train_big
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=4444)
df = pd.DataFrame(y_train).merge(X_train, on='Respondent')
df_te = pd.DataFrame(y_test).merge(X_test, on='Respondent')

And as a reminder, here are the baselines from previous notebooks.

| Metrics       | Baseline(5 fts) | All features | After tuning |
| ------------: | :-------------: | :----------: | :----------: |
| **ROC AUC**   | 0.671741        | 0.665168     | 0.744265     |
| **Accuracy**  | 0.619717        | 0.667045     | 0.691969     |
| **Precision** | 0.698451        | 0.728230     | 0.679767     |
| **Recall**    | 0.937549        | 0.794274     | 0.987484     |
| **F1**        | 0.800528        | 0.759820     | 0.806109     |

### Feature Engineering

**Amplify bad managers with those who report toxic work environment and lack of support from management and no input into the technology**

Re-run model with old data using the parameters we selected as our baslines

In [4]:
soh.test_model(X_train, y_train, X_test, y_test)

{'roc_auc': 0.7455572379007773,
 'accuracy': 0.6866431637027812,
 'precision': 0.6775076874279609,
 'recall': 0.9886858594187498,
 'f1': 0.804159967142672}

Create a feature that gets exponentially larger the more complaints the respondent has about their work environment:

In [5]:
plus_tr = pd.DataFrame()
plus_te = pd.DataFrame()

plus_tr['f_bad_mgr_score'] = (4 - df['MgrIdiot']) ** ((df['WorkChallenge_Lack of support from management']\
                                                       + df['WorkChallenge_Toxic work environment']\
                                                       + df['PurchaseHow']\
                                                       + df['PurchaseWhat']\
                                                       + 1))
plus_te['f_bad_mgr_score'] = (4 - df_te['MgrIdiot']) ** ((df_te['WorkChallenge_Lack of support from management']\
                                                          + df_te['WorkChallenge_Toxic work environment']\
                                                          + df_te['PurchaseHow']\
                                                          + df_te['PurchaseWhat']\
                                                          + 1))

r_mgr = (X_train).merge(plus_tr, on='Respondent')
r_mgr_te = (X_test).merge(plus_te, on='Respondent')

soh.test_model(r_mgr, y_train, r_mgr_te, y_test)

{'roc_auc': 0.7504356282780374,
 'accuracy': 0.7031832252202087,
 'precision': 0.7017585649581797,
 'recall': 0.9698251369493548,
 'f1': 0.8124270091023382}

There is enough improvement to keep the changes.

**Desire to work with technology**

There is a lot of data around technology respondants want to work with. What if the general desire to work with tech leads to higher JobSat?

In [6]:
new_tech = soh.get_sum_of_tech(r_mgr)
mgr_tec = r_mgr.merge(new_tech, left_index=True, right_index=True)

new_tech_t = soh.get_sum_of_tech(r_mgr_te)
mgr_tec_t = r_mgr.merge(new_tech_t, left_index=True, right_index=True)

In [7]:
soh.test_model(r_mgr, y_train, r_mgr_te, y_test)

{'roc_auc': 0.7465895773839281,
 'accuracy': 0.7034115873562217,
 'precision': 0.7052647354120543,
 'recall': 0.9699965165551816,
 'f1': 0.8114663828444876}

Not much of an improvement so we will drop this line of questioning.

**Review feature importance**

As we can see from below, the new feature related to how upset they are at the company is a fairly important feature.

In [8]:
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.feature_selection import SelectFromModel

In [9]:
clf = ExtraTreesClassifier(n_estimators=50)
clf = clf.fit(mgr_tec, y_train)
importance_scores = clf.feature_importances_  

feature_importance = pd.DataFrame(importance_scores.reshape(1,372), columns = mgr_tec.columns).T
sorted_feats = feature_importance.sort_values(by=0, ascending=False)
sorted_feats.head(5)

Unnamed: 0,0
MgrIdiot,0.043846
f_bad_mgr_score,0.017253
WorkChallenge_Lack of support from management,0.012404
WorkChallenge_Toxic work environment,0.009818
PurchaseHow,0.008743


Here are some metrics showing performance of some other models reviewed. Not all were saved for this notebook, but I've summarized my analysis below:

- Just MgrIdiot: I wanted to see how effective just using the `MgrIdiot` field would be. We actually don't lose a lot here in terms of ROC AUC. We also actually get pretty good recall.
- All Fields: This is our baseline.
- MgrId/f_bad: This is the same data as the first column, but with the engineered feature `f_bad_mgr_score`. 
- Work Challenge: Model with just `MgrIdiot`, `WorkChallenge_Lack of support from management` and `WorkChallenge_Toxic work environment` had on the model.
- All Mgrfields: Same as above but with `PurchaseHow` and `PurchaseWhat`.

| Metrics | Just MgrIdiot | All Fields | MgrId/f_bad | Work Challenge | All Mgrfields | 
| ----: | :----: | :----: | :----: | :----: | :----: | 
| **ROC AUC** | 0.6816 |  0.7423 | 0.6991 | 0.7068 | 0.7066 | 
| **Accuracy** | 0.7158 | 0.6906 | 0.7141 | 0.7114 | 0.7141 | 
| **Precision** | 0.7196 | 0.6826 | 0.7234 | 0.7234 | 0.7234 | 
| **Recall** | 0.9338 | 0.9857 | 0.9240 | 0.9180 | 0.9129 | 
| **F1** | 0.8128 | 0.8049 | 0.8091 | 0.8101 | 0.8091 | 

Saving the new csv for future use.

In [10]:
temp_df = pd.DataFrame()

temp_df['f_bad_mgr_score'] = (4 - raw_import['MgrIdiot']) ** ((raw_import['WorkChallenge_Lack of support from management']\
                                                       + raw_import['WorkChallenge_Toxic work environment']\
                                                       + raw_import['PurchaseHow']\
                                                       + raw_import['PurchaseWhat']\
                                                       + 1))

for_output = raw_import.merge(temp_df, on='Respondent')

In [11]:
# for_output.to_csv('/Users/pang/repos/stack-overflow-survey/_data/output_ft.csv')