## Dealing with Large Number of features

Human Resources are critical resources of any organiazation. Organizations spend huge amount of time and money to hire and nuture their employees. It is a huge loss for companies if employees leave, especially the key resources. So if HR can predict weather employees are at risk for leaving the company, it will allow them to identify the attrition risks and help understand and provie necessary support to retain those employees or do preventive hiring to minimize the impact to the orgranization.

This dataset is taken from kaggle https://www.kaggle.com/ludobenistant/hr-analytics

Fields in the dataset include:

- Employee satisfaction level
- Last evaluation
- Number of projects
- Average monthly hours
- Time spent at the company
- Whether they have had a work accident
- Whether they have had a promotion in the last 5 years
- Department
- Salary
- Whether the employee has left


In [None]:
import sklearn
print(sklearn.__version__)

1.0.2


In [None]:
import pandas as pd
import numpy as np

In [None]:
hr_df = pd.read_csv('https://drive.google.com/uc?export=download&id=1XwDeBvO2VtO7z6TXifQTDJfsDeU3Lw0x')

In [None]:
hr_df.head(10)

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,sales,salary
0,0.38,0.53,2,157,3,0,1,0,sales,low
1,0.8,0.86,5,262,6,0,1,0,sales,medium
2,0.11,0.88,7,272,4,0,1,0,sales,medium
3,0.72,0.87,5,223,5,0,1,0,sales,low
4,0.37,0.52,2,159,3,0,1,0,sales,low
5,0.41,0.5,2,153,3,0,1,0,sales,low
6,0.1,0.77,6,247,4,0,1,0,sales,low
7,0.92,0.85,5,259,5,0,1,0,sales,low
8,0.89,1.0,5,224,5,0,1,0,sales,low
9,0.42,0.53,2,142,3,0,1,0,sales,low


In [None]:
hr_df.shape

(14999, 10)

In [None]:
hr_df['left'].value_counts()

0    11428
1     3571
Name: left, dtype: int64

In [None]:
hr_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14999 entries, 0 to 14998
Data columns (total 10 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   satisfaction_level     14999 non-null  float64
 1   last_evaluation        14999 non-null  float64
 2   number_project         14999 non-null  int64  
 3   average_montly_hours   14999 non-null  int64  
 4   time_spend_company     14999 non-null  int64  
 5   Work_accident          14999 non-null  int64  
 6   left                   14999 non-null  int64  
 7   promotion_last_5years  14999 non-null  int64  
 8   sales                  14999 non-null  object 
 9   salary                 14999 non-null  object 
dtypes: float64(2), int64(6), object(2)
memory usage: 1.1+ MB


## Encoding Categorical variables

**Note**: we are using get_dummies() for quick demonstration of feature selection. Please consider using One Hot Encoder (OHE) in real world implementation.

In [None]:
encoded_hr_df = pd.get_dummies( hr_df,
                                columns = ['Work_accident', 'promotion_last_5years', 'sales', 'salary'])

In [None]:
encoded_hr_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14999 entries, 0 to 14998
Data columns (total 23 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   satisfaction_level       14999 non-null  float64
 1   last_evaluation          14999 non-null  float64
 2   number_project           14999 non-null  int64  
 3   average_montly_hours     14999 non-null  int64  
 4   time_spend_company       14999 non-null  int64  
 5   left                     14999 non-null  int64  
 6   Work_accident_0          14999 non-null  uint8  
 7   Work_accident_1          14999 non-null  uint8  
 8   promotion_last_5years_0  14999 non-null  uint8  
 9   promotion_last_5years_1  14999 non-null  uint8  
 10  sales_IT                 14999 non-null  uint8  
 11  sales_RandD              14999 non-null  uint8  
 12  sales_accounting         14999 non-null  uint8  
 13  sales_hr                 14999 non-null  uint8  
 14  sales_management      

## Split Dataset

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
train_df, test_df = train_test_split(encoded_hr_df,
                                     train_size = 0.8,
                                     random_state = 100)

In [None]:
x_features = list(train_df.columns)

In [None]:
x_features.remove('left')

## L1 Based Feature Selection

Linear models penalized with the L1 norm have sparse solutions: many of their estimated coefficients are zero. When the goal is to reduce the dimensionality of the data to use with another classifier, they can be used along with SelectFromModel to select the non-zero coefficients. 

In [None]:
from sklearn.linear_model import LogisticRegression

- **C, default=1.0** - Inverse of regularization strength; must be a positive float. Like in support vector machines, smaller values specify stronger regularization.


In [None]:
logreg = LogisticRegression( penalty = 'l1', C = .1, solver = 'liblinear' )

In [None]:
logreg.fit(train_df[x_features], train_df['left'])

LogisticRegression(C=0.1, penalty='l1', solver='liblinear')

In [None]:
from sklearn.metrics import classification_report

In [None]:
print(classification_report( test_df['left'], 
                            logreg.predict(test_df[x_features])))

              precision    recall  f1-score   support

           0       0.81      0.93      0.87      2250
           1       0.63      0.34      0.44       750

    accuracy                           0.79      3000
   macro avg       0.72      0.64      0.65      3000
weighted avg       0.77      0.79      0.76      3000



In [None]:
l1_selection_df = pd.DataFrame( {"features": x_features,
                                 "coef": np.round(logreg.coef_[0], 2)} )

In [None]:
l1_selection_df[l1_selection_df.coef == 0.0]

Unnamed: 0,features,coef
3,average_montly_hours,0.0
5,Work_accident_0,0.0
7,promotion_last_5years_0,0.0
11,sales_accounting,0.0
14,sales_marketing,0.0
16,sales_sales,0.0
17,sales_support,0.0
21,salary_medium,-0.0


## Sequential Feature Selection

In [None]:
import sklearn
print(sklearn.__version__)

1.0.2


In [None]:
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.tree import DecisionTreeClassifier

In [None]:
tree = DecisionTreeClassifier( max_depth = 10 )

In [None]:
sfs = SequentialFeatureSelector(tree, n_features_to_select=10)

In [None]:
sfs.fit(train_df[x_features], train_df['left'])

SequentialFeatureSelector(estimator=DecisionTreeClassifier(max_depth=10),
                          n_features_to_select=10)

In [None]:
sfs_features = [feature for feature, selected in zip(x_features, sfs.support_) if selected == True]

In [None]:
sfs_features

['satisfaction_level',
 'last_evaluation',
 'number_project',
 'average_montly_hours',
 'time_spend_company',
 'sales_accounting',
 'sales_sales',
 'sales_technical',
 'salary_high',
 'salary_medium']

## Embedded Methods

- The embedded methods use statistical criteria e.g. information gain as a filter to select featuresusing a machine learning algorithm and then select the subset of features with the highest significance or importance.

- Embedded methods do not use iterations like RFE.

In [None]:
from sklearn.tree import DecisionTreeClassifier

rf_reg = DecisionTreeClassifier(max_depth = 5, criterion = 'gini')
rf_reg.fit(train_df[x_features], train_df['left'])

DecisionTreeClassifier(max_depth=5)

In [None]:
features_rf_imp = pd.DataFrame({"features": list(x_features),
                                "importance": rf_reg.feature_importances_})
features_rf_imp = features_rf_imp.sort_values("importance", ascending=False).reset_index()
features_rf_imp

Unnamed: 0,index,features,importance
0,0,satisfaction_level,0.520572
1,4,time_spend_company,0.158379
2,1,last_evaluation,0.150904
3,2,number_project,0.102812
4,3,average_montly_hours,0.066148
5,17,sales_support,0.001089
6,18,sales_technical,9.7e-05
7,13,sales_management,0.0
8,20,salary_low,0.0
9,19,salary_high,0.0


In [None]:
features_rf_imp['cumsum'] = features_rf_imp.importance.cumsum()
features_rf_imp

Unnamed: 0,index,features,importance,cumsum
0,0,satisfaction_level,0.520572,0.520572
1,4,time_spend_company,0.158379,0.678951
2,1,last_evaluation,0.150904,0.829854
3,2,number_project,0.102812,0.932667
4,3,average_montly_hours,0.066148,0.998815
5,17,sales_support,0.001089,0.999903
6,18,sales_technical,9.7e-05,1.0
7,13,sales_management,0.0,1.0
8,20,salary_low,0.0,1.0
9,19,salary_high,0.0,1.0



## Recursive Feature Elimination (RFE)

- Use a machine learning algorithm as a black box evaluator to find the best subsets of features, and so, they are dependent on the estimator.
- Trains the model iteratively and each time removes the least important feature using the weights of the algorithm as the criterion.
- It is a multivariate method in the sense that it evaluates the relevance of several features considered jointly.
- When used as a ranker, in each iteration, the feature that is removed is added into a stack until all features are tested.
- More than one feature can be removed at a single step for computational efficiency.

In [None]:
from sklearn.feature_selection import RFE

In [None]:
rfe_selector = RFE(tree, 
                   n_features_to_select=5, 
                   step=1, 
                   verbose=1)
rfe_selector.fit(train_df[x_features], train_df['left'])

Fitting estimator with 22 features.
Fitting estimator with 21 features.
Fitting estimator with 20 features.
Fitting estimator with 19 features.
Fitting estimator with 18 features.
Fitting estimator with 17 features.
Fitting estimator with 16 features.
Fitting estimator with 15 features.
Fitting estimator with 14 features.
Fitting estimator with 13 features.
Fitting estimator with 12 features.
Fitting estimator with 11 features.
Fitting estimator with 10 features.
Fitting estimator with 9 features.
Fitting estimator with 8 features.
Fitting estimator with 7 features.
Fitting estimator with 6 features.


RFE(estimator=DecisionTreeClassifier(max_depth=10), n_features_to_select=5,
    verbose=1)

In [None]:
features_rfe = pd.DataFrame({"features": list(x_features),
                             "rank": rfe_selector.ranking_})
features_rfe.sort_values("rank", ascending=True)

Unnamed: 0,features,rank
0,satisfaction_level,1
1,last_evaluation,1
2,number_project,1
3,average_montly_hours,1
4,time_spend_company,1
17,sales_support,2
9,sales_IT,3
6,Work_accident_1,4
18,sales_technical,5
19,salary_high,6
