## <u>Introduction: <br>
Broadly, there are __three__ categories of methods for __Feature Selection__. <br>
1. Filter Methods
2. Embedded Methods
3. Wrapper Methods.
<br>

In this jupyter notebook I will be using __Random Forest__ as an __Embedded Method__, __Mutual Information__ as __Filter Method__ and __Recursive Feature Elimination__ as __Wrapper Method__. 
<br>
<br>
<a href='https://towardsdatascience.com/why-how-and-when-to-apply-feature-selection-e9c69adfabf2'>For more details</a>

## Imports #

In [2]:
import pandas as pd 
import numpy as np 
from sklearn.ensemble import RandomForestClassifier,ExtraTreesClassifier
from sklearn.model_selection import train_test_split,cross_val_score
from sklearn.metrics import accuracy_score,precision_score,recall_score,accuracy_score,confusion_matrix
from sklearn.feature_selection import mutual_info_classif,RFE,RFECV,VarianceThreshold

## Methods:

In [3]:
def validation_metrics(model, test_X,test_Y):
    # 01 predict using model #
    prediction = model.predict(test_X)
    # 02 accuracy #
    accuracy = accuracy_score(test_Y, prediction)*100
    # 03 precision #
    precision = precision_score(test_Y, prediction,pos_label=1,labels=[1,0])*100
    # 04 recall #
    recall = recall_score(test_Y, prediction,pos_label=1,labels=[1,0])*100
    # 05 confusion Matrix #
    cm = confusion_matrix(test_Y, prediction,labels=[1,0])
    return accuracy, precision, recall, cm

In [4]:
def split_data(data, X,y):
    train_X,test_X,train_Y, test_Y = train_test_split(X,y)
    return train_X,test_X,train_Y, test_Y

In [5]:
def train_model(model,train_X,train_Y):
    model.fit(train_X,train_Y)
    return model

In [6]:
def separate_data(dt):
    defaulters = dt[dt['default payment next month']==1]
    non_defaulters = dt[dt['default payment next month']==0]
    return defaulters,non_defaulters

In [7]:
def clean_payment_status(deafulters, non_defaulters,payment_status):
    for status in payment_status:
        key =  'updated_'+ status
        defaulters[key] = defaulters[status].apply(lambda x: x if x in [-1,1,2,3,4,5,6,7,8,9] else 2 )
        non_defaulters[key] = non_defaulters[status].apply(lambda x: x if x in [-1,1,2,3,4,5,6,7,8,9] else -1 )
    return deafulters,non_defaulters


## Step 01 Fetch the data 

<u>Facts About Data </u>: 
1. This research aimed at the case of customers default payments in Taiwan. 
<br> 
2. This research employed a binary variable, __default payment (Yes = 1, No = 0), as the response variable__.
<br>
3. This study reviewed the literature and used the following __23 variables as explanatory variables__.
<br>
4. __Limit of Balance__: Amount of the given credit (NT dollar): it includes both the individual consumer credit and his/her family (supplementary) credit.
<br>
5. __Gender__: (1 = male; 2 = female).
<br>
6. __Education__: (1 = graduate school; 2 = university; 3 = high school; 4 = others).
<br>
7. __Marital status__ (1 = married; 2 = single; 3 = others).
<br>
8. __Age__ (year).
<br>
9. __PAY_0 till PAY_6__: History of past payment. We tracked the past monthly payment records (from April to September, 2005) as follows: X6 = the repayment status in September, 2005; X7 = the repayment status in August, 2005; . . .;X11 = the repayment status in April, 2005. The measurement scale for the repayment status is: -1 = pay duly; 1 = payment delay for one month; 2 = payment delay for two months; . . .; 8 = payment delay for eight months; 9 = payment delay for nine months and above.
<br>
10. __BILL_AMT1 till BILL_AMT6__: Amount of bill statement (NT dollar). X12 = amount of bill statement in September, 2005; X13 = amount of bill statement in August, 2005; . . .; X17 = amount of bill statement in April, 2005.
<br>
11. __PAY_AMT1-PAY_AMT6__: Amount of previous payment (NT dollar). X18 = amount paid in September, 2005; X19 = amount paid in August, 2005; . . .;X23 = amount paid in April, 2005.
<br>
<br>
<a href='https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients'> For more Information </a>

In [8]:
dt = pd.read_excel('default of credit card clients.xls', skiprows=1)

In [9]:
dt['default payment next month'].value_counts()/len(dt)*100

0    77.88
1    22.12
Name: default payment next month, dtype: float64

__Information__: There is __22%__ share of __fraudulent cases__  in the data whereas, the __non-defaulters__ make up to about __78%__ of the data. 

## Step 02 Feature Selection

### 1. Embedded Methods
__Description__: Feature selection can also be acheived by the insights provided by some Machine Learning models.

i.<u>  Random Forest:

a. Spliting the dataset:

In [10]:
X = dt.drop(['ID','default payment next month'],axis=1)
y = dt['default payment next month']
train_X,test_X,train_Y, test_Y = split_data(dt,X,y)

b. Train Model: <br>
Here we are training the model with complete data (all features) so, we can see what is the __Feature Importance__ according to __Random Forest__.

In [11]:
model = RandomForestClassifier(n_estimators=100)
trained_model = train_model(model,train_X,train_Y)

c. Cross Validation of the model:

In [16]:
cross_val_score(estimator=model, X = train_X, y=train_Y,cv=5 , n_jobs=-1)*100

array([81.69295712, 81.29304599, 81.84444444, 81.50700156, 81.48477439])

d.Validation of Model:

In [17]:
accuracy, precision, recall, cm = validation_metrics(trained_model,test_X,test_Y)

i. __accuracy__:

In [18]:
accuracy

81.73333333333333

ii. __precision__:

In [19]:
precision

64.16309012875536

iii. __recall__:

In [20]:
recall

36.59730722154223

iv. __confusion matrix__:

In [21]:
cm

array([[ 598, 1036],
       [ 334, 5532]])

e. What is the feature importance according to __Random Forest__? 

In [22]:
pd.Series(trained_model.feature_importances_, index=X.columns.values).sort_values(ascending=False)*100

PAY_0        9.876261
AGE          6.594053
BILL_AMT1    6.039220
LIMIT_BAL    5.961639
BILL_AMT2    5.469988
BILL_AMT3    5.157291
PAY_AMT1     5.152024
BILL_AMT5    5.026638
BILL_AMT4    4.981807
BILL_AMT6    4.962905
PAY_AMT2     4.826063
PAY_AMT3     4.643719
PAY_AMT6     4.597552
PAY_AMT5     4.398325
PAY_AMT4     4.355377
PAY_2        4.135263
PAY_3        2.614682
PAY_4        2.493960
PAY_5        2.065524
EDUCATION    2.036881
PAY_6        2.004002
MARRIAGE     1.384056
SEX          1.222769
dtype: float64

__Conclusion__: According to Random Forest the __PAY_0__,  __AGE__, __BILL_AMT1__, __LIMIT_BAL__ and __BILL_AMT2__ are the most signifcant features. 

f. Run the __Random Forest__ using __top 5 features__ from features suggested by __feature importance__ of __Random Forest__

In [23]:
X = dt[['PAY_0','AGE','BILL_AMT1','LIMIT_BAL','BILL_AMT2']]
y = dt['default payment next month']
train_X,test_X,train_Y, test_Y = split_data(dt,X,y)

b. Train Model: <br>
Here we are training the model with __selected features suggested by Random Forest feature importance__. 

In [24]:
model = RandomForestClassifier(n_estimators=100)
trained_model = train_model(model,train_X,train_Y)

c. Cross Validation of the model:

In [25]:
cross_val_score(estimator=model, X = train_X, y=train_Y,cv=5 , n_jobs=-1)*100

array([80.11552988, 80.69317929, 80.04444444, 80.28450767, 81.06245832])

d.Validation of Model:

In [26]:
accuracy, precision, recall, cm = validation_metrics(trained_model,test_X,test_Y)

i. __accuracy__:

In [27]:
accuracy

80.14666666666666

ii. __precision__:

In [28]:
precision

60.082730093071355

iii. __recall__:

In [29]:
recall

34.50118764845605

iv. __confusion matrix__:

In [30]:
cm

array([[ 581, 1103],
       [ 386, 5430]])

__Conclusion__: Model metrics __don't improve much after using selected features__ so, we will need to use filter method to get the best features.

---

----

### 2. Filter Methods 
__Description__:
Filter Methods considers the relationship between features and the target variable to compute the importance of features.

i.<u>  Mutual Information:<br>

_Before Data Cleaning_:

In [31]:
X = dt.drop(['default payment next month'],axis=1)
y = dt['default payment next month']

In [32]:
mutual_info = mutual_info_classif(X,y)

In [33]:
pd.Series(mutual_info, index=X.columns).sort_values(ascending=False)*100

PAY_0        7.701669
PAY_2        4.462218
PAY_3        4.014667
PAY_4        3.428443
PAY_5        3.227407
PAY_6        3.041289
PAY_AMT1     2.340312
PAY_AMT3     1.751627
PAY_AMT5     1.607144
PAY_AMT4     1.593538
PAY_AMT2     1.440210
LIMIT_BAL    1.200238
PAY_AMT6     1.074484
BILL_AMT1    1.016530
BILL_AMT5    0.842754
BILL_AMT6    0.778144
BILL_AMT3    0.741111
EDUCATION    0.679060
BILL_AMT2    0.539135
ID           0.394234
BILL_AMT4    0.305516
MARRIAGE     0.285118
SEX          0.030395
AGE          0.019791
dtype: float64

__Conclusion__: According to the __Mutual Information__ __PAY_0__, __PAY_2__, __PAY_3__, __PAY_4__  and __PAY_5__ are the most signifcant features. 

ii.<u>  Random Forest:</u><br>
Running Random Forest with __Selected features suggested by Mutual Information__

a. Spliting the dataset:

In [34]:
X = dt[['PAY_0','PAY_2','PAY_3','PAY_4','PAY_5']]
y = dt['default payment next month']
train_X,test_X,train_Y, test_Y = split_data(dt,X,y)

b. Train Model: <br>
Here we are training the model with __selected features suggested by mutual information before cleaning__.

In [35]:
model = RandomForestClassifier(n_estimators=100)
trained_model = train_model(model,train_X,train_Y)

c. Cross Validation of the model:

In [36]:
cross_val_score(estimator=model, X = train_X, y=train_Y,cv=5 , n_jobs=-1)*100

array([81.44856699, 81.40413242, 81.33333333, 81.5959102 , 81.97377195])

d.Validation of Model:

In [37]:
accuracy, precision, recall, cm = validation_metrics(trained_model,test_X,test_Y)

i. __accuracy__:

In [38]:
accuracy

81.89333333333333

ii. __precision__:

In [39]:
precision

63.85135135135135

iii. __recall__:

In [40]:
recall

35.34912718204489

iv. __confusion matrix__:

In [41]:
cm

array([[ 567, 1037],
       [ 321, 5575]])

__Conclusion__: Model metrics __don't improve much after using selected features__.

__Note__:
<br>
Based on EDA (in other notebook) the consistent trends is observed between __defaulters__ and __non-defaulters__ in __Payment Status__. The __defaulters__ usually delay payment by __two months__ whereas, __non-defaulters__ usually __pay duly__.
<br>
<br>
The legtimate values according to UCI Repository are __ __-1,1,2,3,4,5,6,7,8, and 9__ whereas, it is observed there is significant share of __status 0 and -2__ which aren't valid as per UCI description about data. We can't remove the entries as they make up the major portion of the data. 
<br>
<br>
Based on assumption i have replaced invalid status with __2 (assuming they usually delay payment by two months)__ in case of __defaulters__ whereas, in case of __non-defaulters__ the invalid status replaced by __-1 (assuming they usually pay duly)__. 

_After Data Cleaning_:

iii.<u>  Mutual Information After Data Cleaning:<br>

1. Split the __defaulters__ and __non-defaulters__ data. 

In [42]:
defaulters,non_defaulters = separate_data(dt)

2. Clean the __Payment status__ i.e replace invalid values with __2 and -1__ in __defaulters__ and __non-defaulters__ (as observed) respectively 

In [43]:
payment_status = ['PAY_0','PAY_2','PAY_3','PAY_5','PAY_4','PAY_6']
defaulters,non_defaulters = clean_payment_status(defaulters,non_defaulters, payment_status)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """


3. Merge the data

In [44]:
dt = pd.concat([defaulters,non_defaulters])

4. split the dataset for MI

In [45]:
X = dt.drop(['ID','default payment next month'],axis=1)
y = dt['default payment next month']

5. Run the __Mutual Information__:

In [46]:
mutual_info = mutual_info_classif(X,y)

In [47]:
pd.Series(mutual_info, index=X.columns).sort_values(ascending=False)*100

updated_PAY_5    29.375431
updated_PAY_6    27.736797
updated_PAY_4    27.601670
updated_PAY_2    25.598799
updated_PAY_3    25.590725
updated_PAY_0    24.030481
PAY_0             7.628413
PAY_2             4.461428
PAY_3             3.835841
PAY_4             3.238270
PAY_5             3.086649
PAY_6             2.714388
PAY_AMT1          2.677441
PAY_AMT3          2.131717
PAY_AMT4          1.778382
LIMIT_BAL         1.662372
PAY_AMT5          1.499841
PAY_AMT2          1.471083
BILL_AMT1         1.059813
PAY_AMT6          1.001190
EDUCATION         0.721552
BILL_AMT6         0.714375
BILL_AMT5         0.642963
BILL_AMT3         0.638283
BILL_AMT2         0.491247
AGE               0.422595
BILL_AMT4         0.336962
MARRIAGE          0.215574
SEX               0.109374
dtype: float64

__Conclusion__: __After Data Cleaning__ according to the __Mutual Information__ __updated_PAY_5__, __updated_PAY_6__, __updated_PAY_4__, __updated_PAY_3__  and __updated_PAY_2__ are the most signifcant features. 

iv.<u>  Random Forest:</u><br>
Re-Running Random Forest with __Selected features using Mutual Information after data cleaning__

a. Spliting the dataset:

In [48]:
X = dt[['updated_PAY_5','updated_PAY_6','updated_PAY_4','updated_PAY_3','updated_PAY_2']]
y = dt['default payment next month']
train_X,test_X,train_Y, test_Y = split_data(dt,X,y)

b. Train Model: <br>
Here we are training the model with __selected features after data cleaning__

In [49]:
model = RandomForestClassifier(n_estimators=100)
trained_model = train_model(model,train_X,train_Y)

c. Cross Validation of the model:

In [50]:
cross_val_score(estimator=model, X = train_X, y=train_Y,cv=15 , n_jobs=-1)*100

array([94.00399734, 93.73750833, 93.66666667, 94.73333333, 93.93333333,
       94.8       , 94.4       , 95.        , 93.6       , 94.2       ,
       95.06666667, 94.6       , 93.4       , 93.86257505, 93.86257505])

d.Validation of Model:

In [51]:
accuracy, precision, recall, cm = validation_metrics(trained_model,test_X,test_Y)

i. __accuracy__:

In [52]:
accuracy

93.93333333333334

ii. __precision__:

In [53]:
precision

88.61323155216286

iii. __recall__:

In [54]:
recall

83.4631515877771

iv. __confusion matrix__:

In [55]:
cm

array([[1393,  276],
       [ 179, 5652]])

__Conclusion__: Model metrics __improve significantly after data cleaning__. <br>
1. __Recall jumped from 35% to 83%__ 
2. __Precision jumped 64% to 89%__ 
3. __Accuracy jumped from 82% to 94%__.

----

----

### 3. Wrapper Methods
__Description__: Wrapper Methods generate models with a subsets of feature and gauge their model performances.

a. <u> Recursive Feature Elimination:

i.<u>  Random Forest:

a. Train Model:<br>
Training the model with the complete dataset (all features including the updated payment statuses) 

In [51]:
X = dt.drop(['ID','default payment next month'],axis=1)
y = dt['default payment next month']
train_X,test_X,train_Y, test_Y = split_data(dt,X,y)

ii. Rank the features using RFE (Recursive Feature Elimination):

In [52]:
RF_model = RandomForestClassifier(n_estimators=100)
selector = RFE(estimator=RF_model, step=1, n_features_to_select=5)

In [53]:
rfe = selector.fit(X,y)

iii. Check the selected features.

In [54]:
X = dt[X.columns[rfe.support_].tolist()]
y = dt['default payment next month']
train_X,test_X,train_Y, test_Y = split_data(dt,X,y)

In [55]:
trained_model = train_model(RF_model, train_X,train_Y)

In [56]:
cross_val_score(estimator=trained_model, X = train_X, y=train_Y,cv=3 , n_jobs=-1)*100

array([92.41333333, 92.54666667, 92.70666667])

b.Validation Model:

In [57]:
accuracy, precision, recall, cm = validation_metrics(trained_model,test_X,test_Y)

i. __accuracy__:

In [58]:
accuracy

92.47999999999999

ii. __precision__:

In [59]:
precision

83.44175044352454

iii. __recall__:

In [60]:
recall

83.24483775811208

iv. __confusion matrix__:

In [61]:
cm

array([[1411,  284],
       [ 280, 5525]])

__Conclusion__: Model metrics __improve a bit in terms of recall whereas other metrics have decreased as compare to filter method which has so far produce the best results__. <br>
1. __Recall jumped from 81% to 82%__ 
2. __Precision decrease 89% to 84%__ 
3. __Accuracy decrease from 94% to 93%__.

### 4. Store the model

__Note__: The features suggested by `Filter Method` turned out to be the best set of features in terms of model `accuracy`, `precision` and `recall`so, we opted for this model

In [62]:
#import pickle

In [63]:
#filename = 'model.pkl'
#pickle.dump(model, open(filename, 'wb'))