##### ABOUT THE DATASET


###### "Bank Marketing" dataset from Kaggle
###### This dataset has 20 features and a binary target variable. We'll use it to predict if a client will subscribe to a term deposit.

##### LOAD AND PREPROCESS DATASET

In [3]:
import opendatasets as od
from ydata_profiling import ProfileReport
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report


In [4]:
od.download("https://www.kaggle.com/janiobachmann/bank-marketing-dataset")
# Load the data
data=pd.read_csv(fr'C:\Users\Sarah\Desktop\Bank Marketing Dataset Analysis with Feature Selection\Bank-Marketing-Dataset-Analysis-with-Feature-Selection\bank-marketing-dataset\bank.csv')

Skipping, found downloaded files in ".\bank-marketing-dataset" (use force=True to force download)


In [5]:
profile= ProfileReport(data)
profile.to_file(output_file="bank_marketing.html")

(using `df.profile_report(correlations={"auto": {"calculate": False}})`
If this is problematic for your use case, please report this as an issue:
https://github.com/ydataai/ydata-profiling/issues
(include the error message: 'could not convert string to float: 'unknown'')
Summarize dataset: 100%|██████████| 75/75 [00:05<00:00, 14.51it/s, Completed]                 
Generate report structure: 100%|██████████| 1/1 [00:02<00:00,  2.27s/it]
Render HTML: 100%|██████████| 1/1 [00:00<00:00,  1.10it/s]
Export report to file: 100%|██████████| 1/1 [00:00<00:00, 123.87it/s]


In [6]:
data.head(10)

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,deposit
0,59,admin.,married,secondary,no,2343,yes,no,unknown,5,may,1042,1,-1,0,unknown,yes
1,56,admin.,married,secondary,no,45,no,no,unknown,5,may,1467,1,-1,0,unknown,yes
2,41,technician,married,secondary,no,1270,yes,no,unknown,5,may,1389,1,-1,0,unknown,yes
3,55,services,married,secondary,no,2476,yes,no,unknown,5,may,579,1,-1,0,unknown,yes
4,54,admin.,married,tertiary,no,184,no,no,unknown,5,may,673,2,-1,0,unknown,yes
5,42,management,single,tertiary,no,0,yes,yes,unknown,5,may,562,2,-1,0,unknown,yes
6,56,management,married,tertiary,no,830,yes,yes,unknown,6,may,1201,1,-1,0,unknown,yes
7,60,retired,divorced,secondary,no,545,yes,no,unknown,6,may,1030,1,-1,0,unknown,yes
8,37,technician,married,secondary,no,1,yes,no,unknown,6,may,608,1,-1,0,unknown,yes
9,28,services,single,secondary,no,5090,yes,no,unknown,6,may,1297,3,-1,0,unknown,yes


In [7]:

# Encode categorical variables
data_encoded = pd.get_dummies(data, drop_first=True)

# Print column names
print(data_encoded.columns)

# Identify the target column
target_column = [col for col in data_encoded.columns if col.startswith('deposit_')][0]
print(f"Target column: {target_column}")

# Split features and target
X = data_encoded.drop(target_column, axis=1)
y = data_encoded[target_column]

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

Index(['age', 'balance', 'day', 'duration', 'campaign', 'pdays', 'previous',
       'job_blue-collar', 'job_entrepreneur', 'job_housemaid',
       'job_management', 'job_retired', 'job_self-employed', 'job_services',
       'job_student', 'job_technician', 'job_unemployed', 'job_unknown',
       'marital_married', 'marital_single', 'education_secondary',
       'education_tertiary', 'education_unknown', 'default_yes', 'housing_yes',
       'loan_yes', 'contact_telephone', 'contact_unknown', 'month_aug',
       'month_dec', 'month_feb', 'month_jan', 'month_jul', 'month_jun',
       'month_mar', 'month_may', 'month_nov', 'month_oct', 'month_sep',
       'poutcome_other', 'poutcome_success', 'poutcome_unknown',
       'deposit_yes'],
      dtype='object')
Target column: deposit_yes


Initial model without feature selection:

#### build a K-Nearest Neighbors classifier without feature selection:

In [8]:
# Train KNN model
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_scaled, y_train)

# Make predictions
y_pred = knn.predict(X_test_scaled)

# Evaluate performance
print("Model performance without feature selection:")
print(classification_report(y_test, y_pred))

Model performance without feature selection:
              precision    recall  f1-score   support

       False       0.75      0.83      0.79      1166
        True       0.79      0.70      0.74      1067

    accuracy                           0.77      2233
   macro avg       0.77      0.76      0.76      2233
weighted avg       0.77      0.77      0.77      2233



Accuracy: 0.77 (77% of all predictions were correct)
Precision: 0.75 (False), 0.79 (True) (75% accuracy in predicting 'False', 79% for 'True')
Recall: 0.83 (False), 0.70 (True) (83% of actual 'False' were correctly identified, 70% for 'True')
F1-score: 0.79 (False), 0.74 (True) (harmonic mean of precision and recall)

##### Applying feature selection methods:

Now, let's apply one filter method (correlation), one wrapper method (recursive feature elimination), and one embedded method (Lasso):

In [9]:
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.feature_selection import RFE
from sklearn.linear_model import Lasso, LogisticRegression

# Filter method: Correlation
correlation = X_train.corrwith(y_train).abs().sort_values(ascending=False)
selected_features_correlation = correlation[correlation > 0.05].index.tolist()

# Wrapper method: Recursive Feature Elimination
rfe = RFE(estimator=LogisticRegression(), n_features_to_select=5)
rfe.fit(X_train_scaled, y_train)
selected_features_rfe = X_train.columns[rfe.support_].tolist()

# Embedded method: Lasso
lasso = Lasso(alpha=0.01)
lasso.fit(X_train_scaled, y_train)
selected_features_lasso = X_train.columns[abs(lasso.coef_) > 0].tolist()

print("Selected features (Correlation):", selected_features_correlation)
print("Selected features (RFE):", selected_features_rfe)
print("Selected features (Lasso):", selected_features_lasso)

Selected features (Correlation): ['duration', 'poutcome_success', 'contact_unknown', 'poutcome_unknown', 'housing_yes', 'month_may', 'pdays', 'previous', 'month_sep', 'month_oct', 'month_mar', 'campaign', 'loan_yes', 'education_tertiary', 'job_blue-collar', 'job_retired', 'job_student', 'marital_single', 'marital_married', 'month_dec', 'balance', 'education_secondary', 'day', 'month_jul']
Selected features (RFE): ['duration', 'housing_yes', 'contact_unknown', 'month_jul', 'poutcome_success']
Selected features (Lasso): ['balance', 'duration', 'campaign', 'job_blue-collar', 'job_retired', 'job_student', 'marital_married', 'marital_single', 'education_tertiary', 'housing_yes', 'loan_yes', 'contact_unknown', 'month_aug', 'month_dec', 'month_jan', 'month_jul', 'month_jun', 'month_mar', 'month_may', 'month_nov', 'month_oct', 'month_sep', 'poutcome_success', 'poutcome_unknown']


Model with selected features:



#### build KNN models using the features selected by each method:

In [10]:
def evaluate_model(X_train, X_test, y_train, y_test, feature_set):
    X_train_selected = X_train[feature_set]
    X_test_selected = X_test[feature_set]
    
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train_selected)
    X_test_scaled = scaler.transform(X_test_selected)
    
    knn = KNeighborsClassifier(n_neighbors=5)
    knn.fit(X_train_scaled, y_train)
    y_pred = knn.predict(X_test_scaled)
    
    return classification_report(y_test, y_pred)

print("Model performance with Correlation-based feature selection:")
print(evaluate_model(X_train, X_test, y_train, y_test, selected_features_correlation))

print("Model performance with RFE-based feature selection:")
print(evaluate_model(X_train, X_test, y_train, y_test, selected_features_rfe))

print("Model performance with Lasso-based feature selection:")
print(evaluate_model(X_train, X_test, y_train, y_test, selected_features_lasso))

Model performance with Correlation-based feature selection:
              precision    recall  f1-score   support

       False       0.77      0.81      0.79      1166
        True       0.78      0.74      0.76      1067

    accuracy                           0.78      2233
   macro avg       0.78      0.78      0.78      2233
weighted avg       0.78      0.78      0.78      2233

Model performance with RFE-based feature selection:
              precision    recall  f1-score   support

       False       0.78      0.75      0.77      1166
        True       0.74      0.77      0.75      1067

    accuracy                           0.76      2233
   macro avg       0.76      0.76      0.76      2233
weighted avg       0.76      0.76      0.76      2233

Model performance with Lasso-based feature selection:
              precision    recall  f1-score   support

       False       0.81      0.81      0.81      1166
        True       0.80      0.80      0.80      1067

    accuracy    

Comparison and Summary:

1. Initial model (without feature selection):
Initial features: 43 - 1 = 42 (after encoding categorical variables)
   - Accuracy: 0.77
   - Precision: 0.75 (False), 0.79 (True)
   - Recall: 0.83 (False), 0.70 (True)
   - F1-score: 0.79 (False), 0.74 (True)

2. Correlation-based feature selection:
   - Selected 24 features
   - Accuracy: 0.78 (slight improvement)
   - Precision: 0.77 (False), 0.78 (True)
   - Recall: 0.81 (False), 0.74 (True)
   - F1-score: 0.79 (False), 0.76 (True)

3. RFE-based feature selection:
   - Selected 5 features
   - Accuracy: 0.76 (slight decrease)
   - Precision: 0.78 (False), 0.74 (True)
   - Recall: 0.75 (False), 0.77 (True)
   - F1-score: 0.77 (False), 0.75 (True)

4. Lasso-based feature selection:
   - Selected 24 features
   - Accuracy: 0.81 (significant improvement)
   - Precision: 0.81 (False), 0.80 (True)
   - Recall: 0.81 (False), 0.80 (True)
   - F1-score: 0.81 (False), 0.80 (True)

Summary:

1. Feature Reduction: All methods reduced the number of features. RFE was the most aggressive, selecting only 5 features, while Correlation and Lasso methods both selected 24 features.

2. Performance Impact:
   - Correlation-based selection slightly improved overall accuracy from 0.77 to 0.78.
   - RFE slightly decreased overall accuracy from 0.77 to 0.76.
   - Lasso-based selection significantly improved overall accuracy from 0.77 to 0.81.

3. Method Comparison:
   - Lasso performed the best, improving all metrics consistently and achieving the highest accuracy.
   - Correlation method showed slight improvements across all metrics.
   - RFE, despite using only 5 features, maintained relatively good performance with only a slight decrease in accuracy.

4. Class Balance:
   - The initial model showed some imbalance in performance between classes.
   - Lasso-based selection achieved the most balanced performance across classes.
   - Correlation and RFE methods improved the balance compared to the initial model.

5. Interpretability and Efficiency:
   - RFE provides the most interpretable model with only 5 features, which could significantly reduce computational complexity.
   - Lasso and Correlation methods, while using more features, still reduce complexity compared to the full feature set.

Conclusion:
The Lasso-based feature selection method provided the best overall performance, significantly improving accuracy and achieving balanced predictions across classes. The Correlation method showed modest improvements, while RFE, despite slightly decreasing accuracy, drastically reduced the number of features, which could be beneficial for model interpretability and computational efficiency in scenarios where these factors are crucial.
