# Task 1: Feature Selection Schemes

In this assignment you will understand Feature selection techniques

###Forward Selection: 
Forward selection is an iterative method in which we start with having no feature in the model. In each iteration, we keep adding the feature which best improves our model till an addition of a new variable does not improve the performance of the model.

###Backward Elimination: 
In backward elimination, we start with all the features and removes the least significant feature at each iteration which improves the performance of the model. We repeat this until no improvement is observed on removal of features.

###Recursive Feature elimination: 
It is a greedy optimization algorithm which aims to find the best performing feature subset. It repeatedly creates models and keeps aside the best or the worst performing feature at each iteration. It constructs the next model with the left features until all the features are exhausted. It then ranks the features based on the order of their elimination.

##Dataset
The dataset is available at "data/bank-full.csv" in the respective challenge's repo.
The dataset can be obtained from:
https://www.kaggle.com/sonujha090/bank-marketing

#Features (X)
##Input variables:
# bank client data:
1. age (numeric)
2. job : type of job (categorical: 'admin.','blue-collar','entrepreneur','housemaid','management','retired','self-employed','services','student','technician','unemployed','unknown')
3. marital : marital status (categorical: 'divorced','married','single','unknown'; note: 'divorced' means divorced or widowed)
4. education (categorical: 'basic.4y','basic.6y','basic.9y','high.school','illiterate','professional.course','university.degree','unknown')
5. default: has credit in default? (categorical: 'no','yes','unknown')
6. housing: has housing loan? (categorical: 'no','yes','unknown')
7. loan: has personal loan? (categorical: 'no','yes','unknown')
# related with the last contact of the current campaign:
8. contact: contact communication type (categorical: 'cellular','telephone')
9. month: last contact month of year (categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec')
10. day_of_week: last contact day of the week (categorical: 'mon','tue','wed','thu','fri')
11. duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.
# other attributes:
12. campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
13. pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
14. previous: number of contacts performed before this campaign and for this client (numeric)
15. poutcome: outcome of the previous marketing campaign (categorical: 'failure','nonexistent','success')
# social and economic context attributes
16. emp.var.rate: employment variation rate. quarterly indicator (numeric)
17. cons.price.idx: consumer price index. monthly indicator (numeric)
18. cons.conf.idx: consumer confidence index. monthly indicator (numeric)
19. euribor3m: euribor 3 month rate. daily indicator (numeric)
20. nr.employed: number of employees. quarterly indicator (numeric)

##Output variable (desired target):
21. y. has the client subscribed a term deposit? (binary: 'yes','no')

#### Objective
- To apply different feature selection approaches such as Forward Selection, Backward Elimination and recursive feature elimination for feature selection in Logistic Regression Algorithm.


#### Tasks
- Download and load the data (csv file)
- Process the data 
- Split the dataset into 70% for training and rest 30% for testing (sklearn.model_selection.train_test_split function)
- Train Logistic Regression
- Apply feature selection techniques
- Train the models on the feature reduced datasets
- Compare their accuracies and print feature subset

#### Further Fun
- Perform feature selection with other schemes in the Sklearn: https://scikit-learn.org/stable/modules/classes.html#module-sklearn.feature_selection|

#### Helpful links
- pd.get_dummies() and One Hot Encoding: https://queirozf.com/entries/one-hot-encoding-a-feature-on-a-pandas-dataframe-an-example
- Feature Scaling: https://scikit-learn.org/stable/modules/preprocessing.html
- Train-test splitting: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
- Feature selection in ML: https://machinelearningmastery.com/feature-selection-machine-learning-python/
- Feature selection in sklearn: https://scikit-learn.org/stable/modules/feature_selection.html
- Use slack for doubts: https://join.slack.com/t/deepconnectai/shared_invite/zt-givlfnf6-~cn3SQ43k0BGDrG9_YOn4g




In [None]:
import seaborn as sns
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats
from sklearn import preprocessing

In [None]:
banking =  pd.read_csv(?)

In [None]:
banking.columns

In [None]:
banking.dtypes

In [None]:
banking.head(6)

In [None]:
banking.describe()

In [None]:
banking.shape

In [None]:
banking.rename(columns={"y":"Action"},inplace = True)

In [None]:
#banking.Action.value_counts()

In [None]:
sns.heatmap(banking.isnull(),yticklabels = False, cbar = False , cmap ='RdYlGn')

In [None]:
new_data =  banking.select_dtypes(include='object')

In [None]:
#checking the number of uique categories in each column
for i in new_data.columns:
  print(i,';',?,'labels')

In [None]:
from sklearn.preprocessing import LabelEncoder
lb = LabelEncoder()
new_data_encoded = new_data.apply(lb.fit_transform)
new_data_nonobject = banking.select_dtypes(exclude = ["object"])
banking1 = pd.concat([new_data_nonobject,new_data_encoded], axis = 1)

In [None]:
banking1.head()

In [None]:
banking1.shape

# K Best Features

In [None]:
from sklearn.feature_selection import SelectKBest
from scipy.stats import chi2 
from sklearn.feature_selection import f_classif

In [None]:
X = banking.iloc[:,0:16]
y = banking['Result_encoded']

In [None]:
X.shape

In [None]:
Kbest = SelectKBest(?, ?)
kfit = Kbest.fit(?,?)

In [None]:
scores = pd.DataFrame(?)
columns = pd.DataFrame(?)

In [None]:
# Train logistic regression model with subset of features from K Best

In [None]:
from sklearn.linear_model import LogisticRegression,SGDClassifier, RidgeClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.feature_selection import RFE
from sklearn.feature_selection import RFECV
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.metrics import accuracy_score, f1_score
from sklearn.metrics import f1_score,confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression


# Forward Selection

In [None]:
# Train a logistic regression model here

In [None]:
# Print the absolute weights of the model and sort them in descending order

In [None]:
# Run a for loop where each time you train a new model by adding features (from no of features 1 to n) 
# with highest weights (based on absolute weight from initial model) 
# Note you can choose features randomly also 

In [None]:
# Print the accuracies of all the models trained and names of the features used for each time

In [None]:
# Find a feature subset number where accuracy is maximum and number of features is minimum at the same time

# Backward Elimination

In [None]:
# Train a logistic regression model here

In [None]:
# Print the absolute weights of the model and sort them in ascending order

In [None]:
# Run a for loop where each time you train a new model by removing features (from no of features n to 1) 
# with lowest weights (based on absolute weight from initial model) 
# Note you can choose features randomly also 

In [None]:
# Print the accuracies of all the models trained and names of the features used for each time

In [None]:
# Find a feature subset number where accuracy is maximum and number of features is minimum at the same time

# Recursive Feature Elimination. 
Recursive Feature Elimination (RFE) as its title suggests recursively removes features, builds a model using the remaining attributes and calculates model accuracy. 


In [None]:
X = banking1.iloc[:,0:16]
y = banking1.iloc[:,16]
logit = LogisticRegression()

In [None]:
X_train,X_test,y_train,y_test = train_test_split(?,?, test_size = ?, random_state = 10)

In [None]:
rfe = RFE(estimator=?, step=1)
rfe = rfe.fit(?,?)

In [None]:
cols = pd.DataFrame(?)
ranking = pd.DataFrame(rfe.ranking_)

In [None]:
rankings_of_features = pd.concat([cols,ranking],axis = 1)

In [None]:
rankings_of_features

In [None]:
rankings_of_features.columns = [?,?]

In [None]:
rankings_of_features

In [None]:
print(rankings_of_features.nlargest(5,'rank'))
#remember this gives wrong results

In [None]:
rankings_of_features.sort_values(by='rank')

In [None]:
#replace your X_train,X_test with new training data(the one which contains most impactful features)
X_trainRFE = rfe.transform(?)
X_testRFE = rfe.transform(?)

In [None]:
model = logit.fit(?,?)

In [None]:
from sklearn import metrics
from sklearn.model_selection import cross_val_score

In [None]:
logit.predict(X_testRFE)

In [None]:
score = logit.score(X_testRFE, y_test)
print(score)

# RFE using cross validation

In [None]:
rfecv = RFECV(estimator=logit, step=1, cv=5, scoring='accuracy')
rfecv = rfecv.fit(?, ?)

In [None]:
rfecv.grid_scores_

In [None]:
X_train_rfecv = rfecv.transform(?)
X_test_rfecv = rfecv.transform(?)

In [None]:
model = logit.fit(?,?)

In [None]:
logit.predict(?)


In [None]:
logit.score(?,?)
print(score)