## Pipelines and Feature Selection
###  Pipelines Objective

<span style="color:blue">
To use a Pipeline to repeat evaluation of the previous assignment done using Hold-Out testing. In the Algorithm Bias Assessment the results were coming out to be unstable with different train-test split, that's why this time we will be utilizing Cross-Validation technique with the Pipeline strategy to point out the diffeneces in both the strategies.
</span>


- To start with, load the hold-out evaluation strategy from the first assignment
- Evaluate this strategy with the cross-Validation using Pipeline
- Point out the results obtained by above two techniques.

<span style="color:blue">

</span>



In [None]:
import numpy as np
import pandas as pd
from collections import Counter
surv = pd.read_csv('survival.csv')   #loading the survival dataset as dataframe

In [None]:
targetcount=surv['Class'].value_counts()
print('total count of class type 1 in the survival dataset:',targetcount[1])
print('total count of class type 2 in the survival dataset:',targetcount[2])

In [None]:
X = surv.drop('Class', axis=1)   #X will become the independent variable for the models.
y = surv['Class']                #y will become the dependent vaiables.
X.shape, y.shape                 # checking rows and columns in the X and y   

In [None]:
X.head()
#print("Minority class type 2 in the entire dataset(percentage wise): %0.2f" % (Counter(y)[2]/len(y)))

# 1.1 Applying Hold-Out Testing on Models### 
-Loading the Hold-out testing strategy from the fist assignment

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size =0.5,random_state=42)
print('Actual Class type [1 and 2] feaures in test set: ',Counter(y_test))
print('Minority Class Type [2] in test set: ',Counter(y_test)[2])
test_neg = Counter(y_test)[2]
Minority_test= test_neg/len(y_test)
print("Minority class in test set percentage wise : %0.2f" % (Minority_test))
print('*' * 20)


MLalgos ={}

MLalgos['KNN'] = KNeighborsClassifier(n_neighbors=3)
MLalgos['DecisionTree']= DecisionTreeClassifier(criterion='entropy')
MLalgos['LogRegression'] = LogisticRegression(random_state=42,max_iter=10000)
MLalgos['GradBoosting']= GradientBoostingClassifier(random_state=42)
bias_calculated ={}
accuracy_calculated={}


for algo in MLalgos:
    print(type(MLalgos[algo]).__name__)
    y_predicted = MLalgos[algo].fit(X_train, y_train).predict(X_test)
    confusion = confusion_matrix(y_test, y_predicted)
    print("Confusion matrix is :\n{}".format(confusion)) 
    acc = accuracy_score(y_test, y_predicted)
    print('Accuracy:  %0.2f' % acc)
    accuracy_calculated[algo]=acc
    count_predicted = (y_predicted.sum()-len(y_predicted))
    bias_calculated[algo]= count_predicted
    print("Predicted minority class type 2 :",count_predicted)
    pred_neg = Counter(y_predicted)[2]
    test_neg = Counter(y_test)[2]
    print("Predicted minority class type 2 percentage wise : %0.2f" % (pred_neg/len(y_predicted)))
    predicted_count= pred_neg/len(y_predicted)
    print('*' * 20)
    

### 1.1 Using Synthetic Minority Over-sampling Technique and utilizing it to upsample the minority class.

In [None]:
from imblearn.over_sampling import SMOTE, RandomOverSampler, ADASYN

In [None]:
surv = pd.read_csv('survival.csv')
surv['Survived'] = 'GE5'
surv.loc[surv['Class']==2,'Survived']='L5'
vc=surv['Survived'].value_counts() 
y = surv.pop('Survived').values
surv.pop('Class')
X = surv.values
X.shape, y.shape


In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size =0.5, random_state=42)
print("Before upsampling training set {}".format(Counter(y_train)))


In [None]:
sm = SMOTE(random_state=20, sampling_strategy = 0.7)
X_train_res, y_train_res = sm.fit_sample(X_train, y_train)

print("After sampling training set {}".format(Counter(y_train_res)))

In [None]:

acc_bal = {}
predictedMinority={}

print("Total count of Minority class L5 in test set: {}".format(len(y_test)-Counter(y_test)['GE5']))
print('*' * 20)


for algo in MLalgos:
    print(type(MLalgos[algo]).__name__)
    y_pred = MLalgos[algo].fit(X_train_res, y_train_res).predict(X_test)    
    predictedMinority[algo] = len(y_pred)-(Counter(y_pred)['GE5'])
    print("Predicted minority:", predictedMinority[algo])
    acc_bal[algo] = accuracy_score(y_test, y_pred)
    print('Accuracy:  %0.2f' % acc_bal[algo])
    print('*' * 20)
    


### 1.2 Using a Pipeline and evaluate the strategy using Cross-Validation technique

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.model_selection import KFold

from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.model_selection import KFold
from sklearn.metrics import f1_score
import matplotlib.pyplot as plt
from sklearn.model_selection import cross_val_score
from imblearn.pipeline import Pipeline
from imblearn.over_sampling import KMeansSMOTE
from collections import Counter
from sklearn.pipeline import Pipeline

In [None]:
kNNPipeline=Pipeline(steps=[('KNN_Classifier',KNeighborsClassifier(n_neighbors=3))])
DTPipeline=Pipeline(steps=[('DT_Classifier',DecisionTreeClassifier(criterion='entropy'))])
LRPipeline=Pipeline(steps=[('LR_Classifier',LogisticRegression(random_state=42,max_iter=10000))])
GBPipeline=Pipeline(steps=[('GB_Classifier',GradientBoostingClassifier(random_state=42))])

pipelines=[kNNPipeline,DTPipeline,LRPipeline,GBPipeline]  #pipeline will consist of the above 4 models namely KNN, Decision Treee, Logistic Regression,Gradient Boosting
pipe_dict = {0: 'KNN', 1: 'Decision Tree', 2: 'Logistic Regression',3:'Gradient Boosting'}


In [None]:
classifier_counter=0
for pipe in pipelines:
    print("Classifier:",pipe_dict[classifier_counter])
    accuracy=list()
    kf = KFold(n_splits=12)
    minority=0
    minorityPredicted=0
    
    for fold, (tr_pointer, ts_pointer) in enumerate(kf.split(X), 1):
        X_train = X[tr_pointer]
        y_train = y[tr_pointer]
        X_test = X[ts_pointer]
        y_test = y[ts_pointer]  
        X_train_UP, y_train_UP = SMOTE(random_state=20, sampling_strategy = 0.7).fit_resample(X_train, y_train)
        pipe.fit(X_train_UP, y_train_UP)  
        y_pred = pipe.predict(X_test)
        accuracy.append(pipe.score(X_test,y_test))
        minority=len(y_test)-Counter(y_test)['GE5'] + minority
        minorityPredicted=len(y_pred)-(Counter(y_pred)['GE5']) + minorityPredicted
    print("Mean Accuracy in 12 folds is: %0.2f" %(sum(accuracy)/len(accuracy)))
    print("Total Minority in test folds: ",minority)
    print("Total Minority predicted in test folds:",minorityPredicted)
    classifier_counter=classifier_counter+1
    print('*' * 20)
    
    

## 1.3 Outcomes Evaluation of both the methods.

- Cross Validation uses K-Folds and is incured with a great computational cost because all of the data is used during the phase of training. It divides the training data into K Different folds. Following it trains the model K times.In the end, the performance is calculated as the average of the K training sets.
-  In Hold-out testing, we simply denote the training and testing split. The more the training set, the better will be the the accuracy on the unseen data set.
- The hold-out testing has to run only once, hence we can say that It is always faster in computation as compared to the Cross-Valdation
- Cross Validation techniques incurs less variation since all of the data is used in training.

 ## 2
  ### Objective: To assess the impact of feature selection on Training and Test dataset


In [None]:
HTrain=pd.read_csv('heart-train.csv')   #Loading the Heart training data as dataFrame
HTest=pd.read_csv('heart-test.csv')     #Loading the Heart test data as dataFrame 
HTrain.head(10)                         #showing the top 10 rows of the training dataset 

In [None]:
from sklearn.feature_selection import mutual_info_classif,SelectKBest
from sklearn.feature_selection import chi2

y_Train=HTrain.pop('DEATH_EVENT').values
X_Train=HTrain.values

y_Test=HTest.pop('DEATH_EVENT').values
X_Test=HTest.values

### 2.1 Utilize Gradient Boosting for this task

In [None]:
gb_classifier=GradientBoostingClassifier(random_state=42, learning_rate=0.1) #Gradient Boosting Classifier declaration with learning rate as 0.1

### 2.2 Determining Accuracy on the training and test data using all features

In [None]:
gb_classifier.fit(X_Train,y_Train)   #Gradient boosting Classifer trained on Heart_Train data set
y_pred=gb_classifier.predict(X_Test) #Gradient Boosting Classifier Tested on the Heart_test dataset

In [None]:
print("The Accuracy comes out to be:",accuracy_score(y_Test,y_pred))

### Utilizing Information Gain Theory for the feature selection to determine the relevance of attribute in the Heart Data set

In [None]:
from sklearn.feature_selection import mutual_info_classif,SelectKBest
from sklearn.feature_selection import chi2


iGainScore=mutual_info_classif(X_Train,y_Train)

In [None]:
Feature=pd.DataFrame(iGainScore,index=HTrain.columns,columns=['Information_Gain'])
Feature.sort_values(by=['Information_Gain'],ascending=False,inplace=True)


In [None]:
print("The features are:")
print(Feature)

### Critical informtion from the feature set
- Out of the 12 attributes, Time has the max information Gain. Rest other attributes constitute a very litte(closeness to 0)

In [None]:
acc_scores = []
for count in range(1, X_Train.shape[1]+1):
    Feature_transform = SelectKBest(mutual_info_classif, k=count).fit(X_Train,y_Train)
    X_new_trainset = Feature_transform.transform(X_Train)
    X_new_testset = Feature_transform.transform(X_Test)
    seg_NB = gb_classifier.fit(X_new_trainset, y_Train)
    y_dash = seg_NB.predict(X_new_testset)
    acc = accuracy_score(y_Test, y_dash)
    acc_scores.append(acc)

Feature['Accuracy'] = acc_scores
print(Feature)


###  Utilizing sequential feature selection to automatically select subset of the feature that are more relevant to the given problem statement

- Our main aim is to reduce the computational cost invloved as well as remove the unnecessary features. This will be helpful in removing errors.
- Sequential feauture selection is a wrapper approach that performs addition or removal of attributes based on the performance of the classifier. This process happens until we reach to our desired number of features.
- INPUT: the input will be all the attributes of the dataset Provided
- OUTPUT: We will have to provide the number of feautures in advance, then the output will be subset of the features provided.In our computation, I'll be providing K_features=8


In [None]:
from mlxtend.feature_selection import SequentialFeatureSelector as SFS #Importing sequential feature selector from mlExtend Library
feature_names=HTrain.columns

In [None]:
sfs_fwd_search = SFS(gb_classifier,      #declared sequential feature search operation with cross-VaLidation fold as 10
                  k_features=8, 
                  forward=True,
                  floating=False, 
                  verbose=1,
                  scoring='accuracy',
                  cv=10, n_jobs = -1)


In [None]:
sfs_fwd_search.fit(X_Train, y_Train, custom_feature_names=feature_names)

In [None]:
print(sfs_fwd_search.k_feature_names_)

In [None]:
from mlxtend.plotting import plot_sequential_feature_selection as plot_sfs

fig1 = plot_sfs(sfs_fwd_search.get_metric_dict(), 
                ylabel='Accuracy',
                kind='std_dev')
plt.ylim([0.5, 1])
plt.title('Sequential Forward Selection with 8 feautres provided Apriori')
plt.grid()
plt.show()


### From the above plot, It can be noticed that the accuracy is highest for the featuresubset=[4,5,6], after that the accuracy decreased for featuresubset=[7,8]  

###   Selecting the best feature subset as 5, and then I will be running the Sequential forward search Algorithm with 5  feature sub-set 

In [None]:
sfs_fwd_search_subset5 = SFS(gb_classifier, 
                        k_features=5, 
                        forward=True, 
                        floating=False,     
                        verbose=1,
                        scoring='accuracy',
                        cv=10, n_jobs = -1)


In [None]:
sfs_fwd_search_subset5.fit(X_Train, y_Train, custom_feature_names=feature_names)

In [None]:
features_subset_5=sfs_fwd_search_subset5.k_feature_names_

In [None]:
print("The 5 feature set based on the Sequential forward search are:")
print(features_subset_5)

### Performing sequential backward elimination approach to find out the 5 feature subset by providing forward metric as False

In [None]:
seq_backward_elimination = SFS(gb_classifier, 
                  k_features=5, 
                  forward=False, 
                  floating=False, 
                  verbose=1,
                  scoring='accuracy',
                  cv=10, n_jobs = -1)

In [None]:
seq_backward_elimination.fit(X_Train, y_Train, 
                              custom_feature_names=feature_names)

In [None]:
features_subset_back_5=seq_backward_elimination.k_feature_names_

In [None]:
print("The 5 feature-set based on the Sequential Backward elimination are:")
print(features_subset_back_5)

### Finding out the feautures that are common to both approaches i.e, Forward search and Backward Elimination

In [None]:
print("Intersection of both feature sets :",set(features_subset_5).intersection(set(features_subset_back_5)))

### Next task: To check the accuracy of the classifier with the subset of feautres identified as significant. They are 'ejection_fraction', 'platelets', 'serum_creatinine', 'serum_sodium', 'time'

In [None]:
Xtrain_5=HTrain[['ejection_fraction', 'platelets', 'serum_creatinine', 'serum_sodium', 'time']].values

In [None]:
gb_classifier.fit(Xtrain_5,y_Train)

In [None]:
Xtest_5=HTest[['ejection_fraction', 'platelets', 'serum_creatinine', 'serum_sodium', 'time']].values
y_predicted=gb_classifier.predict(Xtest_5)

In [None]:
print("Accuracy for the Backward Elimination Process")
accuracy_score(y_Test,y_predicted)

In [None]:
XtrainFwd_5=HTrain[['anaemia', 'creatinine_phosphokinase', 'platelets', 'smoking', 'time']].values

In [None]:
gb_classifier.fit(XtrainFwd_5,y_Train)

In [None]:
XtestFwd_5=HTest[['anaemia', 'creatinine_phosphokinase', 'platelets', 'smoking', 'time']].values
y_predictedFwd=gb_classifier.predict(XtestFwd_5)

In [None]:
print("Accuracy for the Forward Elimination Process")
accuracy_score(y_Test,y_predictedFwd)

### Outcomes of the methods: Forward Sequential Search and Backward Elimination
- Accuracy obtained from Forward search approach comes out to be 79%
- Accuracy obtained from Backward elimination approach comes out to be 85%. Hence, we will be going forward with this method.
- Sequenctial forward search and backward elimination have been performed on Heart training dataset and evaluated its efficiency on Heart testing data set

### Calculating other metrics for evaluating the performance of the model

In [None]:
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, classification_report, confusion_matrix

In [None]:
f1_score = f1_score(y_Test,y_predicted)
print('The F1_Score Value is: %0.2f '%f1_score)

precision = precision_score(y_Test,y_predicted)
print("The precision Value is: %0.2f " %precision)

recall = recall_score(y_Test,y_predicted)
print("The recall Value is : %0.2f " %recall)

acc=accuracy_score(y_Test,y_predicted)
print("The Accuracy Value is : %0.2f " %acc)

### Outcomes from the performance metrics => F1Score, Precision Value, Recall and Accuracy Score
- The Accuracy and the precision value of the classification model is decent enough, with 85 and 80 percent.
- The Recall value is low in comparison to other metrics. But since this classificaiton model is of medical diagnosis, Hence the true positive rate is significantly important. It is the value associated with "the total no. of data predicted postive correctly to the total no. of data that are in class positve." It is helpful in the scenarios where our system is predicting the class label as death event and it is actually a death event.
- For increasing the performance of the recall, we are considering SMOTE technique to Upsample the training Hearth dataset.


In [None]:
from imblearn.over_sampling import SMOTE
sm=SMOTE(random_state=20)
X_TrainUP,y_TrainUP=sm.fit_resample(Xtrain_5,y_Train)
gb_classifier.fit(X_TrainUP,y_TrainUP)

In [None]:
y_pred_sm=gb_classifier.predict(Xtest_5)

In [None]:
acc_score_sm=accuracy_score(y_Test,y_pred_sm)
print("The Accuracy score is : %0.2f " %acc_score_sm)
recall_score_sm= recall_score(y_Test,y_pred_sm)
print("The recall Value is : %0.2f " %recall_score_sm) 
precision_sm = precision_score(y_Test,y_pred_sm)
print("The precision Value is: %0.2f " %precision_sm)

### Final Outcome
- After applying the SMOTE Technique on the Hearth Training data we have identified that The Effects on Evaluation metrics are:
- The Accuracy and Recall scores have significant postive effects. The Accuracy rose from 85 to 87% and Recall value rose from 72 to 81
- The precision value has little or no effect on Upsampling