# Copyright

This lecture is prepared by Samir Abdelrahman and Adam Kotter. Also, few contents include few links and websites that are cited and used during the lecture.



# Objectives

We will run through a full example of model development, including the following steps:
1. Problem selection
2. Dataset evaluation
3. Dataset cleaning
4. Cross validation
5. Selecting which model type to use (with statistics!)
6. Selecting which features to use

Please think about how these concepts apply to your group project during the lecture.  
This lecture is to give the students: hands-on activities on cross-validation, pair-wise test bewteen several classifiers, and the introduction to feature selection. Please study the below medium-length links.

# Define a problem statement and goals

Why do we want to use this dataset?  
Which feature do we want to predict to accomplish our goal?  
Do we want to predict classes (classifier) or predict values (regressor)?

# [Understand the dataset and features](https://archive.ics.uci.edu/ml/datasets/hepatitis)

## Features
   
     1. Class: DIE, LIVE
     2. AGE: 10, 20, 30, 40, 50, 60, 70, 80
     3. SEX: male, female
     4. STEROID: no, yes
     5. ANTIVIRALS: no, yes
     6. FATIGUE: no, yes
     7. MALAISE: no, yes
     8. ANOREXIA: no, yes
     9. LIVER BIG: no, yes
    10. LIVER FIRM: no, yes
    11. SPLEEN PALPABLE: no, yes
    12. SPIDERS: no, yes
    13. ASCITES: no, yes
    14. VARICES: no, yes
    15. BILIRUBIN: 0.39, 0.80, 1.20, 2.00, 3.00, 4.00
    16. ALK PHOSPHATE: 33, 80, 120, 160, 200, 250
    17. SGOT: 13, 100, 200, 300, 400, 500, 
    18. ALBUMIN: 2.1, 3.0, 3.8, 4.5, 5.0, 6.0
    19. PROTIME: 10, 20, 30, 40, 50, 60, 70, 80, 90
    20. HISTOLOGY: no, yes

The values listed above "represent so called "boundary" values; according to these "boundary" values the attribute can be discretized. At the same time, because of the continious attribute, one can perform some other test since the continuous information is preserved."

In [None]:
# Import statements

import numpy as np
import pandas as pd
import math
from scipy.stats import loguniform
from statistics import mean
from scipy import stats
import matplotlib.pyplot as plt
import warnings 



from sklearn.model_selection import StratifiedShuffleSplit


from sklearn.metrics import classification_report,f1_score
from sklearn.impute import SimpleImputer 

from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier

from sklearn.ensemble import RandomForestClassifier

from sklearn.utils import resample
from sklearn.preprocessing import StandardScaler
from scikit_posthocs import posthoc_nemenyi_friedman


warnings.filterwarnings("ignore")

## Loading the dataset

In [None]:
# We load a CSV with data into a DataFrame object; we use commas to denote the thousands place, and any values marked with a ? are considered missing
data = pd.read_csv('~/DATA/hepatitis.csv', thousands=',', na_values='?')

In [None]:
data.shape

In [None]:
data.dtypes

In [None]:
# From the above there is no datatype object that contradicts with the above description from the link
# We want to use the first feature (Class) as our outcome, so we will skip that when defining our features
allFeatures=data.columns[1:len(data.columns)]
# We need to separate the categorical features from the numeric features; this is done based on manual inspection of the dataset
catFeatures=data.columns[list(range(2,14))+list(range(19,20))]
# To make sure we don't miss any features, we use list comprehension syntax to get every feature in allFeatures that isn't already in catFeatures
numFeatures= [feature for feature in allFeatures if not(feature in catFeatures)]

In [None]:
numFeatures

In [None]:
catFeatures

# Descriptive analysis
What does the data look like?

In [None]:
# What shape do the numeric features in the data have?
data[numFeatures].describe()

In [None]:
# What does a visual description of the numeric data look like?
data[numFeatures].hist(figsize=(7.50, 7.50))

In [None]:
# What does the distribution of categorical data look like?

# This is to make the plot look better
plt.rcParams["figure.figsize"] = [7.50, 3.50]
plt.rcParams["figure.autolayout"] = True

# This loops through each categorical feature and visually plots it
for cat in catFeatures:
    fig, ax = plt.subplots()
    data[cat].value_counts().plot(ax=ax, kind='bar', xlabel=cat, ylabel='frequency')
    plt.show()

In [None]:
# How imbalanced is our dataset?
data['Class'].value_counts()

In [None]:
# We define X and y for our models here
X = data[data.columns[1:20]]
y = data[data.columns[0]]

# Developing Imputer and Standardized Scaler

Our imputer fills in missing data for us. Many algorithms will break if they hit a missing value, so we need to find a way to put in a "best guess" for what the missing values would be.  
Our scaler brings all of the data into a more uniform range without losing information so that features can be compared more easily to each other.

In [None]:
# This imputer fills missing values with the mode of the data; this works better for discretized data than for continuous data
imp_mode = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
# This imputer fills missing values with the mean of the data; this only works for continuous data
imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')
# This scaler has not been trained yet
scaler=StandardScaler()

# Developing baseline and ensemble classifiers

In [None]:
# Here we define all of the classifiers we want to test along with strings representing their human-readably names
baselineClassifiers=[LogisticRegression(), SVC(), KNeighborsClassifier(),DecisionTreeClassifier()]
nameBaselineClassifiers=['LR','SVC','KNN','DT']

# This section just puts all of the classifiers together with their names in a single list
estimators=[]
for bc in range(0,len(baselineClassifiers)):
    estimators.append((nameBaselineClassifiers[bc],baselineClassifiers[bc]))

# Splitting the data: Use any Splitting Criterion

In [None]:
# In this example we use a randomized version of stratified k-fold splitting; this preserves the percentage of samples in each class for each split
number_of_splits=5
sss = StratifiedShuffleSplit(n_splits=number_of_splits, test_size=0.3, random_state=0)
# We set random_state above so that we get the same results every time we run the code
train_indexes=[]
test_indexes=[]
for train_index, test_index in sss.split(X, y):
    train_indexes.append(train_index)
    test_indexes.append(test_index)

In [None]:
results={}    # We initialize a dictionary to hold testing results for each classifier
for bc in range(0,len(baselineClassifiers)):    # The variable 'bc' is an integer representing the index value of each classifier
    print(nameBaselineClassifiers[bc])
    results[nameBaselineClassifiers[bc]]=[]    # We create a dictionary entry based on the human-readable name of the classifier
    for tr,te in zip(train_indexes,test_indexes):
        X_train, X_test = X.iloc[tr], X.iloc[te]
        y_train, y_test = y.iloc[tr], y.iloc[te]
        
        # Imputation
        imp_mode.fit(X_train[catFeatures])    # We replace missing categorical values with the most common value in their feature
        imp_mean.fit(X_train[numFeatures])    # We replace missing numeric values with the mean of their feature
        X_train[catFeatures]=imp_mode.transform(X_train[catFeatures])
        X_test[catFeatures]=imp_mode.transform(X_test[catFeatures])
        X_train[numFeatures]=imp_mean.transform(X_train[numFeatures])
        X_test[numFeatures]=imp_mean.transform(X_test[numFeatures])
    
        # Scaling numeric features
        scaler.fit(X_train[numFeatures])    # We need to train the scaler on the training data before we can scale the training or testing sets
        X_train[numFeatures]=scaler.transform(X_train[numFeatures])
        X_test[numFeatures]=scaler.transform(X_test[numFeatures])
    
        # Encoding the Categorical features
        # This step is necessary if any categorical features have more than two possible values
        X_train=pd.get_dummies(X_train)
        X_test=pd.get_dummies(X_test)
        X_test = X_test.reindex(columns = X_train.columns, fill_value=0)
        
        # Train and test the model on the scaled and imputed data
        model = baselineClassifiers[bc].fit(X_train,y_train)
        y_test_pred = model.predict(X_test)
        f1_value = f1_score(y_test,y_test_pred,average='micro')
        print(f1_value)    # We base model performance on F1-score; 'micro' behaves like 'binary' when we only have two classes
        results[nameBaselineClassifiers[bc]].append(f1_value)
    print('The average of the classifier\'s F1-score results', np.average(results[nameBaselineClassifiers[bc]]))
    print()

# Which classifier is the best based on the mean of repeated 1 holdout cross-validation?

In [None]:
name=''
maximum=float('-inf')
# Go through the name of each classifier to find the one with the best results in the 'results' dictionary
for nameClassifier in results.keys():
    avg=np.average(results[nameClassifier])
    if (avg > maximum):    # Save the average value as a potential maximum if it is greater than every previous average value
        maximum=avg
        maxClassifier=nameClassifier

print('The classifier {0} is the minumm average of {1}'.format(maxClassifier,maximum))

# Is the best classifier actually significantly better than the other classifiers?

In [None]:
test_significance=[]
for nameClassifier in results.keys():
    test_significance.append(results[nameClassifier])

#Check if Friedman test is signifiant
chi_square,p_value_mean=stats.friedmanchisquare(*test_significance)
print(p_value_mean)

In [None]:
# If a significant difference exists, we can check for pairwise significant differences
trans_groups=np.array(test_significance).T
p=posthoc_nemenyi_friedman(trans_groups)
print(p)

# Feature selection methods 


What is the differences between feature selection methods, dimensionality reduction, and feature extraction?



Resources are [link1](https://towardsdatascience.com/feature-selection-for-machine-learning-3-categories-and-12-methods-6a4403f86543)

In [Sklearn](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.feature_selection)

## Feature selection 

1. It eliminates irrelevant and noisy features by keeping the ones with minimum redundancy and maximum relevance to the target variable.
2. It reduces the computational time and complexity of training and testing a classifier, so it results in more cost-effective models.
3. It improves learning algorithms’ performance, avoids overfitting, and helps to create better general models.

There are three categories of feature selection methods, depending on how they interact with the classifier, namely: 
1. filter.
2. wrapper.
3. embedded methods.

### Filter methods 

They are scalable (up to very high-dimensional data) and perform fast feature selection before classification so that the bias of a learning algorithm does not interact with the bias of the feature selection algorithm.

#### Chi-square

If the target variable is independent of the feature, then it gets a low score, or if they are dependent, the feature is important. A higher value of chi-square means that the feature is more relevant concerning the class.

In [None]:
from sklearn.feature_selection import SelectKBest, chi2
chi_selector = SelectKBest(chi2, k=15)  # We're using the 15 best features here; use k='all' if you need to rank all 

results={}
for bc in range(0,len(baselineClassifiers)):
    print(nameBaselineClassifiers[bc])
    results[nameBaselineClassifiers[bc]]=[]
    for tr,te in zip(train_indexes,test_indexes):
        X_train, X_test = X.iloc[tr], X.iloc[te]
        y_train, y_test = y.iloc[tr], y.iloc[te]
        
        #Imputation
        imp_mode.fit(X_train[catFeatures])
        imp_mean.fit(X_train[numFeatures])
        X_train[catFeatures]=imp_mode.transform(X_train[catFeatures])
        X_test[catFeatures]=imp_mode.transform(X_test[catFeatures])
        X_train[numFeatures]=imp_mean.transform(X_train[numFeatures])
        X_test[numFeatures]=imp_mean.transform(X_test[numFeatures])
        
        cols=X_train.columns
    
        #Feature Selection
        chi_selector.fit(X_train,y_train)
        X_train=chi_selector.transform(X_train)
        X_test=chi_selector.transform(X_test)
        
        column_names = cols[chi_selector.get_support()]
        
        X_train=pd.DataFrame(X_train,columns=column_names)
        X_test=pd.DataFrame(X_test,columns=column_names)    
        newNumFeaturs=[num for num in numFeatures if num in column_names]
        newCatFeaturs=[num for num in catFeatures if num in column_names]
            
        #Scaling numeric features
        scaler.fit(X_train[newNumFeaturs])
        X_train[newNumFeaturs]=scaler.transform(X_train[newNumFeaturs])
        X_test[newNumFeaturs]=scaler.transform(X_test[newNumFeaturs])
    
        #Encoding the Categorical features
        X_train=pd.get_dummies(X_train,columns=newCatFeaturs)
        X_test=pd.get_dummies(X_test,columns=newCatFeaturs)
        X_test = X_test.reindex(columns = X_train.columns, fill_value=0)
        
        
        model=baselineClassifiers[bc].fit(X_train,y_train)
        y_test_pred=model.predict(X_test)
        f1_value=f1_score(y_test,y_test_pred,average='micro')
        print(f1_value)
        results[nameBaselineClassifiers[bc]].append(f1_value)
    print('The average of the classifier\'s F1-score results',np.average(results[nameBaselineClassifiers[bc]]))
    print()

####  Mutual Information

From sklearn:  
"Mutual information between two random variables is a non-negative value, which measures the dependency between the variables. It is equal to zero if and only if two random variables are independent, and higher values mean higher dependency."

A feature is considered relevant if it has a high information gain. It cannot handle redundant features, because features are selected in a univariate way.

In [None]:
from sklearn.feature_selection import SelectKBest, mutual_info_classif
me_selector = SelectKBest(mutual_info_classif, k=15)  # We're using the 15 best features here; use k='all' if you need to rank all 

results={}
for bc in range(0,len(baselineClassifiers)):
    print(nameBaselineClassifiers[bc])
    results[nameBaselineClassifiers[bc]]=[]
    for tr,te in zip(train_indexes,test_indexes):
        X_train, X_test = X.iloc[tr], X.iloc[te]
        y_train, y_test = y.iloc[tr], y.iloc[te]
        
        #Imputation
        imp_mode.fit(X_train[catFeatures])
        imp_mean.fit(X_train[numFeatures])
        X_train[catFeatures]=imp_mode.transform(X_train[catFeatures])
        X_test[catFeatures]=imp_mode.transform(X_test[catFeatures])
        X_train[numFeatures]=imp_mean.transform(X_train[numFeatures])
        X_test[numFeatures]=imp_mean.transform(X_test[numFeatures])
        
        cols=X_train.columns
    
        #Feature Selection
        me_selector.fit(X_train,y_train)
        X_train=me_selector.transform(X_train)
        X_test=me_selector.transform(X_test)
        
        column_names = cols[me_selector.get_support()]
        
        X_train=pd.DataFrame(X_train,columns=column_names)
        X_test=pd.DataFrame(X_test,columns=column_names)    
        newNumFeaturs=[num for num in numFeatures if num in column_names]
        newCatFeaturs=[num for num in catFeatures if num in column_names]
            
        #Scaling numeric features
        scaler.fit(X_train[newNumFeaturs])
        X_train[newNumFeaturs]=scaler.transform(X_train[newNumFeaturs])
        X_test[newNumFeaturs]=scaler.transform(X_test[newNumFeaturs])
    
        #Encoding the Categorical features
        X_train=pd.get_dummies(X_train,columns=newCatFeaturs)
        X_test=pd.get_dummies(X_test,columns=newCatFeaturs)
        X_test = X_test.reindex(columns = X_train.columns, fill_value=0)
        
        
        model=baselineClassifiers[bc].fit(X_train,y_train)
        y_test_pred=model.predict(X_test)
        f1_value=f1_score(y_test,y_test_pred,average='micro')
        print(f1_value)
        results[nameBaselineClassifiers[bc]].append(f1_value)
    print('The average of the classifiers\'results',np.average(results[nameBaselineClassifiers[bc]]))
    print()

###  Wrapper methods

The widely used wrapper method uses an algorithm to train the model iteratively and each time removes the least important feature using the weights of the algorithm as the criterion.
It is a multivariate method in the sense that it evaluates the relevance of several features considered jointly.

## Using from scipy.stats feature selection

1. [Mann–Whitney U test](https://en.wikipedia.org/wiki/Mann%E2%80%93Whitney_U_test)
2. [Chi-Squared test](https://en.wikipedia.org/wiki/Chi-squared_test)

In [None]:
# This is an example of using a chi-square filter method

from scipy.stats import mannwhitneyu,chi2_contingency,chi2

results={}
for bc in range(0,len(baselineClassifiers)):
    print(nameBaselineClassifiers[bc])
    results[nameBaselineClassifiers[bc]]=[]
    for tr,te in zip(train_indexes,test_indexes):
        X_train, X_test = X.iloc[tr], X.iloc[te]
        y_train, y_test = y.iloc[tr], y.iloc[te]
        
        #Imputation
        imp_mode.fit(X_train[catFeatures])
        imp_mean.fit(X_train[numFeatures])
        X_train[catFeatures]=imp_mode.transform(X_train[catFeatures])
        X_test[catFeatures]=imp_mode.transform(X_test[catFeatures])
        X_train[numFeatures]=imp_mean.transform(X_train[numFeatures])
        X_test[numFeatures]=imp_mean.transform(X_test[numFeatures])
        
        # Remove all insignificant categorical features from consideration
        newCatFeatures=[]
        for fe in catFeatures:
            table=pd.crosstab(X_train[fe].to_numpy().flatten(), y_train.to_numpy().flatten())
            _, p, _, _ = chi2_contingency(table)
            if (p <=0.05):
                 newCatFeatures.append(fe) 
        
        # Remove all insignificant numeric features from consideration
        newNumFeatures=[]
        for fe in numFeatures:
            _, p = mannwhitneyu(X_train[fe].to_numpy().flatten(), y_train.to_numpy().flatten())
            if (p <=0.05):
                newNumFeatures.append(fe)
    
        #print('Feature Selection')
        #print(numFeatures)
        #print(catFeatures)
        #print(newNumFeatures)
        #print(newCatFeatures)
        #input('Feature Selection')
        

        X_train=X_train[newNumFeatures+newCatFeatures]
        X_test=X_test[newNumFeatures+newCatFeatures]
        
        #Scaling numeric features
        scaler.fit(X_train[newNumFeatures])
        X_train[newNumFeatures]=scaler.transform(X_train[newNumFeatures])
        X_test[newNumFeatures]=scaler.transform(X_test[newNumFeatures])
    
        #Encoding the Categorical features
        X_train=pd.get_dummies(X_train)
        X_test=pd.get_dummies(X_test)
        X_test = X_test.reindex(columns = X_train.columns, fill_value=0)
        
        
        model=baselineClassifiers[bc].fit(X_train,y_train)
        y_test_pred=model.predict(X_test)
        f1_value=f1_score(y_test,y_test_pred,average='micro')
        print(f1_value)
        results[nameBaselineClassifiers[bc]].append(f1_value)
    print('The average of the classifiers\'results',np.average(results[nameBaselineClassifiers[bc]]))
    print()