# Question 2 Naive Bayes

Build a Naive Bayes classifier for the given training data with add 1 smoothing technique covered in the lecture slides:


## I. Setups and import data 

1. import data
2. split train and test (don't need to split here)

In [196]:
# setup and import data

import pandas as pd
import numpy as np

df = pd.read_csv('Desktop/Code/data/nb_instances.csv', sep = ',', index_col = 'Instance')
df
#df.shape

Unnamed: 0_level_0,Education Level,Career,Years of Experience,Salary
Instance,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,High School,Management,Less than 3,Low
2,High School,Management,3 to 10,Low
3,College,Management,Less than 3,High
4,College,Service,More than 10,Low
5,High School,Service,3 to 10,Low
6,College,Service,3 to 10,High
7,College,Management,More than 10,High
8,College,Service,Less than 3,Low
9,High School,Management,More than 10,High
10,High School,Service,More than 10,Low


In [197]:
# split the data into training and testing 
# pre-defined train and test 

train = df.iloc[0:10,:]
test = df.iloc[10:13,:]

## II. Calculate the priors: P(class) or P(A)

priors: can be constants or probability distribution 
e.g. the % of being male or female 

It is the probability of each class label, wihch are p(high) and p(low) in our case

In [201]:
# compute the number of each class

total_ppl = train['Salary'].count() 
n_high = train['Salary'][train['Salary'] == 'High'].count()
n_low = train['Salary'][train['Salary'] == 'Low'].count()

In [202]:
# compute the prior probability: % of being high or low

p_high = n_high/total_ppl
p_low = n_low/total_ppl

## III. Calculate likelihood: P(data|class)


For likelihood, we have two assumptions: 

1) we assume each features is uncorrelated from each other  
2) we assume the value of the features are normally/Guassian distributed. So in the case of continous variables, we will need to impute p(data|class) by the probability density function of the normal distribution

For categorical data, 
we compute the probability by dividing the number instances within each class by the number of each class. 

For continuous data, 
we need to calculate the mean and std of each attribute. 


In [203]:
# the likelihood when class label = high

n_high = train['Salary'][train['Salary'] == 'High'].count()
n_low = train['Salary'][train['Salary'] == 'Low'].count()

# number of education/salary
n_college_high = train[(train['Salary'] == 'High') & (train['Education Level'] == 'College')].count()
n_highschool_high = train[(train['Salary'] == 'High') & (train['Education Level'] == 'High School')].count()

n_college_low = train[(train['Salary'] == 'Low') & (train['Education Level'] == 'College')].count()
n_highschool_low = train[(train['Salary'] == 'Low') & (train['Education Level'] == 'High School')].count()

# number of career/salary
n_management_high = train[(train['Salary'] == 'High') & (train['Career'] == 'Management')].count()
n_service_high = train[(train['Salary'] == 'High') & (train['Career'] == 'Service')].count()

n_management_low = train[(train['Salary'] == 'Low') & (train['Career'] == 'Management')].count()
n_service_low = train[(train['Salary'] == 'Low') & (train['Career'] == 'Service')].count()

# number of experience/salary
n_less3_high = train[(train['Salary'] == 'High') & (train['Years of Experience'] == 'Less than 3')].count()
n_3to10_high = train[(train['Salary'] == 'High') & (train['Years of Experience'] == '3 to 10')].count()
n_more3_high = train[(train['Salary'] == 'High') & (train['Years of Experience'] == 'More than 10')].count()

n_less3_low = train[(train['Salary'] == 'Low') & (train['Years of Experience'] == 'Less than 3')].count()
n_3to10_low = train[(train['Salary'] == 'Low') & (train['Years of Experience'] == '3 to 10')].count()
n_more3_low = train[(train['Salary'] == 'Low') & (train['Years of Experience'] == 'More than 10')].count()


In [204]:
# compute the conditional probability of each attribute

# for class label: high 
p_college_high = n_college_high/n_high
p_highschool_high = n_highschool_high/n_high

p_management_high = n_management_high/n_high
p_service_high = n_service_high/n_high

p_less3_high = n_less3_high/n_high
p_3to10_high = n_3to10_high/n_high
p_more3_high = n_more3_high/n_high


# for class label: low
p_college_low = n_college_low/n_low
p_highschool_low = n_highschool_low/n_low

p_management_low = n_management_low/n_low
p_service_low = n_service_low/n_low

p_less3_low = n_less3_low/n_low
p_3to10_low = n_3to10_low/n_low
p_more3_low = n_more3_low/n_low


In [245]:
# we'd like to apply Laplace Smoothing so we won't end up with zero probabilities

# for class label: high 
p_college_high_lap = (n_college_high + 1)/(n_high + 2)
p_highschool_high_lap = (n_highschool_high + 1)/(n_high + 2)

p_management_high_lap = (n_management_high + 1)/(n_high + 2)
p_service_high_lap = (n_service_high + 1)/(n_high + 2)

p_less3_high_lap = (n_less3_high + 1)/(n_high + 3)
p_3to10_high_lap = (n_3to10_high + 1)/(n_high + 3)
p_more3_high_lap = (n_more3_high + 1)/(n_high + 3)


# for class label: low
p_college_low_lap = (n_college_low + 1)/(n_low + 2)
p_highschool_low_lap = (n_highschool_low + 1)/(n_low + 2)

p_management_low_lap = (n_management_low + 1)/(n_low + 2)
p_service_low_lap = (n_service_low + 1)/(n_low + 2)

p_less3_low_lap = (n_less3_low + 1)/(n_low + 3)
p_3to10_low_lap = (n_3to10_low + 1)/(n_low + 3)
p_more3_low_lap = (n_more3_low + 1)/(n_low + 3)
p_college_low_lap

Education Level        0.375
Career                 0.375
Years of Experience    0.375
Salary                 0.375
dtype: float64

## IV. Compute the marginal probability: P(data)

This step can be neglected
since each class with have the same marginal probability 

In reality, this % will be hard to define

In [227]:
# compute the number of instances in each attribute 

n_highschool = train['Education Level'][train['Education Level'] == 'High School'].count()
n_college = train['Education Level'][train['Education Level'] == 'College'].count()

n_management = train['Career'][train['Career'] == 'Management'].count()
n_service = train['Career'][train['Career'] == 'Service'].count()

n_less3 = train['Years of Experience'][train['Years of Experience'] == 'Less than 3'].count()
n_3to10 = train['Years of Experience'][train['Years of Experience'] == '3 to 10'].count()
n_more3 = train['Years of Experience'][train['Years of Experience'] == 'More than 10'].count()


In [233]:
# conpute the marginal probability of each attributes

p_highschool = n_highschool/total_ppl
p_college = n_college/total_ppl

p_management = n_management/total_ppl
p_service = n_service/total_ppl

p_less3 = n_less3/total_ppl
p_3to10 = n_3to10/total_ppl
p_more3 = n_more3/total_ppl

p_retail = 1/total_ppl
p_graduate = 1/total_ppl

## V. Apply Bayes Classifier to test data

Here, we'd like to know and compare: 

P(High|1) = P(High) * P(High School|High) * P(Service|High) * P(Less than 3|High) 

P(Low|1) = P(Low) * P(College|Low) * P(Retail|Low) * P(Less than 3|Low) 

P(High|2) = P(High) * P(College|High) * P(Retail|High) * P(Less than 3|High) 

P(Low|2) = P(Low) * P(College|Low) * P(Retail|Low) * P(Less than 3|Low) 

P(High|3) = P(High) * P(Graduate|High) * P(Service|High) * P(3 to 10|High) 

P(Low|3) = P(Low) * P(Graduate|Low) * P(Service|Low) * P(3 to 10|Low) 

In [183]:
test

Unnamed: 0_level_0,Education Level,Career,Years of Experience,Salary
Instance,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
11,High School,Service,Less than 3,
12,College,Retail,Less than 3,
13,Graduate,Service,3 to 10,


In [246]:
# here we have instance with undefined attribute value 
# so define there likelihood by adding a minimum count 

p_retail_high_lap = 1 /(n_high + 2)
p_retail_low_lap = 1 /(n_low + 2)

p_graduate_high_lap = 1 /(n_high + 2)
p_graduate_low_lap = 1 /(n_low + 2)

0.125

In [252]:
# apply classifier to instance 1

p_high_1 = p_high * p_highschool_high_lap * p_service_high_lap * p_less3_high_lap
p_high_1w = (p_high * p_highschool_high_lap * p_service_high_lap * p_less3_high_lap)/(p_highschool * p_service * p_less3)

p_low_1 = p_low * p_highschool_low_lap * p_service_low_lap * p_less3_low_lap
p_low_1w = (p_low * p_highschool_low_lap * p_service_low_lap * p_less3_low_lap)/(p_highschool * p_service * p_less3)

print(p_high_1w, p_low_1w)

Education Level        0.169312
Career                 0.169312
Years of Experience    0.169312
Salary                 0.169312
dtype: float64 Education Level        1.041667
Career                 1.041667
Years of Experience    1.041667
Salary                 1.041667
dtype: float64


For instance 1, we would classify it to Low

In [254]:
# apply classifier to instance 2

p_high_2 = p_high * p_college_high_lap * p_retail_high_lap * p_less3_high_lap
p_high_2w = (p_high * p_college_high_lap * p_retail_high_lap * p_less3_high_lap)/(p_college * p_retail * p_less3)

p_low_2 = p_low * p_college_low_lap * p_retail_low_lap * p_less3_low_lap
p_low_2w = (p_low * p_college_low_lap * p_retail_low_lap * p_less3_low_lap)/(p_college * p_retail * p_less3)

print(p_high_2w, p_low_2w)

Education Level        0.846561
Career                 0.846561
Years of Experience    0.846561
Salary                 0.846561
dtype: float64 Education Level        0.625
Career                 0.625
Years of Experience    0.625
Salary                 0.625
dtype: float64


For instance 2, we would classify it to High

In [255]:
# apply classifier to instance 3

p_high_3 = p_high * p_graduate_high_lap * p_service_high_lap * p_3to10_high_lap
p_high_3w = (p_high * p_graduate_high_lap * p_service_high_lap * p_3to10_high_lap)/(p_graduate * p_service * p_3to10)

p_low_3 = p_low * p_graduate_low_lap * p_service_low_lap * p_3to10_low_lap
p_low_3w = (p_low * p_graduate_low_lap * p_service_low_lap * p_3to10_low_lap)/(p_graduate * p_service * p_3to10)

print(p_high_3w, p_low_3w)

Education Level        0.42328
Career                 0.42328
Years of Experience    0.42328
Salary                 0.42328
dtype: float64 Education Level        1.041667
Career                 1.041667
Years of Experience    1.041667
Salary                 1.041667
dtype: float64


For instance 3, we would classify it to Low

In [195]:
# sum the result 

result = pd.DataFrame(columns = list('123'))
result.loc[0] = ['Low', 'High', 'Low']    
result

Unnamed: 0,1,2,3
0,Low,High,Low


# Question 3: KNN with Filter Method

For Question 3 and 4, you will be extending your KNN classifier to include automated feature selection.

Feature selection is used to remove irrelevant or correlated features in order to improve classification performance. You will be performing feature selection on a variant of the UCI vehicle dataset in the file veh-prime.arff. You will be comparing 2 different feature selection methods: the Filter method which doesn’t make use of cross-validation performance and the Wrapper method which does.

Fix the KNN parameter to be k = 7 for all runs of LOOCV in Question 3 and 4.

Make the class labels numeric (set “noncar”=0 and “car”=1) and calculate the Pearson Correlation Coefficient (PCC) of each feature with the numeric class label. The PCC value is commonly referred to as r. For a simple method to calculate the PCC that is both computationally efficient and numerically stable, see the pseudo code in the pearson.html file.

https://www.kaggle.com/sz8416/6-ways-for-feature-selection

## I. Setup and Cleaning 

Firstly, we need to convert the categorical class lable into numeric value

In [256]:
# setup and import data

from scipy.io import arff
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

df = pd.read_csv('Desktop/code/CISC6930_DM/data/veh-prime.csv', sep = ',')
df = df.drop(df.columns[0], axis=1)
df.head()

#df.shape
#df.dtypes

Unnamed: 0,f0,f1,f2,f3,f4,f5,f6,f7,f8,f9,...,f27,f28,f29,f30,f31,f32,f33,f34,f35,CLASS
0,0.063,0.16,0.509,-0.967,0.058,0.0,0.874,0.271,1.307,-0.011,...,-0.924,-0.077,0.108,-0.003,0.381,-0.314,0.929,0.184,-0.001,noncar
1,-0.037,-0.325,-0.626,-0.029,0.121,-0.409,-0.002,-0.835,-0.595,-0.253,...,0.27,0.533,0.152,-0.978,0.157,0.011,-0.254,0.453,-0.621,noncar
2,-0.0,1.253,0.833,-0.97,1.516,0.014,-0.378,1.197,0.546,-0.402,...,-0.408,1.55,0.01,-0.652,-0.403,-0.151,0.0,0.049,-0.113,car
3,-0.743,-0.082,-0.626,0.723,-0.006,-0.0,-0.08,-0.297,0.166,0.311,...,0.819,-0.077,-0.099,-0.001,-0.291,1.633,0.686,1.528,-0.0,noncar
4,-0.939,-1.054,-0.14,0.036,-0.766,0.0,-0.272,1.077,5.236,-0.366,...,0.676,0.533,-0.003,0.122,-0.179,-1.449,0.024,-1.698,0.083,noncar


In [257]:
# change the CLASS into numeric values 0 for b'noncar and 1 for b'car'
# the reason to separate the data and convert to boolean first is that 
# dictionary or replace didn't work

df_car = df.loc[df['CLASS'] == 'car']
df_noncar = df.loc[df['CLASS'] == 'noncar']

df_car.is_copy = False
df_car['CLASS'] = df_car['CLASS'].astype('bool')
df_car['CLASS'] = df_car['CLASS'] * 1

df_noncar.is_copy = False
df_noncar['CLASS'] = df_noncar['CLASS'].astype('bool')
df_noncar['CLASS'] = df_noncar['CLASS'] * 0


In [258]:
df = pd.concat([df_car, df_noncar], ignore_index=True)
df.shape

(846, 37)

## II. Data preprocessing

Here we follow the following steps:

1. Standardization
2. Feature selection by filter method based on PCC 


### 1. Standardization

In [259]:
# seperate x and y for preprocessing 

x = df.iloc[:, 0:36]
y = df['CLASS']

In [260]:
# preprocess the whole dataset first

from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler

scaler = preprocessing.StandardScaler()
x_scaled = scaler.fit_transform(x)
x_scaled = pd.DataFrame(x_scaled)
x_scaled.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,26,27,28,29,30,31,32,33,34,35
0,-0.044227,1.253766,0.833595,-2.548894,1.516837,0.093443,-0.952771,1.197664,0.546367,-1.084037,...,0.072051,-1.043293,1.550859,0.033408,-1.739382,-0.403208,-0.151117,-0.014298,0.049026,-0.229782
1,-0.171682,-0.082067,-0.140049,0.831308,1.009535,-0.666495,1.907348,0.838459,0.039072,1.617702,...,-1.130658,-0.013961,-0.483254,-0.053639,1.143713,0.157087,0.984566,2.221898,1.125647,0.048134
2,0.736437,-0.447294,-1.76112,-0.057667,-1.020676,0.034987,0.020128,-0.98458,-0.849443,0.020615,...,-0.99758,-2.497949,-0.280143,-1.944198,-0.005384,1.500795,1.147664,-1.100374,0.856492,0.003138
3,0.032777,0.282161,1.64413,0.007892,1.326724,0.099011,0.101627,0.958527,0.419293,-0.663471,...,0.205129,-0.305899,-0.077031,-1.525283,0.597633,-1.187621,-0.476312,-2.306533,-0.219129,-0.02333
4,1.102871,1.253766,1.482023,0.220302,1.136611,-0.031821,1.237525,0.510272,-0.088001,1.434037,...,0.205129,0.23016,-0.280143,0.087813,2.259165,-0.17909,-0.63841,-0.065,-0.085051,0.064015


In [261]:
dataframe = pd.concat([x_scaled, y], axis = 1)
dataframe.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,27,28,29,30,31,32,33,34,35,CLASS
0,-0.044227,1.253766,0.833595,-2.548894,1.516837,0.093443,-0.952771,1.197664,0.546367,-1.084037,...,-1.043293,1.550859,0.033408,-1.739382,-0.403208,-0.151117,-0.014298,0.049026,-0.229782,1
1,-0.171682,-0.082067,-0.140049,0.831308,1.009535,-0.666495,1.907348,0.838459,0.039072,1.617702,...,-0.013961,-0.483254,-0.053639,1.143713,0.157087,0.984566,2.221898,1.125647,0.048134,1
2,0.736437,-0.447294,-1.76112,-0.057667,-1.020676,0.034987,0.020128,-0.98458,-0.849443,0.020615,...,-2.497949,-0.280143,-1.944198,-0.005384,1.500795,1.147664,-1.100374,0.856492,0.003138,1
3,0.032777,0.282161,1.64413,0.007892,1.326724,0.099011,0.101627,0.958527,0.419293,-0.663471,...,-0.305899,-0.077031,-1.525283,0.597633,-1.187621,-0.476312,-2.306533,-0.219129,-0.02333,1
4,1.102871,1.253766,1.482023,0.220302,1.136611,-0.031821,1.237525,0.510272,-0.088001,1.434037,...,0.23016,-0.280143,0.087813,2.259165,-0.17909,-0.63841,-0.065,-0.085051,0.064015,1


### 2. Feature selection by filter method

Here we'd like to calculate the Pearson Correlation Coefficient (PCC) of each feature with the numeric class label, and rank the features based on r. 

PCC measures features' linear correlation with the outcome variable. The result will be within [-1, 1]. 

$$r=\frac{\sum_{i=1}^n (x_{i}- \overline{x})(y_{i}- \overline{y})}{\sum_{i=1}^n (x_{i}- \overline{x})^2 \sum_{i=1}^n (y_{i}- \overline{y})^2}$$

reference: https://onlinecourses.science.psu.edu/stat501/node/256/


### Question (a) 

List the features from highest |r| (the absolute value of r) to lowest, along with their r values. Why would one be interested in the absolute value of r rather than the raw value?

**Answer**: 

We are interested in the absolute value of r rather than the raw value because r measures the correlation between our features and response variable. 

The raw value only represent the features themselves without taken consideration of any relationships, so we won't be able to measure the importance and select features purely based on the raw value

Also r is unitless, which allows us to measure the correlation coefficients calculated on different data sets with different units and compare them on the same bases. 


In [48]:
# calculate the PCC between features and response 

correlation = set()
correlation_matrix = dataframe.corr(method = 'pearson')

# list the features from the highest |r| to lowest 
features_response = correlation_matrix.iloc[:, 36]
features_response = pd.DataFrame(features_response.sort_values(ascending = False))
features_response

Unnamed: 0,CLASS
CLASS,1.0
4,0.436922
13,0.368269
16,0.366025
7,0.352141
22,0.35135
1,0.308811
20,0.299049
31,0.290783
34,0.266093


## III. Model fitting and LOOCV

### Question (b) 

Select the features that have the highest m values of |r|, and run LOOCV on the dataset restricted to only those m features. Which value of m gives the highest LOOCV classification accuracy, and what is the value of this optimal accuracy?

** Answer: **

As we can see from PCC, only the first 20 features in the list have a r score over 0.002. 

So, we are going to use LOOCV to evaluate the classification accuracy of m out of those 20 features with m = range(1:20), and choose the value of m with the highest LOOCV score. 

As a result, 

the value of m that gives the highest LOOCV classification accuracy is: **18**

the value of the optimal accuracy is: **92.435%** 


In [262]:
# Evaluate using LOOCV
from sklearn import model_selection
from sklearn.neighbors import KNeighborsClassifier
# from sklearn.model_selection import cross_val_score
# from sklearn.model_selection import LeaveOneOut 
# from sklearn.model_selection import KFold

# select the top 20 features based on PCC
X = x_scaled
y = y 

def cor_selector(X, y):
    cor_list = []
    
    # calculate the correlation with y for each feature
    for i in X.columns.tolist():
        cor = np.corrcoef(X[i], y)[0, 1]
        cor_list.append(cor)
        
    # extract feature name
    # np.argsort sort an array by ascending order(smallest to largest)
    cor_feature = X.iloc[:, np.argsort(np.abs(cor_list))[-20:]].columns.tolist()
    
    return cor_feature

x_features = X.iloc[:, np.argsort(np.abs(cor_list))[-20:]]

# empty list that will hold cv scores
cv_scores = []

# perform LOOCV
loocv = model_selection.LeaveOneOut()
x = x_features

knn = KNeighborsClassifier(n_neighbors = 7)
scores = cross_val_score(knn, x, y, cv = loocv)
print("Accuracy: %.3f%% (%.3f%%)" % (scores.mean()*100.0, scores.std()*100.0))
cv_scores.append(scores.mean())     
cv_scores

Accuracy: 92.553% (26.253%)


[0.925531914893617]

In [158]:
# select the top 19 features based on PCC
X = x_scaled
y = y 

def cor_selector(X, y):
    cor_list = []
    
    # calculate the correlation with y for each feature
    for i in X.columns.tolist():
        cor = np.corrcoef(X[i], y)[0, 1]
        cor_list.append(cor)
        
    # extract feature name
    # np.argsort sort an array by ascending order(smallest to largest)
    cor_feature = X.iloc[:, np.argsort(np.abs(cor_list))[-19:]].columns.tolist()
    
    return cor_feature

x_features = X.iloc[:, np.argsort(np.abs(cor_list))[-19:]]

# empty list that will hold cv scores
cv_scores = []

# perform LOOCV
loocv = model_selection.LeaveOneOut()
x = x_features

knn = KNeighborsClassifier(n_neighbors = 7)
scores = cross_val_score(knn, x, y, cv = loocv)
print("Accuracy: %.3f%% (%.3f%%)" % (scores.mean()*100.0, scores.std()*100.0))
cv_scores.append(scores.mean())     
cv_scores

Accuracy: 91.726% (27.549%)


[0.91725768321513]

In [185]:
# select the top 18 features based on PCC
X = x_scaled
y = y 

def cor_selector(X, y):
    cor_list = []
    
    # calculate the correlation with y for each feature
    for i in X.columns.tolist():
        cor = np.corrcoef(X[i], y)[0, 1]
        cor_list.append(cor)
        
    # extract feature name
    # np.argsort sort an array by ascending order(smallest to largest)
    cor_feature = X.iloc[:, np.argsort(np.abs(cor_list))[-18:]].columns.tolist()
    
    return cor_feature

x_features = X.iloc[:, np.argsort(np.abs(cor_list))[-18:]]

# empty list that will hold cv scores
cv_scores = []

# perform LOOCV
loocv = model_selection.LeaveOneOut()
x = x_features

knn = KNeighborsClassifier(n_neighbors = 7)
scores = cross_val_score(knn, x, y, cv = loocv)

print("Subset features: %s" %(cor_feature))
print("Accuracy: %.3f%% (%.3f%%)" % (scores.mean()*100.0, scores.std()*100.0))

cv_scores.append(scores.mean())     
cv_scores
cor_feature

Subset features: [17, 19, 25, 28, 2, 34, 31, 20, 1, 26, 22, 7, 16, 14, 13, 4]
Accuracy: 92.435% (26.444%)


[17, 19, 25, 28, 2, 34, 31, 20, 1, 26, 22, 7, 16, 14, 13, 4]

In [161]:
# select the top 17 features based on PCC
X = x_scaled
y = y 

def cor_selector(X, y):
    cor_list = []
    
    # calculate the correlation with y for each feature
    for i in X.columns.tolist():
        cor = np.corrcoef(X[i], y)[0, 1]
        cor_list.append(cor)
        
    # extract feature name
    # np.argsort sort an array by ascending order(smallest to largest)
    cor_feature = X.iloc[:, np.argsort(np.abs(cor_list))[-17:]].columns.tolist()
    
    return cor_feature

x_features = X.iloc[:, np.argsort(np.abs(cor_list))[-17:]]

# empty list that will hold cv scores
cv_scores = []

# perform LOOCV
loocv = model_selection.LeaveOneOut()
x = x_features

knn = KNeighborsClassifier(n_neighbors = 7)
scores = cross_val_score(knn, x, y, cv = loocv)
print("Accuracy: %.3f%% (%.3f%%)" % (scores.mean()*100.0, scores.std()*100.0))
cv_scores.append(scores.mean())     
cv_scores

Accuracy: 91.017% (28.594%)


[0.9101654846335697]

In [162]:
# select the top 16 features based on PCC
X = x_scaled
y = y 

def cor_selector(X, y):
    cor_list = []
    
    # calculate the correlation with y for each feature
    for i in X.columns.tolist():
        cor = np.corrcoef(X[i], y)[0, 1]
        cor_list.append(cor)
        
    # extract feature name
    # np.argsort sort an array by ascending order(smallest to largest)
    cor_feature = X.iloc[:, np.argsort(np.abs(cor_list))[-16:]].columns.tolist()
    
    return cor_feature

x_features = X.iloc[:, np.argsort(np.abs(cor_list))[-16:]]

# empty list that will hold cv scores
cv_scores = []

# perform LOOCV
loocv = model_selection.LeaveOneOut()
x = x_features

knn = KNeighborsClassifier(n_neighbors = 7)
scores = cross_val_score(knn, x, y, cv = loocv)
print("Accuracy: %.3f%% (%.3f%%)" % (scores.mean()*100.0, scores.std()*100.0))
cv_scores.append(scores.mean())     
cv_scores

Accuracy: 89.716% (30.375%)


[0.8971631205673759]

In [163]:
# select the top 15 features based on PCC
X = x_scaled
y = y 

def cor_selector(X, y):
    cor_list = []
    
    # calculate the correlation with y for each feature
    for i in X.columns.tolist():
        cor = np.corrcoef(X[i], y)[0, 1]
        cor_list.append(cor)
        
    # extract feature name
    # np.argsort sort an array by ascending order(smallest to largest)
    cor_feature = X.iloc[:, np.argsort(np.abs(cor_list))[-15:]].columns.tolist()
    
    return cor_feature

x_features = X.iloc[:, np.argsort(np.abs(cor_list))[-15:]]

# empty list that will hold cv scores
cv_scores = []

# perform LOOCV
loocv = model_selection.LeaveOneOut()
x = x_features

knn = KNeighborsClassifier(n_neighbors = 7)
scores = cross_val_score(knn, x, y, cv = loocv)
print("Accuracy: %.3f%% (%.3f%%)" % (scores.mean()*100.0, scores.std()*100.0))
cv_scores.append(scores.mean())     
cv_scores

Accuracy: 90.898% (28.763%)


[0.9089834515366431]

In [164]:
# select the top 14 features based on PCC
X = x_scaled
y = y 

def cor_selector(X, y):
    cor_list = []
    
    # calculate the correlation with y for each feature
    for i in X.columns.tolist():
        cor = np.corrcoef(X[i], y)[0, 1]
        cor_list.append(cor)
        
    # extract feature name
    # np.argsort sort an array by ascending order(smallest to largest)
    cor_feature = X.iloc[:, np.argsort(np.abs(cor_list))[-14:]].columns.tolist()
    
    return cor_feature

x_features = X.iloc[:, np.argsort(np.abs(cor_list))[-14:]]

# empty list that will hold cv scores
cv_scores = []

# perform LOOCV
loocv = model_selection.LeaveOneOut()
x = x_features

knn = KNeighborsClassifier(n_neighbors = 7)
scores = cross_val_score(knn, x, y, cv = loocv)
print("Accuracy: %.3f%% (%.3f%%)" % (scores.mean()*100.0, scores.std()*100.0))
cv_scores.append(scores.mean())     
cv_scores

Accuracy: 89.125% (31.132%)


[0.8912529550827423]

In [165]:
# select the top 13 features based on PCC
X = x_scaled
y = y 

def cor_selector(X, y):
    cor_list = []
    
    # calculate the correlation with y for each feature
    for i in X.columns.tolist():
        cor = np.corrcoef(X[i], y)[0, 1]
        cor_list.append(cor)
        
    # extract feature name
    # np.argsort sort an array by ascending order(smallest to largest)
    cor_feature = X.iloc[:, np.argsort(np.abs(cor_list))[-13:]].columns.tolist()
    
    return cor_feature

x_features = X.iloc[:, np.argsort(np.abs(cor_list))[-13:]]

# empty list that will hold cv scores
cv_scores = []

# perform LOOCV
loocv = model_selection.LeaveOneOut()
x = x_features

knn = KNeighborsClassifier(n_neighbors = 7)
scores = cross_val_score(knn, x, y, cv = loocv)
print("Accuracy: %.3f%% (%.3f%%)" % (scores.mean()*100.0, scores.std()*100.0))
cv_scores.append(scores.mean())     
cv_scores

Accuracy: 88.889% (31.427%)


[0.8888888888888888]

In [166]:
# select the top 12 features based on PCC
X = x_scaled
y = y 

def cor_selector(X, y):
    cor_list = []
    
    # calculate the correlation with y for each feature
    for i in X.columns.tolist():
        cor = np.corrcoef(X[i], y)[0, 1]
        cor_list.append(cor)
        
    # extract feature name
    # np.argsort sort an array by ascending order(smallest to largest)
    cor_feature = X.iloc[:, np.argsort(np.abs(cor_list))[-12:]].columns.tolist()
    
    return cor_feature

x_features = X.iloc[:, np.argsort(np.abs(cor_list))[-12:]]

# empty list that will hold cv scores
cv_scores = []

# perform LOOCV
loocv = model_selection.LeaveOneOut()
x = x_features

knn = KNeighborsClassifier(n_neighbors = 7)
scores = cross_val_score(knn, x, y, cv = loocv)
print("Accuracy: %.3f%% (%.3f%%)" % (scores.mean()*100.0, scores.std()*100.0))
cv_scores.append(scores.mean())     
cv_scores

Accuracy: 90.426% (29.424%)


[0.9042553191489362]

In [167]:
# select the top 11 features based on PCC
X = x_scaled
y = y 

def cor_selector(X, y):
    cor_list = []
    
    # calculate the correlation with y for each feature
    for i in X.columns.tolist():
        cor = np.corrcoef(X[i], y)[0, 1]
        cor_list.append(cor)
        
    # extract feature name
    # np.argsort sort an array by ascending order(smallest to largest)
    cor_feature = X.iloc[:, np.argsort(np.abs(cor_list))[-11:]].columns.tolist()
    
    return cor_feature

x_features = X.iloc[:, np.argsort(np.abs(cor_list))[-11:]]

# empty list that will hold cv scores
cv_scores = []

# perform LOOCV
loocv = model_selection.LeaveOneOut()
x = x_features

knn = KNeighborsClassifier(n_neighbors = 7)
scores = cross_val_score(knn, x, y, cv = loocv)
print("Accuracy: %.3f%% (%.3f%%)" % (scores.mean()*100.0, scores.std()*100.0))
cv_scores.append(scores.mean())     
cv_scores

Accuracy: 90.189% (29.746%)


[0.9018912529550828]

In [168]:
# select the top 10 features based on PCC
X = x_scaled
y = y 

def cor_selector(X, y):
    cor_list = []
    
    # calculate the correlation with y for each feature
    for i in X.columns.tolist():
        cor = np.corrcoef(X[i], y)[0, 1]
        cor_list.append(cor)
        
    # extract feature name
    # np.argsort sort an array by ascending order(smallest to largest)
    cor_feature = X.iloc[:, np.argsort(np.abs(cor_list))[-10:]].columns.tolist()
    
    return cor_feature

x_features = X.iloc[:, np.argsort(np.abs(cor_list))[-10:]]

# empty list that will hold cv scores
cv_scores = []

# perform LOOCV
loocv = model_selection.LeaveOneOut()
x = x_features

knn = KNeighborsClassifier(n_neighbors = 7)
scores = cross_val_score(knn, x, y, cv = loocv)
print("Accuracy: %.3f%% (%.3f%%)" % (scores.mean()*100.0, scores.std()*100.0))
cv_scores.append(scores.mean())     
cv_scores

Accuracy: 88.534% (31.861%)


[0.8853427895981087]

In [169]:
# select the top 9 features based on PCC
X = x_scaled
y = y 

def cor_selector(X, y):
    cor_list = []
    
    # calculate the correlation with y for each feature
    for i in X.columns.tolist():
        cor = np.corrcoef(X[i], y)[0, 1]
        cor_list.append(cor)
        
    # extract feature name
    # np.argsort sort an array by ascending order(smallest to largest)
    cor_feature = X.iloc[:, np.argsort(np.abs(cor_list))[-9:]].columns.tolist()
    
    return cor_feature

x_features = X.iloc[:, np.argsort(np.abs(cor_list))[-9:]]

# empty list that will hold cv scores
cv_scores = []

# perform LOOCV
loocv = model_selection.LeaveOneOut()
x = x_features

knn = KNeighborsClassifier(n_neighbors = 7)
scores = cross_val_score(knn, x, y, cv = loocv)
print("Accuracy: %.3f%% (%.3f%%)" % (scores.mean()*100.0, scores.std()*100.0))
cv_scores.append(scores.mean())     
cv_scores

Accuracy: 89.480% (30.681%)


[0.8947990543735225]

In [170]:
# select the top 8 features based on PCC
X = x_scaled
y = y 

def cor_selector(X, y):
    cor_list = []
    
    # calculate the correlation with y for each feature
    for i in X.columns.tolist():
        cor = np.corrcoef(X[i], y)[0, 1]
        cor_list.append(cor)
        
    # extract feature name
    # np.argsort sort an array by ascending order(smallest to largest)
    cor_feature = X.iloc[:, np.argsort(np.abs(cor_list))[-8:]].columns.tolist()
    
    return cor_feature

x_features = X.iloc[:, np.argsort(np.abs(cor_list))[-8:]]

# empty list that will hold cv scores
cv_scores = []

# perform LOOCV
loocv = model_selection.LeaveOneOut()
x = x_features

knn = KNeighborsClassifier(n_neighbors = 7)
scores = cross_val_score(knn, x, y, cv = loocv)
print("Accuracy: %.3f%% (%.3f%%)" % (scores.mean()*100.0, scores.std()*100.0))
cv_scores.append(scores.mean())     
cv_scores

Accuracy: 87.470% (33.105%)


[0.8747044917257684]

In [171]:
# select the top 7 features based on PCC
X = x_scaled
y = y 

def cor_selector(X, y):
    cor_list = []
    
    # calculate the correlation with y for each feature
    for i in X.columns.tolist():
        cor = np.corrcoef(X[i], y)[0, 1]
        cor_list.append(cor)
        
    # extract feature name
    # np.argsort sort an array by ascending order(smallest to largest)
    cor_feature = X.iloc[:, np.argsort(np.abs(cor_list))[-7:]].columns.tolist()
    
    return cor_feature

x_features = X.iloc[:, np.argsort(np.abs(cor_list))[-7:]]

# empty list that will hold cv scores
cv_scores = []

# perform LOOCV
loocv = model_selection.LeaveOneOut()
x = x_features

knn = KNeighborsClassifier(n_neighbors = 7)
scores = cross_val_score(knn, x, y, cv = loocv)
print("Accuracy: %.3f%% (%.3f%%)" % (scores.mean()*100.0, scores.std()*100.0))
cv_scores.append(scores.mean())     
cv_scores

Accuracy: 86.170% (34.521%)


[0.8617021276595744]

In [172]:
# select the top 6 features based on PCC
X = x_scaled
y = y 

def cor_selector(X, y):
    cor_list = []
    
    # calculate the correlation with y for each feature
    for i in X.columns.tolist():
        cor = np.corrcoef(X[i], y)[0, 1]
        cor_list.append(cor)
        
    # extract feature name
    # np.argsort sort an array by ascending order(smallest to largest)
    cor_feature = X.iloc[:, np.argsort(np.abs(cor_list))[-6:]].columns.tolist()
    
    return cor_feature

x_features = X.iloc[:, np.argsort(np.abs(cor_list))[-6:]]

# empty list that will hold cv scores
cv_scores = []

# perform LOOCV
loocv = model_selection.LeaveOneOut()
x = x_features

knn = KNeighborsClassifier(n_neighbors = 7)
scores = cross_val_score(knn, x, y, cv = loocv)
print("Accuracy: %.3f%% (%.3f%%)" % (scores.mean()*100.0, scores.std()*100.0))
cv_scores.append(scores.mean())     
cv_scores

Accuracy: 83.688% (36.948%)


[0.8368794326241135]

In [173]:
# select the top 5 features based on PCC
X = x_scaled
y = y 

def cor_selector(X, y):
    cor_list = []
    
    # calculate the correlation with y for each feature
    for i in X.columns.tolist():
        cor = np.corrcoef(X[i], y)[0, 1]
        cor_list.append(cor)
        
    # extract feature name
    # np.argsort sort an array by ascending order(smallest to largest)
    cor_feature = X.iloc[:, np.argsort(np.abs(cor_list))[-5:]].columns.tolist()
    
    return cor_feature

x_features = X.iloc[:, np.argsort(np.abs(cor_list))[-5:]]

# empty list that will hold cv scores
cv_scores = []

# perform LOOCV
loocv = model_selection.LeaveOneOut()
x = x_features

knn = KNeighborsClassifier(n_neighbors = 7)
scores = cross_val_score(knn, x, y, cv = loocv)
print("Accuracy: %.3f%% (%.3f%%)" % (scores.mean()*100.0, scores.std()*100.0))
cv_scores.append(scores.mean())     
cv_scores

Accuracy: 83.452% (37.162%)


[0.83451536643026]

In [174]:
# select the top 4 features based on PCC
X = x_scaled
y = y 

def cor_selector(X, y):
    cor_list = []
    
    # calculate the correlation with y for each feature
    for i in X.columns.tolist():
        cor = np.corrcoef(X[i], y)[0, 1]
        cor_list.append(cor)
        
    # extract feature name
    # np.argsort sort an array by ascending order(smallest to largest)
    cor_feature = X.iloc[:, np.argsort(np.abs(cor_list))[-4:]].columns.tolist()
    
    return cor_feature

x_features = X.iloc[:, np.argsort(np.abs(cor_list))[-4:]]

# empty list that will hold cv scores
cv_scores = []

# perform LOOCV
loocv = model_selection.LeaveOneOut()
x = x_features

knn = KNeighborsClassifier(n_neighbors = 7)
scores = cross_val_score(knn, x, y, cv = loocv)
print("Accuracy: %.3f%% (%.3f%%)" % (scores.mean()*100.0, scores.std()*100.0))
cv_scores.append(scores.mean())     
cv_scores

Accuracy: 83.333% (37.268%)


[0.8333333333333334]

In [175]:
# select the top 3 features based on PCC
X = x_scaled
y = y 

def cor_selector(X, y):
    cor_list = []
    
    # calculate the correlation with y for each feature
    for i in X.columns.tolist():
        cor = np.corrcoef(X[i], y)[0, 1]
        cor_list.append(cor)
        
    # extract feature name
    # np.argsort sort an array by ascending order(smallest to largest)
    cor_feature = X.iloc[:, np.argsort(np.abs(cor_list))[-3:]].columns.tolist()
    
    return cor_feature

x_features = X.iloc[:, np.argsort(np.abs(cor_list))[-3:]]

# empty list that will hold cv scores
cv_scores = []

# perform LOOCV
loocv = model_selection.LeaveOneOut()
x = x_features

knn = KNeighborsClassifier(n_neighbors = 7)
scores = cross_val_score(knn, x, y, cv = loocv)
print("Accuracy: %.3f%% (%.3f%%)" % (scores.mean()*100.0, scores.std()*100.0))
cv_scores.append(scores.mean())     
cv_scores

Accuracy: 82.624% (37.890%)


[0.8262411347517731]

In [176]:
# select the top 2 features based on PCC
X = x_scaled
y = y 

def cor_selector(X, y):
    cor_list = []
    
    # calculate the correlation with y for each feature
    for i in X.columns.tolist():
        cor = np.corrcoef(X[i], y)[0, 1]
        cor_list.append(cor)
        
    # extract feature name
    # np.argsort sort an array by ascending order(smallest to largest)
    cor_feature = X.iloc[:, np.argsort(np.abs(cor_list))[-2:]].columns.tolist()
    
    return cor_feature

x_features = X.iloc[:, np.argsort(np.abs(cor_list))[-2:]]

# empty list that will hold cv scores
cv_scores = []

# perform LOOCV
loocv = model_selection.LeaveOneOut()
x = x_features

knn = KNeighborsClassifier(n_neighbors = 7)
scores = cross_val_score(knn, x, y, cv = loocv)
print("Accuracy: %.3f%% (%.3f%%)" % (scores.mean()*100.0, scores.std()*100.0))
cv_scores.append(scores.mean())     
cv_scores

Accuracy: 79.433% (40.419%)


[0.7943262411347518]

In [177]:
# select the top 1 features based on PCC
X = x_scaled
y = y 

def cor_selector(X, y):
    cor_list = []
    
    # calculate the correlation with y for each feature
    for i in X.columns.tolist():
        cor = np.corrcoef(X[i], y)[0, 1]
        cor_list.append(cor)
        
    # extract feature name
    # np.argsort sort an array by ascending order(smallest to largest)
    cor_feature = X.iloc[:, np.argsort(np.abs(cor_list))[-1:]].columns.tolist()
    
    return cor_feature

x_features = X.iloc[:, np.argsort(np.abs(cor_list))[-1:]]

# empty list that will hold cv scores
cv_scores = []

# perform LOOCV
loocv = model_selection.LeaveOneOut()
x = x_features

knn = KNeighborsClassifier(n_neighbors = 7)
scores = cross_val_score(knn, x, y, cv = loocv)
print("Accuracy: %.3f%% (%.3f%%)" % (scores.mean()*100.0, scores.std()*100.0))
cv_scores.append(scores.mean())     
cv_scores

Accuracy: 70.095% (45.784%)


[0.7009456264775413]

In [None]:
# plot m value against LOOCV classification accuracy



In [None]:
# knn without sklearn 

from collections import Counter
from sklearn.metrics import accuracy_score

# Training method
def train(X_train, y_train):
    # do nothing
    return

# prediction method
def predict(X_train, y_train, X_test, k):
    # create list for distances and targets
    distances = []
    targets = []
    
    for i in range(len(X_train)):
        # first compute the euclidean distance
        distance = np.sqrt(np.sum(np.square(X_test - X_train[i, :])))
        # add it to the list of distances
        distances.append([distance, i])
    
    # sort the list
    distances = sorted(distances)
    
    # make a list of k neighbors' targets
    for i in range(k):
        index = distances[i][1]
        targets.append(y_train[index])
    
    # return most common target: majority vote 
    return Counter(targets).most_common(1)[0][0]


# knn
def kNearestNeighbor(X_train, y_train, X_test, predictions, k):
    # check if k larger than n
    if k > len(X_train):
        raise ValueError
    
    # train on the input data
    train(X_train, y_train)
    
    # predict for each testing observation 
    for i in range(len(X_test)):
        predictions.append(predict(X_train, y_train, X_test[i, :], k))

# make predictions
predictions = []
try:
        kNearestNeighbor(X_train, y_train, X_test, predictions, 1)
        predictions = np.asarray(predictions)
        
        # evaluating accuracy 
        accuracy = accuracy_score(y_test, predictions) * 100
        print('\nThe accuracy of Our classifier is %d%%' % accuracy)
        
except ValueError:
        print('Can\'t have more neighbors than training samples!!')

# Question: 4 KNN with Wrapper Method

Starting with the empty set of features, use a greedy approach to add the single feature that improves performance by the largest amount when added to the feature set. This is Sequential Forward Selection. 

Define performance as the LOOCV classification accuracy of the KNN classifier using only the features in the selection set (including the ?candidate? feature). 

Stop adding features only when there is no candidate that when added to the selection set increases the LOOCV accuracy.

### Question (a) Sequential Forward Selection
Show the set of selected features at each step, as it grows from size zero to its final size (increasing in size by exactly one feature at each step).

**Answer: **
Coded as below

### Question (b) LOOCV accuracy

What is the LOOCV accuracy over the final set of selected features?

**Answer: **

As we can see from the following  results, the best subset is: 

(1, 4, 7, 8, 10, 14, 16, 19, 20, 22, 25)

with a LOOCV accuracy rate of 0.9633569739952719


In [264]:
# sequential forward selection with mlxtend
# http://rasbt.github.io/mlxtend/user_guide/feature_selection/SequentialFeatureSelector/

from sklearn.model_selection import cross_val_score
from sklearn.model_selection import LeaveOneOut 
from sklearn.model_selection import KFold
from sklearn.neighbors import KNeighborsClassifier
from mlxtend.feature_selection import SequentialFeatureSelector as SFS

X = x_scaled
y = y 

cv_gen = LeaveOneOut().split(X)
cv = list(cv_gen)

knn = KNeighborsClassifier(n_neighbors = 7)

sfs = SFS(estimator = knn, 
           k_features = (1, 36), 
           forward = True, 
           floating = False, 
           scoring = 'accuracy',
           cv = cv)

sfs = sfs.fit(X, y)

print('best sebset (CV Score: %.3f): %s\n' % (sfs.k_score_, sfs.k_feature_idx_))


best sebset (CV Score: 0.963): (1, 4, 7, 8, 10, 14, 16, 19, 20, 22, 25)



In [265]:
#  Show the set of selected features at each step
pd.DataFrame.from_dict(sfs.get_metric_dict()).T

Unnamed: 0,avg_score,ci_bound,cv_scores,feature_idx,feature_names,std_dev,std_err
1,0.755319,0.0290273,"[1.0, 1.0, 0.0, 1.0, 1.0, 1.0, 1.0, 0.0, 1.0, ...","(14,)","(14,)",0.429898,0.0147889
2,0.86643,0.0229701,"[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 1.0, ...","(10, 14)","(10, 14)",0.340189,0.0117029
3,0.903073,0.0199767,"[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, ...","(10, 14, 19)","(10, 14, 19)",0.295858,0.0101778
4,0.93617,0.0165056,"[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, ...","(8, 10, 14, 19)","(8, 10, 14, 19)",0.244449,0.00840932
5,0.956265,0.0138085,"[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, ...","(7, 8, 10, 14, 19)","(7, 8, 10, 14, 19)",0.204505,0.0070352
6,0.958629,0.0134467,"[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, ...","(7, 8, 10, 14, 19, 25)","(7, 8, 10, 14, 19, 25)",0.199147,0.00685087
7,0.962175,0.0128813,"[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, ...","(1, 7, 8, 10, 14, 19, 25)","(1, 7, 8, 10, 14, 19, 25)",0.190773,0.00656279
8,0.959811,0.0132614,"[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, ...","(1, 7, 8, 10, 14, 19, 20, 25)","(1, 7, 8, 10, 14, 19, 20, 25)",0.196403,0.00675645
9,0.960993,0.0130729,"[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, ...","(1, 4, 7, 8, 10, 14, 19, 20, 25)","(1, 4, 7, 8, 10, 14, 19, 20, 25)",0.193612,0.00666045
10,0.962175,0.0128813,"[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, ...","(1, 4, 7, 8, 10, 14, 19, 20, 22, 25)","(1, 4, 7, 8, 10, 14, 19, 20, 22, 25)",0.190773,0.00656279
