# Task 2 - Ionosphere Dataset

Sharyar Memon, 1299819
<br> 
Alyson Wu, 1399985

For task 2, we were tasked in the creation of a predictor using three different models/approaches, which are shown in the following order: (1) Regression, (2) Support Vector Machine and (3) Random Forest. 

The analyzed dataset for this task was the Ionosphere dataset, which describes radar data that is collector by a system that is located in Goose Bay Labrador. The system, consisting of a phased array of 16 high frequency antennas, targeted free electrons in the ionosphere. From these observations, a "good" radar return and "bad" radar return can be recorded, where a "good" return is indicative that the radar return showed evidence of some kind of structure in the ionophere and a "bad" return indicates that the signal passed through the ionosphere.

Hence, for classification, our goal is the creation of a predictor that should perform the following classification:
g for good and b for bad = function(ionosphere features)

In our analysis of each created model, a 10-fold cross validation is performed to compare model performance betewen different models/approaches. Lastly, this information, along with the results of a ANOVA are used to identify the best model.

## Ionosphere Data Set Pre-processing

Prior to model creation, the ionosphere data was pre-processed. Note that upon visual inspection of the data, the second column was found to have no variance (e.g. all values were the same) and therefore does not provide a significant contribution to the data. Hence, the second column was removed from the analysis.

For classification, the label was changed to a binary encoding where "g" was mapped to a 1 value and "b" was mapped to a 0 value. This is necessary to construct a logistic regression model so that the regression model maps to a logical value following the performed classification.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Naming of all columns
colnames=['f1', 'f2', 'f3', 'f4', 'f5', 'f6', 'f7', 'f8', 'f9', 'f10', 'f11', 'f12', 'f13', 'f14', 'f15', 
          'f16', 'f17', 'f18', 'f19', 'f20', 'f21', 'f22', 'f23', 'f24', 'f25', 'f26', 'f27', 'f28', 'f29',
          'f30', 'f31', 'f32', 'f33', 'f34', 'label']

ionosphere_df = pd.read_csv('data_files/ionosphere.data', names=colnames, header=None)
ionosphere_df

Unnamed: 0,f1,f2,f3,f4,f5,f6,f7,f8,f9,f10,...,f26,f27,f28,f29,f30,f31,f32,f33,f34,label
0,1,0,0.99539,-0.05889,0.85243,0.02306,0.83398,-0.37708,1.00000,0.03760,...,-0.51171,0.41078,-0.46168,0.21266,-0.34090,0.42267,-0.54487,0.18641,-0.45300,g
1,1,0,1.00000,-0.18829,0.93035,-0.36156,-0.10868,-0.93597,1.00000,-0.04549,...,-0.26569,-0.20468,-0.18401,-0.19040,-0.11593,-0.16626,-0.06288,-0.13738,-0.02447,b
2,1,0,1.00000,-0.03365,1.00000,0.00485,1.00000,-0.12062,0.88965,0.01198,...,-0.40220,0.58984,-0.22145,0.43100,-0.17365,0.60436,-0.24180,0.56045,-0.38238,g
3,1,0,1.00000,-0.45161,1.00000,1.00000,0.71216,-1.00000,0.00000,0.00000,...,0.90695,0.51613,1.00000,1.00000,-0.20099,0.25682,1.00000,-0.32382,1.00000,b
4,1,0,1.00000,-0.02401,0.94140,0.06531,0.92106,-0.23255,0.77152,-0.16399,...,-0.65158,0.13290,-0.53206,0.02431,-0.62197,-0.05707,-0.59573,-0.04608,-0.65697,g
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
346,1,0,0.83508,0.08298,0.73739,-0.14706,0.84349,-0.05567,0.90441,-0.04622,...,-0.04202,0.83479,0.00123,1.00000,0.12815,0.86660,-0.10714,0.90546,-0.04307,g
347,1,0,0.95113,0.00419,0.95183,-0.02723,0.93438,-0.01920,0.94590,0.01606,...,0.01361,0.93522,0.04925,0.93159,0.08168,0.94066,-0.00035,0.91483,0.04712,g
348,1,0,0.94701,-0.00034,0.93207,-0.03227,0.95177,-0.03431,0.95584,0.02446,...,0.03193,0.92489,0.02542,0.92120,0.02242,0.92459,0.00442,0.92697,-0.00577,g
349,1,0,0.90608,-0.01657,0.98122,-0.01989,0.95691,-0.03646,0.85746,0.00110,...,-0.02099,0.89147,-0.07760,0.82983,-0.17238,0.96022,-0.03757,0.87403,-0.16243,g


In [2]:
# Encode the categories as 1 and 0 (g = 1, b = 0)
ionosphere_df['label'] = ionosphere_df.label.astype('category')
encoding = {'g': 1, 'b': 0}
ionosphere_df.label.replace(encoding, inplace=True)

Upon closer inspection of the data set, the second column (feature 2) was removed as all of it's values were identical and therefore due to the lack of variance the feature was discarded.

In [3]:
# Removal of the second column (f2) as all of its values are identical and there is no variance
ionosphere_df.drop(columns=['f2'], inplace=True)

X = ionosphere_df.values[:, :-1]
y = ionosphere_df.values[:, -1]

For a more complete analysis, the data set was standardized and normalized to unit variance.

In [4]:
def normalization(Features):
    """
    Take a set of features return normalize values using
    (x -xmin) / (xmax – xmin) of each feature
    """
    fmin = np.min(Features,axis=0)
    fmax = np.max(Features,axis=0)
    Features_norm = (Features - fmin)/(fmax-fmin)
    return Features_norm

In [5]:
from sklearn.preprocessing import StandardScaler

# Perform standardization on feature set data
X_scaled = StandardScaler().fit_transform(X)

# Perform normalization of the feature set data
X_normalized = normalization(X)

### Logistic Regression Model

The logistic regression model was evaluated for scaled and normalized data. Unlike the previously conducted linear regression model (performed for task 1), the logistic regression is a classification method that will assign observations to a discrete set of classes (instead of a continuous values as was seen using the linear regression model). Following the creation of the model, a 10 fold cross validation was conducted to determine the model performance. The definition of a 10 fold cross validaiton is shown below:

In [6]:
from sklearn.model_selection import KFold, cross_val_score

def kfold_10_cross_validation(model, X, Y):
    # Initialize cross validation for k-fold = 10
    cross_validation = KFold(n_splits=10, random_state=1, shuffle=True)
    
    return cross_val_score(model, X, Y, scoring='accuracy', cv=cross_validation, n_jobs=-1)

**Train 3 different models with no scaling/normalization, scaling, normalization**

For this exercise, we created three different models using different variations of the original data set (original data, scaled data and normalized data). Following this, the performance of each model was compared, and lastly, feature importance was 
1. Create logistic regression models
2. Compare performance
3. Select the best logistic regression model

From this analysis we aimed to retrieve the model that performed the best of the three. Sklearn's native logistic regression model was leveraged for the creation of the model.

In [7]:
from sklearn import linear_model

# Utilize sklearn's native logistic regression model
logreg_model = linear_model.LogisticRegression(solver='lbfgs')

logreg_scores = kfold_10_cross_validation(logreg_model, X, y)
logreg_scaled_scores = kfold_10_cross_validation(logreg_model, X_scaled, y)
logreg_normalized_scores = kfold_10_cross_validation(logreg_model, X_normalized, y)

scores = {
    'Regular Scores': logreg_scores,
    'Scaled Scores': logreg_scaled_scores,
    'Normalized Scores': logreg_normalized_scores,
}
pd.DataFrame(scores)

Unnamed: 0,Regular Scores,Scaled Scores,Normalized Scores
0,0.805556,0.833333,0.805556
1,0.942857,1.0,0.971429
2,1.0,0.942857,1.0
3,0.857143,0.857143,0.885714
4,0.828571,0.857143,0.8
5,0.828571,0.885714,0.857143
6,0.914286,0.914286,0.885714
7,0.857143,0.914286,0.885714
8,0.8,0.828571,0.828571
9,0.885714,0.885714,0.942857


In [8]:
# Print the mean value of the scores to find best performing model
logreg_stats_values = {
    'Regular': [logreg_scores.mean(), logreg_scores.std()],
    'Scaled': [logreg_scaled_scores.mean(), logreg_scaled_scores.std()],
    'Normalized': [logreg_normalized_scores.mean(), logreg_normalized_scores.std()],  
}

pd.DataFrame(logreg_stats_values, index=['Mean Score Value', 'Standard Deviation'])

Unnamed: 0,Regular,Scaled,Normalized
Mean Score Value,0.871984,0.891905,0.88627
Standard Deviation,0.060986,0.050217,0.064439


From the mean values of the 10 fold cross validation using the regular data, scaled data and normalized data, the logistic model with the highest mean score value was selected. In this case, the logistic model performed best with the scaled data.

### Support Vector Machine

A support vector machine (SVM) model was also evaluated for regular, scaled and normalized data, as well as different kernel functions (linear, poly and rbf). From these generated models, a k fold cross validation was performed and analyzed to determine the best model.

**Train 3 different models with no scaling/normalization, scaling, normalization**

For this exercise, we created three different models using different variations of the original data set (original data, scaled data and normalized data). Following this, the performance of each model was compared, and lastly, feature importance was 
1. Create support vector machine (SVM) models using different variantions of the original data set (original, scaled, normalized) and using different kernel functions (linear, polynomial, radial basis function)
2. Compare performance
3. Select the best support vector machine (SVM) model

From this analysis we aimed to retrieve the model that performed the best of the three. Sklearn's native SVC model was leveraged for the creation of the model.

**SVM with Linear Kernel**

In [9]:
from sklearn.svm import SVC

svc_linear_model = SVC(kernel='linear').fit(X, y)
svc_linear_model_scaled = SVC(kernel='linear').fit(X_scaled, y)
svc_linear_model_normalized = SVC(kernel='linear').fit(X_normalized, y)

In [10]:
# K-fold cross validation
svc_linear_model_scores = kfold_10_cross_validation(svc_linear_model, X, y)
svc_linear_model_scaled_scores = kfold_10_cross_validation(svc_linear_model_scaled, X_scaled, y)
svc_linear_model_normalized_scores = kfold_10_cross_validation(svc_linear_model_normalized, X_normalized, y)

In [11]:
scores = {
    'Regular Scores': svc_linear_model_scores,
    'Scaled Scores': svc_linear_model_scaled_scores,
    'Normalized Scores': svc_linear_model_normalized_scores,
}
pd.DataFrame(scores)

Unnamed: 0,Regular Scores,Scaled Scores,Normalized Scores
0,0.805556,0.833333,0.833333
1,0.942857,0.971429,0.942857
2,0.971429,0.942857,1.0
3,0.857143,0.942857,0.885714
4,0.828571,0.8,0.828571
5,0.857143,0.885714,0.885714
6,0.885714,0.885714,0.885714
7,0.885714,0.914286,0.885714
8,0.8,0.8,0.8
9,0.885714,0.914286,0.914286


In [12]:
# Print the mean value of the scores to find best performing model
svc_linear_mean_values = {
    'Regular': [svc_linear_model_scores.mean(), svc_linear_model_scores.std()],
    'Scaled': [svc_linear_model_scaled_scores.mean(), svc_linear_model_scaled_scores.std()],
    'Normalized': [svc_linear_model_normalized_scores.mean(), svc_linear_model_normalized_scores.std()],  
}

pd.DataFrame(svc_linear_mean_values, index=['Mean Score Value', 'Standard Deviation'])

Unnamed: 0,Regular,Scaled,Normalized
Mean Score Value,0.871984,0.889048,0.88619
Standard Deviation,0.052343,0.057303,0.055224


The best performing SVM model (using the linear kernel) was found to be using scaled data. Note that the mean score value across models were comparable, with the normalized data performing similarly to the model that was tested and trained using scaled data. The same analysis is performed using the polynomial kernel and rbf kernel.

**SVM with Polynomial Kernel**

In [13]:
# Generate SVM
svc_poly_model = SVC(kernel='poly').fit(X, y)
svc_poly_model_scaled = SVC(kernel='poly').fit(X_scaled, y)
svc_poly_model_normalized = SVC(kernel='poly').fit(X_normalized, y)

# K-fold cross validation
svc_poly_model_scores = kfold_10_cross_validation(svc_poly_model, X, y)
svc_poly_model_scaled_scores = kfold_10_cross_validation(svc_poly_model_scaled, X_scaled, y)
svc_poly_model_normalized_scores = kfold_10_cross_validation(svc_poly_model_normalized, X_normalized, y)

# Print the mean value of the scores to find best performing model
svc_poly_mean_values = {
    'Regular': [svc_poly_model_scores.mean(), svc_poly_model_scores.std()],
    'Scaled': [svc_poly_model_scaled_scores.mean(), svc_poly_model_scaled_scores.std()],
    'Normalized': [svc_poly_model_normalized_scores.mean(), svc_poly_model_normalized_scores.std()],  
}

pd.DataFrame(svc_poly_mean_values, index=['Mean Score Value', 'Standard Deviation'])

Unnamed: 0,Regular,Scaled,Normalized
Mean Score Value,0.903016,0.758175,0.911825
Standard Deviation,0.05754,0.058111,0.05142


The best performing SVM model (using the polynomial kernel) was found to be using normalized data. Note that the model performed significantly worse with scaled data and had comparative performance between using the regular data and normalized data.

**SVM with Radial Basis Kernel**

In [14]:
# Generate SVM
svc_rbf_model = SVC(kernel='rbf').fit(X, y)
svc_rbf_model_scaled = SVC(kernel='rbf').fit(X_scaled, y)
svc_rbf_model_normalized = SVC(kernel='rbf').fit(X_normalized, y)

# K-fold cross validation
svc_rbf_model_scores = kfold_10_cross_validation(svc_rbf_model, X, y)
svc_rbf_model_scaled_scores = kfold_10_cross_validation(svc_rbf_model_scaled, X_scaled, y)
svc_rbf_model_normalized_scores = kfold_10_cross_validation(svc_rbf_model_normalized, X_normalized, y)

# Print the mean value of the scores to find best performing model
svc_rbf_mean_values = {
    'Regular': [svc_rbf_model_scores.mean(), svc_rbf_model_scores.std()],
    'Scaled': [svc_rbf_model_scaled_scores.mean(), svc_rbf_model_scaled_scores.std()],
    'Normalized': [svc_rbf_model_normalized_scores.mean(), svc_rbf_model_normalized_scores.std()],  
}

pd.DataFrame(svc_rbf_mean_values, index=['Mean Score Value', 'Standard Deviation'])

Unnamed: 0,Regular,Scaled,Normalized
Mean Score Value,0.934683,0.934683,0.934683
Standard Deviation,0.039762,0.039762,0.039762


The same mean score values and standard deviation values were found across all generated SVM models using the rbf kernel. Therefore any of the three can be chosen for the subsequent analysis.

**Comparison Across SVM Models**

Comparison across SVM models was achieved by looking at the mean score value (taken from the array generated following k-fold cross validation) and the standard deviation of each model.

In [15]:
svm_stats_scores = {
    'Best Linear': svc_linear_mean_values['Scaled'],
    'Best Polynomial': svc_poly_mean_values['Normalized'],
    'Best RBF': svc_rbf_mean_values['Regular'],
}

pd.DataFrame(svm_stats_scores, index=['Mean Score Value', 'Standard Deviation'])

Unnamed: 0,Best Linear,Best Polynomial,Best RBF
Mean Score Value,0.889048,0.911825,0.934683
Standard Deviation,0.057303,0.05142,0.039762


Following the comparison of all 3 SVM models, the SVM model generated using the radial basis function was found to have the highest mean score value and hence was chosen as the best SVM model for further comparison.

### Random Forest

Lastly, a random forest model was evaluated for different number of estimators (25, 50, and 100). Application of the regular data, standardized data and normalized data were also applied to find the best performing model. First, a random forest model consisting of 50 estimators was analyzed:  

**Train Random Forest Models**

For this exercise, we created  models using different variations of the original data set (original data, scaled data and normalized data) and different numbers of estimators (25, 50 and 100).
1. Create random forest models using orignal, scaled and normalized data with 25, 50 and 100 estimators
2. Compare performance
3. Select the best random forest model

From this analysis we aimed to retrieve the model that performed the best of the three. Sklearn's native RandomForestClassifier was leveraged for the creation of the model.

In [16]:
from sklearn.ensemble import RandomForestClassifier

# Creation of a random forest model with 50 estimators
random_forest_model = RandomForestClassifier(n_estimators=50, random_state=0, max_depth=12)
random_forest = random_forest_model.fit(X, y)

In [17]:
# Perform k-fold validation using regular data, scaled data and normalized data
random_forest_scores = kfold_10_cross_validation(random_forest, X, y)
random_forest_scaled_scores = kfold_10_cross_validation(random_forest, X_scaled, y)
random_forest_normalized_scores = kfold_10_cross_validation(random_forest, X_normalized, y)

In [18]:
# Print the mean value of the scores to find best performing model
random_forest_mean_values = {
    'Regular': [random_forest_scores.mean(), random_forest_scores.std()],
    'Scaled': [random_forest_scaled_scores.mean(), random_forest_scaled_scores.std()],
    'Normalized': [random_forest_normalized_scores.mean(), random_forest_normalized_scores.std()],  
}

pd.DataFrame(random_forest_mean_values, index=['Mean Score Value', 'Standard Deviation'])

Unnamed: 0,Regular,Scaled,Normalized
Mean Score Value,0.931667,0.931667,0.931667
Standard Deviation,0.02604,0.02604,0.02604


Following the anlaysis, it is clear that the mean score value and standard deviation does not seem to be impacted by the data being used (the regular data, standardized data and the normalized data all have the same values). Therefore we should evaluate based on the number of predictors. For this, 25 predictors and 100 predictors were also evaluated:

In [19]:
random_forest_model_100 = RandomForestClassifier(n_estimators=100, random_state=0, max_depth=12).fit(X, y)
random_forest_100_scores = kfold_10_cross_validation(random_forest_model_100, X, y)

random_forest_model_25 = RandomForestClassifier(n_estimators=25, random_state=0, max_depth=12).fit(X, y)
random_forest_25_scores = kfold_10_cross_validation(random_forest_model_25, X, y)

**Overall Comparison**

Comparison across Random Forest models was achieved by looking at the mean score value (taken from the array generated following k-fold cross validation) and the standard deviation of each model.

In [20]:
scores = {
    '25 Estimators Scores': random_forest_25_scores,
    '50 Estimators Scores': random_forest_scores,
    '100 Estimators Scores':random_forest_100_scores,
}
pd.DataFrame(scores)

Unnamed: 0,25 Estimators Scores,50 Estimators Scores,100 Estimators Scores
0,0.916667,0.916667,0.916667
1,0.942857,0.942857,0.942857
2,0.942857,0.942857,0.971429
3,0.942857,0.942857,0.942857
4,0.914286,0.885714,0.914286
5,0.942857,0.942857,0.942857
6,0.942857,0.942857,0.942857
7,0.942857,0.971429,0.942857
8,0.885714,0.885714,0.857143
9,0.971429,0.942857,0.971429


In [21]:
# Print the mean value of the scores to find best performing model
random_forest_mean_values = {
    '25 Estimators': [random_forest_25_scores.mean(), random_forest_25_scores.std()],
    '50 Estimators': [random_forest_scores.mean(), random_forest_scores.std()],
    '100 Estimators': [random_forest_100_scores.mean(), random_forest_100_scores.std()],  
}

pd.DataFrame(random_forest_mean_values, index=['Mean Score Value', 'Standard Deviation'])

Unnamed: 0,25 Estimators,50 Estimators,100 Estimators
Mean Score Value,0.934524,0.931667,0.934524
Standard Deviation,0.022112,0.02604,0.031285


For the selection of the best random forest model, a high mean score value and a high standard deviation was desired. Therfore, the random forest model with 100 estimators was selected.

### Model Performance Comparison

Comparison of the generated models were achieved through the average of the model score (the output of the 10 k-fold cross validation), and through the ANOVA test, or the Analysis of Variance test. Note that when conducting the ANOVA test, models were compared in pairs and also compared as a whole group.


**ANOVA Test**

Using the scipy module f_oneway method, models could directly be compared using the ANOVA test. When using this method, models are compared to see whether they lie in the same distribution (following the failure to reject the null hypothesis), or within different distributions (following the rejection of the null hypothesis). When the probability value fell below the alpha value of 0.05, the null hypothesis was rejected, signifying at a 95% confidence that the two models formed two separate distributions. Otherwise we failed to reject the null hypothesis and the models are of the same distribution.

In [22]:
# Analysis of variance test
from scipy.stats import f_oneway

# Definition of the ANOVA test
def ANOVA_test(list_of_models):
    stat, p = f_oneway(*list_of_models)
    
    alpha = 0.05
    if p > alpha:
        result = 'Same distributions (fail to reject H0)'
    else:
        result = 'Different distributions (reject H0)'
    return [stat, p, result]

Comparing pairs, it was found that the SVC model was of the same distribution as both the random forest model and the logistic regression model. The logistic regression model was found to be of different distributions than the random forest model. When comparing across all models, they were all found to be of different distributions. This is indicative that while there may be overlap with the SVC model, it was not significant in the comparison of all three models.

In [23]:
# Compare svc soft model and random forest model
svc_vs_forest = ANOVA_test([svc_rbf_model_scores, random_forest_100_scores])
logreg_vs_svc = ANOVA_test([logreg_scaled_scores, svc_rbf_model_scores])
logreg_vs_forest = ANOVA_test([logreg_scaled_scores, random_forest_100_scores])
all_model_compare = ANOVA_test([logreg_scaled_scores, random_forest_100_scores, svc_rbf_model_scores])

# Print the mean value of the scores to find best performing model
comparison_values = {
    'SVC vs. Random Forest': svc_vs_forest,
    'Logistic Regression vs. SVC': logreg_vs_svc,
    'Logistic Regression vs. Random Forest': logreg_vs_forest,  
    'All Models': all_model_compare,
}

row_names = ['Stat Value', 'P value', 'Comparison Result']
pd.DataFrame(comparison_values, index=row_names).transpose()

Unnamed: 0,Stat Value,P value,Comparison Result
SVC vs. Random Forest,8.9e-05,0.992594,Same distributions (fail to reject H0)
Logistic Regression vs. SVC,4.014218,0.060407,Same distributions (fail to reject H0)
Logistic Regression vs. Random Forest,4.670019,0.044418,Different distributions (reject H0)
All Models,3.229066,0.055289,Same distributions (fail to reject H0)


**Model Selection**

Selection of the model was chosen based on the performance of the model (given in the output following 10 k-fold validation, the previous results following ANOVA analysis between the 3 generated models and the standard deviation and mean values of the scores for each model. As stated previously, the SVC model was of the same distribution as both the random forest model and the logistic regression model. The random forest model however, was of a different distribution as the random forest model. Overall, all three models, when compared together, were found to be of different distributions.

In [25]:
import pandas as pd

# We do not need to split the data into test and training set since this is just linear regression with a very small data set (41 observations)
stats_scores = {
    'Best Logistic Regression Model': logreg_stats_values['Scaled'],
    'Best SVM Model': svm_stats_scores['Best RBF'],
    'Best Random Forest Model': random_forest_mean_values['100 Estimators'],
}

pd.DataFrame(stats_scores, index=['Mean Score Value', 'Standard Deviation'])

Unnamed: 0,Best Logistic Regression Model,Best SVM Model,Best Random Forest Model
Mean Score Value,0.891905,0.934683,0.934524
Standard Deviation,0.050217,0.039762,0.031285


For the selection of the best model, a high mean score value and a high standard deviation was desired. Therefore, the best model for the ionosphere data set is the SVM as it has a high mean score value  and the highest standard deviation (~0.04). Comparatively, the Random Forest Model had a comparable mean score value and a slightly lower standard deviation, meaning that the model could account for slightly less deviance from the mean value. The worst performing model of the 3 models was the logistic regression model; however, the model does account for the most variance across all 3 models.