# Task 2 - Ionosphere Dataset

For task 2, we were tasked in the creation of a predictor using three different models/approaches, which are shown in the following order: (1) Regression, (2) Support Vector Machine and (3) Random Forest. 

The analyzed dataset for this task was the Ionosphere dataset, which describes radar data that is collector by a system that is located in Goose Bay Labrador. The system, consisting of a phased array of 16 high frequency antennas, targeted free electrons in the ionosphere. From these observations, a "good" radar return and "bad" radar return can be recorded, where a "good" return is indicative that the radar return showed evidence of some kind of structure in the ionophere and a "bad" return indicates that the signal passed through the ionosphere.

Hence, for classification, our goal is the creation of a predictor that should perform the following classification:
g for good and b for bad = function(ionosphere features)

In our analysis of each created model, each model is conducted and analyzed using a split of training and testing data. Following the intial construction of all models, a 10-fold cross validation is performed to compare model performance betewen different models/approaches. Lastly, this information, along with the results of a t-test are used to identify the best model.

### Ionosphere Data Set Pre-processing

Prior to model creation, the ionosphere data was pre-processed. Note that upon visual inspection of the data, the second column was found to have no variance (e.g. all values were the same) and therefore does not provide a significant contribution to the data. Hence, the second column was removed from the analysis.

For classification, the label was changed to a binary encoding where "g" was mapped to a 1 value and "b" was mapped to a 0 value. This is necessary to construct a logistic regression model so that the regression model maps to a logical value following the performed classification.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Naming of all columns
colnames=['f1', 'f2', 'f3', 'f4', 'f5', 'f6', 'f7', 'f8', 'f9', 'f10', 'f11', 'f12', 'f13', 'f14', 'f15', 
          'f16', 'f17', 'f18', 'f19', 'f20', 'f21', 'f22', 'f23', 'f24', 'f25', 'f26', 'f27', 'f28', 'f29',
          'f30', 'f31', 'f32', 'f33', 'f34', 'label']

ionosphere_df = pd.read_csv('data_files/ionosphere.data', names=colnames, header=None)
ionosphere_df

Unnamed: 0,f1,f2,f3,f4,f5,f6,f7,f8,f9,f10,...,f26,f27,f28,f29,f30,f31,f32,f33,f34,label
0,1,0,0.99539,-0.05889,0.85243,0.02306,0.83398,-0.37708,1.00000,0.03760,...,-0.51171,0.41078,-0.46168,0.21266,-0.34090,0.42267,-0.54487,0.18641,-0.45300,g
1,1,0,1.00000,-0.18829,0.93035,-0.36156,-0.10868,-0.93597,1.00000,-0.04549,...,-0.26569,-0.20468,-0.18401,-0.19040,-0.11593,-0.16626,-0.06288,-0.13738,-0.02447,b
2,1,0,1.00000,-0.03365,1.00000,0.00485,1.00000,-0.12062,0.88965,0.01198,...,-0.40220,0.58984,-0.22145,0.43100,-0.17365,0.60436,-0.24180,0.56045,-0.38238,g
3,1,0,1.00000,-0.45161,1.00000,1.00000,0.71216,-1.00000,0.00000,0.00000,...,0.90695,0.51613,1.00000,1.00000,-0.20099,0.25682,1.00000,-0.32382,1.00000,b
4,1,0,1.00000,-0.02401,0.94140,0.06531,0.92106,-0.23255,0.77152,-0.16399,...,-0.65158,0.13290,-0.53206,0.02431,-0.62197,-0.05707,-0.59573,-0.04608,-0.65697,g
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
346,1,0,0.83508,0.08298,0.73739,-0.14706,0.84349,-0.05567,0.90441,-0.04622,...,-0.04202,0.83479,0.00123,1.00000,0.12815,0.86660,-0.10714,0.90546,-0.04307,g
347,1,0,0.95113,0.00419,0.95183,-0.02723,0.93438,-0.01920,0.94590,0.01606,...,0.01361,0.93522,0.04925,0.93159,0.08168,0.94066,-0.00035,0.91483,0.04712,g
348,1,0,0.94701,-0.00034,0.93207,-0.03227,0.95177,-0.03431,0.95584,0.02446,...,0.03193,0.92489,0.02542,0.92120,0.02242,0.92459,0.00442,0.92697,-0.00577,g
349,1,0,0.90608,-0.01657,0.98122,-0.01989,0.95691,-0.03646,0.85746,0.00110,...,-0.02099,0.89147,-0.07760,0.82983,-0.17238,0.96022,-0.03757,0.87403,-0.16243,g


In [2]:
# Encode the categories as 1 and 0 (g = 1, b = 0)
ionosphere_df['label'] = ionosphere_df.label.astype('category')
encoding = {'g': 1, 'b': 0}
ionosphere_df.label.replace(encoding, inplace=True)

# Removal of the second column (f2) as all of its values are identical and there is no variance
ionosphere_df.drop(columns=['f2'], inplace=True)

X = ionosphere_df.values[:, :-1]
y = ionosphere_df.values[:, -1]

In [3]:
from sklearn.preprocessing import StandardScaler

# Perform scaling on feature set data
X_scaled = StandardScaler().fit_transform(X)

<b>(1) Logistic Regression Model</b>

The logistic regression model was evaluated for scaled data, and selection of 25 features using the SelectKBest method. Following the creation of the model, a 10 fold cross validation was conducted to determine the model performance.

In [4]:
from sklearn.model_selection import KFold, cross_val_score

# Initialize cross validation for k-fold = 10
cross_validation = KFold(n_splits=10, random_state=1, shuffle=True)

In [5]:
from sklearn import linear_model

logreg_model = linear_model.LogisticRegression(solver='lbfgs')

In [6]:
# Scaled data
logreg_scores = cross_val_score(logreg_model, X_scaled, y, scoring='accuracy', cv=cross_validation, n_jobs=-1)
logreg_scores

array([0.83333333, 1.        , 0.94285714, 0.85714286, 0.85714286,
       0.88571429, 0.91428571, 0.91428571, 0.82857143, 0.88571429])

<b>(2) Support Vector Machine</b>

Blurb about Support Vector Machine.

In [7]:
from sklearn.svm import SVC

svc_soft_model = SVC(kernel='linear')
svc_soft_model.fit(X_scaled, y)

SVC(kernel='linear')

In [8]:
Yp_soft = svc_soft_model.predict(X_scaled)

In [9]:
from sklearn.metrics import accuracy_score
print("Soft Margin Model accuracy:",accuracy_score(y,Yp_soft))

Soft Margin Model accuracy: 0.9430199430199431


In [10]:
# K-fold cross validation
svc_soft_model_scores = cross_val_score(svc_soft_model, X_scaled, y, scoring='accuracy', cv=cross_validation, n_jobs=-1)
svc_soft_model_scores

array([0.83333333, 0.97142857, 0.94285714, 0.94285714, 0.8       ,
       0.88571429, 0.88571429, 0.91428571, 0.8       , 0.91428571])

<b>(3) Random Forest</b>

Blurb about Random Forest.

In [11]:
from sklearn.ensemble import RandomForestClassifier

random_forest_model = RandomForestClassifier(n_estimators=50, random_state=0, max_depth=12)
random_forest_model.fit(X_scaled, y)

RandomForestClassifier(max_depth=12, n_estimators=50, random_state=0)

In [12]:
# K-fold cross validation
random_forest_model_scores = cross_val_score(random_forest_model, X_scaled, y, scoring='accuracy', cv=cross_validation, n_jobs=-1)
random_forest_model_scores

array([0.91666667, 0.94285714, 0.94285714, 0.94285714, 0.88571429,
       0.94285714, 0.94285714, 0.97142857, 0.88571429, 0.94285714])

<b>Model Performance Comparison</b>

Blurb about Model Performance Comparison

In [None]:
#t-test, comparison of model performance
# Placeholder for printing out all results across all models
import pandas as pd
pd.DataFrame({'':['Widthout normalization','With normalization','With standarization','scikit-learn'],
              'Train score':[score(X_train, Y_train, w),
                             score(X_train_norm, Y_train, w_n),
                             score(X_train_std, Y_train, w_s),
                             sklreg.score(X_train,Y_train)],
              'Test score':[score(X_test, Y_test, w),
                            score(normalization(X_test), Y_test, w_n),
                            score(standardization(X_test), Y_test, w_s),
                            sklreg.score(X_test,Y_test)]})