Import libraries:

In [67]:
import pandas as pd
import numpy as np
from sklearn.svm import LinearSVC
from sklearn.feature_selection import SelectFromModel
from pandas.tools.plotting import scatter_matrix
import matplotlib.pyplot as plt
from sklearn import model_selection
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC

### Dataset
This dataset contains audio features (36 features) extracted from 280 samples. For each feature, we have 6 statistics (mean, median, std, std by mean, max and min.

So, we expect a 280x(216+1) matrix, where the last column is the "label" (1 = good roof-tile or 2 = roof-tile with problems).

In [68]:
dt = pd.read_csv("dataset.csv")
dt.shape

(280, 217)

Lets see how our data looks like:

In [69]:
dt.head()

Unnamed: 0,t_zcr_mean,t_zcr_median,t_zcr_std,t_zcr_stdbymean,t_zcr_max,t_zcr_min,t_energy_mean,t_energy_median,t_energy_std,t_energy_stdbymean,...,f_chrvec12_stdbymean,f_chrvec12_max,f_chrvec12_min,f_persistence_mean,f_persistence_median,f_persistence_std,f_persistence_stdbymean,f_persistence_max,f_persistence_min,label
0,0.047869,0.005975,1.5104,0.14549,0.14635,0.90982,0.005272,0.14386,-30.571,-8.2801,...,0.0,0.0,0,0.0,0,0.0,0.0,0.0,49.4,1
1,0.053551,0.004888,1.6245,0.14694,0.14312,0.96656,0.005142,0.15898,-30.124,-8.3968,...,0.0,0.0,0,0.0,0,0.0,0.0,0.0,88.878,1
2,0.04,0.007517,1.4857,0.087087,0.10073,0.41488,0.007266,0.0675,-23.242,-6.7864,...,0.0,0.0,0,1.6896,0,0.0,2.1297,0.0,98.378,1
3,0.04108,0.01057,1.7121,0.12932,0.14035,0.68629,0.007476,0.11477,-30.336,-8.4624,...,0.0,0.0,0,0.0,0,0.0,2.5871,0.0,100.02,1
4,0.040909,0.012872,1.8787,0.13533,0.15745,0.53158,0.010323,0.098977,-34.642,-9.3728,...,0.0,0.0,0,0.0,0,0.0,0.0,0.0,147.75,1


Lets separate the data:

X will be a matrix containing all features from all samples.

Y will be a vector containing the labels of all observations.

In [70]:
array = dt.values
X = array[:, 0:216]
y = array[:, 216]
X.shape

(280, 216)

Now, we need to make the data sets to train and to validate the models. The choosed proportion is: 70% to test and 30% to validate.

In [71]:
validation_size = 0.30
seed = 7
X_train, X_validation, y_train, y_validation = model_selection.train_test_split(X, y, test_size=validation_size, random_state=seed)
X_train.shape

(196, 216)

### L1-based feature selection
Our dataset contains a lot of features (216 to be more specific).

Some features are collinear, so we can and we must to transform our data. To do that, I choosed to use a L1-based feature selection method.

Its importante to say that smaller C implies in fewer features selected.

In [72]:
lsvc = LinearSVC(C=0.01, penalty="l1", dual=False).fit(X_train, y_train)
modellsvc = SelectFromModel(lsvc, prefit=True)
X_train_new = modellsvc.transform(X_train)
X_train_new.shape

(196, 11)

### Select a classifier
We will evaluate six classifiers, to choose the best model to classify our validation data. The criteria to choose the best is the accuracy of the model on the train data.

We use a cross-validation (k-fold with k = 10) to evaluate the models

In [73]:
models = []
models.append(('LR', LogisticRegression()))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
models.append(('SVM', SVC()))

scoring = 'accuracy'
results = []
names = []
for name, model in models:
	kfold = model_selection.KFold(n_splits=10, random_state=seed)
	cv_results = model_selection.cross_val_score(model, X_train_new, y_train, cv=kfold, scoring=scoring)
	results.append(cv_results.mean())
	names.append(name)
	msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
	print(msg)

LR: 0.974737 (0.040482)
LDA: 0.994737 (0.015789)
KNN: 0.856579 (0.096757)
CART: 0.959474 (0.049108)
NB: 0.984474 (0.023727)
SVM: 0.433684 (0.023265)


Selected the best classifier model

In [74]:
best_model_idx = results.index(max(results))
best_model = models[best_model_idx]
best_model[0]

'LDA'

Transform the validation input data to reduce the number of features
We will use our modellsvc to do the feature selection here

In [75]:
X_validation_new = modellsvc.transform(X_validation)
X_validation_new.shape

(84, 11)

### Make the predictions on validation data
Finally, we evaluate the accuracy of our proposed model making the predictions of X_validation_new

In [76]:
best_model[1].fit(X_train_new, y_train)
predictions = best_model[1].predict(X_validation_new)
accuracy_score(y_validation, predictions)

0.97619047619047616