In [1]:
# Data Processing
import numpy as np

train_data = np.genfromtxt('trainData.csv', delimiter=',', dtype=str)
m = train_data.shape[0] - 1
d = train_data.shape[1]
print(m, d)
train_input, train_output = train_data[1:,0:d-1].tolist(), train_data[1:,d-1].tolist()

24712 20


The above step uses `numpy`'s `genfromtxt` function since there are no missing values. Below I randomly print to get the actual types of data like textual information and specific fields

In [2]:
print([train_input[0], train_output[0]])

[['28', '"admin."', '"single"', '"university.degree"', '"no"', '"yes"', '"no"', '"cellular"', '"aug"', '"mon"', '1', '999', '0', '"nonexistent"', '-2.9', '92.201', '-31.4', '0.861', '5076.2'], '1']


#### Preprocessing step 1: 
Remove quotes in textual information

#### Preprocessing step 2:
Convert numeric information to numbers as opposed to within single quotes. These fields are `0, 10, 11, 12, 14, ..., 19`.

In [3]:
for i in range(0, m-1):
    for j in range(0, d-1):
        if train_input[i][j].find("\"") != -1:
            train_input[i][j] = train_input[i][j][1:-1] # removing double quotes in pairs
        else:
            train_input[i][j] = float(train_input[i][j])
    train_output[i] = float(train_output[i])
train_output = np.array(train_output).astype(float)

print(train_input[0])
print(train_output[0])

[28.0, 'admin.', 'single', 'university.degree', 'no', 'yes', 'no', 'cellular', 'aug', 'mon', 1.0, 999.0, 0.0, 'nonexistent', -2.9, 92.201, -31.4, 0.861, 5076.2]
1.0


Now that the preprocessing has been completed, it is time to separate the input and outputs and begin training. Clearly, we might have to go with Ensemble methods since the data has both textual and numeric information in it.

In [4]:
rand_index = np.random.randint(low=0, high=m)
print(train_input[rand_index])
print(train_output[rand_index])

[53.0, 'management', 'divorced', 'university.degree', 'no', 'no', 'no', 'telephone', 'jun', 'fri', 3.0, 999.0, 0.0, 'nonexistent', 1.4, 94.465, -41.8, 4.959, 5228.1]
0.0


Now that I have the data through a simple level of preprocessing, I am going to convert text data into numeric values. For this I am first going to tokenize and then merge this into the actual training input.

#### Preprocessing step 3:
Convert the text data into numbers using `LabelEncoder`.

#### Preprocessing step 4:
Normalize the data.

In [5]:
train_text_data = np.hstack((np.array(train_input)[:,1:10], np.array(train_input)[:,13:14]))
print('Size of training text data : {0}'.format(train_text_data.shape))

# Converting the textual data into one vector per training input
train_text_vectors = []

from sklearn.preprocessing import LabelEncoder, normalize, StandardScaler

for j in range(0, train_text_data.shape[1]):
    lbl_enc = LabelEncoder()
    train_text_vectors.append(lbl_enc.fit_transform(train_text_data[:,j]))
train_text_vectors = np.array(train_text_vectors).T

print('Size of training text data: {0}'.format(train_text_vectors.shape))
print('Text: {0} --> Data: {1}'.format(train_text_data[0], train_text_vectors[0]))
print('Text: {0} --> Data: {1}'.format(train_text_data[10], train_text_vectors[10]))

Size of training text data : (24712, 10)
Size of training text data: (24712, 10)
Text: ['admin.' 'single' 'university.degree' 'no' 'yes' 'no' 'cellular' 'aug'
 'mon' 'nonexistent'] --> Data: [1 3 7 1 3 1 1 2 2 2]
Text: ['blue-collar' 'single' 'basic.9y' 'no' 'yes' 'no' 'cellular' 'jul' 'tue'
 'nonexistent'] --> Data: [2 3 3 1 3 1 1 4 4 2]


In [6]:
train_input = np.hstack((np.array(train_input)[:,0:1], 
                         train_text_vectors[:,0:train_text_vectors.shape[1] - 1], 
                         np.array(train_input)[:,10:13],
                         train_text_vectors[:,train_text_vectors.shape[1] - 1].reshape(-1, 1),
                         np.array(train_input)[:,14:]
                        )).astype(float)
mean_reduce = StandardScaler()
train_input = mean_reduce.fit_transform(train_input)
print(train_input.shape)
print('Vector = {0}: \nNorm = {1}'.format(train_input[0], np.linalg.norm(train_input[0])))

train_input = normalize(train_input)
print(train_input.shape)
print('Vector = {0}: \nNorm = {1}'.format(train_input[0], np.linalg.norm(train_input[0])))

(24712, 19)
Vector = [-1.1518788  -1.04245572  1.36067936  1.05600834 -0.51164653  0.94134947
 -0.44949276 -0.75723938 -1.38470179 -0.70956021 -0.56425354  0.19653049
 -0.35249416  0.19500115 -1.90310477 -2.38261316  1.97762226 -1.5909553
 -1.25826956]: 
Norm = 5.250495168763574
(24712, 19)
Vector = [-0.21938479 -0.19854427  0.25915258  0.20112548 -0.09744729  0.17928775
 -0.08560959 -0.14422247 -0.26372785 -0.13514158 -0.10746673  0.03743085
 -0.06713541  0.03713957 -0.36246196 -0.45378828  0.37665443 -0.30301052
 -0.23964779]: 
Norm = 1.0


Now the preprocessing steps are completed. I will try my preprocessed dataset on multiple methods of classification.
For this, I will generate 10 folds (stratified) and use 10-fold cross validation to get the best method.

In [7]:
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=10, shuffle=True)
folds = []
for tr_split, va_split in skf.split(train_input, train_output):
    folds.append((tr_split, va_split))
for i in range(0, len(folds)):
    print("Fold {0}\tTrain number: {1} Validation number: {2}".format(i+1, len(folds[i][0]), len(folds[i][1])))

Fold 1	Train number: 22240 Validation number: 2472
Fold 2	Train number: 22240 Validation number: 2472
Fold 3	Train number: 22240 Validation number: 2472
Fold 4	Train number: 22240 Validation number: 2472
Fold 5	Train number: 22241 Validation number: 2471
Fold 6	Train number: 22241 Validation number: 2471
Fold 7	Train number: 22241 Validation number: 2471
Fold 8	Train number: 22241 Validation number: 2471
Fold 9	Train number: 22242 Validation number: 2470
Fold 10	Train number: 22242 Validation number: 2470


### Attempt 1: SVM with Gaussian Kernel with 10 fold cross validation

In [11]:
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, precision_score, recall_score

accu_train = prec_train = rcll_train = 0.0
accu_valid = prec_valid = rcll_valid = 0.0
g_svm = SVC(kernel='rbf', cache_size=2000)
for i in range(0, len(folds)):
    print("Performing fold {0}".format(i+1))
    g_svm.fit(train_input[folds[i][0]], train_output[folds[i][0]])

    train_pred = g_svm.predict(train_input[folds[i][0]])
    valid_pred = g_svm.predict(train_input[folds[i][1]])

    accu_train += accuracy_score(train_output[folds[i][0]], train_pred)
    prec_train += precision_score(train_output[folds[i][0]], train_pred)
    rcll_train += recall_score(train_output[folds[i][0]], train_pred)
    
    accu_valid += accuracy_score(train_output[folds[i][1]], valid_pred)
    prec_valid += precision_score(train_output[folds[i][1]], valid_pred)
    rcll_valid += recall_score(train_output[folds[i][1]], valid_pred)
    
print("Average training accuracy after 10 fold cross validation = {0}".format(accu_train/len(folds)))
print("Average training precision after 10 fold cross validation = {0}".format(prec_train/len(folds)))
print("Average training recall after 10 fold cross validation = {0}".format(rcll_train/len(folds)))

print("Average validation accuracy after 10 fold cross validation = {0}".format(accu_valid/len(folds)))
print("Average validation precision after 10 fold cross validation = {0}".format(prec_valid/len(folds)))
print("Average validation recall after 10 fold cross validation = {0}".format(rcll_valid/len(folds)))

Performing fold 1
Performing fold 2
Performing fold 3
Performing fold 4
Performing fold 5
Performing fold 6
Performing fold 7
Performing fold 8
Performing fold 9
Performing fold 10
Average training accuracy after 10 fold cross validation = 0.8958805519146763
Average training precision after 10 fold cross validation = 0.6148393596168257
Average training recall after 10 fold cross validation = 0.20294544191744207
Average validation accuracy after 10 fold cross validation = 0.8958811959569284
Average validation precision after 10 fold cross validation = 0.6184453796596394
Average validation recall after 10 fold cross validation = 0.2029486088548516


### Attempt 2: Random Forests with multiple number of trees and 10 fold cross validation

In [13]:
from sklearn.ensemble import RandomForestClassifier

for m in range(3, 9):
    rnd_frst = RandomForestClassifier(n_estimators = m)
    accu_train = prec_train = rcll_train = 0.0
    accu_valid = prec_valid = rcll_valid = 0.0
    for i in range(0, len(folds)):
        rnd_frst.fit(train_input[folds[i][0]], train_output[folds[i][0]])

        train_pred = rnd_frst.predict(train_input[folds[i][0]])
        valid_pred = rnd_frst.predict(train_input[folds[i][1]])

        accu_train += accuracy_score(train_output[folds[i][0]], train_pred)
        prec_train += precision_score(train_output[folds[i][0]], train_pred)
        rcll_train += recall_score(train_output[folds[i][0]], train_pred)
    
        accu_valid += accuracy_score(train_output[folds[i][1]], valid_pred)
        prec_valid += precision_score(train_output[folds[i][1]], valid_pred)
        rcll_valid += recall_score(train_output[folds[i][1]], valid_pred)
    
    print("Average training accuracy after 10 fold cross validation = {0} for m = {1}".format(accu_train/len(folds), m))
    print("Average training precision after 10 fold cross validation = {0} for m = {1}".format(prec_train/len(folds), m))
    print("Average training recall after 10 fold cross validation = {0} for m = {1}".format(rcll_train/len(folds), m))

    print("Average validation accuracy after 10 fold cross validation = {0} for m = {1}".format(accu_valid/len(folds), m))
    print("Average validation precision after 10 fold cross validation = {0} for m = {1}".format(prec_valid/len(folds), m))
    print("Average validation recall after 10 fold cross validation = {0} for m = {1}\n".format(rcll_valid/len(folds), m))

Average training accuracy after 10 fold cross validation = 0.9704237076585225 for m = 3
Average training precision after 10 fold cross validation = 0.9102265945253913 for m = 3
Average training recall after 10 fold cross validation = 0.8181666674631585 for m = 3
Average validation accuracy after 10 fold cross validation = 0.8704685107436507 for m = 3
Average validation precision after 10 fold cross validation = 0.3956610189833982 for m = 3
Average validation recall after 10 fold cross validation = 0.28160439390423153 for m = 3

Average training accuracy after 10 fold cross validation = 0.9645831190743366 for m = 4
Average training precision after 10 fold cross validation = 0.9656396595876655 for m = 4
Average training recall after 10 fold cross validation = 0.7109277852913487 for m = 4
Average validation accuracy after 10 fold cross validation = 0.8864926111460741 for m = 4
Average validation precision after 10 fold cross validation = 0.4917565623870731 for m = 4
Average validation rec

### Attempt 3: AdaBoost with multiple weak classifiers and 10 fold cross validation

In [14]:
from sklearn.ensemble import AdaBoostClassifier

for t in [10, 25, 50, 100, 200, 250, 500, 1000]:
    rnd_frst = AdaBoostClassifier(n_estimators = t)
    accu_train = prec_train = rcll_train = 0.0
    accu_valid = prec_valid = rcll_valid = 0.0
    for i in range(0, len(folds)):
        rnd_frst.fit(train_input[folds[i][0]], train_output[folds[i][0]])

        train_pred = rnd_frst.predict(train_input[folds[i][0]])
        valid_pred = rnd_frst.predict(train_input[folds[i][1]])

        accu_train += accuracy_score(train_output[folds[i][0]], train_pred)
        prec_train += precision_score(train_output[folds[i][0]], train_pred)
        rcll_train += recall_score(train_output[folds[i][0]], train_pred)
    
        accu_valid += accuracy_score(train_output[folds[i][1]], valid_pred)
        prec_valid += precision_score(train_output[folds[i][1]], valid_pred)
        rcll_valid += recall_score(train_output[folds[i][1]], valid_pred)
    
    print("Average training accuracy after 10 fold cross validation = {0} for t = {1}".format(accu_train/len(folds), t))
    print("Average training precision after 10 fold cross validation = {0} for t = {1}".format(prec_train/len(folds), t))
    print("Average training recall after 10 fold cross validation = {0} for t = {1}".format(rcll_train/len(folds), t))

    print("Average validation accuracy after 10 fold cross validation = {0} for t = {1}".format(accu_valid/len(folds), t))
    print("Average validation precision after 10 fold cross validation = {0} for t = {1}".format(prec_valid/len(folds), t))
    print("Average validation recall after 10 fold cross validation = {0} for t = {1}\n".format(rcll_valid/len(folds), t))

Average training accuracy after 10 fold cross validation = 0.8986187557383187 for t = 10
Average training precision after 10 fold cross validation = 0.6718016025616685 for t = 10
Average training recall after 10 fold cross validation = 0.19683875664473133 for t = 10
Average validation accuracy after 10 fold cross validation = 0.8979448922597184 for t = 10
Average validation precision after 10 fold cross validation = 0.6594612902374594 for t = 10
Average validation recall after 10 fold cross validation = 0.19648926020473945 for t = 10

Average training accuracy after 10 fold cross validation = 0.8986861981425639 for t = 25
Average training precision after 10 fold cross validation = 0.6575265242220661 for t = 25
Average training recall after 10 fold cross validation = 0.21040864798734535 for t = 25
Average validation accuracy after 10 fold cross validation = 0.8973379324053521 for t = 25
Average validation precision after 10 fold cross validation = 0.6391766417150258 for t = 25
Average v

Now it seems that we are randomly trying to use multiple classifiers. I believe that there is a fundamental problem in the preprocessing. I think we have to use the `OneHotEncoder` to encode categorical data after encoding using label as shown [here](http://scikit-learn.org/stable/modules/preprocessing.html#encoding-categorical-features).