# Powerful clasifiers 
Here we will try powerful classifiers, including support vector machines, random forest, neural networks. Remember that 90% of Machine Learning is about classification. This lecture includes precise but uninterpretable classifiers.


# Load file
Commonly two libraries are used to load a csv files.
- numpy function `np.loadtext` and `np.genfromtext ` 
- pandas function `pd.read_csv`

Here we prefer using pandas

In [1]:
import pandas as pd
import numpy as np
path='data/'
filename = path+'spamdata.csv'
spam = pd.read_csv(filename)

In [2]:
X = spam.values[:,:57]
y = spam.values[:,57]

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size= 0.1)

# Random Forest
Random forest is one of the powerful classification tools. Computations for moderate number of samples is rather fast. 



In [5]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=100)
rf.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [6]:
from sklearn.metrics import accuracy_score
accuracy_score(rf.predict(X_test), y_test)


0.9566160520607375

# Suppor Vector Machines
SVMs are like linear regression, expanded in kernel space. Their behaviour is similar to local regression.


In [16]:
# try C=1, C=10, C=100
from sklearn.svm import SVC
sv = SVC(C=10)
sv.fit(X_train,y_train)


SVC(C=10, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [17]:
accuracy_score(sv.predict(X_test), y_test)

0.8872017353579176

In [10]:
# Only one iteration of KFold for single C
from sklearn.model_selection import KFold
k = 5
acck = np.zeros(k)
kf = KFold(n_splits=k, shuffle=True)
i = 0
for train_i, test_i in kf.split(spam):
    sv = SVC(C=0.1)
    sv = sv.fit(X[train_i], y[train_i])
    acck[i]=accuracy_score(sv.predict(X[test_i]), y[test_i], normalize=False)
    i+=1
np.sum(acck)/X.shape[0]


0.7594001304064334

In [11]:
# B iteration of k-fold for single C
k = 5
acck = np.zeros(k)
kf = KFold(n_splits=k, shuffle=True)
B = 2
acc = np.zeros(B)
for j in range(B):    
    i = 0
    for train_i, test_i in kf.split(spam):
        sv = SVC(C=0.1)
        sv = sv.fit(X[train_i], y[train_i])
        acck[i]=accuracy_score(sv.predict(X[test_i]), y[test_i], normalize=False)
        i+=1
    acc[j] = np.sum(acck)
acc/X.shape[0]

array([0.75157574, 0.76374701])

In [12]:
# One iteration of Kfold and tune C between zero and 10
nb = 5
penalty = np.linspace(0.01, 100, nb)
k = 5
acck = np.zeros(k)
acc = np.zeros(len(penalty))
kf = KFold(n_splits=k, shuffle=True)
for c in range(len(penalty)):
    i = 0
    for train_i, test_i in kf.split(spam):
        sv = SVC(C=penalty[c])
        sv = sv.fit(X[train_i], y[train_i])
        acck[i]=accuracy_score(sv.predict(X[test_i]), y[test_i], normalize=False)
        i+=1
    acc[c] = np.sum(acck)
acc


array([2788., 3946., 3947., 3891., 3917.])

In [18]:
# B iteration of Kfold and tune C between zero and 10
# Leave it to you


In [19]:
# fill  it here.


# Neural Networks
Shallow neural networks often produces comparable results with random forest, bagging and boosting. Deep neural netoworks for large data requires load tensorflow, which we do not discuss.


In [14]:
# build two hidden layers, each layer with 10 neurons
from sklearn.neural_network import MLPClassifier
nn = MLPClassifier(hidden_layer_sizes=(10, 10), activation='logistic')
nn.fit(X_train, y_train)

MLPClassifier(activation='logistic', alpha=0.0001, batch_size='auto',
       beta_1=0.9, beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(10, 10), learning_rate='constant',
       learning_rate_init=0.001, max_iter=200, momentum=0.9,
       nesterovs_momentum=True, power_t=0.5, random_state=None,
       shuffle=True, solver='adam', tol=0.0001, validation_fraction=0.1,
       verbose=False, warm_start=False)

In [15]:
accuracy_score(nn.predict(X_test), y_test)

0.9544468546637744