## A "Hello World" Example of Machine Learning - Revisit

Loading the Iris dataset from scikit-learn. 

The first column represents Sepal length, the second column represents Sepal width,  the third column represents the petal length, and the fourth column the petal width of the flower samples. The classes (type of species) are already converted to integer labels where 0=Iris-Setosa, 1=Iris-Versicolor, 2=Iris-Virginica.

Here, we are using only two features: the third and fourth columns. 

In [227]:
from sklearn import datasets
import numpy as np

iris = datasets.load_iris()
#iris.data

In [228]:
X = iris.data[:, [2, 3]]

In [229]:
y = iris.target

In [230]:
print('Class labels:', np.unique(y))

Class labels: [0 1 2]


Scikit-learn algorithms support multi-class classification via the One-Versus-Rest(OvR) method. 

Splitting data into 70% training and 30% test data:

In [231]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=1, stratify=y)

In [232]:
print('Labels counts in y:', np.bincount(y))
print('Labels counts in y_train:', np.bincount(y_train))
print('Labels counts in y_test:', np.bincount(y_test))

Labels counts in y: [50 50 50]
Labels counts in y_train: [35 35 35]
Labels counts in y_test: [15 15 15]


### Standardizing the features:

In [234]:
from sklearn.preprocessing import StandardScaler

sc = StandardScaler() #center the distribution around zero (mean), with a standard deviation of 1.
sc.fit(X_train)
X_train_std = sc.transform(X_train)
X_test_std = sc.transform(X_test)

In [235]:
from sklearn.linear_model import Perceptron

ppn = Perceptron(max_iter=100, eta0=0.1, random_state=42)
ppn.fit(X_train_std, y_train)

Perceptron(alpha=0.0001, class_weight=None, early_stopping=False, eta0=0.1,
           fit_intercept=True, max_iter=100, n_iter_no_change=5, n_jobs=None,
           penalty=None, random_state=42, shuffle=True, tol=0.001,
           validation_fraction=0.1, verbose=0, warm_start=False)

### Test the model with the hold-out test set

In [236]:
y_pred = ppn.predict(X_test_std)
print('Misclassified samples: ' + str((y_test != y_pred).sum()))

Misclassified samples: 2


In [237]:
from sklearn.metrics import accuracy_score

print('Accuracy: ' + str(accuracy_score(y_test, y_pred)))

Accuracy: 0.9555555555555556


In [238]:
X_new = [[1.1, 0.2],[0.4, 1.9], [1.4, 0.2]]
y_new = ppn.predict(X_new)
y_new

array([1, 2, 1])

### Evaluate the model using cross validation

In [239]:
from sklearn.model_selection import cross_val_score
cross_val_score(ppn, X_train_std, y_train, cv=4, scoring="accuracy")

array([0.88888889, 0.48148148, 0.7037037 , 0.95833333])

2-features: sccuracy score: array([0.88888889, 0.48148148, 0.7037037 , 0.95833333])

# Exercise 1: Use all four features to train the model and use cross validaton to check if the results better? Briefly explain why. 

In [242]:
from sklearn import datasets
import numpy as np

iris = datasets.load_iris()
#iris.data

In [243]:
A = iris.data

In [244]:
b = iris.target
print('Class labels:', np.unique(b))

Class labels: [0 1 2]


In [245]:
from sklearn.model_selection import train_test_split

A_train, A_test, b_train, b_test = train_test_split(
    A, b, test_size=0.3, random_state=1, stratify=b)

In [246]:
print('Labels counts in b:', np.bincount(y))
print('Labels counts in b_train:', np.bincount(b_train))
print('Labels counts in b_test:', np.bincount(b_test))

Labels counts in b: [50 50 50]
Labels counts in b_train: [35 35 35]
Labels counts in b_test: [15 15 15]


In [247]:
from sklearn.preprocessing import StandardScaler

sc = StandardScaler() #center the distribution around zero (mean), with a standard deviation of 1.
sc.fit(A_train)
A_train_std = sc.transform(A_train)
A_test_std = sc.transform(A_test)

In [248]:
from sklearn.linear_model import Perceptron

ppn = Perceptron(max_iter=100, eta0=0.1, random_state=42)
ppn.fit(A_train_std, b_train)

Perceptron(alpha=0.0001, class_weight=None, early_stopping=False, eta0=0.1,
           fit_intercept=True, max_iter=100, n_iter_no_change=5, n_jobs=None,
           penalty=None, random_state=42, shuffle=True, tol=0.001,
           validation_fraction=0.1, verbose=0, warm_start=False)

In [249]:
b_pred = ppn.predict(A_test_std)
print('Misclassified samples: ' + str((b_test != b_pred).sum()))

Misclassified samples: 3


In [250]:
from sklearn.metrics import accuracy_score

print('Accuracy: ' + str(accuracy_score(b_test, b_pred)))

Accuracy: 0.9333333333333333


In [251]:
from sklearn.model_selection import cross_val_score
cross_val_score(ppn, A_train_std, b_train, cv=4, scoring="accuracy")

array([0.92592593, 0.77777778, 0.66666667, 0.875     ])

Using all four features caused the acurracy rate to drop from 96% with two features to 93% with four. This is most likely because the two features are more linearly separable. Once you add the the other two features it becomes less perfectly linearly separable and so we misclassify more.

# Exercise 2: Try with the scikit-learn stochastic gradient descent model instead of perceptron. Use all four features. Evaluate with cross-validation how does the model perform in terms of accuracy using both two features and four features. 

In [252]:
S = iris.data[:, [2, 3]]
t = iris.target

In [253]:
from sklearn.model_selection import train_test_split

S_train, S_test, t_train, t_test = train_test_split(
    S, t, test_size=0.3, random_state=1, stratify=t)

In [254]:
from sklearn.preprocessing import StandardScaler

sc = StandardScaler() #center the distribution around zero (mean), with a standard deviation of 1.
sc.fit(S_train)
S_train_std = sc.transform(S_train)
S_test_std = sc.transform(S_test)

In [255]:
from sklearn.linear_model import SGDClassifier
sgd = SGDClassifier(random_state=42)
sgd.fit(S_train_std, t_train)

SGDClassifier(alpha=0.0001, average=False, class_weight=None,
              early_stopping=False, epsilon=0.1, eta0=0.0, fit_intercept=True,
              l1_ratio=0.15, learning_rate='optimal', loss='hinge',
              max_iter=1000, n_iter_no_change=5, n_jobs=None, penalty='l2',
              power_t=0.5, random_state=42, shuffle=True, tol=0.001,
              validation_fraction=0.1, verbose=0, warm_start=False)

In [256]:
t_pred = sgd.predict(S_test_std)
print('Misclassified samples: ' + str((t_test != t_pred).sum()))

Misclassified samples: 0


In [257]:
from sklearn.metrics import accuracy_score

print('Accuracy: ' + str(accuracy_score(t_test, t_pred)))

Accuracy: 1.0


### four features

In [258]:
C = iris.data
d = iris.target
print('Class labels:', np.unique(d))

Class labels: [0 1 2]


In [259]:
from sklearn.model_selection import train_test_split

C_train, C_test, d_train, d_test = train_test_split(
    C, d, test_size=0.3, random_state=1, stratify=d)

In [260]:
from sklearn.preprocessing import StandardScaler

sc = StandardScaler() #center the distribution around zero (mean), with a standard deviation of 1.
sc.fit(C_train)
C_train_std = sc.transform(C_train)
C_test_std = sc.transform(C_test)

In [261]:
from sklearn.linear_model import SGDClassifier
sgd = SGDClassifier(random_state=42)
sgd.fit(C_train_std, d_train)

SGDClassifier(alpha=0.0001, average=False, class_weight=None,
              early_stopping=False, epsilon=0.1, eta0=0.0, fit_intercept=True,
              l1_ratio=0.15, learning_rate='optimal', loss='hinge',
              max_iter=1000, n_iter_no_change=5, n_jobs=None, penalty='l2',
              power_t=0.5, random_state=42, shuffle=True, tol=0.001,
              validation_fraction=0.1, verbose=0, warm_start=False)

In [262]:
d_pred = sgd.predict(C_test_std)
print('Misclassified samples: ' + str((d_test != d_pred).sum()))

Misclassified samples: 5


In [263]:
from sklearn.metrics import accuracy_score

print('Accuracy: ' + str(accuracy_score(d_test, d_pred)))

Accuracy: 0.8888888888888888


The stochastic gradient descent model performed better on the two features similar to the perceptron model. With four features it performed worse and misclassified 5 samples as opposed to next to nothing with two features. The two features that we use for our two feature algorithms appear more linearly separable and thus we can predict better. When we use four features it is less linearly separable and thus we perform worse. Also the sotchastic gradient descent algorithm randomly picks a data point to begin with. With more data provided by the extra features,we have more random options for the model and also more when the model updates.